[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
> But there can be a on chip disk controller on the motherboard, I'm not sure. There is always some kind of controller. Could be on-board. Usually, the cache settings are accessible when booting into the BIOS set-up. > If your worry is fsync persistence No, what I worry about is volatile write cache, which is usually enabled by default. This cache exists on disk as well as on controller. To avoid loosing writes on power fail, the controller needs to be in write-through mode and the disk write cache disabled. The latter can be done with smartctl, the former in the BIOS setup. Did you test power failure? If so, how often? On how many hosts simultaneously? Pulling network cables will not trigger cache related problems. The problem with write cache is, that you rely on a lot of bells and whistles where some usually fail. With ceph, this will lead to exactly the problem you are observing now. Your pool configuration looks OK. You need to find out where exactly the scrub errors are situated. It looks like meta-data damage and you might loose some data. Be careful to do only read-only admin operations for now. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 02 November 2020 16:08:58 To: ceph-users@ceph.io; Frank Schilder Subject: Re: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? > Hmm, I'm getting a bit confused. Could you also send the output of "ceph osd > pool ls detail". File ceph-osd-pool-ls-detail.txt attached. > Did you look at the disk/controller cache settings? I don't have disk controllers on Ceph machines. The hard disk is directly attached to the motherboard via SATA cable. But there can be a on chip disk controller on the motherboard, I'm not sure. If your worry is fsync persistence, I have thoroughly tested database fsync reliability on Ceph RBD with hundreds of transactions per second and remove network cable and restart the database machine, etc. while inserts going on. and I did not lose a single transaction. I simulated this many times and persistence on my Ceph cluster was perfect (i.e not a single loss). > I think you should start a deep-scrub with "ceph pg deep-scrub 3.b" and > record the output of "ceph -w | grep '3\.b'" (note the single quotes). > The error messages you included in one of your first e-mails are only on 1 > out of 3 scrub errors (3 lines for 1 error). We need to find all 3 errors. I ran again the "ceph pg deep-scrub 3.b", here is the whole output of ceph -w: 2020-11-02 22:33:48.224392 osd.0 [ERR] 3.b shard 2 soid 3:d577e975:::123675e.:head : candidate had a missing snapset key, candidate had a missing info key 2020-11-02 22:33:48.224396 osd.0 [ERR] 3.b soid 3:d577e975:::123675e.:head : failed to pick suitable object info 2020-11-02 22:35:30.087042 osd.0 [ERR] 3.b deep-scrub 3 errors Btw, I'm very grateful for your perseverance on this. Best regards Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
Hmm, I'm getting a bit confused. Could you also send the output of "ceph osd pool ls detail". Did you look at the disk/controller cache settings? I think you should start a deep-scrub with "ceph pg deep-scrub 3.b" and record the output of "ceph -w | grep '3\.b'" (note the single quotes). The error messages you included in one of your first e-mails are only on 1 out of 3 scrub errors (3 lines for 1 error). We need to find all 3 errors. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 02 November 2020 14:25:08 To: ceph-users@ceph.io; Frank Schilder Subject: Re: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Frank > the primary OSD is probably not listed as a peer. Can you post the complete > output of > - ceph pg 3.b query > - ceph pg dump > - ceph osd df tree > in a pastebin? Yes, the Primary OSD is 0. I have attached above as .txt files. Please let me know if you still cannot read them. Regards Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
Hi Sagara, the primary OSD is probably not listed as a peer. Can you post the complete output of - ceph pg 3.b query - ceph pg dump - ceph osd df tree in a pastebin? = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 02 November 2020 11:53:58 To: ceph-users@ceph.io; Frank Schilder Subject: Re: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Frank > Please note, there is no peer 0 in "ceph pg 3.b query". Also no word osd. I checked other PGs with "active+clean", there is a "peer": "0". But "ceph pg pgid query" always shows only two peers, sometime peer 0 and 1, or 1 and 2, 0 and 2, etc. Regards Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
Hi Sagra, looks like you have one on a new and 2 on an old version. Can you add the information about which OSD each version resides? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 02 November 2020 10:10:02 To: ceph-users@ceph.io; Frank Schilder Subject: Re: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Frank > I'm not sure if my hypothesis can be correct. Ceph sends an acknowledge of a > write only after all copies are on disk. In other words, if PGs end up on > different versions after a power outage, one always needs to roll back. Since > you have two healthy OSDs in the PG and the PG is active (successfully > peered), it might just be a broken disk and read/write errors. I would focus > on that. I tried to revert the PG as follows: # ceph pg 3.b query | grep version "last_user_version": 2263481, "version": "4825'2264303", "last_user_version": 2263481, "version": "4825'2264301", "last_user_version": 2263481, "version": "4825'2264301", ceph pg 3.b list_unfound { "num_missing": 0, "num_unfound": 0, "objects": [], "more": false } # ceph pg 3.b mark_unfound_lost revert pg has no unfound objects # ceph pg 3.b revert Invalid command: revert not in query pg query : show details of a specific pg Error EINVAL: invalid command How to revert/rollback a PG? > Another question, do you have write caches enabled (disk cache and controller > cache)? This is know to cause problems on power outages and also degraded > performance with ceph. You should check and disable any caches if necessary. No. HDD is directly connected to motherboard. Thank you Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
Hi Sagara, I'm not sure if my hypothesis can be correct. Ceph sends an acknowledge of a write only after all copies are on disk. In other words, if PGs end up on different versions after a power outage, one always needs to roll back. Since you have two healthy OSDs in the PG and the PG is active (successfully peered), it might just be a broken disk and read/write errors. I would focus on that. Another question, do you have write caches enabled (disk cache and controller cache)? This is know to cause problems on power outages and also degraded performance with ceph. You should check and disable any caches if necessary. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 01 November 2020 14:37:41 To: Sagara Wijetunga; ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? sorry: *badblocks* can force remappings of broken sectors (non-destructive read-write check) = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 01 November 2020 14:35:35 To: Sagara Wijetunga; ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Sagara, looks like your situation is more complex. Before doing anything potentially destructive, you need to investigate some more. A possible interpretation (numbering just for the example): OSD 0 PG at version 1 OSD 1 PG at version 2 OSD 2 PG has scrub error Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward (OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). Part of the relevant information on OSD 2 seems to be unreadable, therefore pg repair bails out. You need to find out if you are in this situation or some other case. If you are, you need to find out somehow if you need to roll back or forward. I'm afraid in your current situation, even taking the OSD with the scrub errors down will not rebuild the PG. I would probably try: - find out with smartctl if the OSD with scrub errors is in a pre-fail state (has remapped sectors) - if it is: * take it down and try to make a full copy with ddrescue * if ddrescure manages to copy everything, copy back to a new disk and add to ceph * if ddrescue fails to copy everything, you could try if badblocks manages to get the disk back; ddrescue can force remappings of broken sectors (non-destructive read-write check) and it can happen that data becomes readable again, exchange the disk as soon as possible thereafter - if the disk is healthy: * try to find out if you can deduce the state of the copies on every OSD The tool for low-level operations is bluestore-tool. I never used it, so you need to look at the documentation. If everything fails, I guess your last option is to decide for one of the copies, export it from one OSD and inject it to another one (but not any of 0,1,2!). This will establish 2 identical copies and the third one will be changed to this one automatically. Note that this may lead to data loss on objects that were in the undefined state. As far as I can see, its only 1 object and probably possible to recover from (backup, snapshot). Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 01 November 2020 14:05:36 To: ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Frank Thanks for the reply. > I think this happens when a PG has 3 different copies and cannot decide which > one is correct. You might have hit a very rare case. You should start with > the scrub errors, check which PGs and which copies (OSDs) are affected. It > sounds almost like all 3 scrub errors are on the same PG. Yes, all 3 errors are for the same PG and on the same OSD: 2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 3:d577e975:::123675e.:head : candidate had a missing snapset key, candidate had a missing info key 2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 3:d577e975:::123675e.:head : failed to pick suitable object info 2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed > You might have had a combination of crash and OSD fail, your situation is > probably not covered by "single point of failure". Yes it was a complex crash, all went down. > In case you have a PG with scrub errors on 2 copies, you should be able to > reconstruct the PG from the third with PG export/PG import commands. I have not done a PG export/import before. Mind if you could send the instructions or a link for it. Thanks Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
sorry: *badblocks* can force remappings of broken sectors (non-destructive read-write check) = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 01 November 2020 14:35:35 To: Sagara Wijetunga; ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Sagara, looks like your situation is more complex. Before doing anything potentially destructive, you need to investigate some more. A possible interpretation (numbering just for the example): OSD 0 PG at version 1 OSD 1 PG at version 2 OSD 2 PG has scrub error Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward (OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). Part of the relevant information on OSD 2 seems to be unreadable, therefore pg repair bails out. You need to find out if you are in this situation or some other case. If you are, you need to find out somehow if you need to roll back or forward. I'm afraid in your current situation, even taking the OSD with the scrub errors down will not rebuild the PG. I would probably try: - find out with smartctl if the OSD with scrub errors is in a pre-fail state (has remapped sectors) - if it is: * take it down and try to make a full copy with ddrescue * if ddrescure manages to copy everything, copy back to a new disk and add to ceph * if ddrescue fails to copy everything, you could try if badblocks manages to get the disk back; ddrescue can force remappings of broken sectors (non-destructive read-write check) and it can happen that data becomes readable again, exchange the disk as soon as possible thereafter - if the disk is healthy: * try to find out if you can deduce the state of the copies on every OSD The tool for low-level operations is bluestore-tool. I never used it, so you need to look at the documentation. If everything fails, I guess your last option is to decide for one of the copies, export it from one OSD and inject it to another one (but not any of 0,1,2!). This will establish 2 identical copies and the third one will be changed to this one automatically. Note that this may lead to data loss on objects that were in the undefined state. As far as I can see, its only 1 object and probably possible to recover from (backup, snapshot). Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 01 November 2020 14:05:36 To: ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Frank Thanks for the reply. > I think this happens when a PG has 3 different copies and cannot decide which > one is correct. You might have hit a very rare case. You should start with > the scrub errors, check which PGs and which copies (OSDs) are affected. It > sounds almost like all 3 scrub errors are on the same PG. Yes, all 3 errors are for the same PG and on the same OSD: 2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 3:d577e975:::123675e.:head : candidate had a missing snapset key, candidate had a missing info key 2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 3:d577e975:::123675e.:head : failed to pick suitable object info 2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed > You might have had a combination of crash and OSD fail, your situation is > probably not covered by "single point of failure". Yes it was a complex crash, all went down. > In case you have a PG with scrub errors on 2 copies, you should be able to > reconstruct the PG from the third with PG export/PG import commands. I have not done a PG export/import before. Mind if you could send the instructions or a link for it. Thanks Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
Hi Sagara, looks like your situation is more complex. Before doing anything potentially destructive, you need to investigate some more. A possible interpretation (numbering just for the example): OSD 0 PG at version 1 OSD 1 PG at version 2 OSD 2 PG has scrub error Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward (OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). Part of the relevant information on OSD 2 seems to be unreadable, therefore pg repair bails out. You need to find out if you are in this situation or some other case. If you are, you need to find out somehow if you need to roll back or forward. I'm afraid in your current situation, even taking the OSD with the scrub errors down will not rebuild the PG. I would probably try: - find out with smartctl if the OSD with scrub errors is in a pre-fail state (has remapped sectors) - if it is: * take it down and try to make a full copy with ddrescue * if ddrescure manages to copy everything, copy back to a new disk and add to ceph * if ddrescue fails to copy everything, you could try if badblocks manages to get the disk back; ddrescue can force remappings of broken sectors (non-destructive read-write check) and it can happen that data becomes readable again, exchange the disk as soon as possible thereafter - if the disk is healthy: * try to find out if you can deduce the state of the copies on every OSD The tool for low-level operations is bluestore-tool. I never used it, so you need to look at the documentation. If everything fails, I guess your last option is to decide for one of the copies, export it from one OSD and inject it to another one (but not any of 0,1,2!). This will establish 2 identical copies and the third one will be changed to this one automatically. Note that this may lead to data loss on objects that were in the undefined state. As far as I can see, its only 1 object and probably possible to recover from (backup, snapshot). Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 01 November 2020 14:05:36 To: ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Frank Thanks for the reply. > I think this happens when a PG has 3 different copies and cannot decide which > one is correct. You might have hit a very rare case. You should start with > the scrub errors, check which PGs and which copies (OSDs) are affected. It > sounds almost like all 3 scrub errors are on the same PG. Yes, all 3 errors are for the same PG and on the same OSD: 2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 3:d577e975:::123675e.:head : candidate had a missing snapset key, candidate had a missing info key 2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 3:d577e975:::123675e.:head : failed to pick suitable object info 2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed > You might have had a combination of crash and OSD fail, your situation is > probably not covered by "single point of failure". Yes it was a complex crash, all went down. > In case you have a PG with scrub errors on 2 copies, you should be able to > reconstruct the PG from the third with PG export/PG import commands. I have not done a PG export/import before. Mind if you could send the instructions or a link for it. Thanks Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
I think this happens when a PG has 3 different copies and cannot decide which one is correct. You might have hit a very rare case. You should start with the scrub errors, check which PGs and which copies (OSDs) are affected. It sounds almost like all 3 scrub errors are on the same PG. You might have had a combination of crash and OSD fail, your situation is probably not covered by "single point of failure". In case you have a PG with scrub errors on 2 copies, you should be able to reconstruct the PG from the third with PG export/PG import commands. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 01 November 2020 13:16:08 To: ceph-users@ceph.io Subject: [ceph-users] How to recover from active+clean+inconsistent+failed_repair? Hi all I have a Ceph cluster (Nautilus 14.2.11) with 3 Ceph nodes. A crash happened and all 3 Ceph nodes went down. One (1) PG turned "active+clean+inconsistent", I tried to repair it. After the repair, now shows "active+clean+inconsistent+failed_repair" for the PG in the question and cannot bring the cluster to "active+clean". How do I rescue the cluster? Is this a false positive? Here are the detail: All three Ceph nodes run ceph-mon, ceph-mgr, ceph-osd and ceph-mds. 1. ceph -s health: HEALTH_ERR3 scrub errorsPossible data damage: 1 pg inconsistent pgs: 191 active+clean 1 active+clean+inconsistent 2. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 pg inconsistentpg 3.b is active+clean+inconsistent, acting [0,1,2] 3. rados list-inconsistent-pg rbd[] 4. ceph pg deep-scrub 3.b 5. ceph pg repair 3.b 6. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 pg inconsistentpg 3.b is active+clean+inconsistent+failed_repair, acting [0,1,2] 7. rados list-inconsistent-obj 3.b --format=json-pretty{ "epoch": 4769, "inconsistents": []} 8. ceph pg 3.b list_unfound { "num_missing": 0,"num_unfound": 0, "objects": [],"more": false} Appreciate your help. ThanksSagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Very high read IO during backfilling
Are you a victim of bluefs_buffered_io=false: https://www.mail-archive.com/ceph-users@ceph.io/msg05550.html ? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Kamil Szczygieł Sent: 27 October 2020 21:39:22 To: ceph-users@ceph.io Subject: [ceph-users] Very high read IO during backfilling Hi, We're running Octopus and we've 3 control plane nodes (12 core, 64 GB memory each) that are running mon, mds and mgr and also 4 data nodes (12 core, 256 GB memory, 13x10TB HDDs each). We've increased number of PGs inside our pool, which resulted in all OSDs going crazy and reading the average of 900 M/s constantly (based on iotop). This has resulted in slow ops and very low recovery speed. Any tips on how to handle this kind of situation? We've osd_recovery_sleep_hdd set to 0.2, osd_recovery_max_active set to 5 and osd_max_backfills set to 4. Some OSDs are reporting slow ops constantly and iowait on machines is at 70-80% constantly. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond to capability release
umount + mount worked. Thanks! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 30 October 2020 10:22:38 To: Frank Schilder Cc: ceph-users Subject: Re: [ceph-users] MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond to capability release Hi, You said you dropped caches -- can you try again echo 3 > /proc/sys/vm/drop_caches ? Otherwise, does umount then mount from one of the clients clear the warning? (I don't believe this is due to a "busy client", but rather a kernel client bug where it doesn't release caps in some cases -- we've seen this in the past but not recently). -- Dan On Fri, Oct 30, 2020 at 10:13 AM Frank Schilder wrote: > > Dear cephers, > > I have a somewhat strange situation. I have the health warning: > > # ceph health detail > HEALTH_WARN 3 clients failing to respond to capability release > MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release > mdsceph-12(mds.0): Client sn106.hpc.ait.dtu.dk:con-fs2-hpc failing to > respond to capability release client_id: 30716617 > mdsceph-12(mds.0): Client sn269.hpc.ait.dtu.dk:con-fs2-hpc failing to > respond to capability release client_id: 30717358 > mdsceph-12(mds.0): Client sn009.hpc.ait.dtu.dk:con-fs2-hpc failing to > respond to capability release client_id: 30749150 > > However, these clients are not busy right now. Also, they hold almost > nothing; see snippets from "session ls" below. It is possible that a very IO > intensive application was running on these nodes and these release requests > got stuck. How do I resolve this issue? Can I just evict the client? > > Version is mimic 13.2.8. Note that we execute a drop cache command after a > job finishes on these clients. Its possible that the clients dropped the caps > already before the MDS request was handled/received. > > Best regards, > Frank > > { > "id": 30717358, > "num_leases": 0, > "num_caps": 44, > "state": "open", > "request_load_avg": 0, > "uptime": 6632206.332307, > "replay_requests": 0, > "completed_requests": 0, > "reconnecting": false, > "inst": "client.30717358 192.168.57.140:0/3212676185", > "client_metadata": { > "features": "00ff", > "entity_id": "con-fs2-hpc", > "hostname": "sn269.hpc.ait.dtu.dk", > "kernel_version": "3.10.0-957.12.2.el7.x86_64", > "root": "/hpc/home" > } > }, > -- > { > "id": 30716617, > "num_leases": 0, > "num_caps": 48, > "state": "open", > "request_load_avg": 1, > "uptime": 6632206.336307, > "replay_requests": 0, > "completed_requests": 1, > "reconnecting": false, > "inst": "client.30716617 192.168.56.233:0/2770977433", > "client_metadata": { > "features": "00ff", > "entity_id": "con-fs2-hpc", > "hostname": "sn106.hpc.ait.dtu.dk", > "kernel_version": "3.10.0-957.12.2.el7.x86_64", > "root": "/hpc/home" > } > }, > -- > { > "id": 30749150, > "num_leases": 0, > "num_caps": 44, > "state": "open", > "request_load_avg": 0, > "uptime": 6632206.338307, > "replay_requests": 0, > "completed_requests": 0, > "reconnecting": false, > "inst": "client.30749150 192.168.56.136:0/2578719015", > "client_metadata": { > "features": "00ff", > "entity_id": "con-fs2-hpc", > "hostname": "sn009.hpc.ait.dtu.dk", > "kernel_version": "3.10.0-957.12.2.el7.x86_64", > "root": "/hpc/home" > } > }, > > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MDS_CLIENT_LATE_RELEASE: 3 clients failing to respond to capability release
Dear cephers, I have a somewhat strange situation. I have the health warning: # ceph health detail HEALTH_WARN 3 clients failing to respond to capability release MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release mdsceph-12(mds.0): Client sn106.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 30716617 mdsceph-12(mds.0): Client sn269.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 30717358 mdsceph-12(mds.0): Client sn009.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 30749150 However, these clients are not busy right now. Also, they hold almost nothing; see snippets from "session ls" below. It is possible that a very IO intensive application was running on these nodes and these release requests got stuck. How do I resolve this issue? Can I just evict the client? Version is mimic 13.2.8. Note that we execute a drop cache command after a job finishes on these clients. Its possible that the clients dropped the caps already before the MDS request was handled/received. Best regards, Frank { "id": 30717358, "num_leases": 0, "num_caps": 44, "state": "open", "request_load_avg": 0, "uptime": 6632206.332307, "replay_requests": 0, "completed_requests": 0, "reconnecting": false, "inst": "client.30717358 192.168.57.140:0/3212676185", "client_metadata": { "features": "00ff", "entity_id": "con-fs2-hpc", "hostname": "sn269.hpc.ait.dtu.dk", "kernel_version": "3.10.0-957.12.2.el7.x86_64", "root": "/hpc/home" } }, -- { "id": 30716617, "num_leases": 0, "num_caps": 48, "state": "open", "request_load_avg": 1, "uptime": 6632206.336307, "replay_requests": 0, "completed_requests": 1, "reconnecting": false, "inst": "client.30716617 192.168.56.233:0/2770977433", "client_metadata": { "features": "00ff", "entity_id": "con-fs2-hpc", "hostname": "sn106.hpc.ait.dtu.dk", "kernel_version": "3.10.0-957.12.2.el7.x86_64", "root": "/hpc/home" } }, -- { "id": 30749150, "num_leases": 0, "num_caps": 44, "state": "open", "request_load_avg": 0, "uptime": 6632206.338307, "replay_requests": 0, "completed_requests": 0, "reconnecting": false, "inst": "client.30749150 192.168.56.136:0/2578719015", "client_metadata": { "features": "00ff", "entity_id": "con-fs2-hpc", "hostname": "sn009.hpc.ait.dtu.dk", "kernel_version": "3.10.0-957.12.2.el7.x86_64", "root": "/hpc/home" } }, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: frequent Monitor down
I remember exactly this discussion some time ago, where one of the developers gave some more subtle reasons for not using even numbers. The maths sounds simple, with 4 mons you can tolerate the loss of 1, just like with 3 mons. The added benefit seems to be the extra copy of a mon. However, the reality is not that simple. There is apparently some kind of subtlety that has more to do with the physical set-up that makes 4 mons worse than 3 (more likely to lead to loss of service). I do not remember the thread, but it was within the last year. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Janne Johansson Sent: 29 October 2020 22:07:45 To: Tony Liu Cc: Marc Roos; ceph-users Subject: [ceph-users] Re: frequent Monitor down Den tors 29 okt. 2020 kl 20:16 skrev Tony Liu : > Typically, the number of nodes is 2n+1 to cover n failures. > It's OK to have 4 nodes, from failure covering POV, it's the same > as 3 nodes. 4 nodes will cover 1 failure. If 2 nodes down, the > cluster is down. It works, just not make much sense. > > Well, you can see it the other way around, with 3 configured mons, and only 2 up, you know you have a majority and can go on with writes. With 4 configured mons and only 2 up, it stops because you get the split brain scenario. For a 2DC setup with 2 mons at each place, a split is still fatal. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Huge HDD ceph monitor usage [EXT]
> ... i will use now only one site, but need first stabilice the > cluster to remove the EC erasure coding and use replicate ... If you change to one site only, there is no point in getting rid of the EC pool. Your main problem will be restoring the lost data. Do you have backup of everything? Do you still have the old OSDs? You never answered these questions. To give you an idea why this is important, with ceph, loosing 1% of data on an rbd pool does *not* mean you loose 1% of the disks. It means that, on average, every disk looses 1% of its blocks. In other words, getting everything up again will be a lot of work either way. The best path to follow is what Eugen suggested: add mons to have at least 3 and dig out the old disks to be able to export and import PGs. Look at Eugen's last 2 e-mails, its a starting point. You might be able to recover more by reducing temporarily min_size to 1 on the replicated pools and to 4 on the EC pool. If possible, make sure there is no client access during that time. The missing rest needs to be scraped off the OSDs you deleted from the cluster. If you have backup of everything, starting from scratch and populating the ceph cluster from backup might be the fastest option. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: 28 October 2020 07:23:09 To: Ing. Luis Felipe Domínguez Vega Cc: Ceph Users Subject: [ceph-users] Re: Huge HDD ceph monitor usage [EXT] If you have that many spare hosts I would recommend to deploy two more MONs on them, and probably also additional MGRs so they can failover. What is the EC profile for the data_storage pool? Can you also share ceph pg dump pgs | grep -v "active+clean" to see which PGs are affected. The remaining issue with unfound objects and unkown PGs could be because you removed OSDs. That could mean data loss, but maybe there's a chance to recover anyway. Zitat von "Ing. Luis Felipe Domínguez Vega" : > Well recovering not working yet... i was started 6 servers more and > the cluster not yet recovered. > Ceph status not show any recover progress > > ceph -s : https://pastebin.ubuntu.com/p/zRQPbvGzbw/ > ceph osd tree : https://pastebin.ubuntu.com/p/sTDs8vd7Sk/ > ceph osd df : https://pastebin.ubuntu.com/p/ysbh8r2VVz/ > ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/ > crush rules : (ceph osd crush rule dump) > https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/ > > El 2020-10-27 09:59, Eugen Block escribió: >> Your pool 'data_storage' has a size of 7 (or 7 chunks since it's >> erasure-coded) and the rule requires each chunk on a different host >> but you currently have only 5 hosts available, that's why the recovery >> is not progressing. It's waiting for two more hosts. Unfortunately, >> you can't change the EC profile or the rule of that pool. I'm not sure >> if it would work in the current cluster state, but if you can't add >> two more hosts (which would be your best option for recovery) it might >> be possible to create a new replicated pool (you seem to have enough >> free space) and copy the contents from that EC pool. But as I said, >> I'm not sure if that would work in a degraded state, I've never tried >> that. >> >> So your best bet is to get two more hosts somehow. >> >> >>> pool 4 'data_storage' erasure profile desoft size 7 min_size 5 >>> crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 >>> autoscale_mode off last_change 154384 lfor 0/121016/121014 flags >>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>> application rbd >> >> >> Zitat von "Ing. Luis Felipe Domínguez Vega" : >> >>> Needed data: >>> >>> ceph -s : https://pastebin.ubuntu.com/p/S9gKjyZtdK/ >>> ceph osd tree : https://pastebin.ubuntu.com/p/SCZHkk6Mk4/ >>> ceph osd df : (later, because i'm waiting since 10 >>> minutes and not output yet) >>> ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/ >>> crush rules : (ceph osd crush rule dump) >>> https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/ >>> >>> El 2020-10-27 07:14, Eugen Block escribió: >>>>> I understand, but i delete the OSDs from CRUSH map, so ceph >>>>> don't wait for these OSDs, i'm right? >>>> >>>> It depends on your actual crush tree and rules. Can you share (maybe >>>> you already did) >>>> >>>> ceph osd tree >>>> ceph osd df >>>> ceph osd pool ls detail >>>> >>>> and a dump
[ceph-users] Re: monitor sst files continue growing
I think you really need to sit down and explain the full story. Dropping one-liners with new information will not work via e-mail. I have never heard of the problem you are facing, so you did something that possibly no-one else has done before. Unless we know the full history from the last time the cluster was health_ok until now, it will almost certainly not be possible to figure out what is going on via e-mail. Usually, setting "norebalance" and "norecovery" should stop any recovery IO and allow the PGs to peer. If they do not become active, something is wrong and the information we got so far does not give a clue what this could be. Please post the output of "ceph health detail", "ceph osd pool stats" and "ceph osd pool ls detail" and a log of actions and results since last health_ok status here, maybe it gives a clue what is going on. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zhenshi Zhou Sent: 29 October 2020 09:44:14 To: Frank Schilder Cc: ceph-users Subject: Re: [ceph-users] monitor sst files continue growing I reset the pg_num after adding osd, it made some pg inactive(in activating state) Frank Schilder mailto:fr...@dtu.dk>> 于2020年10月29日周四 下午3:56写道: This does not explain incomplete and inactive PGs. Are you hitting https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not recover from OSD restart"? In that case, temporarily stopping and restarting all new OSDs might help. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zhenshi Zhou mailto:deader...@gmail.com>> Sent: 29 October 2020 08:30:25 To: Frank Schilder Cc: ceph-users Subject: Re: [ceph-users] monitor sst files continue growing After add OSDs into the cluster, the recovery and backfill progress has not finished yet Zhenshi Zhou mailto:deader...@gmail.com><mailto:deader...@gmail.com<mailto:deader...@gmail.com>>> 于2020年10月29日周四 下午3:29写道: MGR is stopped by me cause it took too much memories. For pg status, I added some OSDs in this cluster, and it Frank Schilder mailto:fr...@dtu.dk><mailto:fr...@dtu.dk<mailto:fr...@dtu.dk>>> 于2020年10月29日周四 下午3:27写道: Your problem is the overall cluster health. The MONs store cluster history information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs only makes things worse right now. The health status is a mess, no MGR, a bunch of PGs inactive, etc. This is what you need to resolve. How did your cluster end up like this? It looks like all OSDs are up and in. You need to find out - why there are inactive PGs - why there are incomplete PGs This usually happens when OSDs go missing. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zhenshi Zhou mailto:deader...@gmail.com><mailto:deader...@gmail.com<mailto:deader...@gmail.com>>> Sent: 29 October 2020 07:37:19 To: ceph-users Subject: [ceph-users] monitor sst files continue growing Hi all, My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db continue growing. It claims mon are using a lot of disk space. I set "mon compact on start = true" and restart one of the monitors. But it started and campacting for a long time, seems it has no end. [image.png] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pgs stuck backfill_toofull
He he. > It will prevent OSDs from being marked out if you shut them down or the . ... down or the MONs loose heartbeats due to high network load during peering. ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 29 October 2020 09:05:27 To: Mark Johnson; ceph-users@ceph.io Subject: [ceph-users] Re: pgs stuck backfill_toofull It will prevent OSDs from being marked out if you shut them down or the . Changing PG counts does not require a shut down of OSDs, but sometimes OSDs get overloaded by peering traffic and the MONs can loose contact for a while. Setting noout will prevent flapping and also reduce the administrative traffic a bit. Its just a precaution. If this is a production system, you need to rethink your size 2 min size 1 config. This is the major problem for keeping the service available under maintenance. Please take your time and read the docs on all the commands I sent you. The cluster status is not critical as far as I can see. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Johnson Sent: 29 October 2020 08:58:15 To: ceph-users@ceph.io; Frank Schilder Subject: Re: pgs stuck backfill_toofull Thanks again Frank. That gives me something to digest (and try to understand). One question regarding maintenance mode, these are production systems that are required to be available all the time. What, exactly, will happen if I issue this command for maintenance mode? Thanks, Mark On Thu, 2020-10-29 at 07:51 +0000, Frank Schilder wrote: Cephfs pools are uncritical, because ceph fs splits very large files into chunks of objectsize. The RGW pool is the problem, because RGW does not as far as I know. A few 1TB uploads and you have a problem. The calculation is confusing, because the term PG is used in two different meanings, unfortunately. The pool PG count and OSD PG count are different things. A PG is a virtual raid set distributed over some OSDs. The number of PGs in a pool is the count of such raid sets. The PG count for an OSD is in fact the PG membership count - something completely different. It says in how many PGs an OSD is a member of. To create 100PGs with replication 3 you need 3x100=300 PG memberships. If you have 3 OSDs, these will have 100 PG memberships each. This is shown as PGs in the utilisation columns. If these terms were used with a bit more precision, it would be less confusing. If the data distribution will remain more or less the same in the near future, changing the PG count as follows should help: Assuming that you have 20 OSDs (OSD 1 seems to be gone), increasing the PG count for pool 20 from 64 to 512 will require 2x(512-64)=896 additional PG memberships. Distributed over 20 OSDs, this is on average 44.8 memberships per OSD. This will leave PG memberships available for the future and should sort out your distribution problem. If you want to follow this route, you can do the following: - ceph osd set noout # maintenance mode - ceph osd set norebalance # prevent immediate start of rebalancing - increase pg_num and pgp_num of pool 20 to 512 - increase the reweight of osd.3 to, say 0.8 - wait for peering to finish and any recovery to complete - ceph osd unset noout # leave maintenance mode - if everything OK (all PGs active, no degraded objects, no recovery) do ceph osd unset norebalance - once the rebalancing is finished, reweight the OSDs manually, the built-in reweight commands are a bit limited is that just a matter of "ceph osd reweight osd.3 1" Yes, this will do. However, increase probably in less aggressive steps. You will need some rebalancing, because you run a bit low on available space. As a final note, running with size 2 min size 1 is a serious data redundancy risk. You should get another server and upgrade to 3(2). Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Johnson < <mailto:ma...@iovox.com> ma...@iovox.com > Sent: 29 October 2020 08:19:01 To: <mailto:ceph-users@ceph.io> ceph-users@ceph.io ; Frank Schilder Subject: Re: pgs stuck backfill_toofull Thanks for you swift reply. Below is the requested information. I understand the bit about not being able to reduce the pg count as we've come across this issue once before. This is the reason I've been hesitant to make any changes there without being 100% certain of getting it right and the impact of these changes. That, and the more I read about how to calculate this, the more confused I get. As for the reweight, is that just a matter of "ceph osd reweight osd.3 1" once the other issues are sorted out (or perhaps start with a less dramatic change and work up)? Also, presuming I need to change the pg/pgp num, would you be
[ceph-users] Re: pgs stuck backfill_toofull
It will prevent OSDs from being marked out if you shut them down or the . Changing PG counts does not require a shut down of OSDs, but sometimes OSDs get overloaded by peering traffic and the MONs can loose contact for a while. Setting noout will prevent flapping and also reduce the administrative traffic a bit. Its just a precaution. If this is a production system, you need to rethink your size 2 min size 1 config. This is the major problem for keeping the service available under maintenance. Please take your time and read the docs on all the commands I sent you. The cluster status is not critical as far as I can see. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Johnson Sent: 29 October 2020 08:58:15 To: ceph-users@ceph.io; Frank Schilder Subject: Re: pgs stuck backfill_toofull Thanks again Frank. That gives me something to digest (and try to understand). One question regarding maintenance mode, these are production systems that are required to be available all the time. What, exactly, will happen if I issue this command for maintenance mode? Thanks, Mark On Thu, 2020-10-29 at 07:51 +, Frank Schilder wrote: Cephfs pools are uncritical, because ceph fs splits very large files into chunks of objectsize. The RGW pool is the problem, because RGW does not as far as I know. A few 1TB uploads and you have a problem. The calculation is confusing, because the term PG is used in two different meanings, unfortunately. The pool PG count and OSD PG count are different things. A PG is a virtual raid set distributed over some OSDs. The number of PGs in a pool is the count of such raid sets. The PG count for an OSD is in fact the PG membership count - something completely different. It says in how many PGs an OSD is a member of. To create 100PGs with replication 3 you need 3x100=300 PG memberships. If you have 3 OSDs, these will have 100 PG memberships each. This is shown as PGs in the utilisation columns. If these terms were used with a bit more precision, it would be less confusing. If the data distribution will remain more or less the same in the near future, changing the PG count as follows should help: Assuming that you have 20 OSDs (OSD 1 seems to be gone), increasing the PG count for pool 20 from 64 to 512 will require 2x(512-64)=896 additional PG memberships. Distributed over 20 OSDs, this is on average 44.8 memberships per OSD. This will leave PG memberships available for the future and should sort out your distribution problem. If you want to follow this route, you can do the following: - ceph osd set noout # maintenance mode - ceph osd set norebalance # prevent immediate start of rebalancing - increase pg_num and pgp_num of pool 20 to 512 - increase the reweight of osd.3 to, say 0.8 - wait for peering to finish and any recovery to complete - ceph osd unset noout # leave maintenance mode - if everything OK (all PGs active, no degraded objects, no recovery) do ceph osd unset norebalance - once the rebalancing is finished, reweight the OSDs manually, the built-in reweight commands are a bit limited is that just a matter of "ceph osd reweight osd.3 1" Yes, this will do. However, increase probably in less aggressive steps. You will need some rebalancing, because you run a bit low on available space. As a final note, running with size 2 min size 1 is a serious data redundancy risk. You should get another server and upgrade to 3(2). Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Johnson < <mailto:ma...@iovox.com> ma...@iovox.com > Sent: 29 October 2020 08:19:01 To: <mailto:ceph-users@ceph.io> ceph-users@ceph.io ; Frank Schilder Subject: Re: pgs stuck backfill_toofull Thanks for you swift reply. Below is the requested information. I understand the bit about not being able to reduce the pg count as we've come across this issue once before. This is the reason I've been hesitant to make any changes there without being 100% certain of getting it right and the impact of these changes. That, and the more I read about how to calculate this, the more confused I get. As for the reweight, is that just a matter of "ceph osd reweight osd.3 1" once the other issues are sorted out (or perhaps start with a less dramatic change and work up)? Also, presuming I need to change the pg/pgp num, would you be suggesting on pool 2 based on the below info (the pool with a few large files) or on pool 20 (the pool with the most data but an average of about 250KB file size)? I'm just completely confused as to what's caused this issue in the first place and how to go about fixing it. On top of that, am I going to be able to increase the pg/pgp count with the cluster in a state of health_warn? Just some posts I've read
[ceph-users] Re: monitor sst files continue growing
This does not explain incomplete and inactive PGs. Are you hitting https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not recover from OSD restart"? In that case, temporarily stopping and restarting all new OSDs might help. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zhenshi Zhou Sent: 29 October 2020 08:30:25 To: Frank Schilder Cc: ceph-users Subject: Re: [ceph-users] monitor sst files continue growing After add OSDs into the cluster, the recovery and backfill progress has not finished yet Zhenshi Zhou mailto:deader...@gmail.com>> 于2020年10月29日周四 下午3:29写道: MGR is stopped by me cause it took too much memories. For pg status, I added some OSDs in this cluster, and it Frank Schilder mailto:fr...@dtu.dk>> 于2020年10月29日周四 下午3:27写道: Your problem is the overall cluster health. The MONs store cluster history information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs only makes things worse right now. The health status is a mess, no MGR, a bunch of PGs inactive, etc. This is what you need to resolve. How did your cluster end up like this? It looks like all OSDs are up and in. You need to find out - why there are inactive PGs - why there are incomplete PGs This usually happens when OSDs go missing. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zhenshi Zhou mailto:deader...@gmail.com>> Sent: 29 October 2020 07:37:19 To: ceph-users Subject: [ceph-users] monitor sst files continue growing Hi all, My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db continue growing. It claims mon are using a lot of disk space. I set "mon compact on start = true" and restart one of the monitors. But it started and campacting for a long time, seems it has no end. [image.png] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pgs stuck backfill_toofull
Cephfs pools are uncritical, because ceph fs splits very large files into chunks of objectsize. The RGW pool is the problem, because RGW does not as far as I know. A few 1TB uploads and you have a problem. The calculation is confusing, because the term PG is used in two different meanings, unfortunately. The pool PG count and OSD PG count are different things. A PG is a virtual raid set distributed over some OSDs. The number of PGs in a pool is the count of such raid sets. The PG count for an OSD is in fact the PG membership count - something completely different. It says in how many PGs an OSD is a member of. To create 100PGs with replication 3 you need 3x100=300 PG memberships. If you have 3 OSDs, these will have 100 PG memberships each. This is shown as PGs in the utilisation columns. If these terms were used with a bit more precision, it would be less confusing. If the data distribution will remain more or less the same in the near future, changing the PG count as follows should help: Assuming that you have 20 OSDs (OSD 1 seems to be gone), increasing the PG count for pool 20 from 64 to 512 will require 2x(512-64)=896 additional PG memberships. Distributed over 20 OSDs, this is on average 44.8 memberships per OSD. This will leave PG memberships available for the future and should sort out your distribution problem. If you want to follow this route, you can do the following: - ceph osd set noout # maintenance mode - ceph osd set norebalance # prevent immediate start of rebalancing - increase pg_num and pgp_num of pool 20 to 512 - increase the reweight of osd.3 to, say 0.8 - wait for peering to finish and any recovery to complete - ceph osd unset noout # leave maintenance mode - if everything OK (all PGs active, no degraded objects, no recovery) do ceph osd unset norebalance - once the rebalancing is finished, reweight the OSDs manually, the built-in reweight commands are a bit limited > is that just a matter of "ceph osd reweight osd.3 1" Yes, this will do. However, increase probably in less aggressive steps. You will need some rebalancing, because you run a bit low on available space. As a final note, running with size 2 min size 1 is a serious data redundancy risk. You should get another server and upgrade to 3(2). Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Johnson Sent: 29 October 2020 08:19:01 To: ceph-users@ceph.io; Frank Schilder Subject: Re: pgs stuck backfill_toofull Thanks for you swift reply. Below is the requested information. I understand the bit about not being able to reduce the pg count as we've come across this issue once before. This is the reason I've been hesitant to make any changes there without being 100% certain of getting it right and the impact of these changes. That, and the more I read about how to calculate this, the more confused I get. As for the reweight, is that just a matter of "ceph osd reweight osd.3 1" once the other issues are sorted out (or perhaps start with a less dramatic change and work up)? Also, presuming I need to change the pg/pgp num, would you be suggesting on pool 2 based on the below info (the pool with a few large files) or on pool 20 (the pool with the most data but an average of about 250KB file size)? I'm just completely confused as to what's caused this issue in the first place and how to go about fixing it. On top of that, am I going to be able to increase the pg/pgp count with the cluster in a state of health_warn? Just some posts I've read seem to indicate that the health state needs to be ok before this sort of thing can be changed (but I could be misunnderstanding what I'm reading). Anyway, here's the info: # ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 28219G 11227G 15558G 55.13 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 690G 0 KUBERNETES1122G 15.11 690G 34188 KUBERNETES_METADATA 2 49310k 0 690G 1426 default.rgw.control 11 0 0 690G 8 default.rgw.data.root 12 20076k 0 690G 54412 default.rgw.gc13 0 0 690G 32 default.rgw.log 14 0 0 690G 127 default.rgw.users.uid 15 4942 0 690G 15 default.rgw.users.keys16126 0 690G 4 default.rgw.users.swift 17252 0 690G 8 default.rgw.buckets.index 18 0 0 690G 27206 .rgw.root
[ceph-users] Re: monitor sst files continue growing
Your problem is the overall cluster health. The MONs store cluster history information that will be trimmed once it reaches HEALTH_OK. Restarting the MONs only makes things worse right now. The health status is a mess, no MGR, a bunch of PGs inactive, etc. This is what you need to resolve. How did your cluster end up like this? It looks like all OSDs are up and in. You need to find out - why there are inactive PGs - why there are incomplete PGs This usually happens when OSDs go missing. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zhenshi Zhou Sent: 29 October 2020 07:37:19 To: ceph-users Subject: [ceph-users] monitor sst files continue growing Hi all, My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db continue growing. It claims mon are using a lot of disk space. I set "mon compact on start = true" and restart one of the monitors. But it started and campacting for a long time, seems it has no end. [image.png] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: pgs stuck backfill_toofull
Hi Mark, it looks like you have some very large PGs. Also, you run with a quite low PG count, in particular, for the large pool. Please post the output of "ceph df" and "ceph osd pool ls detail" to see how much data is in each pool and some pool info. I guess you need to increase the PG count of the large pool to split PGs up and also reduce the impact of imbalance. When I look at this: 3 1.37790 0.45013 1410G 1079G 259G 76.49 1.39 21 4 1.37790 0.95001 1410G 1086G 253G 76.98 1.40 44 I would conclude that the PGs are too large, the reweight of 0.45 without much utilization effect indicates that. This weight will need to be rectified as well at some time. You should be able to run with 100-200 PGs per OSD. Please be aware that PG planning requires caution as you cannot reduce the PG count of a pool in your version. You need to know how much data is in the pools right now and what the future plan is. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Johnson Sent: 29 October 2020 06:55:55 To: ceph-users@ceph.io Subject: [ceph-users] pgs stuck backfill_toofull I've been struggling with this one for a few days now. We had an OSD report as near full a few days ago. Had this happen a couple of times before and a reweight-by-utilization has sorted it out in the past. Tried the same again but this time we ended up with a couple of pgs in a state of backfill_toofull and a handful of misplaced objects as a result. Tried doing the reweight a few more times and it's been moving data around. We did have another osd trigger the near full alert but running the reweight a couple more times seems to have moved some of that data around a bit better. However, the original near_full osd doesn't seem to have changed much and the backfill_toofull pgs are still there. I'd keep doing the reweight-by-utilization but I'm not sure if I'm heading down the right path and if it will eventually sort it out. We have 14 pools, but the vast majority of data resides in just one of those pools (pool 20). The pgs in the backfill state are in pool 2 (as far as I can tell). That particular pool is used for some cephfs stuff and has a handful of large files in there (not sure if this is significant to the problem). All up, our utilization is showing as 55.13% but some of our OSDs are showing as 76% in use with this one problem sitting at 85.02%. Right now, I'm just not sure what the proper corrective action is. The last couple of reweights I've run have been a bit more targetted in that I've set it to only function on two OSDs at a time. If I run a test-reweight targetting only one osd, it does say it will reweight OSD 9 (the one at 85.02%). I gather this will move data away from this OSD and potentially get it below the threshold. However, at one point in the past couple of days, it's shown as no OSDs in a near full state, yet the two pgs in backfill_toofull didn't change. So, that's why I'm not sure continually reweighting is going to solve this issue. I'm a long way from knowledgable on Ceph so I'm not really sure what information is useful here. Here's a bit of info on what I'm seeing. Can provide anything else that might help. Basically, we have a three node cluster but only two have OSDs. The third is there simply to enable a quorum to be established. The OSDs are evenly spread across these two needs and the configuration of each is identical. We are running Jewel and are not in a position to upgrade at this stage. # ceph --version ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) # ceph health detail HEALTH_WARN 2 pgs backfill_toofull; 2 pgs stuck unclean; recovery 33/62099566 objects misplaced (0.000%); 1 near full osd(s) pg 2.52 is stuck unclean for 201822.031280, current state active+remapped+backfill_toofull, last acting [17,3] pg 2.18 is stuck unclean for 202114.617682, current state active+remapped+backfill_toofull, last acting [18,2] pg 2.18 is active+remapped+backfill_toofull, acting [18,2] pg 2.52 is active+remapped+backfill_toofull, acting [17,3] recovery 33/62099566 objects misplaced (0.000%) osd.9 is near full at 85% # ceph osd df ID WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR PGS 2 1.37790 1.0 1410G 842G 496G 59.75 1.08 33 3 1.37790 0.45013 1410G 1079G 259G 76.49 1.39 21 4 1.37790 0.95001 1410G 1086G 253G 76.98 1.40 44 5 1.37790 1.0 1410G 617G 722G 43.74 0.79 43 6 1.37790 0.65009 1410G 616G 722G 43.69 0.79 39 7 1.37790 0.95001 1410G 495G 844G 35.10 0.64 40 8 1.37790 1.0 1410G 732G 606G 51.93 0.94 52 9 1.37790 0.70007 1410G 1199G 139G 85.02 1.54 37 10 1.37790 1.0 1410G 611G 727G 43.35 0.79 41 11 1.37790 0.75006 1410G 495G 843G 35.11 0.64 32 0 1.37790 1.0 1410G 731G 608G 51.82 0.94 43 12 1.37790 1.0 1410G
[ceph-users] Re: Huge HDD ceph monitor usage [EXT]
Hi all, I need to go back to a small piece of information: > I was 3 mons, but i have 2 physical datacenters, one of them breaks with > not short term fix, so i remove all osds and ceph mon (2 of them) and > now i have only the osds of 1 datacenter with the monitor. When I look at the data about pools and crush map, I don't see anything that is multi-site. Maybe the physical location was 2-site, but the crush rules don't reflect that. Consequently, the ceph cluster was configured single-site and will act accordingly when you loose 50% of it. Quick interlude: when people recommend to add servers, they do not necessarily mean *new* servers. They mean you have to go to ground zero, dig out as much hardware as you can, drive it to the working site and make it rejoin the cluster. A hypothesis. Assume we want to build a 2-site cluster (sites A and B) that can sustain the total loss of any 1 site, capacity at each site is equal (mirrored). Short answer: this is not exactly possible due to the fact that you always need a qualified majority of monitors for quorum and you cannot distribute both, N MONs and a qualified majority evenly and simultaneously over 2 sites. We have already an additional constraint: site A will have 2 and site B 1 monitor. The condition is, that in case site A goes down the monitors from the site A can be rescued and moved to site B to restore data access. Hence, a loss of site A will imply temporary loss of service (Note that 2+2=4 MONs will not help, because now 3 MONs are required for a qualified majority; again MONs need to be rescued from the down site). If this constraint is satisfied, then one can configure pools as follows: replicated: size 4, min_size 2, crush rule places 2 copies at each site erasure coded: k+m with min_size=k+1, m even and m>=k+2, for example, k=2, m=4, crush rule places 3 shards at each site With such a configuration, it is possible to sustain the loss of any one site if the monitors can be recovered from site A. Note that such EC pools will be very compute intense and have high latency (use option fast_read to get at least reasonable read speeds). Essentially, EC is not really suitable for multi-site redundancy, but the above EC set up will require a bit less capacity than 4 copies. This setup can sustain the total loss of 1 site (minus MONs on site A) and will rebuild all data once a large enough second site is brought up again. When I look at the information you posted, I see replication 3(2) and EC 5+2 pools, all having crush root default. I do not see any of these mandatory configurations, the sites are ignored in the crush rules. Hence, if you can't get material from the down site back up, you look at permanent data loss. You may be able to recover some more data in the replicated pools by setting min_size=1 for some time. However, you will loose any writes that are on the other 2 but not on the 1 disk now used for recovery and it will certainly not recover PGs with all 3 copies on the down site. Therefore, I would not attempt this, also because for the EC pools, you will need to get hold of the hosts from the down site and re-integrate these into the cluster any ways. If you can't do this, the data is lost. In the long run, given your crush map and rules, you either stop placing stuff at 2 sites, or you create a proper 2-site set-up and copy data over. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Ing. Luis Felipe Domínguez Vega Sent: 28 October 2020 05:14:27 To: Eugen Block Cc: Ceph Users Subject: [ceph-users] Re: Huge HDD ceph monitor usage [EXT] Well recovering not working yet... i was started 6 servers more and the cluster not yet recovered. Ceph status not show any recover progress ceph -s : https://pastebin.ubuntu.com/p/zRQPbvGzbw/ ceph osd tree : https://pastebin.ubuntu.com/p/sTDs8vd7Sk/ ceph osd df : https://pastebin.ubuntu.com/p/ysbh8r2VVz/ ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/ crush rules : (ceph osd crush rule dump) https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/ El 2020-10-27 09:59, Eugen Block escribió: > Your pool 'data_storage' has a size of 7 (or 7 chunks since it's > erasure-coded) and the rule requires each chunk on a different host > but you currently have only 5 hosts available, that's why the recovery > is not progressing. It's waiting for two more hosts. Unfortunately, > you can't change the EC profile or the rule of that pool. I'm not sure > if it would work in the current cluster state, but if you can't add > two more hosts (which would be your best option for recovery) it might > be possible to create a new replicated pool (you seem to have enough > free space) and copy the contents from that EC pool. But as I said, > I'm not sure if that would work in a degraded state, I'v
[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
Thanks for digging this out. I believed to remember exactly this method (don't know where from), but couldn't find it in the documentation and started doubting it. Yes, this would be very useful information to add to the documentation and it also confirms that your simpler setup with just a specialized crush rule will work exactly as intended and is long-term stable. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: 胡 玮文 Sent: 26 October 2020 17:19 To: Frank Schilder Cc: Anthony D'Atri; ceph-users@ceph.io Subject: Re: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool > 在 2020年10月26日,15:43,Frank Schilder 写道: > > >> I’ve never seen anything that implies that lead OSDs within an acting set >> are a function of CRUSH rule ordering. > > This is actually a good question. I believed that I had seen/heard that > somewhere, but I might be wrong. > > Looking at the definition of a PG, is states that a PG is an ordered set of > OSD (IDs) and the first up OSD will be the primary. In other words, it seems > that the lowest OSD ID is decisive. If the SSDs were deployed before the > HDDs, they have the smallest IDs and, hence, will be preferred as primary > OSDs. I don’t think this is correct. From my experiments, using previously mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the primary OSDs are always SSD. I also have a look at the code, if I understand it correctly: * If the default primary affinity is not changed, then the logic about primary affinity is skipped, and the primary would be the first one returned by CRUSH algorithm [1]. * The order of OSDs returned by CRUSH still matters if you changed the primary affinity. The affinity represents the probability of a test to be success. The first OSD will be tested first, and will have higher probability to become primary. [2] * If any OSD has primary affinity = 1.0, the test will always success, and any OSD after it will never be primary. * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has probability of 0.125. Otherwise, 1st will be primary. * If no test success (Suppose all OSDs have affinity of 0), 1st OSD will be primary as fallback. [1]: https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2456 [2]: https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2561 So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient for it to be the primary in my case. Do you think I should contribute these to documentation? > This, however, is not a sustainable situation. Any addition of OSDs will mess > this up and the distribution scheme will fail in the future. A way out seem > to be: > > - subdivide your HDD storage using device classes: > * define a device class for HDDs with primary affinity=0, for example, pick 5 > HDDs and change their device class to hdd_np (for no primary) > * set the primary affinity of these HDD OSDs to 0 > * modify your crush rule to use "step take default class hdd_np" > * this will create a pool with primaries on SSD and balanced storage > distribution between SSD and HDD > * all-HDD pools deployed as usual on class hdd > * when increasing capacity, one needs to take care of adding disks to hdd_np > class and set their primary affinity to 0 > * somewhat increased admin effort, but fully working solution > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs
Hi Kristof, I missed that: why do you need to do manual compaction? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Kristof Coucke Sent: 26 October 2020 11:33:52 To: Frank Schilder; a.jazdzew...@googlemail.com Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Question about expansion existing Ceph cluster - adding OSDs Hi Ansgar, Frank, all, Thanks for the feedback in the first place. In the meantime, I've added all the disks and the cluster is rebalancing itself... Which will take ages as you've mentioned. Last week after this conversation it was around 50% (little bit more), today it's around 44,5%. Every day, I have to take the cluster down to run manual compaction on some disks :-(, but that's a known bug where Igor is working on. (Kudos to him when I get my sleep back at night for this one...) Though, I'm still having an issue which I don't completely understand. When I look into the Ceph dashboard - OSDs, I can see the #pgs for a specific OSD. Does someone know how this is calculated? Because it seems incorrect... E.g. A specific disk shows in the dashboard 189 PGs...? However, examining the pg dump output I can see that for that particular disk there are 145 PGs where the disk is in the "up" list, and 168 disks where that particular disk is in the "acting" list... Of those 2 lists, 135 are in common, meaning 10 PGs will need to be moved to that disk, while 33 PGs will need to be moved away... I can't figure out how the dashboard is getting to the figure of 189... It's also on other disks (a delta between the PG dump output and the info in the Ceph dashboard). Another example is one disk which I've put on weight 0 as it's marked to have a predictable failure in the future... So the list with "up" is 0 (which is correct), and the PGs where this disk is in acting is 49. So, this seems correct as these 49 PGs need to be moved away. However... Looking into the Ceph dashboard the UI is saying that there are 71 PGs on that disk... So: - How does the Ceph dashboard get that number in the 1st place? - Is there a possibility that there are "orphaned" PG-parts left behind on a particular OSD? - If it is possible that there are orphaned parts of a PG left behind on a disk, how do I clean this up? I've also tried examining the osdmap, however, the output seems to be limited(??). I only see the PGs voor pool 1 and 2. (I don't know if the file is concatenated by exporting the osd map, or by the osdmaptool --print). The cluster is running Nautilus v14.2.11, all on the same version. I'll make some time writing documentation and documenting my findings which I've all faced in the journey of the last 2 weeks Kristof in Ceph's wunderland... Thanks for all your input so far! Regards, Kristof Op wo 21 okt. 2020 om 14:01 schreef Frank Schilder mailto:fr...@dtu.dk>>: There have been threads on exactly this. Might depend a bit on your ceph version. We are running mimic and have no issues doing: - set noout, norebalance, nobackfill - add all OSDs (with weight 1) - wait for peering to complete - unset all flags and let the rebalance loose Starting with nautilus there seem to be issues with this procedure. Mainly the peering phase can cause a collapse of the cluster. In your case, it sounds like you added the OSDs already. You should be able to do relatively safely: - set noout, norebalance, nobackfill - set weight of OSDs to 1 one by one and wait for peering to complete every time - unset all flags and let the rebalance loose I believe once the peering succeeded without crashes, the rebalancing will just work fine. You can easily control how much rebalancing is going on. I noted that ceph seems to have a strange concept of priority though. I needed to gain capacity by adding OSDs and ceph was very consequent with moving PGs from the fullest OSDs last. The opposite of what should happen. Thus, it took ages for additional capacity to become available and also the backfill too full warnings stayed for all the time. You can influence this to some degree by using force_recovery commands on PGs on the fullest OSDs. Best regards and good luck, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Kristof Coucke mailto:kristof.cou...@gmail.com>> Sent: 21 October 2020 13:29:00 To: ceph-users@ceph.io<mailto:ceph-users@ceph.io> Subject: [ceph-users] Question about expansion existing Ceph cluster - adding OSDs Hi, I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs. Some disks were near full. The new disks have been added with initial weight = 0. The original plan was to increase this slowly towards their full weight using the gentle reweight script. However, this is going way too slow and I'm also having issues now with "backfill_toofull"
[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
> I’ve never seen anything that implies that lead OSDs within an acting set are > a function of CRUSH rule ordering. This is actually a good question. I believed that I had seen/heard that somewhere, but I might be wrong. Looking at the definition of a PG, is states that a PG is an ordered set of OSD (IDs) and the first up OSD will be the primary. In other words, it seems that the lowest OSD ID is decisive. If the SSDs were deployed before the HDDs, they have the smallest IDs and, hence, will be preferred as primary OSDs. This, however, is not a sustainable situation. Any addition of OSDs will mess this up and the distribution scheme will fail in the future. A way out seem to be: - subdivide your HDD storage using device classes: * define a device class for HDDs with primary affinity=0, for example, pick 5 HDDs and change their device class to hdd_np (for no primary) * set the primary affinity of these HDD OSDs to 0 * modify your crush rule to use "step take default class hdd_np" * this will create a pool with primaries on SSD and balanced storage distribution between SSD and HDD * all-HDD pools deployed as usual on class hdd * when increasing capacity, one needs to take care of adding disks to hdd_np class and set their primary affinity to 0 * somewhat increased admin effort, but fully working solution Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Anthony D'Atri Sent: 25 October 2020 17:07:15 To: ceph-users@ceph.io Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool > I'm not entirely sure if primary on SSD will actually make the read happen on > SSD. My understanding is that by default reads always happen from the lead OSD in the acting set. Octopus seems to (finally) have an option to spread the reads around, which IIRC defaults to false. I’ve never seen anything that implies that lead OSDs within an acting set are a function of CRUSH rule ordering. I’m not asserting that they aren’t though, but I’m … skeptical. Setting primary affinity would do the job, and you’d want to have cron continually update it across the cluster to react to topology changes. I was told of this strategy back in 2014, but haven’t personally seen it implemented. That said, HDDs are more of a bottleneck for writes than reads and just might be fine for your application. Tiny reads are going to limit you to some degree regardless of drive type, and you do mention throughput, not IOPS. I must echo Frank’s notes about capacity too. Ceph can do a lot of things, but that doesn’t mean something exotic is necessarily the best choice. You’re concerned about 3R only yielding 1/3 of raw capacity if using an all-SSD cluster, but the architecture you propose limits you anyway because drive size. Consider also chassis, CPU, RAM, RU, switch port costs as well, and the cost of you fussing over an exotic solution instead of the hundreds of other things in your backlog. And your cluster as described is *tiny*. Honestly I’d suggest considering one of these alternatives: * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are really promising for replacing HDDs for density in this kind of application. You might even consider ARM if IOPs aren’t a concern. * An NVMeoF solution Cache tiers are “deprecated”, but then so are custom cluster names. Neither appears > For EC pools there is an option "fast_read" > (https://docs.ceph.com/en/latest/rados/operations/pools/?highlight=fast_read#set-pool-values), > which states that a read will return as soon as the first k shards have > arrived. The default is to wait for all k+m shards (all replicas). This > option is not available for replicated pools. > > Now, not sure if this option is not available for replicated pools because > the read will always be served by the acting primary, or if it currently > waits for all replicas. In the latter case, reads will wait for the slowest > device. > > I'm not sure if I interpret this correctly. I think you should test the setup > with HDD only and SSD+HDD to see if read speed improves. Note that write > speed will always depend on the slowest device. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: 25 October 2020 15:03:16 > To: 胡 玮文; Alexander E. Patrakov > Cc: ceph-users@ceph.io > Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool > > A cache pool might be an alternative, heavily depending on how much data is > hot. However, then you will have much less SSD capacity available, because it > also requires replication. > > Looking at the setup that you have only 10*1T =10T SSD, but 20*6T =
[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
I would like to add one comment. I'm not entirely sure if primary on SSD will actually make the read happen on SSD. For EC pools there is an option "fast_read" (https://docs.ceph.com/en/latest/rados/operations/pools/?highlight=fast_read#set-pool-values), which states that a read will return as soon as the first k shards have arrived. The default is to wait for all k+m shards (all replicas). This option is not available for replicated pools. Now, not sure if this option is not available for replicated pools because the read will always be served by the acting primary, or if it currently waits for all replicas. In the latter case, reads will wait for the slowest device. I'm not sure if I interpret this correctly. I think you should test the setup with HDD only and SSD+HDD to see if read speed improves. Note that write speed will always depend on the slowest device. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Frank Schilder Sent: 25 October 2020 15:03:16 To: 胡 玮文; Alexander E. Patrakov Cc: ceph-users@ceph.io Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool A cache pool might be an alternative, heavily depending on how much data is hot. However, then you will have much less SSD capacity available, because it also requires replication. Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD you will probably run short of SSD capacity. Or, looking at it the other way around, with copies on 1 SSD+3HDD, you will only be able to use about 30T out of 120T HDD capacity. With this replication, the usable storage will be 10T and raw used will be 10T SSD and 30T HDD. If you can't do anything else on the HDD space, you will need more SSDs. If your servers have more free disk slots, you can add SSDs over time until you have at least 40T SSD capacity to balance SSD and HDD capacity. Personally, I think the 1SSD + 3HDD is a good option compared with a cache pool. You have the data security of 3-times replication and, if everything is up, need only 1 copy in the SSD cache, which means that you have 3 times the cache capacity. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: 胡 玮文 Sent: 25 October 2020 13:40:55 To: Alexander E. Patrakov Cc: ceph-users@ceph.io Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on HDD. This will work as intended, right? Because at least I can ensure 3 HDDs are from different hosts. > 在 2020年10月25日,20:04,Alexander E. Patrakov 写道: > > On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com > wrote: >> >> Hi all, >> >> We are planning for a new pool to store our dataset using CephFS. These data >> are almost read-only (but not guaranteed) and consist of a lot of small >> files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will >> deploy about 10 such nodes. We aim at getting the highest read throughput. >> >> If we just use a replicated pool of size 3 on SSD, we should get the best >> performance, however, that only leave us 1/3 of usable SSD space. And EC >> pools are not friendly to such small object read workload, I think. >> >> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want >> 3 data replications, each on a different host (fail domain). 1 of them on >> SSD, the other 2 on HDD. And normally every read request is directed to SSD. >> So, if every SSD OSD is up, I’d expect the same read throughout as the all >> SSD deployment. >> >> I’ve read the documents and did some tests. Here is the crush rule I’m >> testing with: >> >> rule mixed_replicated_rule { >>id 3 >>type replicated >>min_size 1 >>max_size 10 >>step take default class ssd >>step chooseleaf firstn 1 type host >>step emit >>step take default class hdd >>step chooseleaf firstn -1 type host >>step emit >> } >> >> Now I have the following conclusions, but I’m not very sure: >> * The first OSD produced by crush will be the primary OSD (at least if I >> don’t change the “primary affinity”). So, the above rule is guaranteed to >> map SSD OSD as primary in pg. And every read request will read from SSD if >> it is up. >> * It is currently not possible to enforce SSD and HDD OSD to be chosen from >> different hosts. So, if I want to ensure data availability even if 2 hosts >> fail, I need to choose 1 SSD a
[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
A cache pool might be an alternative, heavily depending on how much data is hot. However, then you will have much less SSD capacity available, because it also requires replication. Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD you will probably run short of SSD capacity. Or, looking at it the other way around, with copies on 1 SSD+3HDD, you will only be able to use about 30T out of 120T HDD capacity. With this replication, the usable storage will be 10T and raw used will be 10T SSD and 30T HDD. If you can't do anything else on the HDD space, you will need more SSDs. If your servers have more free disk slots, you can add SSDs over time until you have at least 40T SSD capacity to balance SSD and HDD capacity. Personally, I think the 1SSD + 3HDD is a good option compared with a cache pool. You have the data security of 3-times replication and, if everything is up, need only 1 copy in the SSD cache, which means that you have 3 times the cache capacity. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: 胡 玮文 Sent: 25 October 2020 13:40:55 To: Alexander E. Patrakov Cc: ceph-users@ceph.io Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on HDD. This will work as intended, right? Because at least I can ensure 3 HDDs are from different hosts. > 在 2020年10月25日,20:04,Alexander E. Patrakov 写道: > > On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com > wrote: >> >> Hi all, >> >> We are planning for a new pool to store our dataset using CephFS. These data >> are almost read-only (but not guaranteed) and consist of a lot of small >> files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will >> deploy about 10 such nodes. We aim at getting the highest read throughput. >> >> If we just use a replicated pool of size 3 on SSD, we should get the best >> performance, however, that only leave us 1/3 of usable SSD space. And EC >> pools are not friendly to such small object read workload, I think. >> >> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want >> 3 data replications, each on a different host (fail domain). 1 of them on >> SSD, the other 2 on HDD. And normally every read request is directed to SSD. >> So, if every SSD OSD is up, I’d expect the same read throughout as the all >> SSD deployment. >> >> I’ve read the documents and did some tests. Here is the crush rule I’m >> testing with: >> >> rule mixed_replicated_rule { >>id 3 >>type replicated >>min_size 1 >>max_size 10 >>step take default class ssd >>step chooseleaf firstn 1 type host >>step emit >>step take default class hdd >>step chooseleaf firstn -1 type host >>step emit >> } >> >> Now I have the following conclusions, but I’m not very sure: >> * The first OSD produced by crush will be the primary OSD (at least if I >> don’t change the “primary affinity”). So, the above rule is guaranteed to >> map SSD OSD as primary in pg. And every read request will read from SSD if >> it is up. >> * It is currently not possible to enforce SSD and HDD OSD to be chosen from >> different hosts. So, if I want to ensure data availability even if 2 hosts >> fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the >> replication size to 4, instead of the ideal value 3, on the pool using the >> above crush rule. >> >> Am I correct about the above statements? How would this work from your >> experience? Thanks. > > This works (i.e. guards against host failures) only if you have > strictly separate sets of hosts that have SSDs and that have HDDs. > I.e., there should be no host that has both, otherwise there is a > chance that one hdd and one ssd from that host will be picked. > > -- > Alexander E. Patrakov > CV: > https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7data=04%7C01%7C%7Cfdfe2029034643f3f2f408d878de2b44%7C84df9e7fe9f640afb435%7C1%7C0%7C637392242885406736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=8NY0IpDiDnLZV2FGxwChZmNC8IA6%2BsZ2NEHPb%2B%2BEiA0%3Dreserved=0 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Michael. > I still don't see any traffic to the pool, though I'm also unsure how much > traffic is to be expected. Probably not much. If ceph df shows that the pool contains some objects, I guess that's sorted. That osdmaptool crashes indicates that your cluster runs with corrupted internal data. I tested your crush map and you should get complete PGs for the fs data pool. That you don't and that osdmaptool crashes points at a corruption of internal data. I'm afraid this is the point where you need support from ceph developers and should file a tracker report (https://tracker.ceph.com/projects/ceph/issues). A short description of the origin of the situation with the osdmaptool output and a reference to this thread linked in should be sufficient. Please post a link to the ticket here. In parallel, you should probably open a new thread focussed on the osd map corruption. Maybe there are low-level commands to repair it. You should wait with trying to clean up the unfound objects until this is resolved. Not sure about adding further storage either. To me, this sounds quite serious. Best regards and good luck! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Urgent help needed please - MDS offline
The post was titled "mds behind on trimming - replay until memory exhausted". > Load up with swap and try the up:replay route. > Set the beacon to 10 until it finishes. Good point! The MDS will not send beacons for a long time. Same was necessary in the other case. Good luck! ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Urgent help needed please - MDS offline
If you can't add RAM, you could try provisioning SWAP on a reasonably fast drive. There is a thread from this year where someone had a similar problem, the MDS running out of memory during replay. He could quickly add sufficient swap and the MDS managed to come up. Took a long time though, but might be faster than getting more RAM and will not loose data. Your clients will not be able to do much, if anything during recovery though. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 22 October 2020 18:11:57 To: David C Cc: ceph-devel; ceph-users Subject: [ceph-users] Re: Urgent help needed please - MDS offline I assume you aren't able to quickly double the RAM on this MDS ? or failover to a new MDS with more ram? Failing that, you shouldn't reset the journal without recovering dentries, otherwise the cephfs_data objects won't be consistent with the metadata. The full procedure to be used is here: https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts backup the journal, recover dentires, then reset the journal. (the steps after might not be needed) That said -- maybe there is a more elegant procedure than using cephfs-journal-tool. A cephfs dev might have better advice. -- dan On Thu, Oct 22, 2020 at 6:03 PM David C wrote: > > I'm pretty sure it's replaying the same ops every time, the last > "EMetaBlob.replay updated dir" before it dies is always referring to > the same directory. Although interestingly that particular dir shows > up in the log thousands of times - the dir appears to be where a > desktop app is doing some analytics collecting - I don't know if > that's likely to be a red herring or the reason why the journal > appears to be so long. It's a dir I'd be quite happy to lose changes > to or remove from the file system altogether. > > I'm loath to update during an outage although I have seen people > update the MDS code independently to get out of a scrape - I suspect > you wouldn't recommend that. > > I feel like this leaves me with having to manipulate the journal in > some way, is there a nuclear option where I can choose to disregard > the uncommitted events? I assume that would be a journal reset with > the cephfs-journal-tool but I'm unclear on the impact of that, I'd > expect to lose any metadata changes that were made since my cluster > filled up but are there further implications? I also wonder what's the > riskier option, resetting the journal or attempting an update. > > I'm very grateful for your help so far > > Below is more of the debug 10 log with ops relating to the > aforementioned dir (name changed but inode is accurate): > > 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal > EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/ > [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14 > 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592 > 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592 > 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300] > 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal > EMetaBlob.replay for [2,head] had [dentry > #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967 > inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1 > 0x5654f82794a0] > 2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal > EMetaBlob.replay for [2,head] had [inode 0x10009e1ec8e [...2,head] > /path/to/desktop/app/Upload/ auth v911967 f(v0 m2020-10-22 > 08:46:44.932805 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805 > b17592 89216=89215+1) (iversion lock) | dirfrag=1 dirty=1 > 0x5654f8288a00] > 2020-10-22 16:44:00.44 7f424659e700 10 mds.0.journal > EMetaBlob.replay dir 0x10009e1ec8e > 2020-10-22 16:44:00.45 7f424659e700 10 mds.0.journal > EMetaBlob.replay updated dir [dir 0x10009e1ec8e > /path/to/desktop/app/Upload/ [2,head] auth v=904150 cv=0/0 > state=1073741824 f(v0 m2020-10-22 08:46:44.932805 89215=89215+0) n(v2 > rc2020-10-22 08:46:44.932805 b17592 89215=89215+0) > hs=42926+1178,ss=0+0 dirty=2375 | child=1 0x5654f8289100] > 2020-10-22 16:44:00.488898 7f424659e700 10 mds.0.journal > EMetaBlob.replay added (full) [dentry > #0x1/path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp > [2,head] auth NULL (dversion lock) v=904149 inode=0 > state=1610612800|bottomlru | dirty=1 0x56586df52f00] > 2020-10-22 16:44:00.488911 7f424659e700 10 mds.0.journal > EMetaBlob.replay added [inode 0x1000e4c0ff4 [2,head] > /path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp > auth v904149 s=0 n(v0 1=1+0) (iversion lock) 0x566ce168ce00] > 2020-10-22 16:44:00.488918 7f424659e700 10 >
[ceph-users] Re: multiple OSD crash, unfound objects
Could you also execute (and post the output of) # osdmaptool osd.map --test-map-pgs-dump --pool 7 with the osd map you pulled out (pool 7 should be the fs data pool)? Please check what mapping is reported for PG 7.39d? Just checking if osd map and pg dump agree here. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 22 October 2020 09:32:07 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Sounds good. Did you re-create the pool again? If not, please do to give the devicehealth manager module its storage. In case you can't see any IO, it might be necessary to restart the MGR to flush out a stale rados connection. I would probably give the pool 10 PGs instead of 1, but that's up to you. I hope I find time today to look at the incomplete PG. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 21 October 2020 22:58:47 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/21/20 6:47 AM, Frank Schilder wrote: > Hi Michael, > > some quick thoughts. > > > That you can create a pool with 1 PG is a good sign, the crush rule is OK. > That pg query says it doesn't have PG 1.0 points in the right direction. > There is an inconsistency in the cluster. This is also indicated by the fact > that no upmaps seem to exist (the clean-up script was empty). With the osd > map you extracted, you could check what the osd map believes the mapping of > the PGs of pool 1 are: > ># osdmaptool osd.map --test-map-pgs-dump --pool 1 https://pastebin.com/seh6gb7R As I suspected, it thinks that OSDs 0, 41 are the acting set. > or if it also claims the PG does not exist. It looks like something went > wrong during pool creation and you are not the only one having problems with > this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html > . Sounds a lot like a bug in cephadm. > > In principle, it looks like the idea to delete and recreate the health > metrics pool is a way forward. Please look at the procedure mentioned in the > thread quoted above. Deletion of the pool there lead to some crashes and some > surgery on some OSDs was necessary. However, in your case it might just work, > because you redeployed the OSDs in question already - if I remember correctly. That is correct. The original OSDs 0 and 41 were removed and redeployed on new disks. > In order to do so cleanly, however, you will probably want to shut down all > clients accessing this pool. Note that clients accessing the health metrics > pool are not FS clients, so the mds cannot tell you anything about them. The > only command that seems to list all clients is > ># ceph daemon mon.MON-ID sessions > > that needs to be executed on all mon hosts. On the other hand, you could also > just go ahead and see if something crashes (an MGR module probably) or > disable all MGR modules during this recovery attempt. I found some info that > cephadm creates this pool and starts an MGR module. > > If you google "device_health_metric pool" you should find descriptions of > similar cases. It looks solvable. Unfortunately, in Octopus you can not disable the devicehealth manager module, and the manager is required for operation. So I just went ahead and removed the pool with everything still running. Fortunately, this did not appear to cause any problems, and the single unknown PG has disappeared from the ceph health output. > I will look at the incomplete PG issue. I hope this is just some PG tuning. > At least pg query didn't complain :) I have OSDs ready to add to the pool, in case you think we should try. > The stuck MDS request could be an attempt to access an unfound object. It > should be possible to locate the fs client and find out what it was trying to > do. I see this sometimes when people are too impatient. They manage to > trigger a race condition and an MDS operation gets stuck (there are MDS bugs > and in my case it was an ls command that got stuck). Usually, evicting the > client temporarily solves the issue (but tell the user :). I found the fs client and rebooted it. The MDS still reports the slow OPs, but according to the mds logs the offending ops were established before the client was rebooted, and the offending client session (now defunct) has been blacklisted. I'll check back later to see if the slow OPS get cleared from 'ceph status'. Regards, --Mike > From: Michael Thomas > Sent: 20 October 2020 23:48:36 > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: multipl
[ceph-users] Re: multiple OSD crash, unfound objects
Sounds good. Did you re-create the pool again? If not, please do to give the devicehealth manager module its storage. In case you can't see any IO, it might be necessary to restart the MGR to flush out a stale rados connection. I would probably give the pool 10 PGs instead of 1, but that's up to you. I hope I find time today to look at the incomplete PG. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 21 October 2020 22:58:47 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/21/20 6:47 AM, Frank Schilder wrote: > Hi Michael, > > some quick thoughts. > > > That you can create a pool with 1 PG is a good sign, the crush rule is OK. > That pg query says it doesn't have PG 1.0 points in the right direction. > There is an inconsistency in the cluster. This is also indicated by the fact > that no upmaps seem to exist (the clean-up script was empty). With the osd > map you extracted, you could check what the osd map believes the mapping of > the PGs of pool 1 are: > ># osdmaptool osd.map --test-map-pgs-dump --pool 1 https://pastebin.com/seh6gb7R As I suspected, it thinks that OSDs 0, 41 are the acting set. > or if it also claims the PG does not exist. It looks like something went > wrong during pool creation and you are not the only one having problems with > this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html > . Sounds a lot like a bug in cephadm. > > In principle, it looks like the idea to delete and recreate the health > metrics pool is a way forward. Please look at the procedure mentioned in the > thread quoted above. Deletion of the pool there lead to some crashes and some > surgery on some OSDs was necessary. However, in your case it might just work, > because you redeployed the OSDs in question already - if I remember correctly. That is correct. The original OSDs 0 and 41 were removed and redeployed on new disks. > In order to do so cleanly, however, you will probably want to shut down all > clients accessing this pool. Note that clients accessing the health metrics > pool are not FS clients, so the mds cannot tell you anything about them. The > only command that seems to list all clients is > ># ceph daemon mon.MON-ID sessions > > that needs to be executed on all mon hosts. On the other hand, you could also > just go ahead and see if something crashes (an MGR module probably) or > disable all MGR modules during this recovery attempt. I found some info that > cephadm creates this pool and starts an MGR module. > > If you google "device_health_metric pool" you should find descriptions of > similar cases. It looks solvable. Unfortunately, in Octopus you can not disable the devicehealth manager module, and the manager is required for operation. So I just went ahead and removed the pool with everything still running. Fortunately, this did not appear to cause any problems, and the single unknown PG has disappeared from the ceph health output. > I will look at the incomplete PG issue. I hope this is just some PG tuning. > At least pg query didn't complain :) I have OSDs ready to add to the pool, in case you think we should try. > The stuck MDS request could be an attempt to access an unfound object. It > should be possible to locate the fs client and find out what it was trying to > do. I see this sometimes when people are too impatient. They manage to > trigger a race condition and an MDS operation gets stuck (there are MDS bugs > and in my case it was an ls command that got stuck). Usually, evicting the > client temporarily solves the issue (but tell the user :). I found the fs client and rebooted it. The MDS still reports the slow OPs, but according to the mds logs the offending ops were established before the client was rebooted, and the offending client session (now defunct) has been blacklisted. I'll check back later to see if the slow OPS get cleared from 'ceph status'. Regards, --Mike > From: Michael Thomas > Sent: 20 October 2020 23:48:36 > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects > > On 10/20/20 1:18 PM, Frank Schilder wrote: >> Dear Michael, >> >>>> Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an >>>> OSD mapping? >> >> I meant here with crush rule replicated_host_nvme. Sorry, forgot. > > Seems to have worked fine: > > https://pastebin.com/PFgDE4J1 > >>> Yes, the OSD was still out when the previous health report was created. >> >> Hmm, this is odd. If this is correct, the
[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs
There have been threads on exactly this. Might depend a bit on your ceph version. We are running mimic and have no issues doing: - set noout, norebalance, nobackfill - add all OSDs (with weight 1) - wait for peering to complete - unset all flags and let the rebalance loose Starting with nautilus there seem to be issues with this procedure. Mainly the peering phase can cause a collapse of the cluster. In your case, it sounds like you added the OSDs already. You should be able to do relatively safely: - set noout, norebalance, nobackfill - set weight of OSDs to 1 one by one and wait for peering to complete every time - unset all flags and let the rebalance loose I believe once the peering succeeded without crashes, the rebalancing will just work fine. You can easily control how much rebalancing is going on. I noted that ceph seems to have a strange concept of priority though. I needed to gain capacity by adding OSDs and ceph was very consequent with moving PGs from the fullest OSDs last. The opposite of what should happen. Thus, it took ages for additional capacity to become available and also the backfill too full warnings stayed for all the time. You can influence this to some degree by using force_recovery commands on PGs on the fullest OSDs. Best regards and good luck, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Kristof Coucke Sent: 21 October 2020 13:29:00 To: ceph-users@ceph.io Subject: [ceph-users] Question about expansion existing Ceph cluster - adding OSDs Hi, I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs. Some disks were near full. The new disks have been added with initial weight = 0. The original plan was to increase this slowly towards their full weight using the gentle reweight script. However, this is going way too slow and I'm also having issues now with "backfill_toofull". Can I just add all the OSDs with their full weight, or will I get a lot of issues when I'm doing that? I know that a lot of PGs will have to be replaced, but increasing the weight slowly will take a year at the current speed. I'm already playing with the max backfill to increase the speed, but every time I increase the weight it will take a lot of time again... I can face the fact that there will be a performance decrease. Looking forward to your comments! Regards, Kristof ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Michael, some quick thoughts. That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query says it doesn't have PG 1.0 points in the right direction. There is an inconsistency in the cluster. This is also indicated by the fact that no upmaps seem to exist (the clean-up script was empty). With the osd map you extracted, you could check what the osd map believes the mapping of the PGs of pool 1 are: # osdmaptool osd.map --test-map-pgs-dump --pool 1 or if it also claims the PG does not exist. It looks like something went wrong during pool creation and you are not the only one having problems with this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . Sounds a lot like a bug in cephadm. In principle, it looks like the idea to delete and recreate the health metrics pool is a way forward. Please look at the procedure mentioned in the thread quoted above. Deletion of the pool there lead to some crashes and some surgery on some OSDs was necessary. However, in your case it might just work, because you redeployed the OSDs in question already - if I remember correctly. In order to do so cleanly, however, you will probably want to shut down all clients accessing this pool. Note that clients accessing the health metrics pool are not FS clients, so the mds cannot tell you anything about them. The only command that seems to list all clients is # ceph daemon mon.MON-ID sessions that needs to be executed on all mon hosts. On the other hand, you could also just go ahead and see if something crashes (an MGR module probably) or disable all MGR modules during this recovery attempt. I found some info that cephadm creates this pool and starts an MGR module. If you google "device_health_metric pool" you should find descriptions of similar cases. It looks solvable. I will look at the incomplete PG issue. I hope this is just some PG tuning. At least pg query didn't complain :) The stuck MDS request could be an attempt to access an unfound object. It should be possible to locate the fs client and find out what it was trying to do. I see this sometimes when people are too impatient. They manage to trigger a race condition and an MDS operation gets stuck (there are MDS bugs and in my case it was an ls command that got stuck). Usually, evicting the client temporarily solves the issue (but tell the user :). Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 20 October 2020 23:48:36 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/20/20 1:18 PM, Frank Schilder wrote: > Dear Michael, > >>> Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an >>> OSD mapping? > > I meant here with crush rule replicated_host_nvme. Sorry, forgot. Seems to have worked fine: https://pastebin.com/PFgDE4J1 >> Yes, the OSD was still out when the previous health report was created. > > Hmm, this is odd. If this is correct, then it did report a slow op even > though it was out of the cluster: > >> from https://pastebin.com/3G3ij9ui: >> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons >> [osd.0,osd.41] have slow ops. > > Not sure what to make of that. It looks almost like you have a ghost osd.41. > > > I think (some of) the slow ops you are seeing are directed to the > health_metrics pool and can be ignored. If it is too annoying, you could try > to find out who runs the client with IDs client.7524484 and disable it. Might > be an MGR module. I'm also pretty certain that the slow ops are related to the health metrics pool, which is why I've been ignoring them. What I'm not sure about is whether re-creating the device_health_metrics pool will cause any problems in the ceph cluster. > Looking at the data you provided and also some older threads of yours > (https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start > considering that we are looking at the fall-out of a past admin operation. A > possibility is, that an upmap for PG 1.0 exists that conflicts with the crush > rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG > 1.0. For example, the upmap specifies HDDs, but the crush rule required > NVMEs. This result is an empty set. So var I've been unable to locate the client with the ID 7524484. It's not showing up in the manager dashboard -> Filesystems page, nor in the output of 'ceph tell mds.ceph1 client ls'. I'm digging through the compress logs for the past week to see if I can find the culprit. > I couldn't really find a simple command to list up-maps. The only > non-destructive way seems to be to extract the osdmap and create a clean-up > command file. The clea
[ceph-users] Re: multiple OSD crash, unfound objects
Dear Michael, > > Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an > > OSD mapping? I meant here with crush rule replicated_host_nvme. Sorry, forgot. > Yes, the OSD was still out when the previous health report was created. Hmm, this is odd. If this is correct, then it did report a slow op even though it was out of the cluster: > from https://pastebin.com/3G3ij9ui: > [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons > [osd.0,osd.41] have slow ops. Not sure what to make of that. It looks almost like you have a ghost osd.41. I think (some of) the slow ops you are seeing are directed to the health_metrics pool and can be ignored. If it is too annoying, you could try to find out who runs the client with IDs client.7524484 and disable it. Might be an MGR module. Looking at the data you provided and also some older threads of yours (https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start considering that we are looking at the fall-out of a past admin operation. A possibility is, that an upmap for PG 1.0 exists that conflicts with the crush rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 1.0. For example, the upmap specifies HDDs, but the crush rule required NVMEs. This result is an empty set. I couldn't really find a simple command to list up-maps. The only non-destructive way seems to be to extract the osdmap and create a clean-up command file. The cleanup file should contain a command for every PG with an upmap. To check this, you can execute (see also https://docs.ceph.com/en/latest/man/8/osdmaptool/) # ceph osd getmap > osd.map # osdmaptool osd.map --upmap-cleanup cleanup.cmd If you do this, could you please post as usual the contents of cleanup.cmd? Also, with the OSD map of your cluster, you can simulate certain admin operations and check resulting PG mappings for pools and other things without having to touch the cluster; see https://docs.ceph.com/en/latest/man/8/osdmaptool/. To dig a little bit deeper, could you please post as usual the output of: - ceph pg 1.0 query - ceph pg 7.39d query It would also be helpful if you could post the decoded crush map. You can get the map as a txt-file as follows: # ceph osd getcrushmap -o crush-orig.bin # crushtool -d crush-orig.bin -o crush.txt and post the contents of file crush.txt. Did the slow MDS request complete by now? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 Contents of previous messages removed. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Dear Michael, this is a bit of a nut. I can't see anything obvious. I have two hypotheses that you might consider testing. 1) Problem with 1 incomplete PG. In the shadow hierarchy for your cluster I can see quite a lot of nodes like { "id": -135, "name": "node229~hdd", "type_id": 1, "type_name": "host", "weight": 0, "alg": "straw2", "hash": "rjenkins1", "items": [] }, I would have expected that hosts without a device of a certain device class are *excluded* completely from a tree instead of having weight 0. I'm wondering if this could lead to the crush algorithm fail in the way described here: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon . This might be a long shot, but could you export your crush map and play with the tunables as described under this link to see if more tries lead to a valid mapping? Note that testing this is harmless and does not change anything on the cluster. The hypothesis here is that buckets with weight 0 are not excluded from drawing a-priori, but a-posteriori. If there are too many draws of an empty bucket, a mapping fails. Allowing more tries should then lead to success. We should at least rule out this possibility. 2) About the incomplete PG. I'm wondering if the problem is that the pool has exactly 1 PG. I don't have a test pool with Nautilus and cannot try this out. Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? If not, can you then increase pg_num and pgp_num to, say, 10 and see if this has any effect? I'm wondering here if there needs to be a minimum number >1 of PGs in a pool. Again, this is more about ruling out a possibility than expecting success. As an extension to this test, you could increase pg_num and pgp_num of the pool device_health_metrics to see if this has any effect. The crush rules and crush tree look OK to me. I can't really see why the missing OSDs are not assigned to the two PGs 1.0 and 7.39d. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 16 October 2020 15:41:29 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Dear Michael, > Please mark OSD 41 as "in" again and wait for some slow ops to show up. I forgot. "wait for some slow ops to show up" ... and then what? Could you please go to the host of the affected OSD and look at the output of "ceph daemon osd.ID ops" or "ceph daemon osd.ID dump_historic_slow_ops" and check what type of operations get stuck? I'm wondering if its administrative, like peering attempts. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 16 October 2020 15:09:20 To: Michael Thomas; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects Dear Michael, thanks for this initial work. I will need to look through the files you posted in more detail. In the meantime: Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far as I can see, marking it "out" might have cleared hanging slow ops (there were 1000 before), but they then started piling up again. From the OSD log it looks like an operation that is sent to/from PG 1.0, which doesn't respond because it is inactive. Hence, getting PG 1.0 active should resolve this issue (later). Its a bit strange that I see slow ops for OSD 41 in the latest health detail (https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report was created? I think we might have misunderstood my question 6. My question was whether or not each host bucket corresponds to a physical host and vice versa, that is, each physical host has exactly 1 host bucket. I'm asking because it is possible to have multiple host buckets assigned to a single physical host and this has implications on how to manage things. Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I can see), the problem is that is has no OSDs assigned. I need to look a bit longer at the data you uploaded to find out why. I can't see anything obvious. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Michael Thomas Sent: 16 October 2020 02:08:01 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/14/20 3:49 PM, Frank Schilder wrote: > Hi Michael, > > it doesn't look too bad. All degraded objects are due to the undersi
[ceph-users] Re: multiple OSD crash, unfound objects
Dear Michael, > Please mark OSD 41 as "in" again and wait for some slow ops to show up. I forgot. "wait for some slow ops to show up" ... and then what? Could you please go to the host of the affected OSD and look at the output of "ceph daemon osd.ID ops" or "ceph daemon osd.ID dump_historic_slow_ops" and check what type of operations get stuck? I'm wondering if its administrative, like peering attempts. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Frank Schilder Sent: 16 October 2020 15:09:20 To: Michael Thomas; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects Dear Michael, thanks for this initial work. I will need to look through the files you posted in more detail. In the meantime: Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far as I can see, marking it "out" might have cleared hanging slow ops (there were 1000 before), but they then started piling up again. From the OSD log it looks like an operation that is sent to/from PG 1.0, which doesn't respond because it is inactive. Hence, getting PG 1.0 active should resolve this issue (later). Its a bit strange that I see slow ops for OSD 41 in the latest health detail (https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report was created? I think we might have misunderstood my question 6. My question was whether or not each host bucket corresponds to a physical host and vice versa, that is, each physical host has exactly 1 host bucket. I'm asking because it is possible to have multiple host buckets assigned to a single physical host and this has implications on how to manage things. Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I can see), the problem is that is has no OSDs assigned. I need to look a bit longer at the data you uploaded to find out why. I can't see anything obvious. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Michael Thomas Sent: 16 October 2020 02:08:01 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/14/20 3:49 PM, Frank Schilder wrote: > Hi Michael, > > it doesn't look too bad. All degraded objects are due to the undersized PG. > If this is an EC pool with m>=2, data is currently not in danger. > > I see a few loose ends to pick up, let's hope this is something simple. For > any of the below, before attempting the next step, please wait until all > induced recovery IO has completed before continuing. > > 1) Could you please paste the output of the following commands to pastebin > (bash syntax): > >ceph osd pool get device_health_metrics all https://pastebin.com/6D83mjsV >ceph osd pool get fs.data.archive.frames all https://pastebin.com/7XAaQcpC >ceph pg dump |& grep -i -e PG_STAT -e "^7.39d" https://pastebin.com/tBLaq63Q >ceph osd crush rule ls https://pastebin.com/6f5B778G >ceph osd erasure-code-profile ls https://pastebin.com/uhAaMH1c >ceph osd crush dump # this is a big one, please be careful with copy-paste > (see point 3 below) https://pastebin.com/u92D23jV > 2) I don't see any IO reported (neither user nor recovery). Could you please > confirm that the command outputs were taken during a zero-IO period? That's correct, there was no activity at this time. Access to the cephfs filesystem is very bursty, varying from completely idle to multiple GB/s (read). > 3) Something is wrong with osd.41. Can you check its health status with > smartctl? If it is reported healthy, give it one more clean restart. If the > slow ops do not disappear, it could be a disk fail that is not detected by > health monitoring. You could set it to "out" and see if the cluster recovers > to a healthy state (modulo the currently degraded objects) with no slow ops. > If so, I would replace the disk. smartctl reports no problems. osd.41 (and osd.0) was one of the original OSDs used for the device_health_metrics pool. Early on, before I knew better, I had removed this OSD (and osd.0) from the cluster, and the OSD ids got recycled when new disks were later added. This is when the slow ops on osd.0 and osd.41 started getting reported. On advice from another user on ceph-users, I updated my crush map to remap the device_health_metrics pool to a different set of OSDs (and the slow ops persisted). osd.0 usually also shows slow ops. I was a little surprised that it didn't when I took this snapshot, but now it does. I have now run 'ceph osd out 41', and the recovery I/O has finished. With the exception of one less OSD marked in, the output of 'ceph status' looks the same.
[ceph-users] Re: multiple OSD crash, unfound objects
Dear Michael, thanks for this initial work. I will need to look through the files you posted in more detail. In the meantime: Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far as I can see, marking it "out" might have cleared hanging slow ops (there were 1000 before), but they then started piling up again. From the OSD log it looks like an operation that is sent to/from PG 1.0, which doesn't respond because it is inactive. Hence, getting PG 1.0 active should resolve this issue (later). Its a bit strange that I see slow ops for OSD 41 in the latest health detail (https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report was created? I think we might have misunderstood my question 6. My question was whether or not each host bucket corresponds to a physical host and vice versa, that is, each physical host has exactly 1 host bucket. I'm asking because it is possible to have multiple host buckets assigned to a single physical host and this has implications on how to manage things. Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I can see), the problem is that is has no OSDs assigned. I need to look a bit longer at the data you uploaded to find out why. I can't see anything obvious. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 16 October 2020 02:08:01 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects On 10/14/20 3:49 PM, Frank Schilder wrote: > Hi Michael, > > it doesn't look too bad. All degraded objects are due to the undersized PG. > If this is an EC pool with m>=2, data is currently not in danger. > > I see a few loose ends to pick up, let's hope this is something simple. For > any of the below, before attempting the next step, please wait until all > induced recovery IO has completed before continuing. > > 1) Could you please paste the output of the following commands to pastebin > (bash syntax): > >ceph osd pool get device_health_metrics all https://pastebin.com/6D83mjsV >ceph osd pool get fs.data.archive.frames all https://pastebin.com/7XAaQcpC >ceph pg dump |& grep -i -e PG_STAT -e "^7.39d" https://pastebin.com/tBLaq63Q >ceph osd crush rule ls https://pastebin.com/6f5B778G >ceph osd erasure-code-profile ls https://pastebin.com/uhAaMH1c >ceph osd crush dump # this is a big one, please be careful with copy-paste > (see point 3 below) https://pastebin.com/u92D23jV > 2) I don't see any IO reported (neither user nor recovery). Could you please > confirm that the command outputs were taken during a zero-IO period? That's correct, there was no activity at this time. Access to the cephfs filesystem is very bursty, varying from completely idle to multiple GB/s (read). > 3) Something is wrong with osd.41. Can you check its health status with > smartctl? If it is reported healthy, give it one more clean restart. If the > slow ops do not disappear, it could be a disk fail that is not detected by > health monitoring. You could set it to "out" and see if the cluster recovers > to a healthy state (modulo the currently degraded objects) with no slow ops. > If so, I would replace the disk. smartctl reports no problems. osd.41 (and osd.0) was one of the original OSDs used for the device_health_metrics pool. Early on, before I knew better, I had removed this OSD (and osd.0) from the cluster, and the OSD ids got recycled when new disks were later added. This is when the slow ops on osd.0 and osd.41 started getting reported. On advice from another user on ceph-users, I updated my crush map to remap the device_health_metrics pool to a different set of OSDs (and the slow ops persisted). osd.0 usually also shows slow ops. I was a little surprised that it didn't when I took this snapshot, but now it does. I have now run 'ceph osd out 41', and the recovery I/O has finished. With the exception of one less OSD marked in, the output of 'ceph status' looks the same. The last few lines of the osd.41 logfile are here: https://pastebin.com/k06aArW4 How long does it take for ceph to clear the slow ops status? > 4) In the output of "df tree" node141 shows up twice. Could you confirm that > this is a copy-paste error or is this node indeed twice in the output? This > is easiest to see in the pastebin when switching to "raw" view. This was a copy/paste error. > 5) The crush tree contains an empty host bucket (node308). Please delete this > host bucket (ceph osd crush rm node308) for now and let me know if this > caused any data movements (recovery IO). This did not cause any data movement, according to 'ceph status'. > 6) The crush tree looks a bit
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Michael, it doesn't look too bad. All degraded objects are due to the undersized PG. If this is an EC pool with m>=2, data is currently not in danger. I see a few loose ends to pick up, let's hope this is something simple. For any of the below, before attempting the next step, please wait until all induced recovery IO has completed before continuing. 1) Could you please paste the output of the following commands to pastebin (bash syntax): ceph osd pool get device_health_metrics all ceph osd pool get fs.data.archive.frames all ceph pg dump |& grep -i -e PG_STAT -e "^7.39d" ceph osd crush rule ls ceph osd erasure-code-profile ls ceph osd crush dump # this is a big one, please be careful with copy-paste (see point 3 below) 2) I don't see any IO reported (neither user nor recovery). Could you please confirm that the command outputs were taken during a zero-IO period? 3) Something is wrong with osd.41. Can you check its health status with smartctl? If it is reported healthy, give it one more clean restart. If the slow ops do not disappear, it could be a disk fail that is not detected by health monitoring. You could set it to "out" and see if the cluster recovers to a healthy state (modulo the currently degraded objects) with no slow ops. If so, I would replace the disk. 4) In the output of "df tree" node141 shows up twice. Could you confirm that this is a copy-paste error or is this node indeed twice in the output? This is easiest to see in the pastebin when switching to "raw" view. 5) The crush tree contains an empty host bucket (node308). Please delete this host bucket (ceph osd crush rm node308) for now and let me know if this caused any data movements (recovery IO). 6) The crush tree looks a bit exotic. Do the nodes with a single OSD correspond to a physical host with 1 OSD disk? If not, could you please state how the host buckets are mapped onto physical hosts? 7) In case there was a change to the health status, could you please include an updated "ceph health detail"? I don't expect to get the incomplete PG resolved with the above, but it will move some issues out of the way before proceeding. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 14 October 2020 20:52:10 To: Andreas John; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Hello, The original cause of the OSD instability has already been fixed. It was due to user jobs (via condor) consuming too much memory and causing the machine to swap. The OSDs didn't actually crash, but weren't responding in time and were being flagged as down. In most cases, the problematic OSD servers were also not responding on the console and had to be physically power cycled to recover. Since adding additional memory limits to user jobs, we have only had 1 or 2 unstable OSDs that were fixed by killing the remaining rogue user jobs. Regards, --Mike On 10/10/20 9:22 AM, Andreas John wrote: > Hello Mike, > > do your OSDs go down from time to time? I once has an issue with > unrecoverable objects, because I had only n+1 (size 2) redundancy and > ceph wasn't able to decide, what's the correct copy of the object. In my > case there half-deleted snapshots in one of the copies. I used > ceph-objectstoretool to remove the "wrong" part. Did you check you OSD > logs? Do the osd go down wirth an obscure stacktrace (and maybe they are > restartet by systemd ...) > > rgds, > > j. > > > > On 09.10.20 22:33, Michael Thomas wrote: >> Hi Frank, >> >> That was a good tip. I was able to move the broken files out of the >> way and restore them for users. However, after 2 weeks I'm still left >> with unfound objects. Even more annoying, I now have 82k objects >> degraded (up from 74), which hasn't changed in over a week. >> >> I'm ready to claim that the auto-repair capabilities of ceph are not >> able to fix my particular issues, and will have to continue to >> investigate alternate ways to clean this up, including a pg >> export/import (as you suggested) and perhaps a mds backward scrub >> (after testing in a junk pool first). >> >> I have other tasks I need to perform on the filesystem (removing OSDs, >> adding new OSDs, increasing PG count), but I feel like I need to >> address these degraded/lost objects before risking any more damage. >> >> One particular PG is in a curious state: >> >> 7.39d82163 82165 2467341 3440607778070 >> 0 2139 active+recovery_unfound+undersized+degraded+remapped 23m >> 50755'112549 50766:960500 [116,72,122,48,45,131,73,81]p116 >> [71,109,99,48,45,90,73,NONE]p7
[ceph-users] Long heartbeat ping times
Dear all, occasionally, I find messages like Health check update: Long heartbeat ping times on front interface seen, longest is 1043.153 msec (OSD_SLOW_PING_TIME_FRONT) in the cluster log. Unfortunately, I seem to be unable to find out which OSDs were affected (a-posteriori). I cannot find related messages in any OSD log and the messages I find in /var/log/messages do not contain IP addresses or OSD IDs. Is there a way to find out which OSDs/hosts were the problem after health status is back to healthy? Thanks! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs tag not working
There used to be / is a bug in ceph fs commands when using data pools. If you enable the application cephfs on a pool explicitly before running cephfs add datapool, the fs-tag is not applied. Maybe its that? There is an older thread on the topic in the users-list and also a fix/workaround. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: 01 October 2020 15:33:53 To: ceph-users@ceph.io Subject: [ceph-users] Re: cephfs tag not working Hi, I have a one-node-cluster (also 15.2.4) for testing purposes and just created a cephfs with the tag, it works for me. But my node is also its own client, so there's that. And it was installed with 15.2.4, no upgrade. > For the 2nd, mds works, files can be created or removed, but client > read/write (native client, kernel version 5.7.4) fails with I/O > error, so osd part does not seem to be working properly. You mean it works if you mount it from a different host (within the cluster maybe) with the new client's key but it doesn't work with the designated clients? I'm not sure about the OSD part since the other syntax seems to work, you say. Can you share more details about the error? The mount on the clients works but they can't read/write? Regards, Eugen Zitat von Andrej Filipcic : > Hi, > > on octopus 15.2.4 I have an issue with cephfs tag auth. The > following works fine: > > client.f9desktop > key: > caps: [mds] allow rw > caps: [mon] allow r > caps: [osd] allow rw pool=cephfs_data, allow rw > pool=ssd_data, allow rw pool=fast_data, allow rw pool=arich_data, > allow rw pool=ecfast_data > > but this one does not. > > client.f9desktopnew > key: > caps: [mds] allow rw > caps: [mon] allow r > caps: [osd] allow rw tag cephfs data=cephfs > > For the 2nd, mds works, files can be created or removed, but client > read/write (native client, kernel version 5.7.4) fails with I/O > error, so osd part does not seem to be working properly. > > Any clues what can be wrong? the cephfs was created in jewel... > > Another issue is: if osd caps are updated (adding data pool), then > some clients refresh the caps, but most of them do not, and the only > way to refresh it is to remount the filesystem. working tag would > solve it. > > Best regards, > Andrej > > -- > _ >prof. dr. Andrej Filipcic, E-mail: andrej.filip...@ijs.si >Department of Experimental High Energy Physics - F9 >Jozef Stefan Institute, Jamova 39, P.o.Box 3000 >SI-1001 Ljubljana, Slovenia >Tel.: +386-1-477-3674Fax: +386-1-477-3166 > - > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: hdd pg's migrating when converting ssd class osd's
Dear Mark and Nico, I think this might be the time to file a tracker report. As far as I can see, your set-up is as it should be, OSD operations on your clusters should behave exactly as on ours. I don't know of any other configuration option that influences placement calculation. The problems you (Nico in particular) describe seem serious enough. I heard also other reports of admin operations killing a cluster starting with Nautilus, most notably this one https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/W4M5XQRDBLXFGJGDYZALG6TQ4QBVGGAJ/#4KY3OW7PTOODLQVYKARZLGE5FZUNQOER . Maybe there is/are regressions with crush placement computations (and others)? I will add this to the list of tests before considering to upgrade from mimic. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Marc Roos Sent: 30 September 2020 22:26:11 To: eblock; Frank Schilder Cc: ceph-users; nico.schottelius Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's I am not sure, but it looks like this remapping at hdd's is not being done when adding back the same ssd osd. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: hdd pg's migrating when converting ssd class osd's
Hi Nico and Mark, your crush trees look indeed like they have been converted properly to using device classes already. Changing something within one device class should not influence placement in another. Maybe I'm overlooking something? The only other place I know of where such a mix-up could occur are the crush rules. Do your rules look like this: { "rule_id": 5, "rule_name": "sr-rbd-data-one", "ruleset": 5, "type": 3, "min_size": 3, "max_size": 8, "steps": [ { "op": "set_chooseleaf_tries", "num": 50 }, { "op": "set_choose_tries", "num": 1000 }, { "op": "take", "item": -185, "item_name": "ServerRoom~rbd_data" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } Notice the "~rbd_data" qualifier. It is important that the device class is specified at the root selection. I'm really surprised that with your crush tree you observe changes in SSD implying changes in HDD placements. I was really rough on our mimic cluster with moving disks in and out and between servers and I have never seen this problem. Could it be a regression in nautilus? Is the auto-balancer interfering? > we recently also noticed that rebuilding one pool ("ssd") > influenced speed on other pools, which was unexpected. Could this be something else? Was PG/object placement influenced or performance only? I'm asking, because during one of our service windows we observed something very strange. We have a multi-location cluster with pools with completely isolated storage devices in different locations. On one of these sub-clusters we run a ceph fs. During maintenance we needed to shut down the ceph-fs. When our admin issued the umount command (ca. 1500 clients), we noticed that RBD pools seemed to have problems even though there is absolutely no overlap in disks (disjoint crush trees), they are not even in the same physical location and sit on their own switches. The fs and RBD only share the MONs/MGRs. I'm not entirely sure if we observed something real or only a network blip. However, nagios went crazy on our VM environment for a few minutes. Maybe there is another issue that causes unexpected cross-dependencies that affect performance? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Marc Roos Sent: 30 September 2020 14:59:50 To: eblock; Frank Schilder Cc: ceph-users; nico.schottelius Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Hi Frank, thanks this 'root default' indeed looks different with these 0 there. I have also uploaded mine[1] because it looks very similar to Nico's. I guess his hdd pg's can also start moving in some occassions. Thanks for 'crushtool reclassify' hint, I guess I have missed this in the release notes or so. [1] https://pastebin.com/PFx0V3S7 -Original Message- To: Eugen Block Cc: Marc Roos; ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's This is how my crush tree including shadow hierarchies looks like (a mess :): https://pastebin.com/iCLbi4Up Every device class has its own tree. Starting with mimic, this is automatic when creating new device classes. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: 30 September 2020 08:43:47 To: Frank Schilder Cc: Marc Roos; ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Interesting, I also did this test on an upgraded cluster (L to N). I'll repeat the test on a native Nautilus to see it for myself. Zitat von Frank Schilder > Somebody on this list posted a script that can convert pre-mimic crush > trees with buckets for different types of devices to crush trees with > device classes with minimal data movement (trying to maintain IDs as > much as possible). Don't have a thread name right now, but could try > to find it tomorrow. > > I can check tomorrow how our crush tree unfolds. Basically, for every > device class there is a full copy (shadow hierarchy) for each device > class with its own weights etc. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Marc Roos > Sent: 29 September 2020 22:19:33 > To: eblock; Frank Schilder &
[ceph-users] Re: hdd pg's migrating when converting ssd class osd's
> To me it looks like the structure of both maps is pretty much the same - > or am I mistaken? Yes, but you are not Marc Roos. Do you work on the same cluster or do you observe the same problem? In any case, here is a thread pointing to the crush tree/rule conversion I mentioned: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/675QZ2JXXX4RPRNPK2NL7FB5MVANKUB2/#675QZ2JXXX4RPRNPK2NL7FB5MVANKUB2 The tool is "crushtool reclassify" and is recommended to use when upgrading from luminous to newer to convert crush rules to use device classes. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: 30 September 2020 09:12:49 To: Frank Schilder Cc: Eugen Block; Marc Roos; ceph-users@ceph.io Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Hey Frank, I uploaded our kraken created and nautilus upgraded crush map on [0]. To me it looks like the structure of both maps is pretty much the same - or am I mistaken? Best regards, Nico [0] https://www.nico.schottelius.org/temp/ceph-shadowtree20200930 Frank Schilder writes: > This is how my crush tree including shadow hierarchies looks like (a mess :): > https://pastebin.com/iCLbi4Up > > Every device class has its own tree. Starting with mimic, this is automatic > when creating new device classes. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Eugen Block > Sent: 30 September 2020 08:43:47 > To: Frank Schilder > Cc: Marc Roos; ceph-users > Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class > osd's > > Interesting, I also did this test on an upgraded cluster (L to N). > I'll repeat the test on a native Nautilus to see it for myself. > > > Zitat von Frank Schilder : > >> Somebody on this list posted a script that can convert pre-mimic >> crush trees with buckets for different types of devices to crush >> trees with device classes with minimal data movement (trying to >> maintain IDs as much as possible). Don't have a thread name right >> now, but could try to find it tomorrow. >> >> I can check tomorrow how our crush tree unfolds. Basically, for >> every device class there is a full copy (shadow hierarchy) for each >> device class with its own weights etc. >> >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Marc Roos >> Sent: 29 September 2020 22:19:33 >> To: eblock; Frank Schilder >> Cc: ceph-users >> Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd >> class osd's >> >> Yes correct this is coming from Luminous or maybe even Kraken. How does >> a default crush tree look like in mimic or octopus? Or is there some >> manual how to bring this to the new 'default'? >> >> >> -Original Message- >> Cc: ceph-users >> Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd >> class osd's >> >> Are these crush maps inherited from pre-mimic versions? I have >> re-balanced SSD and HDD pools in mimic (mimic deployed) where one device >> class never influenced the placement of the other. I have mixed hosts >> and went as far as introducing rbd_meta, rbd_data and such classes to >> sub-divide even further (all these devices have different perf specs). >> This worked like a charm. When adding devices of one class, only pools >> in this class were ever affected. >> >> As far as I understand, starting with mimic, every shadow class defines >> a separate tree (not just leafs/OSDs). Thus, device classes are >> independent of each other. >> >> >> >> >> Sent: 29 September 2020 20:54:48 >> To: eblock >> Cc: ceph-users >> Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class >> osd's >> >> Yes correct, hosts have indeed both ssd's and hdd's combined. Is this >> not more of a bug then? I would assume the goal of using device classes >> is that you separate these and one does not affect the other, even the >> host weight of the ssd and hdd class are already available. The >> algorithm should just use that instead of the weight of the whole host. >> Or is there some specific use case, where these classes combined is >> required? >> >> >> -Original Message- >> Cc: ceph-users >> Subject: *SPAM* Re: [ceph-users]
[ceph-users] Re: hdd pg's migrating when converting ssd class osd's
This is how my crush tree including shadow hierarchies looks like (a mess :): https://pastebin.com/iCLbi4Up Every device class has its own tree. Starting with mimic, this is automatic when creating new device classes. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: 30 September 2020 08:43:47 To: Frank Schilder Cc: Marc Roos; ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Interesting, I also did this test on an upgraded cluster (L to N). I'll repeat the test on a native Nautilus to see it for myself. Zitat von Frank Schilder : > Somebody on this list posted a script that can convert pre-mimic > crush trees with buckets for different types of devices to crush > trees with device classes with minimal data movement (trying to > maintain IDs as much as possible). Don't have a thread name right > now, but could try to find it tomorrow. > > I can check tomorrow how our crush tree unfolds. Basically, for > every device class there is a full copy (shadow hierarchy) for each > device class with its own weights etc. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Marc Roos > Sent: 29 September 2020 22:19:33 > To: eblock; Frank Schilder > Cc: ceph-users > Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd > class osd's > > Yes correct this is coming from Luminous or maybe even Kraken. How does > a default crush tree look like in mimic or octopus? Or is there some > manual how to bring this to the new 'default'? > > > -Original Message- > Cc: ceph-users > Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd > class osd's > > Are these crush maps inherited from pre-mimic versions? I have > re-balanced SSD and HDD pools in mimic (mimic deployed) where one device > class never influenced the placement of the other. I have mixed hosts > and went as far as introducing rbd_meta, rbd_data and such classes to > sub-divide even further (all these devices have different perf specs). > This worked like a charm. When adding devices of one class, only pools > in this class were ever affected. > > As far as I understand, starting with mimic, every shadow class defines > a separate tree (not just leafs/OSDs). Thus, device classes are > independent of each other. > > > > > Sent: 29 September 2020 20:54:48 > To: eblock > Cc: ceph-users > Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class > osd's > > Yes correct, hosts have indeed both ssd's and hdd's combined. Is this > not more of a bug then? I would assume the goal of using device classes > is that you separate these and one does not affect the other, even the > host weight of the ssd and hdd class are already available. The > algorithm should just use that instead of the weight of the whole host. > Or is there some specific use case, where these classes combined is > required? > > > -Original Message- > Cc: ceph-users > Subject: *SPAM* Re: [ceph-users] Re: hdd pg's migrating when > converting ssd class osd's > > They're still in the same root (default) and each host is member of both > device-classes, I guess you have a mixed setup (hosts c01/c02 have both > HDDs and SSDs)? I don't think this separation is enough to avoid > remapping even if a different device-class is affected (your report > confirms that). > > Dividing the crush tree into different subtrees might help here but I'm > not sure if that's really something you need. You might also just deal > with the remapping as long as it doesn't happen too often, I guess. On > the other hand, if your setup won't change (except adding more OSDs) you > might as well think about a different crush tree. It really depends on > your actual requirements. > > We created two different subtrees when we got new hardware and it helped > us a lot moving the data only once to the new hardware avoiding multiple > remappings, now the older hardware is our EC environment except for some > SSDs on those old hosts that had to stay in the main subtree. So our > setup is also very individual but it works quite nice. > :-) > > > Zitat von : > >> I have practically a default setup. If I do a 'ceph osd crush tree >> --show-shadow' I have a listing like this[1]. I would assume from the >> hosts being listed within the default~ssd and default~hdd, they are >> separate (enough)? >> >> >> [1] >> root default~ssd >> host c01~ssd >> .. >> ..
[ceph-users] Re: hdd pg's migrating when converting ssd class osd's
Somebody on this list posted a script that can convert pre-mimic crush trees with buckets for different types of devices to crush trees with device classes with minimal data movement (trying to maintain IDs as much as possible). Don't have a thread name right now, but could try to find it tomorrow. I can check tomorrow how our crush tree unfolds. Basically, for every device class there is a full copy (shadow hierarchy) for each device class with its own weights etc. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Marc Roos Sent: 29 September 2020 22:19:33 To: eblock; Frank Schilder Cc: ceph-users Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Yes correct this is coming from Luminous or maybe even Kraken. How does a default crush tree look like in mimic or octopus? Or is there some manual how to bring this to the new 'default'? -Original Message- Cc: ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Are these crush maps inherited from pre-mimic versions? I have re-balanced SSD and HDD pools in mimic (mimic deployed) where one device class never influenced the placement of the other. I have mixed hosts and went as far as introducing rbd_meta, rbd_data and such classes to sub-divide even further (all these devices have different perf specs). This worked like a charm. When adding devices of one class, only pools in this class were ever affected. As far as I understand, starting with mimic, every shadow class defines a separate tree (not just leafs/OSDs). Thus, device classes are independent of each other. Sent: 29 September 2020 20:54:48 To: eblock Cc: ceph-users Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Yes correct, hosts have indeed both ssd's and hdd's combined. Is this not more of a bug then? I would assume the goal of using device classes is that you separate these and one does not affect the other, even the host weight of the ssd and hdd class are already available. The algorithm should just use that instead of the weight of the whole host. Or is there some specific use case, where these classes combined is required? -Original Message- Cc: ceph-users Subject: *SPAM* Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's They're still in the same root (default) and each host is member of both device-classes, I guess you have a mixed setup (hosts c01/c02 have both HDDs and SSDs)? I don't think this separation is enough to avoid remapping even if a different device-class is affected (your report confirms that). Dividing the crush tree into different subtrees might help here but I'm not sure if that's really something you need. You might also just deal with the remapping as long as it doesn't happen too often, I guess. On the other hand, if your setup won't change (except adding more OSDs) you might as well think about a different crush tree. It really depends on your actual requirements. We created two different subtrees when we got new hardware and it helped us a lot moving the data only once to the new hardware avoiding multiple remappings, now the older hardware is our EC environment except for some SSDs on those old hosts that had to stay in the main subtree. So our setup is also very individual but it works quite nice. :-) Zitat von : > I have practically a default setup. If I do a 'ceph osd crush tree > --show-shadow' I have a listing like this[1]. I would assume from the > hosts being listed within the default~ssd and default~hdd, they are > separate (enough)? > > > [1] > root default~ssd > host c01~ssd > .. > .. > host c02~ssd > .. > root default~hdd > host c01~hdd > .. > host c02~hdd > .. > root default > > > > > -Original Message- > To: ceph-users@ceph.io > Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class > osd's > > Are all the OSDs in the same crush root? I would think that since the > crush weight of hosts change as soon as OSDs are out it impacts the > whole crush tree. If you separate the SSDs from the HDDs logically (e.g. > different bucket type in the crush tree) the ramapping wouldn't affect > the HDDs. > > > > >> I have been converting ssd's osd's to dmcrypt, and I have noticed >> that > >> pg's of pools are migrated that should be (and are?) on hdd class. >> >> On a healthy ok cluster I am getting, when I set the crush reweight >> to > >> 0.0 of a ssd osd this: >> >> 17.35 10415 00 9907 0 >> 36001743890 0 0 3045 3045 >> active+remapped+backfilling 2020-09-27 12:55:49.093054 >> active+remapp
[ceph-users] Re: hdd pg's migrating when converting ssd class osd's
Are these crush maps inherited from pre-mimic versions? I have re-balanced SSD and HDD pools in mimic (mimic deployed) where one device class never influenced the placement of the other. I have mixed hosts and went as far as introducing rbd_meta, rbd_data and such classes to sub-divide even further (all these devices have different perf specs). This worked like a charm. When adding devices of one class, only pools in this class were ever affected. As far as I understand, starting with mimic, every shadow class defines a separate tree (not just leafs/OSDs). Thus, device classes are independent of each other. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Marc Roos Sent: 29 September 2020 20:54:48 To: eblock Cc: ceph-users Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Yes correct, hosts have indeed both ssd's and hdd's combined. Is this not more of a bug then? I would assume the goal of using device classes is that you separate these and one does not affect the other, even the host weight of the ssd and hdd class are already available. The algorithm should just use that instead of the weight of the whole host. Or is there some specific use case, where these classes combined is required? -Original Message- Cc: ceph-users Subject: *SPAM* Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's They're still in the same root (default) and each host is member of both device-classes, I guess you have a mixed setup (hosts c01/c02 have both HDDs and SSDs)? I don't think this separation is enough to avoid remapping even if a different device-class is affected (your report confirms that). Dividing the crush tree into different subtrees might help here but I'm not sure if that's really something you need. You might also just deal with the remapping as long as it doesn't happen too often, I guess. On the other hand, if your setup won't change (except adding more OSDs) you might as well think about a different crush tree. It really depends on your actual requirements. We created two different subtrees when we got new hardware and it helped us a lot moving the data only once to the new hardware avoiding multiple remappings, now the older hardware is our EC environment except for some SSDs on those old hosts that had to stay in the main subtree. So our setup is also very individual but it works quite nice. :-) Zitat von : > I have practically a default setup. If I do a 'ceph osd crush tree > --show-shadow' I have a listing like this[1]. I would assume from the > hosts being listed within the default~ssd and default~hdd, they are > separate (enough)? > > > [1] > root default~ssd > host c01~ssd > .. > .. > host c02~ssd > .. > root default~hdd > host c01~hdd > .. > host c02~hdd > .. > root default > > > > > -Original Message- > To: ceph-users@ceph.io > Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class > osd's > > Are all the OSDs in the same crush root? I would think that since the > crush weight of hosts change as soon as OSDs are out it impacts the > whole crush tree. If you separate the SSDs from the HDDs logically (e.g. > different bucket type in the crush tree) the ramapping wouldn't affect > the HDDs. > > > > >> I have been converting ssd's osd's to dmcrypt, and I have noticed >> that > >> pg's of pools are migrated that should be (and are?) on hdd class. >> >> On a healthy ok cluster I am getting, when I set the crush reweight >> to > >> 0.0 of a ssd osd this: >> >> 17.35 10415 00 9907 0 >> 36001743890 0 0 3045 3045 >> active+remapped+backfilling 2020-09-27 12:55:49.093054 >> active+remapped+83758'20725398 >> 83758:100379720 [8,14,23] 8 [3,14,23] 3 >> 83636'20718129 2020-09-27 00:58:07.098096 83300'20689151 2020-09-24 >> 21:42:07.385360 0 >> >> However osds 3,14,23,8 are all hdd osd's >> >> Since this is a cluster from Kraken/Luminous, I am not sure if the >> device class of the replicated_ruleset[1] was set when the pool 17 >> was > >> created. >> Weird thing is that all pg's of this pool seem to be on hdd osd[2] >> >> Q. How can I display the definition of 'crush_rule 0' at the time of >> the pool creation? (To be sure it had already this device class hdd >> configured) >> >> >> >> [1] >> [@~]# ceph osd pool ls detail | grep 'pool 17' >> pool 17 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash >> rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 83712 >> f
[ceph-users] Re: samba vfs_ceph: client_mds_namespace not working?
Hi Stefan, thanks for your answer. I think the deprecated option is still supported and I found something else - I will update to the new option though. On the ceph side, I see in the log now: client session with non-allowable root '/' denied (client.31382084 192.168.48.135:0/2576875769) It looks like the path option of vfs_ceph is not passed on correctly. I did neither try nor allow to mount the root itself, but a sub-directory. It looks like one of the combinations I tested works, but the path into the ceph fs is not used. The option I use is: path = /shares/FOLDER-NAME and this should show up in the client session as root '/shares/FOLDER-NAME'. Starts looking like a bug in vfs_ceph.c . Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Stefan Kooman Sent: 23 September 2020 11:49:29 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] samba vfs_ceph: client_mds_namespace not working? On 2020-09-23 11:00, Frank Schilder wrote: > Dear all, > > maybe someone has experienced this before. We are setting up a SAMBA gateway > and would like to use the vfs_ceph module. In case of several file systems > one needs to choose an mds namespace. There is an option in ceph.conf: > > client mds namespace = CEPH-FS-NAME > > Unfortunately, it seems not to work. I tried it in all possible versions, in > [global] and [client], with and without "client" at the beginning, to no > avail. I either get a time out or an error. I also found the libcephfs > function In ceph/src/common/options.cc I found this: Option("client_fs", Option::TYPE_STR, Option::LEVEL_ADVANCED) .set_flag(Option::FLAG_STARTUP) .set_default("") .set_description("CephFS file system name to mount") .set_long_description("Use this with ceph-fuse, or with any process " "that uses libcephfs. Programs using libcephfs may also pass " "the filesystem name into mount(), which will override this setting. " "If no filesystem name is given in mount() or this setting, the default " "filesystem will be mounted (usually the first created)."), /* Alias for client_fs. Deprecated */ Option("client_mds_namespace", Option::TYPE_STR, Option::LEVEL_DEV) .set_flag(Option::FLAG_STARTUP) .set_default(""), So the client_mds_namespace is deprecated, and maybe even removed? Does it work if you specify "client_fs"? Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Documentation broken
Hi Lenz, thanks for that, this should do. Please retain the copy until all is migrated :) Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Lenz Grimmer Sent: 23 September 2020 10:55:13 To: ceph-users@ceph.io Subject: [ceph-users] Re: Documentation broken Hi Frank, On 9/22/20 4:30 PM, Frank Schilder wrote: > during the migration of documentation, would it be possible to make > the old documentation available somehow? A lot of pages are broken > and I can't access the documentation for mimic at all any more. > > Is there an archive or something similar? The wayback machine has an online copy from May this year: https://web.archive.org/web/20191226012841/https://docs.ceph.com/docs/mimic/ Alternatively, all previous versions of the docs are of course stored in the git repo (but admittedly not that easy to browse/read): https://github.com/ceph/ceph/tree/mimic/doc Hope that helps, Lenz -- SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg GF: Felix Imendörffer, HRB 36809 (AG Nürnberg) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: samba vfs_ceph: client_mds_namespace not working?
Update: setting "ceph fs set-default CEPH-FS-NAME" allows to do a kernel fs mount without providing the mds_namespace mount option, but the vfs_ceph module still fails with either cephwrap_connect: [CEPH] Error return: Operation not permitted or cephwrap_connect: [CEPH] Error return: Operation not supported depending on whether I use ceph:user_id = USER-NAME or ceph:user_id = client.USER-NAME I guess the second user spec is correct as the first error message indicates an auth problem. On the client side I see the same message in both cases: tree connect failed: NT_STATUS_UNSUCCESSFUL Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Frank Schilder Sent: 23 September 2020 11:00:50 To: ceph-users Subject: [ceph-users] samba vfs_ceph: client_mds_namespace not working? Dear all, maybe someone has experienced this before. We are setting up a SAMBA gateway and would like to use the vfs_ceph module. In case of several file systems one needs to choose an mds namespace. There is an option in ceph.conf: client mds namespace = CEPH-FS-NAME Unfortunately, it seems not to work. I tried it in all possible versions, in [global] and [client], with and without "client" at the beginning, to no avail. I either get a time out or an error. I also found the libcephfs function ceph_select_filesystem(cmount, CEPH-FS-NAME) added it to vfs_ceph.c just before the ceph_mount with the same result, I get an error (operation not permitted). Does anyone know how to get this to work? And, yes, I tested an ordinary kernel fs mount with the credentials for the ceph client without problems. I can't access any documentation on the libcephfs api, I always get a page not found error. My last resort is now to ceph fs set-default CEPH-FS-NAME to the fs to be used and live with the implied restrictions and ugliness. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] samba vfs_ceph: client_mds_namespace not working?
Dear all, maybe someone has experienced this before. We are setting up a SAMBA gateway and would like to use the vfs_ceph module. In case of several file systems one needs to choose an mds namespace. There is an option in ceph.conf: client mds namespace = CEPH-FS-NAME Unfortunately, it seems not to work. I tried it in all possible versions, in [global] and [client], with and without "client" at the beginning, to no avail. I either get a time out or an error. I also found the libcephfs function ceph_select_filesystem(cmount, CEPH-FS-NAME) added it to vfs_ceph.c just before the ceph_mount with the same result, I get an error (operation not permitted). Does anyone know how to get this to work? And, yes, I tested an ordinary kernel fs mount with the credentials for the ceph client without problems. I can't access any documentation on the libcephfs api, I always get a page not found error. My last resort is now to ceph fs set-default CEPH-FS-NAME to the fs to be used and live with the implied restrictions and ugliness. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Unknown PGs after osd move
No, the recipe I gave was for trying to recover healthy status of all PGs in the current situation. I would avoid moving OSDs at all cost, because it will always imply rebalancing. Any change to the crush map changes how PGs are hashed onto OSDs, which in turn triggers a rebalancing. If moving OSDs cannot be avoided, I usually do: - evacuate OSDs that need to move - move empty (!) OSDs to new location - let data move back onto OSDs There are other ways of doing it, with their own pro's and cons. For example, if your client load allows high-bandwidth rebuild operations, you can also - shut down OSDs that need to move (make sure you don't shut down too many from different failure domains at the same time) - let the remaining OSDs rebuild the missing data - after health is back to OK, move OSDs and start up The second way is usually faster, but has the drawback that new writes will go to less redundant storage for a while. The first method takes longer, but there is no redundancy degradation along the way. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: 22 September 2020 22:13:49 To: Frank Schilder Cc: Nico Schottelius; Andreas John; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Unknown PGs after osd move Hey Frank, Frank Schilder writes: >> > Is the crush map aware about that? >> >> Yes, it correctly shows the osds at serve8 (previously server15). >> >> > I didn't ever try that, but don't you need to cursh move it? >> >> I originally imagined this, too. But as soon as the osd starts on a new >> server it is automatically put into the serve8 bucket. > > It does not work like this, unfortunately. If you physically move > disks to a new server without "informing ceph" in advance, hat is, > crush move the OSD while they are up, ceph looses placement > information. You can post-repair such a situation by temporarily > "crush moving" (software move, not hardware move) the OSDs back to > their previous host buckets, wait for peering to complete, and then > "crush move" them to their new location again. That is good to know. So in theory: - crush move osd to a different server bucket - shutdown osd - move physically to another server - no rebalancing needed Should do the job? It won't accept today's rebalance, but it would be good to have a sane way for the future. Cheers, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Unknown PGs after osd move
> > Is the crush map aware about that? > > Yes, it correctly shows the osds at serve8 (previously server15). > > > I didn't ever try that, but don't you need to cursh move it? > > I originally imagined this, too. But as soon as the osd starts on a new > server it is automatically put into the serve8 bucket. It does not work like this, unfortunately. If you physically move disks to a new server without "informing ceph" in advance, hat is, crush move the OSD while they are up, ceph looses placement information. You can post-repair such a situation by temporarily "crush moving" (software move, not hardware move) the OSDs back to their previous host buckets, wait for peering to complete, and then "crush move" them to their new location again. Do not restart OSDs during this process or while rebalancing of misplaced objects is going on. There is a long-standing issue that causes placement information to be lost again and one would need to repeat the procedure. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: 22 September 2020 21:14:07 To: Andreas John Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Unknown PGs after osd move Hey Andreas, Andreas John writes: > Hello, > > On 22.09.20 20:45, Nico Schottelius wrote: >> Hello, >> >> after having moved 4 ssds to another host (+ the ceph tell hanging issue >> - see previous mail), we ran into 241 unknown pgs: > > You mean, that you re-seated the OSDs into another chassis/host? That is correct. > Is the crush map aware about that? Yes, it correctly shows the osds at serve8 (previously server15). > I didn't ever try that, but don't you need to cursh move it? I originally imagined this, too. But as soon as the osd starts on a new server it is automatically put into the serve8 bucket. Cheers, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Documentation broken
Hi all, during the migration of documentation, would it be possible to make the old documentation available somehow? A lot of pages are broken and I can't access the documentation for mimic at all any more. Is there an archive or something similar? Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Setting up a small experimental CEPH network
Hi all, we use heavily bonded interfaces (6x10G) and also needed to look at this balancing question. We use LACP bonding and, while the host OS probably tries to balance outgoing traffic over all NICs, the real decision is made by the switches (incoming traffic). Our switches hash packets to a port by (source?) MAC address, meaning that it is not the number of TCP/IP connections that helps balancing, but only the number of MAC addresses. In an LACP bond, all NICs have the same MAC address and balancing happens by (physical) host. The more hosts, the better it will work. In a way, for us this is a problem and not at the same time. We have about 550 physical clients (an HPC cluster) and 12 OSD hosts, which means that we probably have a good load on every single NIC for client traffic. On the other hand, rebalancing between 12 servers is unlikely to use all NICs effectively. So far, we don't have enough disks per host to notice that, but it could become visible at some point. Basically, the host with the worst switch-sided hashing for incoming traffic will become the bottleneck. On some switches the hashing method for LACP bonds can be configured, however, not with much detail. I have not seen a possibility to use IP:PORT for hashing to a switch port. I have no experience with bonding mode 6 (ALB) that might provide a per-connection hashing. Would be interested to hear how it performs. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Marc Roos Sent: 21 September 2020 11:08:55 To: ceph-users; lindsay.mathieson Subject: [ceph-users] Re: Setting up a small experimental CEPH network I tested something in the past[1] where I could notice that an osd staturated a bond link and did not use the available 2nd one. I think I maybe made a mistake in writing down it was a 1x replicated pool. However it has been written here multiple times that these osd processes are single thread, so afaik they cannot use more than on link, and at the moment your osd has a saturated link, your clients will notice this. [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html -Original Message- From: Lindsay Mathieson [mailto:lindsay.mathie...@gmail.com] Sent: maandag 21 september 2020 2:42 To: ceph-users@ceph.io Subject: [ceph-users] Re: Setting up a small experimental CEPH network On 21/09/2020 5:40 am, Stefan Kooman wrote: > My experience with bonding and Ceph is pretty good (OpenvSwitch). Ceph > uses lots of tcp connections, and those can get shifted (balanced) > between interfaces depending on load. Same here - I'm running 4*1GB (LACP, Balance-TCP) on a 5 node cluster with 19 OSD's. 20 Active VM's and it idles at under 1 MiB/s, spikes up to 100MiB/s no problem. When doing a heavy rebalance/repair data rates on any one node can hit 400MiBs+ It scales out really well. -- Lindsay ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Dear Michael, maybe there is a way to restore access for users and solve the issues later. Someone else with a lost/unfound object was able to move the affected file (or directory containing the file) to a separate location and restore the now missing data from backup. This will "park" the problem of cluster health for later fixing. Best regads, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Frank Schilder Sent: 18 September 2020 15:38:51 To: Michael Thomas; ceph-users@ceph.io Subject: [ceph-users] Re: multiple OSD crash, unfound objects Dear Michael, > I disagree with the statement that trying to recover health by deleting > data is a contradiction. In some cases (such as mine), the data in ceph > is backed up in another location (eg tape library). Restoring a few > files from tape is a simple and cheap operation that takes a minute, at > most. I would agree with that if the data was deleted using the appropriate high-level operation. Deleting an unfound object is like marking a sector on a disk as bad with smartctl. How should the file system react to that? Purging an OSD is like removing a disk from a raid set. Such operations increase inconsistencies/degradation rather than resolving them. Cleaning this up also requires to execute other operations to remove all references to the object and, finally, the file inode itself. The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. For example, when coloring is enabled, ls will stat every file in the dir to be able to choose the color according to permissions. If one then disables coloring, a plain "ls" will return all names while an "ls -l" will hang due to stat calls. An "rm" or "rm -f" should succeed if the folder permissions allow that. It should not stat the file itself, so it sounds a bit odd that its hanging. I guess in some situations it does, like "rm -i", which will ask before removing read-only files. How does "unlink FILE" behave? Most admin commands on ceph are asynchronous. A command like "pg repair" or "osd scrub" only schedules an operation. The command "ceph pg 7.1fb mark_unfound_lost delete" does probably just the same. Unfortunately, I don't know how to check that a scheduled operation has started/completed/succeeded/failed. I asked this in an earlier thread (about PG repair) and didn't get an answer. On our cluster, the actual repair happened ca. 6-12 hours after scheduling (on a healthy cluster!). I would conclude that (some of) these operations have very low priority and will not start at least as long as there is recovery going on. One might want to consider the possibility that some of the scheduled commands have not been executed yet. The output of "pg query" contains the IDs of the missing objects (in mimic) and each of these objects is on one of the peer OSDs of the PG (I think object here refers to shard or copy). It should be possible to find the corresponding OSD (or at least obtain confirmation that the object is really gone) and move the object to a place where it is expected to be found. This can probably be achieved with "PG export" and "PG import". I don't know of any other way(s). I guess, in the current situation, sitting it out a bit longer might be a good strategy. I don't know how many asynchronous commands you executed and giving the cluster time to complete these jobs might improve the situation. Sorry that I can't be of more help here. However, if you figure out a solution (ideally non-destructive), please post it here. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 18 September 2020 14:15:53 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] multiple OSD crash, unfound objects Hi Frank, On 9/18/20 2:50 AM, Frank Schilder wrote: > Dear Michael, > > firstly, I'm a bit confused why you started deleting data. The objects were > unfound, but still there. That's a small issue. Now the data might be gone > and that's a real issue. > > > Interval: > > Anyone reading this: I have seen many threads where ceph admins started > deleting objects or PGs or even purging OSDs way too early from a cluster. > Trying to recover health by deleting data is a contradiction. Ceph has bugs > and sometimes it needs some help finding everything again. As far as I know, > for most of these bugs there are workarounds that allow full recovery with a > bit of work. I disagree with the statement that trying to recover health by deleting data is a contradiction. In some cases (such as mine), the data in ceph is backed up in another location (eg tape library
[ceph-users] Re: multiple OSD crash, unfound objects
Dear Michael, > I disagree with the statement that trying to recover health by deleting > data is a contradiction. In some cases (such as mine), the data in ceph > is backed up in another location (eg tape library). Restoring a few > files from tape is a simple and cheap operation that takes a minute, at > most. I would agree with that if the data was deleted using the appropriate high-level operation. Deleting an unfound object is like marking a sector on a disk as bad with smartctl. How should the file system react to that? Purging an OSD is like removing a disk from a raid set. Such operations increase inconsistencies/degradation rather than resolving them. Cleaning this up also requires to execute other operations to remove all references to the object and, finally, the file inode itself. The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. For example, when coloring is enabled, ls will stat every file in the dir to be able to choose the color according to permissions. If one then disables coloring, a plain "ls" will return all names while an "ls -l" will hang due to stat calls. An "rm" or "rm -f" should succeed if the folder permissions allow that. It should not stat the file itself, so it sounds a bit odd that its hanging. I guess in some situations it does, like "rm -i", which will ask before removing read-only files. How does "unlink FILE" behave? Most admin commands on ceph are asynchronous. A command like "pg repair" or "osd scrub" only schedules an operation. The command "ceph pg 7.1fb mark_unfound_lost delete" does probably just the same. Unfortunately, I don't know how to check that a scheduled operation has started/completed/succeeded/failed. I asked this in an earlier thread (about PG repair) and didn't get an answer. On our cluster, the actual repair happened ca. 6-12 hours after scheduling (on a healthy cluster!). I would conclude that (some of) these operations have very low priority and will not start at least as long as there is recovery going on. One might want to consider the possibility that some of the scheduled commands have not been executed yet. The output of "pg query" contains the IDs of the missing objects (in mimic) and each of these objects is on one of the peer OSDs of the PG (I think object here refers to shard or copy). It should be possible to find the corresponding OSD (or at least obtain confirmation that the object is really gone) and move the object to a place where it is expected to be found. This can probably be achieved with "PG export" and "PG import". I don't know of any other way(s). I guess, in the current situation, sitting it out a bit longer might be a good strategy. I don't know how many asynchronous commands you executed and giving the cluster time to complete these jobs might improve the situation. Sorry that I can't be of more help here. However, if you figure out a solution (ideally non-destructive), please post it here. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 18 September 2020 14:15:53 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] multiple OSD crash, unfound objects Hi Frank, On 9/18/20 2:50 AM, Frank Schilder wrote: > Dear Michael, > > firstly, I'm a bit confused why you started deleting data. The objects were > unfound, but still there. That's a small issue. Now the data might be gone > and that's a real issue. > > > Interval: > > Anyone reading this: I have seen many threads where ceph admins started > deleting objects or PGs or even purging OSDs way too early from a cluster. > Trying to recover health by deleting data is a contradiction. Ceph has bugs > and sometimes it needs some help finding everything again. As far as I know, > for most of these bugs there are workarounds that allow full recovery with a > bit of work. I disagree with the statement that trying to recover health by deleting data is a contradiction. In some cases (such as mine), the data in ceph is backed up in another location (eg tape library). Restoring a few files from tape is a simple and cheap operation that takes a minute, at most. For the sake of expediency, sometimes it's quicker and easier to simply delete the affected files and restore them from the backup system. This procedure has worked fine with our previous distributed filesystem (hdfs), so I (naively?) thought that it could be used with ceph as well. I was a bit surprised that cephs behavior was to indefinitely block the 'rm' operation so that the affected file could not even be removed. Since I have 25 unfound objects spread across 9 PGs, I used a PG with a single unfound object
[ceph-users] Re: multiple OSD crash, unfound objects
Dear Michael, firstly, I'm a bit confused why you started deleting data. The objects were unfound, but still there. That's a small issue. Now the data might be gone and that's a real issue. Interval: Anyone reading this: I have seen many threads where ceph admins started deleting objects or PGs or even purging OSDs way too early from a cluster. Trying to recover health by deleting data is a contradiction. Ceph has bugs and sometimes it needs some help finding everything again. As far as I know, for most of these bugs there are workarounds that allow full recovery with a bit of work. First question is, did you delete the entire object or just a shard on one disk? Are there OSDs that might still have a copy? If the object is gone for good, the file references something that doesn't exist - its like a bad sector. You probably need to delete the file. Bit strange that the operation does not err out with a read error. Maybe it doesn't because it waits for the unfound objects state to be resolved? For all the other unfound objects, they are there somewhere - you didn't loose a disk or something. Try pushing ceph to scan the correct OSDs, for example, by restarting the newly added OSDs one by one or something similar. Sometimes exporting and importing a PG from one OSD to another forces a re-scan and subsequent discovery of unfound objects. It is also possible that ceph will find these objects along the way of recovery or when OSDs scrub or check for objects that can be deleted. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 17 September 2020 22:27:47 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] multiple OSD crash, unfound objects Hi Frank, Yes, it does sounds similar to your ticket. I've tried a few things to restore the failed files: * Locate a missing object with 'ceph pg $pgid list_unfound' * Convert the hex oid to a decimal inode number * Identify the affected file with 'find /ceph -inum $inode' At this point, I know which file is affected by the missing object. As expected, attempts to read the file simply hang. Unexpectedly, attempts to 'ls' the file or its containing directory also hang. I presume from this that the stat() system call needs some information that is contained in the missing object, and is waiting for the object to become available. Next I tried to remove the affected object with: * ceph pg $pgid mark_unfound_lost delete Now 'ceph status' shows one fewer missing objects, but attempts to 'ls' or 'rm' the affected file continue to hang. Finally, I ran a scrub over the part of the filesystem containing the affected file: ceph tell mds.ceph4 scrub start /frames/postO3/hoft recursive Nothing seemed to come up during the scrub: 2020-09-17T14:56:15.208-0500 7f39bca24700 1 mds.ceph4 asok_command: scrub status {prefix=scrub status} (starting...) 2020-09-17T14:58:58.013-0500 7f39bca24700 1 mds.ceph4 asok_command: scrub start {path=/frames/postO3/hoft,prefix=scrub start,scrubops=[recursive]} (starting...) 2020-09-17T14:58:58.013-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub summary: active 2020-09-17T14:58:58.014-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub queued for path: /frames/postO3/hoft 2020-09-17T14:58:58.014-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub summary: active [paths:/frames/postO3/hoft] 2020-09-17T14:59:02.535-0500 7f39bca24700 1 mds.ceph4 asok_command: scrub status {prefix=scrub status} (starting...) 2020-09-17T15:00:12.520-0500 7f39bca24700 1 mds.ceph4 asok_command: scrub status {prefix=scrub status} (starting...) 2020-09-17T15:02:32.944-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub summary: idle 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub complete with tag '1405e5c7-3ecf-4754-918e-129e9d101f7a' 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub completed for path: /frames/postO3/hoft 2020-09-17T15:02:32.945-0500 7f39b5215700 0 log_channel(cluster) log [INF] : scrub summary: idle After the scrub completed, access to the file (ls or rm) continue to hang. The MDS reports slow reads: 2020-09-17T15:11:05.654-0500 7f39b9a1e700 0 log_channel(cluster) log [WRN] : slow request 481.867381 seconds old, received at 2020-09-17T15:03:03.788058-0500: client_request(client.451432:11309 getattr pAsLsXsFs #0x105b1c0 2020-09-17T15:03:03.787602-0500 caller_uid=0, caller_gid=0{}) currently dispatched Does anyone have any suggestions on how else to clean up from a permanently lost object? --Mike On 9/16/20 2:03 AM, Frank Schilder wrote: > Sounds similar to this one: https://tracker.ceph.com/issues/46847 > > If you have or can reconstruct the crush map from before adding the OSDs, you > might be able to discove
[ceph-users] vfs_ceph for CentOS 8
Hi all, we are setting up a SAMBA share and would like to use the vfs_ceph module. Unfortunately, it seems not to be part of the common SAMBA packages on CentOS 8. Does anyone know how to install vfs_ceph? The SAMBA version on CentOS 8 is samba-4.11.2-13 and the documentation says the module is part of it. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Sounds similar to this one: https://tracker.ceph.com/issues/46847 If you have or can reconstruct the crush map from before adding the OSDs, you might be able to discover everything with the temporary reversal of the crush map method. Not sure if there is another method, i never got a reply to my question in the tracker. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michael Thomas Sent: 16 September 2020 01:27:19 To: ceph-users@ceph.io Subject: [ceph-users] multiple OSD crash, unfound objects Over the weekend I had multiple OSD servers in my Octopus cluster (15.2.4) crash and reboot at nearly the same time. The OSDs are part of an erasure coded pool. At the time the cluster had been busy with a long-running (~week) remapping of a large number of PGs after I incrementally added more OSDs to the cluster. After bringing all of the OSDs back up, I have 25 unfound objects and 75 degraded objects. There are other problems reported, but I'm primarily concerned with these unfound/degraded objects. The pool with the missing objects is a cephfs pool. The files stored in the pool are backed up on tape, so I can easily restore individual files as needed (though I would not want to restore the entire filesystem). I tried following the guide at https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects. I found a number of OSDs that are still 'not queried'. Restarting a sampling of these OSDs changed the state from 'not queried' to 'already probed', but that did not recover any of the unfound or degraded objects. I have also tried 'ceph pg deep-scrub' on the affected PGs, but never saw them get scrubbed. I also tried doing a 'ceph pg force-recovery' on the affected PGs, but only one seems to have been tagged accordingly (see ceph -s output below). The guide also says "Sometimes it simply takes some time for the cluster to query possible locations." I'm not sure how long "some time" might take, but it hasn't changed after several hours. My questions are: * Is there a way to force the cluster to query the possible locations sooner? * Is it possible to identify the files in cephfs that are affected, so that I could delete only the affected files and restore them from backup tapes? --Mike ceph -s: cluster: id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad health: HEALTH_ERR 1 clients failing to respond to capability release 1 MDSs report slow requests 25/78520351 objects unfound (0.000%) 2 nearfull osd(s) Reduced data availability: 1 pg inactive Possible data damage: 9 pgs recovery_unfound Degraded data redundancy: 75/626645098 objects degraded (0.000%), 9 pgs degraded 1013 pgs not deep-scrubbed in time 1013 pgs not scrubbed in time 2 pool(s) nearfull 1 daemons have recently crashed 4 slow ops, oldest one blocked for 77939 sec, daemons [osd.0,osd.41] have slow ops. services: mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d) mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1 mds: archive:1 {0=ceph4=up:active} 3 up:standby osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs task status: scrub status: mds.ceph4: idle data: pools: 9 pools, 2433 pgs objects: 78.52M objects, 298 TiB usage: 412 TiB used, 545 TiB / 956 TiB avail pgs: 0.041% pgs unknown 75/626645098 objects degraded (0.000%) 135224/626645098 objects misplaced (0.022%) 25/78520351 objects unfound (0.000%) 2421 active+clean 5active+recovery_unfound+degraded 3active+recovery_unfound+degraded+remapped 2active+clean+scrubbing+deep 1unknown 1active+forced_recovery+recovery_unfound+degraded progress: PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d) [] ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: The confusing output of ceph df command
We might have the same problem. EC 6+2 on a pool for RBD images on spindles. Please see the earlier thread "mimic: much more raw used than reported". In our case, this seems to be a problem exclusively for RBD workloads and here, in particular, Windows VMs. I see no amplification at all on our ceph fs pool. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: norman Sent: 10 September 2020 08:34:42 To: ceph-users@ceph.io Subject: [ceph-users] Re: The confusing output of ceph df command Anyone else met the same problem? Using EC instead of Replica is to save spaces, but now it's worse than replica... On 9/9/2020 上午7:30, norman kern wrote: > Hi, > > I have changed most of pools from 3-replica to ec 4+2 in my cluster, when I > use > ceph df command to show > > the used capactiy of the cluster: > > RAW STORAGE: > CLASS SIZEAVAIL USEDRAW USED %RAW USED > hdd 1.8 PiB 788 TiB 1.0 PiB 1.0 PiB 57.22 > ssd 7.9 TiB 4.6 TiB 181 GiB 3.2 TiB 41.15 > ssd-cache 5.2 TiB 5.2 TiB 67 GiB 73 GiB 1.36 > TOTAL 1.8 PiB 798 TiB 1.0 PiB 1.0 PiB 56.99 > > POOLS: > POOLID STORED OBJECTS > USED%USED MAX AVAIL > default-oss.rgw.control 1 0 B 8 0 > B 0 1.3 TiB > default-oss.rgw.meta2 22 KiB 97 3.9 > MiB 0 1.3 TiB > default-oss.rgw.log 3 525 KiB 223 621 > KiB 0 1.3 TiB > default-oss.rgw.buckets.index 4 33 MiB 34 33 > MiB 0 1.3 TiB > default-oss.rgw.buckets.non-ec 5 1.6 MiB 48 3.8 > MiB 0 1.3 TiB > .rgw.root6 3.8 KiB 16 720 > KiB 0 1.3 TiB > default-oss.rgw.buckets.data7 274 GiB 185.39k 450 > GiB 0.14 212 TiB > default-fs-metadata 8 488 GiB 153.10M 490 > GiB 10.65 1.3 TiB > default-fs-data09 374 TiB 1.48G 939 > TiB 74.71 212 TiB > > ... > > The USED = 3 * STORED in 3-replica mode is completely right, but for EC 4+2 > pool > (for default-fs-data0 ) > > the USED is not equal 1.5 * STORED, why...:( > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD memory leak?
Looks like the image attachment got removed. Please find it here: https://imgur.com/a/3tabzCN = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 31 August 2020 14:42 To: Mark Nelson; Dan van der Ster; ceph-users Subject: [ceph-users] Re: OSD memory leak? Hi Dan and Mark, sorry, took a bit longer. I uploaded a new archive containing files with the following format (https://files.dtu.dk/u/jb0uS6U9LlCfvS5L/heap_profiling-2020-08-31.tgz?l - valid 60 days): - osd.195.profile.*.heap - raw heap dump file - osd.195.profile.*.heap.txt - output of conversion with --text - osd.195.profile.*.heap-base0001.txt - output of conversion with --text against first dump as base - osd.195.*.heap_stats - output of ceph daemon osd.195 heap stats, every hour - osd.195.*.mempools - output of ceph daemon osd.195 dump_mempools, every hour - osd.195.*.perf - output of ceph daemon osd.195 perf dump, every hour, counters are reset Only for the last couple of days are converted files included, post-conversion of everything simply takes too long. Please find also attached a recording of memory usage on one of the relevant OSD nodes. I marked restarts of all OSDs/the host with vertical red lines. What is worrying is the self-amplifying nature of the leak. ts not a linear process, it looks at least quadratic if not exponential. What we are looking for is, given the comparably short uptime, probably still in the lower percentages with increasing rate. The OSDs just started to overrun their limit: top - 14:38:49 up 155 days, 19:17, 1 user, load average: 5.99, 4.59, 4.59 Tasks: 684 total, 1 running, 293 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.9 us, 0.9 sy, 0.0 ni, 89.6 id, 7.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 65727628 total, 6937548 free, 41921260 used, 16868820 buff/cache KiB Swap: 93532160 total, 90199040 free, 120 used. 6740136 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 4099023 ceph 20 0 5918704 3.8g 9700 S 1.7 6.1 378:37.01 /usr/bin/ceph-osd --cluster ceph -f -i 35 --setuser cep+ 4097639 ceph 20 0 5340924 3.0g 11428 S 87.1 4.7 14636:30 /usr/bin/ceph-osd --cluster ceph -f -i 195 --setuser ce+ 4097974 ceph 20 0 3648188 2.3g 9628 S 8.3 3.6 1375:58 /usr/bin/ceph-osd --cluster ceph -f -i 201 --setuser ce+ 4098322 ceph 20 0 3478980 2.2g 9688 S 5.3 3.6 1426:05 /usr/bin/ceph-osd --cluster ceph -f -i 223 --setuser ce+ 4099374 ceph 20 0 3446784 2.2g 9252 S 4.6 3.5 1142:14 /usr/bin/ceph-osd --cluster ceph -f -i 205 --setuser ce+ 4098679 ceph 20 0 3832140 2.2g 9796 S 6.6 3.5 1248:26 /usr/bin/ceph-osd --cluster ceph -f -i 132 --setuser ce+ 4100782 ceph 20 0 3641608 2.2g 9652 S 7.9 3.5 1278:10 /usr/bin/ceph-osd --cluster ceph -f -i 207 --setuser ce+ 4095944 ceph 20 0 3375672 2.2g 8968 S 7.3 3.5 1250:02 /usr/bin/ceph-osd --cluster ceph -f -i 108 --setuser ce+ 4096956 ceph 20 0 3509376 2.2g 9456 S 7.9 3.5 1157:27 /usr/bin/ceph-osd --cluster ceph -f -i 203 --setuser ce+ 4099731 ceph 20 0 3563652 2.2g 8972 S 3.6 3.5 1421:48 /usr/bin/ceph-osd --cluster ceph -f -i 61 --setuser cep+ 4096262 ceph 20 0 3531988 2.2g 9040 S 9.9 3.5 1600:15 /usr/bin/ceph-osd --cluster ceph -f -i 121 --setuser ce+ 4100442 ceph 20 0 3359736 2.1g 9804 S 4.3 3.4 1185:53 /usr/bin/ceph-osd --cluster ceph -f -i 226 --setuser ce+ 4096617 ceph 20 0 3443060 2.1g 9432 S 5.0 3.4 1449:29 /usr/bin/ceph-osd --cluster ceph -f -i 199 --setuser ce+ 4097298 ceph 20 0 3483532 2.1g 9600 S 5.6 3.3 1265:28 /usr/bin/ceph-osd --cluster ceph -f -i 97 --setuser cep+ 4100093 ceph 20 0 3428348 2.0g 9568 S 3.3 3.2 1298:53 /usr/bin/ceph-osd --cluster ceph -f -i 197 --setuser ce+ 4095630 ceph 20 0 3440160 2.0g 8976 S 3.6 3.2 1451:35 /usr/bin/ceph-osd --cluster ceph -f -i 62 --setuser cep+ Generally speaking, increasing the cache minimum seems to help with keeping important information in RAM. Unfortunately, it also means that swap usage starts much earlier. Best regards and thanks for your help, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)
I was talking about on-disk cache, but, yes, the controller cache needs to be disabled too. The first can be done with smartctl or hdparm. Check cache status with something like 'smartctl -g wcache /dev/sda' and disable with something like 'smartctl -s wcache=off /dev/sda'. Controller cache needs to be disabled in the BIOS. By the way, if you can't use pass-through, you should disable controller cache for every disk, including the HDDs. There are cases in the list demonstrating that controller cache enabled can lead to data loss on power outage. As I recommended before, please search the ceph-user list, you will find detailed instructions and also links to explanations and typical benchmarks. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: VELARTIS Philipp Dürhammer Sent: 31 August 2020 14:44:07 To: Frank Schilder; 'ceph-users@ceph.io' Subject: AW: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals) We have older LSi Raid controller with no HBA/JBOD option. So we expose the single disks as raid0 devices. Ceph should not be aware of cache status? But digging deeper in to it it seems that 1 out of 4 serves is performing a lot better and has super low commit/applay rates while the other have a lot mor (20+) on heavy writes. This just applys fore the ssd. For the hdds I cant see a difference... -Ursprüngliche Nachricht- Von: Frank Schilder Gesendet: Montag, 31. August 2020 13:19 An: VELARTIS Philipp Dürhammer ; 'ceph-users@ceph.io' Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals) Yes, they can - if volatile write cache is not disabled. There are many threads on this, also recent. Search for "disable write cache" and/or "disable volatile write cache". You will also find different methods of doing this automatically. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: VELARTIS Philipp Dürhammer Sent: 31 August 2020 13:02:45 To: 'ceph-users@ceph.io' Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no extra journals) I have a productive 60 osd's cluster. No extra Journals. Its performing okay. Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is little slower or equal to the 60 hdd pool. 4K random as also sequential reads. All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on bluestore. Ceph Luminous. What should be possible 16 ssd's vs. 60 hhd's no extra journals? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] How to query status of scheduled commands.
Hi all, can anyone help me with this? In mimic, for any of these commands: ceph osd [deep-]scrub ID ceph pg [deep-]scrub ID ceph pg repair ID an operation is scheduled asynchronously. How can I check the following states: 1) Operation is pending (scheduled, not started). 2) Operation is running. 3) Operation has completed. 4) Exit code and error messages if applicable. Many thanks! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)
Yes, they can - if volatile write cache is not disabled. There are many threads on this, also recent. Search for "disable write cache" and/or "disable volatile write cache". You will also find different methods of doing this automatically. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: VELARTIS Philipp Dürhammer Sent: 31 August 2020 13:02:45 To: 'ceph-users@ceph.io' Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no extra journals) I have a productive 60 osd's cluster. No extra Journals. Its performing okay. Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is little slower or equal to the 60 hdd pool. 4K random as also sequential reads. All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on bluestore. Ceph Luminous. What should be possible 16 ssd's vs. 60 hhd's no extra journals? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD memory leak?
Hi Mark and Dan, I can generate text files. Can you let me know what you would like to see? Without further instructions, I can do a simple conversion and a conversion against the first dump as a base. I will upload an archive with converted files added tomorrow afternoon. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Nelson Sent: 20 August 2020 21:52 To: Frank Schilder; Dan van der Ster; ceph-users Subject: Re: [ceph-users] Re: OSD memory leak? Hi Frank, I downloaded but haven't had time to get the environment setup yet either. It might be better to just generate the txt files if you can. Thanks! Mark On 8/20/20 2:33 AM, Frank Schilder wrote: > Hi Dan and Mark, > > could you please let me know if you can read the files with the version info > I provided in my previous e-mail? I'm in the process of collecting data with > more FS activity and would like to send it in a format that is useful for > investigation. > > Right now I'm observing a daily growth of swap of ca. 100-200MB on servers > with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS > manages to keep enough RAM available. Also the mempool dump still shows onode > and data cached at a seemingly reasonable level. Users report a more stable > performance of the FS after I increased the cach min sizes on all OSDs. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Frank Schilder > Sent: 17 August 2020 09:37 > To: Dan van der Ster > Cc: ceph-users > Subject: [ceph-users] Re: OSD memory leak? > > Hi Dan, > > I use the container > docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I > can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, > its a Centos 7 build. The version is: > > # ceph -v > ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) > > On Centos, the profiler packages are called different, without the "google-" > prefix. The version I have installed is > > # pprof --version > pprof (part of gperftools 2.0) > > Copyright 1998-2007 Google Inc. > > This is BSD licensed software; see the source for copying conditions > and license information. > There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A > PARTICULAR PURPOSE. > > It is possible to install pprof inside this container and analyse the > *.heap-files I provided. > > If this doesn't work for you and you want me to generate the text output for > heap-files, I can do that. Please let me know if I should do all files and > with what option (eg. against a base etc.). > > Best regards, > ===== > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Dan van der Ster > Sent: 14 August 2020 10:38:57 > To: Frank Schilder > Cc: Mark Nelson; ceph-users > Subject: Re: [ceph-users] Re: OSD memory leak? > > Hi Frank, > > I'm having trouble getting the exact version of ceph you used to > create this heap profile. > Could you run the google-pprof --text steps at [1] and share the output? > > Thanks, Dan > > [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/ > > > On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder wrote: >> Hi Mark, >> >> here is a first collection of heap profiling data (valid 30 days): >> >> https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l >> >> This was collected with the following config settings: >> >>osd dev osd_memory_cache_min >> 805306368 >>osd basicosd_memory_target >> 2147483648 >> >> Setting the cache_min value seems to help keeping cache space available. >> Unfortunately, the above collection is for 12 days only. I needed to restart >> the OSD and will need to restart it soon again. I hope I can then run a >> longer sample. The profiling does cause slow ops though. >> >> Maybe you can see something already? It seems to have collected some leaked >> memory. Unfortunately, it was a period of extremely low load. Basically, >> with the day of recording the utilization dropped to almost zero. >> >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Frank Schilder >> Sent: 21 July 2020 12:57:32 >> To: Mark Nelson; Dan van
[ceph-users] Re: OSD memory leak?
Hi Dan, no worries. I checked and osd_map_dedup is set to true, the default value. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 20 August 2020 09:41 To: Frank Schilder Cc: Mark Nelson; ceph-users Subject: Re: [ceph-users] Re: OSD memory leak? Hi Frank, I didn't get time yet. On our side, I was planning to see if the issue persists after upgrading to v14.2.11 -- it includes some updates to how the osdmap is referenced across OSD.cc. BTW, do you happen to have osd_map_dedup set to false? We do, and that surely increases the osdmap memory usage somewhat. -- Dan -- Dan On Thu, Aug 20, 2020 at 9:33 AM Frank Schilder wrote: > > Hi Dan and Mark, > > could you please let me know if you can read the files with the version info > I provided in my previous e-mail? I'm in the process of collecting data with > more FS activity and would like to send it in a format that is useful for > investigation. > > Right now I'm observing a daily growth of swap of ca. 100-200MB on servers > with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS > manages to keep enough RAM available. Also the mempool dump still shows onode > and data cached at a seemingly reasonable level. Users report a more stable > performance of the FS after I increased the cach min sizes on all OSDs. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: 17 August 2020 09:37 > To: Dan van der Ster > Cc: ceph-users > Subject: [ceph-users] Re: OSD memory leak? > > Hi Dan, > > I use the container > docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I > can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, > its a Centos 7 build. The version is: > > # ceph -v > ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) > > On Centos, the profiler packages are called different, without the "google-" > prefix. The version I have installed is > > # pprof --version > pprof (part of gperftools 2.0) > > Copyright 1998-2007 Google Inc. > > This is BSD licensed software; see the source for copying conditions > and license information. > There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A > PARTICULAR PURPOSE. > > It is possible to install pprof inside this container and analyse the > *.heap-files I provided. > > If this doesn't work for you and you want me to generate the text output for > heap-files, I can do that. Please let me know if I should do all files and > with what option (eg. against a base etc.). > > Best regards, > ===== > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Dan van der Ster > Sent: 14 August 2020 10:38:57 > To: Frank Schilder > Cc: Mark Nelson; ceph-users > Subject: Re: [ceph-users] Re: OSD memory leak? > > Hi Frank, > > I'm having trouble getting the exact version of ceph you used to > create this heap profile. > Could you run the google-pprof --text steps at [1] and share the output? > > Thanks, Dan > > [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/ > > > On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder wrote: > > > > Hi Mark, > > > > here is a first collection of heap profiling data (valid 30 days): > > > > https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l > > > > This was collected with the following config settings: > > > > osd dev osd_memory_cache_min > > 805306368 > > osd basicosd_memory_target > > 2147483648 > > > > Setting the cache_min value seems to help keeping cache space available. > > Unfortunately, the above collection is for 12 days only. I needed to > > restart the OSD and will need to restart it soon again. I hope I can then > > run a longer sample. The profiling does cause slow ops though. > > > > Maybe you can see something already? It seems to have collected some leaked > > memory. Unfortunately, it was a period of extremely low load. Basically, > > with the day of recording the utilization dropped to almost zero. > > > > Best regards, > > = > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > > > From: Frank Schilder > > Sent: 21 July
[ceph-users] Re: OSD memory leak?
Hi Dan and Mark, could you please let me know if you can read the files with the version info I provided in my previous e-mail? I'm in the process of collecting data with more FS activity and would like to send it in a format that is useful for investigation. Right now I'm observing a daily growth of swap of ca. 100-200MB on servers with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS manages to keep enough RAM available. Also the mempool dump still shows onode and data cached at a seemingly reasonable level. Users report a more stable performance of the FS after I increased the cach min sizes on all OSDs. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 17 August 2020 09:37 To: Dan van der Ster Cc: ceph-users Subject: [ceph-users] Re: OSD memory leak? Hi Dan, I use the container docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, its a Centos 7 build. The version is: # ceph -v ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) On Centos, the profiler packages are called different, without the "google-" prefix. The version I have installed is # pprof --version pprof (part of gperftools 2.0) Copyright 1998-2007 Google Inc. This is BSD licensed software; see the source for copying conditions and license information. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. It is possible to install pprof inside this container and analyse the *.heap-files I provided. If this doesn't work for you and you want me to generate the text output for heap-files, I can do that. Please let me know if I should do all files and with what option (eg. against a base etc.). Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 14 August 2020 10:38:57 To: Frank Schilder Cc: Mark Nelson; ceph-users Subject: Re: [ceph-users] Re: OSD memory leak? Hi Frank, I'm having trouble getting the exact version of ceph you used to create this heap profile. Could you run the google-pprof --text steps at [1] and share the output? Thanks, Dan [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/ On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder wrote: > > Hi Mark, > > here is a first collection of heap profiling data (valid 30 days): > > https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l > > This was collected with the following config settings: > > osd dev osd_memory_cache_min > 805306368 > osd basicosd_memory_target > 2147483648 > > Setting the cache_min value seems to help keeping cache space available. > Unfortunately, the above collection is for 12 days only. I needed to restart > the OSD and will need to restart it soon again. I hope I can then run a > longer sample. The profiling does cause slow ops though. > > Maybe you can see something already? It seems to have collected some leaked > memory. Unfortunately, it was a period of extremely low load. Basically, with > the day of recording the utilization dropped to almost zero. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: 21 July 2020 12:57:32 > To: Mark Nelson; Dan van der Ster > Cc: ceph-users > Subject: [ceph-users] Re: OSD memory leak? > > Quick question: Is there a way to change the frequency of heap dumps? On this > page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function > HeapProfilerSetAllocationInterval() is mentioned, but no other way of > configuring this. Is there a config parameter or a ceph daemon call to adjust > this? > > If not, can I change the dump path? > > Its likely to overrun my log partition quickly if I cannot adjust either of > the two. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: 20 July 2020 15:19:05 > To: Mark Nelson; Dan van der Ster > Cc: ceph-users > Subject: [ceph-users] Re: OSD memory leak? > > Dear Mark, > > thank you very much for the very helpful answers. I will raise > osd_memory_cache_min, leave everything else alone and watch what happens. I > will report back here. > > Thanks also for raising this as an issue. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, ru
[ceph-users] Re: OSD memory leak?
Hi Dan, I use the container docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, its a Centos 7 build. The version is: # ceph -v ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) On Centos, the profiler packages are called different, without the "google-" prefix. The version I have installed is # pprof --version pprof (part of gperftools 2.0) Copyright 1998-2007 Google Inc. This is BSD licensed software; see the source for copying conditions and license information. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. It is possible to install pprof inside this container and analyse the *.heap-files I provided. If this doesn't work for you and you want me to generate the text output for heap-files, I can do that. Please let me know if I should do all files and with what option (eg. against a base etc.). Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 14 August 2020 10:38:57 To: Frank Schilder Cc: Mark Nelson; ceph-users Subject: Re: [ceph-users] Re: OSD memory leak? Hi Frank, I'm having trouble getting the exact version of ceph you used to create this heap profile. Could you run the google-pprof --text steps at [1] and share the output? Thanks, Dan [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/ On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder wrote: > > Hi Mark, > > here is a first collection of heap profiling data (valid 30 days): > > https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l > > This was collected with the following config settings: > > osd dev osd_memory_cache_min > 805306368 > osd basicosd_memory_target > 2147483648 > > Setting the cache_min value seems to help keeping cache space available. > Unfortunately, the above collection is for 12 days only. I needed to restart > the OSD and will need to restart it soon again. I hope I can then run a > longer sample. The profiling does cause slow ops though. > > Maybe you can see something already? It seems to have collected some leaked > memory. Unfortunately, it was a period of extremely low load. Basically, with > the day of recording the utilization dropped to almost zero. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: 21 July 2020 12:57:32 > To: Mark Nelson; Dan van der Ster > Cc: ceph-users > Subject: [ceph-users] Re: OSD memory leak? > > Quick question: Is there a way to change the frequency of heap dumps? On this > page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function > HeapProfilerSetAllocationInterval() is mentioned, but no other way of > configuring this. Is there a config parameter or a ceph daemon call to adjust > this? > > If not, can I change the dump path? > > Its likely to overrun my log partition quickly if I cannot adjust either of > the two. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: 20 July 2020 15:19:05 > To: Mark Nelson; Dan van der Ster > Cc: ceph-users > Subject: [ceph-users] Re: OSD memory leak? > > Dear Mark, > > thank you very much for the very helpful answers. I will raise > osd_memory_cache_min, leave everything else alone and watch what happens. I > will report back here. > > Thanks also for raising this as an issue. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Mark Nelson > Sent: 20 July 2020 15:08:11 > To: Frank Schilder; Dan van der Ster > Cc: ceph-users > Subject: Re: [ceph-users] Re: OSD memory leak? > > On 7/20/20 3:23 AM, Frank Schilder wrote: > > Dear Mark and Dan, > > > > I'm in the process of restarting all OSDs and could use some quick advice > > on bluestore cache settings. My plan is to set higher minimum values and > > deal with accumulated excess usage via regular restarts. Looking at the > > documentation > > (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), > > I find the following relevant options (with defaults): > > > > # Automatic Cache Sizing > > osd_memory_target {4294967296} # 4GB > > osd_memory_base {805306368} # 768MB > >
[ceph-users] Re: OSD memory leak?
Hi Mark, here is a first collection of heap profiling data (valid 30 days): https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l This was collected with the following config settings: osd dev osd_memory_cache_min 805306368 osd basicosd_memory_target 2147483648 Setting the cache_min value seems to help keeping cache space available. Unfortunately, the above collection is for 12 days only. I needed to restart the OSD and will need to restart it soon again. I hope I can then run a longer sample. The profiling does cause slow ops though. Maybe you can see something already? It seems to have collected some leaked memory. Unfortunately, it was a period of extremely low load. Basically, with the day of recording the utilization dropped to almost zero. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 21 July 2020 12:57:32 To: Mark Nelson; Dan van der Ster Cc: ceph-users Subject: [ceph-users] Re: OSD memory leak? Quick question: Is there a way to change the frequency of heap dumps? On this page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function HeapProfilerSetAllocationInterval() is mentioned, but no other way of configuring this. Is there a config parameter or a ceph daemon call to adjust this? If not, can I change the dump path? Its likely to overrun my log partition quickly if I cannot adjust either of the two. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 20 July 2020 15:19:05 To: Mark Nelson; Dan van der Ster Cc: ceph-users Subject: [ceph-users] Re: OSD memory leak? Dear Mark, thank you very much for the very helpful answers. I will raise osd_memory_cache_min, leave everything else alone and watch what happens. I will report back here. Thanks also for raising this as an issue. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Nelson Sent: 20 July 2020 15:08:11 To: Frank Schilder; Dan van der Ster Cc: ceph-users Subject: Re: [ceph-users] Re: OSD memory leak? On 7/20/20 3:23 AM, Frank Schilder wrote: > Dear Mark and Dan, > > I'm in the process of restarting all OSDs and could use some quick advice on > bluestore cache settings. My plan is to set higher minimum values and deal > with accumulated excess usage via regular restarts. Looking at the > documentation > (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), > I find the following relevant options (with defaults): > > # Automatic Cache Sizing > osd_memory_target {4294967296} # 4GB > osd_memory_base {805306368} # 768MB > osd_memory_cache_min {134217728} # 128MB > > # Manual Cache Sizing > bluestore_cache_meta_ratio {.4} # 40% ? > bluestore_cache_kv_ratio {.4} # 40% ? > bluestore_cache_kv_max {512 * 1024*1024} # 512MB > > Q1) If I increase osd_memory_cache_min, should I also increase > osd_memory_base by the same or some other amount? osd_memory_base is a hint at how much memory the OSD could consume outside the cache once it's reached steady state. It basically sets a hard cap on how much memory the cache will use to avoid over-committing memory and thrashing when we exceed the memory limit. It's not necessary to get it right, it just helps smooth things out by making the automatic memory tuning less aggressive. IE if you have a 2 GB memory target and a 512MB base, you'll never assign more than 1.5GB to the cache on the assumption that the rest of the OSD will eventually need 512MB to operate even if it's not using that much right now. I think you can probably just leave it alone. What you and Dan appear to be seeing is that this number isn't static in your case but increases over time any way. Eventually I'm hoping that we can automatically account for more and more of that memory by reading the data from the mempools. > Q2) The cache ratio options are shown under the section "Manual Cache > Sizing". Do they also apply when cache auto tuning is enabled? If so, is it > worth changing these defaults for higher values of osd_memory_cache_min? They actually do have an effect on the automatic cache sizing and probably shouldn't only be under the manual section. When you have the automatic cache sizing enabled, those options will affect the "fair share" values of the different caches at each cache priority level. IE at priority level 0, if both caches want more memory than is available, those ratios will determine how much each cache gets. If there is more memory available than requested, each cache gets as much as they want and we move on to the next priority level and do the s
[ceph-users] Re: Ceph does not recover from OSD restart
If with monitor log you mean the cluster log /var/log/ceph/ceph.log, I should have all of it. Please find a tgz-file here: https://files.dtu.dk/u/tFCEZJzQhH2mUIRk/logs.tgz?l (valid 100 days). Contents: logs/ceph-2020-08-03.log - cluster log for the day of restart logs/ceph-osd.145.2020-08-03.log - log of "old" OSD trimmed to day of restart logs/ceph-osd.288.log - entire log of "new" OSD Hope this helps. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 04 August 2020 14:15:11 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart Do you have any monitor / OSD logs from the maintenance when the issues occurred? Original message ---- From: Frank Schilder Date: 8/4/20 8:07 AM (GMT-05:00) To: Eric Smith , ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Eric, thanks for the clarification, I did misunderstand you. > You should not have to move OSDs in and out of the CRUSH tree however > in order to solve any data placement problems (This is the baffling part). Exactly. Should I create a tracker issue? I think this is not hard to reproduce with a standard crush map where host-bucket=physical host and I would, in fact, expect that this scenario is part of the integration test. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 04 August 2020 13:58:47 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart All seems in order in terms of your CRUSH layout. You can speed up the rebalancing / scale-out operations by increasing the osd_max_backfills on each OSD (Especially during off hours). The unnecessary degradation is not expected behavior with a cluster in HEALTH_OK status, but with backfill / rebalancing ongoing it's not unexpected. You should not have to move OSDs in and out of the CRUSH tree however in order to solve any data placement problems (This is the baffling part). -Original Message- From: Frank Schilder Sent: Tuesday, August 4, 2020 7:45 AM To: Eric Smith ; ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Erik, I added the disks and started the rebalancing. When I run into the issue, ca. 3 days after start of rebalancing, it was about 25% done. The cluster does not go to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects misplaced" warning. The OSD crush locations for the logical hosts are in ceph.conf, the OSDs come up in the proper crush bucket. > All seems in order then In what sense? The rebalancing is still ongoing and usually a very long operation. This time I added only 9 disks, but we will almost triple the number of disks of a larger pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this expansion to take months. Due to a memory leak, I need to restart OSDs regularly. Also, a host may restart or we might have a power outage during this window. In these situations, it will be a real pain if I have to play the crush move game with 300+ OSDs. This unnecessary redundancy degradation on OSD restart cannot possibly be expected behaviour, or do I misunderstand something here? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 04 August 2020 13:19:41 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart All seems in order then - when you ran into your maintenance issue, how long was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it could trim PG history? Also did the OSDs just start back up in the wrong place in the CRUSH tree? -Original Message- From: Frank Schilder Sent: Tuesday, August 4, 2020 7:10 AM To: Eric Smith ; ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Eric, > Have you adjusted the min_size for pool sr-rbd-data-one-hdd Yes. For all EC pools located in datacenter ServerRoom, we currently set min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are logical but not physical, disks in these buckets are co-located such that no more than 2 host buckets share the same physical host. With failure domain = host, we can ensure that no more than 2 EC shards are on the same physical host. With m=2 and min_size=k we have continued service with any 1 physical host down for maintenance and also recovery will happen if a physical host fails. Some objects will have no redundancy for a while then. We will increase min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs from buckets ceph-21/22 to these without rebalancing. The distribution of disks and buckets is listed below as well (longer listing). Thanks and best regar
[ceph-users] Re: Ceph does not recover from OSD restart
Hi Eric, thanks for the clarification, I did misunderstand you. > You should not have to move OSDs in and out of the CRUSH tree however > in order to solve any data placement problems (This is the baffling part). Exactly. Should I create a tracker issue? I think this is not hard to reproduce with a standard crush map where host-bucket=physical host and I would, in fact, expect that this scenario is part of the integration test. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 04 August 2020 13:58:47 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart All seems in order in terms of your CRUSH layout. You can speed up the rebalancing / scale-out operations by increasing the osd_max_backfills on each OSD (Especially during off hours). The unnecessary degradation is not expected behavior with a cluster in HEALTH_OK status, but with backfill / rebalancing ongoing it's not unexpected. You should not have to move OSDs in and out of the CRUSH tree however in order to solve any data placement problems (This is the baffling part). -Original Message- From: Frank Schilder Sent: Tuesday, August 4, 2020 7:45 AM To: Eric Smith ; ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Erik, I added the disks and started the rebalancing. When I run into the issue, ca. 3 days after start of rebalancing, it was about 25% done. The cluster does not go to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects misplaced" warning. The OSD crush locations for the logical hosts are in ceph.conf, the OSDs come up in the proper crush bucket. > All seems in order then In what sense? The rebalancing is still ongoing and usually a very long operation. This time I added only 9 disks, but we will almost triple the number of disks of a larger pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this expansion to take months. Due to a memory leak, I need to restart OSDs regularly. Also, a host may restart or we might have a power outage during this window. In these situations, it will be a real pain if I have to play the crush move game with 300+ OSDs. This unnecessary redundancy degradation on OSD restart cannot possibly be expected behaviour, or do I misunderstand something here? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 04 August 2020 13:19:41 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart All seems in order then - when you ran into your maintenance issue, how long was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it could trim PG history? Also did the OSDs just start back up in the wrong place in the CRUSH tree? -Original Message- From: Frank Schilder Sent: Tuesday, August 4, 2020 7:10 AM To: Eric Smith ; ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Eric, > Have you adjusted the min_size for pool sr-rbd-data-one-hdd Yes. For all EC pools located in datacenter ServerRoom, we currently set min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are logical but not physical, disks in these buckets are co-located such that no more than 2 host buckets share the same physical host. With failure domain = host, we can ensure that no more than 2 EC shards are on the same physical host. With m=2 and min_size=k we have continued service with any 1 physical host down for maintenance and also recovery will happen if a physical host fails. Some objects will have no redundancy for a while then. We will increase min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs from buckets ceph-21/22 to these without rebalancing. The distribution of disks and buckets is listed below as well (longer listing). Thanks and best regards, Frank # ceph osd erasure-code-profile ls con-ec-8-2-hdd con-ec-8-2-ssd default sr-ec-6-2-hdd This is the relevant one: # ceph osd erasure-code-profile get sr-ec-6-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ServerRoom jerasure-per-chunk-alignment=false k=6 m=2 plugin=jerasure technique=reed_sol_van w=8 Note that the pool sr-rbd-data-one (id 2) was created with this profile and later moved to SSD. Therefore, the crush rule does not match the profile's device class any more. These two are under different roots: # ceph osd erasure-code-profile get con-ec-8-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false k=8 m=2 plugin=jerasure technique=reed_sol_van w=8 # ceph osd erasure-code-profile get con-ec-8-2-ssd crush-device-class=ssd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false k=8 m=2 plugin=jerasure
[ceph-users] Re: Ceph does not recover from OSD restart
Hi Erik, I added the disks and started the rebalancing. When I run into the issue, ca. 3 days after start of rebalancing, it was about 25% done. The cluster does not go to HEALTH_OK before the rebalancing is finished, it shows the "xxx objects misplaced" warning. The OSD crush locations for the logical hosts are in ceph.conf, the OSDs come up in the proper crush bucket. > All seems in order then In what sense? The rebalancing is still ongoing and usually a very long operation. This time I added only 9 disks, but we will almost triple the number of disks of a larger pool soon, which has 150 OSDs at the moment. I expect the rebalancing for this expansion to take months. Due to a memory leak, I need to restart OSDs regularly. Also, a host may restart or we might have a power outage during this window. In these situations, it will be a real pain if I have to play the crush move game with 300+ OSDs. This unnecessary redundancy degradation on OSD restart cannot possibly be expected behaviour, or do I misunderstand something here? Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 04 August 2020 13:19:41 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart All seems in order then - when you ran into your maintenance issue, how long was if after you added the new OSDs and did Ceph ever get to HEALTH_OK so it could trim PG history? Also did the OSDs just start back up in the wrong place in the CRUSH tree? -Original Message- From: Frank Schilder Sent: Tuesday, August 4, 2020 7:10 AM To: Eric Smith ; ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Eric, > Have you adjusted the min_size for pool sr-rbd-data-one-hdd Yes. For all EC pools located in datacenter ServerRoom, we currently set min_size=k=6, because we lack physical servers. Hosts ceph-21 and ceph-22 are logical but not physical, disks in these buckets are co-located such that no more than 2 host buckets share the same physical host. With failure domain = host, we can ensure that no more than 2 EC shards are on the same physical host. With m=2 and min_size=k we have continued service with any 1 physical host down for maintenance and also recovery will happen if a physical host fails. Some objects will have no redundancy for a while then. We will increase min_size to k+1 as soon as we have 2 additional hosts and simply move the OSDs from buckets ceph-21/22 to these without rebalancing. The distribution of disks and buckets is listed below as well (longer listing). Thanks and best regards, Frank # ceph osd erasure-code-profile ls con-ec-8-2-hdd con-ec-8-2-ssd default sr-ec-6-2-hdd This is the relevant one: # ceph osd erasure-code-profile get sr-ec-6-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ServerRoom jerasure-per-chunk-alignment=false k=6 m=2 plugin=jerasure technique=reed_sol_van w=8 Note that the pool sr-rbd-data-one (id 2) was created with this profile and later moved to SSD. Therefore, the crush rule does not match the profile's device class any more. These two are under different roots: # ceph osd erasure-code-profile get con-ec-8-2-hdd crush-device-class=hdd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false k=8 m=2 plugin=jerasure technique=reed_sol_van w=8 # ceph osd erasure-code-profile get con-ec-8-2-ssd crush-device-class=ssd crush-failure-domain=host crush-root=ContainerSquare jerasure-per-chunk-alignment=false k=8 m=2 plugin=jerasure technique=reed_sol_van w=8 Full physical placement information for OSDs under tree "datacenter ServerRoom": ceph-04 CONTID BUCKET SIZE TYP osd-phy0 243 ceph-04 1.8T SSD osd-phy1 247 ceph-21 1.8T SSD osd-phy2 254 ceph-04 1.8T SSD osd-phy3 256 ceph-04 1.8T SSD osd-phy4 286 ceph-04 1.8T SSD osd-phy5 287 ceph-04 1.8T SSD osd-phy6 288 ceph-04 10.7T HDD osd-phy748 ceph-04372.6G SSD osd-phy8 264 ceph-21 1.8T SSD osd-phy984 ceph-04 8.9T HDD osd-phy10 72 ceph-21 8.9T HDD osd-phy11 145 ceph-04 8.9T HDD osd-phy14 156 ceph-04 8.9T HDD osd-phy15 168 ceph-04 8.9T HDD osd-phy16 181 ceph-04 8.9T HDD osd-phy170 ceph-21 8.9T HDD ceph-05 CONTID BUCKET SIZE TYP osd-phy0 240 ceph-05 1.8T SSD osd-phy1 249 ceph-22 1.8T SSD osd-phy2 251 ceph-05 1.8T SSD osd-phy3 255 ceph-05 1.8T SSD osd-phy4 284 ceph-05 1.8T SSD osd-phy5 285 ceph-05 1.8T SSD osd-phy6 289 ceph-05 10.7T HDD osd-phy749 ceph-05372.6G SSD osd-phy8 265 ceph-22 1.
[ceph-users] Re: Ceph does not recover from OSD restart
8.9T HDD osd-phy16 183 ceph-07 8.9T HDD osd-phy173 ceph-22 8.9T HDD ceph-18 CONTID BUCKET SIZE TYP osd-phy0 241 ceph-18 1.8T SSD osd-phy1 248 ceph-18 1.8T SSD osd-phy241 ceph-18372.6G SSD osd-phy331 ceph-18372.6G SSD osd-phy4 277 ceph-18 1.8T SSD osd-phy5 278 ceph-21 1.8T SSD osd-phy653 ceph-21372.6G SSD osd-phy7 267 ceph-18 1.8T SSD osd-phy8 266 ceph-18 1.8T SSD osd-phy9 293 ceph-18 10.7T HDD osd-phy10 86 ceph-21 8.9T HDD osd-phy11 259 ceph-18 10.9T HDD osd-phy14 229 ceph-18 8.9T HDD osd-phy15 232 ceph-18 8.9T HDD osd-phy16 235 ceph-18 8.9T HDD osd-phy17 238 ceph-18 8.9T HDD ceph-19 CONTID BUCKET SIZE TYP osd-phy0 261 ceph-19 1.8T SSD osd-phy1 262 ceph-19 1.8T SSD osd-phy2 295 ceph-19 10.7T HDD osd-phy343 ceph-19372.6G SSD osd-phy4 275 ceph-19 1.8T SSD osd-phy5 294 ceph-22 10.7T HDD osd-phy651 ceph-22372.6G SSD osd-phy7 269 ceph-19 1.8T SSD osd-phy8 268 ceph-19 1.8T SSD osd-phy9 276 ceph-22 1.8T SSD osd-phy10 73 ceph-22 8.9T HDD osd-phy11 263 ceph-19 10.9T HDD osd-phy14 231 ceph-19 8.9T HDD osd-phy15 233 ceph-19 8.9T HDD osd-phy16 236 ceph-19 8.9T HDD osd-phy17 239 ceph-19 8.9T HDD ceph-20 CONTID BUCKET SIZE TYP osd-phy0 245 ceph-20 1.8T SSD osd-phy128 ceph-20372.6G SSD osd-phy244 ceph-20372.6G SSD osd-phy3 271 ceph-20 1.8T SSD osd-phy4 272 ceph-20 1.8T SSD osd-phy5 273 ceph-20 1.8T SSD osd-phy6 274 ceph-21 1.8T SSD osd-phy7 296 ceph-20 10.7T HDD osd-phy876 ceph-21 8.9T HDD osd-phy939 ceph-21372.6G SSD osd-phy10 270 ceph-20 1.8T SSD osd-phy11 260 ceph-20 10.9T HDD osd-phy14 228 ceph-20 8.9T HDD osd-phy15 230 ceph-20 8.9T HDD osd-phy16 234 ceph-20 8.9T HDD osd-phy17 237 ceph-20 8.9T HDD CONT is the container name and encodes the physical slot on the host where the OSD is located. ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 04 August 2020 12:47:12 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart Have you adjusted the min_size for pool sr-rbd-data-one-hdd at all? Also can you send the output of "ceph osd erasure-code-profile ls" and for each EC profile, "ceph osd erasure-code-profile get "? -Original Message- From: Frank Schilder Sent: Monday, August 3, 2020 11:05 AM To: Eric Smith ; ceph-users Subject: Re: Ceph does not recover from OSD restart Sorry for the many small e-mails: requested IDs in the commands, 288-296. One new OSD per host. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Frank Schilder Sent: 03 August 2020 16:59:04 To: Eric Smith; ceph-users Subject: [ceph-users] Re: Ceph does not recover from OSD restart Hi Eric, the procedure for re-discovering all objects is: # Flag: norebalance ceph osd crush move osd.288 host=bb-04 ceph osd crush move osd.289 host=bb-05 ceph osd crush move osd.290 host=bb-06 ceph osd crush move osd.291 host=bb-21 ceph osd crush move osd.292 host=bb-07 ceph osd crush move osd.293 host=bb-18 ceph osd crush move osd.295 host=bb-19 ceph osd crush move osd.294 host=bb-22 ceph osd crush move osd.296 host=bb-20 # Wait until all PGs are peered and recovery is done. In my case, there was only little I/O, # no more than 50-100 objects had writes missing and recovery was a few seconds. # # The bb-hosts are under a separate crush root that I use solely as parking space # and for draining OSDs. ceph osd crush move osd.288 host=ceph-04 ceph osd crush move osd.289 host=ceph-05 ceph osd crush move osd.290 host=ceph-06 ceph osd crush move osd.291 host=ceph-21 ceph osd crush move osd.292 host=ceph-07 ceph osd crush move osd.293 host=ceph-18 ceph osd crush move osd.295 host=ceph-19 ceph osd crush move osd.294 host=ceph-22 ceph osd crush move osd.296 host=ceph-20 After peering, no degraded PGs/objects any more, just the misplaced ones as expected. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 03 August 2020 16:45:28 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart You said you had to move some OSDs ou
[ceph-users] Re: Ceph does not recover from OSD restart
Sorry for the many small e-mails: requested IDs in the commands, 288-296. One new OSD per host. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 03 August 2020 16:59:04 To: Eric Smith; ceph-users Subject: [ceph-users] Re: Ceph does not recover from OSD restart Hi Eric, the procedure for re-discovering all objects is: # Flag: norebalance ceph osd crush move osd.288 host=bb-04 ceph osd crush move osd.289 host=bb-05 ceph osd crush move osd.290 host=bb-06 ceph osd crush move osd.291 host=bb-21 ceph osd crush move osd.292 host=bb-07 ceph osd crush move osd.293 host=bb-18 ceph osd crush move osd.295 host=bb-19 ceph osd crush move osd.294 host=bb-22 ceph osd crush move osd.296 host=bb-20 # Wait until all PGs are peered and recovery is done. In my case, there was only little I/O, # no more than 50-100 objects had writes missing and recovery was a few seconds. # # The bb-hosts are under a separate crush root that I use solely as parking space # and for draining OSDs. ceph osd crush move osd.288 host=ceph-04 ceph osd crush move osd.289 host=ceph-05 ceph osd crush move osd.290 host=ceph-06 ceph osd crush move osd.291 host=ceph-21 ceph osd crush move osd.292 host=ceph-07 ceph osd crush move osd.293 host=ceph-18 ceph osd crush move osd.295 host=ceph-19 ceph osd crush move osd.294 host=ceph-22 ceph osd crush move osd.296 host=ceph-20 After peering, no degraded PGs/objects any more, just the misplaced ones as expected. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 03 August 2020 16:45:28 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart You said you had to move some OSDs out and back in for Ceph to go back to normal (The OSDs you added). Which OSDs were added? -Original Message- From: Frank Schilder Sent: Monday, August 3, 2020 9:55 AM To: Eric Smith ; ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Eric, thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself. Best regards, Frank # ceph osd pool stats pool sr-rbd-meta-one id 1 client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr pool sr-rbd-data-one id 2 client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr pool sr-rbd-one-stretch id 3 nothing is going on pool con-rbd-meta-hpc-one id 7 nothing is going on pool con-rbd-data-hpc-one id 8 client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr pool sr-rbd-data-one-hdd id 11 53241814/346903376 objects misplaced (15.348%) client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr pool con-fs2-meta1 id 12 client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr pool con-fs2-meta2 id 13 client io 0 B/s wr, 0 op/s rd, 0 op/s wr pool con-fs2-data id 14 client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr pool con-fs2-data-ec-ssd id 17 nothing is going on pool ms-rbd-one id 18 client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr # ceph osd pool ls detail pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd removed_snaps [1~45] pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd removed_snaps [1~3] pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~7,9~2] pool 11 'sr-rbd-data-one-hdd' eras
[ceph-users] Re: Ceph does not recover from OSD restart
Hi Eric, the procedure for re-discovering all objects is: # Flag: norebalance ceph osd crush move osd.288 host=bb-04 ceph osd crush move osd.289 host=bb-05 ceph osd crush move osd.290 host=bb-06 ceph osd crush move osd.291 host=bb-21 ceph osd crush move osd.292 host=bb-07 ceph osd crush move osd.293 host=bb-18 ceph osd crush move osd.295 host=bb-19 ceph osd crush move osd.294 host=bb-22 ceph osd crush move osd.296 host=bb-20 # Wait until all PGs are peered and recovery is done. In my case, there was only little I/O, # no more than 50-100 objects had writes missing and recovery was a few seconds. # # The bb-hosts are under a separate crush root that I use solely as parking space # and for draining OSDs. ceph osd crush move osd.288 host=ceph-04 ceph osd crush move osd.289 host=ceph-05 ceph osd crush move osd.290 host=ceph-06 ceph osd crush move osd.291 host=ceph-21 ceph osd crush move osd.292 host=ceph-07 ceph osd crush move osd.293 host=ceph-18 ceph osd crush move osd.295 host=ceph-19 ceph osd crush move osd.294 host=ceph-22 ceph osd crush move osd.296 host=ceph-20 After peering, no degraded PGs/objects any more, just the misplaced ones as expected. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 03 August 2020 16:45:28 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart You said you had to move some OSDs out and back in for Ceph to go back to normal (The OSDs you added). Which OSDs were added? -Original Message- From: Frank Schilder Sent: Monday, August 3, 2020 9:55 AM To: Eric Smith ; ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Eric, thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself. Best regards, Frank # ceph osd pool stats pool sr-rbd-meta-one id 1 client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr pool sr-rbd-data-one id 2 client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr pool sr-rbd-one-stretch id 3 nothing is going on pool con-rbd-meta-hpc-one id 7 nothing is going on pool con-rbd-data-hpc-one id 8 client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr pool sr-rbd-data-one-hdd id 11 53241814/346903376 objects misplaced (15.348%) client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr pool con-fs2-meta1 id 12 client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr pool con-fs2-meta2 id 13 client io 0 B/s wr, 0 op/s rd, 0 op/s wr pool con-fs2-data id 14 client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr pool con-fs2-data-ec-ssd id 17 nothing is going on pool ms-rbd-one id 18 client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr # ceph osd pool ls detail pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd removed_snaps [1~45] pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd removed_snaps [1~3] pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~7,9~2] pool 11 'sr-rbd-data-one-hdd' erasure size 8 min_size 6 crush_rule 9 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186331 lfor 0/127768 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 21990232200 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~59f,5a2~fe, ... less huge list ... ,2559~1,255b~1] removed_snaps_q
[ceph-users] Re: Ceph does not recover from OSD restart
As a side effect of the restart, also the leader sees blocked ops that never get cleared. I need to restart the mon daemon: cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_WARN noout,norebalance flag(s) set 53242005/1492479251 objects misplaced (3.567%) Long heartbeat ping times on back interface seen, longest is 13854.181 msec Long heartbeat ping times on front interface seen, longest is 13737.799 msec 1 pools nearfull 129 slow ops, oldest one blocked for 1699 sec, mon.ceph-01 has slow ops services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs flags noout,norebalance data: pools: 11 pools, 3215 pgs objects: 177.4 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 53242005/1492479251 objects misplaced (3.567%) 2904 active+clean 294 active+remapped+backfill_wait 13 active+remapped+backfilling 4active+clean+scrubbing+deep io: client: 120 MiB/s rd, 50 MiB/s wr, 1.15 kop/s rd, 745 op/s wr Sample OPS: # ceph daemon mon.ceph-01 ops | grep -e description -e num_ops "description": "osd_failure(failed timeout osd.241 192.168.32.82:6814/2178578 for 38sec e186626 v186626)", "description": "osd_failure(failed timeout osd.243 192.168.32.68:6814/3358340 for 37sec e186626 v186626)", [...] "description": "osd_failure(failed timeout osd.286 192.168.32.68:6806/3354298 for 37sec e186764 v186764)", "description": "osd_failure(failed timeout osd.287 192.168.32.68:6804/3353324 for 37sec e186764 v186764)", "num_ops": 129 = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 03 August 2020 15:54:48 To: Eric Smith; ceph-users Subject: Re: Ceph does not recover from OSD restart Hi Eric, thanks for your fast response. Below the output, shortened a bit as indicated. Disks have been added to pool 11 'sr-rbd-data-one-hdd' only, this is the only pool with remapped PGs and is also the only pool experiencing the "loss of track" to objects. Every other pool recovers from restart by itself. Best regards, Frank # ceph osd pool stats pool sr-rbd-meta-one id 1 client io 5.3 KiB/s rd, 3.2 KiB/s wr, 4 op/s rd, 1 op/s wr pool sr-rbd-data-one id 2 client io 24 MiB/s rd, 32 MiB/s wr, 380 op/s rd, 594 op/s wr pool sr-rbd-one-stretch id 3 nothing is going on pool con-rbd-meta-hpc-one id 7 nothing is going on pool con-rbd-data-hpc-one id 8 client io 0 B/s rd, 5.6 KiB/s wr, 0 op/s rd, 0 op/s wr pool sr-rbd-data-one-hdd id 11 53241814/346903376 objects misplaced (15.348%) client io 73 MiB/s rd, 3.4 MiB/s wr, 236 op/s rd, 69 op/s wr pool con-fs2-meta1 id 12 client io 106 KiB/s rd, 112 KiB/s wr, 3 op/s rd, 11 op/s wr pool con-fs2-meta2 id 13 client io 0 B/s wr, 0 op/s rd, 0 op/s wr pool con-fs2-data id 14 client io 5.5 MiB/s rd, 201 KiB/s wr, 34 op/s rd, 8 op/s wr pool con-fs2-data-ec-ssd id 17 nothing is going on pool ms-rbd-one id 18 client io 5.6 MiB/s wr, 0 op/s rd, 179 op/s wr # ceph osd pool ls detail pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 80 pgp_num 80 last_change 122597 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 application rbd removed_snaps [1~45] pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash rjenkins pg_num 560 pgp_num 560 last_change 186437 lfor 0/126858 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 43980465111040 stripe_width 24576 fast_read 1 compression_mode aggressive application rbd removed_snaps [1~3,5~2, ... huge list ... ,11f9d~1,11fa0~2] pool 3 'sr-rbd-one-stretch' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 160 pgp_num 160 last_change 143202 lfor 0/79983 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 1099511627776 stripe_width 0 compression_mode aggressive application rbd removed_snaps [1~7,b~2,11~2,14~2,17~9e,b8~1e] pool 7 'con-rbd-meta-hpc-one' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 50 pgp_num 50 last_change 96357 lfor 0/90462 flags hashpspool,nodelete,selfmanaged_snaps max_bytes 10737418240 stripe_width 0 application rbd removed_snaps [1~3] pool 8 'con-rbd-data-hpc-one' erasure size 10 min_size 9 crush_rule 7 object_hash rjenkins pg_num 150 pgp_num 150 last_change 96358 lfor 0/90996 flags hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 32768 fast
[ceph-users] Re: Ceph does not recover from OSD restart
294 hdd 10.69229 osd.294up 1.0 1.0 249 rbd_data1.74599 osd.249up 1.0 1.0 250 rbd_data1.74599 osd.250up 1.0 1.0 265 rbd_data1.74599 osd.265up 1.0 1.0 276 rbd_data1.74599 osd.276up 1.0 1.0 281 rbd_data1.74599 osd.281up 1.0 1.0 51 rbd_meta0.36400 osd.51 up 1.0 1.0 # ceph osd crush rule dump # crush rules outside tree under "datacenter ServerRoom" removed for brevity [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 5, "rule_name": "sr-rbd-data-one", "ruleset": 5, "type": 3, "min_size": 3, "max_size": 8, "steps": [ { "op": "set_chooseleaf_tries", "num": 50 }, { "op": "set_choose_tries", "num": 1000 }, { "op": "take", "item": -185, "item_name": "ServerRoom~rbd_data" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 9, "rule_name": "sr-rbd-data-one-hdd", "ruleset": 9, "type": 3, "min_size": 3, "max_size": 8, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -53, "item_name": "ServerRoom~hdd" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eric Smith Sent: 03 August 2020 15:40 To: Frank Schilder; ceph-users Subject: RE: Ceph does not recover from OSD restart Can you post the output of these commands: ceph osd pool ls detail ceph osd tree ceph osd crush rule dump -Original Message- From: Frank Schilder Sent: Monday, August 3, 2020 9:19 AM To: ceph-users Subject: [ceph-users] Re: Ceph does not recover from OSD restart After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see: cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_WARN norebalance,norecover flag(s) set 53030026/1492404361 objects misplaced (3.553%) 1 pools nearfull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs flags norebalance,norecover data: pools: 11 pools, 3215 pgs objects: 177.3 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 53030026/1492404361 objects misplaced (3.553%) 2902 active+clean 299 active+remapped+backfill_wait 8active+remapped+backfilling 5active+clean+scrubbing+deep 1active+clean+snaptrim io: client: 69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr Why does a cluster wi
[ceph-users] Re: Ceph does not recover from OSD restart
After moving the newly added OSDs out of the crush tree and back in again, I get to exactly what I want to see: cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_WARN norebalance,norecover flag(s) set 53030026/1492404361 objects misplaced (3.553%) 1 pools nearfull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs flags norebalance,norecover data: pools: 11 pools, 3215 pgs objects: 177.3 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 53030026/1492404361 objects misplaced (3.553%) 2902 active+clean 299 active+remapped+backfill_wait 8active+remapped+backfilling 5active+clean+scrubbing+deep 1active+clean+snaptrim io: client: 69 MiB/s rd, 117 MiB/s wr, 399 op/s rd, 856 op/s wr Why does a cluster with remapped PGs not survive OSD restarts without loosing track of objects? Why is it not finding the objects by itself? A power outage of 3 hosts will halt everything for no reason until manual intervention. How can I avoid this problem? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 03 August 2020 15:03:05 To: ceph-users Subject: [ceph-users] Ceph does not recover from OSD restart Dear cephers, I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that: cluster: id: xxx health: HEALTH_ERR 45813194/1492348700 objects misplaced (3.070%) Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized Degraded data redundancy (low space): 17 pgs backfill_toofull 1 pools nearfull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs data: pools: 11 pools, 3215 pgs objects: 177.3 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 6798138/1492348700 objects degraded (0.456%) 45813194/1492348700 objects misplaced (3.070%) 2903 active+clean 209 active+remapped+backfill_wait 73 active+undersized+degraded+remapped+backfill_wait 9active+remapped+backfill_wait+backfill_toofull 8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull 4active+undersized+degraded+remapped+backfilling 3active+remapped+backfilling 3active+clean+scrubbing+deep 1active+clean+scrubbing 1active+undersized+remapped+backfilling 1active+clean+snaptrim io: client: 47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr recovery: 195 MiB/s, 48 objects/s After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these: [root@gnosis ~]# ceph health detail HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169] 8...9 pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72] pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647] [...] pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86] PG_DEGRADED_FULL Deg
[ceph-users] Ceph does not recover from OSD restart
Dear cephers, I have a serious issue with degraded objects after an OSD restart. The cluster was in a state of re-balancing after adding disks to each host. Before restart I had "X/Y objects misplaced". Apart from that, health was OK. I now restarted all OSDs of one host and the cluster does not recover from that: cluster: id: xxx health: HEALTH_ERR 45813194/1492348700 objects misplaced (3.070%) Degraded data redundancy: 6798138/1492348700 objects degraded (0.456%), 85 pgs degraded, 86 pgs undersized Degraded data redundancy (low space): 17 pgs backfill_toofull 1 pools nearfull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 297 osds: 272 up, 272 in; 307 remapped pgs data: pools: 11 pools, 3215 pgs objects: 177.3 M objects, 489 TiB usage: 696 TiB used, 1.2 PiB / 1.9 PiB avail pgs: 6798138/1492348700 objects degraded (0.456%) 45813194/1492348700 objects misplaced (3.070%) 2903 active+clean 209 active+remapped+backfill_wait 73 active+undersized+degraded+remapped+backfill_wait 9active+remapped+backfill_wait+backfill_toofull 8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull 4active+undersized+degraded+remapped+backfilling 3active+remapped+backfilling 3active+clean+scrubbing+deep 1active+clean+scrubbing 1active+undersized+remapped+backfilling 1active+clean+snaptrim io: client: 47 MiB/s rd, 61 MiB/s wr, 732 op/s rd, 792 op/s wr recovery: 195 MiB/s, 48 objects/s After restarting there should only be a small number of degraded objects, the ones that received writes during OSD restart. What I see, however, is that the cluster seems to have lost track of a huge amount of objects, the 0.456% degraded are 1-2 days worth of I/O. I did reboots before and saw only a few thousand objects degraded at most. The output of ceph health detail shows a lot of lines like these: [root@gnosis ~]# ceph health detail HEALTH_ERR 45804316/1492356704 objects misplaced (3.069%); Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized; Degraded data redundancy (low space): 17 pgs backfill_toofull; 1 pools nearfull OBJECT_MISPLACED 45804316/1492356704 objects misplaced (3.069%) PG_DEGRADED Degraded data redundancy: 6792562/1492356704 objects degraded (0.455%), 85 pgs degraded, 86 pgs undersized pg 11.9 is stuck undersized for 815.188981, current state active+undersized+degraded+remapped+backfill_wait, last acting [60,148,2147483647,263,76,230,87,169] 8...9 pg 11.48 is active+undersized+degraded+remapped+backfill_wait, acting [159,60,180,263,237,3,2147483647,72] pg 11.4a is stuck undersized for 851.162862, current state active+undersized+degraded+remapped+backfill_wait, last acting [182,233,87,228,2,180,63,2147483647] [...] pg 11.22e is stuck undersized for 851.162402, current state active+undersized+degraded+remapped+backfill_wait+backfill_toofull, last acting [234,183,239,2147483647,170,229,1,86] PG_DEGRADED_FULL Degraded data redundancy (low space): 17 pgs backfill_toofull pg 11.24 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [230,259,2147483647,1,144,159,233,146] [...] pg 11.1d9 is active+remapped+backfill_wait+backfill_toofull, acting [84,259,183,170,85,234,233,2] pg 11.225 is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [236,183,1,2147483647,2147483647,169,229,230] pg 11.22e is active+undersized+degraded+remapped+backfill_wait+backfill_toofull, acting [234,183,239,2147483647,170,229,1,86] POOL_NEAR_FULL 1 pools nearfull pool 'sr-rbd-data-one-hdd' has 164 TiB (max 200 TiB) It looks like a lot of PGs are not receiving theire complete crush map placement, as if the peering is incomplete. This is a serious issue, it looks like the cluster will see a total storage loss if just 2 more hosts reboot - without actually having lost any storage. The pool in question is a 6+2 EC pool. What is going on here? Why are the PG-maps not restored to their values from before the OSD reboot? The degraded PGs should receive the missing OSD IDs, everything is up exactly as it was before the reboot. Thanks for your help and best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mimic: much more raw used than reported
Hi all, quick update: looks like copying OSDs does indeed deflate the objects with partial overwrites in an EC pool again: osd df tree blue stats ID SIZEUSE alloc store 878.96.6 6.64.6 <-- old disk with inflated objects 294 111.9 1.92.0 <-- new disk (still beckfilling) Even the small effect of compression is visible. I need to migrate everything to LVM at some point any ways. Seems like all static data will get cleaned up along the way. It was probably the copy process with too small write size causing the trouble. Unfortunately, the tool we are using does not have an option to change that. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Igor Fedotov Sent: 01 August 2020 10:53:29 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mimic: much more raw used than reported Hi Frank, On 7/31/2020 10:31 AM, Frank Schilder wrote: > Hi Igor, > > thanks. I guess the problem with finding the corresponding images is, that it > happens on bluestore and not on object level. Even if I listed all rados > objects and added their sizes I would not see the excess storage. > > Thinking about working around this issue, would re-writing the objects > deflate the exces usage? For example, evacuating an OSD and adding it back to > the pool after it was empty, would this re-write the objects on this OSD > without the overhead? May be but I can't say for sure.. > > Or simply copying an entire RBD image, would the copy be deflated? > > Although the latter options sound a bit crazy, one could do this without > (much) downtime of VMs and it might get us through this migration. Also you might want to try pg export/import using ceph-objectstore-tool. See https://ceph.io/geen-categorie/incomplete-pgs-oh-my/ for some hints how to do that. But again I'm not certain if it's helpful. Preferably to try with some non-production cluster first... > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Igor Fedotov > Sent: 30 July 2020 15:40 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] mimic: much more raw used than reported > > Hi Frank, > > On 7/30/2020 11:19 AM, Frank Schilder wrote: >> Hi Igor, >> >> thanks for looking at this. Here a few thoughts: >> >> The copy goes to NTFS. I would expect between 2-4 meta data operations per >> write, which would go to few existing objects. I guess the difference >> bluestore_write_small-bluestore_write_small_new are mostly such writes and >> are susceptible to the partial overwrite amplification. A first question is, >> how many objects are actually affected? 3 small writes does not mean >> 3 objects have partial overwrites. >> >> The large number of small_new is indeed strange, although these would not >> lead to excess allocations. It is possible that the write size of the copy >> tool is not ideal, was wondering about this too. I will investigate. > small_new might relate to small tailing chunks that presumably appear > when doing unaligned appends. Each such append triggers small_new write... > > >> To know more, I would need to find out which images these small writes come >> from, we have more than one active. Is there a low-level way to find out >> which objects are affected by partial overwrites and which image they belong >> to? In your post you were describing some properties like being >> shared/cloned etc. Can one search for such objects? > IMO raising debug bluestore to 10 (or even 20) and subsequent OSD log > inspection is likely to be the only mean to learn which objects OSD is > processing... Be careful - this produces significant amount of data and > negatively impact the performance. >> On a more fundamental level, I'm wondering why RBD images issue sub-object >> size writes at all. I naively assumed that every I/O operation to RBD always >> implies full object writes, even just changing a single byte (thinking of an >> object as the equivalent of a sector on a disk, the smallest atomic unit). >> If this is not the case, what is the meaning of object size then? How does >> it influence on I/O patterns? My benchmarks show that object size matters a >> lot, but it becomes a bit unclear now why. > Not sure I can provide good enough answer on the above. But I doubt that > RBD unconditionally operates on full objects. > > >> Thanks and best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >>
[ceph-users] Re: mimic: much more raw used than reported
Hi Igor, thanks. I guess the problem with finding the corresponding images is, that it happens on bluestore and not on object level. Even if I listed all rados objects and added their sizes I would not see the excess storage. Thinking about working around this issue, would re-writing the objects deflate the exces usage? For example, evacuating an OSD and adding it back to the pool after it was empty, would this re-write the objects on this OSD without the overhead? Or simply copying an entire RBD image, would the copy be deflated? Although the latter options sound a bit crazy, one could do this without (much) downtime of VMs and it might get us through this migration. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Igor Fedotov Sent: 30 July 2020 15:40 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mimic: much more raw used than reported Hi Frank, On 7/30/2020 11:19 AM, Frank Schilder wrote: > Hi Igor, > > thanks for looking at this. Here a few thoughts: > > The copy goes to NTFS. I would expect between 2-4 meta data operations per > write, which would go to few existing objects. I guess the difference > bluestore_write_small-bluestore_write_small_new are mostly such writes and > are susceptible to the partial overwrite amplification. A first question is, > how many objects are actually affected? 3 small writes does not mean > 3 objects have partial overwrites. > > The large number of small_new is indeed strange, although these would not > lead to excess allocations. It is possible that the write size of the copy > tool is not ideal, was wondering about this too. I will investigate. small_new might relate to small tailing chunks that presumably appear when doing unaligned appends. Each such append triggers small_new write... > To know more, I would need to find out which images these small writes come > from, we have more than one active. Is there a low-level way to find out > which objects are affected by partial overwrites and which image they belong > to? In your post you were describing some properties like being shared/cloned > etc. Can one search for such objects? IMO raising debug bluestore to 10 (or even 20) and subsequent OSD log inspection is likely to be the only mean to learn which objects OSD is processing... Be careful - this produces significant amount of data and negatively impact the performance. > > On a more fundamental level, I'm wondering why RBD images issue sub-object > size writes at all. I naively assumed that every I/O operation to RBD always > implies full object writes, even just changing a single byte (thinking of an > object as the equivalent of a sector on a disk, the smallest atomic unit). If > this is not the case, what is the meaning of object size then? How does it > influence on I/O patterns? My benchmarks show that object size matters a lot, > but it becomes a bit unclear now why. Not sure I can provide good enough answer on the above. But I doubt that RBD unconditionally operates on full objects. > > Thanks and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Igor Fedotov > Sent: 29 July 2020 16:25:36 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] mimic: much more raw used than reported > > Frank, > > so you have pretty high amount of small writes indeed. More than a half > of the written volume (in bytes) is done via small writes. > > And 6x times more small requests. > > > This looks pretty odd for sequential write pattern and is likely to be > the root cause for that space overhead. > > I can see approx 1.4GB additionally lost per each of these 3 OSDs since > perf dump reset ( = allocated_new - stored_new - (allocated_old - > stored_old)) > > Below are some speculations on what might be happening by for sure I > could be wrong/missing something. So please do not consider this as a > 100% valid analysis. > > Client does writes in 1MB chunks. This is split into 6 EC chunks (+2 > added) which results in approx 170K writing block to object store ( = > 1MB / 6). Which corresponds to 1x128K big write and 1x42K small tailing > one. Resulting in 3x64K allocations. > > The next client adjacent write results in another 128K blob, one more > "small" tailing blob and heading blob which partially overlaps with the > previous tailing 42K chunk. Overlapped chunks are expected to be merged. > But presumably this doesn't happen due to that "partial EC overwrites" > issue. So instead additional 64K blob is allocated for overlapped range. > > I.e. 2x170K writes cause 2x128K blobs, 1x64K tailing blob and 2x
[ceph-users] Re: mimic: much more raw used than reported
Hi Igor, thanks for looking at this. Here a few thoughts: The copy goes to NTFS. I would expect between 2-4 meta data operations per write, which would go to few existing objects. I guess the difference bluestore_write_small-bluestore_write_small_new are mostly such writes and are susceptible to the partial overwrite amplification. A first question is, how many objects are actually affected? 3 small writes does not mean 3 objects have partial overwrites. The large number of small_new is indeed strange, although these would not lead to excess allocations. It is possible that the write size of the copy tool is not ideal, was wondering about this too. I will investigate. To know more, I would need to find out which images these small writes come from, we have more than one active. Is there a low-level way to find out which objects are affected by partial overwrites and which image they belong to? In your post you were describing some properties like being shared/cloned etc. Can one search for such objects? On a more fundamental level, I'm wondering why RBD images issue sub-object size writes at all. I naively assumed that every I/O operation to RBD always implies full object writes, even just changing a single byte (thinking of an object as the equivalent of a sector on a disk, the smallest atomic unit). If this is not the case, what is the meaning of object size then? How does it influence on I/O patterns? My benchmarks show that object size matters a lot, but it becomes a bit unclear now why. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Igor Fedotov Sent: 29 July 2020 16:25:36 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mimic: much more raw used than reported Frank, so you have pretty high amount of small writes indeed. More than a half of the written volume (in bytes) is done via small writes. And 6x times more small requests. This looks pretty odd for sequential write pattern and is likely to be the root cause for that space overhead. I can see approx 1.4GB additionally lost per each of these 3 OSDs since perf dump reset ( = allocated_new - stored_new - (allocated_old - stored_old)) Below are some speculations on what might be happening by for sure I could be wrong/missing something. So please do not consider this as a 100% valid analysis. Client does writes in 1MB chunks. This is split into 6 EC chunks (+2 added) which results in approx 170K writing block to object store ( = 1MB / 6). Which corresponds to 1x128K big write and 1x42K small tailing one. Resulting in 3x64K allocations. The next client adjacent write results in another 128K blob, one more "small" tailing blob and heading blob which partially overlaps with the previous tailing 42K chunk. Overlapped chunks are expected to be merged. But presumably this doesn't happen due to that "partial EC overwrites" issue. So instead additional 64K blob is allocated for overlapped range. I.e. 2x170K writes cause 2x128K blobs, 1x64K tailing blob and 2x64K blobs for the range where two writes adjoined. 64K wasted! And similarly +64K space overhead per each additional append to this object. Again I'm not completely sure the above analysis is 100% valid and this doesn't explain that large amount of small requests. But you might want to check/tune/experiment on client writing size. E.g. increase it to 4M if it' less or make divisible by 6. Hope this helps. Thanks, Igor On 7/29/2020 4:06 PM, Frank Schilder wrote: > Hi Igor, > > thanks! Here a sample extract for one OSD, time stamp (+%F-%H%M%S) in file > name. For the second collection I let it run for about 10 minutes after reset: > > perf_dump_2020-07-29-142739.osd181:"bluestore_write_big": 10216689, > perf_dump_2020-07-29-142739.osd181:"bluestore_write_big_bytes": > 992602882048, > perf_dump_2020-07-29-142739.osd181:"bluestore_write_big_blobs": > 10758603, > perf_dump_2020-07-29-142739.osd181:"bluestore_write_small": 63863813, > perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_bytes": > 1481631167388, > perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_unused": > 17279108, > perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_deferred": > 13629951, > perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_pre_read": > 13629951, > perf_dump_2020-07-29-142739.osd181:"bluestore_write_small_new": > 32954754, > perf_dump_2020-07-29-142739.osd181:"compress_success_count": 1167212, > perf_dump_2020-07-29-142739.osd181:"compress_rejected_count": 1493508, > perf_dump_2020-07-29-142739.osd181:"bluestore_compressed&q
[ceph-users] Re: mimic: much more raw used than reported
*$//g" | awk 'BEGIN {printf("%18s\n", "osd df tree")} /root default/ {o=0} /datacenter ServerRoom/ {o=1} (o==1 && $2=="hdd") {s+=$5;u+=$7;printf("%4s %5s %5s\n", $1, $5, $7)} f==0 {printf("%4s %5s %5s\n", $1, $5, $6);f=1} END {printf("%4s %5.1f %5.1f\n", "SUM", s, u)}')" OSDS=( $(echo "$df_tree_data" | tail -n +3 | awk '/SUM/ {next} {print $1}') ) bs_data="$(blue_stats "${OSDS[@]}")" paste -d " " <(echo "$df_tree_data") <(echo "$bs_data") Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Igor Fedotov Sent: 27 July 2020 13:31 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mimic: much more raw used than reported Frank, suggest to start with perf counter analysis as per the second part of my previous email... Thanks, Igor On 7/27/2020 2:30 PM, Frank Schilder wrote: > Hi Igor, > > thanks for your answer. I was thinking about that, but as far as I > understood, to hit this bug actually requires a partial rewrite to happen. > However, these are disk images in storage servers with basically static > files, many of which very large (15GB). Therefore, I believe, the vast > majority of objects is written to only once and should not be affected by the > amplification bug. > > Is there any way to confirm/rule out that/check how much amplification is > happening? > > I'm wondering if I might be observing something else. Since "ceph osd df > tree" does report the actual utilization and I have only one pool on these > OSDs, there is no problem with accounting allocated storage to a pool. I know > its all used by this one pool. I'm more wondering if its not the known > amplification but something else (at least partly) that plays a role here. > > Thanks and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Igor Fedotov > Sent: 27 July 2020 12:54:02 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] mimic: much more raw used than reported > > Hi Frank, > > you might be being hit by https://tracker.ceph.com/issues/44213 > > In short the root causes are significant space overhead due to high > bluestore allocation unit (64K) and EC overwrite design. > > This is fixed for upcoming Pacific release by using 4K alloc unit but it > is unlikely to be backported to earlier releases due to its complexity. > To say nothing about the need for OSD redeployment. Hence please expect > no fix for mimic. > > > And your raw usage reports might still be not that good since mimic > lacks per-pool stats collection https://github.com/ceph/ceph/pull/19454. > I.e. your actual raw space usage is higher than reported. To estimate > proper raw usage one can use bluestore perf counters (namely > bluestore_stored and bluestore_allocated). Summing bluestore_allocated > over all involved OSDs will give actual RAW usage. Summing > bluestore_stored will provide actual data volume after EC processing, > i.e. presumably it should be around 158TiB. > > > Thanks, > > Igor > > On 7/26/2020 8:43 PM, Frank Schilder wrote: >> Dear fellow cephers, >> >> I observe a wired problem on our mimic-13.2.8 cluster. We have an EC RBD >> pool backed by HDDs. These disks are not in any other pool. I noticed that >> the total capacity (=USED+MAX AVAIL) reported by "ceph df detail" has shrunk >> recently from 300TiB to 200TiB. Part but by no means all of this can be >> explained by imbalance of the data distribution. >> >> When I compare the output of "ceph df detail" and "ceph osd df tree", I find >> 69TiB raw capacity used but not accounted for; see calculations below. These >> 69TiB raw are equivalent to 20% usable capacity and I really need it back. >> Together with the imbalance, we loose about 30% capacity. >> >> What is using these extra 69TiB and how can I get it back? >> >> >> Some findings: >> >> These are the 5 largest images in the pool, accounting for a total of 97TiB >> out of 119TiB usage: >> >> # rbd du : >> NAMEPROVISIONED USED >> one-133 25 TiB 14 TiB >> NAMEPROVISIONEDUSED >> one-153@222 40 TiB 14 TiB >> one-153@228 40 TiB 357 GiB >> one-153@235 40 TiB 797 GiB >> one-153@241 40 TiB 509 GiB >> one-153@242 40 TiB 43 GiB >> one-153@243 40 TiB 16 MiB >> one-153@244 40 TiB 16 MiB >> one-153@245 40 TiB 324 MiB &g
[ceph-users] Re: mimic: much more raw used than reported
Hi Igor, thanks for your answer. I was thinking about that, but as far as I understood, to hit this bug actually requires a partial rewrite to happen. However, these are disk images in storage servers with basically static files, many of which very large (15GB). Therefore, I believe, the vast majority of objects is written to only once and should not be affected by the amplification bug. Is there any way to confirm/rule out that/check how much amplification is happening? I'm wondering if I might be observing something else. Since "ceph osd df tree" does report the actual utilization and I have only one pool on these OSDs, there is no problem with accounting allocated storage to a pool. I know its all used by this one pool. I'm more wondering if its not the known amplification but something else (at least partly) that plays a role here. Thanks and best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Igor Fedotov Sent: 27 July 2020 12:54:02 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] mimic: much more raw used than reported Hi Frank, you might be being hit by https://tracker.ceph.com/issues/44213 In short the root causes are significant space overhead due to high bluestore allocation unit (64K) and EC overwrite design. This is fixed for upcoming Pacific release by using 4K alloc unit but it is unlikely to be backported to earlier releases due to its complexity. To say nothing about the need for OSD redeployment. Hence please expect no fix for mimic. And your raw usage reports might still be not that good since mimic lacks per-pool stats collection https://github.com/ceph/ceph/pull/19454. I.e. your actual raw space usage is higher than reported. To estimate proper raw usage one can use bluestore perf counters (namely bluestore_stored and bluestore_allocated). Summing bluestore_allocated over all involved OSDs will give actual RAW usage. Summing bluestore_stored will provide actual data volume after EC processing, i.e. presumably it should be around 158TiB. Thanks, Igor On 7/26/2020 8:43 PM, Frank Schilder wrote: > Dear fellow cephers, > > I observe a wired problem on our mimic-13.2.8 cluster. We have an EC RBD pool > backed by HDDs. These disks are not in any other pool. I noticed that the > total capacity (=USED+MAX AVAIL) reported by "ceph df detail" has shrunk > recently from 300TiB to 200TiB. Part but by no means all of this can be > explained by imbalance of the data distribution. > > When I compare the output of "ceph df detail" and "ceph osd df tree", I find > 69TiB raw capacity used but not accounted for; see calculations below. These > 69TiB raw are equivalent to 20% usable capacity and I really need it back. > Together with the imbalance, we loose about 30% capacity. > > What is using these extra 69TiB and how can I get it back? > > > Some findings: > > These are the 5 largest images in the pool, accounting for a total of 97TiB > out of 119TiB usage: > > # rbd du : > NAMEPROVISIONED USED > one-133 25 TiB 14 TiB > NAMEPROVISIONEDUSED > one-153@222 40 TiB 14 TiB > one-153@228 40 TiB 357 GiB > one-153@235 40 TiB 797 GiB > one-153@241 40 TiB 509 GiB > one-153@242 40 TiB 43 GiB > one-153@243 40 TiB 16 MiB > one-153@244 40 TiB 16 MiB > one-153@245 40 TiB 324 MiB > one-153@246 40 TiB 276 MiB > one-153@247 40 TiB 96 MiB > one-153@248 40 TiB 138 GiB > one-153@249 40 TiB 1.8 GiB > one-153@250 40 TiB 0 B > one-153 40 TiB 204 MiB > 40 TiB 16 TiB > NAME PROVISIONEDUSED > one-391@3 40 TiB 432 MiB > one-391@9 40 TiB 26 GiB > one-391@15 40 TiB 90 GiB > one-391@16 40 TiB 0 B > one-391@17 40 TiB 0 B > one-391@18 40 TiB 0 B > one-391@19 40 TiB 0 B > one-391@20 40 TiB 3.5 TiB > one-391@21 40 TiB 5.4 TiB > one-391@22 40 TiB 5.8 TiB > one-391@23 40 TiB 8.4 TiB > one-391@24 40 TiB 1.4 TiB > one-391 40 TiB 2.2 TiB > 40 TiB 27 TiB > NAME PROVISIONEDUSED > one-394@3 70 TiB 1.4 TiB > one-394@9 70 TiB 2.5 TiB > one-394@15 70 TiB 20 GiB > one-394@16 70 TiB 0 B > one-394@17 70 TiB 0 B > one-394@18 70 TiB 0 B > one-394@19 70 TiB 383 GiB > one-394@20 70 TiB 3.3 TiB > one-394@21 70 TiB 5.0 TiB > one-394@22 70 TiB 5.0 TiB > one-394@23 70 TiB 9.0 TiB > one-394@24 70 TiB 1.6 TiB > one-394 70 TiB 2.5 TiB > 70 TiB 31 TiB > NAMEPROVISIONEDUSED > one-434 25 TiB 9.1 TiB > > The large 70TiB images one-391 and one-394 are
[ceph-users] mimic: much more raw used than reported
46 158 hdd8.90999 1.0 8.9 TiB 5.6 TiB 5.5 TiB 183 MiB 17 GiB 3.4 TiB 62.30 1.90 109 osd.158 170 hdd8.90999 1.0 8.9 TiB 5.7 TiB 5.6 TiB 205 MiB 18 GiB 3.2 TiB 63.53 1.94 112 osd.170 182 hdd8.90999 1.0 8.9 TiB 4.7 TiB 4.6 TiB 105 MiB 14 GiB 4.3 TiB 52.27 1.60 92 osd.182 63 hdd8.90999 1.0 8.9 TiB 4.7 TiB 4.7 TiB 156 MiB 15 GiB 4.2 TiB 52.74 1.61 98 osd.63 148 hdd8.90999 1.0 8.9 TiB 5.2 TiB 5.1 TiB 119 MiB 16 GiB 3.8 TiB 57.82 1.77 100 osd.148 159 hdd8.90999 1.0 8.9 TiB 4.0 TiB 4.0 TiB 89 MiB 12 GiB 4.9 TiB 44.61 1.36 79 osd.159 172 hdd8.90999 1.0 8.9 TiB 5.1 TiB 5.1 TiB 173 MiB 16 GiB 3.8 TiB 57.22 1.75 98 osd.172 183 hdd8.90999 1.0 8.9 TiB 6.0 TiB 6.0 TiB 135 MiB 19 GiB 2.9 TiB 67.35 2.06 118 osd.183 229 hdd8.90999 1.0 8.9 TiB 4.6 TiB 4.6 TiB 127 MiB 15 GiB 4.3 TiB 52.05 1.59 93 osd.229 232 hdd8.90999 1.0 8.9 TiB 5.2 TiB 5.2 TiB 158 MiB 17 GiB 3.7 TiB 58.22 1.78 101 osd.232 235 hdd8.90999 1.0 8.9 TiB 4.1 TiB 4.1 TiB 103 MiB 13 GiB 4.8 TiB 45.96 1.40 79 osd.235 238 hdd8.90999 1.0 8.9 TiB 5.4 TiB 5.4 TiB 120 MiB 17 GiB 3.5 TiB 60.47 1.85 104 osd.238 259 hdd 10.91399 1.0 11 TiB 6.2 TiB 6.2 TiB 140 MiB 19 GiB 4.7 TiB 56.54 1.73 120 osd.259 231 hdd8.90999 1.0 8.9 TiB 5.1 TiB 5.1 TiB 114 MiB 16 GiB 3.8 TiB 56.90 1.74 101 osd.231 233 hdd8.90999 1.0 8.9 TiB 5.5 TiB 5.5 TiB 123 MiB 17 GiB 3.4 TiB 61.78 1.89 106 osd.233 236 hdd8.90999 1.0 8.9 TiB 5.1 TiB 5.1 TiB 114 MiB 16 GiB 3.8 TiB 57.53 1.76 101 osd.236 239 hdd8.90999 1.0 8.9 TiB 4.2 TiB 4.2 TiB 95 MiB 13 GiB 4.7 TiB 47.41 1.45 86 osd.239 263 hdd 10.91399 1.0 11 TiB 5.3 TiB 5.3 TiB 178 MiB 17 GiB 5.6 TiB 48.73 1.49 102 osd.263 228 hdd8.90999 1.0 8.9 TiB 5.1 TiB 5.1 TiB 113 MiB 16 GiB 3.8 TiB 57.10 1.74 96 osd.228 230 hdd8.90999 1.0 8.9 TiB 4.9 TiB 4.9 TiB 144 MiB 16 GiB 4.0 TiB 55.20 1.69 99 osd.230 234 hdd8.90999 1.0 8.9 TiB 5.6 TiB 5.6 TiB 164 MiB 18 GiB 3.3 TiB 63.29 1.93 109 osd.234 237 hdd8.90999 1.0 8.9 TiB 4.8 TiB 4.8 TiB 110 MiB 15 GiB 4.1 TiB 54.33 1.66 97 osd.237 260 hdd 10.91399 1.0 11 TiB 5.4 TiB 5.4 TiB 152 MiB 17 GiB 5.5 TiB 49.35 1.51 104 osd.260 0 hdd8.90999 1.0 8.9 TiB 5.2 TiB 5.2 TiB 157 MiB 16 GiB 3.7 TiB 58.28 1.78 102 osd.0 2 hdd8.90999 1.0 8.9 TiB 5.3 TiB 5.2 TiB 122 MiB 16 GiB 3.6 TiB 59.05 1.80 106 osd.2 72 hdd8.90999 1.0 8.9 TiB 4.4 TiB 4.4 TiB 145 MiB 14 GiB 4.5 TiB 49.89 1.52 89 osd.72 76 hdd8.90999 1.0 8.9 TiB 5.1 TiB 5.1 TiB 178 MiB 16 GiB 3.8 TiB 56.89 1.74 102 osd.76 86 hdd8.90999 1.0 8.9 TiB 4.6 TiB 4.5 TiB 155 MiB 14 GiB 4.3 TiB 51.18 1.56 94 osd.86 1 hdd8.90999 1.0 8.9 TiB 4.9 TiB 4.9 TiB 141 MiB 15 GiB 4.0 TiB 54.73 1.67 95 osd.1 3 hdd8.90999 1.0 8.9 TiB 4.7 TiB 4.7 TiB 156 MiB 15 GiB 4.2 TiB 52.40 1.60 94 osd.3 73 hdd8.90999 1.0 8.9 TiB 5.0 TiB 4.9 TiB 146 MiB 16 GiB 3.9 TiB 55.68 1.70 102 osd.73 85 hdd8.90999 1.0 8.9 TiB 5.6 TiB 5.5 TiB 192 MiB 18 GiB 3.3 TiB 62.46 1.91 109 osd.85 87 hdd8.90999 1.0 8.9 TiB 5.0 TiB 5.0 TiB 189 MiB 16 GiB 3.9 TiB 55.91 1.71 102 osd.87 Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD memory leak?
Quick question: Is there a way to change the frequency of heap dumps? On this page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function HeapProfilerSetAllocationInterval() is mentioned, but no other way of configuring this. Is there a config parameter or a ceph daemon call to adjust this? If not, can I change the dump path? Its likely to overrun my log partition quickly if I cannot adjust either of the two. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 20 July 2020 15:19:05 To: Mark Nelson; Dan van der Ster Cc: ceph-users Subject: [ceph-users] Re: OSD memory leak? Dear Mark, thank you very much for the very helpful answers. I will raise osd_memory_cache_min, leave everything else alone and watch what happens. I will report back here. Thanks also for raising this as an issue. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Nelson Sent: 20 July 2020 15:08:11 To: Frank Schilder; Dan van der Ster Cc: ceph-users Subject: Re: [ceph-users] Re: OSD memory leak? On 7/20/20 3:23 AM, Frank Schilder wrote: > Dear Mark and Dan, > > I'm in the process of restarting all OSDs and could use some quick advice on > bluestore cache settings. My plan is to set higher minimum values and deal > with accumulated excess usage via regular restarts. Looking at the > documentation > (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), > I find the following relevant options (with defaults): > > # Automatic Cache Sizing > osd_memory_target {4294967296} # 4GB > osd_memory_base {805306368} # 768MB > osd_memory_cache_min {134217728} # 128MB > > # Manual Cache Sizing > bluestore_cache_meta_ratio {.4} # 40% ? > bluestore_cache_kv_ratio {.4} # 40% ? > bluestore_cache_kv_max {512 * 1024*1024} # 512MB > > Q1) If I increase osd_memory_cache_min, should I also increase > osd_memory_base by the same or some other amount? osd_memory_base is a hint at how much memory the OSD could consume outside the cache once it's reached steady state. It basically sets a hard cap on how much memory the cache will use to avoid over-committing memory and thrashing when we exceed the memory limit. It's not necessary to get it right, it just helps smooth things out by making the automatic memory tuning less aggressive. IE if you have a 2 GB memory target and a 512MB base, you'll never assign more than 1.5GB to the cache on the assumption that the rest of the OSD will eventually need 512MB to operate even if it's not using that much right now. I think you can probably just leave it alone. What you and Dan appear to be seeing is that this number isn't static in your case but increases over time any way. Eventually I'm hoping that we can automatically account for more and more of that memory by reading the data from the mempools. > Q2) The cache ratio options are shown under the section "Manual Cache > Sizing". Do they also apply when cache auto tuning is enabled? If so, is it > worth changing these defaults for higher values of osd_memory_cache_min? They actually do have an effect on the automatic cache sizing and probably shouldn't only be under the manual section. When you have the automatic cache sizing enabled, those options will affect the "fair share" values of the different caches at each cache priority level. IE at priority level 0, if both caches want more memory than is available, those ratios will determine how much each cache gets. If there is more memory available than requested, each cache gets as much as they want and we move on to the next priority level and do the same thing again. So in this case the ratios end up being sort of more like fallback settings for when you don't have enough memory to fulfill all cache requests at a given priority level, but otherwise are not utilized until we hit that limit. The goal with this scheme is to make sure that "high priority" items in each cache get first dibs at the memory even if it might skew the ratios. This might be things like rocksdb bloom filters and indexes, or potentially very recent hot items in one cache vs very old items in another cache. The ratios become more like guidelines than hard limits. When you change to manual mode, you set an overall bluestore cache size and each cache gets a flat percentage of it based on the ratios. With 0.4/0.4 you will always have 40% for onode, 40% for omap, and 20% for data even if one of those caches does not use all of it's memory. > > Many thanks for your help with this. I can't find answers to these questions > in the docs. > > There might be two reasons for high osd_map memory usage. One is, that our > OSDs seem to hold a large number of OSD
[ceph-users] Re: OSD memory leak?
Dear Mark, thank you very much for the very helpful answers. I will raise osd_memory_cache_min, leave everything else alone and watch what happens. I will report back here. Thanks also for raising this as an issue. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Nelson Sent: 20 July 2020 15:08:11 To: Frank Schilder; Dan van der Ster Cc: ceph-users Subject: Re: [ceph-users] Re: OSD memory leak? On 7/20/20 3:23 AM, Frank Schilder wrote: > Dear Mark and Dan, > > I'm in the process of restarting all OSDs and could use some quick advice on > bluestore cache settings. My plan is to set higher minimum values and deal > with accumulated excess usage via regular restarts. Looking at the > documentation > (https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/), > I find the following relevant options (with defaults): > > # Automatic Cache Sizing > osd_memory_target {4294967296} # 4GB > osd_memory_base {805306368} # 768MB > osd_memory_cache_min {134217728} # 128MB > > # Manual Cache Sizing > bluestore_cache_meta_ratio {.4} # 40% ? > bluestore_cache_kv_ratio {.4} # 40% ? > bluestore_cache_kv_max {512 * 1024*1024} # 512MB > > Q1) If I increase osd_memory_cache_min, should I also increase > osd_memory_base by the same or some other amount? osd_memory_base is a hint at how much memory the OSD could consume outside the cache once it's reached steady state. It basically sets a hard cap on how much memory the cache will use to avoid over-committing memory and thrashing when we exceed the memory limit. It's not necessary to get it right, it just helps smooth things out by making the automatic memory tuning less aggressive. IE if you have a 2 GB memory target and a 512MB base, you'll never assign more than 1.5GB to the cache on the assumption that the rest of the OSD will eventually need 512MB to operate even if it's not using that much right now. I think you can probably just leave it alone. What you and Dan appear to be seeing is that this number isn't static in your case but increases over time any way. Eventually I'm hoping that we can automatically account for more and more of that memory by reading the data from the mempools. > Q2) The cache ratio options are shown under the section "Manual Cache > Sizing". Do they also apply when cache auto tuning is enabled? If so, is it > worth changing these defaults for higher values of osd_memory_cache_min? They actually do have an effect on the automatic cache sizing and probably shouldn't only be under the manual section. When you have the automatic cache sizing enabled, those options will affect the "fair share" values of the different caches at each cache priority level. IE at priority level 0, if both caches want more memory than is available, those ratios will determine how much each cache gets. If there is more memory available than requested, each cache gets as much as they want and we move on to the next priority level and do the same thing again. So in this case the ratios end up being sort of more like fallback settings for when you don't have enough memory to fulfill all cache requests at a given priority level, but otherwise are not utilized until we hit that limit. The goal with this scheme is to make sure that "high priority" items in each cache get first dibs at the memory even if it might skew the ratios. This might be things like rocksdb bloom filters and indexes, or potentially very recent hot items in one cache vs very old items in another cache. The ratios become more like guidelines than hard limits. When you change to manual mode, you set an overall bluestore cache size and each cache gets a flat percentage of it based on the ratios. With 0.4/0.4 you will always have 40% for onode, 40% for omap, and 20% for data even if one of those caches does not use all of it's memory. > > Many thanks for your help with this. I can't find answers to these questions > in the docs. > > There might be two reasons for high osd_map memory usage. One is, that our > OSDs seem to hold a large number of OSD maps: I brought this up in our core team standup last week. Not sure if anyone has had time to look at it yet though. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD memory leak?
"osd": { "items": 96, "bytes": 1115904 }, "osd_mapbl": { "items": 80, "bytes": 8501746 }, "osd_pglog": { "items": 328703, "bytes": 117673864 }, "osdmap": { "items": 12101478, "bytes": 210941392 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 } }, "total": { "items": 23145696, "bytes": 526245301 } } } # ceph daemon osd.211 heap stats osd.211 tcmalloc heap stats: MALLOC: 1727399344 ( 1647.4 MiB) Bytes in use by application MALLOC: + 532480 (0.5 MiB) Bytes in page heap freelist MALLOC: +262860912 ( 250.7 MiB) Bytes in central cache freelist MALLOC: + 11693568 ( 11.2 MiB) Bytes in transfer cache freelist MALLOC: + 29694944 ( 28.3 MiB) Bytes in thread cache freelists MALLOC: + 14024704 ( 13.4 MiB) Bytes in malloc metadata MALLOC: MALLOC: = 2046205952 ( 1951.4 MiB) Actual memory used (physical + swap) MALLOC: +229212160 ( 218.6 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: = 2275418112 ( 2170.0 MiB) Virtual address space used MALLOC: MALLOC: 145115 Spans in use MALLOC: 32 Thread heaps in use MALLOC: 8192 Tcmalloc page size # ceph daemon osd.211 dump_mempools { "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 4691828, "bytes": 37534624 }, "bluestore_cache_data": { "items": 894, "bytes": 163053568 }, "bluestore_cache_onode": { "items": 165536, "bytes": 94024448 }, "bluestore_cache_other": { "items": 33936718, "bytes": 233428234 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 110, "bytes": 75680 }, "bluestore_writing_deferred": { "items": 38, "bytes": 6061245 }, "bluestore_writing": { "items": 0, "bytes": 0 }, "bluefs": { "items": 9956, "bytes": 189640 }, "buffer_anon": { "items": 293298, "bytes": 59950954 }, "buffer_meta": { "items": 1005, "bytes": 64320 }, "osd": { "items": 98, "bytes": 1139152 }, "osd_mapbl": { "items": 80, "bytes": 8501690 }, "osd_pglog": { "items": 350517, "bytes": 132253139 }, "osdmap": { "items": 633498, "bytes": 10866360 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 },
[ceph-users] Re: OSD memory leak?
Dear Dan, cc Mark, this sounds exactly like the scenario I'm looking at. We have rolling snapshots on RBD images on currently ca. 200 VMs and increasing. Snapshots are daily with different retention periods. We have two pools with separate hardware backing RBD and cephfs. The mem stats I sent are from an OSD backing cephfs, which does not have any snaps currently. So the snaps on other OSDs influence the memory usage of OSDs that have nothing to do with the RBDs. I also noticed a significant drop of memory usage across the cluster after restarting the OSDs on just one host. Not sure if this is expected either. Looks like the OSDs do collect dead baggage quite fast and the memory_target reduces the caches in an attempt to accommodate for that. The fact that the kernel swaps this out in favour of disk buffers on a system with low swappiness where the only disk access is local syslog indicates that this is allocated but never used - a quite massive leak. It currently looks like that after only a couple of days the leakage exceeds the mem target already. I don't want to have the occasional OOM killer on my operations team. For now I will probably adopt a reverse strategy, give up on memory_target doing something useful, increase the minimum cache limits to ensure at least some caching, have swap take care of the leak and restart OSDs regularly (every 2-3 months). Would be good if this could be looked at. Please let me know if there is some data I can provide. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Nelson Sent: 15 July 2020 18:36:06 To: Dan van der Ster Cc: ceph-users Subject: [ceph-users] Re: OSD memory leak? On 7/15/20 9:58 AM, Dan van der Ster wrote: > Hi Mark, > > On Mon, Jul 13, 2020 at 3:42 PM Mark Nelson wrote: >> Hi Frank, >> >> >> So the osd_memory_target code will basically shrink the size of the >> bluestore and rocksdb caches to attempt to keep the overall mapped (not >> rss!) memory of the process below the target. It's sort of "best >> effort" in that it can't guarantee the process will fit within a given >> target, it will just (assuming we are over target) shrink the caches up >> to some minimum value and that's it. 2GB per OSD is a pretty ambitious >> target. It's the lowest osd_memory_target we recommend setting. I'm a >> little surprised the OSD is consuming this much memory with a 2GB target >> though. >> >> Looking at your mempool dump I see very little memory allocated to the >> caches. In fact the majority is taken up by osdmap (looks like you have >> a decent number of OSDs) and pglog. That indicates that the memory > Do you know if this high osdmap usage is known already? > Our big block storage cluster generates a new osdmap every few seconds > (due to rbd snap trimming) and we see the osdmap mempool usage growing > over a few months until osds start getting OOM killed. > > Today we proactively restarted them because the osdmap_mempool was > using close to 700MB. > So it seems that whatever is supposed to be trimming is not working. > (This is observed with nautilus 14.2.8 but iirc it has been the same > even when we were running luminous and mimic too) > > Cheers, Dan Hrm, it hasn't been on my radar, though looking back through the mailing list there appears to be various reports over the years of high usage (some of which theoretically have been fixed). Maybe submit a tracker issue? 700MB seems quite high for osdmap, but I don't really know the retention rules so someone else who knows that code better will have to chime in. > >> autotuning is probably working but simply can't do anything more to >> help. Something else is taking up the memory. Figure you've got a >> little shy of 500MB for the mempools. RocksDB will take up more (and >> potentially quite a bit more if you have memtables backing up waiting to >> be flushed to L0) and potentially some other things in the OSD itself >> that could take up memory. If you feel comfortable experimenting, you >> could try changing the rocksdb WAL/memtable settings. By default we >> have up to 4 256MB WAL buffers. Instead you could try something like 2 >> 64MB buffers, but be aware this could cause slow performance or even >> temporary write stalls if you have fast storage. Still, this would only >> give you up to ~0.9GB back. Since you are on mimic, you might also want >> to check what your kernel's transparent huge pages configuration is. I >> don't remember if we backported Patrick's fix to always avoid THP for >> ceph processes. If your kernel is set to "always", you might consider >> trying it with "madvise". >> >> Alte
[ceph-users] Re: mon_osd_down_out_subtree_limit not working?
Hi Dan, I now added it to ceph.conf and restarted all MONs. The running config now shows as: # ceph config show mon.ceph-01 | grep -e NAME -e mon_osd_down_out_subtree_limit NAME VALUE SOURCE OVERRIDESIGNORES mon_osd_down_out_subtree_limit host file (mon[host]) The config DB entry moved from column ignores to overrides, that is, it is still not used. Looks like a priority bug to me. On startup, the config DB setting should have higher priority than source "default" (and lower than "file" as is the case). Should I open a tracker ticket? I tested a shutdown of all OSDs on a host and it works now as expected and desired. Thanks! ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Frank Schilder Sent: 15 July 2020 10:15:12 To: Dan van der Ster Cc: Anthony D'Atri; ceph-users Subject: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? Setting it in ceph.conf is exactly what I wanted to avoid :). I will give it a try though. I guess this should become an issue in the tracker? Is it, by any chance, required to restart *all* daemons or should MONs be enough? Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 15 July 2020 10:10:44 To: Frank Schilder Cc: Anthony D'Atri; ceph-users Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? Hrmm that is strange. We set it via /etc/ceph/ceph.conf, not the config framework. Maybe try that? -- dan On Wed, Jul 15, 2020 at 9:59 AM Frank Schilder wrote: > > Hi Dan, > > it still does not work. When I execute > > # ceph config set global mon_osd_down_out_subtree_limit host > 2020-07-15 09:17:11.890 7f36cf7fe700 -1 set_mon_vals failed to set > mon_osd_down_out_subtree_limit = host: Configuration option > 'mon_osd_down_out_subtree_limit' may not be modified at runtime > > I get now a warning that one cannot change the value at run time. However, a > restart of all monitors still does not apply the value: > > # ceph config show mon.ceph-01 | grep -e NAME -e > mon_osd_down_out_subtree_limit | sed -e "s/ */\t/g" > NAMEVALUE SOURCE OVERRIDES IGNORES > mon_osd_down_out_subtree_limit rackdefault mon > > so the setting in the config data base is still ignored. Any ideas? I cannot > shut down the entire cluster for something that simple. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Dan van der Ster > Sent: 14 July 2020 17:38:27 > To: Frank Schilder > Cc: Anthony D'Atri; ceph-users > Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? > > Seems that > > ceph config set mon mon_osd_down_out_subtree_limit > > isn't working. (I've seen this sort of config namespace issue in the past). > > I'd try `ceph config set global mon_osd_down_out_subtree_limit host` > then restart the mon and check `ceph daemon mon.ceph-01 config get > mon_osd_down_out_subtree_limit` again. > > -- dan > > > On Tue, Jul 14, 2020 at 1:35 PM Frank Schilder wrote: > > > > Hi Dan, > > > > thanks for your reply. There is still a problem. > > > > Firstly, I did indeed forget to restart the mon even though I looked at the > > help for mon_osd_down_out_subtree_limit and it says it requires a restart. > > Stupid me. Well, now I did a restart and it still doesn't work. Here is the > > situation: > > > > # ceph config dump | grep subtree > > mon advanced mon_osd_down_out_subtree_limithost > >* > > mon advanced mon_osd_reporter_subtree_level > > datacenter > > > > # ceph config get mon.ceph-01 mon_osd_down_out_subtree_limit > > host > > > > # ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit > > { > > "mon_osd_down_out_subtree_limit": "rack" > > } > > > > # ceph config show mon.ceph-01 | grep subtree > > mon_osd_down_out_subtree_limit rack default > > mon > > mon_osd_reporter_subtree_level datacenter mon > > > > The default overrides the mon config database setting. What is going on > > here? I restarted all 3 monitors. > > > > Best regards and thanks for your help, > > = > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > &
[ceph-users] Re: mon_osd_down_out_subtree_limit not working?
Setting it in ceph.conf is exactly what I wanted to avoid :). I will give it a try though. I guess this should become an issue in the tracker? Is it, by any chance, required to restart *all* daemons or should MONs be enough? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 15 July 2020 10:10:44 To: Frank Schilder Cc: Anthony D'Atri; ceph-users Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? Hrmm that is strange. We set it via /etc/ceph/ceph.conf, not the config framework. Maybe try that? -- dan On Wed, Jul 15, 2020 at 9:59 AM Frank Schilder wrote: > > Hi Dan, > > it still does not work. When I execute > > # ceph config set global mon_osd_down_out_subtree_limit host > 2020-07-15 09:17:11.890 7f36cf7fe700 -1 set_mon_vals failed to set > mon_osd_down_out_subtree_limit = host: Configuration option > 'mon_osd_down_out_subtree_limit' may not be modified at runtime > > I get now a warning that one cannot change the value at run time. However, a > restart of all monitors still does not apply the value: > > # ceph config show mon.ceph-01 | grep -e NAME -e > mon_osd_down_out_subtree_limit | sed -e "s/ */\t/g" > NAMEVALUE SOURCE OVERRIDES IGNORES > mon_osd_down_out_subtree_limit rackdefault mon > > so the setting in the config data base is still ignored. Any ideas? I cannot > shut down the entire cluster for something that simple. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Dan van der Ster > Sent: 14 July 2020 17:38:27 > To: Frank Schilder > Cc: Anthony D'Atri; ceph-users > Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? > > Seems that > > ceph config set mon mon_osd_down_out_subtree_limit > > isn't working. (I've seen this sort of config namespace issue in the past). > > I'd try `ceph config set global mon_osd_down_out_subtree_limit host` > then restart the mon and check `ceph daemon mon.ceph-01 config get > mon_osd_down_out_subtree_limit` again. > > -- dan > > > On Tue, Jul 14, 2020 at 1:35 PM Frank Schilder wrote: > > > > Hi Dan, > > > > thanks for your reply. There is still a problem. > > > > Firstly, I did indeed forget to restart the mon even though I looked at the > > help for mon_osd_down_out_subtree_limit and it says it requires a restart. > > Stupid me. Well, now I did a restart and it still doesn't work. Here is the > > situation: > > > > # ceph config dump | grep subtree > > mon advanced mon_osd_down_out_subtree_limithost > >* > > mon advanced mon_osd_reporter_subtree_level > > datacenter > > > > # ceph config get mon.ceph-01 mon_osd_down_out_subtree_limit > > host > > > > # ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit > > { > > "mon_osd_down_out_subtree_limit": "rack" > > } > > > > # ceph config show mon.ceph-01 | grep subtree > > mon_osd_down_out_subtree_limit rack default > > mon > > mon_osd_reporter_subtree_level datacenter mon > > > > The default overrides the mon config database setting. What is going on > > here? I restarted all 3 monitors. > > > > Best regards and thanks for your help, > > = > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > > > From: Dan van der Ster > > Sent: 14 July 2020 10:53:13 > > To: Frank Schilder > > Cc: Anthony D'Atri; ceph-users > > Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? > > > > mon_osd_down_out_subtree_limit has been working well here. Did you > > restart the mon's after making that config change? > > Can you do this just to make sure it took effect? > > > >ceph daemon mon.`hostname -s` config get mon_osd_down_out_subtree_limit > > > > -- dan > > > > On Tue, Jul 14, 2020 at 8:57 AM Frank Schilder wrote: > > > > > > Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got > > > remapped and recovery/rebalancing started as usual. In the past, I did > > > service on servers with the flag noout set and would expect that > > > mon_osd_down_out_subtree_limit=host has the same effect when shutting >
[ceph-users] Re: mon_osd_down_out_subtree_limit not working?
Hi Dan, it still does not work. When I execute # ceph config set global mon_osd_down_out_subtree_limit host 2020-07-15 09:17:11.890 7f36cf7fe700 -1 set_mon_vals failed to set mon_osd_down_out_subtree_limit = host: Configuration option 'mon_osd_down_out_subtree_limit' may not be modified at runtime I get now a warning that one cannot change the value at run time. However, a restart of all monitors still does not apply the value: # ceph config show mon.ceph-01 | grep -e NAME -e mon_osd_down_out_subtree_limit | sed -e "s/ */\t/g" NAMEVALUE SOURCE OVERRIDES IGNORES mon_osd_down_out_subtree_limit rackdefault mon so the setting in the config data base is still ignored. Any ideas? I cannot shut down the entire cluster for something that simple. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 14 July 2020 17:38:27 To: Frank Schilder Cc: Anthony D'Atri; ceph-users Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? Seems that ceph config set mon mon_osd_down_out_subtree_limit isn't working. (I've seen this sort of config namespace issue in the past). I'd try `ceph config set global mon_osd_down_out_subtree_limit host` then restart the mon and check `ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit` again. -- dan On Tue, Jul 14, 2020 at 1:35 PM Frank Schilder wrote: > > Hi Dan, > > thanks for your reply. There is still a problem. > > Firstly, I did indeed forget to restart the mon even though I looked at the > help for mon_osd_down_out_subtree_limit and it says it requires a restart. > Stupid me. Well, now I did a restart and it still doesn't work. Here is the > situation: > > # ceph config dump | grep subtree > mon advanced mon_osd_down_out_subtree_limithost > * > mon advanced mon_osd_reporter_subtree_level > datacenter > > # ceph config get mon.ceph-01 mon_osd_down_out_subtree_limit > host > > # ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit > { > "mon_osd_down_out_subtree_limit": "rack" > } > > # ceph config show mon.ceph-01 | grep subtree > mon_osd_down_out_subtree_limit rack default > mon > mon_osd_reporter_subtree_level datacenter mon > > The default overrides the mon config database setting. What is going on here? > I restarted all 3 monitors. > > Best regards and thanks for your help, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Dan van der Ster > Sent: 14 July 2020 10:53:13 > To: Frank Schilder > Cc: Anthony D'Atri; ceph-users > Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? > > mon_osd_down_out_subtree_limit has been working well here. Did you > restart the mon's after making that config change? > Can you do this just to make sure it took effect? > >ceph daemon mon.`hostname -s` config get mon_osd_down_out_subtree_limit > > -- dan > > On Tue, Jul 14, 2020 at 8:57 AM Frank Schilder wrote: > > > > Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got > > remapped and recovery/rebalancing started as usual. In the past, I did > > service on servers with the flag noout set and would expect that > > mon_osd_down_out_subtree_limit=host has the same effect when shutting down > > an entire host. Unfortunately, in my case these two settings behave > > differently. > > > > If I understand the documentation correctly, the OSDs should not get marked > > out automatically. > > > > Best regards, > > = > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > > > From: Anthony D'Atri > > Sent: 14 July 2020 04:32:05 > > To: Frank Schilder > > Subject: Re: [ceph-users] mon_osd_down_out_subtree_limit not working? > > > > Did it start rebalancing? > > > > > On Jul 13, 2020, at 4:29 AM, Frank Schilder wrote: > > > > > > if I shut down all OSDs on this host, these OSDs should not be marked out > > > automatically after mon_osd_down_out_interval(=600) seconds. I did a test > > > today and, unfortunately, the OSDs do get marked as out. Ceph status was > > > showing 1 host down as expected. > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Poor Windows performance on ceph RBD.
Dear all, a few more results regarding virtio-version, RAM size and ceph RBD caching. I got some wrong information from our operators. We are using virtio-win-0.1.171 and found that this version might have a regression that affects performance: https://forum.proxmox.com/threads/big-discovery-on-virtio-performance.62728/. We are considering to downgrade all machines to virtio-win-0.1.164-2 until virtio-win-0.1.185-1 is marked stable. Our tests show that with both of these versions, Windows server version 2016 and 2019 perform equally well. We also experimented with the memory size for these machines. They used to have 4GB only. With 4GB, both versions eventually run into stalled I/O. After increasing this to 8GB we don't see stalls any more. Ceph RBD caching should have been set to writeback. Not sure why caching was disabled by default. It does not have much if any effect on write performance, although transfer rates seem more steady. I mainly want to enable caching to reduce read operations, which compete with writes on OSD level. This should give much better overall experience. We will change this setting during forthcoming service windows. Looks like we more or less got it sorted. Hints in this thread helped pinpointing issues. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 13 July 2020 15:38:58 To: André Gemünd; ceph-users Subject: [ceph-users] Re: Poor Windows performance on ceph RBD. > If I may ask, which version of the virtio drivers do you use? https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/latest-virtio/virtio-win.iso Looks like virtio-win-0.1.185.* > And do you use caching on libvirt driver level? In the ONE interface, we use DISK = [ driver = "raw" , cache = "none"] which translates to in the XML. We have no qemu settings in the ceph.conf. Looks like caching is disabled. Not sure if this is the recommended way though and why caching is disabled by default. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: André Gemünd Sent: 13 July 2020 11:18 To: Frank Schilder Subject: Re: [ceph-users] Re: Poor Windows performance on ceph RBD. If I may ask, which version of the virtio drivers do you use? And do you use caching on libvirt driver level? Greetings André - Am 13. Jul 2020 um 10:43 schrieb Frank Schilder fr...@dtu.dk: >> > To anyone who is following this thread, we found a possible explanation for >> > (some of) our observations. > >> If someone is following this, they probably want the possible >> explanation and not the knowledge of you having the possible >> explanation. > >> So you are saying if you do eg. a core installation (without gui) of >> 2016/2019 disable all services. The fio test results are signficantly >> different to eg. a centos 7 vm doing the same fio test? Are you sure >> this is not related to other processes writing to disk? > > Right, its not an explanation but rather a further observation. We don't > really > have an explanation yet. > > Its an identical installation of both server versions, same services > configured. > Our operators are not really into debugging Windows, that's why we were asking > here. Their hypothesis is, that the VD driver for accessing RBD images has > problems with Windows servers newer than 2016. I'm not a Windows guy, so can't > really comment on this. > > The test we do is a simple copy-test of a single 10g file and we monitor the > transfer speed. This info was cut out of this e-mail, the original report for > reference is: > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/ANHJQZLJT474B457VVM4ZZZ6HBXW4OPO/ > . > > We are very sure that it is not related to other processes writing to disk, we > monitor that too. There is also no competition on the RBD pool at the time of > testing. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Marc Roos > Sent: 13 July 2020 10:24 > To: ceph-users; Frank Schilder > Subject: RE: [ceph-users] Re: Poor Windows performance on ceph RBD. > >>> To anyone who is following this thread, we found a possible > explanation for >>> (some of) our observations. > > If someone is following this, they probably want the possible > explanation and not the knowledge of you having the possible > explanation. > > So you are saying if you do eg. a core installation (without gui) of > 2016/2019 disable all services. The fio test results are signficantly > different to eg. a centos 7 vm doing the same fio tes
[ceph-users] Re: OSD memory leak?
Hi Anthony and Mark, thanks for your answers. I have seen recommendations derived from test clusters with bluestore OSDs that read 16GB base line + 1GB per HDD + 4GB per SSD OSD, probably from the times when bluestore had a base-line+stress dependent. I would actually consider this already quite something. I understand that for high-performance requirements one adds RAM etc. to speed things up. For a mostly cold data store with a thin layer of warm/hot data, however, this is quite a lot compared with what standard disk controllers can do with a cheap CPU, 4GB of RAM and 16 drives connected. Essentially, ceph is turning a server into a disk controller and it should be possible to run a configuration that does not require much more than an ordinary hardware controller per disk delivering reasonable performance. I'm thinking along the lines of 25MB/s throughput and maybe 10IOP/s per NL-SAS HDD OSD to the user side (simple collocated deployment, EC pool). This ought to be possible in a way similar to a RAID controller with comparably moderate hardware requirements. Good aggregated performance then comes from scale and because the layer of hot data per disk is only a few GB per drive (a full re-write of just the hot data is only a few minutes). I thought this was the idea of ceph. Instead of trying to accommodate high-performance wishes for ridiculously small ceph clusters (I do see these "I have 3 servers with 3 disks each, why is it so slow" kind of complaints, which I would simply ignore), one talks about scale-out systems with thousands of OSDs. Something like 20 hosts serving 200 disks each would count as a small cluster. If the warm/hot data is only 1% or even less, such a system will be quite satisfying. For low-cost scale-out we have ceph. For performance, we have technologies like Lustre (which by the way has much more moderate minimum hardware requirements). For anything that requires higher performance one can then start using tiering, WAL/DB devices, SSD only pools, lots of RAM, whatever. However, there should be a stable, well-tested and low-demanding base line config for a cold store use case with hardware requirements similar to a NAS box per storage unit (one server+JBODs). I start missing support for the latter. 2 or even 4GB and 1core-GHz per HDD is really a lot compared with such systems. Please don't take this as a start of a long discussion. Its just a wish from my side to have low-demanding configs available that scale easily and are easy to administrate at an overall low cost. I will look into memory profiling of some OSDs. It doesn't look like a performance killer. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Anthony D'Atri Sent: 14 July 2020 17:29 To: ceph-users@ceph.io Subject: [ceph-users] Re: OSD memory leak? >> In the past, the minimum recommendation was 1GB RAM per HDD blue store OSD. There was a rule of thumb of 1GB RAM *per TB* of HDD Filestore OSD, perhaps you were influenced by that? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD memory leak?
Dear Mark, thanks for the info. I forgot a few answers: THPes are disabled (set to "never"). The kernel almost certainly doesn't reclaim because there is not enough pressure yet. We have 268 OSDs. I would not consider this much. We plan to triple that soonish. In the past, the minimum recommendation was 1GB RAM per HDD blue store OSD. I'm actually not really happy about that this has been quadrupled for not really convincing reasons. Compared with other storage systems, the increase in minimum requirements really start making ceph expensive. We have set the OSDs to use the bitmap allocator. Is the fact that we get tcmalloc stats a contradiction to this? I did not consider upgrading from mimic, because a lot of people report stability issues that might be caused by a regression in the message queueing. There was a longer e-mail about clusters from nautilus and higher collapsing under trivial amounts of rebalancing, pool deletion and other admin tasks. Before I consider upgrading, I want to test this on a lab cluster we plan to set up soon. I will look at the memory profiling. If one can use this on a production system, I will give it a go. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Nelson Sent: 14 July 2020 14:48:36 To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: OSD memory leak? Hi Frank, These might help: https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/ https://gperftools.github.io/gperftools/heapprofile.html https://gperftools.github.io/gperftools/heap_checker.html Regarding the mempools, they don't track all of the memory usage in Ceph, only things that were allocated using mempools. There are many other things (rocksdb block cache for example) that don't use them. It's only giving you a partial picture of memory usage. In your example below, that byte value from the thread cache freelist looks very wrong. Ignoring that for a moment though, there's a ton of memory that's been unmapped and released to the OS, but hasn't been reclaimed by the kernel. That's either because the kernel doesn't have enough memory pressure to bother reclaiming it, or because it's all fragmented chunks of a huge page that the kernel can't fully reclaim. That tells me you should definitely be looking at the transparent huge page (THP) configuration on your nodes. Looking back at batrick's PR that disables THP for Ceph, it looks like we only backported it to nautilus but not mimic. On that topic, have you considered upgrading to Nautilus? Mark On 7/14/20 2:56 AM, Frank Schilder wrote: > Dear Mark, > > thanks for the quick answer. I would try the memory profiler if I could find > any documentation on it. In fact, I just guessed the "heap stats" command and > have a hard time finding anything on the OSD daemon commands. Could you > possibly point me to something? Also how to interpret the mempools? Is it > correct to say that out of the memory_target only the mempools total is > actually used and the remaining memory is lost due to leaks? > > For example, for OSD 256 I get the stats below after just 2 months uptime. Am > I looking at a 5.5GB memory leak here? > > # ceph config get osd.256 osd_memory_target > 8589934592 > > # ceph daemon osd.256 heap stats > osd.256 tcmalloc heap stats: > MALLOC: 7216067616 ( 6881.8 MiB) Bytes in use by application > MALLOC: + 229376 (0.2 MiB) Bytes in page heap freelist > MALLOC: + 1222913888 ( 1166.3 MiB) Bytes in central cache freelist > MALLOC: + 278016 (0.3 MiB) Bytes in transfer cache freelist > MALLOC: + 18446744073692937856 (17592186044400.2 MiB) Bytes in thread cache > freelists > MALLOC: + 52166656 ( 49.8 MiB) Bytes in malloc metadata > MALLOC: > MALLOC: = 8475041792 ( 8082.4 MiB) Actual memory used (physical + swap) > MALLOC: + 2010464256 ( 1917.3 MiB) Bytes released to OS (aka unmapped) > MALLOC: > MALLOC: = 10485506048 ( .8 MiB) Virtual address space used > MALLOC: > MALLOC: 765182 Spans in use > MALLOC: 48 Thread heaps in use > MALLOC: 8192 Tcmalloc page size > > > # ceph daemon osd.256 dump_mempools > { > "mempool": { > "by_pool": { > "bloom_filter": { > "items": 0, > "bytes": 0 > }, > "bluestore_alloc": { > "items": 2300682, > "bytes": 18405456 > }, >
[ceph-users] Re: mon_osd_down_out_subtree_limit not working?
Hi Dan, thanks for your reply. There is still a problem. Firstly, I did indeed forget to restart the mon even though I looked at the help for mon_osd_down_out_subtree_limit and it says it requires a restart. Stupid me. Well, now I did a restart and it still doesn't work. Here is the situation: # ceph config dump | grep subtree mon advanced mon_osd_down_out_subtree_limithost * mon advanced mon_osd_reporter_subtree_leveldatacenter # ceph config get mon.ceph-01 mon_osd_down_out_subtree_limit host # ceph daemon mon.ceph-01 config get mon_osd_down_out_subtree_limit { "mon_osd_down_out_subtree_limit": "rack" } # ceph config show mon.ceph-01 | grep subtree mon_osd_down_out_subtree_limit rack default mon mon_osd_reporter_subtree_level datacenter mon The default overrides the mon config database setting. What is going on here? I restarted all 3 monitors. Best regards and thanks for your help, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 14 July 2020 10:53:13 To: Frank Schilder Cc: Anthony D'Atri; ceph-users Subject: Re: [ceph-users] Re: mon_osd_down_out_subtree_limit not working? mon_osd_down_out_subtree_limit has been working well here. Did you restart the mon's after making that config change? Can you do this just to make sure it took effect? ceph daemon mon.`hostname -s` config get mon_osd_down_out_subtree_limit -- dan On Tue, Jul 14, 2020 at 8:57 AM Frank Schilder wrote: > > Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got > remapped and recovery/rebalancing started as usual. In the past, I did > service on servers with the flag noout set and would expect that > mon_osd_down_out_subtree_limit=host has the same effect when shutting down an > entire host. Unfortunately, in my case these two settings behave differently. > > If I understand the documentation correctly, the OSDs should not get marked > out automatically. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Anthony D'Atri > Sent: 14 July 2020 04:32:05 > To: Frank Schilder > Subject: Re: [ceph-users] mon_osd_down_out_subtree_limit not working? > > Did it start rebalancing? > > > On Jul 13, 2020, at 4:29 AM, Frank Schilder wrote: > > > > if I shut down all OSDs on this host, these OSDs should not be marked out > > automatically after mon_osd_down_out_interval(=600) seconds. I did a test > > today and, unfortunately, the OSDs do get marked as out. Ceph status was > > showing 1 host down as expected. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD memory leak?
Dear Mark, thanks for the quick answer. I would try the memory profiler if I could find any documentation on it. In fact, I just guessed the "heap stats" command and have a hard time finding anything on the OSD daemon commands. Could you possibly point me to something? Also how to interpret the mempools? Is it correct to say that out of the memory_target only the mempools total is actually used and the remaining memory is lost due to leaks? For example, for OSD 256 I get the stats below after just 2 months uptime. Am I looking at a 5.5GB memory leak here? # ceph config get osd.256 osd_memory_target 8589934592 # ceph daemon osd.256 heap stats osd.256 tcmalloc heap stats: MALLOC: 7216067616 ( 6881.8 MiB) Bytes in use by application MALLOC: + 229376 (0.2 MiB) Bytes in page heap freelist MALLOC: + 1222913888 ( 1166.3 MiB) Bytes in central cache freelist MALLOC: + 278016 (0.3 MiB) Bytes in transfer cache freelist MALLOC: + 18446744073692937856 (17592186044400.2 MiB) Bytes in thread cache freelists MALLOC: + 52166656 ( 49.8 MiB) Bytes in malloc metadata MALLOC: MALLOC: = 8475041792 ( 8082.4 MiB) Actual memory used (physical + swap) MALLOC: + 2010464256 ( 1917.3 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: = 10485506048 ( .8 MiB) Virtual address space used MALLOC: MALLOC: 765182 Spans in use MALLOC: 48 Thread heaps in use MALLOC: 8192 Tcmalloc page size # ceph daemon osd.256 dump_mempools { "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 2300682, "bytes": 18405456 }, "bluestore_cache_data": { "items": 52390, "bytes": 306843648 }, "bluestore_cache_onode": { "items": 256153, "bytes": 145494904 }, "bluestore_cache_other": { "items": 92199353, "bytes": 656620069 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 4, "bytes": 2752 }, "bluestore_writing_deferred": { "items": 122, "bytes": 1864924 }, "bluestore_writing": { "items": 3673, "bytes": 18440192 }, "bluefs": { "items": 11867, "bytes": 220504 }, "buffer_anon": { "items": 353734, "bytes": 1180837372 }, "buffer_meta": { "items": 91646, "bytes": 5865344 }, "osd": { "items": 134, "bytes": 1557616 }, "osd_mapbl": { "items": 84, "bytes": 8479562 }, "osd_pglog": { "items": 487004, "bytes": 166094788 }, "osdmap": { "items": 117697, "bytes": 2080280 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 } }, "total": { "items": 95874543, "bytes": 2512807411 } } } Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mark Nelson Sent: 13 July 2020 15:39:50 To: ceph-users@ceph.io Subject: [ceph-users] Re: OSD memory lea
[ceph-users] Re: mon_osd_down_out_subtree_limit not working?
Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got remapped and recovery/rebalancing started as usual. In the past, I did service on servers with the flag noout set and would expect that mon_osd_down_out_subtree_limit=host has the same effect when shutting down an entire host. Unfortunately, in my case these two settings behave differently. If I understand the documentation correctly, the OSDs should not get marked out automatically. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Anthony D'Atri Sent: 14 July 2020 04:32:05 To: Frank Schilder Subject: Re: [ceph-users] mon_osd_down_out_subtree_limit not working? Did it start rebalancing? > On Jul 13, 2020, at 4:29 AM, Frank Schilder wrote: > > if I shut down all OSDs on this host, these OSDs should not be marked out > automatically after mon_osd_down_out_interval(=600) seconds. I did a test > today and, unfortunately, the OSDs do get marked as out. Ceph status was > showing 1 host down as expected. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Poor Windows performance on ceph RBD.
> If I may ask, which version of the virtio drivers do you use? https://fedorapeople.org/groups/virt/virtio-win/direct-downloads/latest-virtio/virtio-win.iso Looks like virtio-win-0.1.185.* > And do you use caching on libvirt driver level? In the ONE interface, we use DISK = [ driver = "raw" , cache = "none"] which translates to in the XML. We have no qemu settings in the ceph.conf. Looks like caching is disabled. Not sure if this is the recommended way though and why caching is disabled by default. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: André Gemünd Sent: 13 July 2020 11:18 To: Frank Schilder Subject: Re: [ceph-users] Re: Poor Windows performance on ceph RBD. If I may ask, which version of the virtio drivers do you use? And do you use caching on libvirt driver level? Greetings André - Am 13. Jul 2020 um 10:43 schrieb Frank Schilder fr...@dtu.dk: >> > To anyone who is following this thread, we found a possible explanation for >> > (some of) our observations. > >> If someone is following this, they probably want the possible >> explanation and not the knowledge of you having the possible >> explanation. > >> So you are saying if you do eg. a core installation (without gui) of >> 2016/2019 disable all services. The fio test results are signficantly >> different to eg. a centos 7 vm doing the same fio test? Are you sure >> this is not related to other processes writing to disk? > > Right, its not an explanation but rather a further observation. We don't > really > have an explanation yet. > > Its an identical installation of both server versions, same services > configured. > Our operators are not really into debugging Windows, that's why we were asking > here. Their hypothesis is, that the VD driver for accessing RBD images has > problems with Windows servers newer than 2016. I'm not a Windows guy, so can't > really comment on this. > > The test we do is a simple copy-test of a single 10g file and we monitor the > transfer speed. This info was cut out of this e-mail, the original report for > reference is: > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/ANHJQZLJT474B457VVM4ZZZ6HBXW4OPO/ > . > > We are very sure that it is not related to other processes writing to disk, we > monitor that too. There is also no competition on the RBD pool at the time of > testing. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Marc Roos > Sent: 13 July 2020 10:24 > To: ceph-users; Frank Schilder > Subject: RE: [ceph-users] Re: Poor Windows performance on ceph RBD. > >>> To anyone who is following this thread, we found a possible > explanation for >>> (some of) our observations. > > If someone is following this, they probably want the possible > explanation and not the knowledge of you having the possible > explanation. > > So you are saying if you do eg. a core installation (without gui) of > 2016/2019 disable all services. The fio test results are signficantly > different to eg. a centos 7 vm doing the same fio test? Are you sure > this is not related to other processes writing to disk? > > > > -Original Message- > From: Frank Schilder [mailto:fr...@dtu.dk] > Sent: maandag 13 juli 2020 9:28 > To: ceph-users@ceph.io > Subject: [ceph-users] Re: Poor Windows performance on ceph RBD. > > To anyone who is following this thread, we found a possible explanation > for (some of) our observations. > > We are running Windows servers version 2016 and 2019 as storage servers > exporting data on an rbd image/disk. We recently found that Windows > server 2016 runs fine. It is still not as fast as Linux + SAMBA share on > an rbd image (ca. 50%), but runs with a reasonable sustained bandwidth. > With Windows server 2019, however, we observe near-complete stall of > file transfers and time-outs using standard copy tools (robocopy). We > don't have an explanation yet and are downgrading Windows servers where > possible. > > If anyone has a hint what we can do, please let us know. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Dipl.-Inf. André Gemünd, Leiter IT / Head of IT Fraunhofer-Institute for Algorithms and Scientific Computing andre.gemu...@scai.fraunhofer.de Tel: +49 2241 14-2193 /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] OSD memory leak?
lication MALLOC: + 3727360 (3.6 MiB) Bytes in page heap freelist MALLOC: + 25493688 ( 24.3 MiB) Bytes in central cache freelist MALLOC: + 17101824 ( 16.3 MiB) Bytes in transfer cache freelist MALLOC: + 20301904 ( 19.4 MiB) Bytes in thread cache freelists MALLOC: + 5242880 (5.0 MiB) Bytes in malloc metadata MALLOC: MALLOC: = 1245863936 ( 1188.1 MiB) Actual memory used (physical + swap) MALLOC: + 20488192 ( 19.5 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: = 1266352128 ( 1207.7 MiB) Virtual address space used MALLOC: MALLOC: 54160 Spans in use MALLOC: 33 Thread heaps in use MALLOC: 8192 Tcmalloc page size Am I looking at a memory leak here or are these heap stats expected? I don't mind the swap usage, it doesn't have impact. I'm just wondering if I need to restart OSDs regularly. The "leakage" above occurred within only 2 months. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io