[ceph-users] Getting rid of prometheus messages in /var/log/messages
Hello /var/log/messages on machines in our ceph cluster are inundated with entries from Prometheus scraping ("GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.11.1") Is it possible to configure ceph to not send those to syslog? If not, can I configure something so that none of ceph-mgr messages go to syslog and only go to /var/log/ceph/ceph-mgr.log? Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected increase in the memory usage of OSDs
Best I can tell, automatic cache sizing is enabled and all related settings are at their default values. Looking through cache tunables, I came across osd_memory_expected_fragmentation, which the docs define as "estimate the percent of memory fragmentation". What's the formula to compute actual percentage of memory fragmentation? Based on /proc/buddyinfo, I suspect that our memory fragmentation is a lot worse than osd_memory_expected_fragmentation default of 0.15. Could this be related to many OSDs' RSSes far exceeding osd_memory_target? So far high memory consumption hasn't been a problem for us. (I guess it's possible that the kernel simply sees no need to reclaim unmapped memory until there is actually real memory pressure?) It's just a little scary not understanding why this started happening when memory usage had been so stable before. Thanks, Vlad On 10/9/19 11:51 AM, Gregory Farnum wrote: On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik wrote: > Do you have statistics on the size of the OSDMaps or count of them > which were being maintained by the OSDs? No, I don't think so. How can I find this information? Hmm I don't know if we directly expose the size of maps. There are perfcounters which expose the range of maps being kept around but I don't know their names off-hand. Maybe it's something else involving the bluestore cache or whatever; if you're not using the newer memory limits I'd switch to those but otherwise I dunno. -Greg Memory consumption started to climb again: https://icecube.wisc.edu/~vbrik/graph-3.png Some more info (not sure if relevant or not): I increased size of the swap on the servers to 10GB and it's being completely utilized, even though there is still quite a bit of free memory. It appears that memory is highly fragmented on the NUMA node 0 of all the servers. Some of the servers have no free pages higher than order 0. (Memory on NUMA node 1 of the servers appears much less fragmented.) The servers have 192GB of RAM, 2 NUMA nodes. Vlad On 10/4/19 6:09 PM, Gregory Farnum wrote: Do you have statistics on the size of the OSDMaps or count of them which were being maintained by the OSDs? I'm not sure why having noout set would change that if all the nodes were alive, but that's my bet. -Greg On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik wrote: And, just as unexpectedly, things have returned to normal overnight https://icecube.wisc.edu/~vbrik/graph-1.png The change seems to have coincided with the beginning of Rados Gateway activity (before, it was essentially zero). I can see nothing in the logs that would explain what happened though. Vlad On 10/2/19 3:43 PM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.2 cluster and a few days ago, memory consumption of our OSDs started to unexpectedly grow on all 5 nodes, after being stable for about 6 months. Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png I am not sure what changed to cause this. Cluster usage has been very light (typically <10 iops) during this period, and the number of objects stayed about the same. The only unusual occurrence was the reboot of one of the nodes the day before (a firmware update). For the reboot, I ran "ceph osd set noout", but forgot to unset it until several days later. Unsetting noout did not stop the increase in memory consumption. I don't see anything unusual in the logs. Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I don't know why there is such a big spread. All HDDs are 10TB, 72-76% utilized, with 101-104 PGs. Does anybody know what might be the problem here and how to address or debug it? Thanks very much, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected increase in the memory usage of OSDs
> Do you have statistics on the size of the OSDMaps or count of them > which were being maintained by the OSDs? No, I don't think so. How can I find this information? Memory consumption started to climb again: https://icecube.wisc.edu/~vbrik/graph-3.png Some more info (not sure if relevant or not): I increased size of the swap on the servers to 10GB and it's being completely utilized, even though there is still quite a bit of free memory. It appears that memory is highly fragmented on the NUMA node 0 of all the servers. Some of the servers have no free pages higher than order 0. (Memory on NUMA node 1 of the servers appears much less fragmented.) The servers have 192GB of RAM, 2 NUMA nodes. Vlad On 10/4/19 6:09 PM, Gregory Farnum wrote: Do you have statistics on the size of the OSDMaps or count of them which were being maintained by the OSDs? I'm not sure why having noout set would change that if all the nodes were alive, but that's my bet. -Greg On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik wrote: And, just as unexpectedly, things have returned to normal overnight https://icecube.wisc.edu/~vbrik/graph-1.png The change seems to have coincided with the beginning of Rados Gateway activity (before, it was essentially zero). I can see nothing in the logs that would explain what happened though. Vlad On 10/2/19 3:43 PM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.2 cluster and a few days ago, memory consumption of our OSDs started to unexpectedly grow on all 5 nodes, after being stable for about 6 months. Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png I am not sure what changed to cause this. Cluster usage has been very light (typically <10 iops) during this period, and the number of objects stayed about the same. The only unusual occurrence was the reboot of one of the nodes the day before (a firmware update). For the reboot, I ran "ceph osd set noout", but forgot to unset it until several days later. Unsetting noout did not stop the increase in memory consumption. I don't see anything unusual in the logs. Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I don't know why there is such a big spread. All HDDs are 10TB, 72-76% utilized, with 101-104 PGs. Does anybody know what might be the problem here and how to address or debug it? Thanks very much, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpected increase in the memory usage of OSDs
And, just as unexpectedly, things have returned to normal overnight https://icecube.wisc.edu/~vbrik/graph-1.png The change seems to have coincided with the beginning of Rados Gateway activity (before, it was essentially zero). I can see nothing in the logs that would explain what happened though. Vlad On 10/2/19 3:43 PM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.2 cluster and a few days ago, memory consumption of our OSDs started to unexpectedly grow on all 5 nodes, after being stable for about 6 months. Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png I am not sure what changed to cause this. Cluster usage has been very light (typically <10 iops) during this period, and the number of objects stayed about the same. The only unusual occurrence was the reboot of one of the nodes the day before (a firmware update). For the reboot, I ran "ceph osd set noout", but forgot to unset it until several days later. Unsetting noout did not stop the increase in memory consumption. I don't see anything unusual in the logs. Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I don't know why there is such a big spread. All HDDs are 10TB, 72-76% utilized, with 101-104 PGs. Does anybody know what might be the problem here and how to address or debug it? Thanks very much, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Unexpected increase in the memory usage of OSDs
Hello I am running a Ceph 14.2.2 cluster and a few days ago, memory consumption of our OSDs started to unexpectedly grow on all 5 nodes, after being stable for about 6 months. Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png I am not sure what changed to cause this. Cluster usage has been very light (typically <10 iops) during this period, and the number of objects stayed about the same. The only unusual occurrence was the reboot of one of the nodes the day before (a firmware update). For the reboot, I ran "ceph osd set noout", but forgot to unset it until several days later. Unsetting noout did not stop the increase in memory consumption. I don't see anything unusual in the logs. Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I don't know why there is such a big spread. All HDDs are 10TB, 72-76% utilized, with 101-104 PGs. Does anybody know what might be the problem here and how to address or debug it? Thanks very much, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
I created a ticket: https://tracker.ceph.com/issues/41511 Note that I think I was mistaken when I said that sometimes the problem goes away on its own. I've looked back through our monitoring and it looks like when the problem did go away, it was because either the machine was rebooted or the radosgw service was restarted. Vlad On 8/23/19 10:17 AM, Eric Ivancich wrote: Good morning, Vladimir, Please create a tracker for this (https://tracker.ceph.com/projects/rgw/issues/new) and include the link to it in an email reply. And if you can include any more potentially relevant details, please do so. I’ll add my initial analysis to it. But the threads do seem to be stuck, at least for a while, in get_obj_data::flush despite a lack of traffic. And sometimes it self-resolves, so it’s not a true “infinite loop”. Thank you, Eric On Aug 22, 2019, at 9:12 PM, Eric Ivancich <mailto:ivanc...@redhat.com>> wrote: Thank you for providing the profiling data, Vladimir. There are 5078 threads and most of them are waiting. Here is a list of the deepest call of each thread with duplicates removed. + 100.00% epoll_wait + 100.00% get_obj_data::flush(rgw::OwningList&&) + 100.00% poll + 100.00% poll + 100.00% poll + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2 + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2 + 100.00% pthread_cond_wait@@GLIBC_2.3.2 + 100.00% pthread_cond_wait@@GLIBC_2.3.2 + 100.00% read + 100.00% _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ The only interesting ones are the second and last: * get_obj_data::flush(rgw::OwningList&&) * _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ They are essentially part of the same call stack that results from processing a GetObj request, and five threads are in this call stack (the only difference is wether or not they include the call into boost intrusive list). Here’s the full call stack of those threads: + 100.00% clone + 100.00% start_thread + 100.00% worker_thread + 100.00% process_new_connection + 100.00% handle_request + 100.00% RGWCivetWebFrontend::process(mg_connection*) + 100.00% process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, opt ional_yield, rgw::dmclock::Scheduler*, int*) + 100.00% rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool) + 100.00% RGWGetObj::execute() + 100.00% RGWRados::Object::Read::iterate(long, long, RGWGetDataCB*) + 100.00% RGWRados::iterate_obj(RGWObjectCtx&, RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int (*)(rgw_raw_obj const&, l ong, long, long, bool, RGWObjState*, void*), void*) + 100.00% _get_obj_iterate_cb(rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*) + 100.00% RGWRados::get_obj_iterate_cb(rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*) + 100.00% get_obj_data::flush(rgw::OwningList&&) + 100.00% _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_ So this isn’t background processing but request processing. I’m not clear why these requests are consuming so much CPU for so long. From your initial message: I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. I’m going to check with others who’re more familiar with this code path. Begin forwarded message: *From:*Vladimir Brik <mailto:vladimir.b...@icecube.wisc.edu>> *Subject:**Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred* *Date:*August 21, 2019 at 4:47:01 PM EDT *To:*"J. Eric Ivancich" <mailto:ivanc...@redhat.com>>, Mark Nelson <mailto:mnel...@redhat.com>>,ceph-users@lists.ceph.com <mailto:ceph-users@lists.
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
> Are you running multisite? No > Do you have dynamic bucket resharding turned on? Yes. "radosgw-admin reshard list" prints "[]" > Are you using lifecycle? I am not sure. How can I check? "radosgw-admin lc list" says "[]" > And just to be clear -- sometimes all 3 of your rados gateways are > simultaneously in this state? Multiple, but I have not seen all 3 being in this state simultaneously. Currently one gateway has 1 thread using 100% of CPU, and another has 5 threads each using 100% CPU. Here are the fruits of my attempts to capture the call graph using perf and gdbpmp: https://icecube.wisc.edu/~vbrik/perf.data https://icecube.wisc.edu/~vbrik/gdbpmp.data These are the commands that I ran and their outputs (note I couldn't get perf not to generate the warning): rgw-3 gdbpmp # ./gdbpmp.py -n 100 -p 73688 -o gdbpmp.data Attaching to process 73688...Done. Gathering Samples Profiling complete with 100 samples. rgw-3 ~ # perf record --call-graph fp -p 73688 -- sleep 10 [ perf record: Woken up 54 times to write data ] Warning: Processed 574207 events and lost 4 chunks! Check IO/CPU overload! [ perf record: Captured and wrote 58.866 MB perf.data (233750 samples) ] Vlad On 8/21/19 11:16 AM, J. Eric Ivancich wrote: On 8/21/19 10:22 AM, Mark Nelson wrote: Hi Vladimir, On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello [much elided] You might want to try grabbing a a callgraph from perf instead of just running perf top or using my wallclock profiler to see if you can drill down and find out where in that method it's spending the most time. I agree with Mark -- a call graph would be very helpful in tracking down what's happening. There are background tasks that run. Are you running multisite? Do you have dynamic bucket resharding turned on? Are you using lifecycle? And garbage collection is another background task. And just to be clear -- sometimes all 3 of your rados gateways are simultaneously in this state? But the call graph would be incredibly helpful. Thank you, Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space
Hello After increasing number of PGs in a pool, ceph status is reporting "Degraded data redundancy (low space): 1 pg backfill_toofull", but I don't understand why, because all OSDs seem to have enough space. ceph health detail says: pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85] $ ceph pg map 40.155 osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85] So I guess Ceph wants to move 40.155 from 66 to 79 (or other way around?). According to "osd df", OSD 66's utilization is 71.90%, OSD 79's utilization is 58.45%. The OSD with least free space in the cluster is 81.23% full, and it's not any of the ones above. OSD backfillfull_ratio is 90% (is there a better way to determine this?): $ ceph osd dump | grep ratio full_ratio 0.95 backfillfull_ratio 0.9 nearfull_ratio 0.7 Does anybody know why a PG could be in the backfill_toofull state if no OSD is in the backfillfull state? Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
Correction: the number of threads stuck using 100% of a CPU core varies from 1 to 5 (it's not always 5) Vlad On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. Does anybody have an idea what might be going on or how to debug it? I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl" This is our configuration: rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log (error log file doesn't exist) Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
Hello I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. Does anybody have an idea what might be going on or how to debug it? I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl" This is our configuration: rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log (error log file doesn't exist) Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw daemons constantly reading default.rgw.log pool
Hello I have set up rados gateway using "ceph-deploy rgw create" (default pools, 3 machines acting as gateways) on Ceph 13.2.5. For over 2 weeks now, the three rados gateways have been generating constant ~30MB/s 4K ops/s of read i/o on default.rgw.log even though nothing is using the rados gateways. Nothing in the logs except occasional 7fbce9329700 0 RGWReshardLock::lock failed to acquire lock on reshard.00 ret=-16 Anybody know what might be going on? Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Restricting access to RadosGW/S3 buckets
Hello I am trying to figure out a way to restrict access to S3 buckets. Is it possible to create a RadosGW user that can only access specific bucket(s)? Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore nvme DB/WAL size
Hello I am considering using logical volumes of an NVMe drive as DB or WAL devices for OSDs on spinning disks. The documentation recommends against DB devices smaller than 4% of slow disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so dividing it equally will result in each OSD getting ~90GB DB NVMe volume, which is a lot less than 4%. Will this cause problems down the road? Thanks Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Scrub behavior
Hello I am experimenting with how Ceph (13.2.2) deals with on-disk data corruption, and I've run into some unexpected behavior. I am wondering if somebody could comment on whether I understand things correctly. In my tests I would dd /dev/urandom onto an OSD's disk and see what would happen. I don't fill up the entire disk (that causes OSD to crash) and choose an OSD that is pretty full. It looks like regular scrubs don't detect any problems at all, and I actually don't see any disk activity. So I guess only the stuff that is in memory is getting scrubbed? When I initiate a deep scrub of an OSD, it looks like only PGs for which that OSD is the primary are checked. Is this correct? If so, how is corruption of other PGs on that OSD is detected? Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What could cause mon_osd_full_ratio to be exceeded?
> Why didn't it stop at mon_osd_full_ratio (90%) Should be 95% Vlad On 11/26/18 9:28 AM, Vladimir Brik wrote: Hello I am doing some Ceph testing on a near-full cluster, and I noticed that, after I brought down a node, some OSDs' utilization reached osd_failsafe_full_ratio (97%). Why didn't it stop at mon_osd_full_ratio (90%) if mon_osd_backfillfull_ratio is 90%? Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What could cause mon_osd_full_ratio to be exceeded?
Hello I am doing some Ceph testing on a near-full cluster, and I noticed that, after I brought down a node, some OSDs' utilization reached osd_failsafe_full_ratio (97%). Why didn't it stop at mon_osd_full_ratio (90%) if mon_osd_backfillfull_ratio is 90%? Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How many PGs per OSD is too many?
Hello I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs and 4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 400 PGs each (a lot more pools use SSDs than HDDs). Servers are fairly powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet. The impression I got from the docs is that having more than 200 PGs per OSD is not a good thing, but justifications were vague (no concrete numbers), like increased peering time, increased resource consumption, and possibly decreased recovery performance. None of these appeared to be a significant problem in my testing, but the tests were very basic and done on a pretty empty cluster under minimal load, so I worry I'll run into trouble down the road. Here are the questions I have: - In practice, is it a big deal that some OSDs have ~400 PGs? - In what situations would our cluster most likely fare significantly better if I went through the trouble of re-creating pools so that no OSD would have more than, say, ~100 PGs? - What performance metrics could I monitor to detect possible issues due to having too many PGs? Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Erasure coding with more chunks than servers
Hello I have a 5-server cluster and I am wondering if it's possible to create pool that uses k=5 m=2 erasure code. In my experiments, I ended up with pools whose pgs are stuck in creating+incomplete state even when I created the erasure code profile with --crush-failure-domain=osd. Assuming that what I want to do is possible, will CRUSH distribute chunks evenly among servers, so that if I need to bring one server down (e.g. reboot), clients' ability to write or read any object would not be disrupted? (I guess something would need to ensure that no server holds more than two chunks of an object) Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] NVMe SSD not assigned "nvme" device class
Hello, It looks like Ceph (13.2.2) assigns device class "ssd" to our Samsung PM1725a NVMe SSDs instead of "nvme". Is that a bug or is the "nvme" class reserved for a different kind of device? Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problems after increasing number of PGs in a pool
Thanks to everybody who responded. The problem was, indeed, that I hit the limit on the number of PGs per SSD OSD when I increased the number of PGs in a pool. One question though: should I have received a warning that some OSDs are close to their maximum PG limit? A while back, in a Luminous test pool I remember seeing something like "too many PGs per OSD" in some of my testing, but not this time (perhaps because this time I hit the limit during the resizing operation). Where might such warning be recorded if not in "ceph status"? Thanks, Vlad On 09/28/2018 01:04 PM, Paul Emmerich wrote: > I guess the pool is mapped to SSDs only from the name and you only got 20 > SSDs. > So you should have about ~2000 effective PGs taking replication into account. > > Your pool has ~10k effective PGs with k+m=5 and you seem to have 5 > more pools > > Check "ceph osd df tree" to see how many PGs per OSD you got. > > Try increasing these two options to "fix" it. > > mon max pg per osd > osd max pg per osd hard ratio > > > Paul > Am Fr., 28. Sep. 2018 um 18:05 Uhr schrieb Vladimir Brik > : >> >> Hello >> >> I've attempted to increase the number of placement groups of the pools >> in our test cluster and now ceph status (below) is reporting problems. I >> am not sure what is going on or how to fix this. Troubleshooting >> scenarios in the docs don't seem to quite match what I am seeing. >> >> I have no idea how to begin to debug this. I see OSDs listed in >> "blocked_by" of pg dump, but don't know how to interpret that. Could >> somebody assist please? >> >> I attached output of "ceph pg dump_stuck -f json-pretty" just in case. >> >> The cluster consists of 5 hosts, each with 16 HDDs and 4 SSDs. I am >> running 13.2.2. >> >> This is the affected pool: >> pool 6 'fs-data-ec-ssd' erasure size 5 min_size 4 crush_rule 6 >> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 2493 lfor >> 0/2491 flags hashpspool,ec_overwrites stripe_width 12288 application cephfs >> >> >> Thanks, >> >> Vlad >> >> >> ceph health >> >> cluster: >> id: 47caa1df-42be-444d-b603-02cad2a7fdd3 >> health: HEALTH_WARN >> Reduced data availability: 155 pgs inactive, 47 pgs peering, >> 64 pgs stale >> Degraded data redundancy: 321039/114913606 objects degraded >> (0.279%), 108 pgs degraded, 108 pgs undersized >> >> services: >> mon: 5 daemons, quorum ceph-1,ceph-2,ceph-3,ceph-4,ceph-5 >> mgr: ceph-3(active), standbys: ceph-2, ceph-5, ceph-1, ceph-4 >> mds: cephfs-1/1/1 up {0=ceph-5=up:active}, 4 up:standby >> osd: 100 osds: 100 up, 100 in; 165 remapped pgs >> >> data: >> pools: 6 pools, 5120 pgs >> objects: 22.98 M objects, 88 TiB >> usage: 154 TiB used, 574 TiB / 727 TiB avail >> pgs: 3.027% pgs not active >> 321039/114913606 objects degraded (0.279%) >> 4903 active+clean >> 105 activating+undersized+degraded+remapped >> 61 stale+active+clean >> 47 remapped+peering >> 3stale+activating+undersized+degraded+remapped >> 1active+clean+scrubbing+deep >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Problems after increasing number of PGs in a pool
Hello I've attempted to increase the number of placement groups of the pools in our test cluster and now ceph status (below) is reporting problems. I am not sure what is going on or how to fix this. Troubleshooting scenarios in the docs don't seem to quite match what I am seeing. I have no idea how to begin to debug this. I see OSDs listed in "blocked_by" of pg dump, but don't know how to interpret that. Could somebody assist please? I attached output of "ceph pg dump_stuck -f json-pretty" just in case. The cluster consists of 5 hosts, each with 16 HDDs and 4 SSDs. I am running 13.2.2. This is the affected pool: pool 6 'fs-data-ec-ssd' erasure size 5 min_size 4 crush_rule 6 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 2493 lfor 0/2491 flags hashpspool,ec_overwrites stripe_width 12288 application cephfs Thanks, Vlad ceph health cluster: id: 47caa1df-42be-444d-b603-02cad2a7fdd3 health: HEALTH_WARN Reduced data availability: 155 pgs inactive, 47 pgs peering, 64 pgs stale Degraded data redundancy: 321039/114913606 objects degraded (0.279%), 108 pgs degraded, 108 pgs undersized services: mon: 5 daemons, quorum ceph-1,ceph-2,ceph-3,ceph-4,ceph-5 mgr: ceph-3(active), standbys: ceph-2, ceph-5, ceph-1, ceph-4 mds: cephfs-1/1/1 up {0=ceph-5=up:active}, 4 up:standby osd: 100 osds: 100 up, 100 in; 165 remapped pgs data: pools: 6 pools, 5120 pgs objects: 22.98 M objects, 88 TiB usage: 154 TiB used, 574 TiB / 727 TiB avail pgs: 3.027% pgs not active 321039/114913606 objects degraded (0.279%) 4903 active+clean 105 activating+undersized+degraded+remapped 61 stale+active+clean 47 remapped+peering 3stale+activating+undersized+degraded+remapped 1active+clean+scrubbing+deep stuck.json.gz Description: application/gzip ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS on a mixture of SSDs and HDDs
Hello I am setting up a new ceph cluster (probably Mimic) made up of servers that have a mixture of solid state and spinning disks. I'd like CephFS to store data of some of our applications only on SSDs, and store data of other applications only on HDDs. Is there a way of doing this without running multiple filesystems within the same cluster? (E.g. something like configuring CephFS to store data of some directory trees in an SSD pool, and storing others in an HDD pool) If not, can anybody comment on their experience running multiple file systems in a single cluster? Are there any known issues (I am only aware of some issues related to security)? Does anybody know if support/testing of multiple filesystems in a cluster is something actively being worked on and if it might stop being "experimental" in near future? Thanks very much, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com