Re: [ceph-users] PG inconsistent with error "size_too_large"
As I wrote here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html I saw the same after an update from Luminous to Nautilus 14.2.6 Cheers, Massimo On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan wrote: > Hi, > > I am getting one inconsistent object on our cluster with an inconsistency > error that I haven’t seen before. This started happening during a rolling > upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s > related. > > I was hoping to know what the error means before trying a repair. > > [root@objmon04 ~]# ceph health detail > HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg > inconsistent > OSDMAP_FLAGS noout flag(s) set > OSD_SCRUB_ERRORS 1 scrub errors > PG_DAMAGED Possible data damage: 1 pg inconsistent > pg 9.20e is active+clean+inconsistent, acting [509,674,659] > > rados list-inconsistent-obj 9.20e --format=json-pretty > { > "epoch": 759019, > "inconsistents": [ > { > "object": { > "name": > "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", > "nspace": "", > "locator": "", > "snap": "head", > "version": 692875 > }, > "errors": [ > "size_too_large" > ], > "union_shard_errors": [], > "selected_object_info": { > "oid": { > "oid": > "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", > "key": "", > "snapid": -2, > "hash": 3321413134, > "max": 0, > "pool": 9, > "namespace": "" > }, > "version": "281183'692875", > "prior_version": "281183'692874", > "last_reqid": "client.34042469.0:206759091", > "user_version": 692875, > "size": 146097278, > "mtime": "2017-07-03 12:43:35.569986", > "local_mtime": "2017-07-03 12:43:35.571196", > "lost": 0, > "flags": [ > "dirty", > "data_digest", > "omap_digest" > ], > "truncate_seq": 0, > "truncate_size": 0, > "data_digest": "0xf19c8035", > "omap_digest": "0x", > "expected_object_size": 0, > "expected_write_size": 0, > "alloc_hint_flags": 0, > "manifest": { > "type": 0 > }, > "watchers": {} > }, > "shards": [ > { > "osd": 509, > "primary": true, > "errors": [], > "size": 146097278 > }, > { > "osd": 659, > "primary": false, > "errors": [], > "size": 146097278 > }, > { > "osd": 674, > "primary": false, > "errors": [], > "size": 146097278 > } > ] > } > ] > } > > Thanks, > Liam > — > Senior Developer > Institute for Advanced Computer Studies > University of Maryland > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] One lost cephfs data object
Hi all, I'm on 13.2.6. My cephfs has managed to lose one single object from it's data pool. All the cephfs docs I'm finding show me how to recover from an entire lost PG, but the rest of the PG checks out as far as I can tell. How can I track down which file does that object belongs to? I'm missing "102e2aa.3721" in pg 16.d7. Pool 16 is an EC cephfs data pool called cephfs_ecdata (this data pool is assigned to a directory by ceph.dir.layout). We store backups in this data pool, so we'll likely be fine just deleting the file. # ceph health detail HEALTH_ERR 60758/81263036 objects misplaced (0.075%); 1/16673236 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 1/81263036 objects degraded (0.000%), 1 pg degraded OBJECT_MISPLACED 60758/81263036 objects misplaced (0.075%) OBJECT_UNFOUND 1/16673236 objects unfound (0.000%) pg 16.d7 has 1 unfound objects PG_DAMAGED Possible data damage: 1 pg recovery_unfound pg 16.d7 is active+recovery_unfound+degraded+remapped, acting [48,8,30,11,42], 1 unfound PG_DEGRADED Degraded data redundancy: 1/81263036 objects degraded (0.000%), 1 pg degraded pg 16.d7 is active+recovery_unfound+degraded+remapped, acting [48,8,30,11,42], 1 unfound # ceph pg 16.d7 list_missing { "offset": { "oid": "", "key": "", "snapid": 0, "hash": 0, "max": 0, "pool": -9223372036854775808, "namespace": "" }, "num_missing": 1, "num_unfound": 1, "objects": [ { "oid": { "oid": "102e2aa.3721", "key": "", "snapid": -2, "hash": 2685987031, "max": 0, "pool": 16, "namespace": "" }, "need": "41610'2203339", "have": "0'0", "flags": "none", "locations": [ "42(4)" ] } ], "more": false } At one point this object showed it's map as # ceph osd map cephfs_ecdata "102e2aa.3721" osdmap e45659 pool 'cephfs_ecdata' (16) object '102e2aa.3721' -> pg 16.a018e8d7 (16.d7) -> up ([48,52,30,11,44], p48) acting ([48,8,30,11,NONE], p48) but I restarted osd.44, and now it's showing # ceph osd map cephfs_ecdata "102e2aa.3721" osdmap e45679 pool 'cephfs_ecdata' (16) object '102e2aa.3721' -> pg 16.a018e8d7 (16.d7) -> up ([48,52,30,11,44], p48) acting ([48,8,30,11,42], p48) Thanks, Andrew ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] units of metrics
Quoting Robert LeBlanc (rob...@leblancnet.us): > > req_create > req_getattr > req_readdir > req_lookupino > req_open > req_unlink > > We were graphing these as ops, but using the new avgcount, we are getting > very different values, so I'm wondering if we are choosing the wrong new > value, or we misunderstood what the old value really was and have been > plotting it wrong all this time. I think the last one: not plotting what you think you did. We are using the telegraf plugin from the manager and using "mds.request" from "ceph_daemon_stats" to plot the number of requests. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pool Max Avail and Ceph Dashboard Pool Useage on Nautilus giving different percentages
Does anyone know if this is also respecting an nearfull values? Thank you in advice Mehmet Am 14. Januar 2020 15:20:39 MEZ schrieb Stephan Mueller : >Hi, >I sent out this message on the 19th of December and somehow it didn't >got into the list and I just noticed it now. Sorry for the delay. >I tried to resend it but it just returned the same error that mail was >not deliverable to the ceph mailing list. I will send the message >beneath as soon it's finally possible, but for now this should help you >out. > >Stephan > >-- > >Hi, > >if "MAX AVAIL" displays the wrong data, the bug is just made more >visible through the dashboard, as the calculation is correct. > >To get the right percentage you have to divide the used space through >the total, and the total can only consist of two states used and not >used space, so both states will be added together to get the total. > >Or in short: > >used / (avail + used) > >Just looked into the C++ code - Max avail will be calculated the >following way: > >avail_res = avail / raw_used_rate ( >https://github.com/ceph/ceph/blob/nautilus/src/mon/PGMap.cc#L905) > >raw_used_rate *= (sum.num_object_copies - sum.num_objects_degraded) / >sum.num_object_copies >(https://github.com/ceph/ceph/blob/nautilus/src/mon/PGMap.cc#L892) > > >Am Dienstag, den 17.12.2019, 07:07 +0100 schrieb c...@elchaka.de: >> I have observed this in the ceph nautilus dashboard too - and Think >> it is a Display Bug... but sometimes it Shows tue right values >> >> >> Which nautilus u use? >> >> >> Am 10. Dezember 2019 14:31:05 MEZ schrieb "David Majchrzak, ODERLAND >> Webbhotell AB" : >> > Hi! >> > >> > While browsing /#/pool in nautilus ceph dashboard I noticed it said >> > 93% >> > used on the single pool we have (3x replica). >> > >> > ceph df detal however shows 81% used on the pool and 67% raw >> > useage. >> > >> > # ceph df detail >> > RAW STORAGE: >> >CLASS SIZEAVAIL USEDRAW USED %RAW >> > USED >> >ssd 478 TiB 153 TiB 324 TiB 325 >> > TiB 67.96 >> >TOTAL 478 TiB 153 TiB 324 TiB 325 >> > TiB 67.96 >> > >> > POOLS: >> >POOLID STORED OBJECTS USED%USED >> > >> > MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED >> > COMPR UNDER COMPR >> >echo 3 108 TiB 29.49M 324 >> > TiB 81.6124 >> > TiB N/A N/A 29.49M0 >> > B 0 B > >I manually calculated the used percentage to get "avail" in your case >it seems to be 73 TiB. That means the the total space available for >your pool would be 397 TiB. >I'm not sure why that is, but it's what the math behind those >calculations say. >(Found a thread regarding that on the new mailing list (ceph- >us...@ceph.io) -> > > >https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/thread/NH2LMMX5KVRWCURI3BARRUAETKE2T2QN/#JDHXOQKWF6NZLQMOGEPAQCLI44KB54A3 > ) > >0.8161 = used (324) / total => total = 397 > >Than I looked at the remaining calculations: > >raw_used_rate *= (sum.num_object_copies - sum.num_objects_degraded) / >sum.num_object_copies > >and > >avail_res = avail / raw_used_rate > >First I looked up the init value for "raw_used_rate" for replicated >pools. It's their size so we can put in 3 here and for "avail_res" is >24. > >So I first calculated the final "raw_used_rate" which is 3.042. That >means that you have around 4.2% degraded pg's in your pool. > >> > >> > >> > I know we're looking at the most full OSD (210PGs, 79% used, 1.17 >> > VAR) >> > and count max avail from that. But where's the 93% full from in >> > dashboard? > >As said above the calculation is right but the data is wrong... As it >uses the real data that can be put in the selected pool, but it uses >everywhere else sizes that consider all pool replicas. > >I created an issue to fix this https://tracker.ceph.com/issues/43384 > >> > >> > My guess is that is comes from calculating: >> > >> > 1 - Max Avail / (Used + Max avail) = 0.93 >> > >> > >> > Kind Regards, >> > >> > David Majchrzak >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >Hope I could clarify some things and thanks for your feedback :) > >BTW this problem currently still exists as there wasn't any change to >these mentioned lines after the nautilus release. > >Stephan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG inconsistent with error "size_too_large"
Hi, I am getting one inconsistent object on our cluster with an inconsistency error that I haven’t seen before. This started happening during a rolling upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s related. I was hoping to know what the error means before trying a repair. [root@objmon04 ~]# ceph health detail HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg inconsistent OSDMAP_FLAGS noout flag(s) set OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 9.20e is active+clean+inconsistent, acting [509,674,659] rados list-inconsistent-obj 9.20e --format=json-pretty { "epoch": 759019, "inconsistents": [ { "object": { "name": "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", "nspace": "", "locator": "", "snap": "head", "version": 692875 }, "errors": [ "size_too_large" ], "union_shard_errors": [], "selected_object_info": { "oid": { "oid": "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", "key": "", "snapid": -2, "hash": 3321413134, "max": 0, "pool": 9, "namespace": "" }, "version": "281183'692875", "prior_version": "281183'692874", "last_reqid": "client.34042469.0:206759091", "user_version": 692875, "size": 146097278, "mtime": "2017-07-03 12:43:35.569986", "local_mtime": "2017-07-03 12:43:35.571196", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0xf19c8035", "omap_digest": "0x", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 509, "primary": true, "errors": [], "size": 146097278 }, { "osd": 659, "primary": false, "errors": [], "size": 146097278 }, { "osd": 674, "primary": false, "errors": [], "size": 146097278 } ] } ] } Thanks, Liam — Senior Developer Institute for Advanced Computer Studies University of Maryland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] units of metrics
On Tue, Jan 14, 2020 at 12:30 AM Stefan Kooman wrote: > Quoting Robert LeBlanc (rob...@leblancnet.us): > > The link that you referenced above is no longer available, do you have a > > new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all > > changed, so I'm trying to may the old values to the new values. Might > just > > have to look in the code. :( > > I cannot recall that the metrics have ever changed between 12.2.8 and > 12.2.12. Anyways, it depends on what module you use to collect the > metrics if the right metrics are even there. See this issue: > https://tracker.ceph.com/issues/41881 Yes, I agree that the metrics should not change within a major version, but here is the difference. We are using diamond and the CephCollector, but I verified with the admin socket and dumping the perf counters manually Metrics collected with 12.2.8: servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_client_request 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_server_request 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_request 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_session 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_slave_request 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_link 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookup 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookuphash 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupino 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupname 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupparent 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupsnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lssnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mkdir 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mknod 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mksnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_open 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_readdir 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rename 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_renamesnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmdir 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmsnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmxattr 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setattr 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setdirlayout 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setfilelock 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setlayout 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setxattr 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_symlink 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_unlink 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.cap_revoke_eviction 0 1578955878 Metrics collected with 12.2.12: (much more clear and descriptive which is good) servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_client_request 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_server_request 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_request 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_session 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_slave_request 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.avgcount 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.avgtime 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.sum 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.avgcount 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.avgtime 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.sum 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency.avgcount 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency.avgtime 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency
Re: [ceph-users] where does 100% RBD utilization come from?
Hi Philip, I'm not sure if we're talking about the same thing but I was also confused when I didn't see 100% OSD drive utilization during my first RBD write benchmark. Since then I collect all my confusion here https://yourcmc.ru/wiki/Ceph_performance :) 100% RBD utilization means that something waits for some I/O ops on this device to complete all the time. This "something" (client software) can't produce more I/O operations while it's waiting for previous ones to complete, that's why it can't saturate your OSDs and your network. OSDs can't send more write requests to the drives while they're not done with calculating object states on the CPU or while they're busy with network I/O. That's why OSDs can't saturate drives. Simply said: Ceph is slow. Partly because of the network roundtrips (you have 3 of them: client -> iscsi -> primary osd -> secondary osds), partly because it's just slow. Of course it's not TERRIBLY slow, so software that can send I/O requests in batches (i.e. use async I/O) feels fine. But software that sends I/Os one by one (because of transactional requirements or just stupidity like Oracle) runs very slow. Also.. "It seems like your RBD can't flush it's I/O fast enough" implies that there is some particular measure of "fast enough", that is a tunable value somewhere. If my network cards arent blocked, and my OSDs arent blocked... then doesnt that mean that I can and should "turn that knob" up? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Thank you all, performance is indeed better now. Can now go back to sleep ;) KR Stefan -Ursprüngliche Nachricht- Von: Виталий Филиппов Gesendet: Dienstag 14 Januar 2020 10:28 An: Wido den Hollander ; Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext] ...disable signatures and rbd cache. I didn't mention it in the email to not repeat myself. But I have it in the article :-) -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Yes, that's it, see the end of the article. You'll have to disable signature checks, too. cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false Hi Vitaliy, thank you for your time. Do you mean cephx sign messages = false with "diable signatures" ? KR Stefan -Ursprüngliche Nachricht- VON: Виталий Филиппов GESENDET: Dienstag 14 Januar 2020 10:28 AN: Wido den Hollander ; Stefan Bauer CC: ceph-users@lists.ceph.com BETREFF: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext] ...disable signatures and rbd cache. I didn't mention it in the email to not repeat myself. But I have it in the article :-) -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] where does 100% RBD utilization come from?
Also.. "It seems like your RBD can't flush it's I/O fast enough" implies that there is some particular measure of "fast enough", that is a tunable value somewhere. If my network cards arent blocked, and my OSDs arent blocked... then doesnt that mean that I can and should "turn that knob" up? - Original Message - From: "Wido den Hollander" To: "Philip Brown" , "ceph-users" Sent: Tuesday, January 14, 2020 12:42:48 AM Subject: Re: [ceph-users] where does 100% RBD utilization come from? The util is calculated based on average waits, see: https://coderwall.com/p/utc42q/understanding-iostat Just improving performance isn't just turning a knob and it will happen. It seems like your RBD can't flush it's I/O fast enough and that causes the iowait to go up. This can be all kinds of things: - Network (latency) - CPU on the OSDs Wido > > > -- > Philip Brown| Sr. Linux System Administrator | Medata, Inc. > 5 Peters Canyon Rd Suite 250 > Irvine CA 92606 > Office 714.918.1310| Fax 714.918.1325 > pbr...@medata.com| www.medata.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] where does 100% RBD utilization come from?
The odd thing is: the network interfaces on the gateways dont seem to be at 100% capacity and the OSD disks dont seem to be at 100% utilization. so I'm confused where this could be getting held up. - Original Message - From: "Wido den Hollander" To: "Philip Brown" , "ceph-users" Sent: Tuesday, January 14, 2020 12:42:48 AM Subject: Re: [ceph-users] where does 100% RBD utilization come from? The util is calculated based on average waits, see: https://coderwall.com/p/utc42q/understanding-iostat Just improving performance isn't just turning a knob and it will happen. It seems like your RBD can't flush it's I/O fast enough and that causes the iowait to go up. This can be all kinds of things: - Network (latency) - CPU on the OSDs Wido ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Hi Vitaliy, thank you for your time. Do you mean cephx sign messages = false with "diable signatures" ? KR Stefan -Ursprüngliche Nachricht- Von: Виталий Филиппов Gesendet: Dienstag 14 Januar 2020 10:28 An: Wido den Hollander ; Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext] ...disable signatures and rbd cache. I didn't mention it in the email to not repeat myself. But I have it in the article :-) -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] block db sizing and calculation
Hi Konstantin! Quoting Konstantin Shalygin (k0...@k0ste.ru): > >Is there any recommandation of how many osds a single flash device can > >serve? The optane ones can do 2000MB/s write + 500.000 iop/s. > > Any sizes of db, except 3/30/300 is useless. I have this from Mattia Belluco in my notes which suggests that twice the amount is best: > Following some discussions we had at the past Cephalocon I beg to differ > on this point: when RocksDB needs to compact a layer it rewrites it > *before* deleting the old data; if you'd like to be sure you db does not > spill over to the spindle you should allocate twice the size of the > biggest layer to allow for compaction. I guess ~60 GB would be the sweet > spot assuming you don't plan to mess with size and multiplier of the > rocksDB layers and don't want to go all the way to 600 GB (300 GB x2) Source is http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/035086.html. And apart from the RocksDB pecularities the actual use case also needs to be considered. Lots of small files on a CephFS will require more DB space than mainly big files as Paul states in the same thread. Cheers, LF. -- Lars Fenneberg, l...@elemental.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Hi Stefan, thank you for your time. "temporary write through" does not seem to be a legit parameter. However write through is already set: root@proxmox61:~# echo "temporary write through" > /sys/block/sdb/device/scsi_disk/*/cache_type root@proxmox61:~# cat /sys/block/sdb/device/scsi_disk/2\:0\:0\:0/cache_type write through is that, what you meant? Thank you. KR Stefan -Ursprüngliche Nachricht- Von: Stefan Priebe - Profihost AG this has something todo with the firmware and how the manufacturer handles syncs / flushes. Intel just ignores sync / flush commands for drives which have a capacitor. Samsung does not. The problem is that Ceph sends a lot of flush commands which slows down drives without capacitor. You can make linux to ignore those userspace requests with the following command: echo "temporary write through" > /sys/block/sdX/device/scsi_disk/*/cache_type Greets, Stefan Priebe Profihost AG > Thank you. > > > Stefan > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs inconsistents because of "size_too_large"
This is what I see in the OSD.54 log file 2020-01-14 10:35:04.986 7f0c20dca700 -1 log_channel(cluster) log [ERR] : 13.4 soid 13:20fbec66:::%2fhbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM%2ffile2:head : size 385888256 > 134217728 is too large 2020-01-14 10:35:08.534 7f0c20dca700 -1 log_channel(cluster) log [ERR] : 13.4 soid 13:25e2d1bd:::%2fhbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM%2ffile8:head : size 385888256 > 134217728 is too large On Tue, Jan 14, 2020 at 11:02 AM Massimo Sgaravatto < massimo.sgarava...@gmail.com> wrote: > I have just finished the update of a ceph cluster from luminous to nautilus > Everything seems running, but I keep receiving notifications (about ~ 10 > so far, involving different PGs and different OSDs) of PGs in inconsistent > state. > > rados list-inconsistent-obj pg-id --format=json-pretty (an example is > attached) says that the problem is "size_too_large". > > "ceph pg repair" is able to "fix" the problem, but I am not able to > understand what is the problem > > Thanks, Massimo > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PGs inconsistents because of "size_too_large"
I have just finished the update of a ceph cluster from luminous to nautilus Everything seems running, but I keep receiving notifications (about ~ 10 so far, involving different PGs and different OSDs) of PGs in inconsistent state. rados list-inconsistent-obj pg-id --format=json-pretty (an example is attached) says that the problem is "size_too_large". "ceph pg repair" is able to "fix" the problem, but I am not able to understand what is the problem Thanks, Massimo { "epoch": 1966551, "inconsistents": [ { "object": { "name": "/hbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM/file2", "nspace": "", "locator": "", "snap": "head", "version": 368 }, "errors": [ "size_too_large" ], "union_shard_errors": [], "selected_object_info": { "oid": { "oid": "/hbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM/file2", "key": "", "snapid": -2, "hash": 1714937604, "max": 0, "pool": 13, "namespace": "" }, "version": "243582'368", "prior_version": "243582'367", "last_reqid": "client.13143063.0:20504", "user_version": 368, "size": 385888256, "mtime": "2017-10-10 14:09:12.098334", "local_mtime": "2017-10-10 14:10:29.321446", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x9229f11b", "omap_digest": "0x", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 13, "primary": false, "errors": [], "size": 385888256, "omap_digest": "0x", "data_digest": "0x9229f11b" }, { "osd": 38, "primary": false, "errors": [], "size": 385888256, "omap_digest": "0x", "data_digest": "0x9229f11b" }, { "osd": 54, "primary": true, "errors": [], "size": 385888256, "omap_digest": "0x", "data_digest": "0x9229f11b" } ] }, { "object": { "name": "/hbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM/file8", "nspace": "", "locator": "", "snap": "head", "version": 417 }, "errors": [ "size_too_large" ], "union_shard_errors": [], "selected_object_info": { "oid": { "oid": "/hbWPh36KajAKcJUlCjG9XdqLGQMzkwn3NDrrLDi_mTM/file8", "key": "", "snapid": -2, "hash": 3180021668, "max": 0, "pool": 13, "namespace": "" }, "version": "243596'417", "prior_version": "243596'416", "last_reqid": "client.13143063.0:20858", "user_version": 417, "size": 385888256, "mtime": "2017-10-10 14:16:32.814931", "local_mtime": "2017-10-10 14:17:50.248174", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0x9229f11b", "omap_digest": "0x", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 13, "primary": false, "errors": [], "size": 385888256, "omap_digest": "0x", "data_digest": "0x9229f11b" }, {
Re: [ceph-users] block db sizing and calculation
i'm plannung to split the block db to a seperate flash device which i also would like to use as an OSD for erasure coding metadata for rbd devices. If i want to use 14x 14TB HDDs per Node https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing recommends a minimum size of 140GB per 14TB HDD. Is there any recommandation of how many osds a single flash device can serve? The optane ones can do 2000MB/s write + 500.000 iop/s. Any sizes of db, except 3/30/300 is useless. How much OSD's per NVMe - quantity of OSD's that you can lose once at time. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
...disable signatures and rbd cache. I didn't mention it in the email to not repeat myself. But I have it in the article :-) -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] block db sizing and calculation
One tricky thing is each layer of RocksDB is 100% on SSD or 100% on HDD, so either you need to tweak the rocksdb configuration , or there will be a huge waste, e.g 20GB DB partition makes no difference compared to a 3GB one (under default rocksdb configuration) Janne Johansson 于2020年1月14日周二 下午4:43写道: > (sorry for empty mail just before) > > >> i'm plannung to split the block db to a seperate flash device which i >>> also would like to use as an OSD for erasure coding metadata for rbd >>> devices. >>> >>> If i want to use 14x 14TB HDDs per Node >>> >>> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing >>> >>> recommends a minimum size of 140GB per 14TB HDD. >>> >>> Is there any recommandation of how many osds a single flash device can >>> serve? The optane ones can do 2000MB/s write + 500.000 iop/s. >>> >> >> > I think many ceph admins are more concerned with having many drives > co-using the same DB drive, since if the DB drive fails, it also means all > OSDs are lost at the same time. > Optanes and decent NVMEs are probably capable of handling tons of HDDs, so > that the bottleneck ends up being somewhere else, but the failure scenarios > are a bit scary if the whole host is lost just by that one DB device acting > up. > > -- > May the most significant bit of your life be positive. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware selection for ceph backup on ceph
On 1/10/20 5:32 PM, Stefan Priebe - Profihost AG wrote: > Hi, > > we‘re currently in the process of building a new ceph cluster to backup rbd > images from multiple ceph clusters. > > We would like to start with just a single ceph cluster to backup which is > about 50tb. Compression ratio of the data is around 30% while using zlib. We > need to scale the backup cluster up to 1pb. > > The workload on the original rbd images is mostly 4K writes so I expect rbd > export-diff to do a lot of small writes. > > The current idea is to use the following hw as a start: > 6 Servers with: > 1 AMD EPYC 7302P 3GHz, 16C/32T > 128g Memory > 14x 12tb Toshiba Enterprise MG07ACA HDD drives 4K native > Dual 25gb network > That should be sufficient. The AMD Epyc is a great CPU and you have enough memory. > Does it fit? Has anybody experience with the drives? Can we use EC or do we > need to use normal replication? > EC will just work. It will be fast enough. But since it's only a backup system it should work out. Oh, more servers is always better. Wido > Greets, > Stefan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] block db sizing and calculation
(sorry for empty mail just before) > i'm plannung to split the block db to a seperate flash device which i >> also would like to use as an OSD for erasure coding metadata for rbd >> devices. >> >> If i want to use 14x 14TB HDDs per Node >> >> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing >> >> recommends a minimum size of 140GB per 14TB HDD. >> >> Is there any recommandation of how many osds a single flash device can >> serve? The optane ones can do 2000MB/s write + 500.000 iop/s. >> > > I think many ceph admins are more concerned with having many drives co-using the same DB drive, since if the DB drive fails, it also means all OSDs are lost at the same time. Optanes and decent NVMEs are probably capable of handling tons of HDDs, so that the bottleneck ends up being somewhere else, but the failure scenarios are a bit scary if the whole host is lost just by that one DB device acting up. -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] where does 100% RBD utilization come from?
On 1/10/20 7:43 PM, Philip Brown wrote: > Surprisingly, a google search didnt seem to find the answer on this, so guess > I should ask here: > > what determines if an rdb is "100% busy"? > > I have some backend OSDs, and an iSCSI gateway, serving out some RBDs. > > iostat on the gateway says rbd is 100% utilized > > iostat on individual OSds only goes as high as about 60% on a per-device > basis. > CPU is idle. > Doesnt seem like network interface is capped either. > > So.. how do I improve RBD throughput? > The util is calculated based on average waits, see: https://coderwall.com/p/utc42q/understanding-iostat Just improving performance isn't just turning a knob and it will happen. It seems like your RBD can't flush it's I/O fast enough and that causes the iowait to go up. This can be all kinds of things: - Network (latency) - CPU on the OSDs Wido > > > -- > Philip Brown| Sr. Linux System Administrator | Medata, Inc. > 5 Peters Canyon Rd Suite 250 > Irvine CA 92606 > Office 714.918.1310| Fax 714.918.1325 > pbr...@medata.com| www.medata.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] block db sizing and calculation
Den mån 13 jan. 2020 kl 08:09 skrev Stefan Priebe - Profihost AG < s.pri...@profihost.ag>: > Hello, > > i'm plannung to split the block db to a seperate flash device which i > also would like to use as an OSD for erasure coding metadata for rbd > devices. > > If i want to use 14x 14TB HDDs per Node > > https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing > > recommends a minimum size of 140GB per 14TB HDD. > > Is there any recommandation of how many osds a single flash device can > serve? The optane ones can do 2000MB/s write + 500.000 iop/s. > > Greets, > Stefan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- May the most significant bit of your life be positive. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
On 1/13/20 6:37 PM, vita...@yourcmc.ru wrote: >> Hi, >> >> we're playing around with ceph but are not quite happy with the IOs. >> on average 5000 iops / write >> on average 13000 iops / read >> >> We're expecting more. :( any ideas or is that all we can expect? > > With server SSD you can expect up to ~1 write / ~25000 read iops per > a single client. > > https://yourcmc.ru/wiki/Ceph_performance > >> money is NOT a problem for this test-bed, any ideas howto gain more >> IOS is greatly appreciated. > > Grab some server NVMes and best possible CPUs :) And then: - Disable all powersaving - Pin the CPUs in C-State 1 That might even increase performance even more. But due to the synchronous nature of Ceph the performance and latency of a single thread will be limited. Wido > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] block db sizing and calculation
Hello, does anybody have real live experience with externel block db? Greets, Stefan Am 13.01.20 um 08:09 schrieb Stefan Priebe - Profihost AG: > Hello, > > i'm plannung to split the block db to a seperate flash device which i > also would like to use as an OSD for erasure coding metadata for rbd > devices. > > If i want to use 14x 14TB HDDs per Node > https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing > > recommends a minimum size of 140GB per 14TB HDD. > > Is there any recommandation of how many osds a single flash device can > serve? The optane ones can do 2000MB/s write + 500.000 iop/s. > > Greets, > Stefan > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] units of metrics
Quoting Robert LeBlanc (rob...@leblancnet.us): > The link that you referenced above is no longer available, do you have a > new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all > changed, so I'm trying to may the old values to the new values. Might just > have to look in the code. :( I cannot recall that the metrics have ever changed between 12.2.8 and 12.2.12. Anyways, it depends on what module you use to collect the metrics if the right metrics are even there. See this issue: https://tracker.ceph.com/issues/41881 ... The "(avg)count" metric is needed to perform calculations to obtain "avgtime" (sum/avgcount). Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com