Re: [ceph-users] blocked ops
On Thu, Aug 11, 2016 at 11:33:29PM +0100, Roeland Mertens wrote: > Hi, > > I was hoping someone on this list may be able to help? > > We're running a 35 node 10.2.1 cluster with 595 OSDs. For the last 12 hours > we've been plagued with blocked requests which completely kills the > performance of the cluster > > # ceph health detail > HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs down; 1 > pgs peering; 1 pgs stuck inactive; 100 requests are blocked > 32 sec; 1 osds > have slow requests; noout,nodeep-scrub,sortbitwise flag(s) set > pg 63.1a18 is stuck inactive for 135133.509820, current state > down+remapped+peering, last acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370] That value (2147483647) is defined in src/crush/crush.h like so; #define CRUSH_ITEM_NONE 0x7fff /* no result */ So this could be due to a bad crush rule or maybe choose_total_tries needs to be higher? $ ceph osd crush rule ls For each rule listed by the above command. $ ceph osd crush rule dump [rule_name] I'd then dump out the crushmap and test it showing any bad mappings with the commands listed here; http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon I'd also check the pg numbers for your pool(s) are appropriate as not enough pgs could also be a contributing factor IIRC. That should hopefully give some insight. -- HTH, Brad > pg 63.1a18 is down+remapped+peering, acting > [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370] > 100 ops are blocked > 2097.15 sec on osd.4 > 1 osds have slow requests > noout,nodeep-scrub,sortbitwise flag(s) set > > the one pg down is due to us running into an odd EC issue which I mailed the > list about earlier, it's the 100 blocked ops that are puzzling us. If we out > the osd in question, they just shift to another osd (on a different host!). > We even tried rebooting the node it's on but to little avail. > > We get a ton of log messages like this: > > 2016-08-11 23:32:10.041174 7fc668d9f700 0 log_channel(cluster) log [WRN] : > 100 slow requests, 5 included below; oldest blocked for > 139.313915 secs > 2016-08-11 23:32:10.041184 7fc668d9f700 0 log_channel(cluster) log [WRN] : > slow request 139.267004 seconds old, received at 2016-08-11 23:29:50.774091: > osd_op(client.9192464.0:485640 66.b96c3a18 > default.4282484.42_442fac8195c63a2e19c3c4bb91e8800e [getxattrs,stat,read > 0~524288] snapc 0=[] RETRY=36 ack+retry+read+known_if_redirected e50109) > currently waiting for blocked object > 2016-08-11 23:32:10.041189 7fc668d9f700 0 log_channel(cluster) log [WRN] : > slow request 139.244839 seconds old, received at 2016-08-11 23:29:50.796256: > osd_op(client.9192464.0:596033 66.942a5a18 > default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [write > 1048576~524288] snapc 0=[] RETRY=36 > ack+ondisk+retry+write+known_if_redirected e50109) currently waiting for > blocked object > > A dump of the blocked ops tells us very little , is there anyone who can > shed some light on this? Or at least give us a hint on how we can fix this? > > # ceph daemon osd.4 dump_blocked_ops > > >{ > "description": "osd_op(client.9192464.0:596030 66.942a5a18 > default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [writefull > 0~0] snapc 0=[] RETRY=32 ack+ondisk+retry+write+known_if_redirected > e50092)", > "initiated_at": "2016-08-11 22:58:09.721027", > "age": 1515.105186, > "duration": 1515.113255, > "type_data": [ > "reached pg", > { > "client": "client.9192464", > "tid": 596030 > }, > [ > { > "time": "2016-08-11 22:58:09.721027", > "event": "initiated" > }, > { > "time": "2016-08-11 22:58:09.721066", > "event": "waiting_for_map not empty" > }, > { > "time": "2016-08-11 22:58:09.813574", > "event": "reached_pg" > }, > { > "time": "2016-08-11 22:58:09.813581", > "event": "waiting for peered" > }, > { > "time": "2016-08-11 22:58:09.852796", > "event": "reached_pg" > }, > { > "time": "2016-08-11 22:58:09.852804", > "event": "waiting for peered" > }, > { > "time": "2016-08-11 22:58:10.876636", > "event": "reached_pg" > }, > { >
[ceph-users] blocked ops
Hi, I was hoping someone on this list may be able to help? We're running a 35 node 10.2.1 cluster with 595 OSDs. For the last 12 hours we've been plagued with blocked requests which completely kills the performance of the cluster # ceph health detail HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs down; 1 pgs peering; 1 pgs stuck inactive; 100 requests are blocked > 32 sec; 1 osds have slow requests; noout,nodeep-scrub,sortbitwise flag(s) set pg 63.1a18 is stuck inactive for 135133.509820, current state down+remapped+peering, last acting [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370] pg 63.1a18 is down+remapped+peering, acting [2147483647,2147483647,2147483647,2147483647,2147483647,2147483647,235,148,290,300,147,157,370] 100 ops are blocked > 2097.15 sec on osd.4 1 osds have slow requests noout,nodeep-scrub,sortbitwise flag(s) set the one pg down is due to us running into an odd EC issue which I mailed the list about earlier, it's the 100 blocked ops that are puzzling us. If we out the osd in question, they just shift to another osd (on a different host!). We even tried rebooting the node it's on but to little avail. We get a ton of log messages like this: 2016-08-11 23:32:10.041174 7fc668d9f700 0 log_channel(cluster) log [WRN] : 100 slow requests, 5 included below; oldest blocked for > 139.313915 secs 2016-08-11 23:32:10.041184 7fc668d9f700 0 log_channel(cluster) log [WRN] : slow request 139.267004 seconds old, received at 2016-08-11 23:29:50.774091: osd_op(client.9192464.0:485640 66.b96c3a18 default.4282484.42_442fac8195c63a2e19c3c4bb91e8800e [getxattrs,stat,read 0~524288] snapc 0=[] RETRY=36 ack+retry+read+known_if_redirected e50109) currently waiting for blocked object 2016-08-11 23:32:10.041189 7fc668d9f700 0 log_channel(cluster) log [WRN] : slow request 139.244839 seconds old, received at 2016-08-11 23:29:50.796256: osd_op(client.9192464.0:596033 66.942a5a18 default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [write 1048576~524288] snapc 0=[] RETRY=36 ack+ondisk+retry+write+known_if_redirected e50109) currently waiting for blocked object A dump of the blocked ops tells us very little , is there anyone who can shed some light on this? Or at least give us a hint on how we can fix this? # ceph daemon osd.4 dump_blocked_ops { "description": "osd_op(client.9192464.0:596030 66.942a5a18 default.4282484.30__shadow_.sLkZ_rUX6cvi0ifFasw1UipEIuFPzYB_6 [writefull 0~0] snapc 0=[] RETRY=32 ack+ondisk+retry+write+known_if_redirected e50092)", "initiated_at": "2016-08-11 22:58:09.721027", "age": 1515.105186, "duration": 1515.113255, "type_data": [ "reached pg", { "client": "client.9192464", "tid": 596030 }, [ { "time": "2016-08-11 22:58:09.721027", "event": "initiated" }, { "time": "2016-08-11 22:58:09.721066", "event": "waiting_for_map not empty" }, { "time": "2016-08-11 22:58:09.813574", "event": "reached_pg" }, { "time": "2016-08-11 22:58:09.813581", "event": "waiting for peered" }, { "time": "2016-08-11 22:58:09.852796", "event": "reached_pg" }, { "time": "2016-08-11 22:58:09.852804", "event": "waiting for peered" }, { "time": "2016-08-11 22:58:10.876636", "event": "reached_pg" }, { "time": "2016-08-11 22:58:10.876640", "event": "waiting for peered" }, { "time": "2016-08-11 22:58:10.902760", "event": "reached_pg" } ] ] } ... Kind regards, Roeland -- This email is sent on behalf of Genomics plc, a public limited company registered in England and Wales with registered number 8839972, VAT registered number 189 2635 65 and registered office at King Charles House, Park End Street, Oxford, OX1 1JD, United Kingdom. The contents of this e-mail and any attachments are confidential to the intended recipient. If you are not the intended recipient please do not use or publish its contents, contact Genomics plc immediately at i...@genomicsplc.comthen delete. You may not copy,
Re: [ceph-users] Backfilling pgs not making progress
I just updated the bug with several questions. -Sam On Thu, Aug 11, 2016 at 6:56 AM, Brian Feltonwrote: > Sam, > > I very much appreciate the assistance. I have opened > http://tracker.ceph.com/issues/16997 to track this (potential) issue. > > Brian > > On Wed, Aug 10, 2016 at 1:53 PM, Samuel Just wrote: >> >> Ok, can you >> 1) Open a bug >> 2) Identify all osds involved in the 5 problem pgs >> 3) enable debug osd = 20, debug filestore = 20, debug ms = 1 on all of >> them >> 4) mark the primary for each pg down (should cause peering and >> backfill to restart) >> 5) link all logs to the bug >> >> Thanks! >> -Sam >> >> On Tue, Jul 26, 2016 at 9:11 AM, Samuel Just wrote: >> > Hmm, nvm, it's not an lfn object anyway. >> > -Sam >> > >> > On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton >> > wrote: >> >> If I search on osd.580, I find >> >> >> >> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon >> >> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct. >> >> 1\sDSC04329.JPG__head_981926C1__21__5, which has a >> >> non-zero >> >> size and a hash (981926C1) that matches that of the same file found on >> >> the >> >> other OSDs in the pg. >> >> >> >> If I'm misunderstanding what you're asking about a dangling link, >> >> please >> >> point me in the right direction. >> >> >> >> Brian >> >> >> >> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just wrote: >> >>> >> >>> Did you also confirm that the backfill target does not have any of >> >>> those dangling links? I'd be looking for a dangling link for >> >>> >> >>> >> >>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283-6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/London/Ron >> >>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33 >> >>> on osd.580. >> >>> -Sam >> >>> >> >>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton >> >>> wrote: >> >>> > Sam, >> >>> > >> >>> > I cranked up the logging on the backfill target (osd 580 on node 07) >> >>> > and >> >>> > the >> >>> > acting primary for the pg (453 on node 08, for what it's worth). >> >>> > The >> >>> > logs >> >>> > from the primary are very large, so pardon the tarballs. >> >>> > >> >>> > PG Primary Logs: >> >>> > >> >>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill-primary-log.tgz?dl=0B >> >>> > PG Backfill Target Logs: >> >>> > >> >>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill-target-log.tgz?dl=0 >> >>> > >> >>> > I'll be reviewing them with my team tomorrow morning to see if we >> >>> > can >> >>> > find >> >>> > anything. Thanks for your assistance. >> >>> > >> >>> > Brian >> >>> > >> >>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just >> >>> > wrote: >> >>> >> >> >>> >> The next thing I'd want is for you to reproduce with >> >>> >> >> >>> >> debug osd = 20 >> >>> >> debug filestore = 20 >> >>> >> debug ms = 1 >> >>> >> >> >>> >> and post the file somewhere. >> >>> >> -Sam >> >>> >> >> >>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just >> >>> >> wrote: >> >>> >> > If you don't have the orphaned file link, it's not the same bug. >> >>> >> > -Sam >> >>> >> > >> >>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton >> >>> >> > >> >>> >> > wrote: >> >>> >> >> Sam, >> >>> >> >> >> >>> >> >> I'm reviewing that thread now, but I'm not seeing a lot of >> >>> >> >> overlap >> >>> >> >> with >> >>> >> >> my >> >>> >> >> cluster's situation. For one, I am unable to start either a >> >>> >> >> repair >> >>> >> >> or >> >>> >> >> a >> >>> >> >> deep scrub on any of the affected pgs. I've instructed all six >> >>> >> >> of >> >>> >> >> the >> >>> >> >> pgs >> >>> >> >> to scrub, deep-scrub, and repair, and the cluster has been >> >>> >> >> gleefully >> >>> >> >> ignoring these requests (it has been several hours since I first >> >>> >> >> tried, >> >>> >> >> and >> >>> >> >> the logs indicate none of the pgs ever scrubbed). Second, none >> >>> >> >> of >> >>> >> >> the >> >>> >> >> my >> >>> >> >> OSDs is crashing. Third, none of my pgs or objects has ever >> >>> >> >> been >> >>> >> >> marked >> >>> >> >> inconsistent (or unfound, for that matter) -- I'm only seeing >> >>> >> >> the >> >>> >> >> standard >> >>> >> >> mix of degraded/misplaced objects that are common during a >> >>> >> >> recovery. >> >>> >> >> What >> >>> >> >> I'm not seeing is any further progress on the number of >> >>> >> >> misplaced >> >>> >> >> objects -- >> >>> >> >> the number has remained effectively unchanged for the past >> >>> >> >> several >> >>> >> >> days. >> >>> >> >> >> >>> >> >> To be sure, though, I tracked down the file that the backfill >> >>> >> >> operation >> >>> >> >> seems to be hung on, and I can find it in both the backfill >> >>> >> >> target >> >>> >> >> osd >> >>> >> >> (580)
Re: [ceph-users] rbd-nbd kernel requirements
Fair enough. On Thu, Aug 11, 2016, 10:45 Jason Dillamanwrote: > I don't think anyone has really looked into the cause yet, so it's > hard to say where the problem lies. > > On Thu, Aug 11, 2016 at 9:36 AM, Shawn Edwards > wrote: > > Is it thought that this bug is in Ceph and not the kernel? > > > > > > On Thu, Aug 11, 2016 at 8:14 AM Jason Dillaman > wrote: > >> > >> At this point, we only have automated tests that exercise it against > >> stock Ubuntu Trusty but that will eventually expand to Xenial once we > >> get our lab configured for it. There is one known issue right now > >> where the kernel can deadlock while starting the test [1] and > >> mapping/unmapping the device. > >> > >> Red Hat Enterprise Linux has the nbd driver disabled so we don't test > >> against those variants. I've personally run it on Fedora 23 and 24 > >> during manual testing without any apparent issues. > >> > >> [1] http://tracker.ceph.com/issues/16921 > >> > >> On Wed, Aug 10, 2016 at 7:16 PM, Shawn Edwards > >> wrote: > >> > Is there a minimum kernel version required for rbd-nbd to work and > work > >> > well? Before I start stress testing it, I want to be sure I have a > >> > system > >> > that is expected to work. > >> > > >> > ___ > >> > ceph-users mailing list > >> > ceph-users@lists.ceph.com > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > >> > >> > >> > >> -- > >> Jason > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
> Op 11 augustus 2016 om 15:17 schreef Sean Sullivan: > > > Hello Wido, > > Thanks for the advice. While the data center has a/b circuits and > redundant power, etc if a ground fault happens it travels outside and > fails causing the whole building to fail (apparently). > > The monitors are each the same with > 2x e5 cpus > 64gb of ram > 4x 300gb 10k SAS drives in raid 10 (write through mode). > Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 - > 3am CST) > Ceph hammer LTS 0.94.7 > > (we are still working on our jewel test cluster so it is planned but not in > place yet) > > The only thing that seems to be corrupt is the monitors leveldb store. I > see multiple issues on Google leveldb github from March 2016 about fsync > and power failure so I assume this is an issue with leveldb. > > I have backed up /var/lib/ceph/Mon on all of my monitors before trying to > proceed with any form of recovery. > > Is there any way to reconstruct the leveldb or replace the monitors and > recover the data? > I don't know. I have never done it. Other people might know this better than me. Maybe 'ceph-monstore-tool' can help you? Wido > I found the following post in which sage says it is tedious but possible. ( > http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if > I have any chance of doing it. I have the fsid, the Mon key map and all of > the osds look to be fine so all of the previous osd maps are there. > > I just don't understand what key/values I need inside. > > On Aug 11, 2016 1:33 AM, "Wido den Hollander" wrote: > > > > > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan < > > seapasu...@uchicago.edu>: > > > > > > > > > I think it just got worse:: > > > > > > all three monitors on my other cluster say that ceph-mon can't open > > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose > > all > > > 3 monitors? I saw a post by Sage saying that the data can be recovered as > > > all of the data is held on other servers. Is this possible? If so has > > > anyone had any experience doing so? > > > > I have never done so, so I couldn't tell you. > > > > However, it is weird that on all three it got corrupted. What hardware are > > you using? Was it properly protected against power failure? > > > > If you mon store is corrupted I'm not sure what might happen. > > > > However, make a backup of ALL monitors right now before doing anything. > > > > Wido > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd-nbd kernel requirements
I don't think anyone has really looked into the cause yet, so it's hard to say where the problem lies. On Thu, Aug 11, 2016 at 9:36 AM, Shawn Edwardswrote: > Is it thought that this bug is in Ceph and not the kernel? > > > On Thu, Aug 11, 2016 at 8:14 AM Jason Dillaman wrote: >> >> At this point, we only have automated tests that exercise it against >> stock Ubuntu Trusty but that will eventually expand to Xenial once we >> get our lab configured for it. There is one known issue right now >> where the kernel can deadlock while starting the test [1] and >> mapping/unmapping the device. >> >> Red Hat Enterprise Linux has the nbd driver disabled so we don't test >> against those variants. I've personally run it on Fedora 23 and 24 >> during manual testing without any apparent issues. >> >> [1] http://tracker.ceph.com/issues/16921 >> >> On Wed, Aug 10, 2016 at 7:16 PM, Shawn Edwards >> wrote: >> > Is there a minimum kernel version required for rbd-nbd to work and work >> > well? Before I start stress testing it, I want to be sure I have a >> > system >> > that is expected to work. >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> >> >> >> -- >> Jason > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Include mon restart in logrotate?
Hello, see below. On Thu, 11 Aug 2016 12:52:32 +0200 (CEST) Wido den Hollander wrote: > > > Op 11 augustus 2016 om 10:18 schreef Eugen Block: > > > > > > Thanks for the really quick response! > > > > > Warning! These are not your regular log files. > > > > Thanks for the warning! > > > > > You shouldn't have to worry about that. The MONs should compact and > > > rotate those logs themselve. > > > > I believe the compaction works fine, but these large LOG files just > > grow until mon restart. Is there no way to limit the size to a desired > > value or anything similar? > > > > That's not good. That shouldn't happen. The monitor has to trim these logs as > well. > > How big is your mon store? > > $ du -sh /var/lib/ceph/mon/* > > > > What version of Ceph are you running exactly? > > > > ceph@node1:~/ceph-deploy> ceph --version > > ceph version 0.94.6-75 > > > > 0.94.7 is already out, might be worth upgrading. Release Notes don't tell > anything about this case though. > 0.94.5 definitely has that bug (no compaction on either MON or OSD leveldbs) and I remember a tracker and release note entry about that. And 0.94.7 definitely doesn't have that problem. While 0.94.6 has the potential to eat all your data with cache-tiering. So yeah, upgrade time. Christian > > > What is the output of ceph -s? > > > > ceph@node1:~/ceph-deploy> ceph -s > > cluster 655cb05a-435a-41ba-83d9-8549f7c36167 > > health HEALTH_OK > > monmap e7: 3 mons at > > {mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0} > > election epoch 242, quorum 0,1,2 mon1,mon2,mon3 > > osdmap e2377: 19 osds: 19 up, 19 in > >pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects > > 3223 GB used, 4929 GB / 8153 GB avail > > 4336 active+clean > >client io 0 B/s rd, 72112 B/s wr, 7 op/s > > > > Ok, that's good. Monitors don't trim the logs when the cluster isn't healthy, > but yours is. > > Wido > > > > > Zitat von Wido den Hollander : > > > > >> Op 11 augustus 2016 om 9:56 schreef Eugen Block : > > >> > > >> > > >> Hi list, > > >> > > >> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 > > >> MONs. > > >> Now after a couple of weeks we noticed that we're running out of disk > > >> space on one of the nodes in /var. > > >> Similar to [1] there are two large LOG files in > > >> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are > > >> managed when the respective MON is restarted. But the MONs are not > > >> restarted regularly so the log files can grow for months and fill up > > >> the file system. > > >> > > > > > > Warning! These are not your regular log files. They are binary logs > > > of LevelDB which are mandatory for the MONs to work! > > > > > >> I was thinking about adding another file in /etc/logrotate.d/ and > > >> trigger a monitor restart once a week. But I'm not sure if it's > > >> recommended to restart all MONs at the same time, which could happen > > >> if someone started logrotate manually. > > >> So my question is, how do you guys manage that and how is it supposed > > >> to be handled? I'd really appreciate any insights! > > >> > > > You shouldn't have to worry about that. The MONs should compact and > > > rotate those logs themselve. > > > > > > They compact their store on start, so that works for you, but they > > > should do this while running. > > > > > > What version of Ceph are you running exactly? > > > > > > What is the output of ceph -s? MONs usually only compact when the > > > cluster is healthy. > > > > > > Wido > > > > > >> Regards, > > >> Eugen > > >> > > >> [1] > > >> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor > > >> > > >> -- > > >> Eugen Block voice : +49-40-559 51 75 > > >> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > > >> Postfach 61 03 15 > > >> D-22423 Hamburg e-mail : ebl...@nde.ag > > >> > > >> Vorsitzende des Aufsichtsrates: Angelika Mozdzen > > >>Sitz und Registergericht: Hamburg, HRB 90934 > > >>Vorstand: Jens-U. Mozdzen > > >> USt-IdNr. DE 814 013 983 > > >> > > >> ___ > > >> ceph-users mailing list > > >> ceph-users@lists.ceph.com > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > -- > > Eugen Block voice : +49-40-559 51 75 > > NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > > Postfach 61 03 15 > > D-22423 Hamburg e-mail : ebl...@nde.ag > > > > Vorsitzende des Aufsichtsrates: Angelika Mozdzen > >Sitz und Registergericht: Hamburg, HRB 90934 > >Vorstand: Jens-U. Mozdzen > > USt-IdNr. DE 814
Re: [ceph-users] Backfilling pgs not making progress
Sam, I very much appreciate the assistance. I have opened http://tracker.ceph.com/issues/16997 to track this (potential) issue. Brian On Wed, Aug 10, 2016 at 1:53 PM, Samuel Justwrote: > Ok, can you > 1) Open a bug > 2) Identify all osds involved in the 5 problem pgs > 3) enable debug osd = 20, debug filestore = 20, debug ms = 1 on all of them > 4) mark the primary for each pg down (should cause peering and > backfill to restart) > 5) link all logs to the bug > > Thanks! > -Sam > > On Tue, Jul 26, 2016 at 9:11 AM, Samuel Just wrote: > > Hmm, nvm, it's not an lfn object anyway. > > -Sam > > > > On Tue, Jul 26, 2016 at 7:07 AM, Brian Felton > wrote: > >> If I search on osd.580, I find > >> default.421929.15\uTEPP\s84316222-6ddd-4ac9-8283-6fa1cdcf9b88\sbackups\ > s20160630091353\sp1\s\sShares\sWarehouse\sLondonWarehouse\sLondon\sRon > >> picture's\sMISCELLANEOUS\s2014\sOct., 2014\sOct. > >> 1\sDSC04329.JPG__head_981926C1__21__5, which has a > non-zero > >> size and a hash (981926C1) that matches that of the same file found on > the > >> other OSDs in the pg. > >> > >> If I'm misunderstanding what you're asking about a dangling link, please > >> point me in the right direction. > >> > >> Brian > >> > >> On Tue, Jul 26, 2016 at 8:59 AM, Samuel Just wrote: > >>> > >>> Did you also confirm that the backfill target does not have any of > >>> those dangling links? I'd be looking for a dangling link for > >>> > >>> 981926c1/default.421929.15_TEPP/84316222-6ddd-4ac9-8283- > 6fa1cdcf9b88/backups/20160630091353/p1//Shares/Warehouse/LondonWarehouse/ > London/Ron > >>> picture's/MISCELLANEOUS/2014/Oct., 2014/Oct. 1/DSC04329.JPG/head//33 > >>> on osd.580. > >>> -Sam > >>> > >>> On Mon, Jul 25, 2016 at 9:04 PM, Brian Felton > wrote: > >>> > Sam, > >>> > > >>> > I cranked up the logging on the backfill target (osd 580 on node 07) > and > >>> > the > >>> > acting primary for the pg (453 on node 08, for what it's worth). The > >>> > logs > >>> > from the primary are very large, so pardon the tarballs. > >>> > > >>> > PG Primary Logs: > >>> > https://www.dropbox.com/s/ipjobn2i5ban9km/backfill- > primary-log.tgz?dl=0B > >>> > PG Backfill Target Logs: > >>> > https://www.dropbox.com/s/9qpiqsnahx0qc5k/backfill- > target-log.tgz?dl=0 > >>> > > >>> > I'll be reviewing them with my team tomorrow morning to see if we can > >>> > find > >>> > anything. Thanks for your assistance. > >>> > > >>> > Brian > >>> > > >>> > On Mon, Jul 25, 2016 at 3:33 PM, Samuel Just > wrote: > >>> >> > >>> >> The next thing I'd want is for you to reproduce with > >>> >> > >>> >> debug osd = 20 > >>> >> debug filestore = 20 > >>> >> debug ms = 1 > >>> >> > >>> >> and post the file somewhere. > >>> >> -Sam > >>> >> > >>> >> On Mon, Jul 25, 2016 at 1:33 PM, Samuel Just > wrote: > >>> >> > If you don't have the orphaned file link, it's not the same bug. > >>> >> > -Sam > >>> >> > > >>> >> > On Mon, Jul 25, 2016 at 12:55 PM, Brian Felton < > bjfel...@gmail.com> > >>> >> > wrote: > >>> >> >> Sam, > >>> >> >> > >>> >> >> I'm reviewing that thread now, but I'm not seeing a lot of > overlap > >>> >> >> with > >>> >> >> my > >>> >> >> cluster's situation. For one, I am unable to start either a > repair > >>> >> >> or > >>> >> >> a > >>> >> >> deep scrub on any of the affected pgs. I've instructed all six > of > >>> >> >> the > >>> >> >> pgs > >>> >> >> to scrub, deep-scrub, and repair, and the cluster has been > gleefully > >>> >> >> ignoring these requests (it has been several hours since I first > >>> >> >> tried, > >>> >> >> and > >>> >> >> the logs indicate none of the pgs ever scrubbed). Second, none > of > >>> >> >> the > >>> >> >> my > >>> >> >> OSDs is crashing. Third, none of my pgs or objects has ever been > >>> >> >> marked > >>> >> >> inconsistent (or unfound, for that matter) -- I'm only seeing the > >>> >> >> standard > >>> >> >> mix of degraded/misplaced objects that are common during a > recovery. > >>> >> >> What > >>> >> >> I'm not seeing is any further progress on the number of misplaced > >>> >> >> objects -- > >>> >> >> the number has remained effectively unchanged for the past > several > >>> >> >> days. > >>> >> >> > >>> >> >> To be sure, though, I tracked down the file that the backfill > >>> >> >> operation > >>> >> >> seems to be hung on, and I can find it in both the backfill > target > >>> >> >> osd > >>> >> >> (580) > >>> >> >> and a few other osds in the pg. In all cases, I was able to find > >>> >> >> the > >>> >> >> file > >>> >> >> with an identical hash value on all nodes, and I didn't find any > >>> >> >> duplicates > >>> >> >> or potential orphans. Also, none of the objects involves have > long > >>> >> >> names, > >>> >> >> so they're not using the special ceph long filename handling. > >>> >> >> > >>> >> >> Also, we are not using XFS on our OSDs; we are using ZFS
Re: [ceph-users] rbd-nbd kernel requirements
Is it thought that this bug is in Ceph and not the kernel? On Thu, Aug 11, 2016 at 8:14 AM Jason Dillamanwrote: > At this point, we only have automated tests that exercise it against > stock Ubuntu Trusty but that will eventually expand to Xenial once we > get our lab configured for it. There is one known issue right now > where the kernel can deadlock while starting the test [1] and > mapping/unmapping the device. > > Red Hat Enterprise Linux has the nbd driver disabled so we don't test > against those variants. I've personally run it on Fedora 23 and 24 > during manual testing without any apparent issues. > > [1] http://tracker.ceph.com/issues/16921 > > On Wed, Aug 10, 2016 at 7:16 PM, Shawn Edwards > wrote: > > Is there a minimum kernel version required for rbd-nbd to work and work > > well? Before I start stress testing it, I want to be sure I have a > system > > that is expected to work. > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Include mon restart in logrotate?
How big is your mon store? ceph@node1:~/ceph-deploy> du -hls /var/lib/ceph/mon/* 31M /var/lib/ceph/mon/ceph-mon1 please note that I restarted the monitor on node1 yesterday, for reference the output for node2: ceph@node2:~> du -h /var/lib/ceph/mon/* 156M/var/lib/ceph/mon/ceph-mon2 ceph@node2:~> systemctl status ceph-mon@mon2.service ceph-mon@mon2.service - Ceph cluster monitor daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled) Active: active (running) since Do 2016-07-07 17:15:49 CEST; 1 months 4 days ago That's not good. That shouldn't happen. The monitor has to trim these logs as well. What could be the problem? Maybe a missing option in the ceph.conf? Zitat von Wido den Hollander: Op 11 augustus 2016 om 10:18 schreef Eugen Block : Thanks for the really quick response! > Warning! These are not your regular log files. Thanks for the warning! > You shouldn't have to worry about that. The MONs should compact and > rotate those logs themselve. I believe the compaction works fine, but these large LOG files just grow until mon restart. Is there no way to limit the size to a desired value or anything similar? That's not good. That shouldn't happen. The monitor has to trim these logs as well. How big is your mon store? $ du -sh /var/lib/ceph/mon/* > What version of Ceph are you running exactly? ceph@node1:~/ceph-deploy> ceph --version ceph version 0.94.6-75 0.94.7 is already out, might be worth upgrading. Release Notes don't tell anything about this case though. > What is the output of ceph -s? ceph@node1:~/ceph-deploy> ceph -s cluster 655cb05a-435a-41ba-83d9-8549f7c36167 health HEALTH_OK monmap e7: 3 mons at {mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0} election epoch 242, quorum 0,1,2 mon1,mon2,mon3 osdmap e2377: 19 osds: 19 up, 19 in pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects 3223 GB used, 4929 GB / 8153 GB avail 4336 active+clean client io 0 B/s rd, 72112 B/s wr, 7 op/s Ok, that's good. Monitors don't trim the logs when the cluster isn't healthy, but yours is. Wido Zitat von Wido den Hollander : >> Op 11 augustus 2016 om 9:56 schreef Eugen Block : >> >> >> Hi list, >> >> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs. >> Now after a couple of weeks we noticed that we're running out of disk >> space on one of the nodes in /var. >> Similar to [1] there are two large LOG files in >> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are >> managed when the respective MON is restarted. But the MONs are not >> restarted regularly so the log files can grow for months and fill up >> the file system. >> > > Warning! These are not your regular log files. They are binary logs > of LevelDB which are mandatory for the MONs to work! > >> I was thinking about adding another file in /etc/logrotate.d/ and >> trigger a monitor restart once a week. But I'm not sure if it's >> recommended to restart all MONs at the same time, which could happen >> if someone started logrotate manually. >> So my question is, how do you guys manage that and how is it supposed >> to be handled? I'd really appreciate any insights! >> > You shouldn't have to worry about that. The MONs should compact and > rotate those logs themselve. > > They compact their store on start, so that works for you, but they > should do this while running. > > What version of Ceph are you running exactly? > > What is the output of ceph -s? MONs usually only compact when the > cluster is healthy. > > Wido > >> Regards, >> Eugen >> >> [1] >> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor >> >> -- >> Eugen Block voice : +49-40-559 51 75 >> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 >> Postfach 61 03 15 >> D-22423 Hamburg e-mail : ebl...@nde.ag >> >> Vorsitzende des Aufsichtsrates: Angelika Mozdzen >>Sitz und Registergericht: Hamburg, HRB 90934 >>Vorstand: Jens-U. Mozdzen >> USt-IdNr. DE 814 013 983 >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Eugen Block voice : +49-40-559 51 75 NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 Postfach 61 03 15 D-22423 Hamburg e-mail : ebl...@nde.ag Vorsitzende des Aufsichtsrates: Angelika Mozdzen Sitz und Registergericht: Hamburg, HRB 90934 Vorstand: Jens-U. Mozdzen USt-IdNr. DE 814 013 983 -- Eugen Block voice : +49-40-559 51 75 NDE Netzdesign und
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
Hello Wido, Thanks for the advice. While the data center has a/b circuits and redundant power, etc if a ground fault happens it travels outside and fails causing the whole building to fail (apparently). The monitors are each the same with 2x e5 cpus 64gb of ram 4x 300gb 10k SAS drives in raid 10 (write through mode). Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 - 3am CST) Ceph hammer LTS 0.94.7 (we are still working on our jewel test cluster so it is planned but not in place yet) The only thing that seems to be corrupt is the monitors leveldb store. I see multiple issues on Google leveldb github from March 2016 about fsync and power failure so I assume this is an issue with leveldb. I have backed up /var/lib/ceph/Mon on all of my monitors before trying to proceed with any form of recovery. Is there any way to reconstruct the leveldb or replace the monitors and recover the data? I found the following post in which sage says it is tedious but possible. ( http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if I have any chance of doing it. I have the fsid, the Mon key map and all of the osds look to be fine so all of the previous osd maps are there. I just don't understand what key/values I need inside. On Aug 11, 2016 1:33 AM, "Wido den Hollander"wrote: > > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan < > seapasu...@uchicago.edu>: > > > > > > I think it just got worse:: > > > > all three monitors on my other cluster say that ceph-mon can't open > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose > all > > 3 monitors? I saw a post by Sage saying that the data can be recovered as > > all of the data is held on other servers. Is this possible? If so has > > anyone had any experience doing so? > > I have never done so, so I couldn't tell you. > > However, it is weird that on all three it got corrupted. What hardware are > you using? Was it properly protected against power failure? > > If you mon store is corrupted I'm not sure what might happen. > > However, make a backup of ALL monitors right now before doing anything. > > Wido > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd-nbd kernel requirements
At this point, we only have automated tests that exercise it against stock Ubuntu Trusty but that will eventually expand to Xenial once we get our lab configured for it. There is one known issue right now where the kernel can deadlock while starting the test [1] and mapping/unmapping the device. Red Hat Enterprise Linux has the nbd driver disabled so we don't test against those variants. I've personally run it on Fedora 23 and 24 during manual testing without any apparent issues. [1] http://tracker.ceph.com/issues/16921 On Wed, Aug 10, 2016 at 7:16 PM, Shawn Edwardswrote: > Is there a minimum kernel version required for rbd-nbd to work and work > well? Before I start stress testing it, I want to be sure I have a system > that is expected to work. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs performance benchmark -- metadata intensive
Patrick and I had a related question yesterday, are we able to dynamically vary cache size to artificially manipulate cache pressure? On Thu, Aug 11, 2016 at 6:07 AM, John Spraywrote: > On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen > wrote: > > Hi , > > > > > > Here is the slide I shared yesterday on performance meeting. > > Thanks and hoping for inputs. > > > > > > http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds- > performance-benchmark > > These are definitely useful results and I encourage everyone working > with cephfs to go and look at Xiaoxi's slides. > > The main thing that this highlighted for me was our lack of testing so > far on systems with full caches. Too much of our existing testing is > done on freshly configured systems that never fill the MDS cache. > > Test 2.1 notes that we don't enable directory fragmentation by default > currently -- this is an issue, and I'm hoping we can switch it on by > default in Kraken (see thread "Switching on mds_bal_frag by default"). > In the meantime we have the fix that Patrick wrote for Jewel which at > least prevents people creating dirfrags too large for the OSDs to > handle. > > Test 2.2: since a "failing to respond to cache pressure" bug is > affecting this, I would guess we see the performance fall off at about > the point where the *client* caches fill up (so they start trimming > things even though they're ignore cache pressure). It would be > interesting to see this chart with addition lines for some related > perf counters like mds_log.evtrm and mds.inodes_expired, that might > make it pretty obvious where the MDS is entering different stages that > see a decrease in the rate of handling client requests. > > We really need to sort out the "failing to respond to cache pressure" > issues that keep popping up, especially if they're still happening on > a comparatively simple test that is just creating files. We have a > specific test for this[1] that is currently being run against the fuse > client but not the kernel client[2]. This is a good time to try and > push that forward so I've kicked off an experimental run here: > http://pulpito.ceph.com/jspray-2016-08-10_16:14:52- > kcephfs:recovery-master-testing-basic-mira/ > > In the meantime, although there are reports of similar issues with > newer kernels, it would be very useful to confirm if the same issue is > still occurring with more recent kernels. Issues with cache trimming > have occurred due to various (separate) bugs, so it's possible that > while some people are still seeing cache trimming issues with recent > kernels, the specific case you're hitting might be fixed. > > Test 2.3: restarting the MDS doesn't actually give you a completely > empty cache (everything in the journal gets replayed to pre-populate > the cache on MDS startup). However, the results are still valid > because you're using a different random order in the non-caching test > case, and the number of inodes in your journal is probably much > smaller than the overall cache size so it's only a little bit > populated. We don't currently have a "drop cache" command built into > the MDS but it would be pretty easy to add one for use in testing > (basically just call mds->mdcache->trim(0)). > > As one would imagine, the non-caching case is latency-dominated when > the working set is larger than the cache, where each client is waiting > for one open to finish before proceeding to the next. The MDS is > probably capable of handling many more operations per second, but it > would need more parallel IO operations from the clients. When a > single client is doing opens one by one, you're potentially seeing a > full network+disk latency for each one (though in practice the OSD > read cache will be helping a lot here). This non-caching case would > be the main argument for giving the metadata pool low latency (SSD) > storage. > > Test 2.5: The observation that the CPU bottleneck makes using fast > storage for the metadata pool less useful (in sequential/cached cases) > is valid, although it could still be useful to isolate the metadata > OSDs (probably SSDs since not so much capacity is needed) to avoid > competing with data operations. For random access in the non-caching > cases (2.3, 2.4) I think you would probably see an improvement from > SSDs. > > Thanks again to the team from ebay for sharing all this. > > John > > > > 1. https://github.com/ceph/ceph-qa-suite/blob/master/tasks/ > cephfs/test_client_limits.py#L96 > 2. http://tracker.ceph.com/issues/9466 > > > > > > Xiaoxi > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
Re: [ceph-users] Include mon restart in logrotate?
> Op 11 augustus 2016 om 10:18 schreef Eugen Block: > > > Thanks for the really quick response! > > > Warning! These are not your regular log files. > > Thanks for the warning! > > > You shouldn't have to worry about that. The MONs should compact and > > rotate those logs themselve. > > I believe the compaction works fine, but these large LOG files just > grow until mon restart. Is there no way to limit the size to a desired > value or anything similar? > That's not good. That shouldn't happen. The monitor has to trim these logs as well. How big is your mon store? $ du -sh /var/lib/ceph/mon/* > > What version of Ceph are you running exactly? > > ceph@node1:~/ceph-deploy> ceph --version > ceph version 0.94.6-75 > 0.94.7 is already out, might be worth upgrading. Release Notes don't tell anything about this case though. > > What is the output of ceph -s? > > ceph@node1:~/ceph-deploy> ceph -s > cluster 655cb05a-435a-41ba-83d9-8549f7c36167 > health HEALTH_OK > monmap e7: 3 mons at > {mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0} > election epoch 242, quorum 0,1,2 mon1,mon2,mon3 > osdmap e2377: 19 osds: 19 up, 19 in >pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects > 3223 GB used, 4929 GB / 8153 GB avail > 4336 active+clean >client io 0 B/s rd, 72112 B/s wr, 7 op/s > Ok, that's good. Monitors don't trim the logs when the cluster isn't healthy, but yours is. Wido > > Zitat von Wido den Hollander : > > >> Op 11 augustus 2016 om 9:56 schreef Eugen Block : > >> > >> > >> Hi list, > >> > >> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs. > >> Now after a couple of weeks we noticed that we're running out of disk > >> space on one of the nodes in /var. > >> Similar to [1] there are two large LOG files in > >> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are > >> managed when the respective MON is restarted. But the MONs are not > >> restarted regularly so the log files can grow for months and fill up > >> the file system. > >> > > > > Warning! These are not your regular log files. They are binary logs > > of LevelDB which are mandatory for the MONs to work! > > > >> I was thinking about adding another file in /etc/logrotate.d/ and > >> trigger a monitor restart once a week. But I'm not sure if it's > >> recommended to restart all MONs at the same time, which could happen > >> if someone started logrotate manually. > >> So my question is, how do you guys manage that and how is it supposed > >> to be handled? I'd really appreciate any insights! > >> > > You shouldn't have to worry about that. The MONs should compact and > > rotate those logs themselve. > > > > They compact their store on start, so that works for you, but they > > should do this while running. > > > > What version of Ceph are you running exactly? > > > > What is the output of ceph -s? MONs usually only compact when the > > cluster is healthy. > > > > Wido > > > >> Regards, > >> Eugen > >> > >> [1] > >> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor > >> > >> -- > >> Eugen Block voice : +49-40-559 51 75 > >> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > >> Postfach 61 03 15 > >> D-22423 Hamburg e-mail : ebl...@nde.ag > >> > >> Vorsitzende des Aufsichtsrates: Angelika Mozdzen > >>Sitz und Registergericht: Hamburg, HRB 90934 > >>Vorstand: Jens-U. Mozdzen > >> USt-IdNr. DE 814 013 983 > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Eugen Block voice : +49-40-559 51 75 > NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > Postfach 61 03 15 > D-22423 Hamburg e-mail : ebl...@nde.ag > > Vorsitzende des Aufsichtsrates: Angelika Mozdzen >Sitz und Registergericht: Hamburg, HRB 90934 >Vorstand: Jens-U. Mozdzen > USt-IdNr. DE 814 013 983 > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs performance benchmark -- metadata intensive
On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chenwrote: > Hi , > > > Here is the slide I shared yesterday on performance meeting. > Thanks and hoping for inputs. > > > http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark These are definitely useful results and I encourage everyone working with cephfs to go and look at Xiaoxi's slides. The main thing that this highlighted for me was our lack of testing so far on systems with full caches. Too much of our existing testing is done on freshly configured systems that never fill the MDS cache. Test 2.1 notes that we don't enable directory fragmentation by default currently -- this is an issue, and I'm hoping we can switch it on by default in Kraken (see thread "Switching on mds_bal_frag by default"). In the meantime we have the fix that Patrick wrote for Jewel which at least prevents people creating dirfrags too large for the OSDs to handle. Test 2.2: since a "failing to respond to cache pressure" bug is affecting this, I would guess we see the performance fall off at about the point where the *client* caches fill up (so they start trimming things even though they're ignore cache pressure). It would be interesting to see this chart with addition lines for some related perf counters like mds_log.evtrm and mds.inodes_expired, that might make it pretty obvious where the MDS is entering different stages that see a decrease in the rate of handling client requests. We really need to sort out the "failing to respond to cache pressure" issues that keep popping up, especially if they're still happening on a comparatively simple test that is just creating files. We have a specific test for this[1] that is currently being run against the fuse client but not the kernel client[2]. This is a good time to try and push that forward so I've kicked off an experimental run here: http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-kcephfs:recovery-master-testing-basic-mira/ In the meantime, although there are reports of similar issues with newer kernels, it would be very useful to confirm if the same issue is still occurring with more recent kernels. Issues with cache trimming have occurred due to various (separate) bugs, so it's possible that while some people are still seeing cache trimming issues with recent kernels, the specific case you're hitting might be fixed. Test 2.3: restarting the MDS doesn't actually give you a completely empty cache (everything in the journal gets replayed to pre-populate the cache on MDS startup). However, the results are still valid because you're using a different random order in the non-caching test case, and the number of inodes in your journal is probably much smaller than the overall cache size so it's only a little bit populated. We don't currently have a "drop cache" command built into the MDS but it would be pretty easy to add one for use in testing (basically just call mds->mdcache->trim(0)). As one would imagine, the non-caching case is latency-dominated when the working set is larger than the cache, where each client is waiting for one open to finish before proceeding to the next. The MDS is probably capable of handling many more operations per second, but it would need more parallel IO operations from the clients. When a single client is doing opens one by one, you're potentially seeing a full network+disk latency for each one (though in practice the OSD read cache will be helping a lot here). This non-caching case would be the main argument for giving the metadata pool low latency (SSD) storage. Test 2.5: The observation that the CPU bottleneck makes using fast storage for the metadata pool less useful (in sequential/cached cases) is valid, although it could still be useful to isolate the metadata OSDs (probably SSDs since not so much capacity is needed) to avoid competing with data operations. For random access in the non-caching cases (2.3, 2.4) I think you would probably see an improvement from SSDs. Thanks again to the team from ebay for sharing all this. John 1. https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_client_limits.py#L96 2. http://tracker.ceph.com/issues/9466 > > Xiaoxi > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Include mon restart in logrotate?
I had to make a cronjob to trigger compact on the MONs as well. Ancient version, though. Jan > On 11 Aug 2016, at 10:09, Wido den Hollanderwrote: > > >> Op 11 augustus 2016 om 9:56 schreef Eugen Block : >> >> >> Hi list, >> >> we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs. >> Now after a couple of weeks we noticed that we're running out of disk >> space on one of the nodes in /var. >> Similar to [1] there are two large LOG files in >> /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are >> managed when the respective MON is restarted. But the MONs are not >> restarted regularly so the log files can grow for months and fill up >> the file system. >> > > Warning! These are not your regular log files. They are binary logs of > LevelDB which are mandatory for the MONs to work! > >> I was thinking about adding another file in /etc/logrotate.d/ and >> trigger a monitor restart once a week. But I'm not sure if it's >> recommended to restart all MONs at the same time, which could happen >> if someone started logrotate manually. >> So my question is, how do you guys manage that and how is it supposed >> to be handled? I'd really appreciate any insights! >> > You shouldn't have to worry about that. The MONs should compact and rotate > those logs themselve. > > They compact their store on start, so that works for you, but they should do > this while running. > > What version of Ceph are you running exactly? > > What is the output of ceph -s? MONs usually only compact when the cluster is > healthy. > > Wido > >> Regards, >> Eugen >> >> [1] >> http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor >> >> -- >> Eugen Block voice : +49-40-559 51 75 >> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 >> Postfach 61 03 15 >> D-22423 Hamburg e-mail : ebl...@nde.ag >> >> Vorsitzende des Aufsichtsrates: Angelika Mozdzen >> Sitz und Registergericht: Hamburg, HRB 90934 >> Vorstand: Jens-U. Mozdzen >>USt-IdNr. DE 814 013 983 >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Include mon restart in logrotate?
Thanks for the really quick response! Warning! These are not your regular log files. Thanks for the warning! You shouldn't have to worry about that. The MONs should compact and rotate those logs themselve. I believe the compaction works fine, but these large LOG files just grow until mon restart. Is there no way to limit the size to a desired value or anything similar? What version of Ceph are you running exactly? ceph@node1:~/ceph-deploy> ceph --version ceph version 0.94.6-75 What is the output of ceph -s? ceph@node1:~/ceph-deploy> ceph -s cluster 655cb05a-435a-41ba-83d9-8549f7c36167 health HEALTH_OK monmap e7: 3 mons at {mon1=192.168.160.15:6789/0,mon2=192.168.160.17:6789/0,mon3=192.168.160.16:6789/0} election epoch 242, quorum 0,1,2 mon1,mon2,mon3 osdmap e2377: 19 osds: 19 up, 19 in pgmap v3791457: 4336 pgs, 14 pools, 1551 GB data, 234 kobjects 3223 GB used, 4929 GB / 8153 GB avail 4336 active+clean client io 0 B/s rd, 72112 B/s wr, 7 op/s Zitat von Wido den Hollander: Op 11 augustus 2016 om 9:56 schreef Eugen Block : Hi list, we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs. Now after a couple of weeks we noticed that we're running out of disk space on one of the nodes in /var. Similar to [1] there are two large LOG files in /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are managed when the respective MON is restarted. But the MONs are not restarted regularly so the log files can grow for months and fill up the file system. Warning! These are not your regular log files. They are binary logs of LevelDB which are mandatory for the MONs to work! I was thinking about adding another file in /etc/logrotate.d/ and trigger a monitor restart once a week. But I'm not sure if it's recommended to restart all MONs at the same time, which could happen if someone started logrotate manually. So my question is, how do you guys manage that and how is it supposed to be handled? I'd really appreciate any insights! You shouldn't have to worry about that. The MONs should compact and rotate those logs themselve. They compact their store on start, so that works for you, but they should do this while running. What version of Ceph are you running exactly? What is the output of ceph -s? MONs usually only compact when the cluster is healthy. Wido Regards, Eugen [1] http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor -- Eugen Block voice : +49-40-559 51 75 NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 Postfach 61 03 15 D-22423 Hamburg e-mail : ebl...@nde.ag Vorsitzende des Aufsichtsrates: Angelika Mozdzen Sitz und Registergericht: Hamburg, HRB 90934 Vorstand: Jens-U. Mozdzen USt-IdNr. DE 814 013 983 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Eugen Block voice : +49-40-559 51 75 NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 Postfach 61 03 15 D-22423 Hamburg e-mail : ebl...@nde.ag Vorsitzende des Aufsichtsrates: Angelika Mozdzen Sitz und Registergericht: Hamburg, HRB 90934 Vorstand: Jens-U. Mozdzen USt-IdNr. DE 814 013 983 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Include mon restart in logrotate?
> Op 11 augustus 2016 om 9:56 schreef Eugen Block: > > > Hi list, > > we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs. > Now after a couple of weeks we noticed that we're running out of disk > space on one of the nodes in /var. > Similar to [1] there are two large LOG files in > /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are > managed when the respective MON is restarted. But the MONs are not > restarted regularly so the log files can grow for months and fill up > the file system. > Warning! These are not your regular log files. They are binary logs of LevelDB which are mandatory for the MONs to work! > I was thinking about adding another file in /etc/logrotate.d/ and > trigger a monitor restart once a week. But I'm not sure if it's > recommended to restart all MONs at the same time, which could happen > if someone started logrotate manually. > So my question is, how do you guys manage that and how is it supposed > to be handled? I'd really appreciate any insights! > You shouldn't have to worry about that. The MONs should compact and rotate those logs themselve. They compact their store on start, so that works for you, but they should do this while running. What version of Ceph are you running exactly? What is the output of ceph -s? MONs usually only compact when the cluster is healthy. Wido > Regards, > Eugen > > [1] > http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor > > -- > Eugen Block voice : +49-40-559 51 75 > NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > Postfach 61 03 15 > D-22423 Hamburg e-mail : ebl...@nde.ag > > Vorsitzende des Aufsichtsrates: Angelika Mozdzen >Sitz und Registergericht: Hamburg, HRB 90934 >Vorstand: Jens-U. Mozdzen > USt-IdNr. DE 814 013 983 > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Include mon restart in logrotate?
Hi list, we have a working cluster based on Hammer with 4 nodes, 19 OSDs and 3 MONs. Now after a couple of weeks we noticed that we're running out of disk space on one of the nodes in /var. Similar to [1] there are two large LOG files in /var/lib/ceph/mon/ceph-d/store.db/ and I already figured they are managed when the respective MON is restarted. But the MONs are not restarted regularly so the log files can grow for months and fill up the file system. I was thinking about adding another file in /etc/logrotate.d/ and trigger a monitor restart once a week. But I'm not sure if it's recommended to restart all MONs at the same time, which could happen if someone started logrotate manually. So my question is, how do you guys manage that and how is it supposed to be handled? I'd really appreciate any insights! Regards, Eugen [1] http://ceph-users.ceph.narkive.com/PBL3kuhq/large-log-like-files-on-monitor -- Eugen Block voice : +49-40-559 51 75 NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 Postfach 61 03 15 D-22423 Hamburg e-mail : ebl...@nde.ag Vorsitzende des Aufsichtsrates: Angelika Mozdzen Sitz und Registergericht: Hamburg, HRB 90934 Vorstand: Jens-U. Mozdzen USt-IdNr. DE 814 013 983 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
I'm guessing you had writeback cache enabled on ceph-mon disk (smartctl -g wcache /dev/sdX) and disk firmware did not care about respecting flush semantics. On 11.08.2016 08:33, Wido den Hollander wrote: > >> Op 11 augustus 2016 om 0:10 schreef Sean Sullivan: >> >> >> I think it just got worse:: >> >> all three monitors on my other cluster say that ceph-mon can't open >> /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all >> 3 monitors? I saw a post by Sage saying that the data can be recovered as >> all of the data is held on other servers. Is this possible? If so has >> anyone had any experience doing so? > > I have never done so, so I couldn't tell you. > > However, it is weird that on all three it got corrupted. What hardware are > you using? Was it properly protected against power failure? > > If you mon store is corrupted I'm not sure what might happen. > > However, make a backup of ALL monitors right now before doing anything. > > Wido > >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Tomasz Kuzemko tomasz.kuze...@corp.ovh.com signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
> Op 11 augustus 2016 om 0:10 schreef Sean Sullivan: > > > I think it just got worse:: > > all three monitors on my other cluster say that ceph-mon can't open > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all > 3 monitors? I saw a post by Sage saying that the data can be recovered as > all of the data is held on other servers. Is this possible? If so has > anyone had any experience doing so? I have never done so, so I couldn't tell you. However, it is weird that on all three it got corrupted. What hardware are you using? Was it properly protected against power failure? If you mon store is corrupted I'm not sure what might happen. However, make a backup of ALL monitors right now before doing anything. Wido > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com