Re: [ceph-users] HEALTH_WARN 1 MDSs report oversized cache
Ah, I understand now. Makes a lot of sense. Well, we have a LOT of small files so that might be the reason. I'll keep an eye on it whether the message shows up again. Thank you! Ranjan Am 05.12.19 um 19:40 schrieb Patrick Donnelly: > On Thu, Dec 5, 2019 at 9:45 AM Ranjan Ghosh wrote: >> Ah, that seems to have fixed it. Hope it stays that way. I've raised it >> to 4 GB. Thanks to you both! > Just be aware the warning could come back. You just moved the goal posts. > > The 1GB default is probably too low for most deployments, I have a PR > to increase this: https://github.com/ceph/ceph/pull/32042 > >> Although I have to say that the message is IMHO *very* misleading: "1 >> MDSs report oversized cache" sounds to me like the cache is too large >> (i.e. wasting RAM unnecessarily). Shouldn't the message rather be "1 >> MDSs report *undersized* cache"? Weird. > No. I means the MDS cache is larger than its target. This means the > MDS cannot trim its cache to go back under the limit. This could be > for many reasons but probably due to clients not releasing > capabilities, perhaps due to a bug. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 1 MDSs report oversized cache
Hi, Ah, that seems to have fixed it. Hope it stays that way. I've raised it to 4 GB. Thanks to you both! Although I have to say that the message is IMHO *very* misleading: "1 MDSs report oversized cache" sounds to me like the cache is too large (i.e. wasting RAM unnecessarily). Shouldn't the message rather be "1 MDSs report *undersized* cache"? Weird. That's why I was wondering how small it should be to make Ceph happy but still be sufficient. If I had known that this message meant that the cache is too small, then I would have obviously just raised it until the message disappeared. Thanks again for your help! Much appreciated. BR Ranjan Am 05.12.19 um 16:47 schrieb Nathan Fish: > MDS cache size scales with the number of files recently opened by > clients. if you have RAM to spare, increase "mds cache memory limit". > I have raised mine from the default of 1GiB to 32GiB. My rough > estimate is 2.5kiB per inode in recent use. > > > On Thu, Dec 5, 2019 at 10:39 AM Ranjan Ghosh wrote: >> Okay, now, after I settled the issue with the oneshot service thanks to >> the amazing help of Paul and Richard (thanks again!), I still wonder: >> >> What could I do about that MDS warning: >> >> === >> >> health: HEALTH_WARN >> >> 1 MDSs report oversized cache >> >> === >> >> If anybody has any ideas? I tried googling it, of course, but came up >> with no really relevant info on how to actually solve this. >> >> >> BR >> >> Ranjan >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN 1 MDSs report oversized cache
Okay, now, after I settled the issue with the oneshot service thanks to the amazing help of Paul and Richard (thanks again!), I still wonder: What could I do about that MDS warning: === health: HEALTH_WARN 1 MDSs report oversized cache === If anybody has any ideas? I tried googling it, of course, but came up with no really relevant info on how to actually solve this. BR Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What does the ceph-volume@simple-crazyhexstuff SystemD service do? And what to do about oversized MDS cache?
Hi Richard, Ah, I think I understand, now, brilliant. It's *supposed* to do exactly that. Mount it once on boot and then just exit. So everything is working as intended. Great. Thanks Ranjan Am 05.12.19 um 15:18 schrieb Richard: > On 2019-12-05 7:19 AM, Ranjan Ghosh wrote: >> Why is my service marked inactivate/dead? Shouldn't it be running? > > Look up systemd service type "one shot". The service did it's job of > performing the mount and has now exited. > > Systemd is a beast. It does many things. A service isn't a daemon. > It's different than BSD-init or SYS-v init. > > Congrats on your upgrade, btw! good job! > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What does the ceph-volume@simple-crazyhexstuff SystemD service do? And what to do about oversized MDS cache?
Hi Paul, thanks for the explanation. I didn't know about the JSON file yet. That's certainly good to know. What I still don't understand, though: Why is my service marked inactivate/dead? Shouldn't it be running? If I run: systemctl start ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service nothing seems to happen: === root@yak1 /etc/ceph/osd # systemctl status ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service ● ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service - Ceph Volume activation: simple-0-6585a10b-917f-4458-a464-b4dd729ef174 Loaded: loaded (/lib/systemd/system/ceph-volume@.service; enabled; vendor preset: enabled) Active: inactive (dead) since Thu 2019-12-05 14:14:08 CET; 2min 13s ago Main PID: 27281 (code=exited, status=0/SUCCESS) Dec 05 14:14:08 yak1 systemd[1]: Starting Ceph Volume activation: simple-0-6585a10b-917f-4458-a464-b4dd729ef174... Dec 05 14:14:08 yak1 systemd[1]: ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service: Current command vanished from the unit file, execution of the command Dec 05 14:14:08 yak1 sh[27281]: Running command: /usr/sbin/ceph-volume simple trigger 0-6585a10b-917f-4458-a464-b4dd729ef174 Dec 05 14:14:08 yak1 systemd[1]: ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service: Succeeded. Dec 05 14:14:08 yak1 systemd[1]: Started Ceph Volume activation: simple-0-6585a10b-917f-4458-a464-b4dd729ef174. === It says status=0/SUCCESS and in the log "Succeeded". But then again why is "Started Ceph Volume activation" the last log entry. It sounds like sth. is unfinished. The mount point seems to be mounted perfectly, though: /dev/sdb1 on /var/lib/ceph/osd/ceph-0 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) Shouldn't that service be running continually? BR Ranjan Am 05.12.19 um 13:25 schrieb Paul Emmerich: > The ceph-volume services make sure that the right partitions are > mounted at /var/lib/ceph/osd/ceph-X > > In "simple" mode the service gets the necessary information from a > json file (long-hex-string.json) in /etc/ceph > > ceph-volume simple scan/activate create the json file and systemd unit. > > ceph-disk used udev instead for the activation which was *very* messy > and a frequent cause of long startup delays (seen > 40 minutes on > encrypted ceph-disk OSDs) > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io <http://www.croit.io> > Tel: +49 89 1896585 90 > > > On Thu, Dec 5, 2019 at 1:03 PM Ranjan Ghosh <mailto:gh...@pw6.de>> wrote: > > Hi all, > > After upgrading to Ubuntu 19.10 and consequently from Mimic to > Nautilus, I had a mini-shock when my OSDs didn't come up. Okay, I > should have read the docs more closely, I had to do: > > # ceph-volume simple scan /dev/sdb1 > > # ceph-volume simple activate --all > > Hooray. The OSDs came back to life. And I saw that some weird > services were created. Didn't give that much thought at first, but > later I noticed there is now a new service in town: > > === > > root@yak1 ~ # systemctl status > ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service > <mailto:ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service> > > ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service > <mailto:ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service> > - Ceph Volume activation: > simple-0-6585a10b-917f-4458-a464-b4dd729ef174 > Loaded: loaded (/lib/systemd/system/ceph-volume@.service; > enabled; vendor preset: enabled) > Active: inactive (dead) since Wed 2019-12-04 23:29:15 CET; 13h ago > Main PID: 10048 (code=exited, status=0/SUCCESS) > > === > > Hmm. It's dead. But my cluster is alive & kicking, though. > Everything is working. Why is this needed? Should I be worried? Or > can I safely delete that service from /etc/systemd/... since it's > not running anyway? > > Another, probably minor issue: > > I still get a HEALTH_WARN "1 MDSs report oversized cache". But it > doesn't tell me any details and I cannot find anything in the > logs. What should I do to resolve this? Set > mds_cache_memory_limit? How do I determine an acceptable value? > > > Thank you / Best regards > > Ranjan > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What does the ceph-volume@simple-crazyhexstuff SystemD service do? And what to do about oversized MDS cache?
Hi all, After upgrading to Ubuntu 19.10 and consequently from Mimic to Nautilus, I had a mini-shock when my OSDs didn't come up. Okay, I should have read the docs more closely, I had to do: # ceph-volume simple scan /dev/sdb1 # ceph-volume simple activate --all Hooray. The OSDs came back to life. And I saw that some weird services were created. Didn't give that much thought at first, but later I noticed there is now a new service in town: === root@yak1 ~ # systemctl status ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service ceph-volume@simple-0-6585a10b-917f-4458-a464-b4dd729ef174.service - Ceph Volume activation: simple-0-6585a10b-917f-4458-a464-b4dd729ef174 Loaded: loaded (/lib/systemd/system/ceph-volume@.service; enabled; vendor preset: enabled) Active: inactive (dead) since Wed 2019-12-04 23:29:15 CET; 13h ago Main PID: 10048 (code=exited, status=0/SUCCESS) === Hmm. It's dead. But my cluster is alive & kicking, though. Everything is working. Why is this needed? Should I be worried? Or can I safely delete that service from /etc/systemd/... since it's not running anyway? Another, probably minor issue: I still get a HEALTH_WARN "1 MDSs report oversized cache". But it doesn't tell me any details and I cannot find anything in the logs. What should I do to resolve this? Set mds_cache_memory_limit? How do I determine an acceptable value? Thank you / Best regards Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_WARN - 3 modules have failed dependencies
Ah, after researching some more I think I got hit by this bug: https://github.com/ceph/ceph/pull/25585 At least that's exactly what I see in the logs: "Interpreter change detected - this module can only be loaded into one interpreter per process." Ceph modules don't seem to work at all with the newest Ubuntu version. Only one module can be loaded. Sad :-( Hope this will be fixed soon... Am 30.04.19 um 21:18 schrieb Ranjan Ghosh: Hi my beloved Ceph list, After an upgrade from Ubuntu Cosmic to Ubuntu Disco (and according Ceph packages updated from 13.2.2 to 13.2.4), I now get this when I enter "ceph health": HEALTH_WARN 3 modules have failed dependencies "ceph mgr module ls" only reports those 3 modules enabled: "enabled_modules": [ "dashboard", "restful", "status" ], ... Then I found this page here: docs.ceph.com/docs/master/rados/operations/health-checks Under "MGR_MODULE_DEPENDENCY" it says: "An enabled manager module is failing its dependency check. This health check should come with an explanatory message from the module about the problem." What is "this health check"? If the page talks about "ceph health" or "ceph -s" then, no, there is no explanatory message there on what's wrong. Furthermore, it says: "This health check is only applied to enabled modules. If a module is not enabled, you can see whether it is reporting dependency issues in the output of ceph module ls." The command "ceph module ls", however, doesn't exist. If "ceph mgr module ls" is really meant, then I get this: { "enabled_modules": [ "dashboard", "restful", "status" ], "disabled_modules": [ { "name": "balancer", "can_run": true, "error_string": "" }, { "name": "hello", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "influx", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "iostat", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "localpool", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "prometheus", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "selftest", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "smart", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "telegraf", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "telemetry", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "zabbix", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." } ] } Usually the Ceph documentation is great, very detailed and helpful. But I can find nothing on how to resolve this problem. Any help is much appreciated. Thank you / Best regards Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] HEALTH_WARN - 3 modules have failed dependencies
Hi my beloved Ceph list, After an upgrade from Ubuntu Cosmic to Ubuntu Disco (and according Ceph packages updated from 13.2.2 to 13.2.4), I now get this when I enter "ceph health": HEALTH_WARN 3 modules have failed dependencies "ceph mgr module ls" only reports those 3 modules enabled: "enabled_modules": [ "dashboard", "restful", "status" ], ... Then I found this page here: docs.ceph.com/docs/master/rados/operations/health-checks Under "MGR_MODULE_DEPENDENCY" it says: "An enabled manager module is failing its dependency check. This health check should come with an explanatory message from the module about the problem." What is "this health check"? If the page talks about "ceph health" or "ceph -s" then, no, there is no explanatory message there on what's wrong. Furthermore, it says: "This health check is only applied to enabled modules. If a module is not enabled, you can see whether it is reporting dependency issues in the output of ceph module ls." The command "ceph module ls", however, doesn't exist. If "ceph mgr module ls" is really meant, then I get this: { "enabled_modules": [ "dashboard", "restful", "status" ], "disabled_modules": [ { "name": "balancer", "can_run": true, "error_string": "" }, { "name": "hello", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "influx", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "iostat", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "localpool", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "prometheus", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "selftest", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "smart", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "telegraf", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "telemetry", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." }, { "name": "zabbix", "can_run": false, "error_string": "Interpreter change detected - this module can only be loaded into one interpreter per process." } ] } Usually the Ceph documentation is great, very detailed and helpful. But I can find nothing on how to resolve this problem. Any help is much appreciated. Thank you / Best regards Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Urgent: Reduced data availability / All pgs inactive
Wow. Thank you so much Irek! Your help saved me from a lot of trouble... It turned out to be indeed a firewall issue. Port 6800 in one direction wasn't open. Am 21.02.19 um 07:05 schrieb Irek Fasikhov: Hi, You have problems with MRG. http://docs.ceph.com/docs/master/rados/operations/pg-states/ /The ceph-mgr hasn’t yet received any information about the PG’s state from an OSD since mgr started up./ чт, 21 февр. 2019 г. в 09:04, Irek Fasikhov <mailto:malm...@gmail.com>>: Hi, You have problems with MRG. http://docs.ceph.com/docs/master/rados/operations/pg-states/ /The ceph-mgr hasn’t yet received any information about the PG’s state from an OSD since mgr started up./ ср, 20 февр. 2019 г. в 23:10, Ranjan Ghosh mailto:gh...@pw6.de>>: Hi all, hope someone can help me. After restarting a node of my 2-node-cluster suddenly I get this: root@yak2 /var/www/projects # ceph -s cluster: id: 749b2473-9300-4535-97a6-ee6d55008a1b health: HEALTH_WARN Reduced data availability: 200 pgs inactive services: mon: 3 daemons, quorum yak1,yak2,yak0 mgr: yak0.planwerk6.de <http://yak0.planwerk6.de>(active), standbys: yak1.planwerk6.de <http://yak1.planwerk6.de>, yak2.planwerk6.de <http://yak2.planwerk6.de> mds: cephfs-1/1/1 up {0=yak1.planwerk6.de <http://yak1.planwerk6.de>=up:active}, 1 up:standby osd: 2 osds: 2 up, 2 in data: pools: 2 pools, 200 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 200 unknown And this: root@yak2 /var/www/projects # ceph health detail HEALTH_WARN Reduced data availability: 200 pgs inactive PG_AVAILABILITY Reduced data availability: 200 pgs inactive pg 1.34 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.35 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.36 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.37 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.38 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.39 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3a is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3b is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3c is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3d is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3e is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3f is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.40 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.41 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.42 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.43 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.44 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.45 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.46 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.47 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.48 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.49 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.4a is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.4b is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.4c is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.4d is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.34 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.35 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.36 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.38 is stuck inactive for 3506.815664
[ceph-users] Urgent: Reduced data availability / All pgs inactive
Hi all, hope someone can help me. After restarting a node of my 2-node-cluster suddenly I get this: root@yak2 /var/www/projects # ceph -s cluster: id: 749b2473-9300-4535-97a6-ee6d55008a1b health: HEALTH_WARN Reduced data availability: 200 pgs inactive services: mon: 3 daemons, quorum yak1,yak2,yak0 mgr: yak0.planwerk6.de(active), standbys: yak1.planwerk6.de, yak2.planwerk6.de mds: cephfs-1/1/1 up {0=yak1.planwerk6.de=up:active}, 1 up:standby osd: 2 osds: 2 up, 2 in data: pools: 2 pools, 200 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: 100.000% pgs unknown 200 unknown And this: root@yak2 /var/www/projects # ceph health detail HEALTH_WARN Reduced data availability: 200 pgs inactive PG_AVAILABILITY Reduced data availability: 200 pgs inactive pg 1.34 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.35 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.36 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.37 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.38 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.39 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3a is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3b is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3c is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3d is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3e is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.3f is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.40 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.41 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.42 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.43 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.44 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.45 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.46 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.47 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.48 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.49 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.4a is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.4b is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.4c is stuck inactive for 3506.815664, current state unknown, last acting [] pg 1.4d is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.34 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.35 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.36 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.38 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.39 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.3a is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.3b is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.3c is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.3d is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.3e is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.3f is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.40 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.41 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.42 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.43 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.44 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.45 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.46 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.47 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.48 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.49 is stuck inactive for 3506.815664, current state unknown, last acting [] pg 2.4a is stuck inactive for 3506.815664, current state unknown, last acting []
[ceph-users] ceph pg dump
Hi all, we have two small clusters (3 nodes each) called alpha and beta. One node (alpha0/beta0) is on a remote site and only has monitor & manager. The two other nodes (alpha/beta-1/2) have all 4 services and contain the OSDs and are connected via an internal network. In short: alpha0 -- alpha1--alpha2 beta0 -- beta1--beta2 Now, since a few weeks, I cannot run "ceph pg stat" or "ceph pg dump" or anything the like on alpha0. It works flawlessly on all other nodes including beta0. When I start such a command on alpha0, it just hangs forever. I wonder what could be the reason? Could it be some firewall issue? I see nothing in the logs. Any ideas on how to debug this? I assume it's not necessary to run this command on a node with OSDs, right, because it works on beta0? I could swear it worked on alpha0 as well for many months... I wonder what happened. Weird. Thank you, Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2
HI Ronny, Thanks for the detailed answer. It's much appreciated! I will keep this in the back of my mind, but for now the cost is prohibitive as we're using these servers not as storage-only space but full-fledged servers (i.e. Ceph is mounted locally, there's a webserver and database). And 2 servers can be connected with a cross-link cable. 3 servers would require a switch and so on. It adds up quite quickly if you are really on a tight budget. Sometimes it's not so easy to advocate for new hardware as the benefits are not apparent to everyone :-) In addition, a reason why we're using Ceph in the first place is that we can do easy maintenance and the other server keeps running and the other catches up as soon as it comes back online. With 2/2 we'd lose exactly that - so it's a no-go. Of course, if the second node goes down as well we have a problem but OTOH: Any new changes wont happen as no writes will then happen anyway. And in addition both servers are equipped with Hardware RAID and BBU. In combination with our solid backup, I'm currently willing to take any risks. If we grow further, we might want to look at the 3/2 solution, though. Thanks again for letting me know about the underlying reasons! Best regards, Ranjan Am 25.04.2018 um 19:40 schrieb Ronny Aasen: the difference in cost between 2 and 3 servers are not HUGE. but the reliability difference between a size 2/1 pool and a 3/2 pool is massive. a 2/1 pool is just a single fault during maintenance away from dataloss. but you need multiple simultaneous faults, and have very bad luck to break a 3/2 pool I would recommend rather using 2/2 pools if you are willing to accept a little downtime when a disk dies. the cluster io would stop until the disks backfill to cover for the lost disk. but it is better then having inconsistent pg's or dataloss because a disk crashed during a routine reboot, or 2 disks also worth to read this link https://www.spinics.net/lists/ceph-users/msg32895.html a good explanation. you have good backups and are willing to restore the whole pool. And it is of course your privilege to run 2/1 pools but be mind full of the risks of doing so. kind regards Ronny Aasen BTW: i did not know ubuntu automagically rebooted after a upgrade. you can probably avoid that reboot somehow in ubuntu. and do the restarts of services manually. if you wish to maintain service during upgrade On 25.04.2018 11:52, Ranjan Ghosh wrote: Thanks a lot for your detailed answer. The problem for us, however, was that we use the Ceph packages that come with the Ubuntu distribution. If you do a Ubuntu upgrade, all packages are upgraded in one go and the server is rebooted. You cannot influence anything or start/stop services one-by-one etc. This was concering me, because the upgrade instructions didn't mention anything about an alternative or what to do in this case. But someone here enlightened me that - in general - it all doesnt matter that much *if you are just accepting a downtime*. And, indeed, it all worked nicely. We stopped all services on all servers, upgraded the Ubuntu version, rebooted all servers and were ready to go again. Didn't encounter any problems there. The only problem turned out to be our own fault and simply a firewall misconfiguration. And, yes, we're running a "size:2 min_size:1" because we're on a very tight budget. If I understand correctly, this means: Make changes of files to one server. *Eventually* copy them to the other server. I hope this *eventually* means after a few minutes. Up until now I've never experienced *any* problems with file integrity with this configuration. In fact, Ceph is incredibly stable. Amazing. I have never ever had any issues whatsoever with broken files/partially written files, files that contain garbage etc. Even after starting/stopping services, rebooting etc. With GlusterFS and other Cluster file system I've experienced many such problems over the years, so this is what makes Ceph so great. I have now a lot of trust in Ceph, that it will eventually repair everything :-) And: If a file that has been written a few seconds ago is really lost it wouldnt be that bad for our use-case. It's a web-server. Most important stuff is in the DB. We have hourly backups of everything. In a huge emergency, we could even restore the backup from an hour ago if we really had to. Not nice, but if it happens every 6 years or sth due to some freak hardware failure, I think it is manageable. I accept it's not the recommended/perfect solution if you have infinite amounts of money at your hands, but in our case, I think it's not extremely audacious either to do it like this, right? Am 11.04.2018 um 19:25 schrieb Ronny Aasen: ceph upgrades are usualy not a problem: ceph have to be upgraded in the right order. normally when each servi
Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2
Thanks a lot for your detailed answer. The problem for us, however, was that we use the Ceph packages that come with the Ubuntu distribution. If you do a Ubuntu upgrade, all packages are upgraded in one go and the server is rebooted. You cannot influence anything or start/stop services one-by-one etc. This was concering me, because the upgrade instructions didn't mention anything about an alternative or what to do in this case. But someone here enlightened me that - in general - it all doesnt matter that much *if you are just accepting a downtime*. And, indeed, it all worked nicely. We stopped all services on all servers, upgraded the Ubuntu version, rebooted all servers and were ready to go again. Didn't encounter any problems there. The only problem turned out to be our own fault and simply a firewall misconfiguration. And, yes, we're running a "size:2 min_size:1" because we're on a very tight budget. If I understand correctly, this means: Make changes of files to one server. *Eventually* copy them to the other server. I hope this *eventually* means after a few minutes. Up until now I've never experienced *any* problems with file integrity with this configuration. In fact, Ceph is incredibly stable. Amazing. I have never ever had any issues whatsoever with broken files/partially written files, files that contain garbage etc. Even after starting/stopping services, rebooting etc. With GlusterFS and other Cluster file system I've experienced many such problems over the years, so this is what makes Ceph so great. I have now a lot of trust in Ceph, that it will eventually repair everything :-) And: If a file that has been written a few seconds ago is really lost it wouldnt be that bad for our use-case. It's a web-server. Most important stuff is in the DB. We have hourly backups of everything. In a huge emergency, we could even restore the backup from an hour ago if we really had to. Not nice, but if it happens every 6 years or sth due to some freak hardware failure, I think it is manageable. I accept it's not the recommended/perfect solution if you have infinite amounts of money at your hands, but in our case, I think it's not extremely audacious either to do it like this, right? Am 11.04.2018 um 19:25 schrieb Ronny Aasen: ceph upgrades are usualy not a problem: ceph have to be upgraded in the right order. normally when each service is on its own machine this is not difficult. but when you have mon, mgr, osd, mds, and klients on the same host you have to do it a bit carefully.. i tend to have a terminal open with "watch ceph -s" running, and i never do another service until the health is ok again. first apt upgrade the packages on all the hosts. This only update the software on disk and not the running services. then do the restart of services in the right order. and only on one host at the time mons: first you restart the mon service on all mon running hosts. all the 3 mons are active at the same time, so there is no "shifting around" but make sure the quorum is ok again before you do the next mon. mgr: then restart mgr on all hosts that run mgr. there is only one active mgr at the time now, so here there will be a bit of shifting around. but it is only for statistics/management so it may affect your ceph -s command, but not the cluster operation. osd: restart osd processes one osd at the time, make sure health_ok before doing the next osd process. do this for all hosts that have osd's mds: restart mds's one at the time. you will notice the standby mds taking over for the mds that was restarted. do both. klients: restart clients, that means remount filesystems, migrate or restart vm's. or restart whatever process uses the old ceph libraries. about pools: since you only have 2 osd's you can obviously not be running the recommended 3 replication pools. ? this makes me worry that you may be running size:2 min_size:1 pools. and are daily running risk of dataloss due to corruption and inconsistencies. especially when you restart osd's if your pools are size:2 min_size:2 then your cluster will fail when any osd is restarted, until the osd is up and healthy again. but you have less chance for dataloss then 2/1 pools. if you added a osd on a third host you can run size:3 min_size:2 . the recommended config when you can have both redundancy and high availabillity. kind regards Ronny Aasen On 11.04.2018 17:42, Ranjan Ghosh wrote: Ah, nevermind, we've solved it. It was a firewall issue. The only thing that's weird is that it became an issue immediately after an update. Perhaps it has sth. to do with monitor nodes shifting around or anything. Well, thanks again for your quick support, though. It's much appreciated. BR Ranjan Am 11.04.2018 um 17:07 schrieb Ranjan Ghosh: Thank you for your answer. Do you have an
Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2
Ah, nevermind, we've solved it. It was a firewall issue. The only thing that's weird is that it became an issue immediately after an update. Perhaps it has sth. to do with monitor nodes shifting around or anything. Well, thanks again for your quick support, though. It's much appreciated. BR Ranjan Am 11.04.2018 um 17:07 schrieb Ranjan Ghosh: Thank you for your answer. Do you have any specifics on which thread you're talking about? Would be very interested to read about a success story, because I fear that if I update the other node that the whole cluster comes down. Am 11.04.2018 um 10:47 schrieb Marc Roos: I think you have to update all osd's, mon's etc. I can remember running into similar issue. You should be able to find more about this in mailing list archive. -----Original Message- From: Ranjan Ghosh [mailto:gh...@pw6.de] Sent: woensdag 11 april 2018 16:02 To: ceph-users Subject: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2 Hi all, We have a two-cluster-node (with a third "monitoring-only" node). Over the last months, everything ran *perfectly* smooth. Today, I did an Ubuntu "apt-get upgrade" on one of the two servers. Among others, the ceph packages were upgraded from 12.2.1 to 12.2.2. A minor release update, one might think. But, to my surprise, after restarting the services, Ceph is now in degraded state :-( (see below). Only the first node - which ist still on 12.2.1 - seems to be running. I did a bit of research and found this: https://ceph.com/community/new-luminous-pg-overdose-protection/ I did set "mon_max_pg_per_osd = 300" to no avail. Don't know if this is the problem at all. Looking at the status it seems we have 264 pgs, right? When I enter "ceph osd df" (which I found on another website claiming it should print the number of PGs per OSD), it just hangs (need to abort with Ctrl+C). Hope anybody can help me. The cluster know works with the single node, but it is definively quite worrying because we don't have redundancy. Thanks in advance, Ranjan root@tukan2 /var/www/projects # ceph -s cluster: id: 19895e72-4a0c-4d5d-ae23-7f631ec8c8e4 health: HEALTH_WARN insufficient standby MDS daemons available Reduced data availability: 264 pgs inactive Degraded data redundancy: 264 pgs unclean services: mon: 3 daemons, quorum tukan1,tukan2,tukan0 mgr: tukan0(active), standbys: tukan2 mds: cephfs-1/1/1 up {0=tukan2=up:active} osd: 2 osds: 2 up, 2 in data: pools: 3 pools, 264 pgs objects: 0 objects, 0 bytes usage: 0 kB used, 0 kB / 0 kB avail pgs: 100.000% pgs unknown ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2
Thank you for your answer. Do you have any specifics on which thread you're talking about? Would be very interested to read about a success story, because I fear that if I update the other node that the whole cluster comes down. Am 11.04.2018 um 10:47 schrieb Marc Roos: I think you have to update all osd's, mon's etc. I can remember running into similar issue. You should be able to find more about this in mailing list archive. -Original Message----- From: Ranjan Ghosh [mailto:gh...@pw6.de] Sent: woensdag 11 april 2018 16:02 To: ceph-users Subject: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2 Hi all, We have a two-cluster-node (with a third "monitoring-only" node). Over the last months, everything ran *perfectly* smooth. Today, I did an Ubuntu "apt-get upgrade" on one of the two servers. Among others, the ceph packages were upgraded from 12.2.1 to 12.2.2. A minor release update, one might think. But, to my surprise, after restarting the services, Ceph is now in degraded state :-( (see below). Only the first node - which ist still on 12.2.1 - seems to be running. I did a bit of research and found this: https://ceph.com/community/new-luminous-pg-overdose-protection/ I did set "mon_max_pg_per_osd = 300" to no avail. Don't know if this is the problem at all. Looking at the status it seems we have 264 pgs, right? When I enter "ceph osd df" (which I found on another website claiming it should print the number of PGs per OSD), it just hangs (need to abort with Ctrl+C). Hope anybody can help me. The cluster know works with the single node, but it is definively quite worrying because we don't have redundancy. Thanks in advance, Ranjan root@tukan2 /var/www/projects # ceph -s cluster: id: 19895e72-4a0c-4d5d-ae23-7f631ec8c8e4 health: HEALTH_WARN insufficient standby MDS daemons available Reduced data availability: 264 pgs inactive Degraded data redundancy: 264 pgs unclean services: mon: 3 daemons, quorum tukan1,tukan2,tukan0 mgr: tukan0(active), standbys: tukan2 mds: cephfs-1/1/1 up {0=tukan2=up:active} osd: 2 osds: 2 up, 2 in data: pools: 3 pools, 264 pgs objects: 0 objects, 0 bytes usage: 0 kB used, 0 kB / 0 kB avail pgs: 100.000% pgs unknown ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2
Hi all, We have a two-cluster-node (with a third "monitoring-only" node). Over the last months, everything ran *perfectly* smooth. Today, I did an Ubuntu "apt-get upgrade" on one of the two servers. Among others, the ceph packages were upgraded from 12.2.1 to 12.2.2. A minor release update, one might think. But, to my surprise, after restarting the services, Ceph is now in degraded state :-( (see below). Only the first node - which ist still on 12.2.1 - seems to be running. I did a bit of research and found this: https://ceph.com/community/new-luminous-pg-overdose-protection/ I did set "mon_max_pg_per_osd = 300" to no avail. Don't know if this is the problem at all. Looking at the status it seems we have 264 pgs, right? When I enter "ceph osd df" (which I found on another website claiming it should print the number of PGs per OSD), it just hangs (need to abort with Ctrl+C). Hope anybody can help me. The cluster know works with the single node, but it is definively quite worrying because we don't have redundancy. Thanks in advance, Ranjan root@tukan2 /var/www/projects # ceph -s cluster: id: 19895e72-4a0c-4d5d-ae23-7f631ec8c8e4 health: HEALTH_WARN insufficient standby MDS daemons available Reduced data availability: 264 pgs inactive Degraded data redundancy: 264 pgs unclean services: mon: 3 daemons, quorum tukan1,tukan2,tukan0 mgr: tukan0(active), standbys: tukan2 mds: cephfs-1/1/1 up {0=tukan2=up:active} osd: 2 osds: 2 up, 2 in data: pools: 3 pools, 264 pgs objects: 0 objects, 0 bytes usage: 0 kB used, 0 kB / 0 kB avail pgs: 100.000% pgs unknown ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ubuntu upgrade Zesty => Aardvark, Implications for Ceph?
Hi everyone, In January, support for Ubuntu Zesty will run out and we're planning to upgrade our servers to Aardvark. We have a two-node-cluster (and one additional monitoring-only server) and we're using the packages that come with the distro. We have mounted CephFS on the same server with the kernel client in FSTab. AFAIK, Aardvark includes Ceph 12.0. What would happen if we used the usual "do-release-upgrade" to upgrade the servers one-by-one? I assume the procedure described here "http://ceph.com/releases/v12-2-0-luminous-released/"; (section "Upgrade from Jewel or Kraken") probably won't work for us, because "do-release-upgrade" will upgrade all packages (including the ceph ones) at once and then reboots the machine. So we cannot really upgrade only the monitoring nodes. And I'd rather avoid switching to PPAs beforehand. So, what are the real consequences if we upgrade all servers one-by-one with "do-release-upgrade" and then reboot all the nodes? Is it only the downtime why this isnt recommended or do we lose data? Any other recommendations on how to tackle this? Thank you / BR Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests problem
Hm. That's quite weird. On our cluster, when I set "noscrub", "nodeep-scrub", scrubbing will always stop pretty quickly (a few minutes). I wonder why this doesnt happen on your cluster. When exactly did you set the flag? Perhaps it just needs some more time... Or there might be a disk problem why the scrubbing never finishes. Perhaps it's really a good idea, just like you proposed, to shutdown the corresponding OSDs. But that's just my thoughts. Perhaps some Ceph pro can shed some light on the possible reasons, why a scrubbing might get stuck and how to resolve this. Am 22.08.2017 um 18:58 schrieb Ramazan Terzi: Hi Ranjan, Thanks for your reply. I did set scrub and nodeep-scrub flags. But active scrubbing operation can’t working properly. Scrubbing operation always in same pg (20.1e). $ ceph pg dump | grep scrub dumped all in format plain pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 20.1e 25189 0 0 0 0 98359116362 30483048 active+clean+scrubbing 2017-08-21 04:55:13.354379 6930'2393 6930:20949058 [29,31,3] 29 [29,31,3] 29 6712'22950171 2017-08-20 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792 $ ceph -s cluster health HEALTH_WARN 33 requests are blocked > 32 sec noscrub,nodeep-scrub flag(s) set monmap e9: 3 mons at {ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0} election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e6930: 36 osds: 36 up, 36 in flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects 70497 GB used, 127 TB / 196 TB avail 1407 active+clean 1 active+clean+scrubbing Thanks, Ramazan On 22 Aug 2017, at 18:52, Ranjan Ghosh wrote: Hi Ramazan, I'm no Ceph expert, but what I can say from my experience using Ceph is: 1) During "Scrubbing", Ceph can be extremely slow. This is probably where your "blocked requests" are coming from. BTW: Perhaps you can even find out which processes are currently blocking with: ps aux | grep "D". You might even want to kill some of those and/or shutdown services in order to relieve some stress from the machine until it recovers. 2) I usually have the following in my ceph.conf. This lets the scrubbing only run between midnight and 6 AM (hopefully the time of least demand; adjust as necessary) - and with the lowest priority. #Reduce impact of scrub. osd_disk_thread_ioprio_priority = 7 osd_disk_thread_ioprio_class = "idle" osd_scrub_end_hour = 6 3) The Scrubbing begin and end hour will always work. The low priority mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current scheduler like this (replace sda with your device): cat /sys/block/sda/queue/scheduler You can also echo to this file to set a different scheduler. With these settings you can perhaps alleviate the problem so far, that the scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt have to finish in one night. It will continue the next night and so on. The Ceph experts say scrubbing is important. Don't know why, but I just believe them. They've built this complex stuff after all :-) Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back to work, but you should not let it run like this forever and a day. Hope this helps at least a bit. BR, Ranjan Am 22.08.2017 um 15:20 schrieb Ramazan Terzi: Hello, I have a Ceph Cluster with specifications below: 3 x Monitor node 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD journals) Distributed public and private networks. All NICs are 10Gbit/s osd pool default size = 3 osd pool default min size = 2 Ceph version is Jewel 10.2.6. My cluster is active and a lot of virtual machines running on it (Linux and Windows VM's, database clusters, web servers etc). During normal use, cluster slowly went into a state of blocked requests. Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, iowait, network tests, all of them succeed. Yerterday, 08:00: $ ceph health detail HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 1 ops are blocked > 8388.61 sec on osd.29 3 osds have slow requests Todat, 16:05: $ ceph health detail HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are
Re: [ceph-users] Blocked requests problem
Hi Ramazan, I'm no Ceph expert, but what I can say from my experience using Ceph is: 1) During "Scrubbing", Ceph can be extremely slow. This is probably where your "blocked requests" are coming from. BTW: Perhaps you can even find out which processes are currently blocking with: ps aux | grep "D". You might even want to kill some of those and/or shutdown services in order to relieve some stress from the machine until it recovers. 2) I usually have the following in my ceph.conf. This lets the scrubbing only run between midnight and 6 AM (hopefully the time of least demand; adjust as necessary) - and with the lowest priority. #Reduce impact of scrub. osd_disk_thread_ioprio_priority = 7 osd_disk_thread_ioprio_class = "idle" osd_scrub_end_hour = 6 3) The Scrubbing begin and end hour will always work. The low priority mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current scheduler like this (replace sda with your device): cat /sys/block/sda/queue/scheduler You can also echo to this file to set a different scheduler. With these settings you can perhaps alleviate the problem so far, that the scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt have to finish in one night. It will continue the next night and so on. The Ceph experts say scrubbing is important. Don't know why, but I just believe them. They've built this complex stuff after all :-) Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back to work, but you should not let it run like this forever and a day. Hope this helps at least a bit. BR, Ranjan Am 22.08.2017 um 15:20 schrieb Ramazan Terzi: Hello, I have a Ceph Cluster with specifications below: 3 x Monitor node 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD journals) Distributed public and private networks. All NICs are 10Gbit/s osd pool default size = 3 osd pool default min size = 2 Ceph version is Jewel 10.2.6. My cluster is active and a lot of virtual machines running on it (Linux and Windows VM's, database clusters, web servers etc). During normal use, cluster slowly went into a state of blocked requests. Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, iowait, network tests, all of them succeed. Yerterday, 08:00: $ ceph health detail HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 1 ops are blocked > 8388.61 sec on osd.29 3 osds have slow requests Todat, 16:05: $ ceph health detail HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 16 ops are blocked > 134218 sec on osd.29 11 ops are blocked > 67108.9 sec on osd.29 2 ops are blocked > 16777.2 sec on osd.29 1 ops are blocked > 8388.61 sec on osd.29 3 osds have slow requests $ ceph pg dump | grep scrub dumped all in format plain pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 20.1e 25183 0 0 0 0 98332537930 30663066 active+clean+scrubbing 2017-08-21 04:55:13.354379 6930'23908781 6930:20905696 [29,31,3] 29 [29,31,3] 29 6712'22950171 2017-08-20 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792 Active scrub does not finish (about 24 hours). I did not restart any OSD meanwhile. I'm thinking set noscrub, noscrub-deep, norebalance, nobackfill, and norecover flags and restart 3,29,31th OSDs. Is this solve my problem? Or anyone has suggestion about this problem? Thanks, Ramazan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] WBThrottle
Hi Ceph gurus, I've got the following problem with our Ceph installation (Jewel): There are various websites served from the CephFS mount. Sometimes, when I copy many new (large?) files onto this mount, it seems that after a certain delay, everything grinds to a halt. No websites are served; processes are in D state; probably until Ceph has written everything to disk. Then after a while, everythign recovers. Obviously, it would be great if I could tune some values to make the experience more "even" i.e. it can be a bit slower in general but OTOH without such huge "spikes" in performance... Now, first, I discovered there is "filestore flusher" documented here: http://docs.ceph.com/docs/jewel/rados/configuration/filestore-config-ref/?highlight=flusher Weirdly, when I use ceph --admin-daemon /bla/bla config show then I cannot see anything about this config option. Does it still exist? Then I found this somewhat cryptic page: http://docs.ceph.com/docs/jewel/dev/osd_internals/wbthrottle/ It says: "The flusher was not an adequate solution to this problem since it forced writeback of small writes too eagerly killing performance." Perhaps the "filestore flusher" was removed? But why is it still documented? On the other hand, "config show" lists many "wbthrottle"-Options: "filestore_wbthrottle_enable": "true", "filestore_wbthrottle_xfs_bytes_hard_limit": "419430400", "filestore_wbthrottle_xfs_bytes_start_flusher": "41943040", "filestore_wbthrottle_xfs_inodes_hard_limit": "5000", "filestore_wbthrottle_xfs_inodes_start_flusher": "500", "filestore_wbthrottle_xfs_ios_hard_limit": "5000", "filestore_wbthrottle_xfs_ios_start_flusher": "500", I couldnt find them documented under docs.ceph.com, however they are documented here: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/configuration_guide/file_store_configuration_reference Quite confusing! Now, I wonder: Could/should I modify (raise/lower) some of these values (we're using XFS)? Should I perhaps disable the WBThrottle altogether for my use case? Thank you, Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] num_caps
Ah, understand it much better now. Thank you so much for explaining. I hope/assume the caps dont prevent other clients from accessing the stuff in some way, right? +1, though, for the idea to be able to specify a timeout. We have a rsync backup job which runs over the whole filesystem every few hours to do an incremental backup. If you have only one cronjob like that you consequently have all files as caps - permanently - up to the defined mds cache size. Even if the backup is finished after a few minutes. Which I think goes a bit overboard just for a backup even though you say the actually performance impact of all those caps is not that bad... :-/ Am 15.05.2017 um 14:49 schrieb John Spray: On Mon, May 15, 2017 at 1:36 PM, Henrik Korkuc wrote: On 17-05-15 13:40, John Spray wrote: On Mon, May 15, 2017 at 10:40 AM, Ranjan Ghosh wrote: Hi all, When I run "ceph daemon mds. session ls" I always get a fairly large number for num_caps (200.000). Is this normal? I thought caps are sth. like open/locked files meaning a client is holding a cap on a file and no other client can access it during this time. Capabilities are much broader than that, they cover clients keeping some fresh metadata in their cache, even if the client isn't doing anything with the file at that moment. It's common for a client to accumulate a large number of capabilities in normal operation, as it keeps the metadata for many files in cache. You can adjust the "client cache size" setting on the fuse client to encourage it to cache metadata on fewer files and thereby hold onto fewer capabilities if you want. John Is there an option (or planned option) for clients to release caps after some time of inuse? In my testing I saw that clients tend to hold on caps for indefinite time. Currently in prod I have use case where are over 8mil caps and little over 800k inodes_with_caps. Both the MDS and client caches operate on a LRU, size-limited basis. That means that if they aren't hitting their size thresholds, they will tend to keep lots of stuff in cache indefinitely. One could add a behaviour that also actively expires cached metadata if it has not been used for a certain period of time, but it's not clear what the right time threshold would be, and whether it would be desirable for most users. If we free up memory because the system is quiet this minute/hour, then it potentially just creates an issue when we get busy again and need that memory back. With caching/resources generally, there's a conflict between the desire to keep things in cache in case they're needed again, and the desire to evict things from cache so that we have lots of free space available for new entries. Which one is better is entirely workload dependent: there is clearly scope to add different behaviours as options, but its hard to know how much people would really use them -- the sanity of the defaults is the most important thing. I do think there's a reasonable argument that part of the sane defaults should not be to keep something in cache if it hasn't been used for e.g. a day or more. BTW, clients do have an additional behaviour where they will drop unneeded caps when an MDS restarts, to avoid making a newly started MDS do a lot of unnecessary work to restore those caps, so the overhead of all those extra caps isn't quite as much as one might first imagine. John How can I debug this if it is a cause of concern? Is there any way to debug on which files the caps are held excatly? Thank you, Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] num_caps
Hi all, When I run "ceph daemon mds. session ls" I always get a fairly large number for num_caps (200.000). Is this normal? I thought caps are sth. like open/locked files meaning a client is holding a cap on a file and no other client can access it during this time. How can I debug this if it is a cause of concern? Is there any way to debug on which files the caps are held excatly? Thank you, Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Unsolved questions
Hi everyone, I'm now running our two-node mini-cluster for some months. OSD, MDS and Monitor is running on both nodes. Additionally there is a very small third node which is only running a third monitor but no MDS/OSD. On both main servers, CephFS is mounted via FSTab/Kernel driver. The mounted folder is /var/www hosting many websites. We use Ceph in this situation to achieve redundancy so we can easily switch over to the other node in case one of them fails. Kernel version is 4.9.6. For the most part, it's running great and the performance of the filesystem is very good. Only some stubborn problems/questions have still remained over the whole time and I'd like to settle them once and for all: 1) Every once in a while, some processes (PHP) accessing the filesystem get stuck in a D-state (Uninterruptable sleep). I wonder if this happens due to network fluctuations (both server are connected via a simple Gigabit crosslink cable) or how to diagnose this. Why exactly does this happen in the first place? And what is the proper way to get these processes out of this situation? Why doesnt a timeout happen or anything else? I've read about client eviction, but when I enter "ceph daemon mds.node1 session ls" I only see two "entries" - one for each server. But I don't want to evict all processes on the server, obviously. Only the stuck process. So far, the only method I found to remove the D process is to reboot. Which is of course not a great solution. When I tried to only restart the MDS service instead of rebooting, many more processes got stuck and the load was >500 (not CPU most probably but due to processes waiting for I/O). I found this thread here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001513.html Is this (still) relevant for my problem? And I read somewhere that you should not mount the folder on the same server as the MDS is - except you have a "newer" kernel (can't find where I've read this). The information was a bit older, though, so I wondered if 4.9.6 isnt sufficient or whether this is still a problem at all... 2) A second, also still unsolved problem: Most of the time "ceph health" shows sth. like: "Client node2 failing to respond to cache pressure". Restarting the mds removes this message for a while before it appears again. I could remove the message by setting "mds cache size" higher than the total number of files/folder on the whole filesystem. Which is obviously not a great scalable solution. The message doesnt seem to cause any problems, though. Nevertheless, I'd like to solve this. BTW: When I run "session ls" I see the number of caps held (num_caps) very high (8). Doesnt this mean that so many files are open/occupied by one ore more processes? Is this normal? I have some cronjobs running from time to time which run find or chmod over the filesystem. Could they be resposible for this? Is there some value to have Ceph release those "caps" faster/earlier? Thank you / BR Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Very Small Cluster
Thanks JC & Greg, I've changed the "mon osd min down reporters" to 1. According to this: http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/ the default is already 1, though. I don't remember the value before I changed it everywhere, so I can't say for sure now. But I think it was 2 despite what the docs say. Whatever. It's now 1 everywhere. Another somewhat weird thing I found was: When I check the values of an OSD(!) with "ceph daemon osd.0 config show | sort | grep mon_osd" I see an entry "mon osd min down reporters". I can even change it. But according to the docs, this is just a setting for monitors. Why does it appear there? Does it influence anything? If not: Is there a way to only show relevant config entries for a daemon? Then, when checking the doc page mentioned above and reading the descriptions of the multitude of config settings, I wonder: How can I properly estimate the time until my cluster works again? Since I get hung requests until the failed node is finally declared *down*, this time is obviously quite important for me. What exactly is the sequence of events when a node fails (i.e. someone accidentally hits the power off button). My (possibly totally wrong & dumb) idea: 1) osd0 fails/doesn't answer 2) osd1 pings osd0 every 6 seconds ( osd heartbeat interval). Thus, after 6 seconds max. osd1 notices osd0 *could be* down. 3) After another 20 seconds (osd heartbeat grace), osd1 decides osd0 is definitely down. 4) Another 120 seconds might elapse ( osd mon report interval max) until osd1 reports the bad news to the monitor. 5) The monitor gets the information about failed osd0 and since "mon osd min down reporters" is 1, this single osd is sufficent for the monitor to believe the bad news that osd0 is unresponsive. 6) But since "mon osd min down reports" is 3, all the stuff up until now has to happen 3 times in a row until the monitor finally realizes osd0 is *really* unresponsive. 7) After another 900 seconds (mon osd report timeout) of waiting in hope of another news that osd0 is still/back alive, the monitor marks osd0 as down 8) After another 300 seconds (mon osd down out interval) the monitor marks osd0 as down+out So, after my possibly very naive understanding, it takes 3*(6+20+120) + 900 + 300 seconds from the event "someone accidentally hit the power off switch" to "osd0 is marked down+out". Correct? I expect not. Which config variables did I misunderstand? Thank you Ranjan Am 29.09.2016 um 20:48 schrieb LOPEZ Jean-Charles: mon_osd_min_down_reporters by default set to 2 I guess you’ll have to set it to 1 in your case JC On Sep 29, 2016, at 08:16, Gregory Farnum <mailto:gfar...@redhat.com>> wrote: I think the problem is that Ceph requires a certain number of OSDs or a certain number of reports of failure before it marks an OSD down. These thresholds are not tuned for a 2-OSD cluster; you probably want to set them to 1. Also keep in mind that the OSDs provide a grace period of 20-30 seconds before they'll report somebody down; this helps prevent spurious recovery but means you will get paused IO on an unclean shutdown. I can't recall the exact config options off-hand, but it's something like "mon osd min down reports". Search the docs for that. :) -Greg On Thursday, September 29, 2016, Peter Maloney <mailto:peter.malo...@brockmann-consult.de>> wrote: On 09/29/16 14:07, Ranjan Ghosh wrote: > Wow. Amazing. Thanks a lot!!! This works. 2 (hopefully) last questions > on this issue: > > 1) When the first node is coming back up, I can just call "ceph osd up > 0" and Ceph will start auto-repairing everything everything, right? > That is, if there are e.g. new files that were created during the time > the first node was down, they will (sooner or later) get replicated > there? Nope, there is no "ceph osd up "; you just start the osd, and it already gets recognized as up. (if you don't like this, you set it out, not just down; and there is a "ceph osd in " to undo that.) > > 2) If I don't call "osd down" manually (perhaps at the weekend when > I'm not at the office) when a node dies - did I understand correctly > that the "hanging" I experienced is temporary and that after a few > minutes (don't want to try out now) the node should also go down > automatically? I believe so, yes. Also, FYI, RBD images don't seem to have this issue, and work right away on a 3 osd cluster. Maybe cephfs would also work better with a 3rd osd, even an empty one (weight=0). (and I had an unresolved issue testing the same with cep
Re: [ceph-users] Ceph Very Small Cluster
Wow. Amazing. Thanks a lot!!! This works. 2 (hopefully) last questions on this issue: 1) When the first node is coming back up, I can just call "ceph osd up 0" and Ceph will start auto-repairing everything everything, right? That is, if there are e.g. new files that were created during the time the first node was down, they will (sooner or later) get replicated there? 2) If I don't call "osd down" manually (perhaps at the weekend when I'm not at the office) when a node dies - did I understand correctly that the "hanging" I experienced is temporary and that after a few minutes (don't want to try out now) the node should also go down automatically? BR, Ranjan Am 29.09.2016 um 13:00 schrieb Peter Maloney: And also you could try: ceph osd down ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Very Small Cluster
Hi Vasu, thank you for your answer. Yes, all the pools have min_size 1: root@uhu2 /scripts # ceph osd lspools 0 rbd,1 cephfs_data,2 cephfs_metadata, root@uhu2 /scripts # ceph osd pool get cephfs_data min_size min_size: 1 root@uhu2 /scripts # ceph osd pool get cephfs_metadata min_size min_size: 1 I stopped all the ceph services gracefully on the first machine. But, just to get this straight: What if the first machine really suffered a catastrophic failure? My expectation was, that the second machine just keeps on running and serving files? This is why we are using a Cluster in the first place... Or is already this expectation wrong? When I stop the services on node1, I get this: # ceph pg stat 2016-09-29 11:51:09.514814 7fcba012f700 0 -- :/1939885874 >> 136.243.82.227:6789/0 pipe(0x7fcb9c05a730 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7fcb9c05c3f0).fault v41732: 264 pgs: 264 active+clean; 18514 MB data, 144 GB used, 3546 GB / 3690 GB avail; 1494 B/s rd, 0 op/s So, my question still is: Is there a way to (preferably) automatically avoid such a situation? Or at least manually tell the second node to keep on working and forget about those files? BR, Ranjan Am 28.09.2016 um 18:25 schrieb Vasu Kulkarni: Are all the pools using min_size 1? did you check pg stat and see which ones are waiting? some steps to debug further and check http://docs.ceph.com/docs/jewel/rados/operations/monitoring-osd-pg/ Also did you shutdown the server abruptly while it was busy? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Very Small Cluster
Hi everyone, Up until recently, we were using GlusterFS to have two web servers in sync so we could take one down and switch back and forth between them - e.g. for maintenance or failover. Usually, both were running, though. The performance was abysmal, unfortunately. Copying many small files on the file system caused outages for several minutes - simply unacceptable. So I found Ceph. It's fairly new but I thought I'd give it a try. I liked especially the good, detailed documentation, the configurability and the many command-line tools which allow you to find out what is going on with your Cluster. All of this is severly lacking with GlusterFS IMHO. Because we're on a very tiny budget for this project we cannot currently have more than two file system servers. I added a small Virtual Server, though, only for monitoring. So at least we have 3 monitoring nodes. I also created 3 MDS's, though as far as I understood, two are only for standby. To sum it up, we have: server0: Admin (Deployment started from here) + Monitor + MDS server1: Monitor + MDS + OSD server2: Monitor + MDS + OSD So, the OSD is on server1 and server2 which are next to each other connected by a local GigaBit-Ethernet connection. The cluster is mounted (also on server1 and server2) as /var/www and Apache is serving files off the cluster. I've used these configuration settings: osd pool default size = 2 osd pool default min_size = 1 My idea was that by default everything should be replicated on 2 servers i.e. each file is normally written on server1 and server2. In case of emergency though (one server has a failure), it's better to keep operating and only write the file to one server. Therefore, i set min_size = 1. My further understanding is (correct me if I'm wrong), that when the server comes back online, the files that were written to only 1 server during the outage will automatically be replicated to the server that has come back online. So far, so good. With two servers now online, the performance is light-years away from sluggish GlusterFS. I've also worked with XtreemFS, OCFS2, AFS and never had such a good performance with any Cluster. In fact it's so blazingly fast, that I had to check twice I really had the cluster mounted and wasnt accidentally working on the hard drive. Impressive. I can edit files on server1 and they are immediately changed on server2 and vice versa. Great! Unfortunately, when I'm now stopping all ceph-Services on server1, the websites on server2 start to hang/freeze. And "ceph health" shows "#x blocked requests". Now, what I don't understand: Why is it blocking? Shouldnt both servers have the file? And didn't I set min_size to "1"? And if there are a few files (could be some unimportant stuff) that's missing on one of the servers: How can I abort the blocking? I'd rather have a missing file or whatever, then a completely blocking website. Are my files really duplicated 1:1 - or are they perhaps spread evenly between both OSDs? Do I have to edit the crushmap to achieve a real "RAID-1"-type of replication? Is there a command to find out for a specific file where it actually resides and whether it has really been replicated? Thank you! Ranjan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com