Re: [ceph-users] Ceph luminous repo not working on Ubuntu xenial
Quoting Kashif Mumtaz (kashif.mum...@yahoo.com): > > Dear User, > I am striving had to install Ceph luminous version on Ubuntu 16.04.3 ( > xenial ). > Its repo is available at https://download.ceph.com/debian-luminous/ > I added it like sudo apt-add-repository 'deb > https://download.ceph.com/debian-luminous/ xenial main' > # more sources.list > deb https://download.ceph.com/debian-luminous/ xenial main ^^ That looks good. > It say no package available. Did anybody able to install Luminous on Xenial > by using repo? Just checkin': you did a "apt update" after adding the repo? The repo works fine for me. Is the Ceph gpg key installed? apt-key list |grep Ceph uid Ceph.com (release key)Make sure you have "apt-transport-https" installed (as the repos uses TLS). Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd max scrubs not honored?
Hello, On Thu, 28 Sep 2017 22:36:22 + Gregory Farnum wrote: > Also, realize the deep scrub interval is a per-PG thing and (unfortunately) > the OSD doesn't use a global view of its PG deep scrub ages to try and > schedule them intelligently across that time. If you really want to try and > force this out, I believe a few sites have written scripts to do it by > turning off deep scrubs, forcing individual PGs to deep scrub at intervals, > and then enabling deep scrubs again. > -Greg > This approach works best and w/o surprises down the road if osd_scrub_interval_randomize_ratio is disabled. And the osd_scrub_start_hour and osd_scrub_end_hour set to your needs. I basically kick the deep scrubs off on a per OSD basis (one at a time and staggered of course) and if your cluster is small/fast enough that pattern will be retained indefinitely, with only one PG doing a deep scrub at any given time (with the default max scrub of 1 of course). Christian > On Wed, Sep 27, 2017 at 6:34 AM David Turnerwrote: > > > This isn't an answer, but a suggestion to try and help track it down as > > I'm not sure what the problem is. Try querying the admin socket for your > > osds and look through all of their config options and settings for > > something that might explain why you have multiple deep scrubs happening on > > a single osd at the same time. > > > > However if you misspoke and only have 1 deep scrub per osd but multiple > > people node, then what you are seeing is expected behavior. I believe that > > luminous added a sleep seeing for scrub io that also might help. Looking > > through the admin socket dump of settings looking for scrub should give you > > some ideas of things to try. > > > > On Tue, Sep 26, 2017, 2:04 PM J David wrote: > > > >> With “osd max scrubs” set to 1 in ceph.conf, which I believe is also > >> the default, at almost all times, there are 2-3 deep scrubs running. > >> > >> 3 simultaneous deep scrubs is enough to cause a constant stream of: > >> > >> mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32 > >> sec (REQUEST_SLOW) > >> > >> This seems to correspond with all three deep scrubs hitting the same > >> OSD at the same time, starving out all other I/O requests for that > >> OSD. But it can happen less frequently and less severely with two or > >> even one deep scrub running. Nonetheless, consumers of the cluster > >> are not thrilled with regular instances of 30-60 second disk I/Os. > >> > >> The cluster is five nodes, 15 OSDs, and there is one pool with 512 > >> placement groups. The cluster is running: > >> > >> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous > >> (rc) > >> > >> All of the OSDs are bluestore, with HDD storage and SSD block.db. > >> > >> Even setting “osd deep scrub interval = 1843200” hasn’t resolved this > >> issue, though it seems to get the number down from 3 to 2, which at > >> least cuts down on the frequency of requests stalling out. With 512 > >> pgs, that should mean that one pg gets deep-scrubbed per hour, and it > >> seems like a deep-scrub takes about 20 minutes. So what should be > >> happening is that 1/3rd of the time there should be one deep scrub, > >> and 2/3rds of the time there shouldn’t be any. Yet instead we have > >> 2-3 deep scrubs running at all times. > >> > >> Looking at “ceph pg dump” shows that about 7 deep scrubs get launched per > >> hour: > >> > >> $sudo ceph pg dump | fgrep active | awk ‘{print$23” “$24" "$1}' | > >> fgrep 2017-09-26 | sort -rn | head -22 > >> dumped all > >> 2017-09-26 16:42:46.781761 0.181 > >> 2017-09-26 16:41:40.056816 0.59 > >> 2017-09-26 16:39:26.216566 0.9e > >> 2017-09-26 16:26:43.379806 0.19f > >> 2017-09-26 16:24:16.321075 0.60 > >> 2017-09-26 16:08:36.095040 0.134 > >> 2017-09-26 16:03:33.478330 0.b5 > >> 2017-09-26 15:55:14.205885 0.1e2 > >> 2017-09-26 15:54:31.413481 0.98 > >> 2017-09-26 15:45:58.329782 0.71 > >> 2017-09-26 15:34:51.777681 0.1e5 > >> 2017-09-26 15:32:49.669298 0.c7 > >> 2017-09-26 15:01:48.590645 0.1f > >> 2017-09-26 15:01:00.082014 0.199 > >> 2017-09-26 14:45:52.893951 0.d9 > >> 2017-09-26 14:43:39.870689 0.140 > >> 2017-09-26 14:28:56.217892 0.fc > >> 2017-09-26 14:28:49.665678 0.e3 > >> 2017-09-26 14:11:04.718698 0.1d6 > >> 2017-09-26 14:09:44.975028 0.72 > >> 2017-09-26 14:06:17.945012 0.8a > >> 2017-09-26 13:54:44.199792 0.ec > >> > >> What’s going on here? > >> > >> Why isn’t the limit on scrubs being honored? > >> > >> It would also be great if scrub I/O were surfaced in “ceph status” the > >> way recovery I/O is, especially since it can have such a significant > >> impact on client operations. > >> > >> Thanks! > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > ___ > > ceph-users mailing list > >
Re: [ceph-users] ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)
This looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1458007 or one of the bugs/trackers attached to that. On Thu, Sep 28, 2017 at 11:14 PM, Sean Purdywrote: > On Thu, 28 Sep 2017, Matthew Vernon said: >> Hi, >> >> TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it >> needs increasing and/or removing entirely. Should I copy this to ceph-devel? > > Just a note. Looks like debian stretch luminous packages have a 10_000 > second timeout: > > from /lib/systemd/system/ceph-disk@.service > > Environment=CEPH_DISK_TIMEOUT=1 > ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock > /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout > trigger --sync %f' > > > Sean > >> On 15/09/17 16:48, Matthew Vernon wrote: >> >On 14/09/17 16:26, Götz Reinicke wrote: >> >>After that, 10 OSDs did not came up as the others. The disk did not get >> >>mounted and the OSD processes did nothing … even after a couple of >> >>minutes no more disks/OSDs showed up. >> > >> >I'm still digging, but AFAICT it's a race condition in startup - in our >> >case, we're only seeing it if some of the filesystems aren't clean. This >> >may be related to the thread "Very slow start of osds after reboot" from >> >August, but I don't think any conclusion was reached there. >> >> This annoyed me enough that I went off to find the problem :-) >> >> On systemd-enabled machines[0] ceph disks are activated by systemd's >> ceph-disk@.service, which calls: >> >> /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) >> /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f' >> >> ceph-disk trigger --sync calls ceph-disk activate which (among other things) >> mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/ >> once it's extracted the osd number from the fs). If the fs is unclean, XFS >> auto-recovers before mounting (which takes time - range 2-25s for our 6TB >> disks) Importantly, there is a single global lock file[1] so only one >> ceph-disk activate can be doing this at once. >> >> So, each fs is auto-recovering one at at time (rather than in parallel), and >> once the elapsed time gets past 120s, timeout kills the flock, systemd kills >> the cgroup, and no more OSDs start up - we typically find a few fs mounted >> in /var/lib/ceph/tmp/mnt.. systemd keeps trying to start the remaining >> osds (via ceph-osd@.service), but their fs isn't in the correct place, so >> this never works. >> >> The fix/workaround is to adjust the timeout value (edit the service file >> directly, or for style points write an override in /etc/systemd/system >> remembering you need a blank ExecStart line before your revised one). >> >> Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to >> start all its osds when started up with all fss dirty. So the current 120s >> is far too small (it's just about OK when all the osd fss are clean). >> >> I think, though, that having the timeout at all is a bug - if something >> needs to time out under some circumstances, should it be at a lower layer, >> perhaps? >> >> A couple of final points/asides, if I may: >> >> ceph-disk trigger uses subprocess.communicate (via the command() function), >> which means it swallows the log output from ceph-disk activate (and only >> outputs it after that process finishes) - as well as producing confusing >> timestamps, this means that when systemd kills the cgroup, all the output >> from the ceph-disk activate command vanishes into the void. That made >> debugging needlessly hard. Better to let called processes like that output >> immediately? >> >> Does each fs need mounting twice? could the osd be encoded in the partition >> label or similar instead? >> >> Is a single global activation lock necessary? It slows startup down quite a >> bit; I see no reason why (at least in the one-osd-per-disk case) you >> couldn't be activating all the osds at once... >> >> Regards, >> >> Matthew >> >> [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the >> timeout, so presumably upstart systems aren't affected >> [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, >> a charity registered in England with number 1021457 and a company registered >> in England with number 2742969, whose registered office is 215 Euston Road, >> London, NW1 2BE. ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph 12.2.0 on 32bit?
Have you tried running a Luminous OSD with filestore instead of BlueStore? As BlueStore is all new code and uses a lot of optimizations and tricks for fast and efficient use of memory, some 64-bit assumptions may have snuck in there. I'm not sure how much interest there is in making sure that works on 32-bit systems at this point, but narrowing it down to a specific component would certainly help. On Fri, Sep 22, 2017 at 8:57 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com> wrote: > It crashes with SimpleMessenger as well (ms_type = simple) > > > I've also tried with and without these two settings, but still crashes. > bluestore cache size = 536870912 > bluestore cache kv max = 268435456 > > > When using SimpleMessenger, it tells me it is crashing (Segmentation > Fault) in 'thread_name:ms_pipe_write'. This is common in all crashes under > SimpleMessenger, just like 'msgr-worker-' was common > under AsyncMessenger. > > > The node I'm testing this on is running a 32bit kernel (4.12.5) and has > 8GB ram (free -m). > > > Per 'ps aux', VSZ and RSS never get much above 1196392 and 544024 > respectively. (One time they didn't get past 999536 and 329712 > respectively.) > > > Also, under SimpleMessenger, gdb is reporting stack corruption in the back > traces. > > > What other memory tuning options should I try? > > > > > > On 2017-09-11 08:05, Gregory Farnum wrote: > > You could try setting it to run with SimpleMessenger instead of > AsyncMessenger -- the default changed across those releases. > I imagine the root of the problem though is that with BlueStore the OSD is > using a lot more memory than it used to and so we're overflowing the 32-bit > address space...which means a more permanent solution might require turning > down the memory tuning options. Sage has discussed those in various places. > On Sun, Sep 10, 2017 at 11:52 PM Dyweni - Ceph-Users < > 6exbab4fy...@dyweni.com> wrote: > >> Hi, >> >> Is anyone running Ceph Luminous (12.2.0) on 32bit Linux? Have you seen >> any problems? >> >> >> >> My setup has been 1 MON and 7 OSDs (no MDS, RGW, etc), all running Jewel >> (10.2.1), on 32bit, with no issues at all. >> >> I've upgraded everything to latest version of Jewel (10.2.9) and still >> no issues. >> >> Next I upgraded my MON to Luminous (12.2.0) and added MGR to it. Still >> no issues. >> >> Next I removed one node from the cluster, wiped it clean, upgraded it to >> Luminous (12.2.), and created a new BlueStore data area. Now this node >> crashes with segmentation fault usually within a few minutes of starting >> up. I've loaded symbols and used GDB to examine back traces. From what >> I can tell, the seg faults are happening randomly, and the stack is >> corrupted, so traces from GDB are unusable (even with all symbols >> installed for all packages on the system). However, in all cases, the >> seg fault is occuring in the 'msgr-worker-' thread. >> >> >> >> >> My data is fine, just would like to get Ceph 12.2.0 running stably on >> this node, so I can upgrade the remaining nodes and switch everything >> over to BlueStore. >> >> >> >> Thanks, >> Dyweni >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd max scrubs not honored?
I often schedule the deep scrubs for a cluster so that none of them will happen on their own and will always be run using my cron/scripts. For instance, set the deep scrub interval to 2 months and schedule a cron that will take care of all of the deep scrubs within a month. If for any reason the script stops working, the PGs will still be scrubbed at least every 2 months. But the script should ensure that they happen every month but only during the times of day I'm running the cron. That way I can ease up on deep scrubbing when the cluster needs a little more performance or is going through a big recovery, but also catch it back up. There are also config settings to ensure that scrubs only happen during hours of the day you want them to so you can avoid major client IO regardless of how you scrub your cluster. On Thu, Sep 28, 2017 at 6:36 PM Gregory Farnumwrote: > Also, realize the deep scrub interval is a per-PG thing and > (unfortunately) the OSD doesn't use a global view of its PG deep scrub ages > to try and schedule them intelligently across that time. If you really want > to try and force this out, I believe a few sites have written scripts to do > it by turning off deep scrubs, forcing individual PGs to deep scrub at > intervals, and then enabling deep scrubs again. > -Greg > > > On Wed, Sep 27, 2017 at 6:34 AM David Turner > wrote: > >> This isn't an answer, but a suggestion to try and help track it down as >> I'm not sure what the problem is. Try querying the admin socket for your >> osds and look through all of their config options and settings for >> something that might explain why you have multiple deep scrubs happening on >> a single osd at the same time. >> >> However if you misspoke and only have 1 deep scrub per osd but multiple >> people node, then what you are seeing is expected behavior. I believe that >> luminous added a sleep seeing for scrub io that also might help. Looking >> through the admin socket dump of settings looking for scrub should give you >> some ideas of things to try. >> >> On Tue, Sep 26, 2017, 2:04 PM J David wrote: >> >>> With “osd max scrubs” set to 1 in ceph.conf, which I believe is also >>> the default, at almost all times, there are 2-3 deep scrubs running. >>> >>> 3 simultaneous deep scrubs is enough to cause a constant stream of: >>> >>> mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32 >>> sec (REQUEST_SLOW) >>> >>> This seems to correspond with all three deep scrubs hitting the same >>> OSD at the same time, starving out all other I/O requests for that >>> OSD. But it can happen less frequently and less severely with two or >>> even one deep scrub running. Nonetheless, consumers of the cluster >>> are not thrilled with regular instances of 30-60 second disk I/Os. >>> >>> The cluster is five nodes, 15 OSDs, and there is one pool with 512 >>> placement groups. The cluster is running: >>> >>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous >>> (rc) >>> >>> All of the OSDs are bluestore, with HDD storage and SSD block.db. >>> >>> Even setting “osd deep scrub interval = 1843200” hasn’t resolved this >>> issue, though it seems to get the number down from 3 to 2, which at >>> least cuts down on the frequency of requests stalling out. With 512 >>> pgs, that should mean that one pg gets deep-scrubbed per hour, and it >>> seems like a deep-scrub takes about 20 minutes. So what should be >>> happening is that 1/3rd of the time there should be one deep scrub, >>> and 2/3rds of the time there shouldn’t be any. Yet instead we have >>> 2-3 deep scrubs running at all times. >>> >>> Looking at “ceph pg dump” shows that about 7 deep scrubs get launched >>> per hour: >>> >>> $sudo ceph pg dump | fgrep active | awk ‘{print$23” “$24" "$1}' | >>> fgrep 2017-09-26 | sort -rn | head -22 >>> dumped all >>> 2017-09-26 16:42:46.781761 0.181 >>> 2017-09-26 16:41:40.056816 0.59 >>> 2017-09-26 16:39:26.216566 0.9e >>> 2017-09-26 16:26:43.379806 0.19f >>> 2017-09-26 16:24:16.321075 0.60 >>> 2017-09-26 16:08:36.095040 0.134 >>> 2017-09-26 16:03:33.478330 0.b5 >>> 2017-09-26 15:55:14.205885 0.1e2 >>> 2017-09-26 15:54:31.413481 0.98 >>> 2017-09-26 15:45:58.329782 0.71 >>> 2017-09-26 15:34:51.777681 0.1e5 >>> 2017-09-26 15:32:49.669298 0.c7 >>> 2017-09-26 15:01:48.590645 0.1f >>> 2017-09-26 15:01:00.082014 0.199 >>> 2017-09-26 14:45:52.893951 0.d9 >>> 2017-09-26 14:43:39.870689 0.140 >>> 2017-09-26 14:28:56.217892 0.fc >>> 2017-09-26 14:28:49.665678 0.e3 >>> 2017-09-26 14:11:04.718698 0.1d6 >>> 2017-09-26 14:09:44.975028 0.72 >>> 2017-09-26 14:06:17.945012 0.8a >>> 2017-09-26 13:54:44.199792 0.ec >>> >>> What’s going on here? >>> >>> Why isn’t the limit on scrubs being honored? >>> >>> It would also be great if scrub I/O were surfaced in “ceph status” the >>> way recovery I/O is, especially since it can have such a significant >>> impact on client operations.
Re: [ceph-users] osd max scrubs not honored?
Also, realize the deep scrub interval is a per-PG thing and (unfortunately) the OSD doesn't use a global view of its PG deep scrub ages to try and schedule them intelligently across that time. If you really want to try and force this out, I believe a few sites have written scripts to do it by turning off deep scrubs, forcing individual PGs to deep scrub at intervals, and then enabling deep scrubs again. -Greg On Wed, Sep 27, 2017 at 6:34 AM David Turnerwrote: > This isn't an answer, but a suggestion to try and help track it down as > I'm not sure what the problem is. Try querying the admin socket for your > osds and look through all of their config options and settings for > something that might explain why you have multiple deep scrubs happening on > a single osd at the same time. > > However if you misspoke and only have 1 deep scrub per osd but multiple > people node, then what you are seeing is expected behavior. I believe that > luminous added a sleep seeing for scrub io that also might help. Looking > through the admin socket dump of settings looking for scrub should give you > some ideas of things to try. > > On Tue, Sep 26, 2017, 2:04 PM J David wrote: > >> With “osd max scrubs” set to 1 in ceph.conf, which I believe is also >> the default, at almost all times, there are 2-3 deep scrubs running. >> >> 3 simultaneous deep scrubs is enough to cause a constant stream of: >> >> mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32 >> sec (REQUEST_SLOW) >> >> This seems to correspond with all three deep scrubs hitting the same >> OSD at the same time, starving out all other I/O requests for that >> OSD. But it can happen less frequently and less severely with two or >> even one deep scrub running. Nonetheless, consumers of the cluster >> are not thrilled with regular instances of 30-60 second disk I/Os. >> >> The cluster is five nodes, 15 OSDs, and there is one pool with 512 >> placement groups. The cluster is running: >> >> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous >> (rc) >> >> All of the OSDs are bluestore, with HDD storage and SSD block.db. >> >> Even setting “osd deep scrub interval = 1843200” hasn’t resolved this >> issue, though it seems to get the number down from 3 to 2, which at >> least cuts down on the frequency of requests stalling out. With 512 >> pgs, that should mean that one pg gets deep-scrubbed per hour, and it >> seems like a deep-scrub takes about 20 minutes. So what should be >> happening is that 1/3rd of the time there should be one deep scrub, >> and 2/3rds of the time there shouldn’t be any. Yet instead we have >> 2-3 deep scrubs running at all times. >> >> Looking at “ceph pg dump” shows that about 7 deep scrubs get launched per >> hour: >> >> $sudo ceph pg dump | fgrep active | awk ‘{print$23” “$24" "$1}' | >> fgrep 2017-09-26 | sort -rn | head -22 >> dumped all >> 2017-09-26 16:42:46.781761 0.181 >> 2017-09-26 16:41:40.056816 0.59 >> 2017-09-26 16:39:26.216566 0.9e >> 2017-09-26 16:26:43.379806 0.19f >> 2017-09-26 16:24:16.321075 0.60 >> 2017-09-26 16:08:36.095040 0.134 >> 2017-09-26 16:03:33.478330 0.b5 >> 2017-09-26 15:55:14.205885 0.1e2 >> 2017-09-26 15:54:31.413481 0.98 >> 2017-09-26 15:45:58.329782 0.71 >> 2017-09-26 15:34:51.777681 0.1e5 >> 2017-09-26 15:32:49.669298 0.c7 >> 2017-09-26 15:01:48.590645 0.1f >> 2017-09-26 15:01:00.082014 0.199 >> 2017-09-26 14:45:52.893951 0.d9 >> 2017-09-26 14:43:39.870689 0.140 >> 2017-09-26 14:28:56.217892 0.fc >> 2017-09-26 14:28:49.665678 0.e3 >> 2017-09-26 14:11:04.718698 0.1d6 >> 2017-09-26 14:09:44.975028 0.72 >> 2017-09-26 14:06:17.945012 0.8a >> 2017-09-26 13:54:44.199792 0.ec >> >> What’s going on here? >> >> Why isn’t the limit on scrubs being honored? >> >> It would also be great if scrub I/O were surfaced in “ceph status” the >> way recovery I/O is, especially since it can have such a significant >> impact on client operations. >> >> Thanks! >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW how to delete orphans
I'm pretty sure the orphan find command does exactly just that - finding orphans. I remember some emails on the dev list where Yehuda said he wasn't 100% comfortable of automating the delete just yet. So the purpose is to run the orphan find tool and then delete the orphaned objects once you're happy that they all are actually orphaned. On Fri, Sep 29, 2017 at 7:46 AM, Webert de Souza Limawrote: > When I had to use that I just took for granted that it worked, so I can't > really tell you if that's just it. > > :| > > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > Belo Horizonte - Brasil > > On Thu, Sep 28, 2017 at 1:31 PM, Andreas Calminder > wrote: >> >> Hi, >> Yes I'm able to run these commands, however it is unclear both in man file >> and the docs what's supposed to happen with the orphans, will they be >> deleted once I run finish? Or will that just throw away the job? What will >> orphans find actually produce? At the moment it just outputs a lot of text >> saying something like putting $num in orphans.$jobid.$shardnum and listing >> objects that are not orphans? >> >> Regards, >> Andreas >> >> On 28 Sep 2017 15:10, "Webert de Souza Lima" >> wrote: >> >> Hello, >> >> not an expert here but I think the answer is something like: >> >> radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_ >> radosgw-admin orphans finish --job-id=_JOB_ID_ >> >> _JOB_ID_ being anything. >> >> >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> Belo Horizonte - Brasil >> >> On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder >> wrote: >>> >>> Hello, >>> running Jewel on some nodes with rados gateway I've managed to get a >>> lot of leaked multipart objects, most of them belonging to buckets >>> that do not even exist anymore. We estimated these objects to occupy >>> somewhere around 60TB, which would be great to reclaim. Question is >>> how, since trying to find them one by one and perform some kind of >>> sanity check if they're in use or not will take forever. >>> >>> The radosgw-admin orphans find command sounds like something I could >>> use, but it's not clear if the command also removes the orphans? If >>> not, what does it do? Can I use it to help me removing my orphan >>> objects? >>> >>> Best regards, >>> Andreas >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.
On Thu, Sep 28, 2017 at 5:16 AM Micha Krausewrote: > Hi, > > I had a chance to catch John Spray at the Ceph Day, and he suggested that > I try to reproduce this bug in luminos. > > To fix my immediate problem we discussed 2 ideas: > > 1. Manually edit the Meta-data, unfortunately I was not able to find any > Information on how the meta-data is structured :-( > > 2. Edit the code to set the link count to 0 if it is negative: > > > diff --git a/src/mds/StrayManager.cc b/src/mds/StrayManager.cc > index 9e53907..2ca1449 100644 > --- a/src/mds/StrayManager.cc > +++ b/src/mds/StrayManager.cc > @@ -553,6 +553,10 @@ bool StrayManager::__eval_stray(CDentry *dn, bool > delay) > logger->set(l_mdc_num_strays_delayed, num_strays_delayed); > } > > + if (in->inode.nlink < 0) { > +in->inode.nlink=0; > + } > + > // purge? > if (in->inode.nlink == 0) { > // past snaprealm parents imply snapped dentry remote links. > diff --git a/src/xxHash b/src/xxHash > --- a/src/xxHash > +++ b/src/xxHash > @@ -1 +1 @@ > > > Im not sure if this works, the patched mds no longer crashes, however I > expected that this value: > > root@mds02:~ # ceph daemonperf mds.1 > -mds-- --mds_server-- ---objecter--- -mds_cache- > ---mds_log > rlat inos caps|hsr hcs hcr |writ read actv|recd recy stry purg|segs evts > subm| >0 100k 0 | 000 | 000 | 00 625k 0 | 30 > 25k 0 > > > Should go down, but it stays at 625k, unfortunately I don't have another > System to compare. > > After I started the patched mds once, I reverted back to an unpatched mds, > and it also stopped crashing, so I guess it did "fix" something. > > > A question just out of curiosity, I tried to log these events with > something like: > > dout(10) << "Fixed negative inode count"; > > or > > derr << "Fixed negative inode count"; > > But my compiler yelled at me for trying this. > dout and derr are big macros. You need to end the line with " << dendl;" to close it off. > > Micha Krause > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph luminous repo not working on Ubuntu xenial
Dear User, I am striving had to install Ceph luminous version on Ubuntu 16.04.3 ( xenial ). Its repo is available at https://download.ceph.com/debian-luminous/ I added it like sudo apt-add-repository 'deb https://download.ceph.com/debian-luminous/ xenial main' # more sources.list deb https://download.ceph.com/debian-luminous/ xenial main It say no package available. Did anybody able to install Luminous on Xenial by using repo?___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW how to delete orphans
When I had to use that I just took for granted that it worked, so I can't really tell you if that's just it. :| Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* On Thu, Sep 28, 2017 at 1:31 PM, Andreas Calminder < andreas.calmin...@klarna.com> wrote: > Hi, > Yes I'm able to run these commands, however it is unclear both in man file > and the docs what's supposed to happen with the orphans, will they be > deleted once I run finish? Or will that just throw away the job? What will > orphans find actually produce? At the moment it just outputs a lot of text > saying something like putting $num in orphans.$jobid.$shardnum and listing > objects that are not orphans? > > Regards, > Andreas > > On 28 Sep 2017 15:10, "Webert de Souza Lima"> wrote: > > Hello, > > not an expert here but I think the answer is something like: > > radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_ > radosgw-admin orphans finish --job-id=_JOB_ID_ > > _JOB_ID_ being anything. > > > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > *Belo Horizonte - Brasil* > > On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder < > andreas.calmin...@klarna.com> wrote: > >> Hello, >> running Jewel on some nodes with rados gateway I've managed to get a >> lot of leaked multipart objects, most of them belonging to buckets >> that do not even exist anymore. We estimated these objects to occupy >> somewhere around 60TB, which would be great to reclaim. Question is >> how, since trying to find them one by one and perform some kind of >> sanity check if they're in use or not will take forever. >> >> The radosgw-admin orphans find command sounds like something I could >> use, but it's not clear if the command also removes the orphans? If >> not, what does it do? Can I use it to help me removing my orphan >> objects? >> >> Best regards, >> Andreas >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
On 28. sep. 2017 18:53, hjcho616 wrote: Yay! Finally after about exactly one month I finally am able to mount the drive! Now is time to see how my data is doing. =P Doesn't look too bad though. Got to love the open source. =) I downloaded ceph source code. Built them. Then tried to run ceph-objectstore-export on that osd.4. Then started debugging it. Obviously don't have any idea of what everything do... > but was able to trace to the error message. The corruption appears to be at the mount region. When it tries to decode a buffer, most buffers had very periodic (looking at the printfs I put in) access to data but then > few of them had huge number. Oh that "1" that didn't make sense was from the corruption happened, and that struct_v portion of the data changed to ASCII value of 1, which happily printed 1. =P Since it was a mount portion... and hoping it doesn't impact the data much... went ahead and allowed those corrupted values. I was able to export osd.4 with journal! congratulations and well done :) just imagine tring to do this on $vendors's propitary blackbox... Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
Yay! Finally after about exactly one month I finally am able to mount the drive! Now is time to see how my data is doing. =P Doesn't look too bad though. Got to love the open source. =) I downloaded ceph source code. Built them. Then tried to run ceph-objectstore-export on that osd.4. Then started debugging it. Obviously don't have any idea of what everything do... but was able to trace to the error message. The corruption appears to be at the mount region. When it tries to decode a buffer, most buffers had very periodic (looking at the printfs I put in) access to data but then few of them had huge number. Oh that "1" that didn't make sense was from the corruption happened, and that struct_v portion of the data changed to ASCII value of 1, which happily printed 1. =P Since it was a mount portion... and hoping it doesn't impact the data much... went ahead and allowed those corrupted values. I was able to export osd.4 with journal! Then imported that page.. But OSDs wouldn't take them.. as it decided to create empty page 1.28 and assigned them active. So.. just as "Incomplete PGs Oh My!" page sugeested,pulled those osds down and removed those empty heads and started back up. At that point, no more incomplete data! Working on that inconsistent data. looks like this is somewhat new in the 10.2s. I was able to get it working with rados get and put and deep-scrub.https://www.spinics.net/lists/ceph-users/msg39063.html At this point, everything was active+clean. But MDS wasn't happy. Seems to suggest journal is broke.HEALTH_ERR mds rank 0 is damaged; mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set Found this. Did everything down to cephfs-table-tool all reset sessionhttp://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/ Restarted MDS. HEALTH_WARN no legacy OSD present but 'sortbitwise' flag is not set Mounted! Thank you everyone for the help! Learned alot! Regards,Hong On Friday, September 22, 2017 1:01 AM, hjcho616wrote: Ronny, Could you help me with this log? I got this with debug osd=20 filestore=20 ms=20. This one is running "ceph pg repair 2.7" This is one of the smaller page, thus log was smaller. Others have similar errors. I can see the lines with ERR, but other than that is there something I should be paying attention to? https://drive.google.com/file/d/0By7YztAJNGUWNkpCV090dHBmOWc/view?usp=sharing Error messages looks like this.2017-09-21 23:53:31.545510 7f51682df700 -1 log_channel(cluster) log [ERR] : 2.7 shard 2: soid 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != data_digest 0x43d61c5d from auth oi 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od alloc_hint [0 0])2017-09-21 23:53:31.545520 7f51682df700 -1 log_channel(cluster) log [ERR] : 2.7 shard 7: soid 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != data_digest 0x43d61c5d from auth oi 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od alloc_hint [0 0])2017-09-21 23:53:31.545531 7f51682df700 -1 log_channel(cluster) log [ERR] : 2.7 soid 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head: failed to pick suitable auth object I did try to move that object to different location as suggested from this page.http://ceph.com/geen-categorie/ceph-manually-repair-object/ This is what I ran.systemctl stop ceph-osd@7ceph-osd -i 7 --flush-journalcd /var/lib/ceph/osd/ceph-7cd current/2.7_head/mv rb.0.145d.2ae8944a.00bb__head_6F5DBE87__2 ~/ceph osd treesystemctl start ceph-osd@7ceph pg repair 2.7 Then I just get this..2017-09-22 00:41:06.495399 7f22ac3bd700 -1 log_channel(cluster) log [ERR] : 2.7 shard 2: soid 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != data_digest 0x43d61c5d from auth oi 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od alloc_hint [0 0])2017-09-22 00:41:06.495417 7f22ac3bd700 -1 log_channel(cluster) log [ERR] : 2.7 shard 7 missing 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head2017-09-22 00:41:06.495424 7f22ac3bd700 -1 log_channel(cluster) log [ERR] : 2.7 soid 2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head: failed to pick suitable auth object Moving from osd.2 results in similar error message, just says missing on top one instead. =P I was hoping this time would give me a different result as I let one more osd copy one from OSD1 by turning down osd.7 and set noout. But it doesn't appear to care about that extra data. Maybe only true when size is 3? Basically since I had most osds alive on OSD1 I was trying to favor data from OSD1. =P What can I do in this case? According to
Re: [ceph-users] RGW how to delete orphans
Hi, Yes I'm able to run these commands, however it is unclear both in man file and the docs what's supposed to happen with the orphans, will they be deleted once I run finish? Or will that just throw away the job? What will orphans find actually produce? At the moment it just outputs a lot of text saying something like putting $num in orphans.$jobid.$shardnum and listing objects that are not orphans? Regards, Andreas On 28 Sep 2017 15:10, "Webert de Souza Lima"wrote: Hello, not an expert here but I think the answer is something like: radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_ radosgw-admin orphans finish --job-id=_JOB_ID_ _JOB_ID_ being anything. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder < andreas.calmin...@klarna.com> wrote: > Hello, > running Jewel on some nodes with rados gateway I've managed to get a > lot of leaked multipart objects, most of them belonging to buckets > that do not even exist anymore. We estimated these objects to occupy > somewhere around 60TB, which would be great to reclaim. Question is > how, since trying to find them one by one and perform some kind of > sanity check if they're in use or not will take forever. > > The radosgw-admin orphans find command sounds like something I could > use, but it's not clear if the command also removes the orphans? If > not, what does it do? Can I use it to help me removing my orphan > objects? > > Best regards, > Andreas > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous v12.2.1 released
This is the first bugfix release of Luminous v12.2.x long term stable release series. It contains a range of bug fixes and a few features across CephFS, RBD & RGW. We recommend all the users of 12.2.x series update. For more details, refer to the release notes entry at the official blog[1] and the complete changelog[2] Notable Changes --- * Dynamic resharding is now enabled by default for RGW, RGW will now automatically reshard there bucket index once the index grows beyond `rgw_max_objs_per_shard` * Limiting MDS cache via a memory limit is now supported using the new mds_cache_memory_limit config option (1GB by default). A cache reservation can also be specified using mds_cache_reservation as a percentage of the limit (5% by default). Limits by inode count are still supported using mds_cache_size. Setting mds_cache_size to 0 (the default) disables the inode limit. * The maximum number of PGs per OSD before the monitor issues a warning has been reduced from 300 to 200 PGs. 200 is still twice the generally recommended target of 100 PGs per OSD. This limit can be adjusted via the ``mon_max_pg_per_osd`` option on the monitors. The older ``mon_pg_warn_max_per_osd`` option has been removed. * Creating pools or adjusting pg_num will now fail if the change would make the number of PGs per OSD exceed the configured ``mon_max_pg_per_osd`` limit. The option can be adjusted if it is really necessary to create a pool with more PGs. * There was a bug in the PG mapping behavior of the new *upmap* feature. If you made use of this feature (e.g., via the `ceph osd pg-upmap-items` command), we recommend that all mappings be removed (via the `ceph osd rm-pg-upmap-items` command) before upgrading to this point release. * A stall in BlueStore IO submission that was affecting many users has been resolved. Other Notable Changes - * bluestore: asyn cdeferred_try_submit deadlock (issue#21207, pr#17494, Sage Weil) * bluestore: fix deferred write deadlock, aio short return handling (issue#21171, pr#17601, Sage Weil) * bluestore: osd crash when change option bluestore_csum_type from none to CRC32 (issue#21175, pr#17497, xie xingguo) * bluestore: os/bluestore/BlueFS.cc: 1255: FAILED assert(!log_file->fnode.extents.empty()) (issue#21250, pr#17562, Sage Weil) * build/ops: ceph-fuse RPM should require fusermount (issue#21057, pr#17470, Ken Dreyer) * build/ops: RHEL 7.3 Selinux denials at OSD start (issue#19200, pr#17468, Boris Ranto) * build/ops: rocksdb,cmake: build portable binaries (issue#20529, pr#17745, Kefu Chai) * cephfs: client/mds has wrong check to clear S_ISGID on chown (issue#21004, pr#17471, Patrick Donnelly) * cephfs: get_quota_root sends lookupname op for every buffered write (issue#20945, pr#17473, Dan van der Ster) * cephfs: MDCache::try_subtree_merge() may print N^2 lines of debug message (issue#21221, pr#17712, Patrick Donnelly) * cephfs: MDS rank add/remove log messages say wrong number of ranks (issue#21421, pr#17887, John Spray) * cephfs: MDS: standby-replay mds should avoid initiating subtree export (issue#21378, issue#21222, pr#17714, "Yan, Zheng", Jianyu Li) * cephfs: the standbys are not updated via ceph tell mds.\* command (issue#21230, pr#17565, Kefu Chai) * common: adding line break at end of some cli results (issue#21019, pr#17467, songweibin) * core: [cls] metadata_list API function does not honor `max_return` parameter (issue#21247, pr#17558, Jason Dillaman) * core: incorrect erasure-code space in command ceph df (issue#21243, pr#17724, liuchang0812) * core: interval_set: optimize intersect_of insert operations (issue#21229, pr#17487, Zac Medico) * core: osd crush rule rename not idempotent (issue#21162, pr#17481, xie xingguo) * core: osd/PGLog: write only changed dup entries (issue#21026, pr#17378, Josh Durgin) * doc: doc/rbd: iSCSI Gateway Documentation (issue#20437, pr#17381, Aron Gunn, Jason Dillaman) * mds: fix 'dirfrag end' check in Server::handle_client_readdir (issue#21070, pr#17686, "Yan, Zheng") * mds: support limiting cache by memory (issue#20594, pr#17711, "Yan, Zheng", Patrick Donnelly) * mgr: 500 error when attempting to view filesystem data (issue#20692, pr#17477, John Spray) * mgr: ceph mgr versions shows active mgr as Unknown (issue#21260, pr#17635, John Spray) * mgr: Crash in MonCommandCompletion (issue#21157, pr#17483, John Spray) * mon: mon/OSDMonitor: deleting pool while pgs are being created leads to assert(p != pools.end) in update_creating_pgs() (issue#21309, pr#17634, Joao Eduardo Luis) * mon: OSDMonitor: osd pool application get support (issue#20976, pr#17472, xie xingguo) * mon: rate limit on health check update logging (issue#20888, pr#17500, John Spray) * osd: build_initial_pg_history doesn't update up/acting/etc (issue#21203, pr#17496, w11979, Sage Weil) * osd: osd/PG: discard msgs from down peers (issue#19605, pr#17501, Kefu Chai) * osd/PrimaryLogPG: request osdmap
[ceph-users] Openstack (pike) Ceilometer-API deprecated. RadosGW stats?
Hi it looks like OpenStack (Pike) has deprecated Ceilometer-API. Is this a problem for RadosGW when pushes stats to Openstack Telemetry service? Thanks, J. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)
On Thu, 28 Sep 2017, Matthew Vernon said: > Hi, > > TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it > needs increasing and/or removing entirely. Should I copy this to ceph-devel? Just a note. Looks like debian stretch luminous packages have a 10_000 second timeout: from /lib/systemd/system/ceph-disk@.service Environment=CEPH_DISK_TIMEOUT=1 ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f' Sean > On 15/09/17 16:48, Matthew Vernon wrote: > >On 14/09/17 16:26, Götz Reinicke wrote: > >>After that, 10 OSDs did not came up as the others. The disk did not get > >>mounted and the OSD processes did nothing … even after a couple of > >>minutes no more disks/OSDs showed up. > > > >I'm still digging, but AFAICT it's a race condition in startup - in our > >case, we're only seeing it if some of the filesystems aren't clean. This > >may be related to the thread "Very slow start of osds after reboot" from > >August, but I don't think any conclusion was reached there. > > This annoyed me enough that I went off to find the problem :-) > > On systemd-enabled machines[0] ceph disks are activated by systemd's > ceph-disk@.service, which calls: > > /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) > /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f' > > ceph-disk trigger --sync calls ceph-disk activate which (among other things) > mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/ > once it's extracted the osd number from the fs). If the fs is unclean, XFS > auto-recovers before mounting (which takes time - range 2-25s for our 6TB > disks) Importantly, there is a single global lock file[1] so only one > ceph-disk activate can be doing this at once. > > So, each fs is auto-recovering one at at time (rather than in parallel), and > once the elapsed time gets past 120s, timeout kills the flock, systemd kills > the cgroup, and no more OSDs start up - we typically find a few fs mounted > in /var/lib/ceph/tmp/mnt.. systemd keeps trying to start the remaining > osds (via ceph-osd@.service), but their fs isn't in the correct place, so > this never works. > > The fix/workaround is to adjust the timeout value (edit the service file > directly, or for style points write an override in /etc/systemd/system > remembering you need a blank ExecStart line before your revised one). > > Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to > start all its osds when started up with all fss dirty. So the current 120s > is far too small (it's just about OK when all the osd fss are clean). > > I think, though, that having the timeout at all is a bug - if something > needs to time out under some circumstances, should it be at a lower layer, > perhaps? > > A couple of final points/asides, if I may: > > ceph-disk trigger uses subprocess.communicate (via the command() function), > which means it swallows the log output from ceph-disk activate (and only > outputs it after that process finishes) - as well as producing confusing > timestamps, this means that when systemd kills the cgroup, all the output > from the ceph-disk activate command vanishes into the void. That made > debugging needlessly hard. Better to let called processes like that output > immediately? > > Does each fs need mounting twice? could the osd be encoded in the partition > label or similar instead? > > Is a single global activation lock necessary? It slows startup down quite a > bit; I see no reason why (at least in the one-osd-per-disk case) you > couldn't be activating all the osds at once... > > Regards, > > Matthew > > [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the > timeout, so presumably upstart systems aren't affected > [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research Limited, > a charity registered in England with number 1021457 and a company registered > in England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW how to delete orphans
Hello, not an expert here but I think the answer is something like: radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_ radosgw-admin orphans finish --job-id=_JOB_ID_ _JOB_ID_ being anything. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder < andreas.calmin...@klarna.com> wrote: > Hello, > running Jewel on some nodes with rados gateway I've managed to get a > lot of leaked multipart objects, most of them belonging to buckets > that do not even exist anymore. We estimated these objects to occupy > somewhere around 60TB, which would be great to reclaim. Question is > how, since trying to find them one by one and perform some kind of > sanity check if they're in use or not will take forever. > > The radosgw-admin orphans find command sounds like something I could > use, but it's not clear if the command also removes the orphans? If > not, what does it do? Can I use it to help me removing my orphan > objects? > > Best regards, > Andreas > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW how to delete orphans
Hello, running Jewel on some nodes with rados gateway I've managed to get a lot of leaked multipart objects, most of them belonging to buckets that do not even exist anymore. We estimated these objects to occupy somewhere around 60TB, which would be great to reclaim. Question is how, since trying to find them one by one and perform some kind of sanity check if they're in use or not will take forever. The radosgw-admin orphans find command sounds like something I could use, but it's not clear if the command also removes the orphans? If not, what does it do? Can I use it to help me removing my orphan objects? Best regards, Andreas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.
Hi, I had a chance to catch John Spray at the Ceph Day, and he suggested that I try to reproduce this bug in luminos. To fix my immediate problem we discussed 2 ideas: 1. Manually edit the Meta-data, unfortunately I was not able to find any Information on how the meta-data is structured :-( 2. Edit the code to set the link count to 0 if it is negative: diff --git a/src/mds/StrayManager.cc b/src/mds/StrayManager.cc index 9e53907..2ca1449 100644 --- a/src/mds/StrayManager.cc +++ b/src/mds/StrayManager.cc @@ -553,6 +553,10 @@ bool StrayManager::__eval_stray(CDentry *dn, bool delay) logger->set(l_mdc_num_strays_delayed, num_strays_delayed); } + if (in->inode.nlink < 0) { +in->inode.nlink=0; + } + // purge? if (in->inode.nlink == 0) { // past snaprealm parents imply snapped dentry remote links. diff --git a/src/xxHash b/src/xxHash --- a/src/xxHash +++ b/src/xxHash @@ -1 +1 @@ Im not sure if this works, the patched mds no longer crashes, however I expected that this value: root@mds02:~ # ceph daemonperf mds.1 -mds-- --mds_server-- ---objecter--- -mds_cache- ---mds_log rlat inos caps|hsr hcs hcr |writ read actv|recd recy stry purg|segs evts subm| 0 100k 0 | 000 | 000 | 00 625k 0 | 30 25k 0 Should go down, but it stays at 625k, unfortunately I don't have another System to compare. After I started the patched mds once, I reverted back to an unpatched mds, and it also stopped crashing, so I guess it did "fix" something. A question just out of curiosity, I tried to log these events with something like: dout(10) << "Fixed negative inode count"; or derr << "Fixed negative inode count"; But my compiler yelled at me for trying this. Micha Krause ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)
Hi, TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it needs increasing and/or removing entirely. Should I copy this to ceph-devel? On 15/09/17 16:48, Matthew Vernon wrote: On 14/09/17 16:26, Götz Reinicke wrote: After that, 10 OSDs did not came up as the others. The disk did not get mounted and the OSD processes did nothing … even after a couple of minutes no more disks/OSDs showed up. I'm still digging, but AFAICT it's a race condition in startup - in our case, we're only seeing it if some of the filesystems aren't clean. This may be related to the thread "Very slow start of osds after reboot" from August, but I don't think any conclusion was reached there. This annoyed me enough that I went off to find the problem :-) On systemd-enabled machines[0] ceph disks are activated by systemd's ceph-disk@.service, which calls: /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f' ceph-disk trigger --sync calls ceph-disk activate which (among other things) mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/ once it's extracted the osd number from the fs). If the fs is unclean, XFS auto-recovers before mounting (which takes time - range 2-25s for our 6TB disks) Importantly, there is a single global lock file[1] so only one ceph-disk activate can be doing this at once. So, each fs is auto-recovering one at at time (rather than in parallel), and once the elapsed time gets past 120s, timeout kills the flock, systemd kills the cgroup, and no more OSDs start up - we typically find a few fs mounted in /var/lib/ceph/tmp/mnt.. systemd keeps trying to start the remaining osds (via ceph-osd@.service), but their fs isn't in the correct place, so this never works. The fix/workaround is to adjust the timeout value (edit the service file directly, or for style points write an override in /etc/systemd/system remembering you need a blank ExecStart line before your revised one). Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to start all its osds when started up with all fss dirty. So the current 120s is far too small (it's just about OK when all the osd fss are clean). I think, though, that having the timeout at all is a bug - if something needs to time out under some circumstances, should it be at a lower layer, perhaps? A couple of final points/asides, if I may: ceph-disk trigger uses subprocess.communicate (via the command() function), which means it swallows the log output from ceph-disk activate (and only outputs it after that process finishes) - as well as producing confusing timestamps, this means that when systemd kills the cgroup, all the output from the ceph-disk activate command vanishes into the void. That made debugging needlessly hard. Better to let called processes like that output immediately? Does each fs need mounting twice? could the osd be encoded in the partition label or similar instead? Is a single global activation lock necessary? It slows startup down quite a bit; I see no reason why (at least in the one-osd-per-disk case) you couldn't be activating all the osds at once... Regards, Matthew [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the timeout, so presumably upstart systems aren't affected [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] "ceph fs" commands hang forever and kill monitors
On Thu, Sep 28, 2017 at 11:51 AM, Richard Heskethwrote: > On 27/09/17 19:35, John Spray wrote: >> On Wed, Sep 27, 2017 at 1:18 PM, Richard Hesketh >> wrote: >>> On 27/09/17 12:32, John Spray wrote: On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh wrote: > As the subject says... any ceph fs administrative command I try to run > hangs forever and kills monitors in the background - sometimes they come > back, on a couple of occasions I had to manually stop/restart a suffering > mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an > error and can also kill a monitor. However, clients can mount the > filesystem and read/write data without issue. > > Relevant excerpt from logs on an affected monitor, just trying to run > 'ceph fs ls': > > 2017-09-26 13:20:50.716087 7fc85fdd9700 0 mon.vm-ds-01@0(leader) e19 > handle_command mon_command({"prefix": "fs ls"} v 0) v1 > 2017-09-26 13:20:50.727612 7fc85fdd9700 0 log_channel(audit) log [DBG] : > from='client.? 10.10.10.1:0/2771553898' entity='client.admin' > cmd=[{"prefix": "fs ls"}]: dispatch > 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 > /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& > OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 > 13:20:50.727676 > /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != > pool_name.end()) > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous > (rc) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x55a8ca0bb642] > 2: (()+0x48165f) [0x55a8c9f4165f] > 3: > (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18) > [0x55a8ca047688] > 4: > (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) > [0x55a8ca048008] > 5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) > [0x55a8c9f9d1b0] > 6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) > [0x55a8c9e63193] > 7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) > [0x55a8c9e6a52e] > 8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b] > 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053] > 10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a] > 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d] > 12: (()+0x76ba) [0x7fc86b3ac6ba] > 13: (clone()+0x6d) [0x7fc869bd63dd] > NOTE: a copy of the executable, or `objdump -rdS ` is needed > to interpret this. > > I'm running Luminous. The cluster and FS have been in service since > Hammer and have default data/metadata pool names. I discovered the issue > after attempting to enable directory sharding. Well that's not good... The assertion is because your FSMap is referring to a pool that apparently no longer exists in the OSDMap. This should be impossible in current Ceph (we forbid removing pools if they're in use), but could perhaps have been caused in an earlier version of Ceph when it was possible to remove a pool even if CephFS was referring to it? Alternatively, perhaps something more severe is going on that's causing your mons to see a wrong/inconsistent view of the world. Has the cluster ever been through any traumatic disaster recovery type activity involving hand-editing any of the cluster maps? What intermediate versions has it passed through on the way from Hammer to Luminous? Opened a ticket here: http://tracker.ceph.com/issues/21568 John >>> >>> I've reviewed my notes (i.e. I've grepped my IRC logs); I actually >>> inherited this cluster from a colleague who left shortly after I joined, so >>> unfortunately there is some of its history I cannot fill in. >>> >>> Turns out the cluster actually predates Firefly. Looking at dates my >>> suspicion is that it went Emperor -> Firefly -> Giant -> Hammer. I >>> inherited it at Hammer, and took it Hammer -> Infernalis -> Jewel -> >>> Luminous myself. I know I did make sure to do the tmap_upgrade step on >>> cephfs but can't remember if I did it at Infernalis or Jewel. >>> >>> Infernalis was a tricky upgrade; the attempt was aborted once after the >>> first set of OSDs didn't come back up after upgrade (had to >>> remove/downgrade and readd), and setting sortbitwise as the documentation >>> suggested after a successful second attempt caused everything to break and >>> degrade slowly until it was unset and recovered. Never had disaster >>> recovery involve mucking around with the pools while I was administrating >>> it, but unfortunately I cannot speak for the cluster's pre-Hammer history. >>> The only pools I have removed were ones I created temporarily for testing >>>
Re: [ceph-users] "ceph fs" commands hang forever and kill monitors
So the problem you faced has been completely solved? On Thu, Sep 28, 2017 at 7:51 PM, Richard Heskethwrote: > On 27/09/17 19:35, John Spray wrote: >> On Wed, Sep 27, 2017 at 1:18 PM, Richard Hesketh >> wrote: >>> On 27/09/17 12:32, John Spray wrote: On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh wrote: > As the subject says... any ceph fs administrative command I try to run > hangs forever and kills monitors in the background - sometimes they come > back, on a couple of occasions I had to manually stop/restart a suffering > mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an > error and can also kill a monitor. However, clients can mount the > filesystem and read/write data without issue. > > Relevant excerpt from logs on an affected monitor, just trying to run > 'ceph fs ls': > > 2017-09-26 13:20:50.716087 7fc85fdd9700 0 mon.vm-ds-01@0(leader) e19 > handle_command mon_command({"prefix": "fs ls"} v 0) v1 > 2017-09-26 13:20:50.727612 7fc85fdd9700 0 log_channel(audit) log [DBG] : > from='client.? 10.10.10.1:0/2771553898' entity='client.admin' > cmd=[{"prefix": "fs ls"}]: dispatch > 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 > /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& > OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 > 13:20:50.727676 > /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != > pool_name.end()) > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous > (rc) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x55a8ca0bb642] > 2: (()+0x48165f) [0x55a8c9f4165f] > 3: > (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18) > [0x55a8ca047688] > 4: > (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) > [0x55a8ca048008] > 5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) > [0x55a8c9f9d1b0] > 6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) > [0x55a8c9e63193] > 7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) > [0x55a8c9e6a52e] > 8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b] > 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053] > 10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a] > 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d] > 12: (()+0x76ba) [0x7fc86b3ac6ba] > 13: (clone()+0x6d) [0x7fc869bd63dd] > NOTE: a copy of the executable, or `objdump -rdS ` is needed > to interpret this. > > I'm running Luminous. The cluster and FS have been in service since > Hammer and have default data/metadata pool names. I discovered the issue > after attempting to enable directory sharding. Well that's not good... The assertion is because your FSMap is referring to a pool that apparently no longer exists in the OSDMap. This should be impossible in current Ceph (we forbid removing pools if they're in use), but could perhaps have been caused in an earlier version of Ceph when it was possible to remove a pool even if CephFS was referring to it? Alternatively, perhaps something more severe is going on that's causing your mons to see a wrong/inconsistent view of the world. Has the cluster ever been through any traumatic disaster recovery type activity involving hand-editing any of the cluster maps? What intermediate versions has it passed through on the way from Hammer to Luminous? Opened a ticket here: http://tracker.ceph.com/issues/21568 John >>> >>> I've reviewed my notes (i.e. I've grepped my IRC logs); I actually >>> inherited this cluster from a colleague who left shortly after I joined, so >>> unfortunately there is some of its history I cannot fill in. >>> >>> Turns out the cluster actually predates Firefly. Looking at dates my >>> suspicion is that it went Emperor -> Firefly -> Giant -> Hammer. I >>> inherited it at Hammer, and took it Hammer -> Infernalis -> Jewel -> >>> Luminous myself. I know I did make sure to do the tmap_upgrade step on >>> cephfs but can't remember if I did it at Infernalis or Jewel. >>> >>> Infernalis was a tricky upgrade; the attempt was aborted once after the >>> first set of OSDs didn't come back up after upgrade (had to >>> remove/downgrade and readd), and setting sortbitwise as the documentation >>> suggested after a successful second attempt caused everything to break and >>> degrade slowly until it was unset and recovered. Never had disaster >>> recovery involve mucking around with the pools while I was administrating >>> it, but unfortunately I cannot speak for the cluster's pre-Hammer history. >>> The only pools I have
Re: [ceph-users] "ceph fs" commands hang forever and kill monitors
On 27/09/17 19:35, John Spray wrote: > On Wed, Sep 27, 2017 at 1:18 PM, Richard Hesketh >wrote: >> On 27/09/17 12:32, John Spray wrote: >>> On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh >>> wrote: As the subject says... any ceph fs administrative command I try to run hangs forever and kills monitors in the background - sometimes they come back, on a couple of occasions I had to manually stop/restart a suffering mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an error and can also kill a monitor. However, clients can mount the filesystem and read/write data without issue. Relevant excerpt from logs on an affected monitor, just trying to run 'ceph fs ls': 2017-09-26 13:20:50.716087 7fc85fdd9700 0 mon.vm-ds-01@0(leader) e19 handle_command mon_command({"prefix": "fs ls"} v 0) v1 2017-09-26 13:20:50.727612 7fc85fdd9700 0 log_channel(audit) log [DBG] : from='client.? 10.10.10.1:0/2771553898' entity='client.admin' cmd=[{"prefix": "fs ls"}]: dispatch 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 13:20:50.727676 /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != pool_name.end()) ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55a8ca0bb642] 2: (()+0x48165f) [0x55a8c9f4165f] 3: (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18) [0x55a8ca047688] 4: (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) [0x55a8ca048008] 5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) [0x55a8c9f9d1b0] 6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) [0x55a8c9e63193] 7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) [0x55a8c9e6a52e] 8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b] 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053] 10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d] 12: (()+0x76ba) [0x7fc86b3ac6ba] 13: (clone()+0x6d) [0x7fc869bd63dd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. I'm running Luminous. The cluster and FS have been in service since Hammer and have default data/metadata pool names. I discovered the issue after attempting to enable directory sharding. >>> >>> Well that's not good... >>> >>> The assertion is because your FSMap is referring to a pool that >>> apparently no longer exists in the OSDMap. This should be impossible >>> in current Ceph (we forbid removing pools if they're in use), but >>> could perhaps have been caused in an earlier version of Ceph when it >>> was possible to remove a pool even if CephFS was referring to it? >>> >>> Alternatively, perhaps something more severe is going on that's >>> causing your mons to see a wrong/inconsistent view of the world. Has >>> the cluster ever been through any traumatic disaster recovery type >>> activity involving hand-editing any of the cluster maps? What >>> intermediate versions has it passed through on the way from Hammer to >>> Luminous? >>> >>> Opened a ticket here: http://tracker.ceph.com/issues/21568 >>> >>> John >> >> I've reviewed my notes (i.e. I've grepped my IRC logs); I actually inherited >> this cluster from a colleague who left shortly after I joined, so >> unfortunately there is some of its history I cannot fill in. >> >> Turns out the cluster actually predates Firefly. Looking at dates my >> suspicion is that it went Emperor -> Firefly -> Giant -> Hammer. I inherited >> it at Hammer, and took it Hammer -> Infernalis -> Jewel -> Luminous myself. >> I know I did make sure to do the tmap_upgrade step on cephfs but can't >> remember if I did it at Infernalis or Jewel. >> >> Infernalis was a tricky upgrade; the attempt was aborted once after the >> first set of OSDs didn't come back up after upgrade (had to remove/downgrade >> and readd), and setting sortbitwise as the documentation suggested after a >> successful second attempt caused everything to break and degrade slowly >> until it was unset and recovered. Never had disaster recovery involve >> mucking around with the pools while I was administrating it, but >> unfortunately I cannot speak for the cluster's pre-Hammer history. The only >> pools I have removed were ones I created temporarily for testing crush >> rules/benchmarking. > > OK, so it sounds like a cluster with an interesting history and some > stories to tell :-) > >> I have hand-edited the crush map (extract, decompile,
Re: [ceph-users] RDMA with mellanox connect x3pro on debian stretch and proxmox v5.0 kernel 4.10.17-3
Hi Haomai, can you please guide me to a running cluster with RDMA ? regards Gerhard W. Recher net4sec UG (haftungsbeschränkt) Leitenweg 6 86929 Penzing +49 171 4802507 Am 28.09.2017 um 04:21 schrieb Haomai Wang: > previously we have a infiniband cluster, recently we deploy a roce > cluster. they are both test purpose for users. > > On Wed, Sep 27, 2017 at 11:38 PM, Gerhard W. Recher >wrote: >> Haomai, >> >> I looked at your presentation, so i guess you already have a running >> cluster with RDMA & mellanox >> (https://www.youtube.com/watch?v=Qb2SUWLdDCw) >> >> Is nobody out there having a running cluster with RDMA ? >> any help is appreciated ! >> >> Gerhard W. Recher >> >> net4sec UG (haftungsbeschränkt) >> Leitenweg 6 >> 86929 Penzing >> >> +49 171 4802507 >> Am 27.09.2017 um 16:09 schrieb Haomai Wang: >>> https://community.mellanox.com/docs/DOC-2415 >>> >>> On Wed, Sep 27, 2017 at 10:01 PM, Gerhard W. Recher >>> wrote: How to set local gid option ? I have no glue :) Gerhard W. Recher net4sec UG (haftungsbeschränkt) Leitenweg 6 86929 Penzing +49 171 4802507 Am 27.09.2017 um 15:59 schrieb Haomai Wang: > do you set local gid option? > > On Wed, Sep 27, 2017 at 9:52 PM, Gerhard W. Recher > wrote: >> Yep ROcE >> >> i followed up all recommendations in mellanox papers ... >> >> */etc/security/limits.conf* >> >> * soft memlock unlimited >> * hard memlock unlimited >> root soft memlock unlimited >> root hard memlock unlimited >> >> >> also set properties on daemons (chapter 11) in >> https://community.mellanox.com/docs/DOC-2721 >> >> >> only gids parameter in ceph.conf is no way in proxmox, because >> cephp.conf is for all storage node the same file >> root@pve01:/etc/ceph# ls -latr >> total 16 >> lrwxrwxrwx 1 root root 18 Jun 21 19:35 ceph.conf -> >> /etc/pve/ceph.conf >> >> and each node has uniqe GIDS. >> >> >> ./showgids >> DEV PORTINDEX GID >> IPv4VER DEV >> --- - --- >> --- --- >> mlx4_0 1 0 >> fe80::::268a:07ff:fee2:6070 v1 ens1 >> mlx4_0 1 1 >> fe80::::268a:07ff:fee2:6070 v2 ens1 >> mlx4_0 1 2 ::::::c0a8:dd8d >> 192.168.221.141 v1 vmbr0 >> mlx4_0 1 3 ::::::c0a8:dd8d >> 192.168.221.141 v2 vmbr0 >> mlx4_0 2 0 >> fe80::::268a:07ff:fee2:6071 v1 ens1d1 >> mlx4_0 2 1 >> fe80::::268a:07ff:fee2:6071 v2 ens1d1 >> mlx4_0 2 2 ::::::c0a8:648d >> 192.168.100.141 v1 ens1d1 >> mlx4_0 2 3 ::::::c0a8:648d >> 192.168.100.141 v2 ens1d1 >> n_gids_found=8 >> >> next node ... showgids >> ./showgids >> DEV PORTINDEX GID >> IPv4VER DEV >> --- - --- >> --- --- >> mlx4_0 1 0 >> fe80::::268a:07ff:fef9:8730 v1 ens1 >> mlx4_0 1 1 >> fe80::::268a:07ff:fef9:8730 v2 ens1 >> mlx4_0 1 2 ::::::c0a8:dd8e >> 192.168.221.142 v1 vmbr0 >> mlx4_0 1 3 ::::::c0a8:dd8e >> 192.168.221.142 v2 vmbr0 >> mlx4_0 2 0 >> fe80::::268a:07ff:fef9:8731 v1 ens1d1 >> mlx4_0 2 1 >> fe80::::268a:07ff:fef9:8731 v2 ens1d1 >> mlx4_0 2 2 ::::::c0a8:648e >> 192.168.100.142 v1 ens1d1 >> mlx4_0 2 3 ::::::c0a8:648e >> 192.168.100.142 v2 ens1d1 >> n_gids_found=8 >> >> >> >> ifconfig ens1d1 >> ens1d1: flags=4163 mtu 9000 >> inet 192.168.100.141 netmask 255.255.255.0 broadcast >> 192.168.100.255 >> inet6 fe80::268a:7ff:fee2:6071 prefixlen 64 scopeid 0x20 >> ether 24:8a:07:e2:60:71 txqueuelen 1000 (Ethernet) >> RX packets 25450717 bytes 39981352146 (37.2 GiB) >> RX errors 0 dropped 77 overruns 77 frame 0 >> TX packets 26554236 bytes 53419159091 (49.7 GiB) >> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >> >> >> >> Gerhard W. Recher >> >> net4sec UG (haftungsbeschränkt) >> Leitenweg 6
Re: [ceph-users] tunable question
Hi Dan, list, Our cluster is small: three nodes, totally 24 4Tb platter OSDs, SSD journals. Using rbd for VMs. That's it. Runs nicely though :-) The fact that "tunable optimal" for jewel would result in "significantly fewer mappings change when an OSD is marked out of the cluster" is what attracts us. Reasoning behind it: upgrading to "optimal" NOW, should result in faster rebuild-time when disaster strikes, and we're all stressed out. :-) After the jewel upgrade, we also upgraded the tunables from "(require bobtail, min is firefly)" to "hammer". This resulted in approx 24 hours rebuild, but actually without significant inpact on the hosted VMs. Is it safe to assume that setting it to "optimal" would have a similar impact, or are the implications bigger? MJ On 09/28/2017 10:29 AM, Dan van der Ster wrote: Hi, How big is your cluster and what is your use case? For us, we'll likely never enable the recent tunables that need to remap *all* PGs -- it would simply be too disruptive for marginal benefit. Cheers, Dan On Thu, Sep 28, 2017 at 9:21 AM, mjwrote: Hi, We have completed the upgrade to jewel, and we set tunables to hammer. Cluster again HEALTH_OK. :-) But now, we would like to proceed in the direction of luminous and bluestore OSDs, and we would like to ask for some feedback first. From the jewel ceph docs on tubables: "Changing tunable to "optimal" on an existing cluster will result in a very large amount of data movement as almost every PG mapping is likely to change." Given the above, and the fact that we would like to proceed to luminous/bluestore in the not too far away future: What is cleverer: 1 - keep the cluster at tunable hammer now, upgrade to luminous in a little while, change OSDs to bluestore, and then set tunables to optimal or 2 - set tunable to optimal now, take the impact of "almost all PG remapping", and when that is finished, upgrade to luminous, bluestore etc. Which route is the preferred one? Or is there a third (or fourth?) option..? :-) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Developers Monthly - October
On 09/28/2017 04:08 AM, Leonardo Vaz wrote: Hey Cephers, This is just a friendly reminder that the next Ceph Developer Montly meeting is coming up: http://wiki.ceph.com/Planning If you have work that you're doing that it a feature work, significant backports, or anything you would like to discuss with the core team, please add it to the following page: http://wiki.ceph.com/CDM_04-OCT-2017 Will we, at some point, have a european-friendly time? :) From the planning page, I see that we've been alternating between 21h00 EDT and 12h30 EDT for over a year, but this time around we're having two straight 21h00 EDT sessions instead. Was this a copy-paste mistake, or are we actually having another APAC friendly session again? -Joao ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Developers Monthly - October
Are we going to have next CDM in an APAC friendly time slot again? On Thu, Sep 28, 2017 at 12:08 PM, Leonardo Vazwrote: > Hey Cephers, > > This is just a friendly reminder that the next Ceph Developer Montly > meeting is coming up: > > http://wiki.ceph.com/Planning > > If you have work that you're doing that it a feature work, significant > backports, or anything you would like to discuss with the core team, > please add it to the following page: > > http://wiki.ceph.com/CDM_04-OCT-2017 > > If you have questions or comments, please let us know. > > Kindest regards, > > Leo > > -- > Leonardo Vaz > Ceph Community Manager > Open Source and Standards Team > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need some help/advice upgrading Hammer to Jewel - HEALTH_ERR shutting down OSD
David, Thank you so much for your reply. I'm not entirely satisfied though. I'm expecting the PG states "degraded" and "undersized". Those should result in a HEALTH_WARN. I'm particularly worried about the "stuck inactive" part. Please correct me if I'm wrong but I was in the understanding that a PG would only get in that state if all OSDs that have that PG mapped are down. Even if the cluster would recover immediately after updating and bringing the OSDs back up, I really wouldn't feel comfortable doing this while the cluster is online and being used. I think I'll schedule downtime and do an offline upgrade instead, just to be safe. Nonetheless I would really like to know what is wrong with either this cluster or my understanding of Ceph. Below the ceph -s for my test environment. I would expect the production cluster to act the same. I also find it odd the test setup didn't get the tunables warning. Both clusters were running a Hammer release when initialized, probably not the exact same versions though. health HEALTH_WARN 53 pgs degraded 53 pgs stuck degraded 67 pgs stuck unclean 53 pgs stuck undersized 53 pgs undersized recovery 28/423 objects degraded (6.619%) monmap e3: 3 mons at {mgm1= 10.10.100.21:6789/0,mgm2=10.10.100.22:6789/0,mgm3=10.10.100.23:6789/0} election epoch 40, quorum 0,1,2 mgm1,mgm2,mgm3 osdmap e163: 6 osds: 4 up, 4 in; 14 remapped pgs pgmap v2320: 96 pgs, 2 pools, 514 MB data, 141 objects 1638 MB used, 100707 MB / 102345 MB avail 28/423 objects degraded (6.619%) 53 active+undersized+degraded 29 active+clean 14 active+remapped Kind regards, Eric van Blokland On Thu, Sep 28, 2017 at 3:02 AM, David Turnerwrote: > There are new PG states that cause health_err. In this case it is > undersized that is causing this state. > > While I decided to upgrade my tunables before upgrading the rest of my > cluster, it does not seem to be a requirement. However I would recommend > upgrading them sooner than later. It will cause a fair amount of > backfilling when you do it. If you are using krbd, don't upgrade your > tunables past Hammer. > > In any case, you should feel safe continuing with your upgrade. You will > definitely be safe to finish this first node as you have 2 copies of your > data if anything goes awry. I would say that this first node will finish > and get back to a state where all backfilling is done and you can continue > with the other nodes. > > On Wed, Sep 27, 2017, 6:32 PM Eric van Blokland > wrote: > >> Hello, >> >> I have run into an issue while upgrading a Ceph cluster from Hammer to >> Jewel on CentOS. It's a small cluster with 3 monitoring servers and a >> humble 6 OSDs distributed over 3 servers. >> >> I've upgraded the 3 monitors successfully to 10.2.7. They appear to be >> running fine except for this health warning: "crush map has legacy tunables >> (require bobtail, min is firefly)". While I might completely underestimate >> the significance of this warning, it seemed pretty harmless to me and I >> decided to upgrade my OSDs (running 0.94.10) before touching the tunables. >> >> However, as soon as I brought down the OSDs on the first storage server >> to start upgrading them, the cluster immediately got a HEALTH_ERR status >> (see ceph -s output below) which made me abort to update process and just >> start the OSDs again. >> >> Now considering that my crushmap forces distribution of 3 copies over 3 >> servers, the cluster can't heal itself when I take those OSDs down, which >> would justify an error status. I'm worried however because my memory and my >> lab environment tell me that this situation should only give a health >> warning and only degraded PGs, not stuck/inactive (or did my lab >> environment not get the stuck pgs because they were not being addressed?). >> >> health HEALTH_ERR >> 199 pgs are stuck inactive for more than 300 seconds >> 576 pgs degraded >> 199 pgs stuck inactive >> 238 pgs stuck unclean >> 576 pgs undersized >> recovery 1415496/4246488 objects degraded (33.333%) >> 2/6 in osds are down >> crush map has legacy tunables (require bobtail, min is >> firefly) >> monmap e1: 3 mons at {mgm1=10.10.3.11:6789/0,mgm2= >> 10.10.3.12:6789/0,mgm3=10.10.3.13:6789/0} >> election epoch 1650, quorum 0,1,2 mgm1,mgm2,mgm3 >> osdmap e808: 6 osds: 4 up, 6 in; 576 remapped pgs >> pgmap v4309615: 576 pgs, 5 pools, 1483 GB data, 1382 kobjects >> 4445 GB used, 7836 GB / 12281 GB avail >> 1415496/4246488 objects degraded (33.333%) >> 512 undersized+degraded+peered >> 64 active+undersized+degraded >> >> How should I proceed from here? Am I seeing
Re: [ceph-users] PG in active+clean+inconsistent, but list-inconsistent-obj doesn't show it
On 28. sep. 2017 09:27, Olivier Migeot wrote: Greetings, we're in the process of recovering a cluster after an electrical disaster. Didn't work bad so far, we managed to clear most of errors. All that prevents return to HEALTH_OK now is a bunch (6) of scrub errors, apparently from a PG that's marked as active+clean+inconsistent. Thing is, rados list-inconsistent-obj doesn't return anything but an empty list (plus, in the most recent attempts : error 2: (2) No such file or directory) We're on Jewel (waiting for this to be fixed before planning upgrade), and the pool our PG belongs to has a replica of 2. No success with ceph pg repair, and I already tried to remove and import the most recent version of said PG in both its acting OSDs : it doesn't change a thing. Is there anything else I could try? Thanks, size=2 is ofcourse horrible, and I assume you know that... But even more important: I hope you have min_size=2 so you avoid generating more problems in the future, or while troubleshooting. ! first of all, read this link a few times: http://ceph.com/geen-categorie/ceph-manually-repair-object/ you need to locate the bad objects to fix them. since rados list-inconsistent-obj does not work you need to manualy check the logs of the osd's that are participating in the pg in question. grep for ERR, once you find the name of the object with problem, you need to locate the object using find /path/of/pg -name 'objectname' once you have the objectpath you need to compare the 2 objects and find out what object is the bad one, this is where 3 replication would have helped, since when one is bad, how do you know the bad from the good... the error message in the log may give hints to the error. read and understand what the error message is, since it is critical to understanding what is wrong with the object. the object type also helps when determining the wrong one. is it a rados object, a rbd block or a cephfs metadata og data object. knowing what it should be helps determining the wrong one. things to try: ls -lh $path ; compare metadata are there obvious problems? refer to the error in the log. - one have size 0 and there should have been a size? - one have size greater then 0 and it should have been size 0? - one is significantly larger then the other, perhaps one is truncated? perhaps one have garbage added. md5sum $path - perhaps a block have read error, it would show on this command. and be a dead giveaway to the problem object. - compare checksum. do you know what the object should have as sum? actualy look at the object. use strings or hexdump to try to determine the contents, vs what the object should contain. if you can locate the bad object. then stop the osd. flush it's journal. move away the bad object, (i just mv it to somewhere else). restart the osd. run repair on the pg, tail the logs and wait for the repair and scrub to finish. -- if you are unable to determine the good object from the bad. You can try to determine what file it refers to in cephfs, or what block it refers to in rbd. and by overwriting that file or block in cephfs or rbd you can indirectly overwrite both objects with new data. if this is a rbd you should run a filesystem check on the fs on that rbd after all the ceph problems are repaired. good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] tunable question
Hi, How big is your cluster and what is your use case? For us, we'll likely never enable the recent tunables that need to remap *all* PGs -- it would simply be too disruptive for marginal benefit. Cheers, Dan On Thu, Sep 28, 2017 at 9:21 AM, mjwrote: > Hi, > > We have completed the upgrade to jewel, and we set tunables to hammer. > Cluster again HEALTH_OK. :-) > > But now, we would like to proceed in the direction of luminous and bluestore > OSDs, and we would like to ask for some feedback first. > > From the jewel ceph docs on tubables: "Changing tunable to "optimal" on an > existing cluster will result in a very large amount of data movement as > almost every PG mapping is likely to change." > > Given the above, and the fact that we would like to proceed to > luminous/bluestore in the not too far away future: What is cleverer: > > 1 - keep the cluster at tunable hammer now, upgrade to luminous in a little > while, change OSDs to bluestore, and then set tunables to optimal > > or > > 2 - set tunable to optimal now, take the impact of "almost all PG > remapping", and when that is finished, upgrade to luminous, bluestore etc. > > Which route is the preferred one? > > Or is there a third (or fourth?) option..? :-) > > MJ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG in active+clean+inconsistent, but list-inconsistent-obj doesn't show it
Greetings, we're in the process of recovering a cluster after an electrical disaster. Didn't work bad so far, we managed to clear most of errors. All that prevents return to HEALTH_OK now is a bunch (6) of scrub errors, apparently from a PG that's marked as active+clean+inconsistent. Thing is, rados list-inconsistent-obj doesn't return anything but an empty list (plus, in the most recent attempts : error 2: (2) No such file or directory) We're on Jewel (waiting for this to be fixed before planning upgrade), and the pool our PG belongs to has a replica of 2. No success with ceph pg repair, and I already tried to remove and import the most recent version of said PG in both its acting OSDs : it doesn't change a thing. Is there anything else I could try? Thanks, -- Olivier Migeot ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] tunable question
Hi, We have completed the upgrade to jewel, and we set tunables to hammer. Cluster again HEALTH_OK. :-) But now, we would like to proceed in the direction of luminous and bluestore OSDs, and we would like to ask for some feedback first. From the jewel ceph docs on tubables: "Changing tunable to "optimal" on an existing cluster will result in a very large amount of data movement as almost every PG mapping is likely to change." Given the above, and the fact that we would like to proceed to luminous/bluestore in the not too far away future: What is cleverer: 1 - keep the cluster at tunable hammer now, upgrade to luminous in a little while, change OSDs to bluestore, and then set tunables to optimal or 2 - set tunable to optimal now, take the impact of "almost all PG remapping", and when that is finished, upgrade to luminous, bluestore etc. Which route is the preferred one? Or is there a third (or fourth?) option..? :-) MJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Large amount of files - cephfs?
On 17-09-27 14:57, Josef Zelenka wrote: Hi, we are currently working on a ceph solution for one of our customers. They run a file hosting and they need to store approximately 100 million of pictures(thumbnails). Their current code works with FTP, that they use as a storage. We thought that we could use cephfs for this, but i am not sure how it would behave with that many files, how would the performance be affected etc. Is cephfs useable in this scenario, or would radosgw+swift be better(they'd likely have to rewrite some of the code, so we'd prefer not to do this)? We already have some experience with cephfs for storing bigger files, streaming etc so i'm not completely new to this, but i thought it'd be better to ask more experiened users. Some advice on this would be greatly appreciated, thanks, Josef Depending on your OSD count, you should be able to put 100mil of files there. As others mentioned, depending on your workload, metadata may be a bottleneck. If metadata is not a concern, then you just need to have enough OSDs to distribute RADOS objects. You should be fine with few millions objects per OSDs, going with tens of millions per OSD may be more problematic as you have larger memory usage, OSDs are slower, backfill/recovery is slow. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com