Re: [ceph-users] units of metrics
On Tue, Jan 14, 2020 at 12:30 AM Stefan Kooman wrote: > Quoting Robert LeBlanc (rob...@leblancnet.us): > > The link that you referenced above is no longer available, do you have a > > new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all > > changed, so I'm trying to may the old values to the new values. Might > just > > have to look in the code. :( > > I cannot recall that the metrics have ever changed between 12.2.8 and > 12.2.12. Anyways, it depends on what module you use to collect the > metrics if the right metrics are even there. See this issue: > https://tracker.ceph.com/issues/41881 Yes, I agree that the metrics should not change within a major version, but here is the difference. We are using diamond and the CephCollector, but I verified with the admin socket and dumping the perf counters manually Metrics collected with 12.2.8: servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_client_request 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_server_request 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_request 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_session 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_slave_request 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_link 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookup 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookuphash 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupino 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupname 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupparent 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupsnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lssnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mkdir 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mknod 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mksnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_open 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_readdir 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rename 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_renamesnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmdir 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmsnap 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmxattr 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setattr 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setdirlayout 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setfilelock 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setlayout 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setxattr 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_symlink 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_unlink 0 1578955818 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.cap_revoke_eviction 0 1578955878 Metrics collected with 12.2.12: (much more clear and descriptive which is good) servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_client_request 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_server_request 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_request 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_session 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_slave_request 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.avgcount 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.avgtime 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.sum 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.avgcount 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.avgtime 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.sum 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency.avgcount 0 1578955878 servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency.avgtime 0 157
Re: [ceph-users] units of metrics
The link that you referenced above is no longer available, do you have a new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all changed, so I'm trying to may the old values to the new values. Might just have to look in the code. :( Thanks! Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Sep 12, 2019 at 8:02 AM Paul Emmerich wrote: > We use a custom script to collect these metrics in croit > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > On Thu, Sep 12, 2019 at 5:00 PM Stefan Kooman wrote: > > > > Hi Paul, > > > > Quoting Paul Emmerich (paul.emmer...@croit.io): > > > > https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf > > > > Thanks for the link. So, what tool do you use to gather the metrics? We > > are using telegraf module of the Ceph manager. However, this module only > > provides "sum" and not "avgtime" so I can't do the calculations. The > > influx and zabbix mgr modules also only provide "sum". The only metrics > > module that *does* send "avgtime" is the prometheus module: > > > > ceph_mds_reply_latency_sum > > ceph_mds_reply_latency_count > > > > All modules use "self.get_all_perf_counters()" though: > > > > ~/git/ceph/src/pybind/mgr/ > grep -Ri get_all_perf_counters * > > dashboard/controllers/perf_counters.py:return > mgr.get_all_perf_counters() > > diskprediction_cloud/agent/metrics/ceph_mon_osd.py:perf_data = > obj_api.module.get_all_perf_counters(services=('mon', 'osd')) > > influx/module.py:for daemon, counters in > six.iteritems(self.get_all_perf_counters()): > > mgr_module.py:def get_all_perf_counters(self, prio_limit=PRIO_USEFUL, > > prometheus/module.py:for daemon, counters in > self.get_all_perf_counters().items(): > > restful/api/perf.py:counters = > context.instance.get_all_perf_counters() > > telegraf/module.py:for daemon, counters in > six.iteritems(self.get_all_perf_counters()) > > > > Besides the *ceph* telegraf module we also use the ceph plugin for > > telegraf ... but that plugin does not (yet?) provide mds metrics though. > > Ideally we would *only* use the ceph mgr telegraf module to collect *all > > the things*. > > > > Not sure what's the difference in python code between the modules that > could explain this. > > > > Gr. Stefan > > > > -- > > | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 > > | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Annoying PGs not deep-scrubbed in time messages in Nautilus.
On Mon, Dec 9, 2019 at 11:58 AM Paul Emmerich wrote: > solved it: the warning is of course generated by ceph-mgr and not ceph-mon. > > So for my problem that means: should have injected the option in ceph-mgr. > That's why it obviously worked when setting it on the pool... > > The solution for you is to simply put the option under global and restart > ceph-mgr (or use daemon config set; it doesn't support changing config via > ceph tell for some reason) > > > Paul > > On Mon, Dec 9, 2019 at 8:32 PM Paul Emmerich > wrote: > >> >> >> On Mon, Dec 9, 2019 at 5:17 PM Robert LeBlanc >> wrote: >> >>> I've increased the deep_scrub interval on the OSDs on our Nautilus >>> cluster with the following added to the [osd] section: >>> >> >> should have read the beginning of your email; you'll need to set the >> option on the mons as well because they generate the warning. So your >> problem might be completely different from what I'm seeing here >> > > >> >> >> Paul >> >> >>> >>> osd_deep_scrub_interval = 260 >>> >>> And I started seeing >>> >>> 1518 pgs not deep-scrubbed in time >>> >>> in ceph -s. So I added >>> >>> mon_warn_pg_not_deep_scrubbed_ratio = 1 >>> >>> since the default would start warning with a whole week left to scrub. >>> But the messages persist. The cluster has been running for a month with >>> these settings. Here is an example of the output. As you can see, some of >>> these are not even two weeks old, no where close to the 75% of 4 weeks. >>> >>> pg 6.1f49 not deep-scrubbed since 2019-11-09 23:04:55.370373 >>>pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204 >>>pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569 >>>pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340 >>>pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680 >>>pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622 >>>pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829 >>>pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315 >>>pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569 >>>pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.51 >>>pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332 >>>1468 more pgs... >>> Mon Dec 9 08:12:01 PST 2019 >>> >>> There is very little data on the cluster, so it's not a problem of >>> deep-scrubs taking too long: >>> >>> $ ceph df >>> RAW STORAGE: >>>CLASS SIZEAVAIL USEDRAW USED %RAW USED >>>hdd 6.3 PiB 6.1 PiB 153 TiB 154 TiB 2.39 >>>nvme 5.8 TiB 5.6 TiB 138 GiB 197 GiB 3.33 >>>TOTAL 6.3 PiB 6.2 PiB 154 TiB 154 TiB 2.39 >>> >>> POOLS: >>>POOL ID STORED OBJECTS USED >>>%USED MAX AVAIL >>>.rgw.root 1 3.0 KiB 7 3.0 KiB >>> 0 1.8 PiB >>>default.rgw.control 2 0 B 8 0 B >>> 0 1.8 PiB >>>default.rgw.meta3 7.4 KiB 24 7.4 KiB >>> 0 1.8 PiB >>>default.rgw.log 4 11 GiB 341 11 GiB >>> 0 1.8 PiB >>>default.rgw.buckets.data6 100 TiB 41.84M 100 TiB >>> 1.82 4.2 PiB >>>default.rgw.buckets.index 7 33 GiB 574 33 GiB >>> 0 1.8 PiB >>>default.rgw.buckets.non-ec 8 8.1 MiB 22 8.1 MiB >>> 0 1.8 PiB >>> >>> Please help me figure out what I'm doing wrong with these settings. >>> >> Paul, Thanks, I did set both options to the global on the mons and restarted them, but that didn't help. Having the scrub interval set in the global section and restarting the mgr fixed it. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Annoying PGs not deep-scrubbed in time messages in Nautilus.
I've increased the deep_scrub interval on the OSDs on our Nautilus cluster with the following added to the [osd] section: osd_deep_scrub_interval = 260 And I started seeing 1518 pgs not deep-scrubbed in time in ceph -s. So I added mon_warn_pg_not_deep_scrubbed_ratio = 1 since the default would start warning with a whole week left to scrub. But the messages persist. The cluster has been running for a month with these settings. Here is an example of the output. As you can see, some of these are not even two weeks old, no where close to the 75% of 4 weeks. pg 6.1f49 not deep-scrubbed since 2019-11-09 23:04:55.370373 pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204 pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569 pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340 pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680 pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622 pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829 pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315 pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569 pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.51 pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332 1468 more pgs... Mon Dec 9 08:12:01 PST 2019 There is very little data on the cluster, so it's not a problem of deep-scrubs taking too long: $ ceph df RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 6.3 PiB 6.1 PiB 153 TiB 154 TiB 2.39 nvme 5.8 TiB 5.6 TiB 138 GiB 197 GiB 3.33 TOTAL 6.3 PiB 6.2 PiB 154 TiB 154 TiB 2.39 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL .rgw.root 1 3.0 KiB 7 3.0 KiB 0 1.8 PiB default.rgw.control 2 0 B 8 0 B 0 1.8 PiB default.rgw.meta3 7.4 KiB 24 7.4 KiB 0 1.8 PiB default.rgw.log 4 11 GiB 341 11 GiB 0 1.8 PiB default.rgw.buckets.data6 100 TiB 41.84M 100 TiB 1.82 4.2 PiB default.rgw.buckets.index 7 33 GiB 574 33 GiB 0 1.8 PiB default.rgw.buckets.non-ec 8 8.1 MiB 22 8.1 MiB 0 1.8 PiB Please help me figure out what I'm doing wrong with these settings. Thanks, Robert LeBlanc -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cephfs metadata fix tool
Our Jewel cluster is exhibiting some similar issues to the one in this thread [0] and it was indicated that a tool would need to be written to fix that kind of corruption. Has the tool been written? How would I go about repair this 16EB directories that won't delete? Thank you, Robert LeBlanc [0] https://www.spinics.net/lists/ceph-users/msg31598.html Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW performance with low object sizes
On Tue, Dec 3, 2019 at 9:11 AM Ed Fisher wrote: > > > On Dec 3, 2019, at 10:28 AM, Robert LeBlanc wrote: > > Did you make progress on this? We have a ton of < 64K objects as well and > are struggling to get good performance out of our RGW. Sometimes we have > RGW instances that are just gobbling up CPU even when there are no requests > to them, so it seems like things are getting hung up somewhere. There is > nothing in the logs and I haven't had time to do more troubleshooting. > > > There's a bug in the current stable Nautilus release that causes a loop > and/or crash in get_obj_data::flush (you should be able to see it gobbling > up CPU in perf top). This is the related issue: > https://tracker.ceph.com/issues/39660 -- it should be fixed as soon as > 14.2.5 is released (any day now, supposedly). > We will try out the new version when it's released and see if it improves things for us. Thanks, Robert LeBlanc Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW performance with low object sizes
90 p99 > max | avg min p25 p50 p75 p90 p99 max | > > +-++++ > | 8 | 196.3 MB/s |2 1 2 2 3 3 5 > 5 |2 1 2 2 3 3 5 5 | > > +-++++ > [...section CLEANUP was deleted...] > > Did you make progress on this? We have a ton of < 64K objects as well and are struggling to get good performance out of our RGW. Sometimes we have RGW instances that are just gobbling up CPU even when there are no requests to them, so it seems like things are getting hung up somewhere. There is nothing in the logs and I haven't had time to do more troubleshooting. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Revert a CephFS snapshot?
On Thu, Nov 14, 2019 at 11:48 AM Sage Weil wrote: > On Thu, 14 Nov 2019, Patrick Donnelly wrote: > > On Wed, Nov 13, 2019 at 6:36 PM Jerry Lee > wrote: > > > > > > On Thu, 14 Nov 2019 at 07:07, Patrick Donnelly > wrote: > > > > > > > > On Wed, Nov 13, 2019 at 2:30 AM Jerry Lee > wrote: > > > > > Recently, I'm evaluating the snpahsot feature of CephFS from kernel > > > > > client and everthing works like a charm. But, it seems that > reverting > > > > > a snapshot is not available currently. Is there some reason or > > > > > technical limitation that the feature is not provided? Any > insights > > > > > or ideas are appreciated. > > > > > > > > Please provide more information about what you tried to do (commands > > > > run) and how it surprised you. > > > > > > The thing I would like to do is to rollback a snapped directory to a > > > previous version of snapshot. It looks like the operation can be done > > > by over-writting all the current version of files/directories from a > > > previous snapshot via cp. But cp may take lots of time when there are > > > many files and directories in the target directory. Is there any > > > possibility to achieve the goal much faster from the CephFS internal > > > via command like "ceph fs snap rollback > > > " (just a example)? Thank you! > > > > RADOS doesn't support rollback of snapshots so it needs to be done > > manually. The best tool to do this would probably be rsync of the > > .snap directory with appropriate options including deletion of files > > that do not exist in the source (snapshot). > > rsync is the best bet now, yeah. > > RADOS does have a rollback operation that uses clone where it can, but > it's a per-object operation, so something still needs to walk the > hierarchy and roll back each file's content. The MDS could do this more > efficiently than rsync give what it knows about the snapped inodes > (skipping untouched inodes or, eventually, entire subtrees) but it's a > non-trivial amount of work to implement. > > Would it make sense to extend CephFS to leverage reflinks for cases like this? That could be faster than rsync and more space efficient. It would require some development time though. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Decreasing the impact of reweighting osds
Yout can try adding osd op queue = wpq osd op queue cut off = high To all the osd ceph configs and restarting, That has made reweighting pretty painless for us. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Oct 22, 2019 at 8:36 PM David Turner wrote: > > Most times you are better served with simpler settings like > osd_recovery_sleep, which has 3 variants if you have multiple types of OSDs > in your cluster (osd_recovery_sleep_hdd, osd_recovery_sleep_sdd, > osd_recovery_sleep_hybrid). Using those you can tweak a specific type of OSD > that might be having problems during recovery/backfill while allowing the > others to continue to backfill at regular speeds. > > Additionally you mentioned reweighting OSDs, but it sounded like you do this > manually. The balancer module, especially in upmap mode, can be configured > quite well to minimize client IO impact while balancing. You can specify > times of day that it can move data (only in UTC, it ignores local timezones), > a threshold of misplaced data that it will stop moving PGs at, the increment > size it will change weights with per operation, how many weights it will > adjust with each pass, etc. > > On Tue, Oct 22, 2019, 6:07 PM Mark Kirkwood > wrote: >> >> Thanks - that's a good suggestion! >> >> However I'd still like to know the answers to my 2 questions. >> >> regards >> >> Mark >> >> On 22/10/19 11:22 pm, Paul Emmerich wrote: >> > getting rid of filestore solves most latency spike issues during >> > recovery because they are often caused by random XFS hangs (splitting >> > dirs or just xfs having a bad day) >> > >> > >> > Paul >> > >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Decreasing the impact of reweighting osds
You can try adding Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Oct 22, 2019 at 8:36 PM David Turner wrote: > > Most times you are better served with simpler settings like > osd_recovery_sleep, which has 3 variants if you have multiple types of OSDs > in your cluster (osd_recovery_sleep_hdd, osd_recovery_sleep_sdd, > osd_recovery_sleep_hybrid). Using those you can tweak a specific type of OSD > that might be having problems during recovery/backfill while allowing the > others to continue to backfill at regular speeds. > > Additionally you mentioned reweighting OSDs, but it sounded like you do this > manually. The balancer module, especially in upmap mode, can be configured > quite well to minimize client IO impact while balancing. You can specify > times of day that it can move data (only in UTC, it ignores local timezones), > a threshold of misplaced data that it will stop moving PGs at, the increment > size it will change weights with per operation, how many weights it will > adjust with each pass, etc. > > On Tue, Oct 22, 2019, 6:07 PM Mark Kirkwood > wrote: >> >> Thanks - that's a good suggestion! >> >> However I'd still like to know the answers to my 2 questions. >> >> regards >> >> Mark >> >> On 22/10/19 11:22 pm, Paul Emmerich wrote: >> > getting rid of filestore solves most latency spike issues during >> > recovery because they are often caused by random XFS hangs (splitting >> > dirs or just xfs having a bad day) >> > >> > >> > Paul >> > >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery
On Thu, Oct 17, 2019 at 12:35 PM huxia...@horebdata.cn wrote: > > hello, Robert > > thanks for the quick reply. I did test with osd op queue = wpq , and osd > op queue cut off = high > and > osd_recovery_op_priority = 1 > osd recovery delay start = 20 > osd recovery max active = 1 > osd recovery max chunk = 1048576 > osd recovery sleep = 1 > osd recovery sleep hdd = 1 > osd recovery sleep ssd = 1 > osd recovery sleep hybrid = 1 > osd recovery priority = 1 > osd max backfills = 1 > osd backfill scan max = 16 > osd backfill scan min = 4 > osd_op_thread_suicide_timeout = 300 > > But still the ceph cluster showed extremely hug recovery activities during > the beginning of the recovery, and after ca. 5-10 minutes, the recovery > gradually get under the control. I guess this is quite similar to what you > encountered in Nov. 2015. > > It is really annoying, and what else can i do to mitigate this weird > inital-recovery issue? any suggestions are much appreciated. Hmm, on our Luminous cluster, we have the defaults other than the op queue and cut off and bringing in a node is nearly zero impact for client traffic. Those would need to be set on all OSDs to be completely effective. Maybe go back to the defaults? Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery
On Thu, Oct 17, 2019 at 12:08 PM huxia...@horebdata.cn wrote: > > I happened to find a note that you wrote in Nov 2015: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-November/006173.html > and I believe this is what i just hit exactly the same behavior : a host down > will badly take the client performance down 1/10 (with 200MB/s recovery > workload) and then took ten minutes to get good control of OSD recovery. > > Could you please share how did you eventally solve that issue? by seting a > fair large OSD recovery delay start or any other parameter? Wow! Dusting off the cobwebs here. I think this is what lead me to dig into the code and write the WPQ scheduler. I can't remember doing anything specific. I'm sorry I'm not much help in this regard. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery
On Wed, Oct 16, 2019 at 11:53 AM huxia...@horebdata.cn wrote: > > My Ceph version is Luminuous 12.2.12. Do you think should i upgrade to > Nautilus, or will Nautilus have a better control of recovery/backfilling? We have a Jewel cluster and Luminuous cluster that we have changed these settings on and it really helped both of them. ---- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery
On Thu, Oct 10, 2019 at 2:23 PM huxia...@horebdata.cn wrote: > > Hi, folks, > > I have a middle-size Ceph cluster as cinder backup for openstack (queens). > Duing testing, one Ceph node went down unexpected and powered up again ca 10 > minutes later, Ceph cluster starts PG recovery. To my surprise, VM IOPS > drops dramatically during Ceph recovery, from ca. 13K IOPS to about 400, a > factor of 1/30, and I did put a stringent throttling on backfill and > recovery, with the following ceph parameters > > osd_max_backfills = 1 > osd_recovery_max_active = 1 > osd_client_op_priority=63 > osd_recovery_op_priority=1 > osd_recovery_sleep = 0.5 > > The most weird thing is, > 1) when there is no IO activity from any VM (ALL VMs are quiet except the > recovery IO), the recovery bandwidth is ca. 10MiB/s, 2 objects/s. Seems like > recovery throttle setting is working properly > 2) when using FIO testing inside a VM, the recovery bandwith is going up > quickly, reaching above 200MiB/s, 60 objects/s. FIO IOPS performance inside > VM, however, is only at 400 IOPS/s (8KiB block size), around 3MiB/s. Obvious > recovery throttling DOES NOT work properly > 3) If i stop the FIO testing in VM, the recovery bandwith then goes down to > 10MiB/s, 2 objects/s again, strange enough. > > How can this weird behavior happen? I just wonder, is there a method to > configure recovery bandwith to a specific value, or the number of recovery > objects per second? this may give better control of bakcfilling/recovery, > instead of the faulty logic or relative osd_client_op_priority vs > osd_recovery_op_priority. > > any ideas or suggests to make the recovery under control? > > best regards, > > Samuel Not sure which version of Ceph you are on, but add these to your /etc/ceph/ceph.conf on all your OSDs and restart them. osd op queue = wpq osd op queue cut off = high That should really help and make backfills and recovery be non-impactful. This will be the default in Octopus. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Commit and Apply latency on nautilus
On Tue, Oct 1, 2019 at 7:54 AM Robert LeBlanc wrote: > > On Mon, Sep 30, 2019 at 5:12 PM Sasha Litvak > wrote: > > > > At this point, I ran out of ideas. I changed nr_requests and readahead > > parameters to 128->1024 and 128->4096, tuned nodes to > > performance-throughput. However, I still get high latency during benchmark > > testing. I attempted to disable cache on ssd > > > > for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done > > > > and I think it make things not better at all. I have H740 and H730 > > controllers with drives in HBA mode. > > > > Other them converting them one by one to RAID0 I am not sure what else I > > can try. > > > > Any suggestions? > > If you haven't already tried this, add this to your ceph.conf and > restart your OSDs, this should help bring down the variance in latency > (It will be the default in Octopus): > > osd op queue = wpq > osd op queue cut off = high I should clarify. This will reduce the variance in latency for client OPs. If this counter is also including recovery/backfill/deep_scrub OP-, then the latency can still be high as these settings make recovery/backfill/deep_scrub less impactful to client I/O at the cost of them possibly being delayed a bit. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Commit and Apply latency on nautilus
On Mon, Sep 30, 2019 at 5:12 PM Sasha Litvak wrote: > > At this point, I ran out of ideas. I changed nr_requests and readahead > parameters to 128->1024 and 128->4096, tuned nodes to performance-throughput. > However, I still get high latency during benchmark testing. I attempted to > disable cache on ssd > > for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done > > and I think it make things not better at all. I have H740 and H730 > controllers with drives in HBA mode. > > Other them converting them one by one to RAID0 I am not sure what else I can > try. > > Any suggestions? If you haven't already tried this, add this to your ceph.conf and restart your OSDs, this should help bring down the variance in latency (It will be the default in Octopus): osd op queue = wpq osd op queue cut off = high Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage
On Tue, Sep 24, 2019 at 4:33 AM Thomas <74cmo...@gmail.com> wrote: > > Hi, > > I'm experiencing the same issue with this setting in ceph.conf: > osd op queue = wpq > osd op queue cut off = high > > Furthermore I cannot read any old data in the relevant pool that is > serving CephFS. > However, I can write new data and read this new data. If you restarted all the OSDs with this setting, it won't necessarily prevent any blocked IO, it just really helps prevent the really long blocked IO and makes sure that IO is eventually done in a more fair manner. It sounds like you may have some MDS issues that are deeper than my understanding. First thing I'd try is to bounce the MDS service. > > If I want to add this my ceph-ansible playbook parameters, in which files I > > should add it and what is the best way to do it ? > > > > Add those 3 lines in all.yml or osds.yml ? > > > > ceph_conf_overrides: > > global: > > osd_op_queue_cut_off: high > > > > Is there another (better?) way to do that? I can't speak to either of those approaches. I wanted all my config in a single file, so I put it in my inventory file, but it looks like you have the right idea. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hanging slow requests: failed to authpin, subtree is being exported
84:92684 lookup > > #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, > > caller_gid=0{0,}) currently failed to authpin, subtree is being exported > > 2019-09-23 12:07:40.621 7f4f401e8700 0 log_channel(cluster) log [WRN] > > : slow request 1923.409501 seconds old, received at 2019-09-23 > > 11:35:37.217113: client_request(client.38347357:111963 lookup > > #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, > > caller_gid=0{0,}) currently failed to authpin, subtree is being exported > > 2019-09-23 12:29:20.639 7f4f401e8700 0 log_channel(cluster) log [WRN] > > : slow request 3843.057602 seconds old, received at 2019-09-23 > > 11:25:17.598152: client_request(client.38352684:92684 lookup > > #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0, > > caller_gid=0{0,}) currently failed to authpin, subtree is being exported > > 2019-09-23 12:39:40.872 7f4f401e8700 0 log_channel(cluster) log [WRN] > > : slow request 3843.664914 seconds old, received at 2019-09-23 > > 11:35:37.217113: client_request(client.38347357:111963 lookup > > #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0, > > caller_gid=0{0,}) currently failed to authpin, subtree is being exported > If I try to ls this paths, the client will hang. > > I tried this using ceph kernel client of centos7.6 and now also with the > ceph-fuse of 14.2.3, I see the issue with both. I tried remounting, but > that did not solve the issue, if I restart the mds, the issue goes away > - for some time > > > > [root@mds02 ~]# ceph -s > > cluster: > > id: 92bfcf0a-1d39-43b3-b60f-44f01b630e47 > > health: HEALTH_WARN > > 1 MDSs report slow requests > > 2 MDSs behind on trimming > > > > services: > > mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d) > > mgr: mds02(active, since 4d), standbys: mds01, mds03 > > mds: ceph_fs:2 {0=mds03=up:active,1=mds02=up:active} 1 up:standby > > osd: 535 osds: 535 up, 535 in > > > > data: > > pools: 3 pools, 3328 pgs > > objects: 375.14M objects, 672 TiB > > usage: 1.0 PiB used, 2.2 PiB / 3.2 PiB avail > > pgs: 3319 active+clean > > 9active+clean+scrubbing+deep > > > > io: > > client: 141 KiB/s rd, 54 MiB/s wr, 62 op/s rd, 577 op/s wr > > > > > [root@mds02 ~]# ceph health detail > > HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming > > MDS_SLOW_REQUEST 1 MDSs report slow requests > > mdsmds02(mds.1): 2 slow requests are blocked > 30 secs > > MDS_TRIM 2 MDSs behind on trimming > > mdsmds02(mds.1): Behind on trimming (3407/200) max_segments: 200, > > num_segments: 3407 > > mdsmds03(mds.0): Behind on trimming (4240/200) max_segments: 200, > > num_segments: 4240 > > Can someone help me to debug this further? What is the make up of your cluster? It sounds like it may be all HDD. If so, try adding this to /etc/ceph/ceph.conf on your OSDs and restart the processes. osd op queue cut off = high Depending on your version (default in newer versions), adding osd op queue = wpq can also help. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph; pg scrub errors
On Thu, Sep 19, 2019 at 4:34 AM M Ranga Swami Reddy wrote: > > Hi-Iam using ceph 12.2.11. here I am getting a few scrub errors. To fix these > scrub error I ran the "ceph pg repair ". > But scrub error not going and the repair is talking long time like 8-12 hours. Depending on the size of the PGs and how active the cluster is, it could take a long time as it takes another deep scrub to happen to clear the error status after a repair. Since it is not going away, either the problem is too complicated to automatically repair and needs to be done by hand, or the problem is repaired and when it deep-scrubs to check it, the problem has reappeared or another problem was found and the disk needs to be replaced. Try running: rados list-inconsistent-obj ${PG} --format=json and see what the exact problems are. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage
blocked for > 62442.674748 secs > > 2019-09-19 08:52:53.960684 mds.icadmin006 [WRN] 10 slow requests, 2 > > included below; oldest blocked for > 62447.674825 secs > > 2019-09-19 08:52:53.960692 mds.icadmin006 [WRN] slow request 61441.895507 > > seconds old, received at 2019-09-18 17:48:52.065114: rejoin:mds.1:13 > > currently dispatched > > 2019-09-19 08:52:53.960697 mds.icadmin006 [WRN] slow request 61441.895489 > > seconds old, received at 2019-09-18 17:48:52.065131: rejoin:mds.1:14 > > currently dispatched > > 2019-09-19 08:52:57.527852 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62451.242174 secs > > 2019-09-19 08:53:02.527972 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62456.242289 secs > > 2019-09-19 08:52:58.960777 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62452.674936 secs > > 2019-09-19 08:53:03.960853 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62457.675011 secs > > 2019-09-19 08:53:07.528033 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62461.242354 secs > > 2019-09-19 08:53:12.528177 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62466.242487 secs > > 2019-09-19 08:53:08.960965 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62462.675123 secs > > 2019-09-19 08:53:13.961034 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62467.675195 secs > > 2019-09-19 08:53:17.528276 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62471.242592 secs > > 2019-09-19 08:53:22.528407 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62476.242729 secs > > 2019-09-19 08:53:18.961149 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62472.675310 secs > > 2019-09-19 08:53:23.961234 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62477.675392 secs > > 2019-09-19 08:53:27.528509 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62481.242832 secs > > 2019-09-19 08:53:32.528651 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62486.242961 secs > > 2019-09-19 08:53:28.961314 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62482.675471 secs > > 2019-09-19 08:53:33.961393 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62487.675549 secs > > 2019-09-19 08:53:37.528706 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62491.243031 secs > > 2019-09-19 08:53:42.528790 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62496.243105 secs > > 2019-09-19 08:53:38.961476 mds.icadmin006 [WRN] 10 slow requests, 1 > > included below; oldest blocked for > 62492.675617 secs > > 2019-09-19 08:53:38.961485 mds.icadmin006 [WRN] slow request 61441.151061 > > seconds old, received at 2019-09-18 17:49:37.810351: > > client_request(client.21441:176429 getattr pAsLsXsFs #0x1f2b1b3 > > 2019-09-18 17:49:37.806002 caller_uid=204878, caller_gid=11233{}) currently > > failed to rdlock, waiting > > 2019-09-19 08:53:43.961569 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62497.675728 secs > > 2019-09-19 08:53:47.528891 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62501.243214 secs > > 2019-09-19 08:53:52.529021 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62506.243337 secs > > 2019-09-19 08:53:48.961685 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62502.675839 secs > > 2019-09-19 08:53:53.961792 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62507.675948 secs > > 2019-09-19 08:53:57.529113 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62511.243437 secs > > 2019-09-19 08:54:02.529224 mds.icadmin007 [WRN] 3 slow requests, 0 included > > below; oldest blocked for > 62516.243546 secs > > 2019-09-19 08:53:58.961866 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62512.676025 secs > > 2019-09-19 08:54:03.961939 mds.icadmin006 [WRN] 10 slow requests, 0 > > included below; oldest blocked for > 62517.676099 secs > > Thanks for your help. If you haven't set: osd op queue cut off = high in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should help quite a bit with pure HDD clusters. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failure to start ceph-mon in docker
Frank, Thank you for the explanation, these are freshly installed machines and did not have ceph on them. I checked one of the other OSD nodes and there is no ceph user in /etc/passwd, nor is UID 167 allocated to any user. I did install ceph-common from the 18.04 repos before realizing that deploying ceph in containers did not update the host's /etc/apt/sources.list (or add an entry in /etc/apt/sources.list.d/). I manually added the repo for nautilus and upgraded the packages. So, I don't know if that had anything to do with it. Maybe Ubuntu packages ceph under UID 64045 and upgrading to the Ceph distributed packages didn't change the UID. Thanks, Robert LeBlanc -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Aug 29, 2019 at 12:33 AM Frank Schilder wrote: > Hi Robert, > > this is a bit less trivial than it might look right now. The ceph user is > usually created by installing the package ceph-common. By default it will > use id 167. If the ceph user already exists, I would assume it will use the > existing user to allow an operator to avoid UID collisions (if 167 is used > already). > > If you use docker, the ceph UID on the host and inside the container > should match (or need to be translated). If they don't, you will have a lot > of fun re-owning stuff all the time, because deployments will use the > symbolic name ceph, which has different UIDs on the host and inside the > container in your case. > > I would recommend removing this discrepancy as soon as possible: > > 1) Find out why there was a ceph user with UID different from 167 before > installation of ceph-common. >Did you create it by hand? Was UID 167 allocated already? > 2) If you can safely change the GID and UID of ceph to 167, just do > groupmod+usermod with new GID and UID. > 3) If 167 is used already by another service, you will have to map the > UIDs between host and container. > > To prevent ansible from deploying dockerized ceph with mismatching user ID > for ceph, add these tasks to an appropriate part of your deployment > (general host preparation or so): > > - name: "Create group 'ceph'." > group: > name: ceph > gid: 167 > local: yes > state: present > system: yes > > - name: "Create user 'ceph'." > user: > name: ceph > password: "!" > comment: "ceph-container daemons" > uid: 167 > group: ceph > shell: "/sbin/nologin" > home: "/var/lib/ceph" > create_home: no > local: yes > state: present > system: yes > > This should err if a group and user ceph already exist with IDs different > from 167. > > Best regards, > > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: ceph-users on behalf of Robert > LeBlanc > Sent: 28 August 2019 23:23:06 > To: ceph-users > Subject: Re: [ceph-users] Failure to start ceph-mon in docker > > Turns out /var/lib/ceph was ceph.ceph and not 167.167, chowning it made > things work. I guess only monitor needs that permission, rgw,mgr,osd are > all happy without needing it to be 167.167. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Wed, Aug 28, 2019 at 1:45 PM Robert LeBlanc <mailto:rob...@leblancnet.us>> wrote: > We are trying to set up a new Nautilus cluster using ceph-ansible with > containers. We got things deployed, but I couldn't run `ceph s` on the host > so decided to `apt install ceph-common and installed the Luminous version > from Ubuntu 18.04. For some reason the docker container that was running > the monitor restarted and won't restart. I added the repo for Nautilus and > upgraded ceph-common, but the problem persists. The Manager and OSD docker > containers don't seem to be affected at all. I see this in the journal: > > Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Starting Ceph Monitor... > Aug 28 20:40:55 sun-gcs02-osd01 docker[2926]: Error: No such container: > ceph-mon-sun-gcs02-osd01 > Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Started Ceph Monitor. > Aug 28 20:40:55 sun-gcs02-osd01 docker[2949]: WARNING: Your kernel does > not support swap limit capabilities or the cgroup is not mounted. Memory > limited without swap. > Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:40:56 > /opt/ceph-container/bin/entrypoint.sh: Existing mon, trying to rejoin > cluster... > Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: warning: line 41: > 'osd_memory_target' in section 'osd' redefined >
[ceph-users] Specify OSD size and OSD journal size with ceph-ansible
I have a new cluster and I'd like to put the DB on the NVMe device, but only make it 30GB, then use 100GB of the rest of the NVMe as an OSD for the RGW metadata pool. I set up the disks like the conf below without the block_db_size and it created all the LVs on the HDDs and one LV on the NVMe that took up all the space. I've tried using block_db_size in vars, and also as a property in the list for each OSD disk but neither work. With block_db_size in the vars I get: failed: [sun-gcs02-osd01] (item={u'db': u'/dev/nvme0n1', u'data': u'/dev/sda', u'crush_device_class': u'hdd'}) => changed=true ansible_loop_var: item cmd: - docker - run - --rm - --privileged - --net=host - --ipc=host - --ulimit - nofile=1024:1024 - -v - /run/lock/lvm:/run/lock/lvm:z - -v - /var/run/udev/:/var/run/udev/:z - -v - /dev:/dev - -v - /etc/ceph:/etc/ceph:z - -v - /run/lvm/:/run/lvm/ - -v - /var/lib/ceph/:/var/lib/ceph/:z - -v - /var/log/ceph/:/var/log/ceph/:z - --entrypoint=ceph-volume - docker.io/ceph/daemon:latest - --cluster - ceph - lvm - prepare - --bluestore - --data - /dev/sda - --block.db - /dev/nvme0n1 - --crush-device-class - hdd delta: '0:00:05.004777' end: '2019-08-28 23:26:39.074850' item: crush_device_class: hdd data: /dev/sda db: /dev/nvme0n1 msg: non-zero return code rc: 1 start: '2019-08-28 23:26:34.070073' stderr: '--> RuntimeError: unable to use device' stderr_lines: stdout: |- Running command: /bin/ceph-authtool --gen-print-key Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new bcc7b3c3-6203-47c7-9f34-7b2e2060bf59 Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-76cd6a80-17dd-4a89-a35b-0844026bc9d4 /dev/sda stdout: Physical volume "/dev/sda" successfully created. stdout: Volume group "ceph-76cd6a80-17dd-4a89-a35b-0844026bc9d4" successfully created Running command: /usr/sbin/lvcreate --yes -l 100%FREE -n osd-block-bcc7b3c3-6203-47c7-9f34-7b2e2060bf59 ceph-76cd6a80-17dd-4a89-a35b-0844026bc9d4 stdout: Logical volume "osd-block-bcc7b3c3-6203-47c7-9f34-7b2e2060bf59" created. --> blkid could not detect a PARTUUID for device: /dev/nvme0n1 --> Was unable to complete a new OSD, will rollback changes Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.21 --yes-i-really-mean-it stderr: purged osd.21 stdout_lines: ... (One for each device) ... And the LVs are created for all the HDD OSDs and none on the NVMe. Looking through the code I don't see a way to set a size for the OSD, but maybe I'm just missing it as I'm really new to Ansible. osds: hosts: sun-gcs02-osd[01:43]: sun-gcs02-osd[45:60]: vars: block_db_size: 32212254720 lvm_volumes: - data: '/dev/sda' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdb' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdc' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdd' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sde' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdf' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdg' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdh' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdi' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdj' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdk' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/sdl' db: '/dev/nvme0n1' crush_device_class: 'hdd' - data: '/dev/nvme0n1' # Use the rest for metadata crush_device_class: 'nvme' With block_db_size set for each disk, I got an error during the parameter checking phase in Ansible and no LVs were created. Please help me understand how to configure what I would like to do. Thank you, Robert LeBlanc Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Failure to start ceph-mon in docker
Turns out /var/lib/ceph was ceph.ceph and not 167.167, chowning it made things work. I guess only monitor needs that permission, rgw,mgr,osd are all happy without needing it to be 167.167. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Aug 28, 2019 at 1:45 PM Robert LeBlanc wrote: > We are trying to set up a new Nautilus cluster using ceph-ansible with > containers. We got things deployed, but I couldn't run `ceph s` on the host > so decided to `apt install ceph-common and installed the Luminous version > from Ubuntu 18.04. For some reason the docker container that was running > the monitor restarted and won't restart. I added the repo for Nautilus and > upgraded ceph-common, but the problem persists. The Manager and OSD docker > containers don't seem to be affected at all. I see this in the journal: > > Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Starting Ceph Monitor... > Aug 28 20:40:55 sun-gcs02-osd01 docker[2926]: Error: No such container: > ceph-mon-sun-gcs02-osd01 > Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Started Ceph Monitor. > Aug 28 20:40:55 sun-gcs02-osd01 docker[2949]: WARNING: Your kernel does > not support swap limit capabilities or the cgroup is not mounted. Memory > limited without swap. > Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:40:56 > /opt/ceph-container/bin/entrypoint.sh: Existing mon, trying to rejoin > cluster... > Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: warning: line 41: > 'osd_memory_target' in section 'osd' redefined > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:41:03 > /opt/ceph-container/bin/entrypoint.sh: /etc/ceph/ceph.conf is already > memory tuned > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:41:03 > /opt/ceph-container/bin/entrypoint.sh: SUCCESS > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: exec: PID 368: spawning > /usr/bin/ceph-mon --cluster ceph --default-log-to-file=false > --default-mon-cluster-log-to-file=false --setuser ceph --setgroup ceph -d > --mon-cluster-log-to-stderr --log-stderr-prefix=debug -i sun-gcs02-osd01 > --mon-data /var/lib/ceph/mon/ceph-sun-gcs02-osd01 --public-addr > 10.65.101.21 > > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: exec: Waiting 368 to quit > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: warning: line 41: > 'osd_memory_target' in section 'osd' redefined > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28 > 20:41:03.835 7f401283c180 0 set uid:gid to 167:167 (ceph:ceph) > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28 > 20:41:03.835 7f401283c180 0 ceph version 14.2.2 > (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process > ceph-mon, pid 368 > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28 > 20:41:03.835 7f401283c180 -1 stat(/var/lib/ceph/mon/ceph-sun-gcs02-osd01) > (13) Permission denied > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28 > 20:41:03.835 7f401283c180 -1 error accessing monitor data directory at > '/var/lib/ceph/mon/ceph-sun-gcs02-osd01': (13) Permission denied > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: managing teardown > after SIGCHLD > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: Waiting PID 368 to > terminate > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: Process 368 is > terminated > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: Bye Bye, container > will die with return code -1 > Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: if you don't want > me to die and have access to a shell to debug this situation, next time run > me with '-e DEBUG=stayalive' > Aug 28 20:41:04 sun-gcs02-osd01 systemd[1]: > ceph-mon@sun-gcs02-osd01.service: Main process exited, code=exited, > status=255/n/a > Aug 28 20:41:04 sun-gcs02-osd01 systemd[1]: > ceph-mon@sun-gcs02-osd01.service: Failed with result 'exit-code'. > > The directories for the monitor are owned by 167.167 and matches the > UID.GID that the container reports. > > oot@sun-gcs02-osd01:~# ls -lhd /var/lib/ceph/ > drwxr-x--- 14 ceph ceph 4.0K Jul 30 22:15 /var/lib/ceph/ > root@sun-gcs02-osd01:~# ls -lh /var/lib/ceph/ > total 56K > drwxr-xr-x 2 167 167 4.0K Jul 30 22:16 bootstrap-mds > drwxr-xr-x 2 167 167 4.0K Jul 30 22:16 bootstrap-mgr > drwxr-xr-x 2 167 167 4.0K Jul 30 22:16 bootstrap-osd > drwxr-xr-x 2 167 167 4.0K Jul 30 22:16 bootstrap-rbd > drwxr-xr-x 2 167 167 4.0K Jul 30 22:16 bootstrap-rbd-mirror > drwxr-xr-x 2 167 167 4.0K Jul 30 22:16 bootstrap-rgw > drwxr-xr-x 3 167 167 4.0K Jul 30 22:15 mds > drwxr-xr-x 3 167 167 4.0K Jul 30 22:15 mgr > drwxr-xr-x 3 167 167 4.0K Jul 3
[ceph-users] Failure to start ceph-mon in docker
t -rw-r--r-- 1 167 167 45M Aug 28 19:16 050228.sst -rw-r--r-- 1 167 167 16 Aug 16 07:40 CURRENT -rw-r--r-- 1 167 167 37 Jul 30 22:15 IDENTITY -rw-r--r-- 1 167 1670 Jul 30 22:15 LOCK -rw-r--r-- 1 167 167 1.3M Aug 28 19:16 MANIFEST-027846 -rw-r--r-- 1 167 167 4.7K Aug 1 23:38 OPTIONS-002825 -rw-r--r-- 1 167 167 4.7K Aug 16 07:40 OPTIONS-027849 Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDSs report damaged metadata
We just had metadata damage show up on our Jewel cluster. I tried a few things like renaming directories and scanning, but the damage would just show up again in less than 24 hours. I finally just copied the directories with the damage to a tmp location on CephFS, then swapped it with the damaged one. When I deleted the directories with the damage the active MDS crashed, but the replay took over just fine. I haven't had the messages now for almost a week. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Aug 19, 2019 at 10:30 PM Lars Täuber wrote: > Hi there! > > Does anyone else have an idea what I could do to get rid of this error? > > BTW: it is the third time that the pg 20.0 is gone inconsistent. > This is a pg from the metadata pool (cephfs). > May this be related anyhow? > > # ceph health detail > HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data > damage: 1 pg inconsistent > MDS_DAMAGE 1 MDSs report damaged metadata > mdsmds3(mds.0): Metadata damage detected > OSD_SCRUB_ERRORS 1 scrub errors > PG_DAMAGED Possible data damage: 1 pg inconsistent > pg 20.0 is active+clean+inconsistent, acting [9,27,15] > > > Best regards, > Lars > > > Mon, 19 Aug 2019 13:51:59 +0200 > Lars Täuber ==> Paul Emmerich : > > Hi Paul, > > > > thanks for the hint. > > > > I did a recursive scrub from "/". The log says there where some inodes > with bad backtraces repaired. But the error remains. > > May this have something to do with a deleted file? Or a file within a > snapshot? > > > > The path told by > > > > # ceph tell mds.mds3 damage ls > > 2019-08-19 13:43:04.608 7f563f7f6700 0 client.894552 ms_handle_reset on > v2:192.168.16.23:6800/176704036 > > 2019-08-19 13:43:04.624 7f56407f8700 0 client.894558 ms_handle_reset on > v2:192.168.16.23:6800/176704036 > > [ > > { > > "damage_type": "backtrace", > > "id": 3760765989, > > "ino": 1099518115802, > > "path": "~mds0/stray7/15161f7/dovecot.index.backup" > > } > > ] > > > > starts a bit strange to me. > > > > Are the snapshots also repaired with a recursive repair operation? > > > > Thanks > > Lars > > > > > > Mon, 19 Aug 2019 13:30:53 +0200 > > Paul Emmerich ==> Lars Täuber > : > > > Hi, > > > > > > that error just says that the path is wrong. I unfortunately don't > > > know the correct way to instruct it to scrub a stray path off the top > > > of my head; you can always run a recursive scrub on / to go over > > > everything, though > > > > > > > > > Paul > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does CephFS find a file?
I'm fairly new to CephFS, but in my poking around with it, this is what I understand. The MDS manages dentries as omap (simple key/value database) entries in the metadada pool. Each dentry keeps a list of filenames and some metadata about the file such as inode number and some other info such as size I presume (can't find a documentation outlining the binary format of the omap, just did enough digging to find the inode location). The MDS can return the inode and size to the client and the client looks up the OSDs for the inode using the CRUSH map and dividing the size by the stripe size to know how many objects to fetch for the whole object. The file is stored by the inode (in hex) appended by the object offset. The inode corresponds to the same value in `ls -li` in CephFS converted to hex. I hope that is correct and useful as a starting point for you. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Aug 19, 2019 at 2:37 AM aot...@outlook.com wrote: > I am a student new to cephfs. I think there are 2 steps to finding a file: > > 1.Find out which objects belong to this file. > > 2.Use CRUSH to find out OSDs. > > > > What I don’t know is how does CephFS get the object list of the file. Does > MDS save all object list of all files? Or CRUSH can use some > information(what information?) to calculate the list of objects? In other > words, where is the object list of the file saved? > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Wed, Aug 7, 2019 at 7:05 AM Paul Emmerich wrote: > ~ is the internal implementation of device classes. Internally it's > still using separate roots, that's how it stays compatible with older > clients that don't know about device classes. > That makes sense. > And since it wasn't mentioned here yet: consider upgrading to Nautilus > to benefit from the new and improved accounting for metadata space. > You'll be able to see how much space is used for metadata and quotas > should work properly for metadata usage. > I think I'm not explaining this well and it is confusing people. I don't want to limit the size of the metadata pool, I also don't want to limit the size of the data pool as the cluster flexibility could cause the quota to be out of date at anytime and probably useless (since we want to use as much space as possible for data). I would like to reserve space for the metadata pool so that no other pool can touch it, much like when you thick provision a VM disk file. It is guaranteed for that entity an no one else can use it, even if it is mostly empty. So far people have only told me how to limit the space of a pool, which is not what I'm looking for. Thank you, Robert LeBlanc Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Replay MDS server stuck
We had a outage of our Jewel 10.2.11 CephFS last night. Our primary MDS hit an assert in ceph try_remove_dentries_for_stray(), but the replay MDS never came up. The logs for MDS02 show: ---like clockwork these first two lines appear every second--- 2019-08-02 16:27:24.664508 7f6f47f5c700 1 mds.0.0 standby_replay_restart (as standby) 2019-08-02 16:27:24.689079 7f6f43341700 1 mds.0.0 replay_done (as standby) 2019-08-02 16:27:25.689210 7f6f47f5c700 1 mds.0.0 standby_replay_restart (as standby) 2019-08-09 00:01:25.298131 7f6f4a862700 1 mds.0.10 handle_mds_map i am now mds.0.10 2019-08-09 00:01:25.298135 7f6f4a862700 1 mds.0.10 handle_mds_map state change up:standby-replay --> up:replay 2019-08-09 00:43:35.382921 7f6f46f5a700 -1 mds.sun-gcs01-mds02 *** got signal Terminated *** 2019-08-09 00:43:35.382952 7f6f46f5a700 1 mds.sun-gcs01-mds02 suicide. wanted state up:replay 2019-08-09 00:43:35.409663 7f6f46f5a700 1 mds.0.10 shutdown: shutting down rank 0 2019-08-09 00:43:35.414729 7f6f43341700 0 mds.0.log _replay journaler got error -11, aborting 2019-08-09 00:48:36.819539 7fe6e89e8200 0 set uid:gid to 1001:1002 (ceph:ceph) 2019-08-09 00:48:36.819555 7fe6e89e8200 0 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e), process ceph-mds, pid 39603 2019-08-09 00:48:36.820045 7fe6e89e8200 0 pidfile_write: ignore empty --pid-file 2019-08-09 00:48:37.813857 7fe6e2088700 1 mds.sun-gcs01-mds02 handle_mds_map standby 2019-08-09 00:48:37.833089 7fe6e2088700 1 mds.0.0 handle_mds_map i am now mds.19317235.0 replaying mds.0.0 2019-08-09 00:48:37.833097 7fe6e2088700 1 mds.0.0 handle_mds_map state change up:boot --> up:standby-replay 2019-08-09 00:48:37.833106 7fe6e2088700 1 mds.0.0 replay_start 2019-08-09 00:48:37.833111 7fe6e2088700 1 mds.0.0 recovery set is 2019-08-09 00:48:37.849332 7fe6dd77e700 0 mds.0.cache creating system inode with ino:100 2019-08-09 00:48:37.849627 7fe6dd77e700 0 mds.0.cache creating system inode with ino:1 2019-08-09 00:48:40.548094 7fe6dab67700 0 log_channel(cluster) log [WRN] : replayed op client.10012302:8321663,8321660 used ino 10052d9c287 but session next is 10052d57512 2019-08-09 00:48:40.844534 7fe6dab67700 1 mds.0.0 replay_done (as standby) 2019-08-09 00:48:41.844648 7fe6df782700 1 mds.0.0 standby_replay_restart (as standby) 2019-08-09 00:48:41.868242 7fe6dab67700 1 mds.0.0 replay_done (as standby) ---last two lines repeat again every second--- I was thinking of a couple of option to improve recovery in this situation. 1. Monitor the replay status of the replay MDS and alert if it is too far behind the active MDS, or if it isn't incrementing. I'm looking at `ceph daemon mds. perf dump` as mds_log:rdpos was mentioned in another thread, but the replay MDS has a higher number than the primary. Maybe even some sort of timestamp or entry that increments with time or something we can check. 2. A third MDS that isn't replaying the journal and is more of a cold standby. It would take longer to start up, but seems if the replay failed to take over, then it would read the journal and start up. It seems that the issue that caused the primary to fail is hard to catch and debug, so we just need to be sure that we can recover with failover. Any hints on how to improve that would be appreciated. Thank you, Robert LeBlanc -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Wed, Aug 7, 2019 at 12:08 AM Konstantin Shalygin wrote: > On 8/7/19 1:40 PM, Robert LeBlanc wrote: > > > Maybe it's the lateness of the day, but I'm not sure how to do that. > > Do you have an example where all the OSDs are of class ssd? > Can't parse what you mean. You always should paste your `ceph osd tree` > first. > Our 'ceph osd tree' is like this: ID CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF -1 892.21326 root default -369.16382 host sun-pcs01-osd01 0 ssd 3.49309 osd.0up 1.0 1.0 1 ssd 3.42329 osd.1up 0.87482 1.0 2 ssd 3.49309 osd.2up 0.88989 1.0 3 ssd 3.42329 osd.3up 0.94989 1.0 4 ssd 3.49309 osd.4up 0.93993 1.0 5 ssd 3.42329 osd.5up 1.0 1.0 6 ssd 3.49309 osd.6up 0.89490 1.0 7 ssd 3.42329 osd.7up 1.0 1.0 8 ssd 3.49309 osd.8up 0.89482 1.0 9 ssd 3.42329 osd.9up 1.0 1.0 100 ssd 3.49309 osd.100 up 1.0 1.0 101 ssd 3.42329 osd.101 up 1.0 1.0 102 ssd 3.49309 osd.102 up 1.0 1.0 103 ssd 3.42329 osd.103 up 0.81482 1.0 104 ssd 3.49309 osd.104 up 0.87973 1.0 105 ssd 3.42329 osd.105 up 0.86485 1.0 106 ssd 3.49309 osd.106 up 0.79965 1.0 107 ssd 3.42329 osd.107 up 1.0 1.0 108 ssd 3.49309 osd.108 up 1.0 1.0 109 ssd 3.42329 osd.109 up 1.0 1.0 -562.24744 host sun-pcs01-osd02 10 ssd 3.49309 osd.10 up 1.0 1.0 11 ssd 3.42329 osd.11 up 0.72473 1.0 12 ssd 3.49309 osd.12 up 1.0 1.0 13 ssd 3.42329 osd.13 up 0.78979 1.0 14 ssd 3.49309 osd.14 up 0.98961 1.0 15 ssd 3.42329 osd.15 up 1.0 1.0 16 ssd 3.49309 osd.16 up 0.96495 1.0 17 ssd 3.42329 osd.17 up 0.94994 1.0 18 ssd 3.49309 osd.18 up 1.0 1.0 19 ssd 3.42329 osd.19 up 0.80481 1.0 110 ssd 3.49309 osd.110 up 0.97998 1.0 111 ssd 3.42329 osd.111 up 1.0 1.0 112 ssd 3.49309 osd.112 up 1.0 1.0 113 ssd 3.42329 osd.113 up 0.72974 1.0 116 ssd 3.49309 osd.116 up 0.91992 1.0 117 ssd 3.42329 osd.117 up 0.96997 1.0 118 ssd 3.49309 osd.118 up 0.93959 1.0 119 ssd 3.42329 osd.119 up 0.94481 1.0 ... plus 11 more hosts just like this How do you single out one OSD from each host for the metadata only and prevent data on that OSD when all the device classes are the same? It seems that you would need one OSD to be a different class to do that. It a previous email the conversation was: Is it possible to add a new device class like 'metadata'? Yes, but you don't need this. Just use your existing class with another crush ruleset. So, I'm trying to figure out how you use the existing class of 'ssd' with another CRUSH ruleset to accomplish the above. > > Yes, we can set quotas to limit space usage (or number objects), but > > you can not reserve some space that other pools can't use. The problem > > is if we set a quota for the CephFS data pool to the equivalent of 95% > > there are at least two scenario that make that quota useless. > > Of course. 95% of CephFS deployments is where meta_pool on flash drives > with enough space for this. > > > ``` > > pool 21 'fs_data' replicated size 3 min_size 2 crush_rule 4 object_hash > rjenkins pg_num 64 pgp_num 64 last_change 56870 flags hashpspool > stripe_width 0 application cephfs > pool 22 'fs_meta' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins pg_num 16 pgp_num 16 last_change 56870 flags hashpspool > stripe_width 0 application cephfs > > ``` > > ``` > > # ceph osd crush rule dump replicated_racks_nvme > { > "rule_id": 0, > "rule_name": "replicated_racks_nvme", > "ruleset": 0, > "type": 1, > "min_size": 1, > "max_size&qu
Re: [ceph-users] New CRUSH device class questions
On Tue, Aug 6, 2019 at 7:56 PM Konstantin Shalygin wrote: > Is it possible to add a new device class like 'metadata'? > > > Yes, but you don't need this. Just use your existing class with another > crush ruleset. > Maybe it's the lateness of the day, but I'm not sure how to do that. Do you have an example where all the OSDs are of class ssd? > If I set the device class manually, will it be overwritten when the OSD > boots up? > > > Nope. Classes assigned automatically when OSD is created, not boot'ed. > That's good to know. > I read https://ceph.com/community/new-luminous-crush-device-classes/ and it > mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence > the question. > > But it's not a magic. Sometimes drive can be sata ssd, but in kernel is > 'rotational'... > I see, so it's not looking to see if the device is in /sys/class/pci or something. > We will still have 13 OSDs, it will be overkill for space for metadata, but > since Ceph lacks a reserve space feature, we don't have many options. This > cluster is so fast that it can fill up in the blink of an eye. > > > Not true. You always can set per-pool quota in bytes, for example: > > * your meta is 1G; > > * your raw space is 300G; > > * your data is 90G; > > Set quota to your data pool: `ceph osd pool set-quota > max_bytes 96636762000` > Yes, we can set quotas to limit space usage (or number objects), but you can not reserve some space that other pools can't use. The problem is if we set a quota for the CephFS data pool to the equivalent of 95% there are at least two scenario that make that quota useless. 1. A host fails and the cluster recovers. The quota is now past the capacity of the cluster so if the data pool fills up, no pool can write. 2. The CephFS data pool is an erasure encoded pool, and it shares with a RGW data pool that is 3x rep. If more writes happen to the RGW data pool, then the quota will be past the capacity of the cluster. Both of these cause metadata operations to not be committed and cause lots of problems with CephFS (can't list a directory with a broken inode in it). We would prefer to get a truncated file, then a broken file system. I wrote a script that calculates 95% of the pool capacity and sets the quota if the current quota is 1% out of balance. This is run by cron every 5 minutes. If there is a way to reserve some capacity for a pool that no other pool can use, please provide an example. Think of reserved inode space in ext4/XFS/etc. Thank you. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New CRUSH device class questions
On Tue, Aug 6, 2019 at 11:11 AM Paul Emmerich wrote: > On Tue, Aug 6, 2019 at 7:45 PM Robert LeBlanc > wrote: > > We have a 12.2.8 luminous cluster with all NVMe and we want to take some > of the NVMe OSDs and allocate them strictly to metadata pools (we have a > problem with filling up this cluster and causing lingering metadata > problems, and this will guarantee space for metadata operations). > > Depending on the workload and metadata size: this might be a bad idea > as it reduces parallelism. > We will still have 13 OSDs, it will be overkill for space for metadata, but since Ceph lacks a reserve space feature, we don't have many options. This cluster is so fast that it can fill up in the blink of an eye. > In the past, we have done this the old-school way of creating a separate > root, but I wanted to see if we could leverage the device class function > instead. > > > > Right now all our devices show as ssd rather than nvme, but that is the > only class in this cluster. None of the device classes were manually set, > so is there a reason they were not detected as nvme? > > Ceph only distinguishes rotational vs. non-rotational for device > classes and device-specific configuration options > I read https://ceph.com/community/new-luminous-crush-device-classes/ and it mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence the question. > Is it possible to add a new device class like 'metadata'? > > yes, it's just a string, you can use whatever you want. For example, > we run a few setups that distinguish 2.5" and 3.5" HDDs. > > > > > > If I set the device class manually, will it be overwritten when the OSD > boots up? > > not sure about this, we run all of our setups with auto-updating on > start disabled > Do you know if 'osd crush location hook' can specify the device class? > Is what I'm trying to accomplish better done by the old-school separate > root and the osd_crush_location_hook (potentially using a file with a list > of partition UUIDs that should be in the metadata pool).? > > device classes are the way to go here Thanks! Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New CRUSH device class questions
We have a 12.2.8 luminous cluster with all NVMe and we want to take some of the NVMe OSDs and allocate them strictly to metadata pools (we have a problem with filling up this cluster and causing lingering metadata problems, and this will guarantee space for metadata operations). In the past, we have done this the old-school way of creating a separate root, but I wanted to see if we could leverage the device class function instead. Right now all our devices show as ssd rather than nvme, but that is the only class in this cluster. None of the device classes were manually set, so is there a reason they were not detected as nvme? Is it possible to add a new device class like 'metadata'? If I set the device class manually, will it be overwritten when the OSD boots up? Is what I'm trying to accomplish better done by the old-school separate root and the osd_crush_location_hook (potentially using a file with a list of partition UUIDs that should be in the metadata pool).? Any other options I may not be considering? Thank you, Robert LeBlanc -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Built-in HA?
Another option is if both RDMA ports are on the same card, then you can do RDMA with a bond. This does not work if you have two separate cards. As far as your questions go, my guess would be that you would want to have the different NICs in different broadcast domains, or set up Source Based Routing and bind the source port on the connection (not the easiest, but allows you to have multiple NICs in the same broadcast domain). I don't have experience with Ceph in this type of configuration. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Aug 2, 2019 at 9:41 AM Volodymyr Litovka wrote: > Dear colleagues, > > at the moment, we use Ceph in routed environment (OSPF, ECMP) and > everything is ok, reliability is high and there is nothing to complain > about. But for hardware reasons (to be more precise - RDMA offload), we are > faced with the need to operate Ceph directly on physical interfaces. > > According to documentation, "We generally recommend that dual-NIC systems > either be configured with two IPs on the same network, or bonded." > > Q1: Did anybody test and can explain, how Ceph will behave in first > scenario (two IPs on the same network)? I think this configuration require > just one statement in 'public network' (where both interfaces reside)? How > it will distribute traffic between links, how it will detect link failures > and how it will switchover? > > Q2: Did anybody test a bit another scenario - both NICs have addresses in > different networks and Ceph configuration contain two 'public networks'? > Questions are same - how Ceph distributes traffic between links and how it > recovers from link failures? > > Thank you. > > -- > Volodymyr Litovka > "Vision without Execution is Hallucination." -- Thomas Edison > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to add 100 new OSDs...
It does better because it is a fair share queue and doesn't let recovery ops take priority over client ops at any point for any time. It allows clients to have a much more predictable latency to the storage. Sent from a mobile device, please excuse any typos. On Sat, Aug 3, 2019, 1:10 PM Alex Gorbachev wrote: > On Fri, Aug 2, 2019 at 6:57 PM Robert LeBlanc > wrote: > > > > On Fri, Jul 26, 2019 at 1:02 PM Peter Sabaini wrote: > >> > >> On 26.07.19 15:03, Stefan Kooman wrote: > >> > Quoting Peter Sabaini (pe...@sabaini.at): > >> >> What kind of commit/apply latency increases have you seen when > adding a > >> >> large numbers of OSDs? I'm nervous how sensitive workloads might > react > >> >> here, esp. with spinners. > >> > > >> > You mean when there is backfilling going on? Instead of doing "a big > >> > >> Yes exactly. I usually tune down max rebalance and max recovery active > >> knobs to lessen impact but still I found the additional write load can > >> substantially increase i/o latencies. Not all workloads like this. > > > > > > We have been using: > > > > osd op queue = wpq > > osd op queue cut off = high > > > > It virtually eliminates the impact of backfills on our clusters. Our > backfill and recovery times have increased when the cluster has lots of > client I/O, but the clients haven't noticed that huge backfills have been > going on. > > > > > > Robert LeBlanc > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > Would this be superior to setting: > > osd_recovery_sleep = 0.5 (or some high value) > > > -- > Alex Gorbachev > Intelligent Systems Services Inc. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problems understanding 'ceph-features' output
On Tue, Jul 30, 2019 at 2:06 AM Janne Johansson wrote: > Someone should make a webpage where you can enter that hex-string and get > a list back. > Providing a minimum bitmap would allow someone to do so, and someone like me to do it manually until then. ---- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to add 100 new OSDs...
On Fri, Jul 26, 2019 at 1:02 PM Peter Sabaini wrote: > On 26.07.19 15:03, Stefan Kooman wrote: > > Quoting Peter Sabaini (pe...@sabaini.at): > >> What kind of commit/apply latency increases have you seen when adding a > >> large numbers of OSDs? I'm nervous how sensitive workloads might react > >> here, esp. with spinners. > > > > You mean when there is backfilling going on? Instead of doing "a big > > Yes exactly. I usually tune down max rebalance and max recovery active > knobs to lessen impact but still I found the additional write load can > substantially increase i/o latencies. Not all workloads like this. > We have been using: osd op queue = wpq osd op queue cut off = high It virtually eliminates the impact of backfills on our clusters. Our backfill and recovery times have increased when the cluster has lots of client I/O, but the clients haven't noticed that huge backfills have been going on. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mark CephFS inode as lost
Thanks, I created a ticket. http://tracker.ceph.com/issues/40906 Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Jul 22, 2019 at 11:45 PM Yan, Zheng wrote: > please create a ticket at http://tracker.ceph.com/projects/cephfs and > upload mds log with debug_mds =10 > > On Tue, Jul 23, 2019 at 6:00 AM Robert LeBlanc > wrote: > > > > We have a Luminous cluster which has filled up to 100% multiple times > and this causes an inode to be left in a bad state. Doing anything to these > files causes the client to hang which requires evicting the client and > failing over the MDS. Usually we move the parent directory out of the way > and things mostly are okay. However in this last fill up, we have a > significant amount of storage that we have moved out of the way and really > need to reclaim that space. I can't delete the files around it as listing > the directory causes a hang. > > > > We can get the inode that is bad from the logs/blocked_ops, how can we > tell MDS that the inode is lost and to forget about it without trying to do > any checks on it (checking the RADOS objects may be part of the problem)? > Once the inode is out of CephFS, we can clean up the RADOS objects manually > or leave them there to rot. > > > > Thanks, > > Robert LeBlanc > > > > Robert LeBlanc > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mark CephFS inode as lost
We have a Luminous cluster which has filled up to 100% multiple times and this causes an inode to be left in a bad state. Doing anything to these files causes the client to hang which requires evicting the client and failing over the MDS. Usually we move the parent directory out of the way and things mostly are okay. However in this last fill up, we have a significant amount of storage that we have moved out of the way and really need to reclaim that space. I can't delete the files around it as listing the directory causes a hang. We can get the inode that is bad from the logs/blocked_ops, how can we tell MDS that the inode is lost and to forget about it without trying to do any checks on it (checking the RADOS objects may be part of the problem)? Once the inode is out of CephFS, we can clean up the RADOS objects manually or leave them there to rot. Thanks, Robert LeBlanc Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Investigating Config Error, 300x reduction in IOPs performance on RGW layer
I'm pretty new to RGW, but I'm needing to get max performance as well. Have you tried moving your RGW metadata pools to nvme? Carve out a bit of NVMe space and then pin the pool to the SSD class in CRUSH, that way the small metadata ops aren't on slow media. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Jul 17, 2019 at 5:59 PM Ravi Patel wrote: > Hello, > > We have deployed ceph cluster and we are trying to debug a massive drop in > performance between the RADOS layer vs the RGW layer > > ## Cluster config > 4 OSD nodes (12 Drives each, NVME Journals, 1 SSD drive) 40GbE NIC > 2 RGW nodes ( DNS RR load balancing) 40GbE NIC > 3 MON nodes 1 GbE NIC > > ## Pool configuration > RGW data pool - replicated 3x 4M stripe (HDD) > RGW metadata pool - replicated 3x (SSD) pool > > ## Benchmarks > 4K Read IOP/s performance using RADOS Bench 48,000~ IOP/s > 4K Read RGW performance via s3 interface ~ 130 IOP/s > > Really trying to understand how to debug this issue. all the nodes never > break 15% CPU utilization and there is plenty of RAM. The one pathological > issue in our cluster is that the MON nodes are currently on VMs that are > sitting behind a single 1 GbE NIC. (We are in the process of moving them, > but are unsure if that will fix the issue. > > What metrics should we be looking at to debug the RGW layer. Where do we > need to look? > > --- > > Ravi Patel, PhD > Machine Learning Systems Lead > Email: r...@kheironmed.com > > > *Kheiron Medical Technologies* > > kheironmed.com | supporting radiologists with deep learning > > Kheiron Medical Technologies Ltd. is a registered company in England and > Wales. This e-mail and its attachment(s) are intended for the above named > only and are confidential. If they have come to you in error then you must > take no action based upon them but contact us immediately. Any disclosure, > copying, distribution or any action taken or omitted to be taken in > reliance on it is prohibited and may be unlawful. Although this e-mail and > its attachments are believed to be free of any virus, it is the > responsibility of the recipient to ensure that they are virus free. If you > contact us by e-mail then we will store your name and address to facilitate > communications. Any statements contained herein are those of the individual > and not the organisation. > > Registered number: 10184103. Registered office: RocketSpace, 40 Islington > High Street, London, N1 8EQ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Allocation recommendations for separate blocks.db and WAL
So, I see the recommendation for 4% of OSD space for blocks.db/WAL and the corresponding discussion regrading the 3/30/300GB vs 6/60/600GB allocation. How does this change when WAL is seperate from blocks.db? Reading [0] it seems that 6/60/600 is not correct. It seems that to compact a 300GB DB, you taking values from the above layer (which is only 10% of the lower layer and only some percentage that exceeds the trigger point of that will be merged down) and merging that in, so at worse case you would need 333GB (300+30+3) plus some headroom. [0] https://github.com/facebook/rocksdb/wiki/Leveled-Compaction Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] enterprise support
We recently used Croit (https://croit.io/) and they were really good. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Jul 15, 2019 at 12:53 PM Void Star Nill wrote: > Hello, > > Other than Redhat and SUSE, are there other companies that provide > enterprise support for Ceph? > > Thanks, > Shridhar > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] To backport or not to backport
On Thu, Jul 4, 2019 at 8:00 AM Stefan Kooman wrote: > Hi, > > Now the release cadence has been set, it's time for another discussion > :-). > > During Ceph day NL we had a panel q/a [1]. One of the things that was > discussed were backports. Occasionally users will ask for backports of > functionality in newer releases to older releases (that are still in > support). > > Ceph is quite a unique project in the sense that new functionality gets > backported to older releases. Sometimes even functionality gets changed > in the lifetime of a release. I can recall "ceph-volume" change to LVM > in the beginning of the Luminous release. While backports can enrich the > user experience of a ceph operator, it's not without risks. There have > been several issues with "incomplete" backports and or unforeseen > circumstances that had the reverse effect: downtime of (part of) ceph > services. The ones that come to my mind are: > > - MDS (cephfs damaged) mimic backport (13.2.2) > - RADOS (pg log hard limit) luminous / mimic backport (12.2.8 / 13.2.2) > > I would like to define a simple rule of when to backport: > > - Only backport fixes that do not introduce new functionality, but > addresses > (impaired) functionality already present in the release. > > Example of, IMHO, a backport that matches the backport criteria was the > "bitmap_allocator" fix. It fixed a real problem, not some corner case. > Don't get me wrong here, it is important to catch corner cases, but it > should not put the majority of clusters at risk. > > The time and effort that might be saved with this approach can indeed be > spend in one of the new focus areas Sage mentioned during his keynote > talk at Cephalocon Barcelona: quality. Quality of the backports that are > needed, improved testing, especially for upgrades to newer releases. If > upgrades are seemless, people are more willing to upgrade, because hey, > it just works(tm). Upgrades should be boring. > > How many clusters (not nautilus ;-)) are running with "bitmap_allocator" or > with the pglog_hardlimit enabled? If a new feature is not enabled by > default and it's unclear how "stable" it is to use, operators tend to not > enable it, defeating the purpose of the backport. > > Backporting fixes to older releases can be considered a "business > opportunity" for the likes of Red Hat, SUSE, Fujitsu, etc. Especially > for users that want a system that "keeps on running forever" and never > needs "dangerous" updates. > > This is my view on the matter, please let me know what you think of > this. > > Gr. Stefan > > P.s. Just to make things clear: this thread is in _no way_ intended to > pick on > anybody. > > > [1]: https://pad.ceph.com/p/ceph-day-nl-2019-panel I prefer a released version to be fairly static and not have new features introduced, only bug fixes. For one, I'd prefer not to have to read the release notes to figure out how dangerous a "bug-fix" release should be. The fixes in a released version should be tested extremely well so it "Just Works". By not back porting new features, I think it gives more time to bake the features into the new version and frees up the developers to focus on the forward direction of the product. If I want a new feature, then the burden is on me to test a new version and verify that it works in my environment (or vendors), not the developers. I wholeheartedly support only bug fixes and security fixes going into released versions. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cannot add fuse options to ceph-fuse command
Is this a Ceph specific option? If so, you may need to prefix it with "ceph.", at least I had to for FUSE to pass it to the Ceph module/code portion. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Jul 4, 2019 at 7:35 AM songz.gucas wrote: > Hi, > > > I try to add some fuse options when mount cephfs using ceph-fuse tool, but > it errored: > > > ceph-fuse -m 10.128.5.1,10.128.5.2,10.128.5.3 -r /test1 /cephfs/test1 -o > entry_timeout=5 > > ceph-fuse[3857515]: starting ceph client2019-07-04 21:55:37.767 > 7fc1d9cbdbc0 -1 init, newargv = 0x555d6f847490 newargc=9 > > > fuse: unknown option `entry_timeout=5' > > ceph-fuse[3857515]: fuse failed to start > > 2019-07-04 21:55:37.796 7fc1d9cbdbc0 -1 fuse_lowlevel_new failed > > > > How can I pass options to fuse? > > > Thank you for your precious help ! > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-ansible with docker
I need some help getting up the learning curve and hope someone can get me on the right track. I need to set up a new cluster, but want the mon, mgr and rgw services as containers on the non-container osd nodes. It seem that doing no containers or all containers is fairly easy but I'm trying to understand if I can do what I want. I'd like to use containers to set cgroups for resource management. With the OSDs having a public network and cluster network, is Ansible smart enough to connect to the right network based on mon IP address for instance? How do you tell Ansible to place the mon container on a specific host? I watched the video from Sébastien Han that he made in 2015, but it seems that the config has changed quite a bit since then. I'm quite new to Ansible (used Puppet and Salt in the past) and Docker (used LXC and LXD), so any help would be appreciated. Does Docker have their own IP address and are bridges created like LXD or does it share the host IP? Thank you, Robert LeBlanc -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] increase pg_num error
On Mon, Jul 1, 2019 at 11:57 AM Brett Chancellor wrote: > In Nautilus just pg_num is sufficient for both increases and decreases. > > Good to know, I haven't gotten to Nautilus yet. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] increase pg_num error
I believe he needs to increase the pgp_num first, then pg_num. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Jul 1, 2019 at 7:21 AM Nathan Fish wrote: > I ran into this recently. Try running "ceph osd require-osd-release > nautilus". This drops backwards compat with pre-nautilus and allows > changing settings. > > On Mon, Jul 1, 2019 at 4:24 AM Sylvain PORTIER wrote: > > > > Hi all, > > > > I am using ceph 14.2.1 (Nautilus) > > > > I am unable to increase the pg_num of a pool. > > > > I have a pool named Backup, the current pg_num is 64 : ceph osd pool get > > Backup pg_num => result pg_num: 64 > > > > And when I try to increase it using the command > > > > ceph osd pool set Backup pg_num 512 => result "set pool 6 pg_num to 512" > > > > And then I check with the command : ceph osd pool get Backup pg_num => > > result pg_num: 64 > > > > I don't how to increase the pg_num of a pool, I also tried the autoscale > > module, but it doesn't work (unable to activate the autoscale, always > > warn mode). > > > > Thank you for your help, > > > > > > Cabeur. > > > > > > --- > > L'absence de virus dans ce courrier électronique a été vérifiée par le > logiciel antivirus Avast. > > https://www.avast.com/antivirus > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
On Sat, Jun 29, 2019 at 8:12 PM Bryan Henderson wrote: > > I'm not sure why the monitor did not mark it _out_ after 600 seconds > > (default) > > Well, that part I understand. The monitor didn't mark the OSD out because > the > monitor still considered the OSD up. No reason to mark an up OSD out. > > I think the monitor should have marked the OSD down upon not hearing from > it > for 15 minutes ("mon osd report interval"), then out 10 minutes after that > ("mon osd down out interval"). > > And that's worst case. Though details of how OSDs watch each other are > vague, > I suspect an existing OSD was supposed to detect the dead OSDs and report > that > to the monitor, which would believe it within about a minute and mark the > OSDs > down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min > down > reporters", "osd reporter subtree level"). > > -- > Bryan Henderson San Jose, California > So, if an OSD (osd.1) misses three heartbeats (6 seconds each) from another OSD (osd.2), then the OSD sending the heartbeats (osd.2) tells the monitor that the OSD (osd.1) is down. It takes two OSDs from different CRUSH subtrees (host by default) for the monitor to mark the host down. The OSD is supposed to report to the monitor each time there is a change or every 120 seconds, if 600 seconds pass with the monitor not hearing from the OSD, it will mark it down. It 'should' only take 20 seconds to detect a downed OSD. Usually, the problem is that an OSD gets too busy and misses heartbeats so other OSDs wrongly mark them down. If 'nodown' is set, then the monitor will not mark OSDs down. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
On Sat, Jun 29, 2019 at 6:51 PM Bryan Henderson wrote: > > The reason it is so long is that you don't want to move data > > around unnecessarily if the osd is just being rebooted/restarted. > > I think you're confusing down with out. When an OSD is out, Ceph > backfills. While it is merely down, Ceph hopes that it will come back. > But it will direct I/O to other redundant OSDs instead of a down one. > > Going down leads to going out, and I believe that is the 600 seconds you > mention - the time between when the OSD is marked down and when Ceph marks > it > out (if all other conditions permit). > > There is a pretty good explanation of how OSDs get marked down, which is > pretty complicated, at > > > http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ > > It just doesn't seem to match the implementation. > > -- > Bryan Henderson San Jose, California > I mixed up my terminology, the first line should have read: " I'm not sure why the monitor did not mark it _out_ after 600 seconds (default) " The "down timeout" I mention is the "mon osd down out interval". The rest of what I wrote is correct. Just to make sure I don't confuse anyone else. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating a cephfs data pool
Yes, 'mv' on the client is just a metadata operation and not what I'm talking about. The idea is to bring the old pool in as a cache layer, then bring the new pool in as a lower layer, then flush/evict the data from the cache and Ceph will move the data to the new pool, but still be able to access it by the old pool name. You then add an overlay so that the new pool name acts the same, then the idea is that you can remove the old pool from the cache and remove the overlay. The only problem is updating cephfs to look at the new pool name for data that it knows is at the old pool name. The other option is to add a data mover to cephfs so you can do something like `ceph fs mv old_pool new_pool` and it would move all the objects and update the metadata as it performs the data moving. The question is how to do the data movement since the MDS is not in the data path. Since both pool names act the same with the overlay, the best option sounds like; configure the tiering, add the overlay, then do a `ceph fs migrate old_pool new_pool` which causes the MDS to scan through all the metadata and update all references of 'old_pool' to 'new_pool'. Once that is done and the eviction is done, then you can remove the pool from cephfs and the overlay. That way the OSDs are the one doing the data movement. I don't know that part of the code, so I can't quickly propose any patches. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Jun 28, 2019 at 9:37 AM Marc Roos wrote: > > Afaik is the mv now fast because it is not moving any real data, just > some meta data. Thus a real mv will be slow (only in the case between > different pools) because it copies the data to the new pool and when > successful deletes the old one. This will of course take a lot more > time, but you at least are able to access the cephfs on both locations > during this time and can fix things in your client access. > > My problem with mv now is that if you accidentally use it between data > pools, it does not really move data. > > > > -Original Message- > From: Robert LeBlanc [mailto:rob...@leblancnet.us] > Sent: vrijdag 28 juni 2019 18:30 > To: Marc Roos > Cc: ceph-users; jgarcia > Subject: Re: [ceph-users] Migrating a cephfs data pool > > Given that the MDS knows everything, it seems trivial to add a ceph 'mv' > command to do this. I looked at using tiering to try and do the move, > but I don't know how to tell cephfs that the data is now on the new pool > instead of the old pool name. Since we can't take a long enough downtime > to move hundreds of Terabytes, we need something that can be done > online, and if it has a minute or two of downtime would be okay. > > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Fri, Jun 28, 2019 at 9:02 AM Marc Roos > wrote: > > > > > 1. > change data pool for a folder on the file system: > setfattr -n ceph.dir.layout.pool -v fs_data.ec21 foldername > > 2. > cp /oldlocation /foldername > Remember that you preferably want to use mv, but this leaves > (meta) > data > on the old pool, that is not what you want when you want to delete > that > pool. > > 3. When everything is copied-removed, you should end up with an > empty > datapool with zero objects. > > 4. Verify here with others, if you can just remove this one. > > I think this is a reliable technique to switch, because you use > the > > basic cephfs functionality that supposed to work. I prefer that > the > ceph > guys implement a mv that does what you expect from it. Now it acts > more > or less like a linking. > > > > > -Original Message- > From: Jorge Garcia [mailto:jgar...@soe.ucsc.edu] > Sent: vrijdag 28 juni 2019 17:52 > To: Marc Roos; ceph-users > Subject: Re: [ceph-users] Migrating a cephfs data pool > > Are you talking about adding the new data pool to the current > filesystem? Like: > >$ ceph fs add_data_pool my_ceph_fs new_ec_pool > > I have done that, and now the filesystem shows up as having two > data > pools: > >$ ceph fs ls >name: my_ceph_fs, metadata pool: cephfs_meta, data pools: > [cephfs_data new_ec_pool ] > > but then I run into two issues: > > 1. How do I actually copy/move/migrate the data from the old pool > to the > new pool? > 2. When I'm done moving the data, how do I get rid of the old da
Re: [ceph-users] Migrating a cephfs data pool
Given that the MDS knows everything, it seems trivial to add a ceph 'mv' command to do this. I looked at using tiering to try and do the move, but I don't know how to tell cephfs that the data is now on the new pool instead of the old pool name. Since we can't take a long enough downtime to move hundreds of Terabytes, we need something that can be done online, and if it has a minute or two of downtime would be okay. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Jun 28, 2019 at 9:02 AM Marc Roos wrote: > > > 1. > change data pool for a folder on the file system: > setfattr -n ceph.dir.layout.pool -v fs_data.ec21 foldername > > 2. > cp /oldlocation /foldername > Remember that you preferably want to use mv, but this leaves (meta) data > on the old pool, that is not what you want when you want to delete that > pool. > > 3. When everything is copied-removed, you should end up with an empty > datapool with zero objects. > > 4. Verify here with others, if you can just remove this one. > > I think this is a reliable technique to switch, because you use the > basic cephfs functionality that supposed to work. I prefer that the ceph > guys implement a mv that does what you expect from it. Now it acts more > or less like a linking. > > > > > -Original Message- > From: Jorge Garcia [mailto:jgar...@soe.ucsc.edu] > Sent: vrijdag 28 juni 2019 17:52 > To: Marc Roos; ceph-users > Subject: Re: [ceph-users] Migrating a cephfs data pool > > Are you talking about adding the new data pool to the current > filesystem? Like: > >$ ceph fs add_data_pool my_ceph_fs new_ec_pool > > I have done that, and now the filesystem shows up as having two data > pools: > >$ ceph fs ls >name: my_ceph_fs, metadata pool: cephfs_meta, data pools: > [cephfs_data new_ec_pool ] > > but then I run into two issues: > > 1. How do I actually copy/move/migrate the data from the old pool to the > new pool? > 2. When I'm done moving the data, how do I get rid of the old data pool? > > I know there's a rm_data_pool option, but I have read on the mailing > list that you can't remove the original data pool from a cephfs > filesystem. > > The other option is to create a whole new cephfs with a new metadata > pool and the new data pool, but creating multiple filesystems is still > experimental and not allowed by default... > > On 6/28/19 8:28 AM, Marc Roos wrote: > > > > What about adding the new data pool, mounting it and then moving the > > files? (read copy because move between data pools does not what you > > expect it do) > > > > > > -Original Message- > > From: Jorge Garcia [mailto:jgar...@soe.ucsc.edu] > > Sent: vrijdag 28 juni 2019 17:26 > > To: ceph-users > > Subject: *SPAM* [ceph-users] Migrating a cephfs data pool > > > > This seems to be an issue that gets brought up repeatedly, but I > > haven't seen a definitive answer yet. So, at the risk of repeating a > > question that has already been asked: > > > > How do you migrate a cephfs data pool to a new data pool? The obvious > > case would be somebody that has set up a replicated pool for their > > cephfs data and then wants to convert it to an erasure code pool. Is > > there a simple way to do this, other than creating a whole new ceph > > cluster and copying the data using rsync? > > > > Thanks for any clues > > > > Jorge > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
I'm not sure why the monitor did not mark it down after 600 seconds (default). The reason it is so long is that you don't want to move data around unnecessarily if the osd is just being rebooted/restarted. Usually, you will still have min_size OSDs available for all PGs that will allow IO to continue. Then when the down timeout expires it will start backfilling and recovering the PGs that were affected. Double check that size != min_size for your pools. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Jun 27, 2019 at 5:26 PM Bryan Henderson wrote: > What does it take for a monitor to consider an OSD down which has been > dead as > a doornail since the cluster started? > > A couple of times, I have seen 'ceph status' report an OSD was up, when it > was > quite dead. Recently, a couple of OSDs were on machines that failed to > boot > up after a power failure. The rest of the Ceph cluster came up, though, > and > reported all OSDs up and in. I/Os stalled, probably because they were > waiting > for the dead OSDs to come back. > > I waited 15 minutes, because the manual says if the monitor doesn't hear a > heartbeat from an OSD in that long (default value of > mon_osd_report_timeout), > it marks it down. But it didn't. I did "osd down" commands for the dead > OSDs > and the status changed to down and I/O started working. > > And wouldn't even 15 minutes of grace be unacceptable if it means I/Os > have to > wait that long before falling back to a redundant OSD? > > -- > Bryan Henderson San Jose, California > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS : Kernel/Fuse technical differences
There may also be more memory coping involved instead of just passing pointers around as well, but I'm not 100% sure. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Jun 24, 2019 at 10:28 AM Jeff Layton wrote: > On Mon, 2019-06-24 at 15:51 +0200, Hervé Ballans wrote: > > Hi everyone, > > > > We successfully use Ceph here for several years now, and since recently, > > CephFS. > > > > From the same CephFS server, I notice a big difference between a fuse > > mount and a kernel mount (10 times faster for kernel mount). It makes > > sense to me (an additional fuse library versus a direct access to a > > device...), but recently, one of our users asked me to explain him in > > more detail the reason for this big difference...Hum... > > > > I then realized that I didn't really know how to explain the reasons to > > him !! > > > > As well, does anyone have a more detailed explanation in a few words or > > know a good web resource on this subject (I guess it's not specific to > > Ceph but it's generic to all filesystems ?..) > > > > Thanks in advance, > > Hervé > > > > A lot of it is the context switching. > > Every time you make a system call (or other activity) that accesses a > FUSE mount, it has to dispatch that request to the fuse device, the > userland ceph-fuse daemon then has to wake up and do its thing (at least > once) and then send the result back down to the kernel which then wakes > up the original task so it can get the result. > > FUSE is a wonderful thing, but it's not really built for speed. > > -- > Jeff Layton > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rebalancing ceph cluster
The placement of PGs is random in the cluster and takes into account any CRUSH rules which may also skew the distribution. Having more PGs will help give more options for placing PGs, but it still may not be adequate. It is recommended to have between 100-150 PGs per OSD, and you are pretty close. If you aren't planning to add any more pools, then splitting the PGs for pools that have a lot of data can help. To get things to be more balanced, you can reweight the high utlization OSDs down to cause CRUSH to migrate some PGs off. This won't mean that they will get moved to the lowest utilized OSDs (they might wind up on another one that is pretty full). So, it may take several iterations to get things balanced. Just be sure that if you reweighted one down and it is now much lower usage than the others to reweight it back up to attract some PGs back to it. ```ceph osd reweight {osd-num} {weight}``` -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Jun 24, 2019 at 2:25 AM jinguk.k...@ungleich.ch < jinguk.k...@ungleich.ch> wrote: > Hello everyone, > > We have some osd on the ceph. > Some osd's usage is more than 77% and another osd's usage is 39% in the > same host. > > I wonder why osd’s usage is different.(Difference is large) and how can i > fix it? > > ID CLASS WEIGHTREWEIGHT SIZEUSE AVAIL %USE VAR PGS TYPE > NAME > -2 93.26010- 93.3TiB 52.3TiB 41.0TiB 56.04 0.98 - > host serverA > …... > 33 HDD 9.09511 1.0 9.10TiB 3.55TiB 5.54TiB 39.08 0.68 66 > osd.4 > 45 HDD 7.27675 1.0 7.28TiB 5.64TiB 1.64TiB 77.53 1.36 81 > osd.7 > …... > > -5 79.99017- 80.0TiB 47.7TiB 32.3TiB 59.62 1.04 - > host serverB > 1 HDD 9.09511 1.0 9.10TiB 4.79TiB 4.31TiB 52.63 0.92 87 > osd.1 > 6 HDD 9.09511 1.0 9.10TiB 6.62TiB 2.48TiB 72.75 1.27 99 > osd.6 > …... > > Thank you > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
I'm glad it's working, to be clear did you use wpq, or is it still the prio queue? Sent from a mobile device, please excuse any typos. On Mon, Jun 10, 2019, 4:45 AM BASSAGET Cédric wrote: > an update from 12.2.9 to 12.2.12 seems to have fixed the problem ! > > Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric < > cedric.bassaget...@gmail.com> a écrit : > >> Hi Robert, >> Before doing anything on my prod env, I generate r/w on ceph cluster >> using fio . >> On my newest cluster, release 12.2.12, I did not manage to get >> the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio >> ran from 4 diffrent hosts) >> >> On my prod cluster, release 12.2.9, as soon as I run fio on a single >> host, I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" >> does not show me a usage more that 5-10% on disks... >> >> Le lun. 10 juin 2019 à 10:12, Robert LeBlanc a >> écrit : >> >>> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < >>> cedric.bassaget...@gmail.com> wrote: >>> >>>> Hello Robert, >>>> My disks did not reach 100% on the last warning, they climb to 70-80% >>>> usage. But I see rrqm / wrqm counters increasing... >>>> >>>> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>> >>>> sda 0.00 4.000.00 16.00 0.00 104.00 >>>> 13.00 0.000.000.000.00 0.00 0.00 >>>> sdb 0.00 2.001.00 3456.00 8.00 25996.00 >>>> 15.04 5.761.670.001.67 0.03 9.20 >>>> sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 >>>> 15.9419.890.470.480.21 0.02 66.00 >>>> >>>> dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 >>>> 92.48 4.000.550.560.30 0.09 66.80 >>>> dm-1 0.00 0.001.00 1129.00 8.00 25996.00 >>>> 46.02 1.030.910.000.91 0.09 10.00 >>>> >>>> >>>> sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd >>>> are my OSDs >>>> >>>> would "osd op queue = wpq" help in this case ? >>>> Regards >>>> >>> >>> Your disk times look okay, just a lot more unbalanced than I would >>> expect. I'd give wpq a try, I use it all the time, just be sure to also >>> include the op_cutoff setting too or it doesn't have much effect. Let me >>> know how it goes. >>> >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < cedric.bassaget...@gmail.com> wrote: > Hello Robert, > My disks did not reach 100% on the last warning, they climb to 70-80% > usage. But I see rrqm / wrqm counters increasing... > > Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > > sda 0.00 4.000.00 16.00 0.00 104.0013.00 > 0.000.000.000.00 0.00 0.00 > sdb 0.00 2.001.00 3456.00 8.00 25996.0015.04 > 5.761.670.001.67 0.03 9.20 > sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 > 15.9419.890.470.480.21 0.02 66.00 > > dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 > 92.48 4.000.550.560.30 0.09 66.80 > dm-1 0.00 0.001.00 1129.00 8.00 25996.0046.02 > 1.030.910.000.91 0.09 10.00 > > > sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd are > my OSDs > > would "osd op queue = wpq" help in this case ? > Regards > Your disk times look okay, just a lot more unbalanced than I would expect. I'd give wpq a try, I use it all the time, just be sure to also include the op_cutoff setting too or it doesn't have much effect. Let me know how it goes. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
With the low number of OSDs, you are probably satuarting the disks. Check with `iostat -xd 2` and see what the utilization of your disks are. A lot of SSDs don't perform well with Ceph's heavy sync writes and performance is terrible. If some of your drives are 100% while others are lower utilization, you can possibly get more performance and greatly reduce the blocked I/O with the WPQ scheduler. In the ceph.conf add this to the [osd] section and restart the processes: osd op queue = wpq osd op queue cut off = high This has helped our clusters with fairness between OSDs and making backfills not so disruptive. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Jun 6, 2019 at 1:43 AM BASSAGET Cédric wrote: > Hello, > > I see messages related to REQUEST_SLOW a few times per day. > > here's my ceph -s : > > root@ceph-pa2-1:/etc/ceph# ceph -s > cluster: > id: 72d94815-f057-4127-8914-448dfd25f5bc > health: HEALTH_OK > > services: > mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3 > mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2 > osd: 6 osds: 6 up, 6 in > > data: > pools: 1 pools, 256 pgs > objects: 408.79k objects, 1.49TiB > usage: 4.44TiB used, 37.5TiB / 41.9TiB avail > pgs: 256 active+clean > > io: > client: 8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr > > > Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) > luminous (stable) > > I've check : > - all my network stack : OK ( 2*10G LAG ) > - memory usage : ok (256G on each host, about 2% used per osd) > - cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz) > - disk status : OK (SAMSUNG AREA7680S5xnNTRI 3P04 => samsung DC series) > > I heard on IRC that it can be related to samsung PM / SM series. > > Do anybody here is facing the same problem ? What can I do to solve that ? > Regards, > Cédric > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] performance in a small cluster
On Fri, May 24, 2019 at 6:26 AM Robert Sander wrote: > Am 24.05.19 um 14:43 schrieb Paul Emmerich: > > 20 MB/s at 4K blocks is ~5000 iops, that's 1250 IOPS per SSD (assuming > > replica 3). > > > > What we usually check in scenarios like these: > > > > * SSD model? Lots of cheap SSDs simply can't handle more than that > > The system has been newly created and is not busy at all. > > We tested a single SSD without OSD on top with fio: it can do 50K IOPS > read and 16K IOPS write. > You probably tested with async writes, try passing sync to fio, that is much closer to what Ceph will do as it syncs every write to make sure it is written to disk before acknowledging back to the client that the write is done. When I did these tests, I also filled the entire drive and ran the test for an hour. Most drives looked fine with short tests are small amounts of data, but once the drive started getting full, the performance dropped off a cliff. Considering that Ceph is really hard on drives, it's good to test the extreme. Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS object mapping.
On Fri, May 24, 2019 at 2:14 AM Burkhard Linke < burkhard.li...@computational.bio.uni-giessen.de> wrote: > Hi, > On 5/22/19 5:53 PM, Robert LeBlanc wrote: > > When you say 'some' is it a fixed offset that the file data starts? Is the > first stripe just metadata? > > No, the first stripe contains the first 4 MB of a file by default,. The > xattr and omap data are stored separately. > Ahh, so it must be in the XFS xattrs, that makes sense. For future posterity, I combined a couple of your commands to remove the temporary intermediate file for others who may run across this. rados -p getxattr . parent | ceph-dencoder type inode_backtrace_t import - decode dump_json Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
I'd say that if you can't find that object in Rados, then your assumption may be good. I haven't run into this problem before. Try doing a Rados get for that object and see if you get anything. I've done a Rados list grepping for the hex inode, but it took almost two days on our cluster that had half a billion objects. Your cluster may be faster. Sent from a mobile device, please excuse any typos. On Fri, May 24, 2019, 8:21 AM Kevin Flöh wrote: > ok this just gives me: > > error getting xattr ec31/10004dfce92./parent: (2) No such file or > directory > > Does this mean that the lost object isn't even a file that appears in the > ceph directory. Maybe a leftover of a file that has not been deleted > properly? It wouldn't be an issue to mark the object as lost in that case. > On 24.05.19 5:08 nachm., Robert LeBlanc wrote: > > You need to use the first stripe of the object as that is the only one > with the metadata. > > Try "rados -p ec31 getxattr 10004dfce92. parent" instead. > > Robert LeBlanc > > Sent from a mobile device, please excuse any typos. > > On Fri, May 24, 2019, 4:42 AM Kevin Flöh wrote: > >> Hi, >> >> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but >> this is just hanging forever if we are looking for unfound objects. It >> works fine for all other objects. >> >> We also tried scanning the ceph directory with find -inum 1099593404050 >> (decimal of 10004dfce92) and found nothing. This is also working for non >> unfound objects. >> >> Is there another way to find the corresponding file? >> On 24.05.19 11:12 vorm., Burkhard Linke wrote: >> >> Hi, >> On 5/24/19 9:48 AM, Kevin Flöh wrote: >> >> We got the object ids of the missing objects with ceph pg 1.24c >> list_missing: >> >> { >> "offset": { >> "oid": "", >> "key": "", >> "snapid": 0, >> "hash": 0, >> "max": 0, >> "pool": -9223372036854775808, >> "namespace": "" >> }, >> "num_missing": 1, >> "num_unfound": 1, >> "objects": [ >> { >> "oid": { >> "oid": "10004dfce92.003d", >> "key": "", >> "snapid": -2, >> "hash": 90219084, >> "max": 0, >> "pool": 1, >> "namespace": "" >> }, >> "need": "46950'195355", >> "have": "0'0", >> "flags": "none", >> "locations": [ >> "36(3)", >> "61(2)" >> ] >> } >> ], >> "more": false >> } >> >> we want to give up those objects with: >> >> ceph pg 1.24c mark_unfound_lost revert >> >> But first we would like to know which file(s) is affected. Is there a way to >> map the object id to the corresponding file? >> >> >> The object name is composed of the file inode id and the chunk within the >> file. The first chunk has some metadata you can use to retrieve the >> filename. See the 'CephFS object mapping' thread on the mailing list for >> more information. >> >> >> Regards, >> >> Burkhard >> >> >> ___ >> ceph-users mailing >> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
You need to use the first stripe of the object as that is the only one with the metadata. Try "rados -p ec31 getxattr 10004dfce92. parent" instead. Robert LeBlanc Sent from a mobile device, please excuse any typos. On Fri, May 24, 2019, 4:42 AM Kevin Flöh wrote: > Hi, > > we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but > this is just hanging forever if we are looking for unfound objects. It > works fine for all other objects. > > We also tried scanning the ceph directory with find -inum 1099593404050 > (decimal of 10004dfce92) and found nothing. This is also working for non > unfound objects. > > Is there another way to find the corresponding file? > On 24.05.19 11:12 vorm., Burkhard Linke wrote: > > Hi, > On 5/24/19 9:48 AM, Kevin Flöh wrote: > > We got the object ids of the missing objects with ceph pg 1.24c > list_missing: > > { > "offset": { > "oid": "", > "key": "", > "snapid": 0, > "hash": 0, > "max": 0, > "pool": -9223372036854775808, > "namespace": "" > }, > "num_missing": 1, > "num_unfound": 1, > "objects": [ > { > "oid": { > "oid": "10004dfce92.003d", > "key": "", > "snapid": -2, > "hash": 90219084, > "max": 0, > "pool": 1, > "namespace": "" > }, > "need": "46950'195355", > "have": "0'0", > "flags": "none", > "locations": [ > "36(3)", > "61(2)" > ] > } > ], > "more": false > } > > we want to give up those objects with: > > ceph pg 1.24c mark_unfound_lost revert > > But first we would like to know which file(s) is affected. Is there a way to > map the object id to the corresponding file? > > > The object name is composed of the file inode id and the chunk within the > file. The first chunk has some metadata you can use to retrieve the > filename. See the 'CephFS object mapping' thread on the mailing list for > more information. > > > Regards, > > Burkhard > > > ___ > ceph-users mailing > listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Major ceph disaster
On Wed, May 22, 2019 at 4:31 AM Kevin Flöh wrote: > Hi, > > thank you, it worked. The PGs are not incomplete anymore. Still we have > another problem, there are 7 PGs inconsistent and a cpeh pg repair is > not doing anything. I just get "instructing pg 1.5dd on osd.24 to > repair" and nothing happens. Does somebody know how we can get the PGs > to repair? > > Regards, > > Kevin > Kevin, I just fixed an inconsistent PG yesterday. You will need to figure out why they are inconsistent. Do these steps and then we can figure out how to proceed. 1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them) 2. Print out the inconsistent report for each inconsistent PG. `rados list-inconsistent-obj --format=json-pretty` 3. You will want to look at the error messages and see if all the shards have the same data. Robert LeBlanc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS object mapping.
On Wed, May 22, 2019 at 12:22 AM Burkhard Linke < burkhard.li...@computational.bio.uni-giessen.de> wrote: > Hi, > > On 5/21/19 9:46 PM, Robert LeBlanc wrote: > > I'm at a new job working with Ceph again and am excited to back in the > > community! > > > > I can't find any documentation to support this, so please help me > > understand if I got this right. > > > > I've got a Jewel cluster with CephFS and we have an inconsistent PG. > > All copies of the object are zero size, but the digest says that it > > should be a non-zero size, so it seems that my two options are, delete > > the file that the object is part of, or rewrite the object with RADOS > > to update the digest. So, this leads to my question, how to I tell > > which file the object belongs to. > > > > From what I found, the object is prefixed with the hex value of the > > inode and suffixed by the stripe number: > > 1000d2ba15c.0005 > > . > > > > I then ran `find . -xdev -inum 1099732590940` and found a file on the > > CephFS file system. I just want to make sure that I found the right > > file before I start trying recovery options. > > > > The first stripe XYZ. has some metadata stored as xattr (rados > xattr, not cephfs xattr). One of the entries has the key 'parent': > When you say 'some' is it a fixed offset that the file data starts? Is the first stripe just metadata? > # ls Ubuntu16.04-WS2016-17.ova > Ubuntu16.04-WS2016-17.ova > > # ls -i Ubuntu16.04-WS2016-17.ova > 1099751898435 Ubuntu16.04-WS2016-17.ova > > # rados -p cephfs_test_data stat 1000e523d43. > cephfs_test_data/1000e523d43. mtime 2016-10-13 16:20:10.00, > size 4194304 > > # rados -p cephfs_test_data listxattr 1000e523d43. > layout > parent > > # rados -p cephfs_test_data getxattr 1000e523d43. parent | strings > Ubuntu16.04-WS2016-17.ova5: > adm2 > volumes > > > The complete path of the file is > /volumes/adm/Ubuntu16.04-WS2016-17.ova5. For a complete check you can > store the content of the parent key and use ceph-dencoder to print its > content: > > # rados -p cephfs_test_data getxattr 1000e523d43. parent > > parent.bin > > # ceph-dencoder type inode_backtrace_t import parent.bin decode dump_json > { > "ino": 1099751898435, > "ancestors": [ > { > "dirino": 1099527190071, > "dname": "Ubuntu16.04-WS2016-17.ova", > "version": 14901 > }, > { > "dirino": 1099521974514, > "dname": "adm", > "version": 61190706 > }, > { > "dirino": 1, > "dname": "volumes", > "version": 48394885 > } > ], > "pool": 7, > "old_pools": [] > } > > > One important thing to note: ls -i prints the inode id in decimal, > cephfs uses hexadecimal for the rados object names. Thus the different > value in the above commands. > Thank you for this, this is much faster than doing a find for the inode (that took many hours, I let it run overnight and it found it some time. It took about 21 hours to search the whole filesystem.) Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS object mapping.
I'm at a new job working with Ceph again and am excited to back in the community! I can't find any documentation to support this, so please help me understand if I got this right. I've got a Jewel cluster with CephFS and we have an inconsistent PG. All copies of the object are zero size, but the digest says that it should be a non-zero size, so it seems that my two options are, delete the file that the object is part of, or rewrite the object with RADOS to update the digest. So, this leads to my question, how to I tell which file the object belongs to. >From what I found, the object is prefixed with the hex value of the inode and suffixed by the stripe number: 1000d2ba15c.0005 . I then ran `find . -xdev -inum 1099732590940` and found a file on the CephFS file system. I just want to make sure that I found the right file before I start trying recovery options. Thank you, Robert LeBlanc -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] deep-scrubbing has large impact on performance
If you use wpq, I recommend also setting "osd_op_queue_cut_off = high" as well, otherwise replication OPs are not weighted and really reduces the benefit of wpq. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Nov 22, 2016 at 5:34 AM, Eugen Block wrote: > Thank you! > > > Zitat von Nick Fisk : > >>> -Original Message- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Eugen Block >>> Sent: 22 November 2016 10:11 >>> To: Nick Fisk >>> Cc: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] deep-scrubbing has large impact on performance >>> >>> Thanks for the very quick answer! >>> >>> > If you are using Jewel >>> >>> We are still using Hammer (0.94.7), we wanted to upgrade to Jewel in a >>> couple of weeks, would you recommend to do it now? >> >> >> It's been fairly solid for me, but you might want to wait for the >> scrubbing hang bug to be fixed before upgrading. I think this >> might be fixed in the upcoming 10.2.4 release. >> >>> >>> >>> Zitat von Nick Fisk : >>> >>> >> -Original Message- >>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>> >> Of Eugen Block >>> >> Sent: 22 November 2016 09:55 >>> >> To: ceph-users@lists.ceph.com >>> >> Subject: [ceph-users] deep-scrubbing has large impact on performance >>> >> >>> >> Hi list, >>> >> >>> >> I've been searching the mail archive and the web for some help. I >>> >> tried the things I found, but I can't see the effects. We use >>> > Ceph for >>> >> our Openstack environment. >>> >> >>> >> When our cluster (2 pools, each 4092 PGs, in 20 OSDs on 4 nodes, 3 >>> >> MONs) starts deep-scrubbing, it's impossible to work with the VMs. >>> >> Currently, the deep-scrubs happen to start on Monday, which is >>> >> unfortunate. I already plan to start the next deep-scrub on >>> > Saturday, >>> >> so it has no impact on our work days. But if I imagine we had a large >>> >> multi-datacenter, such performance breaks are not >>> > reasonable. So >>> >> I'm wondering how do you guys manage that? >>> >> >>> >> What I've tried so far: >>> >> >>> >> ceph tell osd.* injectargs '--osd_scrub_sleep 0.1' >>> >> ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7' >>> >> ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle' >>> >> ceph tell osd.* injectargs '--osd_scrub_begin_hour 0' >>> >> ceph tell osd.* injectargs '--osd_scrub_end_hour 7' >>> >> >>> >> And I also added these options to the ceph.conf. >>> >> To be able to work again, I had to set the nodeep-scrub option and >>> >> unset it when I left the office. Today, I see the cluster deep- >>> >> scrubbing again, but only one PG at a time, it seems that now the >>> >> default for osd_max_scrubs is working now and I don't see major >>> >> impacts yet. >>> >> >>> >> But is there something else I can do to reduce the performance impact? >>> > >>> > If you are using Jewel, the scrubing is now done in the client IO >>> > thread, so those disk thread options won't do anything. Instead there >>> > is a new priority setting, which seems to work for me, along with a >>> > few other settings. >>> > >>> > osd_scrub_priority = 1 >>> > osd_scrub_sleep = .1 >>> > osd_scrub_chunk_min = 1 >>> > osd_scrub_chunk_max = 5 >>> > osd_scrub_load_threshold = 5 >>> > >>> > Also enabling the weighted priority queue can assist the new priority >>> > options >>> > >>> > osd_op_queue = wpq >>> > >>> > >>> >> I just found [1] and will have a look into it. >>> >> >>> >> [1] http://prob6.com/en/ceph-pg-deep-scrub-cron/ >>> >> >>> >> Thanks! >>> >> Eugen >>> >> >>> >> -- >>> >> Eugen Block voice : +49-40-559 51 75 >>>
Re: [ceph-users] Blocked ops, OSD consuming memory, hammer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I've seen something similar to this when bringing an OSD back into a cluster that has a lot of I/O that is "close" to the max performance of the drives. For Jewell, there is a "mon osd prime pg temp" [0] which really helped reduce the huge memory usage when an OSD starts up and helped a bit with the slow/blocked I/O too. I created a backport for Hammer that I didn't have problems with, but was rejected to prevent adding new features to Hammer. You could patch Hammer and you only have to run the new code on the monitors to get the benefit.[1] [0] http://docs.ceph.com/docs/jewel/rados/configuration/mon-config-ref/#miscellaneous [1] https://github.com/ceph/ceph/pull/7848 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.4.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJXRxdoCRDmVDuy+mK58QAAIhYQAJuJ4k7OEOwkAoGnILar M+etyX5nbacWSBxX/NhH7pD++Nmu1JxHa1KM1ymytzEfIwhBenCb/4exdkaQ KpQQREB2STDSCWXvutoAfhc3YsqL0XY/XH2gRMX+crK2NXQoRsEQVzBgWVYh uIZ+wJ2EjzML9nZX5t4Qcxf2o7Z130/FIwcAAx2IkIRex3PsgCWy9t6sVSZB 5zytRQECL+bwa04/Oy+xQqMhekJyLiYkKk0m3c6HI10LOtkoVO/iSj723jMs 5AWazaJOP8A8P5UyzXrDIuM2mcia4yws1INke8r8fcLFtll+rDLIs6icSzsQ aMJHpHu2HNSv+EfqAmX7LpH/ebxcx6CtS51fW2BuQWCszzmeSwbNkx/8VVSS VKgL4ARy1596sdtQVwXuPAQqmV65Cw9K/gP5E/LtISC3tnQM/bugZhGfZNz7 X3ZYy3ujYhrswMsKVw+5i1dPolqxZerIt6rq2r56JK5ZqjNBi5EhUDGsfqiZ LT84jcAhazI+inrIF/O4bg8ili1uNeqhcyNQWnrFawyt3C5MzOD8GSHLgA3A D1IR5I2hpwO3TMqzED8+eQ/Qgd1qF1zMaAkja95aC7mxzfXTsxQj68iIAKUp 47Nwaz4ln2A5f20SQe3W4jxp33MKsAJYej2/xn/B0roxH7ZTAhXlcpYhU8Ni s5aw =X+rk -END PGP SIGNATURE- ---- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, May 24, 2016 at 3:16 PM, Heath Albritton wrote: > Having some problems with my cluster. Wondering if I could get some > troubleshooting tips: > > Running hammer 0.94.5. Small cluster with cache tiering. 3 spinning > nodes and 3 SSD nodes. > > Lots of blocked ops. OSDs are consuming the entirety of the system > memory (128GB) and then falling over. Lots of blocked ops, slow > requests. Seeing logs like this: > > 2016-05-24 19:30:09.288941 7f63c126b700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7f63cb3cd700' had timed out after 60 > 2016-05-24 19:30:09.503712 7f63c5273700 0 log_channel(cluster) log > [WRN] : map e7779 wrongly marked me down > 2016-05-24 19:30:11.190178 7f63cabcc700 0 -- > 10.164.245.22:6831/5013886 submit_message MOSDPGPushReply(9.10d 7762 > [PushReplyOp(3110010d/rbd_data.9647882ae8944a.26e7/head//9)]) > v2 remote, 10.164.245.23:6821/3028423, failed lossy con, dropping > message 0xfc21e00 > 2016-05-24 19:30:22.832381 7f63bca62700 -1 osd.23 7780 > lsb_release_parse - failed to call lsb_release binary with error: (12) > Cannot allocate memory > > Eventually the OSD fails. Cluster is in an unhealthy state. > > I can set noup, restart the OSDs and get them on the current map, but > once I put them back into the cluster, they eventually fail. > > > -H > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mark out vs crush weight 0
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Check out the Weighted Priority Queue option in Jewel, this really helped reduce the impact of recovery and backfill on client traffic with my testing. I think it really addresses a lot of the pain points you mention. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.4.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJXQybACRDmVDuy+mK58QAANFMP/iFHpRFx8Xiik5axDSZl zKjUQGUetuGzh6hu/y1+RNtrbUaC+Gg6L4A3ivT5f7CUsCcnOquQ/bBxQMe5 ve5M8XrEREPlBOzcQS+IIFK66bN8OC1Q/Rf1OzCFpWJmoMumbcBxrGV5KV8l 5m/GrOjmtxJzH/olaAzktOMAm3mTpWyL7KIPjUiBXvPi4EnyifIV3Hqc55TX 6/oz7vX7U9cg+JouVvnDAkLcb5C/hxNRNCGKO7Vxk0usuvYbvsbmRbQddAFt 6z6dJ9SFiPpys50WR8vpmsabqFEwKBAZSCemv/LdeAp+moLhFAydVD46LRsP NUNj23NuB5lDJKt444Y97/udDgnwJM4uq/8fHfTGMdptkzDsfdbOxDG4SPqd m7/bOJJET0UByCgtNuU0dUq0Rme0iidrH/9gZt6Y2w0jY4VSvPmkuP+GSIfj Boc2EIw39SoyaNgC/m5WvEru5trsH+vE7RcJpStLzwv+3MejQPzr9UDay/k4 7gxrNrB7YJ7YIX5i2yGYfE+tNVNUD4nGBgPCcBY7yDAzvbBKM5HzZSxWfYv6 JULq+EVc592gGjUx8BI+vJnckV3yGABCrVdUda2xxYwjMkIHbnoQtL7yi3DL W7Y5Z5iIDGSSpDcMOIEzSCiABzuKJHQC+EPf1NHGbEtK7ZGFPqmVx98eREgO oyjl =U0LK -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, May 19, 2016 at 5:26 AM, Oliver Dzombic wrote: > Hi, > > a sparedisk is a nice idea. > > But i think thats something you can also do with a shellscript. > > Checking if an osd is down or out and just using your spare disk. > > Maybe the programming ressources should not be used for something most > of us can do with a simple shell script checking every 5 seconds the > situation. > > > > Maybe better idea ( in my humble opinion ) is to solve this stuff by > optimizing the code in recovery situations. > > Currently we have things like > > client-op-priority, > recovery-op-priority, > max-backfills, > recovery-max-active and so on > > to limit the performance impact in a recovery situation. > > And still in a situation of recovery the performance go downhill ( a lot > ) when all OSD's start to refill the to_be_recovered OSD. > > In my case, i was removing old HDD's from a cluster. > > If i down/out them ( 6 TB drives 40-50% full ) the cluster's performance > will go down very dramatically. So i had to reduce the weight by 0.1 > steps to ease this pain, but could not remove it completely. > > > So i think the tools / code to protect the cluster's performance ( even > in recovery situation ) can be improved. > > Of course, on one hand, we want to make sure, that asap the configured > amount of replica's and this way, datasecurity is restored. > > But on the other hand, it does not help too much if the recovery > proceedure will impact the cluster's performance on a level where the > useability is too much reduced. > > So maybe introcude another config option to controle this ratio ? > > To control more effectively how much IOPS/Bandwidth is used ( maybe > streight in numbers in form of an IO ratelimit ) so that administrator's > have the chance to config, according to the hardware environment, the > "perfect" settings for their individual usecase. > > > Because, right now, when i reduce the weight of a 6 TB HDD, while having > ~ 30 OSD's in the cluster, from 1.0 to 0.9, around 3-5% of data will be > moved around the cluster ( replication 2 ). > > While its moving, there is a true performance hit on the virtual servers. > > So if this could be solved, by a IOPS/HDD Bandwidth rate limit, that i > can simply tell the cluster to use max. 10 IOPS and/or 10 MB/s for the > recovery, then i think it would be a great help for any usecase and > administrator. > > Thanks ! > > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:i...@ip-interactive.de > > Anschrift: > > IP Interactive UG ( haftungsbeschraenkt ) > Zum Sonnenberg 1-3 > 63571 Gelnhausen > > HRB 93402 beim Amtsgericht Hanau > Geschäftsführung: Oliver Dzombic > > Steuer Nr.: 35 236 3622 1 > UST ID: DE274086107 > > > Am 19.05.2016 um 04:57 schrieb Christian Balzer: > > > > Hello Sage, > > > > On Wed, 18 May 2016 17:23:00 -0400 (EDT) Sage Weil wrote: > > > >> Currently, after an OSD has been down for 5 minutes, we mark the OSD > >> "out", whic redistributes the data to other OSDs in the cluster. If the > >> OSD comes back up, it marks the OSD back in (with the same reweight > >> value, usually 1.0). > >> > >> The good thing about marking OSDs out is that exactly the amount of data > >> on the OSD moves. (Well, pretty close.) It is uniformly distributed > >> across al
Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Ceph is not able to use native Infiniband protocols yet and so it is only leveraging IPoIB at the moment. The most likely reason you are only getting ~10 Gb performance is that IPoIB heavily leverages multicast in Infiniband (if you do so research in this area you will understand why unicast IP still uses multicast on an Inifiniband network). To be extremely compatible with all adapters, the subnet manager will set the speed of multicast to 10 Gb/s so that SDR adapters can be used and not drop packets. If you know that you will never have adapters under a certain speed, you can configure the subnet manager to use a higher speed. This does not change IPoIB networks that are already configured (I had to down all the IPoIB adapter at the same time and bring them back up to upgrade the speed). Even after that, there still wasn't similar performance to native Infiniband, but I got at least a 2x improvement (along with setting the MTU to 64K) on the FDR adapters. There is still a ton of overhead for doing IPoIB so it is not an ideal transport to get performance on Infiniband, I think of it as a compatibility feature. Hopefully, that will give you enough information to perform the research. If you search the OFED mailing list, you will see some posts from me 2-3 years ago regarding this very topic. Good luck and keep holding out for Ceph with XIO. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJXBrtICRDmVDuy+mK58QAAqVkP/2hpe93FYIbQtpV4Qta4 9Fohqf478kVPX/v6XkAYOlAFFAISxfbDdm0FxOjbGSEOMKGNs/oSaFRCsqb9 +T5dfMUHyhY51wyaNeVF3k3zgvGpNUO1xEQ1IenUquZp9825VRBze5/T6r8Z PMFySNtuHBp8AhARisPJcXqKv/Vowfy/LqyvlL6ytIHfwqsVHngbtVN7L/HX vzMZM93cLwwV44v2bT8t63U76GKyQpbksDx02CktMIFzNbfApsiMaA1dyx1O 9HEgirtddMO358f+1DN/OjNc/Z3zECILaw3tq/HUWJyBJqO95uBw++znIacb UKwqJ1HmUeDvdqY72ZQa2fQT7ayMMlPPwzoVtdQGMZnSaAjn8MlunDFCrdLw +JPT+kt0qnjzs9qK0zEp5drfUwnV5BXS4hZhKUvuxWmVjUv1EfJrIFCszSFO 2be/xLxqBTpCEcHL9fsc16P7HsrdBW8GDy3X5PC2sOl/2DSes4y2TpCfr7w9 V8Mhs7mmkEQtwcvyaYQ0bx0Bs3o4cvTTeYbJUpLWEgMmGAEBZbf7Sx+y3dIp jUHb2jPEchBb83BGeLvAkCTfouq/J3pzQK96gA2Kh/KOlVJTpFdKUU5x+wpM ACqD+S/AFkgnfGm4fcgBexhro7GImiO6VIaOdxvTSdQbSsaoKckZOxFhVWih XyBJ =EF9A -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Apr 7, 2016 at 1:43 PM, German Anders wrote: > Hi Cephers, > > I've setup a production environment Ceph cluster with the Jewel release > (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON > Servers and 6 OSD Servers: > > 3x MON Servers: > 2x Intel Xeon E5-2630v3@2.40Ghz > 384GB RAM > 2x 200G Intel DC3700 in RAID-1 for OS > 1x InfiniBand ConnectX-3 ADPT DP > > 6x OSD Servers: > 2x Intel Xeon E5-2650v2@2.60Ghz > 128GB RAM > 2x 200G Intel DC3700 in RAID-1 for OS > 12x 800G Intel DC3510 (osd & journal) on same device > 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on > the CLUS network) > > ceph.conf file is: > > [global] > fsid = xxx > mon_initial_members = cibm01, cibm02, cibm03 > mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > public_network = xx.xx.16.0/20 > cluster_network = xx.xx.32.0/20 > > [mon] > > [mon.cibm01] > host = cibm01 > mon_addr = xx.xx.xx.1:6789 > > [mon.cibm02] > host = cibm02 > mon_addr = xx.xx.xx.2:6789 > > [mon.cibm03] > host = cibm03 > mon_addr = xx.xx.xx.3:6789 > > [osd] > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > > ## OSD Configuration ## > [osd.0] > host = cibn01 > public_addr = xx.xx.17.1 > cluster_addr = xx.xx.32.1 > > [osd.1] > host = cibn01 > public_addr = xx.xx.17.1 > cluster_addr = xx.xx.32.1 > > ... > > > > They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on each > disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For > example: > > sdc 8:32 0 745.2G 0 disk > |-sdc1 8:33 0 740.2G 0 part > /var/lib/ceph/osd/ceph-0 > `-sdc2 8:34 0 5G 0 part > > The purpose of this cluster will be to serve as a backend storage for Cinder > volumes (RBD) and Glance images in an OpenStack cloud, most of the clusters > on OpenStack will be non-relational databases like Cassandra with many > instances each. > > All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with > Mellanox Technologies MT27500 Family [ConnectX-3] adapters. > > > So I assume that performance will be really nice, right?...but.. I'm getting > some numbers that I
Re: [ceph-users] data corruption with hammer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'm having trouble finding documentation about using ceph_test_rados. Can I run this on the existing cluster and will that provide useful info? It seems running it in the build will not have the caching set up (vstart.sh). I have accepted a job with another company and only have until Wednesday to help with getting information about this bug. My new job will not be using Ceph, so I won't be able to provide any additional info after Tuesday. I want to leave the company on a good trajectory for upgrading, so any input you can provide will be helpful. I've found: ./ceph_test_rados --op read 100 --op write 100 --op delete 50 - --max-ops 40 --objects 1024 --max-in-flight 64 --size 400 - --min-stride-size 40 --max-stride-size 80 --max-seconds 600 - --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0 Is that enough if I change --pool to the cached pool and do the toggling while ceph_test_rados is running? I think this will run for 10 minutes. Thanks, -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW6tjwCRDmVDuy+mK58QAANKgP/ia5TA/7kTUpmciVR2BW t0MrilXAIvdikHlaWTVIxEmb4S8X+57hziEZUd6hLBMnKnuUQxsDb3yyuZX4 iqaE8KBXDjMFjHnhTOFf7eB2JIjM1WkZxmlA23yBRMNtvlBArbwxYYnAyTXt /fW1QmgLZIvuql1y01TdRot/owqJ3B2Ah896lySrltWj626R+1rhTLVDWYr6 EKa1mf8BiRBeGpjEVhN6Vihb7T1IzHtCi1E6+mlSqhWGNf8AeZh8IKUT0tbm C/JiUVGmG8/t7WFzCiQWd1w8UdkdCzms7k662CsSLIpbjNo4ouwEkpb5sZLP ELgWxo8hvad7USqSXvXqJNzmoenUwQwdUvSjYbNk+4D+8eHqptlNXDmDfpiE pN7dp8wbJ+yICxMPLuUe/Iqzp6rRnjPwam/CiDZu52N1ncH3X1X4u0cuAD0Z dFjEfdAZJAJ+fqvts2zVvtOwq/q41eTuV3ZRSn5ubA6iAeKnxMtPoEcuozEp Su1Iud2fYdma5w8MFStjp1BAV3osg1WgIM6KYzsSZI1BkCQAqU58ROZ0ZsMb D05/AEK/A6fp0ROXUczhXDcXlXcGEWyJm1QEtg7cSu3C+9qu5qvQQxyrrwbZ MK8C5lhVb44sqSVcSIZ+KCrPC+x8UKodDQZCz6O6NrJjZLn2g06583cMFWK8 qLo+ =qgB7 -END PGP SIGNATURE- -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 8:19 AM, Sage Weil wrote: > On Thu, 17 Mar 2016, Robert LeBlanc wrote: > > We are trying to figure out how to use rados bench to reproduce. Ceph > > itself doesn't seem to think there is any corruption, but when you do a > > verify inside the RBD, there is. Can rados bench verify the objects after > > they are written? It also seems to be primarily the filesystem metadata > > that is corrupted. If we fsck the volume, there is missing data (put into > > lost+found), but if it is there it is primarily OK. There only seems to > be > > a few cases where a file's contents are corrupted. I would suspect on an > > object boundary. We would have to look at blockinfo to map that out and > see > > if that is what is happening. > > 'rados bench' doesn't do validation. ceph_test_rados does, though--if you > can reproduce with that workload then it should be pretty easy to track > down. > > Thanks! > sage > > > > We stopped all the IO and did put the tier in writeback mode with recency > > 1, set the recency to 2 and started the test and there was corruption, > so > > it doesn't seem to be limited to changing the mode. I don't know how that > > patch could cause the issue either. Unless there is a bug that reads from > > the back tier, but writes to cache tier, then the object gets promoted > > wiping that last write, but then it seems like it should not be as much > > corruption since the metadata should be in the cache pretty quick. We > > usually evited the cache before each try so we should not be evicting on > > writeback. > > > > Sent from a mobile device, please excuse any typos. > > On Mar 17, 2016 6:26 AM, "Sage Weil" wrote: > > > > > On Thu, 17 Mar 2016, Nick Fisk wrote: > > > > There is got to be something else going on here. All that PR does is > to > > > > potentially delay the promotion to hit_set_period*recency instead of > > > > just doing it on the 2nd read regardless, it's got to be uncovering > > > > another bug. > > > > > > > > Do you see the same problem if the cache is in writeback mode before > you > > > > start the unpacking. Ie is it the switching mid operation which > causes > > > > the problem? If it only happens mid operation, does it still occur if > > > > you pause IO when you make the switch? > > > > > > > > Do you also see this if you perform on a RBD mount, to rule out any > > > > librbd/qemu weirdness? > > > > > > > > Do you know if it’s the actual data that is getting corrupted or if > it's > > &g
Re: [ceph-users] data corruption with hammer
Yep, let me pull and build that branch. I tried installing the dbg packages and running it in gdb, but it didn't load the symbols. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil wrote: > On Thu, 17 Mar 2016, Robert LeBlanc wrote: >> Also, is this ceph_test_rados rewriting objects quickly? I think that >> the issue is with rewriting objects so if we can tailor the >> ceph_test_rados to do that, it might be easier to reproduce. > > It's doing lots of overwrites, yeah. > > I was albe to reproduce--thanks! It looks like it's specific to > hammer. The code was rewritten for jewel so it doesn't affect the > latest. The problem is that maybe_handle_cache may proxy the read and > also still try to handle the same request locally (if it doesn't trigger a > promote). > > Here's my proposed fix: > > https://github.com/ceph/ceph/pull/8187 > > Do you mind testing this branch? > > It doesn't appear to be directly related to flipping between writeback and > forward, although it may be that we are seeing two unrelated issues. I > seemed to be able to trigger it more easily when I flipped modes, but the > bug itself was a simple issue in the writeback mode logic. :/ > > Anyway, please see if this fixes it for you (esp with the RBD workload). > > Thanks! > sage > > > > >> -------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc >> wrote: >> > I'll miss the Ceph community as well. There was a few things I really >> > wanted to work in with Ceph. >> > >> > I got this: >> > >> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028) >> > dirty exists >> > 1038: left oid 13 (ObjNum 1028 snap 0 seq_num 1028) >> > 1040: finishing write tid 1 to nodez23350-256 >> > 1040: finishing write tid 2 to nodez23350-256 >> > 1040: finishing write tid 3 to nodez23350-256 >> > 1040: finishing write tid 4 to nodez23350-256 >> > 1040: finishing write tid 6 to nodez23350-256 >> > 1035: done (4 left) >> > 1037: done (3 left) >> > 1038: done (2 left) >> > 1043: read oid 430 snap -1 >> > 1043: expect (ObjNum 429 snap 0 seq_num 429) >> > 1040: finishing write tid 7 to nodez23350-256 >> > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029) >> > dirty exists >> > 1040: left oid 256 (ObjNum 1029 snap 0 seq_num 1029) >> > 1042: expect (ObjNum 664 snap 0 seq_num 664) >> > 1043: Error: oid 430 read returned error code -2 >> > ./test/osd/RadosModel.h: In function 'virtual void >> > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time >> > 2016-03-17 10:47:19.085414 >> > ./test/osd/RadosModel.h: 1109: FAILED assert(0) >> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> > const*)+0x76) [0x4db956] >> > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c] >> > 3: (()+0x9791d) [0x7fa1d472191d] >> > 4: (()+0x72519) [0x7fa1d46fc519] >> > 5: (()+0x13c178) [0x7fa1d47c6178] >> > 6: (()+0x80a4) [0x7fa1d425a0a4] >> > 7: (clone()+0x6d) [0x7fa1d2bd504d] >> > NOTE: a copy of the executable, or `objdump -rdS ` is >> > needed to interpret this. >> > terminate called after throwing an instance of 'ceph::FailedAssertion' >> > Aborted >> > >> > I had to toggle writeback/forward and min_read_recency_for_promote a >> > few times to get it, but I don't know if it is because I only have one >> > job running. Even with six jobs running, it is not easy to trigger >> > with ceph_test_rados, but it is very instant in the RBD VMs. >> > >> > Here are the six run crashes (I have about the last 2000 lines of each >> > if needed): >> > >> > nodev: >> > update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num >> > 1014) dirty exists >> > 1015: left oid 1015 (ObjNum 1014 snap 0 seq_num 1014) >> > 1016: finishing write tid 1 to nodev21799-1016 >> > 1016: finishing write tid 2 to nodev21799-1016 >> > 1016: finishing write tid 3 to nodev21799-1016 >> > 1016: finishing write tid 4 to nodev21799-1016 >> > 1016: finishing write tid 6 to nodev21799-1016 >> > 1016: finish
Re: [ceph-users] data corruption with hammer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Possible, it looks like all the messages comes from a test suite. Is there some logging that would expose this or an assert that could be added? We are about ready to do some testing in our lab to see if we can replicate it and workaround the issue. I also can't tell which version introduced this in Hammer, it doesn't look like it has been resolved. Thanks, -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW6bqRCRDmVDuy+mK58QAANTsP/1jceRh9zYDlm2rkVq3e F6UKgezyCWV7h1cou8/rSVkxOfyyWEDSy1nMPBTHCtfMuOHzlx9VZftmPCiY BmxbclpUhAbAbjMb/E7t0jFR7fAZylX4okjUTN1y7NII+6xMXyxb51drYrZv AJzNcXfWYL1+y0Mz/QqOgEyij27OF8vYpSTJqXFDUcXtZNPfyvTjJ1ttYtuR saFJJ6SrFXA5LliGBNQK+pTDq0ZF0Bn0soE73rpzwpQvIdiOf/Jg7hAbERCc Vqjhg34YVLdpGd8W7IvaT0RirYbz8SmRdwOw1IIkBcqe0r9Mt08OgKu5NPT3 Rm0MKYynE1E7nKgutPisJQidT9QuaSVuY40oRDBIlrFA1BxNjGjwFxZn7y8r WyNMHKqB9Y+78uWdtEZtGfiSwyxC2UZTQFI4+eLs/XOoRLWv9oxRYV55Co0W e8zPW0nL1pm9iD9J+3fCRlNEL+cyDjsLLmW005BkF2q7da1XgxkoNndUBTlM Az9RGHoCELfI6kle315/2BEGfE2aRokLngbyhQWKAWmrdTCTDZaJwDKIi4hb 69LGT2eHofTWB5KgMHoCFLUSy2lYa86GxLLsBvPuqOfAXPWHMZERGv94qH/E CppgbnchgRHuI68rNM6nFYPJa4C3MlyQhu2WmOialAGgQi+IQP/g6h70e0RR eqLX =DcjE -END PGP SIGNATURE- -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Mar 16, 2016 at 1:40 PM, Gregory Farnum wrote: > This tracker ticket happened to go by my eyes today: > http://tracker.ceph.com/issues/12814 . There isn't a lot of detail > there but the headline matches. > -Greg > > On Wed, Mar 16, 2016 at 2:02 AM, Nick Fisk wrote: > > > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of > >> Christian Balzer > >> Sent: 16 March 2016 07:08 > >> To: Robert LeBlanc > >> Cc: Robert LeBlanc ; ceph-users >> us...@lists.ceph.com>; William Perkins > >> Subject: Re: [ceph-users] data corruption with hammer > >> > >> > >> Hello Robert, > >> > >> On Tue, 15 Mar 2016 10:54:20 -0600 Robert LeBlanc wrote: > >> > >> > -BEGIN PGP SIGNED MESSAGE- > >> > Hash: SHA256 > >> > > >> > There are no monitors on the new node. > >> > > >> So one less possible source of confusion. > >> > >> > It doesn't look like there has been any new corruption since we > >> > stopped changing the cache modes. Upon closer inspection, some files > >> > have been changed such that binary files are now ASCII files and visa > >> > versa. These are readable ASCII files and are things like PHP or > >> > script files. Or C files where ASCII files should be. > >> > > >> What would be most interesting is if the objects containing those > > corrupted > >> files did reside on the new OSDs (primary PG) or the old ones, or both. > >> > >> Also, what cache mode was the cluster in before the first switch > > (writeback I > >> presume from the timeline) and which one is it in now? > >> > >> > I've seen this type of corruption before when a SAN node misbehaved > >> > and both controllers were writing concurrently to the backend disks. > >> > The volume was only mounted by one host, but the writes were split > >> > between the controllers when it should have been active/passive. > >> > > >> > We have killed off the OSDs on the new node as a precaution and will > >> > try to replicate this in our lab. > >> > > >> > I suspicion is that is has to do with the cache promotion code update, > >> > but I'm not sure how it would have caused this. > >> > > >> While blissfully unaware of the code, I have a hard time imagining how > it > >> would cause that as well. > >> Potentially a regression in the code that only triggers in one cache > mode > > and > >> when wanting to promote something? > >> > >> Or if it is actually the switching action, not correctly promoting > things > > as it > >> happens? > >> And thus referencing a stale object? > > > > I can't think of any other reason why the recency would break things in > any > > other way. Can the OP confirm what recency setting is being used? > > > > When you switch to writeback, if you haven't reached the required recency > > yet, all reads will be proxied, previous behaviour would have pretty much > > promoted all the time regardless. So unless something is happening where > &g
Re: [ceph-users] data corruption with hammer
We are trying to figure out how to use rados bench to reproduce. Ceph itself doesn't seem to think there is any corruption, but when you do a verify inside the RBD, there is. Can rados bench verify the objects after they are written? It also seems to be primarily the filesystem metadata that is corrupted. If we fsck the volume, there is missing data (put into lost+found), but if it is there it is primarily OK. There only seems to be a few cases where a file's contents are corrupted. I would suspect on an object boundary. We would have to look at blockinfo to map that out and see if that is what is happening. We stopped all the IO and did put the tier in writeback mode with recency 1, set the recency to 2 and started the test and there was corruption, so it doesn't seem to be limited to changing the mode. I don't know how that patch could cause the issue either. Unless there is a bug that reads from the back tier, but writes to cache tier, then the object gets promoted wiping that last write, but then it seems like it should not be as much corruption since the metadata should be in the cache pretty quick. We usually evited the cache before each try so we should not be evicting on writeback. Sent from a mobile device, please excuse any typos. On Mar 17, 2016 6:26 AM, "Sage Weil" wrote: > On Thu, 17 Mar 2016, Nick Fisk wrote: > > There is got to be something else going on here. All that PR does is to > > potentially delay the promotion to hit_set_period*recency instead of > > just doing it on the 2nd read regardless, it's got to be uncovering > > another bug. > > > > Do you see the same problem if the cache is in writeback mode before you > > start the unpacking. Ie is it the switching mid operation which causes > > the problem? If it only happens mid operation, does it still occur if > > you pause IO when you make the switch? > > > > Do you also see this if you perform on a RBD mount, to rule out any > > librbd/qemu weirdness? > > > > Do you know if it’s the actual data that is getting corrupted or if it's > > the FS metadata? I'm only wondering as unpacking should really only be > > writing to each object a couple of times, whereas FS metadata could > > potentially be being updated+read back lots of times for the same group > > of objects and ordering is very important. > > > > Thinking through it logically the only difference is that with recency=1 > > the object will be copied up to the cache tier, where recency=6 it will > > be proxy read for a long time. If I had to guess I would say the issue > > would lie somewhere in the proxy read + writeback<->forward logic. > > That seems reasonable. Was switching from writeback -> forward always > part of the sequence that resulted in corruption? Not that there is a > known ordering issue when switching to forward mode. I wouldn't really > expect it to bite real users but it's possible.. > > http://tracker.ceph.com/issues/12814 > > I've opened a ticket to track this: > > http://tracker.ceph.com/issues/15171 > > What would be *really* great is if you could reproduce this with a > ceph_test_rados workload (from ceph-tests). I.e., get ceph_test_rados > running, and then find the sequence of operations that are sufficient to > trigger a failure. > > sage > > > > > > > > > > > > -Original Message- > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of > > > Mike Lovell > > > Sent: 16 March 2016 23:23 > > > To: ceph-users ; sw...@redhat.com > > > Cc: Robert LeBlanc ; William Perkins > > > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > > > just got done with a test against a build of 0.94.6 minus the two > commits that > > > were backported in PR 7207. everything worked as it should with the > cache- > > > mode set to writeback and the min_read_recency_for_promote set to 2. > > > assuming it works properly on master, there must be a commit that we're > > > missing on the backport to support this properly. > > > > > > sage, > > > i'm adding you to the recipients on this so hopefully you see it. the > tl;dr > > > version is that the backport of the cache recency fix to hammer > doesn't work > > > right and potentially corrupts data when > > > the min_read_recency_for_promote is set to greater than 1. > > > > > > mike > > > > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > > > wrote: > > > robert and i have done some further investigation the
Re: [ceph-users] data corruption with hammer
Cherry-picking that commit onto v0.94.6 wasn't clean so I'm just building your branch. I'm not sure what the difference between your branch and 0.94.6 is, I don't see any commits against osd/ReplicatedPG.cc in the last 5 months other than the one you did today. -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 11:38 AM, Robert LeBlanc wrote: > Yep, let me pull and build that branch. I tried installing the dbg > packages and running it in gdb, but it didn't load the symbols. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil wrote: >> On Thu, 17 Mar 2016, Robert LeBlanc wrote: >>> Also, is this ceph_test_rados rewriting objects quickly? I think that >>> the issue is with rewriting objects so if we can tailor the >>> ceph_test_rados to do that, it might be easier to reproduce. >> >> It's doing lots of overwrites, yeah. >> >> I was albe to reproduce--thanks! It looks like it's specific to >> hammer. The code was rewritten for jewel so it doesn't affect the >> latest. The problem is that maybe_handle_cache may proxy the read and >> also still try to handle the same request locally (if it doesn't trigger a >> promote). >> >> Here's my proposed fix: >> >> https://github.com/ceph/ceph/pull/8187 >> >> Do you mind testing this branch? >> >> It doesn't appear to be directly related to flipping between writeback and >> forward, although it may be that we are seeing two unrelated issues. I >> seemed to be able to trigger it more easily when I flipped modes, but the >> bug itself was a simple issue in the writeback mode logic. :/ >> >> Anyway, please see if this fixes it for you (esp with the RBD workload). >> >> Thanks! >> sage >> >> >> >> >>> >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >>> >>> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc >>> wrote: >>> > I'll miss the Ceph community as well. There was a few things I really >>> > wanted to work in with Ceph. >>> > >>> > I got this: >>> > >>> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028) >>> > dirty exists >>> > 1038: left oid 13 (ObjNum 1028 snap 0 seq_num 1028) >>> > 1040: finishing write tid 1 to nodez23350-256 >>> > 1040: finishing write tid 2 to nodez23350-256 >>> > 1040: finishing write tid 3 to nodez23350-256 >>> > 1040: finishing write tid 4 to nodez23350-256 >>> > 1040: finishing write tid 6 to nodez23350-256 >>> > 1035: done (4 left) >>> > 1037: done (3 left) >>> > 1038: done (2 left) >>> > 1043: read oid 430 snap -1 >>> > 1043: expect (ObjNum 429 snap 0 seq_num 429) >>> > 1040: finishing write tid 7 to nodez23350-256 >>> > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029) >>> > dirty exists >>> > 1040: left oid 256 (ObjNum 1029 snap 0 seq_num 1029) >>> > 1042: expect (ObjNum 664 snap 0 seq_num 664) >>> > 1043: Error: oid 430 read returned error code -2 >>> > ./test/osd/RadosModel.h: In function 'virtual void >>> > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time >>> > 2016-03-17 10:47:19.085414 >>> > ./test/osd/RadosModel.h: 1109: FAILED assert(0) >>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >>> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> > const*)+0x76) [0x4db956] >>> > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c] >>> > 3: (()+0x9791d) [0x7fa1d472191d] >>> > 4: (()+0x72519) [0x7fa1d46fc519] >>> > 5: (()+0x13c178) [0x7fa1d47c6178] >>> > 6: (()+0x80a4) [0x7fa1d425a0a4] >>> > 7: (clone()+0x6d) [0x7fa1d2bd504d] >>> > NOTE: a copy of the executable, or `objdump -rdS ` is >>> > needed to interpret this. >>> > terminate called after throwing an instance of 'ceph::FailedAssertion' >>> > Aborted >>> > >>> > I had to toggle writeback/forward and min_read_recency_for_promote a >>> > few times to get it, but I don't know if it is because I only have one >>> > job running. Ev
Re: [ceph-users] data corruption with hammer
1017: finishing write tid 3 to nodezz25161-1017 1017: finishing write tid 5 to nodezz25161-1017 1017: finishing write tid 6 to nodezz25161-1017 update_object_version oid 1017 v 3011 (ObjNum 1016 snap 0 seq_num 1016) dirty exists 1017: left oid 1017 (ObjNum 1016 snap 0 seq_num 1016) 1018: finishing write tid 1 to nodezz25161-1018 1018: finishing write tid 2 to nodezz25161-1018 1018: finishing write tid 3 to nodezz25161-1018 1018: finishing write tid 4 to nodezz25161-1018 1018: finishing write tid 6 to nodezz25161-1018 1018: finishing write tid 7 to nodezz25161-1018 update_object_version oid 1018 v 1099 (ObjNum 1017 snap 0 seq_num 1017) dirty exists 1018: left oid 1018 (ObjNum 1017 snap 0 seq_num 1017) 1019: finishing write tid 1 to nodezz25161-1019 1019: finishing write tid 2 to nodezz25161-1019 1019: finishing write tid 3 to nodezz25161-1019 1019: finishing write tid 5 to nodezz25161-1019 1019: finishing write tid 6 to nodezz25161-1019 update_object_version oid 1019 v 1300 (ObjNum 1018 snap 0 seq_num 1018) dirty exists 1019: left oid 1019 (ObjNum 1018 snap 0 seq_num 1018) 1020: finishing write tid 1 to nodezz25161-1020 1020: finishing write tid 2 to nodezz25161-1020 1020: finishing write tid 3 to nodezz25161-1020 1020: finishing write tid 5 to nodezz25161-1020 1020: finishing write tid 6 to nodezz25161-1020 update_object_version oid 1020 v 1324 (ObjNum 1019 snap 0 seq_num 1019) dirty exists 1020: left oid 1020 (ObjNum 1019 snap 0 seq_num 1019) 1021: finishing write tid 1 to nodezz25161-1021 1021: finishing write tid 2 to nodezz25161-1021 1021: finishing write tid 3 to nodezz25161-1021 1021: finishing write tid 5 to nodezz25161-1021 1021: finishing write tid 6 to nodezz25161-1021 update_object_version oid 1021 v 890 (ObjNum 1020 snap 0 seq_num 1020) dirty exists 1021: left oid 1021 (ObjNum 1020 snap 0 seq_num 1020) 1022: finishing write tid 1 to nodezz25161-1022 1022: finishing write tid 2 to nodezz25161-1022 1022: finishing write tid 3 to nodezz25161-1022 1022: finishing write tid 5 to nodezz25161-1022 1022: finishing write tid 6 to nodezz25161-1022 update_object_version oid 1022 v 464 (ObjNum 1021 snap 0 seq_num 1021) dirty exists 1022: left oid 1022 (ObjNum 1021 snap 0 seq_num 1021) 1023: finishing write tid 1 to nodezz25161-1023 1023: finishing write tid 2 to nodezz25161-1023 1023: finishing write tid 3 to nodezz25161-1023 1023: finishing write tid 5 to nodezz25161-1023 1023: finishing write tid 6 to nodezz25161-1023 update_object_version oid 1023 v 1516 (ObjNum 1022 snap 0 seq_num 1022) dirty exists 1023: left oid 1023 (ObjNum 1022 snap 0 seq_num 1022) 1024: finishing write tid 1 to nodezz25161-1024 1024: finishing write tid 2 to nodezz25161-1024 1025: Error: oid 219 read returned error code -2 ./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fbb1bfff700 time 2016-03-17 10:53:53.071338 ./test/osd/RadosModel.h: 1109: FAILED assert(0) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0x4db956] 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c] 3: (()+0x9791d) [0x7fbb30ff191d] 4: (()+0x72519) [0x7fbb30fcc519] 5: (()+0x13c178) [0x7fbb31096178] 6: (()+0x80a4) [0x7fbb30b2a0a4] 7: (clone()+0x6d) [0x7fbb2f4a504d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 10:39 AM, Sage Weil wrote: > On Thu, 17 Mar 2016, Robert LeBlanc wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> I'm having trouble finding documentation about using ceph_test_rados. Can I >> run this on the existing cluster and will that provide useful info? It seems >> running it in the build will not have the caching set up (vstart.sh). >> >> I have accepted a job with another company and only have until Wednesday to >> help with getting information about this bug. My new job will not be using C >> eph, so I won't be able to provide any additional info after Tuesday. I want >> to leave the company on a good trajectory for upgrading, so any input you c >> an provide will be helpful. > > I'm sorry to hear it! You'll be missed. :) > >> I've found: >> >> ./ceph_test_rados --op read 100 --op write 100 --op delete 50 >> - --max-ops 40 --objects 1024 --max-in-flight 64 --size 400 >> - --min-stride-size 40 --max-stride-size 80 --max-seconds 600 >> - --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op >> rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0 >> >> Is that enough if I change --pool
Re: [ceph-users] data corruption with hammer
Also, is this ceph_test_rados rewriting objects quickly? I think that the issue is with rewriting objects so if we can tailor the ceph_test_rados to do that, it might be easier to reproduce. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc wrote: > I'll miss the Ceph community as well. There was a few things I really > wanted to work in with Ceph. > > I got this: > > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028) > dirty exists > 1038: left oid 13 (ObjNum 1028 snap 0 seq_num 1028) > 1040: finishing write tid 1 to nodez23350-256 > 1040: finishing write tid 2 to nodez23350-256 > 1040: finishing write tid 3 to nodez23350-256 > 1040: finishing write tid 4 to nodez23350-256 > 1040: finishing write tid 6 to nodez23350-256 > 1035: done (4 left) > 1037: done (3 left) > 1038: done (2 left) > 1043: read oid 430 snap -1 > 1043: expect (ObjNum 429 snap 0 seq_num 429) > 1040: finishing write tid 7 to nodez23350-256 > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029) > dirty exists > 1040: left oid 256 (ObjNum 1029 snap 0 seq_num 1029) > 1042: expect (ObjNum 664 snap 0 seq_num 664) > 1043: Error: oid 430 read returned error code -2 > ./test/osd/RadosModel.h: In function 'virtual void > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time > 2016-03-17 10:47:19.085414 > ./test/osd/RadosModel.h: 1109: FAILED assert(0) > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x76) [0x4db956] > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c] > 3: (()+0x9791d) [0x7fa1d472191d] > 4: (()+0x72519) [0x7fa1d46fc519] > 5: (()+0x13c178) [0x7fa1d47c6178] > 6: (()+0x80a4) [0x7fa1d425a0a4] > 7: (clone()+0x6d) [0x7fa1d2bd504d] > NOTE: a copy of the executable, or `objdump -rdS ` is > needed to interpret this. > terminate called after throwing an instance of 'ceph::FailedAssertion' > Aborted > > I had to toggle writeback/forward and min_read_recency_for_promote a > few times to get it, but I don't know if it is because I only have one > job running. Even with six jobs running, it is not easy to trigger > with ceph_test_rados, but it is very instant in the RBD VMs. > > Here are the six run crashes (I have about the last 2000 lines of each > if needed): > > nodev: > update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num > 1014) dirty exists > 1015: left oid 1015 (ObjNum 1014 snap 0 seq_num 1014) > 1016: finishing write tid 1 to nodev21799-1016 > 1016: finishing write tid 2 to nodev21799-1016 > 1016: finishing write tid 3 to nodev21799-1016 > 1016: finishing write tid 4 to nodev21799-1016 > 1016: finishing write tid 6 to nodev21799-1016 > 1016: finishing write tid 7 to nodev21799-1016 > update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num > 1015) dirty exists > 1016: left oid 1016 (ObjNum 1015 snap 0 seq_num 1015) > 1017: finishing write tid 1 to nodev21799-1017 > 1017: finishing write tid 2 to nodev21799-1017 > 1017: finishing write tid 3 to nodev21799-1017 > 1017: finishing write tid 5 to nodev21799-1017 > 1017: finishing write tid 6 to nodev21799-1017 > update_object_version oid 1017 v 1010 (ObjNum 1016 snap 0 seq_num > 1016) dirty exists > 1017: left oid 1017 (ObjNum 1016 snap 0 seq_num 1016) > 1018: finishing write tid 1 to nodev21799-1018 > 1018: finishing write tid 2 to nodev21799-1018 > 1018: finishing write tid 3 to nodev21799-1018 > 1018: finishing write tid 4 to nodev21799-1018 > 1018: finishing write tid 6 to nodev21799-1018 > 1018: finishing write tid 7 to nodev21799-1018 > update_object_version oid 1018 v 1093 (ObjNum 1017 snap 0 seq_num > 1017) dirty exists > 1018: left oid 1018 (ObjNum 1017 snap 0 seq_num 1017) > 1019: finishing write tid 1 to nodev21799-1019 > 1019: finishing write tid 2 to nodev21799-1019 > 1019: finishing write tid 3 to nodev21799-1019 > 1019: finishing write tid 5 to nodev21799-1019 > 1019: finishing write tid 6 to nodev21799-1019 > update_object_version oid 1019 v 462 (ObjNum 1018 snap 0 seq_num 1018) > dirty exists > 1019: left oid 1019 (ObjNum 1018 snap 0 seq_num 1018) > 1021: finishing write tid 1 to nodev21799-1021 > 1020: finishing write tid 1 to nodev21799-1020 > 1020: finishing write tid 2 to nodev21799-1020 > 1020: finishing write tid 3 to nodev21799-1020 > 1020: finishing write tid 5 to nodev21799-1020 > 1020: finishing write tid 6 to nodev21799-1020 > update_object_version oid 1020 v 1287 (ObjNum 1019 snap 0 seq_num > 1019) dirty exists > 1020: left oid 1020 (
Re: [ceph-users] data corruption with hammer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Sage, You patch seems to have resolved the issue for us. We can't reproduce the problem with ceph_test_rados or our VM test. I also figured out that those are all backports that were cherry-picked so it was showing the original commit date. There was quite a bit of work on ReplicatedPG.cc since 0.94.6 so it probably only makes sense to wait for 0.94.7 for this fix. Thanks for looking into this so quick! As a work around for 0.94.6, our testing shows that min_read_recency_for_promote 1 does not have the corruption as it keeps the original behavior. Something for people to be aware of with 0.94.6 and using cache tiers. Hopefully there is a way to detect this in a unittest. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW6wILCRDmVDuy+mK58QAAcVQP/0t8jGZuwmwg2RIwkgjQ Kb3mIxvsmnA9BQ4dICJB3Wu6FPT1/V34t0ThASehWyVSJyiUkdf+pxhXbDaQ vOr4OOyTwCB2Ly6jaLEgAiyGTL45uOnMYcSttXPG95lilTb+oGUcqBdQzRbw yJHG18UiEgMvKnttFjTLbd1FjICIY7xkkP7lrdHvaqe200aqQmb+g8CHTVj/ HqzYm/gTs84c2vK+x/nV8OFxY9Yf5WAV+O7uozeWC3SAc2VMlQgi8rdng51N B+andt/SXgGq9VCDqdmEzcEpBN+2wK6usZQCZJmMXRmW4BXYVK4yAdfgKJOB MEUN2cDA1i7bMIUcDrh1hnqwEfizkbqOWXpgrgAkQYhtlbp/gvEucl5nYMUy kv9jNYg/KFQn9tzZqKWmvHj3sjl6DmOlN+A9XA2fGppOiiKk0s4dVKRDFwSJ LNxUIZm4CtAekaQ4KymE/hK6RhRU2REQl7qSMF+wtw73nhA9gzqP32Ag46yd WoeGpOngWRnMaejQfkuTSjiDSLvbCd7X5LM/WXH4dJHtHNSSA2qK3c4Nvvqp yDhvFLdvybtJvWj0+hHczpcP0VlFZH9s7uGWz0+cNabkRnm41EC2+XD6sJ5+ kinZO+CgjbC2AQPdoEKMuvRwBgnftH0YuZJFl0sQPkgBg23r+eCfIxfW/9v/ iLgk =6It+ -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 11:55 AM, Robert LeBlanc wrote: > Cherry-picking that commit onto v0.94.6 wasn't clean so I'm just > building your branch. I'm not sure what the difference between your > branch and 0.94.6 is, I don't see any commits against > osd/ReplicatedPG.cc in the last 5 months other than the one you did > today. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Thu, Mar 17, 2016 at 11:38 AM, Robert LeBlanc wrote: >> Yep, let me pull and build that branch. I tried installing the dbg >> packages and running it in gdb, but it didn't load the symbols. >> >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil wrote: >>> On Thu, 17 Mar 2016, Robert LeBlanc wrote: >>>> Also, is this ceph_test_rados rewriting objects quickly? I think that >>>> the issue is with rewriting objects so if we can tailor the >>>> ceph_test_rados to do that, it might be easier to reproduce. >>> >>> It's doing lots of overwrites, yeah. >>> >>> I was albe to reproduce--thanks! It looks like it's specific to >>> hammer. The code was rewritten for jewel so it doesn't affect the >>> latest. The problem is that maybe_handle_cache may proxy the read and >>> also still try to handle the same request locally (if it doesn't trigger a >>> promote). >>> >>> Here's my proposed fix: >>> >>> https://github.com/ceph/ceph/pull/8187 >>> >>> Do you mind testing this branch? >>> >>> It doesn't appear to be directly related to flipping between writeback and >>> forward, although it may be that we are seeing two unrelated issues. I >>> seemed to be able to trigger it more easily when I flipped modes, but the >>> bug itself was a simple issue in the writeback mode logic. :/ >>> >>> Anyway, please see if this fixes it for you (esp with the RBD workload). >>> >>> Thanks! >>> sage >>> >>> >>> >>> >>>> >>>> Robert LeBlanc >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>>> >>>> >>>> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc >>>> wrote: >>>> > I'll miss the Ceph community as well. There was a few things I really >>>> > wanted to work in with Ceph. >>>> > >>>> > I got this: >>>> > >>>> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028) >>>> > dirty exists >>>> > 1038: left oid 13 (ObjNum 1028 snap 0 seq_num 1028) >>>> > 1040: finishing write tid 1 to nodez23350-256 >>>> > 1040: finishing write tid 2 to nodez23350-256 >>>> > 1040: finishing write
Re: [ceph-users] data corruption with hammer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 There are no monitors on the new node. It doesn't look like there has been any new corruption since we stopped changing the cache modes. Upon closer inspection, some files have been changed such that binary files are now ASCII files and visa versa. These are readable ASCII files and are things like PHP or script files. Or C files where ASCII files should be. I've seen this type of corruption before when a SAN node misbehaved and both controllers were writing concurrently to the backend disks. The volume was only mounted by one host, but the writes were split between the controllers when it should have been active/passive. We have killed off the OSDs on the new node as a precaution and will try to replicate this in our lab. I suspicion is that is has to do with the cache promotion code update, but I'm not sure how it would have caused this. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW6D4zCRDmVDuy+mK58QAAoW0QAKmaNnN78m/3/YLIIlAB U+q9PKXgB4ptds1prEJrB/HJqtxIi021M2urk6iO2XRUgR4qSWZyVJWMmeE9 6EhM6IvLbweOePr2LJ5nAVEkL5Fns+ya/aOAvilqo2WJGr8jt9J1ABjQgodp SAGwDywo3GbGUmdxWWy5CrhLsdc9WNhiXdBxREh/uqWFvw2D8/1Uq4/u8tEv fohrGD+SZfYLQwP9O/v8Rc1C3A0h7N4ytSMiN7Xg2CC9bJDynn0FTrP2LAr/ edEYx+SWF2VtKuG7wVHrQqImTfDUoTLJXP5Q6B+Oxy852qvWzglfoRhaKwGf fodaxFlTDQaeMnyhMlODRMMXadmiTmyM/WK44YBuMjM8tnlaxf7yKgh09ADz ay5oviRWnn7peXmq65TvaZzUfz6Mx5ZWYtqIevaXb0ieFgrxCTdVbdpnMNRt bMwQ+yVQ8WB5AQmEqN6p6enBCxpvr42p8Eu484dO0xqjIiEOfsMANT/8V63y RzjPMOaFKFnl3JoYNm61RGAUYszNBeX/Plm/3mP0qiiGBAeHYoxh7DNYlrs/ gUb/O9V0yNuHQIRTs8ZRyrzZKpmh9YMYo8hCsfIqWZjMwEyQaRFuysQB3NaR lQCO/o12Khv2cygmTCQxS2L7vp2zrkPaS/KietqQ0gwkV1XbynK0XyLkAVDw zTLa =Wk/a -END PGP SIGNATURE- -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer wrote: > > Hello, > > On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote: > >> something weird happened on one of the ceph clusters that i administer >> tonight which resulted in virtual machines using rbd volumes seeing >> corruption in multiple forms. >> >> when everything was fine earlier in the day, the cluster was a number of >> storage nodes spread across 3 different roots in the crush map. the first >> bunch of storage nodes have both hard drives and ssds in them with the >> hard drives in one root and the ssds in another. there is a pool for >> each and the pool for the ssds is a cache tier for the hard drives. the >> last set of storage nodes were in a separate root with their own pool >> that is being used for burn in testing. >> >> these nodes had run for a while with test traffic and we decided to move >> them to the main root and pools. the main cluster is running 0.94.5 and >> the new nodes got 0.94.6 due to them getting configured after that was >> released. i removed the test pool and did a ceph osd crush move to move >> the first node into the main cluster, the hard drives into the root for >> that tier of storage and the ssds into the root and pool for the cache >> tier. each set was done about 45 minutes apart and they ran for a couple >> hours while performing backfill without any issue other than high load >> on the cluster. >> > Since I glanced what your setup looks like from Robert's posts and yours I > won't say the obvious thing, as you aren't using EC pools. > >> we normally run the ssd tier in the forward cache-mode due to the ssds we >> have not being able to keep up with the io of writeback. this results in >> io on the hard drives slowing going up and performance of the cluster >> starting to suffer. about once a week, i change the cache-mode between >> writeback and forward for short periods of time to promote actively used >> data to the cache tier. this moves io load from the hard drive tier to >> the ssd tier and has been done multiple times without issue. i normally >> don't do this while there are backfills or recoveries happening on the >> cluster but decided to go ahead while backfill was happening due to the >> high load. >> > As you might recall, I managed to have "rados bench" break (I/O error) when > doing these switches with Firefly on my crappy test cluster, but not with > Hammer. > However I haven't done any such switches on my production cluster with a > cache tier, both because the cache pool hasn't even reached 50% capacity > after 2 weeks of pounding and because I'm sure that everything will hold > up when it comes to the first flushing. > > Maybe the extreme load (as opposed to normal VM ops) of your cluster > during the backfilling triggered the same or a simi
Re: [ceph-users] how to downgrade when upgrade from firefly to hammer fail
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 There is no downgrade path. You are best off trying to fix the issue preventing the upgrade. Post some of the logs from the upgraded OSD and people can try to help you out. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW3fYBCRDmVDuy+mK58QAAma8P/iqr+OaE/qzI9DModRp6 jk16h6k5rH/VSuIyFk74yvfRN28vtywJc62dx/sK4ke3oGpzo9Fn8ay0YMWM 7WEK6E0qZydMYGsNMKw5uBN1VfhiThwKLiip+U/t0ZUV963d+djH0yOFK4bF 9jKX/p8ZIWfAgIRyFgz2QhFT2zPL9UIasVo7eg8nsAE9YNSE2CkBEvxTrxWC C26BrJ24+4TY6qliruhniQNOkWAstYMtPeiFqh5IlH/5/oOqQ6wK+dgyhEmP 4A/RAS7bom0cWP4eu0b+St+IvefC7kzoJM38yTEHku5YALAAFYitLh1Fbzp8 99lS/piObnAgjNTPW6h1KteweIYZJJ3ki9yhq8cBpQ4O5PqHc/64SBq/NY4o 69dpUUqp6L7HudMwDs5z8Q76BjuCu4NhMCieKgks+CuF7mwmCPTEN2A+enaD MTHkQeM5MNRZ4xigucIrYhiT18SMvaI4aKkLCq7GGHkaInk5+91WLcYF+KDa L+9n4M0jW14n2BXejMZjpKXxNa86N5cF7yO/hILCtz1CVJgNcqT2z+kIDZ3z 50aZva/SHsvxmdwK+UxrB3jnFldhzPUB6nU/xJCQWN+BBTSQByFmAg+JkEuX 13qV0h4yWRfH4uaKYdKuzTVSX0zY8HkAA4ZHTatxiPXiVET+NwNE+4aqdbTz hw+f =nLNP -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Sun, Mar 6, 2016 at 7:59 PM, Dong Wu wrote: > hi, cephers > I want to upgrade my ceph cluster from firefly(0.80.11) to hammer, > when i successfully install hammer deb package on all my hosts, then i > update monitor first, and it success. > but when i restart osds on one host to upgrade, it failed, osds > cannot startup, then i want to downgrade to firefly again to keep my > cluster going on, after i reinstall firefly deb package, i failed to > start osds on the host, here is the log: > > 2016-03-07 09:47:14.704242 7f2f11ba87c0 0 ceph version 0.80.11 > (8424145d49264624a3b0a204aedb127835161070), process ceph-osd, pid > 37459 > 2016-03-07 09:47:14.709159 7f2f11ba87c0 -1 > filestore(/var/lib/ceph/osd/ceph-0) FileStore::mount : stale version > stamp 4. Please run the FileStore update script before starting the > OSD, or set filestore_update_to to 3 > 2016-03-07 09:47:14.709176 7f2f11ba87c0 -1 ** ERROR: error converting > store /var/lib/ceph/osd/ceph-0: (22) Invalid argument > 2016-03-07 09:47:18.385399 7f98478187c0 0 ceph version 0.80.11 > (8424145d49264624a3b0a204aedb127835161070), process ceph-osd, pid > 39041 > 2016-03-07 09:47:18.390320 7f98478187c0 -1 > filestore(/var/lib/ceph/osd/ceph-0) FileStore::mount : stale version > stamp 4. Please run the FileStore update script before starting the > OSD, or set filestore_update_to to 3 > 2016-03-07 09:47:18.390337 7f98478187c0 -1 ** ERROR: error converting > store /var/lib/ceph/osd/ceph-0: (22) Invalid argument > > how can i downgrade to firefly successfully? > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Pool and EC: objects didn't flush to a cold EC storage
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Did you also set "target_max_bytes" to the size of the pool? That bit us when we didn't have it set. The ratio then uses the target_max_bytes to know when to flush. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW3fWJCRDmVDuy+mK58QAAWg0P/131uHJVnkZ7jFCRi8yY iD2//WQJl5VJtcsYwqR0PxqKdMHGTDs263BpVSyUj5tMF+dgOpfvkPSlZ2Uf SGYBXMvmUOu3WOMWfgcu6Tkt6Sai7vJtn6m8P1B+jKPEiTRqk+Apkft87JAE rOsVM1lEwGZNH6+C8XUz13xcZeM15MTHn/QyRRhjt0cNLHxcG0/oWBBX753j BIhde7XtORq6U0T79E6N6kd8KRE0XgOiwWa3bk9mKHWxkrc+1W53RfefwexU rA9VkJKI+7YCh307TXF2cFEw8JPglOJdMcn5G96tb//jMBGh+kBfoT3FbM4F Pb9LASt+DRIptZsF4DJJHLCOs6HseLmAiDp6z+wntjMITkeRGdxcA92llXz+ +/nnGKJtOZj76agXhYmkEZeEVSCiKaKC2xFqUy+p+B1UVGff+cSRt5Fz3NfB NOSlYXbYCdahXaoKcaxa6oupep3TtjI6TBQ7JS4kHHfBMj8JHpSga4WkKqlz e3Oz9PsDU9Tw2UVyo4zLEqgpcWcbY8E1VAAoirKAGcCqnwzwjvhGM2e1h66L yYjepiUQ9oLbIct9MXJOSAMwctsrAYgvR1veG+vqND5ZLr+OIR7at9Vpeg8m +oBVG+4PgxlIEfxVGf+8OjLK9sJUTm+AtLMzsbDqMFX9VQtpoTlsqYGd5gTW 9t/H =7sfH -END PGP SIGNATURE- -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Sun, Mar 6, 2016 at 2:17 AM, Mike Almateia wrote: > Hello Cephers! > > When my cluster hit "full ratio" settings, objects from cache pull didn't > flush to a cold storage. > > 1. Hit the 'full ratio': > > 2016-03-06 11:35:23.838401 osd.64 10.22.11.21:6824/31423 4327 : cluster > [WRN] OSD near full (90%) > 2016-03-06 11:35:55.447205 osd.64 10.22.11.21:6824/31423 4329 : cluster > [WRN] OSD near full (90%) > 2016-03-06 11:36:29.255815 osd.64 10.22.11.21:6824/31423 4332 : cluster > [WRN] OSD near full (90%) > 2016-03-06 11:37:04.769765 osd.64 10.22.11.21:6824/31423 4333 : cluster > [WRN] OSD near full (90%) > ... > > 2. Well, ok. Set the option 'ceph osd pool set hotec cache_target_full_ratio > 0.8'. > But no one of objects didn't flush at all > > 3. Ok. Try flush all object manually: > [root@c1 ~]# rados -p hotec cache-flush-evict-all > rbd_data.34d1f5746d773.00016ba9 > > 4. After full day objects still in cache pool, didn't flush at all: > [root@c1 ~]# rados df > pool name KB objects clones degraded unfound > rdrd KB wrwr KB > data 00000 > 64 158212215700473 > hotec 797656118 25030755000 > 370599163045649 69947951 17786794779 > rbd00000 > 0000 > total used 2080570792 25030755 > > It a bug or predictable action? > > -- > Mike. runs! > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replacing OSD drive without rempaping pg's
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 With a fresh disk, you will need to remove the old key in ceph (ceph auth del osd.X) and the old osd (ceph osd rm X), but I think you can leave the CRUSH map alone (don't do ceph osd crush rm osd.X) so that there isn't any additional data movement (if there aren't any available OSD numbers less than the OSD being replaced, it will get the same ID. There may also be a way to specify an ID, but I haven't used it). Then when you add the new disk in, it only backfills what the previous disk had, unless the size is different, then it will take on more or less and shuffle some things around the cluster. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW1cZBCRDmVDuy+mK58QAAm9EQAMVOnCOrBkqGaczy5+ds yplotd9kKt/eyhp1nSgPJD+4RdOVQjoL4VVLtCfApXcMfxHkW/vBjpOWD1Bh l14NDjCzpkXM5HpHqQkiel/7thcN45u/Z7wSX8T+x9ontYn1Bv0CfI/6qaFb DmIYdAGjdLgWKpORyeN1WjgrU5DzUbCMHw/3sLfieVpoYsh91dMuxt33366z mcMQ6RYIE/5xpm8LkTsjYkmnl7Xes5fGsIAlx6kJDHpAoBBWEfstjgtCXIBt PgDnBJ/SwisAQKXuQOZg87/3OE+qFQUyILwFE3USD3ugx8xvo1aUGnerY/mT 8rUNfFLCPLhdiAp1fr2kkQW/SfV7spkNkZ/v99J/9dEwSj2pgJ7iHMGNr/Em K3oLezrm7NO2RHsMrn/pz82bO1CSzHrRQ5Aq7Re2r48zYeFxSgvcbMk6Ogzh rDPb2q+QEw/UbIuotl09ab3OGCjzXxhfDIQ44iEUEj0l2Cl5MQQcakdYakoC WCPaqIN7ocqiWnQPY/RnSXuhUgsd8uTBtxcXtHp+y0feAf/80nxc3dFWDfiK 8sKmt+rHoBQKQz0yhc0A0YqM8vnWYatVrVh1+SZe7iJE3/qyglNFmbJQ0O54 au/AJ7OqEy1MnJ06fIaLbSIQMXXMWdEqcib2gIKeunhLDkwoUbi+JRJLBY5X ITts =yIjM -END PGP SIGNATURE----- -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Feb 29, 2016 at 10:29 PM, Lindsay Mathieson wrote: > I was looking at replacing an osd drive in place as per the procedure here: > > http://www.spinics.net/lists/ceph-users/msg05959.html > > "If you are going to replace the drive immediately, set the “noout” flag. > Take the OSD “down” and replace drive. Assuming it is mounted in the same > place as the bad drive, bring the OSD back up. This will replicate exactly > the same PGs the bad drive held back to the replacement drive." > > > > But the new drive mount will be blank - what happens with the journal, > keyring etc? does starting the OSD process recreate them automatically? > > > thanks, > > -- > Lindsay Mathieson > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] List of SSDs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Honestly, we are scared to try the same tests with the m600s. When we first put them in, we had them more full, but we backed them off to reduce the load on them. Based on that I don't expect them to fair any better. We'd love to get more IOPs out of our clusters considering the ability of the s3610s. We constantly tune the cluster and try to provide code back to Ceph that helps under high loads and congestion. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW0PHnCRDmVDuy+mK58QAAxh4QAIJG5blyxFMRJ9DdF3U+ J1U47Yd1jGMNUrzSsipA6TCm7FeoKs9/y2PRTIIFdnanmKj/J3X+F2T4M+b4 oHr1V8HzbxRk6dB6q2+Z6DkXMG9I48qhTbxnksAsn/vbEDrCq1t0ctvL3Zxu h97FRsnW5DBH/HAfble6cLZ9vbBbo394yqtR8wu/1gE0E/zpLAgAw5ZnJ4t8 n8RIWyOIwQgapUspQ3KtzGdFl1HP/MLjA/QQiexn8CEhtluwxJTZZB6Fy4q6 b60gLj3HEszo76bcExCrGETCqlT5kiy1qGJCUrHO6sQ7YHpSDVGt9o1muoGJ FXvbGkdhSbqGYB0P5xx83ab3ZQ9Eyg2tf0hreZo9q1kyP5rXTfylr6IGgaCF qNj0QTvcE0TYeUVIUxKkHfG0Ys06kFqdxJAEF3A4tJJp0KyBKwK7eJrj4P2H xclQWUDMTDJk+JSufBNxo5AY94TOLhUsWieuEFGyZeW8gji+oOrIWHHilxz7 De0Xi2Y+9O/OKcKkbBE/g+Pys0S/L9ZwAId5EEMzNRXEoQwlbPVclvukpEQJ xFiLdEJLQzwXP7hRT9lMQkHs3IKKL/0TgsfN2bszoXbHk1rN1NqMVt9BDqHr ZGb++dyfjUFaMOM/S8WXfkxV3dtYi7LKGEn4pSQ2IyZ92REwcTWej2TPV5r9 Nq0g =LM6/ -END PGP SIGNATURE- -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Feb 26, 2016 at 5:41 PM, Shinobu Kinjo wrote: > Thank you for your very precious output. > "s3610s write iops high-load" is very interesting to me. > Have you every did any same test set of s3610s for m600s? > >> These clusters normally service 12K IOPs with bursts up to 22K IOPs all RBD. >> I've seen a peak of 64K IOPs from client traffic. > > That's pretty good result, isn't it? > I guess you've been tuning your cluster?? > > Rgds, > Shinobu > > - Original Message - > From: "Robert LeBlanc" > To: "Shinobu Kinjo" > Cc: "Christian Balzer" , ceph-users@lists.ceph.com > Sent: Saturday, February 27, 2016 8:52:34 AM > Subject: Re: [ceph-users] List of SSDs > > A picture is worth a thousand words: > > > The red lines are the m600s IO time (dotted) and IOPs (solid) and our > baseline s3610s in green and our test set of s3610s in blue. > > We used weighting to manipulate how many PGs each SSD took. The m600s are > 1TB while the s3610s are 800GBs and we only have the m600s about half > filled. So we weighted the s3610s individually until they were about ~40GBs > within the m600s. We did the same weighting to achieve similar percentage > usage and 80% usage. This graph is stepping from 50% to 70% and finally > very close to 80%. > > We have two production clusters currently, third one will be built in the > next month all about the same size. > > 16 nodes, 3 - 1TB m600 drives and 9 - 4TB HGST HDDs, single E5-2640v2 and > 64 GB RAM dual 40 Gigabit Ethernet ports, direct attached SATA. These > clusters normally service 12K IOPs with bursts up to 22K IOPs all RBD. I've > seen a peak of 64K IOPs from client traffic. > > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > On Fri, Feb 26, 2016 at 4:05 PM, Shinobu Kinjo wrote: > >> Hello, >> >> > We started having high wait times on the M600s so we got 6 S3610s, 6 >> M500dcs, and 6 500 GB M600s (they have the SLC to MLC conversion that we >> thought might work better). >> >> Is it working better as you were expecting? >> >> > We have graphite gathering stats on the admin sockets for Ceph and the >> standard system stats. >> >> Very cool! >> >> > We weighted the drives so they had the same byte usage and let them run >> for a week or so, then made them the same percentage of used space, let >> them run a couple of weeks, then set them to 80% full and let them run a >> couple of weeks. >> >> Almost exactly same *byte* usage? I'm pretty interesting to how you >> realized that. >> >> > We compared IOPS and IO time of the drives to get our comparison. >> >> What is your feeling about the comparison? >> >> > This was done on live production clusters and not synthetic benchmarks. >> >> How large is your production the Ceph cluster? >> >> Rgds, >> Shinobu >> >> > >> > Hello, >> > >> > On Wed, 24 Feb 2016 22:56:15 -0700 Robert LeBlanc wrote: >> > >> > > We are moving to the Intel S3610, from our testing it is a good balance >> > > between price, performance and longevity. But as with all things, do >>
Re: [ceph-users] List of SSDs
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Resending sans attachment... A picture is worth a thousand words: http://robert.leblancnet.us/files/s3610-load-test-20160224.png The red lines are the m600s IO time (dotted) and IOPs (solid) and our baseline s3610s in green and our test set of s3610s in blue. We used weighting to manipulate how many PGs each SSD took. The m600s are 1TB while the s3610s are 800GBs and we only have the m600s about half filled. So we weighted the s3610s individually until they were about ~40GBs within the m600s. We did the same weighting to achieve similar percentage usage and 80% usage. This graph is stepping from 50% to 70% and finally very close to 80%. We have two production clusters currently, third one will be built in the next month all about the same size. 16 nodes, 3 - 1TB m600 drives and 9 - 4TB HGST HDDs, single E5-2640v2 and 64 GB RAM dual 40 Gigabit Ethernet ports, direct attached SATA. These clusters normally service 12K IOPs with bursts up to 22K IOPs all RBD. I've seen a peak of 64K IOPs from client traffic. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW0OgCCRDmVDuy+mK58QAAp+4P/0HJ+UU3gaAdRXyELCg5 mLifFliWYDFuabP+K5aI6mBn4qlF/1BAe6d9K8Zrcz+nZvXP+BcSEd1puUAW GIy+5O3xJkDUM5O9lAN+jIqw0X7ple2xni3Q5/fKwgGpD1TuEjGEnZlFfRJC 8HWfw6rnL+J7WEirhhXrk+NmOvLJRaozROuzKmKcbBVS2oVtrhOPA7eiNrUz NhN/YbvArGrQFneBO39Tp3YPn8cJ2nVgwv6eru9nnrvkEUD9nwJXlgyNf/NC IjX+LnKET0q0ouCFbjJGaUm4+tvNWWtXypYpcdC78RF+XMdsYHMKAikQ0aG7 7UbYlvf+DhFPqskXhpaB1+lEj+qyhYNwvaxt5QtYsuPK7zDfbV23ed/aiw7c 58q3ROMmIZGsVyBh3fR7EAvKcp3W8KQr9JUq3K3vLcWplNZsuvg4QZIx0ia2 YfGzBsJKugxMVGbmqnXCAcjUyEI/haoovIdMOVBWw8Uv8R9m2IpoNXgqsqi1 xJjIJ5pmiwMZliq2YLwcUy/6e3uPpPRYhgRkkHr167DDB0A5ijI7Y8Q5GX28 AeraQSHLBtOtyrXBcFCtZv2YVbl2juwwC2lNXHJZBd0b/iUDnrBA358U0crm +TqyYR7LoZiUjUMI0HZzjeyVIsST201R6uQ1Tv9b6DFAOxDMPWD7ViJLcSIO yAiI =vXUO -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Feb 26, 2016 at 4:05 PM, Shinobu Kinjo wrote: > Hello, > >> We started having high wait times on the M600s so we got 6 S3610s, 6 >> M500dcs, and 6 500 GB M600s (they have the SLC to MLC conversion that we >> thought might work better). > > Is it working better as you were expecting? > >> We have graphite gathering stats on the admin sockets for Ceph and the >> standard system stats. > > Very cool! > >> We weighted the drives so they had the same byte usage and let them run for >> a week or so, then made them the same percentage of used space, let them run >> a couple of weeks, then set them to 80% full and let them run a couple of >> weeks. > > Almost exactly same *byte* usage? I'm pretty interesting to how you realized > that. > >> We compared IOPS and IO time of the drives to get our comparison. > > What is your feeling about the comparison? > >> This was done on live production clusters and not synthetic benchmarks. > > How large is your production the Ceph cluster? > > Rgds, > Shinobu > >> >> Hello, >> >> On Wed, 24 Feb 2016 22:56:15 -0700 Robert LeBlanc wrote: >> >> > We are moving to the Intel S3610, from our testing it is a good balance >> > between price, performance and longevity. But as with all things, do your >> > testing ahead of time. This will be our third model of SSDs for our >> > cluster. The S3500s didn't have enough life and performance tapers off >> > add it gets full. The Micron M600s looked good with the Sebastian journal >> > tests, but once in use for a while go downhill pretty bad. We also tested >> > Micron M500dc drives and they were on par with the S3610s and are more >> > expensive and are closer to EoL. The S3700s didn't have quite the same >> > performance as the S3610s, but they will last forever and are very stable >> > in terms of performance and have the best power loss protection. >> > >> That's interesting, how did you come to that conclusion and how did test >> it? >> Also which models did you compare? >> >> >> > Short answer is test them for yourself to make sure they will work. You >> > are pretty safe with the Intel S3xxx drives. The Micron M500dc is also >> > pretty safe based on my experience. It had also been mentioned that >> > someone has had good experience with a Samsung DC Pro (has to have both >> > DC and Pro in the name), but we weren't able to get any quick enough to >> > test so I can't vouch for them. >> > >> I have some Samsung DC Pro EVOs in production (non-Ceph, see that >> non-barrier thread). >> They do have issues with LSI o
Re: [ceph-users] List of SSDs
We replaced 32 S3500s with 48 Micron M600s in our production cluster. The S3500s were only doing journals because they were too small and we still ate 3-4% of their life in a couple of months. We started having high wait times on the M600s so we got 6 S3610s, 6 M500dcs, and 6 500 GB M600s (they have the SLC to MLC conversion that we thought might work better). And we swapped out 18 of the M600s throughout our cluster with these test drives. We have graphite gathering stats on the admin sockets for Ceph and the standard system stats. We weighted the drives so they had the same byte usage and let them run for a week or so, then made them the same percentage of used space, let them run a couple of weeks, then set them to 80% full and let them run a couple of weeks. We compared IOPS and IO time of the drives to get our comparison. This was done on live production clusters and not synthetic benchmarks. Some of the data about the S3500s is from my test cluster that has them. Sent from a mobile device, please excuse any typos. On Feb 25, 2016 9:20 PM, "Christian Balzer" wrote: > > Hello, > > On Wed, 24 Feb 2016 22:56:15 -0700 Robert LeBlanc wrote: > > > We are moving to the Intel S3610, from our testing it is a good balance > > between price, performance and longevity. But as with all things, do your > > testing ahead of time. This will be our third model of SSDs for our > > cluster. The S3500s didn't have enough life and performance tapers off > > add it gets full. The Micron M600s looked good with the Sebastian journal > > tests, but once in use for a while go downhill pretty bad. We also tested > > Micron M500dc drives and they were on par with the S3610s and are more > > expensive and are closer to EoL. The S3700s didn't have quite the same > > performance as the S3610s, but they will last forever and are very stable > > in terms of performance and have the best power loss protection. > > > That's interesting, how did you come to that conclusion and how did test > it? > Also which models did you compare? > > > > Short answer is test them for yourself to make sure they will work. You > > are pretty safe with the Intel S3xxx drives. The Micron M500dc is also > > pretty safe based on my experience. It had also been mentioned that > > someone has had good experience with a Samsung DC Pro (has to have both > > DC and Pro in the name), but we weren't able to get any quick enough to > > test so I can't vouch for them. > > > I have some Samsung DC Pro EVOs in production (non-Ceph, see that > non-barrier thread). > They do have issues with LSI occasionally, haven't gotten around to make > that FS non-barrier to see if it fixes things. > > The EVOs are also similar to the Intel DC S3500s, meaning that they are > not really suitable for Ceph due to their endurance. > > Never tested the "real" DC Pro ones, but they are likely to be OK. > > Christian > > > Sent from a mobile device, please excuse any typos. > > On Feb 24, 2016 6:37 PM, "Shinobu Kinjo" wrote: > > > > > Hello, > > > > > > There has been a bunch of discussion about using SSD. > > > Does anyone have any list of SSDs describing which SSD is highly > > > recommended, which SSD is not. > > > > > > Rgds, > > > Shinobu > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Observations with a SSD based pool under Hammer
I was only testing one SSD per node and it used 3.5-4.5 cores on my 8 core Atom boxes. I've also set these boxes to only 4 GB of RAM to reduce the effects of page cache. So no, I still had some headroom, but I was also running fio on my nodes too. I don't remember how much idle I had overall, but there was some. Sent from a mobile device, please excuse any typos. On Feb 25, 2016 9:15 PM, "Christian Balzer" wrote: > > Hello, > > On Wed, 24 Feb 2016 23:01:43 -0700 Robert LeBlanc wrote: > > > With my S3500 drives in my test cluster, the latest master branch gave me > > an almost 2x increase in performance compare to just a month or two ago. > > There looks to be some really nice things coming in Jewel around SSD > > performance. My drives are now 80-85% busy doing about 10-12K IOPS when > > doing 4K fio to libRBD. > > > That's good news, but then again the future is always bright. ^o^ > Before that (or even now with the SSDs still 15% idle), were you > exhausting your CPUs or are they also still not fully utilized as I am > seeing below? > > Christian > > > Sent from a mobile device, please excuse any typos. > > On Feb 24, 2016 8:10 PM, "Christian Balzer" wrote: > > > > > > > > Hello, > > > > > > For posterity and of course to ask some questions, here are my > > > experiences with a pure SSD pool. > > > > > > SW: Debian Jessie, Ceph Hammer 0.94.5. > > > > > > HW: > > > 2 nodes (thus replication of 2) with each: > > > 2x E5-2623 CPUs > > > 64GB RAM > > > 4x DC S3610 800GB SSDs > > > Infiniband (IPoIB) network > > > > > > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4, > > > Ceph journal is inline (journal file). > > > > > > Performance: > > > A test run with "rados -p cache bench 30 write -t 32" (4MB blocks) > > > gives me about 620MB/s, the storage nodes are I/O bound (all SSDs are > > > 100% busy according to atop) and this meshes nicely with the speeds I > > > saw when testing the individual SSDs with fio before involving Ceph. > > > > > > To elaborate on that, an individual SSD of that type can do about > > > 500MB/s sequential writes, so ideally you would see 1GB/s writes with > > > Ceph (500*8/2(replication)/2(journal on same disk). > > > However my experience tells me that other activities (FS journals, > > > leveldb PG updates, etc) impact things as well. > > > > > > A test run with "rados -p cache bench 30 write -t 32 -b 4096" (4KB > > > blocks) gives me about 7200 IOPS, the SSDs are about 40% busy. > > > All OSD processes are using about 2 cores and the OS another 2, but > > > that leaves about 6 cores unused (MHz on all cores scales to max > > > during the test run). > > > Closer inspection with all CPUs being displayed in atop shows that no > > > single core is fully used, they all average around 40% and even the > > > busiest ones (handling IRQs) still have ample capacity available. > > > I'm wondering if this an indication of insufficient parallelism or if > > > it's latency of sorts. > > > I'm aware of the many tuning settings for SSD based OSDs, however I was > > > expecting to run into a CPU wall first and foremost. > > > > > > > > > Write amplification: > > > 10 second rados bench with 4MB blocks, 6348MB written in total. > > > nand-writes per SSD:118*32MB=3776MB. > > > 30208MB total written to all SSDs. > > > Amplification:4.75 > > > > > > Very close to what you would expect with a replication of 2 and > > > journal on same disk. > > > > > > > > > 10 second rados bench with 4KB blocks, 219MB written in total. > > > nand-writes per SSD:41*32MB=1312MB. > > > 10496MB total written to all SSDs. > > > Amplification:48!!! > > > > > > Le ouch. > > > In my use case with rbd cache on all VMs I expect writes to be rather > > > large for the most part and not like this extreme example. > > > But as I wrote the last time I did this kind of testing, this is an > > > area where caveat emptor most definitely applies when planning and > > > buying SSDs. And where the Ceph code could probably do with some > > > attention. > > > > > > Regards, > > > > > > Christian > > > -- > > > Christian BalzerNetwork/Systems Engineer > > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > > http://www.gol.com/ > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can not disable rbd cache
My guess would be that if you are already running hammer on the client it is already using the new watcher API. This would be a fix on the OSDs to allow the object to be moved because the current client is smart enough to try again. It would be watchers per object. Sent from a mobile device, please excuse any typos. On Feb 25, 2016 9:10 PM, "Christian Balzer" wrote: > On Thu, 25 Feb 2016 10:07:37 -0500 (EST) Jason Dillaman wrote: > > > > > Let's start from the top. Where are you stuck with [1]? I have > > > > noticed that after evicting all the objects with RBD that one object > > > > for each active RBD is still left, I think this is the head object. > > > Precisely. > > > That came up in my extensive tests as well. > > > > Is this in reference to the RBD image header object (i.e. XYZ.rbd or > > rbd_header.XYZ)? > Yes. > > > The cache tier doesn't currently support evicting > > objects that are being watched. This guard was added to the OSD because > > it wasn't previously possible to alert clients that a watched object had > > encountered an error (such as it no longer exists in the cache tier). > > Now that Hammer (and later) librbd releases will reconnect the watch on > > error (eviction), perhaps this guard can be loosened [1]. > > > > [1] http://tracker.ceph.com/issues/14865 > > > > How do I interpret "all watchers" in the issue above? > As in, all watchers of an object, or all watchers in general. > > If it is per object (which I guess/hope), than this fix would mean that > after an upgrade to Hammer or later on the client side a restart of the VM > would allow the header object to be evicted, while the header objects for > VMs that have been running since the dawn of time can not. > > Correct? > > This would definitely be better than having to stop the VM, flush things > and then start it up again. > > Christian > > > -- > > > > Jason > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Observations with a SSD based pool under Hammer
With my S3500 drives in my test cluster, the latest master branch gave me an almost 2x increase in performance compare to just a month or two ago. There looks to be some really nice things coming in Jewel around SSD performance. My drives are now 80-85% busy doing about 10-12K IOPS when doing 4K fio to libRBD. Sent from a mobile device, please excuse any typos. On Feb 24, 2016 8:10 PM, "Christian Balzer" wrote: > > Hello, > > For posterity and of course to ask some questions, here are my experiences > with a pure SSD pool. > > SW: Debian Jessie, Ceph Hammer 0.94.5. > > HW: > 2 nodes (thus replication of 2) with each: > 2x E5-2623 CPUs > 64GB RAM > 4x DC S3610 800GB SSDs > Infiniband (IPoIB) network > > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4, > Ceph journal is inline (journal file). > > Performance: > A test run with "rados -p cache bench 30 write -t 32" (4MB blocks) gives > me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy > according to atop) and this meshes nicely with the speeds I saw when > testing the individual SSDs with fio before involving Ceph. > > To elaborate on that, an individual SSD of that type can do about 500MB/s > sequential writes, so ideally you would see 1GB/s writes with Ceph > (500*8/2(replication)/2(journal on same disk). > However my experience tells me that other activities (FS journals, leveldb > PG updates, etc) impact things as well. > > A test run with "rados -p cache bench 30 write -t 32 -b 4096" (4KB > blocks) gives me about 7200 IOPS, the SSDs are about 40% busy. > All OSD processes are using about 2 cores and the OS another 2, but that > leaves about 6 cores unused (MHz on all cores scales to max during the > test run). > Closer inspection with all CPUs being displayed in atop shows that no > single core is fully used, they all average around 40% and even the > busiest ones (handling IRQs) still have ample capacity available. > I'm wondering if this an indication of insufficient parallelism or if it's > latency of sorts. > I'm aware of the many tuning settings for SSD based OSDs, however I was > expecting to run into a CPU wall first and foremost. > > > Write amplification: > 10 second rados bench with 4MB blocks, 6348MB written in total. > nand-writes per SSD:118*32MB=3776MB. > 30208MB total written to all SSDs. > Amplification:4.75 > > Very close to what you would expect with a replication of 2 and journal on > same disk. > > > 10 second rados bench with 4KB blocks, 219MB written in total. > nand-writes per SSD:41*32MB=1312MB. > 10496MB total written to all SSDs. > Amplification:48!!! > > Le ouch. > In my use case with rbd cache on all VMs I expect writes to be rather > large for the most part and not like this extreme example. > But as I wrote the last time I did this kind of testing, this is an area > where caveat emptor most definitely applies when planning and buying SSDs. > And where the Ceph code could probably do with some attention. > > Regards, > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] List of SSDs
We are moving to the Intel S3610, from our testing it is a good balance between price, performance and longevity. But as with all things, do your testing ahead of time. This will be our third model of SSDs for our cluster. The S3500s didn't have enough life and performance tapers off add it gets full. The Micron M600s looked good with the Sebastian journal tests, but once in use for a while go downhill pretty bad. We also tested Micron M500dc drives and they were on par with the S3610s and are more expensive and are closer to EoL. The S3700s didn't have quite the same performance as the S3610s, but they will last forever and are very stable in terms of performance and have the best power loss protection. Short answer is test them for yourself to make sure they will work. You are pretty safe with the Intel S3xxx drives. The Micron M500dc is also pretty safe based on my experience. It had also been mentioned that someone has had good experience with a Samsung DC Pro (has to have both DC and Pro in the name), but we weren't able to get any quick enough to test so I can't vouch for them. Sent from a mobile device, please excuse any typos. On Feb 24, 2016 6:37 PM, "Shinobu Kinjo" wrote: > Hello, > > There has been a bunch of discussion about using SSD. > Does anyone have any list of SSDs describing which SSD is highly > recommended, which SSD is not. > > Rgds, > Shinobu > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)
We have not seen this issue, but we don't run EC pools yet (we are waiting for multiple layers to be available). We are not running 0.94.6 in production yet either. We have adopted the policy to only run released versions in production unless there is a really pressing need to have a patch. We are running 0.94.6 through our alpha and staging clusters and hoping to do the upgrade in the next couple of weeks. We won't know how much the recency fix will help until then because we have not been able to replicate our workload with fio accurately enough to get good test results. Unfortunately we will probably be swapping out our M600s with S3610s. We've burned through 30% of the life in 2 months and they have 8x the op latency. Due to the 10 Minutes of Terror, we are going to have to do both at the same time to reduce the impact. Luckily, when you have weighted out OSDs or empty ones, it is much less impactful. If you get your upgrade done before ours, I'd like to know how it went. I'll be posting the results from ours when it is done. Sent from a mobile device, please excuse any typos. On Feb 24, 2016 5:43 PM, "Christian Balzer" wrote: > > Hello Jason (Ceph devs et al), > > On Wed, 24 Feb 2016 13:15:34 -0500 (EST) Jason Dillaman wrote: > > > If you run "rados -p ls | grep "rbd_id." and > > don't see that object, you are experiencing that issue [1]. > > > > You can attempt to work around this issue by running "rados -p irfu-virt > > setomapval rbd_id. dummy value" to force-promote the object > > to the cache pool. I haven't tested / verified that will alleviate the > > issue, though. > > > > [1] http://tracker.ceph.com/issues/14762 > > > > This concerns me greatly, as I'm about to phase in a cache tier this > weekend into a very busy, VERY mission critical Ceph cluster. > That is on top of a replicated pool, Hammer. > > That issue and the related git blurb are less than crystal clear, so for > my and everybody else's benefit could you elaborate a bit more on this? > > 1. Does this only affect EC base pools? > 2. Is this a regressions of sorts and when came it about? >I have a hard time imagining people not running into this earlier, >unless that problem is very hard to trigger. > 3. One assumes that this isn't fixed in any released version of Ceph, >correct? > > Robert, sorry for CC'ing you, but AFAICT your cluster is about the closest > approximation in terms of busyness to mine here. > And I a assume that you're neither using EC pools (since you need > performance, not space) and haven't experienced this bug all? > > Also, would you consider the benefits of the recency fix (thanks for > that) being worth risk of being an early adopter of 0.94.6? > In other words, are you eating your own dog food already and 0.94.6 hasn't > eaten your data babies yet? ^o^ > > Regards, > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Can not disable rbd cache
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Let's start from the top. Where are you stuck with [1]? I have noticed that after evicting all the objects with RBD that one object for each active RBD is still left, I think this is the head object. We haven't tried this, but our planned procedure for finishing the deactivation of a cache tier is to shut down the active VM, then flush again and then start the VM again. Once all VMs have been stopped, flushed and restarted, we should be able to remove the cache tier. That way we don't have to stop all the VMs at once or for long periods of time. I hope at some point the last object can be flushed without shutting down the VM. If you are experiencing something different, please provide some more info, especially more detailed steps of what you tried. [1] http://docs.ceph.com/docs/master/rados/operations/cache-tiering/?highlight=cache#removing-a-cache-tier -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWzkPgCRDmVDuy+mK58QAABrEQAIFAxEEmKroSKqqGluFE aCwvTTxye5IfIBjmVoreFZy+/r5B5D+aMBUFArANCk/A9V678mb/24MkCggT 8Ehb0eBVbkWxptUfexXfSuXvFqTGWA5BDnVTzT9rJ5liTQinXbhDCuJcVCDb hcHmNRnUrituZoDfivwp9ZMpe/ZqsQsIN06NyVhLyPWtA1/Ji06v1WwVkEKe b6FkS4J4C6RdmgBi1+QNntcgLjgWi5CXNBrPwhyvRMHYyjGFGUJQ87S7mQJL 4bBSs5e/bBraMBZlv59DgRjmvlGuBQHlSiqSsy3BKsHErKzjxYsh06fNTAZe TJ6bVPsa+vUKprRdWtUIaxqbY6vAXytwpswL57zgvD4PuPAFD80Wz9AK0mgz ypoUacAocRu+rIZ2NgEt4Xr6+K3pJ2wRT2Fs+xMmKt2uoH7XyccU+7kIrEhy CD4AZfCXlOgA5LWYPFpBXC9087OygNZ7907klCG2QMn5Qh15W/MiylU0ECF8 n3kNm4qEO4ICl5MiAXfaw2yaFa7Hht6N+oyDBRUI93Oj9I7pFA4uCrPhuPNt oRgNN9nTwBdVqUICvWJxOsb0AHuJoVIZbLbJ5dNKpcxehrO9aC9Ursa5/Wqt BGljYMYyg1QNf/CbAhZTpT+H4NQLPbN4D0muCchVKe7gekvj6u6vKjWwEiWR cl7D =U5aZ -END PGP SIGNATURE- -------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Feb 24, 2016 at 4:29 AM, Oliver Dzombic wrote: > Hi Esta, > > how do you know, that its still active ? > > -- > Mit freundlichen Gruessen / Best regards > > Oliver Dzombic > IP-Interactive > > mailto:i...@ip-interactive.de > > Anschrift: > > IP Interactive UG ( haftungsbeschraenkt ) > Zum Sonnenberg 1-3 > 63571 Gelnhausen > > HRB 93402 beim Amtsgericht Hanau > Geschäftsführung: Oliver Dzombic > > Steuer Nr.: 35 236 3622 1 > UST ID: DE274086107 > > > Am 24.02.2016 um 12:27 schrieb wikison: >> Hi, >> I want to disable rbd cache in my ceph cluster. I've set the *rbd cache* >> to be false in the [client] section of ceph.conf and rebooted the >> cluster. But caching system was still working. How can I disable the rbd >> caching system? Any help? >> >> best regards. >> >> 2016-02-24 >> >> Esta Wang >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush map customization for production use
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I think I saw someone say that they has issues with "step take" when it was not a "root" node. Otherwise it looks good to me. The "step chooseleaf firstn 0 type chassis" says to pick one OSD from different chassis where 0 says to take as many is the replication factor. Since an OSD can only be in one server, they you are accomplishing what you want. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWzkEECRDmVDuy+mK58QAAPtgQAJc4AKklznB6tQBUOaF9 nRu1C7+CJMpdhWZLiJW96OwTCIQ4CDv0f86/W0tOEoMa5Swqk0kWj4CEaej3 65/MgHsk3BhW6qwKmOicI/y+bALPDuBXRTEUm97tuKjhVC19vpEsOqQhd7Ux TlqCoQuf+yBjr5sOGj/NYRC6NKCVjmP6k3kth1INyvDPfjmK2h0VUuUB/AGo 2sWPdYG0Ki3I5JjtO3Ja5yjsYWMbDNZq1hgEFCfhEmsQSzbCmzRvIPRfbAiv DmrRo7qy9M86tJKuucBuiUD0k4HmIEVR8b1f42w9Kfhc7FyhtkszyvCpo7cl 8yuAqgfQ5bgzRyHtPmvBCqxxNesca9T7jlLxn+Q6Wco2fwGbYvwb4HcF2v+I +FAZQEOLZ1h4gxhsZ5j6IgSIwwoxlswc0G4DL1PIYwmWaqUBH3OUQjZg4tL5 eN1/X2fl7vgEdVO3fh+sm8+HfDLkEwL67GDxPm09RraSCpT/jyX+cjLWav0z qbTT6GxG74YiZIgQ0/s95GUvYJem+W7XgfSuf7P5Hpk4ooKcI/H3H6WUjaby kIvNvdgK2+DFcfRisE0WObKESQO/9tVojpEp9zkEH6OAv3cNvdCcGaHRiFDl 7cD0IpScVkSFHVn4MfOeB4Z+qw9ow9SwGB75BYm98axxsRdNlPNiQzxRcb5z Tdal =iMwX -END PGP SIGNATURE- ---- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Feb 24, 2016 at 4:09 AM, Vickey Singh wrote: > Hello Geeks > > Can someone please review and comment on my custom crush maps. I would > really appreciate your help > > > My setup : 1 Rack , 4 chassis , 3 storage nodes each chassis ( so total 12 > storage nodes ) , pool size = 3 > > What i want to achieve is: > - Survive chassis failures , even if i loose 2 complete chassis (containing > 3 nodes each) , data should not be lost > - The crush ruleset should store each copy on a unique chassis and host > > For example : > copy 1 ---> c1-node1 > copy 2 ---> c2-node3 > copy 3 ---> c4-node2 > > > > Here is my crushmap > = > > chassis block_storage_chassis_4 { > id -17 # do not change unnecessarily > # weight weight 163.350 > alg straw > hash 0 # rjenkins1 > item c4-node1 weight 54.450 > item c4-node2 weight 54.450 > item c4-node3 weight 54.450 > > } > > chassis block_storage_chassis_3 { > id -16 # do not change unnecessarily > # weight weight 163.350 > alg straw > hash 0 # rjenkins1 > item c3-node1 weight 54.450 > item c3-node2 weight 54.450 > item c3-node3 weight 54.450 > > } > > chassis block_storage_chassis_2 { > id -15 # do not change unnecessarily > # weight weight 163.350 > alg straw > hash 0 # rjenkins1 > item c2-node1 weight 54.450 > item c2-node2 weight 54.450 > item c3-node3 weight 54.450 > > } > > chassis block_storage_chassis_1 { > id -14 # do not change unnecessarily > # weight 163.350 > alg straw > hash 0 # rjenkins1 > item c1-node1 weight 54.450 > item c1-node2 weight 54.450 > item c1-node3 weight 54.450 > > } > > rack block_storage_rack_1 { > id -10 # do not change unnecessarily > # weight 174.240 > alg straw > hash 0 # rjenkins1 > item block_storage_chassis_1 weight 163.350 > item block_storage_chassis_2 weight 163.350 > item block_storage_chassis_3 weight 163.350 > item block_storage_chassis_4 weight 163.350 > > } > > class block_storage { > id -6 # do not change unnecessarily > # weight 210.540 > alg straw > hash 0 # rjenkins1 > item block_storage_rack_1 weight 656.400 > } > > rule ruleset_block_storage { > ruleset 1 > type replicated > min_size 1 > max_size 10 > step take block_storage > step chooseleaf firstn 0 type chassis > step emit > } > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Incorrect output from ceph osd map command
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 ceph pg dump Since all objects map to a PG, as long as you can verify that no PG is on the same host/chassis/rack, you are good. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.5 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWzS0NCRDmVDuy+mK58QAANfQP/19WHCUa2wPK6cHwx6zC msfy+zipJ86qvqTgAh5azy0VRIk5lo1GknwMJhulox5vk5M+GQo0ermR/yfw MbXKXy1f81NeZgQSqDX+GD3V19c/mb1WYuA0SLatPKkvv6L5BxPzHoGm6HYE 1hr3VSMYixCE2JZubQxj8EA+RnrJXYPue+e9aRXGbFymXIGHNdW5A3wU/vlp IJ18E3vTIrAdmpyKlLFYhI6w2sMPUSwGllqfBpuo+OxVE+9Wa+AptZIClNXB CI2Ozs02V9aRwUiCf6qPIBUAIPUE6/uDqzcS3mId8KUs4IxGi0pCr/t2irr5 jdc3u4WLtmZISo7RC/yyftvFFWvUkH0+2tr3lLQXHaDc+RaJPdlj5v5tylJp j5HTywmzz/vIPKFnn9OmVimMHfFJyWinShixVWI4ORKnPFD0gT0Qlg0yC2Hx PmtFE/OxUvYYM65WKONhAUTrjOlLAjbibFHDwhuXfQ/1Pxuh28YWkAyX/wdE cFZxoq6E6DePuKNO3xw1EqBUVsncW3+PltN7b+CWVOawEp+me42Ovetq7OqU B8aQhqQB0/T8bRYeIzINkkB60k6gSvrF5TO2Kq+x7UiYUQ82KyHE+zlTryXW 0BEj2bK9s4NtAItkx3F7bcmnusOOlb1AMMJFssMQV/LmjDOR9xJUYiuqXxrb 6AB3 =hv6I -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Feb 23, 2016 at 3:33 PM, Vickey Singh wrote: > Adding community for further help on this. > > On Tue, Feb 23, 2016 at 10:57 PM, Vickey Singh > wrote: >> >> >> >> On Tue, Feb 23, 2016 at 9:53 PM, Gregory Farnum >> wrote: >>> >>> >>> >>> On Tuesday, February 23, 2016, Vickey Singh >>> wrote: >>>> >>>> Thanks Greg, >>>> >>>> Do you mean ceph osd map command is not displaying accurate information >>>> ? >>>> >>>> I guess, either of these things are happening with my cluster >>>> - ceph osd map is not printing true information >>>> - Object to PG mapping is not correct ( one object is mapped to multiple >>>> PG's ) >>>> >>>> This is happening for several objects , but the cluster is Healthy. >>> >>> >>> No, you're looking for the map command to do something it was not >>> designed for. If you want to see if an object exists, you will need to use a >>> RADOS client to fetch the object and see if it's there. "map" is a mapping >>> command: given an object name, which PG/OSD does CRUSH map that name to? >> >> >> well your 6th sense is amazing :) >> >> This is exactly i want to achieve , i wan to see my PG/OSD mapping for >> objects. ( basically i have changed my crush hierarchy , now i want to >> verify that no 2 objects should go to a single host / chassis / rack ) so to >> verify them i was using ceph osd map command. >> >> Is there a smarter way to achieve this ? >> >> >> >> >>> >>> >>>> >>>> >>>> Need expert suggestion. >>>> >>>> >>>> On Tue, Feb 23, 2016 at 7:20 PM, Gregory Farnum >>>> wrote: >>>>> >>>>> This is not a bug. The map command just says which PG/OSD an object >>>>> maps to; it does not go out and query the osd to see if there actually is >>>>> such an object. >>>>> -Greg >>>>> >>>>> >>>>> On Tuesday, February 23, 2016, Vickey Singh >>>>> wrote: >>>>>> >>>>>> Hello Guys >>>>>> >>>>>> I am getting wired output from osd map. The object does not exists on >>>>>> pool but osd map still shows its PG and OSD on which its stored. >>>>>> >>>>>> So i have rbd device coming from pool 'gold' , this image has an >>>>>> object 'rb.0.10f61.238e1f29.2ac5' >>>>>> >>>>>> The below commands verifies this >>>>>> >>>>>> [root@ceph-node1 ~]# rados -p gold ls | grep -i >>>>>> rb.0.10f61.238e1f29.2ac5 >>>>>> rb.0.10f61.238e1f29.2ac5 >>>>>> [root@ceph-node1 ~]# >>>>>> >>>>>> This object lives on pool gold and OSD 38,0,20 , which is correct >>>>>> >>>>>> [root@ceph-node1 ~]# ceph osd map gold >>>>>> rb.0.10f61.238e1f29.2ac5 >>>>>> osdmap e1357 pool 'gold' (1) object 'rb.0.10f61.238e1f29.2ac5' >>>>>> -> pg 1.11692600 (1.0) -> up ([38,0,20], p38) acting ([38,0,20], p38) >>>>>> [root@ceph-node1 ~]# >>>>>> >>>>>> >>>>>>
Re: [ceph-users] Ceph and its failures
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 You probably haven't written to any objects after fixing the problem. Do some client I/O on the cluster and the PG will show fixed again. I had this happen to me as well. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.5 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWzSv5CRDmVDuy+mK58QAABe4P/jJ4Vtp9qsV6T49/17FW qgoZlxIfTLDXNnsTUUFju3c20hDHTET8uMCsaCrLb02ZujbGV0a1LcW/ffJe hjWx1ExyyrN0bTdwBe+RRycKriHTFH19Fx3zVoRQvDaWoTAbjTFZkvQAxftN vqKonYxsWyvITYLCFMtX0aPEljo+kQ8BNK4vJoPA2hw6cc0TKIKHSsbt9a0Q 6eCjuSPB76cGDRfbxnZbTXT79UgPD4m5ztNo3stXjvfzRMq0/6YLov8rBXTJ y5bnlheBOHfwcS/9P1Vdi+LDDy+iaZb5/gEwXPPzV2uGr/z8RTgGMk0dKyk3 fzZHWU7FhUIl3OVDF3IqQe2tZtWTs59fithHRme7T7+tmQaG0VOd1noMYlNz n3bCQOJutfcyWvU4naQSkgAPfvTH0GwNp16ETAZlB6pADKtH3oXMOPW3CH5H HyY5+H9w7ELbYiuJlGwMRyko/sNIiVEoj2dZB/ta+61G8+nlYR2GsjLceXOM HP9Wi3MrVJtXDLFrnQRglB2dfFWvBlrlBTj3uG7Ebn5DO6glxPEAvzrOgsJ2 O8D5+AMvooc41T74aUcWQK8NHNrrN+eL18yhRfjCgyadA2VYvWeu6K7sIUFo NKFE66ahsxrNKZUrLjeCo69iP4Zf5+AgY7rCau81vzQNtmFUPjzUKyOzgpsb Y2fQ =TGcG -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Feb 23, 2016 at 2:08 PM, Nmz wrote: >>> ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) >>> >>> Ceph contains >>> MON: 3 >>> OSD: 3 >>> >> For completeness sake, the OSDs are on 3 different hosts, right? > > It is single machine. I`m doing tests only. > >>> File system: ZFS >> That is the odd one out, very few people I'm aware of use it, support for >> it is marginal at best. >> And some of its features may of course obscure things. > > I`m using ZFS on linux for a log time and I`m happy with it. > > >> Exact specification please, as in how is ZFS configured (single disk, >> raid-z, etc)? > > 2 disks in mirror mode. > >>> Kernel: 4.2.6 >>> >> While probably not related, I vaguely remember 4.3 being recommended for >> use with Ceph. > > At this time I can run only this kernel. But IF I decide to use Ceph (only if > Ceph satisfy requirements) I can use any other kernel. > >>> 3. Does Ceph have auto heal option? >> No. >> And neither is the repair function a good idea w/o checking the data on >> disk first. >> This is my biggest pet peeve with Ceph and you will find it mentioned >> frequently in this ML, just a few days ago this thread for example: >> "pg repair behavior? (Was: Re: getting rid of misplaced objects)" > > It is very strange to recovery data manually without know which data is good. > If I have 3 copies of data and 2 of them are corrupted then I cat recovery > the bad one. > > > -- > > Did some new test. Now new 3 OSD are in different systems. FS is ext3 > > Same start as before. > > # grep "a" * -R > Binary file > osd/nmz-5/current/17.17_head/rbd\udata.1bef77ac761fb.0001__head_FB98F317__11 > matches > Binary file osd/nmz-5-journal/journal matches > > # ceph pg dump | grep 17.17 > dumped all in format plain > 17.17 1 0 0 0 0 40961 1 > active+clean2016-02-23 16:14:32.234638 291'1 309:44 [5,4,3] 5 > [5,4,3] 5 0'0 2016-02-22 20:30:04.255301 0'0 2016-02-22 > 20:30:04.255301 > > # md5sum rbd\\udata.1bef77ac761fb.0001__head_FB98F317__11 > \c2642965410d118c7fe40589a34d2463 > rbd\\udata.1bef77ac761fb.0001__head_FB98F317__11 > > # sed -i -r 's/aa/ab/g' > rbd\\udata.1bef77ac761fb.0001__head_FB98F317__11 > > > # ceph pg deep-scrub 17.17 > > 7fbd99e6c700 0 log_channel(cluster) log [INF] : 17.17 deep-scrub starts > 7fbd97667700 0 log_channel(cluster) log [INF] : 17.17 deep-scrub ok > > -- restartind OSD.5 > > # ceph pg deep-scrub 17.17 > > 7f00f40b8700 0 log_channel(cluster) log [INF] : 17.17 deep-scrub starts > 7f00f68bd700 -1 log_channel(cluster) log [ERR] : 17.17 shard 5: soid > 17/fb98f317/rbd_data.1bef77ac761fb.0001/head data_digest > 0x389d90f6 != known data_digest 0x4f18a4a5 from auth shard 3, missing attr _, > missing attr snapset > 7f00f68bd700 -1 log_channel(cluster) log [ERR] : 17.17 deep-scrub 0 missing, > 1 inconsistent objects > 7f00f68bd700 -1 log_channel(cluster) log [ERR] : 17.17 deep-scrub 1 errors > > > Ceph 9.2.0 bug ? > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'm still going to see if I can get Ceph clients to hardly notice that an OSD comes back in. Our set up is EXT4 and our SSDs have the hardest time with the longest recovery impact. It should be painless no matter how slow the drives/CPU/etc are. If it means waiting to service client I/O until all the peering, and stuff (not including backfilling/recovery because that can be done in the background without much impact already) is completed before sending the client I/O to the OSD, then that is what I'm going to target. That way if it takes 5 minutes for the OSD to get it's bearing because it is swapping due to low memory or whatever, the clients happily ignore the OSD until it says it is ready and don't have all the client I/O fighting to get a piece of scarce resources. I appreciate all the suggestions that have been mentioned and believe that there is a fundamental issue here that causes a problem when you run your hardware into the red zone (like we have to do out of necessity). You may be happy with how things are set-up in your environment, but I'm not ready to give up on it and I think we can make it better. That way it "Just Works" (TM) with more hardware and configurations and doesn't need tons of efforts to get it tuned just right. Oh, and be careful not to touch it, the balance of the force might get thrown off and the whole thing will tank. That does not make me feel confident. Ceph is so resilient in so many ways already, why should this be an Achilles heel for some? -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWwAeGCRDmVDuy+mK58QAAG6MP/j+JN2z1qLK2KwlQOr/w dam1U6t1WCzwN1XBpvYvbvJKKMcRHcwKmauuzTLYeEG8FjhgnOcvHaSRoHd8 NURWINnGQrdTbxiMRGDbwC6iWfJypWMDN5d1vibo9aXC8ib7W6l9R21f+Koa CsgyZV32kSwEs36teeM4JZrZBTlYQ4qRTOsMUDIfE1JFtBaeDjEwyI6gajdB XsQo3mnqhe4LQC7x9oem/MpKEHp1Y/LO8tyf4jj72ZUp+qmJy2F3+oUPnCdU P4h3uC0GZUd6l43p5cKW1w/h1mfEwR/9ppsIyufTghqlWFlE6dziaQdlas88 IuDpGwCJfyJhiH18VxbtRpZQpNorJ27uxNjPPDcWNoUFHR8+daTCu+8NU6vT 8xiZhBWpLiH/tShUtR6ZQnumwKgbwc+VOfHj+GSTY/DIfat/zaPxtKYsCHWz LNE6fkzd4st2Aw7UVPSSUKrH/87RhIEnlipptZsh5SQNFUrl1G5ztNBTj7Xl tyb+HD1Ge3u2mgS/ycnRGQECyXyUMvPXwITDqHLhN3wF7D/A3616v3Pg2H+v R/dU8Wq31wA+A0LRuViMJy2PJMgEBoux+zhBsJFun4TPdXkpC15QODhpquMs /0ofBwHG+FaWmmwVSQ0A0jMGGodfXTAgP4r/tL58JjGTgi1xtQu9L74u5KPD yHbZ =rnWI -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Sat, Feb 13, 2016 at 8:51 PM, Tom Christensen wrote: >> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0 0 osd.2 >> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0 0 osd.2 1788 >> > load_pgs opened >> 564 pgs > --- > Another minute to load the PGs. >> Same OSD reboot as above : 8 seconds for this. > > Do you really have 564 pgs on a single OSD? I've never had anything like > decent performance on an OSD with greater than about 150pgs. In our > production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd total > (with size set to 3). When we initially deployed our large cluster with > 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had no end of > trouble getting pgs to peer. The OSDs ate RAM like nobody's business, took > forever to do anything, and in general caused problems. If you're running > 564 pgs/osd in this 4 OSD cluster, I'd look at that first as the potential > culprit. That is a lot of threads inside the OSD process that all need to > get CPU/network/disk time in order to peer as they come up. Especially on > firefly I would point to this. We've moved to Hammer and that did improve a > number of our performance bottlenecks, though we've also grown our cluster > without adding pgs, so we are now down in the 25-30 primary pgs/osd range, > and restarting osds, or whole nodes (24-32 OSDs for us) no longer causes us > pain. In the past restarting a node could cause 5-10 minutes of peering and > pain/slow requests/unhappiness of various sorts (RAM exhaustion, OOM Killer, > Flapping OSDs). This all improved greatly once we got our pg/osd count > under 100 even before we upgraded to hammer. > > > > > > On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton > wrote: >> >> Hi, >> >> Le 13/02/2016 15:52, Christian Balzer a écrit : >> > [..] >> > >> > Hum that's surprisingly long. How much data (size and nb of files) do >> > you have on this OSD, which FS do you use, what are the mount options, >> > what is the hardware and the kind of access ? >> > >> > I already mentioned the HW, Areca RAID controller with 2GB HW cache and >> > a >> > 7 disk RAID6 per OSD. >> > Not
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
Christian, Yep, that describes what I see too. Good news is that I made a lot of progress on optimizing the queue today 10-50% performance increase in my microbenchmarks (that is only the improvement in enqueueing and dequeueing ops which is a small part of the whole IO path, but every little bit helps). I have some more knobs to turn in my microbenchmarks then run it through some real tests and document the results and then submit the pull request hopefully mid next week. Then the next thing to look into is what I've affectionately called "The 10 Minutes of Terror" which can be anywhere from 5 minutes to 20 minutes in our cluster. It is our biggest point after the starved IO situation which the new queue I wrote goes a long way in mitigating that and the cache promotion/demotion issues which Nick and Sage have been working on (thanks for all the work on that). I hope in a few weeks time I can have a report on what I find. Hopefully we can have it fixed for Jewel and Hammer. Fingers crossed. Robert LeBlanc Sent from a mobile device please excuse any typos. On Feb 12, 2016 10:32 PM, "Christian Balzer" wrote: > > Hello, > > for the record what Robert is writing below matches my experience the best. > > On Fri, 12 Feb 2016 22:17:01 + Steve Taylor wrote: > > > I could be wrong, but I didn't think a PG would have to peer when an OSD > > is restarted with noout set. If I'm wrong, then this peering would > > definitely block I/O. I just did a quick test on a non-busy cluster and > > didn't see any peering when my OSD went down or up, but I'm not sure how > > good a test that is. The OSD should also stay "in" throughout the > > restart with noout set, so it wouldn't have been "out" before to cause > > peering when it came "in." > > > It stays in as far as the ceph -s output is concerned, but clearly for > things to work as desired/imagined/expected some redirection state has to > be enabled for the duration. > > > I do know that OSDs don’t mark themselves "up" until they're caught up > > on OSD maps. They won't accept any op requests until they're "up," so > > they shouldn't have any catching up to do by the time they start taking > > op requests. In theory they're ready to handle I/O by the time they > > start handling I/O. At least that's my understanding. > > > > Well, here is the 3 minute ordeal of a restart, things went downhill from > the moment the shutdown was initiated to the time it was fully back up, > with some additional fun at the recovery after that, but compared to the 3 > minutes of near silence, that was a minor hiccup. > > --- > 2016-02-12 01:33:45.408348 7f7f0c786700 -1 osd.2 1788 *** Got signal > Terminated *** > 2016-02-12 01:33:45.408414 7f7f0c786700 0 osd.2 1788 prepare_to_stop > telling mon we are shutting down > 2016-02-12 01:33:45.408807 7f7f1cfa7700 0 osd.2 1788 got_stop_ack > starting shutdown > 2016-02-12 01:33:45.408841 7f7f0c786700 0 osd.2 1788 prepare_to_stop > starting shutdown > 2016-02-12 01:33:45.408852 7f7f0c786700 -1 osd.2 1788 shutdown > 2016-02-12 01:33:45.409541 7f7f0c786700 20 osd.2 1788 kicking pg 1.10 > 2016-02-12 01:33:45.409547 7f7f0c786700 30 osd.2 pg_epoch: 1788 pg[1.10( > empty local-les=1788 n=0 ec=1 les/c 1788/1788 > 1787/1787/1787) [1,2] r=1 lpr=1787 pi=1656-1786/12 crt=0'0 active] lock > 2016-02-12 01:33:45.409562 7f7f0c786700 10 osd.2 pg_epoch: 1788 pg[1.10( > empty local-les=1788 n=0 ec=1 les/c 1788/1788 > 1787/1787/1787) [1,2] r=1 lpr=1787 pi=1656-1786/12 crt=0'0 active] > on_shutdown > --- > This goes on for quite a while with other PGs and more shutdown fun, until > we get here: > --- > 2016-02-12 01:33:46.413966 7f7f0c786700 20 osd.2 1788 kicking pg 2.17c > 2016-02-12 01:33:46.413974 7f7f0c786700 30 osd.2 pg_epoch: 1788 pg[2.17c( > v 1788'10083124 (1782'10080123,1788'10083124 > ] local-les=1785 n=1522 ec=1 les/c 1785/1785 1784/1784/1784) [0,2] r=1 > lpr=1784 pi=1652-1783/16 luod=0'0 crt=1692'3885 > 111 lcod 1788'10083123 active] lock > 2016-02-12 01:33:48.690760 7f75be4d57c0 0 ceph version 0.80.10 > (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ce > ph-osd, pid 24967 > --- > So from shutdown to startup about 2 seconds, not that bad. > However here is where the cookie crumbles massively: > --- > 2016-02-12 01:33:50.263152 7f75be4d57c0 0 > filestore(/var/lib/ceph/osd/ceph-2) limited size xattrs > 2016-02-12 01:35:31.809897 7f75be4d57c0 0 > filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode > : checkpoint is not enabled > --- > Nearly 2 minutes to mount things, it probably had to go
Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 What I've seen is that when an OSD starts up in a busy cluster, as soon as it is "in" (could be "out" before) it starts getting client traffic. However, it has be "in" to start catching up and peering to the other OSDs in the cluster. The OSD is not ready to service requests for that PG yet, but it has the OP queued until it is ready. On a busy cluster it can take an OSD a long time to become ready especially if it is servicing client requests at the same time. If someone isn't able to look into the code to resolve this by the time I'm finished with the queue optimizations I'm doing (hopefully in a week or two), I plan on looking into this to see if there is something that can be done to prevent the OPs from being accepted until the OSD is ready for them. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk wrote: > I wonder if Christian is hitting some performance issue when the OSD or > number of OSD's all start up at once? Or maybe the OSD is still doing some > internal startup procedure and when the IO hits it on a very busy cluster, > it causes it to become overloaded for a few seconds? > > I've seen similar things in the past where if I did not have enough min free > KB's configured, PG's would take a long time to peer/activate and cause slow > ops. > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Steve Taylor >> Sent: 12 February 2016 16:32 >> To: Nick Fisk ; 'Christian Balzer' ; ceph- >> us...@lists.ceph.com >> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't >> uptosnuff) >> >> Nick is right. Setting noout is the right move in this scenario. > Restarting an >> OSD shouldn't block I/O unless nodown is also set, however. The exception >> to this would be a case where min_size can't be achieved because of the >> down OSD, i.e. min_size=3 and 1 of 3 OSDs is restarting. That would > certainly >> block writes. Otherwise the cluster will recognize down OSDs as down >> (without nodown set), redirect I/O requests to OSDs that are up, and > backfill >> as necessary when things are back to normal. >> >> You can set min_size to something lower if you don't have enough OSDs to >> allow you to restart one without blocking writes. If this isn't the case, >> something deeper is going on with your cluster. You shouldn't get slow >> requests due to restarting a single OSD with only noout set and idle disks > on >> the remaining OSDs. I've done this many, many times. >> >> Steve Taylor | Senior Software Engineer | StorageCraft Technology >> Corporation >> 380 Data Drive Suite 300 | Draper | Utah | 84020 >> Office: 801.871.2799 | Fax: 801.545.4705 >> >> If you are not the intended recipient of this message, be advised that any >> dissemination or copying of this message is prohibited. >> If you received this message erroneously, please notify the sender and >> delete it, together with any attachments. >> >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Nick Fisk >> Sent: Friday, February 12, 2016 9:07 AM >> To: 'Christian Balzer' ; ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't >> uptosnuff) >> >> >> >> > -Original Message- >> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> > Of Christian Balzer >> > Sent: 12 February 2016 15:38 >> > To: ceph-users@lists.ceph.com >> > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout >> > ain't >> > uptosnuff) >> > >> > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote: >> > >> > > Hi, >> > > >> > > On 02/12/2016 03:47 PM, Christian Balzer wrote: >> > > > Hello, >> > > > >> > > > yesterday I upgraded our most busy (in other words lethally >> > > > overloaded) production cluster to the latest Firefly in >> > > > preparation for a Hammer upgrade and then phasing in of a cache > tier. >> > > > >> > > > When restarting the ODSs it took 3 minutes (1 minute in a >> > > > consecutive repeat to test the impact of primed caches) during >> > > > which the cluster crawled to a near stand-still and
Re: [ceph-users] cls_rbd ops on rbd_id.$name objects in EC pool
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Is this only a problem with EC base tiers or would replicated base tiers see this too? - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Feb 11, 2016 at 6:09 PM, Sage Weil wrote: > On Thu, 11 Feb 2016, Nick Fisk wrote: >> That’s a relief, I was sensing a major case of face palm occuring when I >> read Jason's email!!! > > https://github.com/ceph/ceph/pull/7617 > > The tangled logic in maybe_handle_cache wasn't respecting the force > promotion bool. > > sage > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWvVgFCRDmVDuy+mK58QAAepwQAIYDDy9BKOqCN6AYg6QK XOipjXIAwU+lwJA9dV6GOLSeztyg03i1h0Nvibww9JuRYoWWDfPmRCqWWCyl qHoa1q3RgByUTlrxQwl2j0oqVdj2Gn238yyZLqqkhvJS4icc1Xl42710xQa8 OZCMmrJZQ6ZF4n9rU4tUZy6+4l+FjhmqGCu4PHw6SK2TiA6SJR4pcMsbFb6Y h5yWHLNJaCNxe3JVI4sd/tDFxU+pnalz4u2/QkUg2I22C1rYOelbvQ8qeVsR TFy3wc62GGqjaZ9+cjvY3VwrsScFh9skz/cBg7ANRs20rdwX74xfdsIUeAdW f1zfNaobOBt0ZbrYcrp28BhjpIik7GriBiFSUaJ/xIWc8wDNSYhAApGUMNhc oLcsl11zpHzAce8z/Jv5uVRH7VG0jqJKQg8t2l09V/LryxcTktrEQctq+6LS zqh46uToc2jpvTvhIwUiT5fhg9NA2j2cOo1lhMgWJBpT41qambgBWAeIXliE oHMexkdpN80Fqv+yXjEyDocDH1c1bTlyXI71btHptQNvC2VtTqoviz96CtEt jOAFV3nfYz36XkAeayH22ExAwwvF8d9FWpnzEnI3cf63QChN3LAk1C52T5Lw T5rp/4kNF3UnTlJ/ejf3twGaR+FSUWMeum4pKpakfpntRH/ZxpU1VSKSL0TM BdYi =gvlV -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] K is for Kraken
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Too bad K isn't an LTS. It was be fun to release the Kraken many times. I like liliput - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Feb 8, 2016 at 11:36 AM, Sage Weil wrote: > I didn't find any other good K names, but I'm not sure anything would top > kraken anyway, so I didn't look too hard. :) > > For L, the options I found were > > luminous (flying squid) > longfin (squid) > long barrel (squid) > liliput (octopus) > > Any other suggestions? > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWuOfZCRDmVDuy+mK58QAAyMgP/ian9UzIjU8JNkxgR+9T nMI99toP9Ud5nqq6IN+niVwNIkaVIpuPACMi5e88UHQMW8qhVZPAnPA1Ogrd HP67cO/m6SAFFIzOWHyFCNpVzfPzoqL0lijdzLzihTC0d9Cbv+vHTX/jX4q6 HwWDEMOctrrdVqaCXGP2hkuViq+pRZqDZKgG9GeQ5lEY9b7swOEmC/z1P5Me R/UpKtHfu0QMywY6AWTf2vgwx2RIy1QGLs8Fy++GjsggazZqmmOS0xmefLtl ImSqCmj+YFlsPBt+lazLtYU+2v5AJThIRkZWUbSR+A1jkotP48fQgSQeJN1V F6fu/4gLB+FwbLLwaZqYVTrq+hrztxu98SkgyMIwN1t+O5JzcCY0xd56Gemi f//00qvNjSCmRoILq3MPxnPzoD66RnZvFkhbGCsz0h5F1xJUSa0L7u9x0tUe 5LlwY1Qb6e9UBfP6VYjUwGMTChlvnO2tvKQszxPBmIadrjjlfOJhw+aueHmz kCSsM+s5LlrlI8e7vlxEdF05R7StLVVGzi8aIx/byjLxKFjNA2ZiIg+IUVJK GwaV6FQ/B3yRW9WKS+TH1aG7HfdtBWkmcDpy0ofLE5NsckrKL0YKEhJmQSBj BFSAXKXk+cKbWV/ykN0fLWhJcYmNM/9pZ5d3eBzbqIltBe4OQcVgBUllNE0B QvQj =gYt+ -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unified queue in Infernalis
I believe this is referring to combining the previously separate queues into a single queue (PrioritizedQueue and soon to be WeightedPriorityQueue) in ceph. That way client IO and recovery IO can be better prioritized in the Ceph code. This is all before the disk queue. Robert LeBlanc Sent from a mobile device please excuse any typos. On Feb 5, 2016 4:28 PM, "Stillwell, Bryan" wrote: > I saw the following in the release notes for Infernalis, and I'm wondering > where I can find more information about it? > > * There is now a unified queue (and thus prioritization) of client IO, > recovery, scrubbing, and snapshot trimming. > > I've tried checking the docs for more details, but didn't have much luck. > Does this mean we can adjust the ionice priority of each of these > operations if we're using the CFQ scheduler? > > Thanks, > Bryan > > > > > This E-mail and any of its attachments may contain Time Warner Cable > proprietary information, which is privileged, confidential, or subject to > copyright belonging to Time Warner Cable. This E-mail is intended solely > for the use of the individual or entity to which it is addressed. If you > are not the intended recipient of this E-mail, you are hereby notified that > any dissemination, distribution, copying, or action taken in relation to > the contents of and attachments to this E-mail is strictly prohibited and > may be unlawful. If you have received this E-mail in error, please notify > the sender immediately and permanently delete the original and any copy of > this E-mail and any printout. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Set cache tier pool forward state automatically!
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On Thu, Feb 4, 2016 at 8:32 PM, Christian Balzer wrote: > On Wed, 3 Feb 2016 22:42:32 -0700 Robert LeBlanc wrote: > I just finished downgrading my test cluster from testing to Jessie and > then upgrading Ceph from Firefly to Hammer (that was fun few hours). > > And I can confirm that I don't see that issue with Hammer, wonder if it's > worth prodding the devs about. > I sorta dread the time the PG upgrade process will take when going to > Hammer on the overloaded production server. > But then again, a fix for Firefly is both unlikely and going to take too > long for my case anyway. There isn't any PG upgrading as part of the Firefly to Hammer process that I can think of. If there is, it wasn't long. Setting noout would prevent backfilling so only recovery of delta changes would be needed. You don't have to reboot the node, just restart the process so it can limit the amount of changes to only a minute or two. It's been a long time since I did that upgrade that I just may be plain forgetting something. I think you are right about the fix not going into Firefly since it will EoL this year. Glad to know that it is fixed in Hammer. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWtCYCCRDmVDuy+mK58QAAvDEP/i85/cjKQwi4idRzLT9e 7oecZ2kTldVNLILLsGhmbg+oABgyKQ7uNY+XTJXSlYMYIKGpoQ9cDO/r9tB3 nANDVxvVF6yxiA4Pzo8ybytu+qKyOeB17ri3//ReFyyPg+tDJsNpXV+ECUFX LZPekvhV397JFS8KoT00nkzGGiWh1PlbQCYqZCNCbsrhIqwCjFq+k5ydKpvv qJfTh1d3V0h0vgtbtdC4Vdrzvqr65BoLHNcy6cOlIzPHhkJi6W5rABB6Haec sn7onFqsdJn9TSEJ8TSHfgtaWR5vT7y6/AQHHDafXzdr/VZKorwemdeiRwuX LEWudwg+J3cf4DrhVlDjv91I24f78/fH4Bm8m/sugo98L/+UqNgCz9VXI4AP ejRkZyIkWacEjkrBw8D7QttEEwo58247gYrimb07+MMVX36p+0S7pkpsdH1Y 3d3eOuHqqs3mG51eFlZng8Iax029NPQ7Umdt7l/Eru7g7pthJtPEmvPwMMB1 dcx+X2Aj6G9F+Jsa3hJNTPDsr3cKLOGcS9uu7iQjXVpMfhlF5v/16XCoDKfa ZSGc6cEdEKLSfEIe7msD3n2gRLL4QfbXFSf7bJUi9dm6LRLMCBEks4QiakAm t0AFeV96xsubg5uplBfkIROND3qU80ccSI5mhey8OC42zHKFj3B/Rf5v+qDF X3T3 =2CEJ -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrading with mon & osd on same host
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Just make sure that your monitors and OSDs are on the very latest of Hammer or else your Infernalis OSDs won't activate. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Feb 4, 2016 at 12:23 AM, Mika c wrote: > Hi, > >* Do the packages (Debian) restart the services upon upgrade? > No need, restart by yourself. > >>Do I need to actually stop all OSDs, or can I upgrade them one by one? > No need to stop. Just upgrade osd server one by one and restart each osd > daemons. > > > > Best wishes, > Mika > > > 2016-02-03 18:55 GMT+08:00 Udo Waechter : >> >> Hi, >> >> I would like to upgrade my ceph cluster from hammer to infernalis. >> >> I'm reading the upgrade notes, that I need to upgrade & restart the >> monitors first, then the OSDs. >> >> Now, my cluster has OSDs and Mons on the same hosts (I know that should >> not be the case, but it is :( ). >> >> I'm just wondering: >> * Do the packages (Debian) restart the services upon upgrade? >> >> >> In theory it should work this way: >> >> * install / upgrade the new packages >> * restart all mons >> * stop OSD one by one and change the user accordingly. >> >> Another question then: >> >> Do I need to actually stop all OSDs, or can I upgrade them one by one? >> I don't want to take the whole cluster down :( >> >> Thanks very much, >> udo. >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWs4YFCRDmVDuy+mK58QAAdjUP/RNYDkRYaPuspNei14sh XrM23GLCsmuv7jderKwOkG2wsIQOxR86E5F/dUEnB0UB+CKAvYBi3w2cHNc1 PrZoWkPkiEy2+bsQW65CoPK4UoghFoNGWABSPIDgcNjrblbTJ+Ph0FEfXSNn PnlZ40/NySiCypPbKTOFag8o30eOqIO1UDjWqTdeQWhVKmQpAGWEAMc/A1Dk YHLqq1MQOiZ1Zh16Bx664sspR68GYnWw57MF5bVterEahlhm8/n17rJVDFT/ 440+Idph3GEpIWqiXLLYM8nCIiwsXO30OxdwTVVpoDrszh782E2jAMkW9cCs IXBkZRgq4M6Gz4P76BWiNJN0CeTsA0NUwQVZQl9cndeLgyqhCzFS8825ixfl fFFiz3RFqluVzP55V+D3IEFZHlbiYMZtx1HbrjWR1UG1Q40PnB3XxwxiNBDT dKsjpGMYeHs/KPUdMaWraQqBxjWC1bvc00eqVhQZm/Xz+jniitr+DGfh9afi sTYYiHJcURgpvvbi77oOglzYfMes+b5oOxJT5KII2eEDothG6GF63Bn7c75W 7BjjlR4ugmD6kO4PsyF2NisfdL7IpEQe/aiieGPU10QRvVfRdu5LEGd6/An2 YxvAhzQxx+gJzknBDlbh95wcdVy/MHKDO3XoK1FXOpRaejCcPLRhu3rW/vgy ZRJc =rHFo -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Set cache tier pool forward state automatically!
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On Wed, Feb 3, 2016 at 9:00 PM, Christian Balzer wrote: > On Wed, 3 Feb 2016 16:57:09 -0700 Robert LeBlanc wrote: > That's an interesting strategy, I suppose you haven't run into the issue I > wrote about 2 days ago when switching to forward while running rdb bench? We haven't, but we are running 0.94.5. If you are running Firefly, that could be why. > In my case I venture that the number of really hot objects is small enough > to not overwhelm things and that 5K IOPS would be all that cluster ever > needs to provide. We have 48x Micron M600 1TB drives. They do not perform as well as the Intel S3610 800GB which seems to have the right balance of performance and durability for our needs. Since we had to under provision the M600s, we should be able to get by with the 800GB just fine. Once we get the drives swapped out, we may do better than the 10K IOPs as well as the recency fix going into the next version of Hammer it will help with writeback. With our new cluster, we did fine in writeback mode until we hit that 10K IOP limit, then we started getting slow I/O messages, turning it to forward mode and things sped up a lot. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.4 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWsuTFCRDmVDuy+mK58QAAxO0P/R2mkmoOE/YSD9Ea8uUz XBlOI2eibT5DGK6jR/hVL0V0dInNtVM+4yGWEmvJm5nxnwbx+EQd+lCTFQ5y WouwGQLCMOCiy0rgduTeTwyGHjeIbloGoYYhZQPEFHOMt1lcKcwiEbrEKUYN csUmEApK2aiPna5dMsvQs39/oATuid9Aec8VwcyCozzWUe/UziXVFhdWw3Q5 2mz8AuOhrmFqd7iyFN9Dici/DXLhBxWgg4PWn81Ggzq/5LHGyyV6A0jiLCBH /B9rUCOmdfBvdK/GxCG7iUqIjVvIR2mtYFkCu7VL/exsnxuGRB2RHYcXgfVH rMbZ+gbK/T4XZvUTwDpsfzkEwOTlCuhkcMcHyZLl/MdmcNVXP2+cB9TaCbPI Hn2H0CuXqQhZ73znQSVS66/QA7s4W5LzMiAUZnOdIX05eVLnZEgstFr8fSEn O95Y4jLYyQB+CIF9IfA6fgGsvnrs0rTGvYEThk6HL1sa6uVwR5PESVJpapS5 smUenHyp7OPTVdVpGzJh6VOOB08lcA7JFkicCSG1iXTPucuGkuVNMQ2i0LNb DA/WAbwUqSK1XHIIu2NCaDZsIbSPwWGXj2uwfNFgSzss1UqAVEF0cBfY6c6n 3bdPwY2SgOc7nB+LGDQM6dsaFqDS1E490cFwc85uDTkVOBL0JcAJHAvZV2lD w4Tj =H+mV -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com