Re: [ceph-users] fio test rbd - single thread - qd1
One thing you can check is the CPU performance (cpu governor in particular). On such light loads I've seen CPUs sitting in low performance mode (slower clocks), giving MUCH worse performance results than when tried with heavier loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core frequencies. On 2019-03-19 3:17 p.m., jes...@krogh.cc wrote: Hi All. I'm trying to get head and tails into where we can stretch our Ceph cluster into what applications. Parallism works excellent, but baseline throughput it - perhaps - not what I would expect it to be. Luminous cluster running bluestore - all OSD-daemons have 16GB of cache. Fio files attacher - 4KB random read and 4KB random write - test file is "only" 1GB In this i ONLY care about raw IOPS numbers. I have 2 pools, both 3x replicated .. one backed with SSDs S4510's (14x1TB) and one with HDD's 84x10TB. Network latency from rbd mount to one of the osd-hosts. --- ceph-osd01.nzcorp.net ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9189ms rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms SSD: randr: # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 1727.07 2033.66 1954.71 1949.4789 46.592401 randw: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 36400.05455.26436.58 433.91417 12.468187 The double (or triple) network penalty of-course kicks in and delivers a lower throughput here. Are these performance numbers in the ballpark of what we'd expect? With 1GB of test file .. I would really expect this to be memory cached in the OSD/bluestore cache and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K IOPS Again on the write side - all OSDs are backed by Battery-Backed write cache, thus writes should go directly into memory of the constroller .. .. still slower than reads - due to having to visit 3 hosts.. but not this low? Suggestions for improvements? Are other people seeing similar results? For the HDD tests I get similar - surprisingly slow numbers: # grep iops write*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 38 36.91 118.8 69.14 72.926842 21.75198 This should have the same performance characteristics as the SSD's as the writes should be hitting BBWC. # grep iops read*json | grep -v 0.00 | perl -ane'print $F[-1] . "\n"' | cut -d\, -f1 | ministat -n x N Min MaxMedian AvgStddev x 39 26.18181.51 48.16 50.574872 24.01572 Same here - shold be cached in the blue-store cache as it is 16GB x 84 OSD's .. with a 1GB testfile. Any thoughts - suggestions - insights ? Jesper -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph block - volume with RAID#0
On 2019-01-31 6:05 a.m., M Ranga Swami Reddy wrote: My thought was - Ceph block volume with raid#0 (means I mounted a ceph block volumes to an instance/VM, there I would like to configure this volume with RAID0). Just to know, if anyone doing the same as above, if yes what are the constraints? Exclusive lock on RBD images will kill any (theoretical) performance gains. Without exclusive lock, you lose some of RBD features. Plus, using 2+ clients with single images doesn't sound like a good idea. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: what are the potential risks of mixed cluster and client ms_type
On 2018-11-19 8:17 a.m., Honggang(Joseph) Yang wrote: thank you. but I encountered a problem: https://tracker.ceph.com/issues/37300 I don't know if this is because of mix use of messger type. Have you done basic troubleshooting, like checking osd.179 networking? Usually this means firewall or network hardware issues. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: what are the potential risks of mixed cluster and client ms_type
On 2018-11-19 5:05 a.m., Honggang(Joseph) Yang wrote: hello, Our cluster side ms_type is async, while client side ms_type is simple. I want to know if this is a proper way to use, what are the potential risks? None if Ceph doesn't complain about async messenger being experimental - both messengers use the same protocol. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD image "lightweight snapshots"
Hello, At OVH we're heavily utilizing snapshots for our backup system. We think there's an interesting optimization opportunity regarding snapshots I'd like to discuss here. The idea is to introduce a concept of a "lightweight" snapshots - such snapshot would not contain data but only the information about what has changed on the image since it was created (so basically only the object map part of snapshots). Our backup solution (which seems to be a pretty common practice) is as follows: 1. Create snapshot of the image we want to backup 2. If there's a previous backup snapshot, export diff and apply it on the backup image 3. If there's no older snapshot, just do a full backup of image This introduces one big issue: it enforces COW snapshot on image, meaning that original image access latencies and consumed space increases. "Lightweight" snapshots would remove these inefficiencies - no COW performance and storage overhead. At first glance, it seems like it could be implemented as extension to current RBD snapshot system, leaving out the machinery required for copy-on-write. In theory it could even co-exist with regular snapshots. Removal of these "lightweight" snapshots would be instant (or near instant). So what do others think about this? -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Safe to use rados -p rbd cleanup?
On 18-07-16 01:40 PM, Wido den Hollander wrote: On 07/15/2018 11:12 AM, Mehmet wrote: hello guys, in my production cluster i've many objects like this "#> rados -p rbd ls | grep 'benchmark'" ... .. . benchmark_data_inkscope.example.net_32654_object1918 benchmark_data_server_26414_object1990 ... .. . Is it safe to run "rados -p rbd cleanup" or is there any risk for my images? the cleanup will require more then just that as you will need to specify the benchmark prefix as well. Yes and no. "rados -p rbd cleanup" will try to locate benchmark metadata object and remove only objects indiced by these metadata. "--prefix" is used when these metadata are lost or overwritten. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Safe to use rados -p rbd cleanup?
On 18-07-15 11:12 AM, Mehmet wrote: hello guys, in my production cluster i've many objects like this "#> rados -p rbd ls | grep 'benchmark'" ... .. . benchmark_data_inkscope.example.net_32654_object1918 benchmark_data_server_26414_object1990 ... .. . Is it safe to run "rados -p rbd cleanup" or is there any risk for my images? It'll probably fail due to hostname mismatch (rados bench write produces objects with caller hostname embedded in object name). Try what Wido suggested to cleanup all benchmark-made objects. Otherwise yes, it's safe as objects for rbd images are named differently. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSDs for data drives
On 18-07-11 02:35 PM, David Blundell wrote: Hi, I’m looking at 4TB Intel DC P4510 for data drives running BlueStore with WAL, DB and data on the same drives. Has anyone had any good / bad experiences with them? As Intel’s new data centre NVMe SSD it should be fast and reliable but then I would have thought the same about the DC S4600 drives which currently seem best to avoid… David tl;dr - try to avoid TLC NAND flash at all costs if consistent write performance is your target. Lately I was benchmarking Intel DC P4500 (not DC P4510, mind you) and I easily ran into performance issues. Both DC P4500 and DC P4510 utilize 3d TLC NAND flash chips, so you won't get great speeds on very low queue depths, but what's interesting in DC P4500 is that it seems to use SLC cache that provides fast qd=1 4k random writes, close to 300MB/s (or ~90k IOPS), but qd=1 4k random reads are from totally different league (~38MB/s, ~10k IOPS). What is worse, it's not that difficult to exhaust that SLC cache and then your overall write performance drops BADLY. In my case, I was getting RBD write IOPS varying from 10 to 40k IOPS depending on if and for how long the write test was running and how heavy it was. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Prioritize recovery over backfilling
On 18-06-07 12:43 PM, Caspar Smit wrote: Hi Piotr, Thanks for your answer! I've set nodown and now it doesn't mark any OSD's as down anymore :) Any tip when everything is recovered/backfilled and unsetting the nodown flag? When all pgs are reported as active+clean (any scrubbing/deep scrubbing is fine). >Shutdown all activity to the ceph cluster before that moment? Depends on whether it's actually possible in your case and what load your users generate - you have to decide. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Prioritize recovery over backfilling
On 18-06-06 09:29 PM, Caspar Smit wrote: Hi all, We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node to it. osd-max-backfills is at the default 1 so backfilling didn't go very fast but that doesn't matter. Once it started backfilling everything looked ok: ~300 pgs in backfill_wait ~10 pgs backfilling (~number of new osd's) But i noticed the degraded objects increasing a lot. I presume a pg that is in backfill_wait state doesn't accept any new writes anymore? Hence increasing the degraded objects? So far so good, but once a while i noticed a random OSD flapping (they come back up automatically). This isn't because the disk is saturated but a driver/controller/kernel incompatibility which 'hangs' the disk for a short time (scsi abort_task error in syslog). Investigating further i noticed this was already the case before the node expansion. These OSD's flapping results in lots of pg states which are a bit worrying: 109 active+remapped+backfill_wait 80 active+undersized+degraded+remapped+backfill_wait 51 active+recovery_wait+degraded+remapped 41 active+recovery_wait+degraded 27 active+recovery_wait+undersized+degraded+remapped 14 active+undersized+remapped+backfill_wait 4 active+undersized+degraded+remapped+backfilling I think the recovery_wait is more important then the backfill_wait, so i like to prioritize these because the recovery_wait was triggered by the flapping OSD's > furthermore the undersized ones should get absolute priority or is that already the case? I was thinking about setting "nobackfill" to prioritize recovery instead of backfilling. Would that help in this situation? Or am i making it even worse then? ps. i tried increasing the heartbeat values for the OSD's to no avail, they still get flagged as down once in a while after a hiccup of the driver. First of all, use "nodown" flag so osds won't be marked down automatically and unset it once everything backfills/recovers and settles for good -- note that there might be lingering osd down reports, so unsetting nodown might cause some of problematic osds to be instantly marked as down. Second, since Luminous you can use "ceph pg force-recovery" to ask particular pgs to recover first, even if there are other pgs to backfill and/or recovery. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reduced productivity because of slow requests
On 18-06-06 01:57 PM, Grigory Murashov wrote: Hello cephers! I have luminous 12.2.5 cluster of 3 nodes 5 OSDs each with S3 RGW. All OSDs are HDD. I often (about twice a day) have slow request problem which reduces cluster efficiency. It can be started both in day peak and night time. Doesn't matter. That's what I have in ceph health detail https://avatars.mds.yandex.net/get-pdb/234183/9ba023d0-4352-4235-8826-76b412016e9f/s1200 [..] Since it starts in any time but twice a day and for fixed period of time I assume it could be some recovery or rebalancing operations. I tried to find smth out in osd logs but there are nothing about it. Any thoughts how to avoid it? Have you tried disabling scrub and deep scrub? -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] a big cluster or several small
On 18-05-14 06:49 PM, Marc Boisis wrote: Hi, Hello, Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients only, 1 single pool (size=3). We want to divide this cluster into several to minimize the risk in case of failure/crash. For example, a cluster for the mail, another for the file servers, a test cluster ... Do you think it's a good idea ? If reliability and data availability is your main concern, and you don't share data between clusters - yes. Do you have experience feedback on multiple clusters in production on the same hardware: - containers (LXD or Docker) - multiple cluster on the same host without virtualization (with ceph-deploy ... --cluster ...) - multilple pools ... Do you have any advice? We're using containers to host OSDs, but we don't host multiple clusters on same machine (in other words, single physical machine hosts containers for one and the same cluster). We're using Ceph for RBD images, so having multiple clusters isn't a problem for us. Our main reason for using multiple clusters is that Ceph has a bad reliability history when scaling up and even now there are many issues unresolved (https://tracker.ceph.com/issues/21761 for example) so by dividing single, large cluster into few smaller ones, we reduce the impact for customers when things go fatally wrong - when one cluster goes down or it's performance is on single ESDI drive level due to recovery, other clusters - and their users - are unaffected. For us this already proved useful in the past. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovhcloud.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Integrating XEN Server : Long query time for "rbd ls -l" queries
On 18-04-25 02:29 PM, Marc Schöchlin wrote: Hello list, we are trying to integrate a storage repository in xenserver. (i also describe the problem as a issue in the ceph bugtracker: https://tracker.ceph.com/issues/23853) Summary: The slowness is a real pain for us, because this prevents the xen storage repository to work efficently. Gathering information for XEN Pools with hundreds of virtual machines (using "--format json") would be a real pain... The high user time consumption and the really huge amount of threads suggests that there is something really inefficient in the "rbd" utility. So what can i do to make "rbd ls -l" faster or to get comparable information regarding snapshot hierarchy information? Can you run this command with extra argument "--rbd_concurrent_management_ops=1" and share the timing of that? -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] High apply latency
On 18-02-02 09:55 AM, Jakub Jaszewski wrote: Hi, So I have changed merge & split settings to filestore_merge_threshold = 40 filestore_split_multiple = 8 and restart all OSDs , host by host. Let me ask a question, although the pool default.rgw.buckets.data that was affected prior to the above change has higher write bandwidth it is very random now. Writes are random for other pools (same for EC and replicated types) too, before the change writes to replicated pools were much more stable. Reads from pools look fine and stable. Is it the result of mentioned change ? Is PG directory structure updating or ...? The HUGE problem with filestore is that it can't handle large number of small objects well. Sure, if the number only grows slowly (case with RBD images) then it's probably not that noticeable, but in case of 31 millions of objects that come and go at random pace you're going to hit frequent problems with filestore collections splitting and merging. Pre-Luminous, it happened on all osds hosting particular collection at once, and in Luminous there's "filestore split rand factor" which according to docs: Description: A random factor added to the split threshold to avoid too many filestore splits occurring at once. See ``filestore split multiple`` for details. This can only be changed for an existing osd offline, via ceph-objectstore-tool's apply-layout-settings command. You may want to try the above as well. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] formatting bytes and object counts in ceph status ouput
On 18-01-02 11:43 AM, Jan Fajerski wrote: Hi lists, Currently the ceph status output formats all numbers with binary unit prefixes, i.e. 1MB equals 1048576 bytes and an object count of 1M equals 1048576 objects. I received a bug report from a user that printing object counts with a base 2 multiplier is confusing (I agree) so I opened a bug and https://github.com/ceph/ceph/pull/19117. In the PR discussion a couple of questions arose that I'd like to get some opinions on: - Should we print binary unit prefixes (MiB, GiB, ...) since that would be technically correct? +1 - Should counters (like object counts) be formatted with a base 10 multiplier or a multiplier woth base 2? +1 My proposal would be to both use binary unit prefixes and use base 10 multipliers for counters. I think this aligns with user expectations as well as the relevant standard(s?). Most users expect that non-size counters - like object counts - use base-10, and size counters use base-2 units. Ceph's "standard" of using base-2 everywhere was confusing for me as well initially, but I got used to that... Still, wouldn't mind if that would get sorted out once and for all. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Snap trim queue length issues
On 17-12-15 03:58 PM, Sage Weil wrote: On Fri, 15 Dec 2017, Piotr Dałek wrote: On 17-12-14 05:31 PM, David Turner wrote: I've tracked this in a much more manual way. I would grab a random subset [..] This was all on a Hammer cluster. The changes to the snap trimming queues going into the main osd thread made it so that our use case was not viable on Jewel until changes to Jewel that happened after I left. It's exciting that this will actually be a reportable value from the cluster. Sorry that this story doesn't really answer your question, except to say that people aware of this problem likely have a work around for it. However I'm certain that a lot more clusters are impacted by this than are aware of it and being able to quickly see that would be beneficial to troubleshooting problems. Backporting would be nice. I run a few Jewel clusters that have some VM's and it would be nice to see how well the cluster handle snap trimming. But they are much less critical on how much snapshots they do. Thanks for your response, it pretty much confirms what I though: - users aware of issue have their own hacks that don't need to be efficient or convenient. - users unaware of issue are, well, unaware and at risk of serious service disruption once disk space is all used up. Hopefully it'll be convincing enough for devs. ;) Your PR looks great! I commented with a nit on the format of the warning itself. I just adressed the comments. I expect this is trivial to backport to luminous; it will need to be partially reimplemented for jewel (with some care around the pg_stat_t and a different check for the jewel-style health checks). Yeah, that's why I expected some resistance here and asked for comments. I really don't mind reimplementing this, it's not a big deal. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Snap trim queue length issues
On 17-12-14 05:31 PM, David Turner wrote: I've tracked this in a much more manual way. I would grab a random subset [..] This was all on a Hammer cluster. The changes to the snap trimming queues going into the main osd thread made it so that our use case was not viable on Jewel until changes to Jewel that happened after I left. It's exciting that this will actually be a reportable value from the cluster. Sorry that this story doesn't really answer your question, except to say that people aware of this problem likely have a work around for it. However I'm certain that a lot more clusters are impacted by this than are aware of it and being able to quickly see that would be beneficial to troubleshooting problems. Backporting would be nice. I run a few Jewel clusters that have some VM's and it would be nice to see how well the cluster handle snap trimming. But they are much less critical on how much snapshots they do. Thanks for your response, it pretty much confirms what I though: - users aware of issue have their own hacks that don't need to be efficient or convenient. - users unaware of issue are, well, unaware and at risk of serious service disruption once disk space is all used up. Hopefully it'll be convincing enough for devs. ;) -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Snap trim queue length issues
Hi, We recently ran into low disk space issues on our clusters, and it wasn't because of actual data. On those affected clusters we're hosting VMs and volumes, so naturally there are snapshots involved. For some time, we observed increased disk space usage that we couldn't explain, as there was discrepancy between what Ceph reported and actual space used on disks. We finally found out that snap trim queues were both long and not getting any shorter, and decreasing snap trim sleep and increasing max concurrent snap trims helped reversing the trend - we're safe now. The problem is, we haven't been aware of this issue for some time, and there's no easy (and fast[1]) way to check this. I made a pull request[2] that makes snap trim queue lengths available to monitoring tools and also generates health warning when things go out of control, so an admin can act before hell breaks loose. My question is, how many Jewel users would be interested in a such feature? There's a lot of changes between Luminous and Jewel, and it's not going to be a straight backport, but it's not a big patch either, so I won't mind doing it myself. But having some support from users would be helpful in pushing this into next Jewel release. Thanks! [1] one of our guys hacked a bash oneliner that printed out snap trim queue lengths for all pgs, but full run takes over an hour to complete on a cluster with over 20k pgs... [2] https://github.com/ceph/ceph/pull/19520 -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph.conf tuning ... please comment
On 17-12-06 07:01 AM, Stefan Kooman wrote: [osd] # http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ osd crush update on start = false osd heartbeat interval = 1 # default 6 osd mon heartbeat interval = 10# default 30 osd mon report interval min = 1# default 5 osd mon report interval max = 15 # default 120 The osd would almost immediately see a "cut off" to their partner OSD's in the placement group. By default they wait 6 seconds before sending their report to the monitors. During our analysis this is exactly the time the monitors were keeping an election. By tuning all of the above we could get them to send their reports faster, and by the time the election process was finished the monitors would handle the reports from the OSDs and come to the conclusion that a DC is down, flag it down and allow for normal client IO again. Of course, stability and data safety is most important to us. So if any of these settings make you worry please let us know. Heartbeats, especially in Luminous, are quite heavy bandwidth-wise if you have a lot of OSDs in clusters. You may want to keep osd heartbeat interval at 3 lowest, or if that's not acceptable then at least set "osd heartbeat min size" to 0. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Single disk per OSD ?
On 17-12-01 12:23 PM, Maged Mokhtar wrote: Hi all, I believe most exiting setups use 1 disk per OSD. Is this going to be the most common setup in the future ? With the move to lvm, will this prefer the use of multiple disks per OSD ? On the other side i also see nvme vendors recommending multiple OSDs ( 2,4 ) per disk as disks are getting faster for a single OSD process. Can anyone shed some light/recommendations into this please ? You don't put more than one OSD on spinning disk because access times will kill your performance - they already do [kill your performance] and asking hdds to do double/triple/quadruple/... duty is only going to make it far more worse. On the other hand, SSD drives have access time so short that they're most often bottlenecked by SSD users and not SSD itself, so it makes perfect sense to put 2-4 OSDs on one OSD. LVM isn't going to change much in that pattern, it may be easier to setup RAID0 HDD OSDs, but that's questionable use case, and OSDs with JBODs under them are counterproductive (single disk failure would be caught by Ceph, but replacing failed drives will be more difficult -- plus, JBOD OSDs significantly extend the damage area once such OSD fails). -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk is now deprecated
On 17-11-28 09:12 AM, Wido den Hollander wrote: Op 27 november 2017 om 14:36 schreef Alfredo Deza <ad...@redhat.com>: For the upcoming Luminous release (12.2.2), ceph-disk will be officially in 'deprecated' mode (bug fixes only). A large banner with deprecation information has been added, which will try to raise awareness. As much as I like ceph-volume and the work being done, is it really a good idea to use a minor release to deprecate a tool? Can't we just introduce ceph-volume and deprecate ceph-disk at the release of M? Because when you upgrade to 12.2.2 suddenly existing integrations will have deprecation warnings being thrown at them while they haven't upgraded to a new major version. As ceph-deploy doesn't support ceph-disk either I don't think it's a good idea to deprecate it right now. How do others feel about this? Same, although we don't have a *big* problem with this (we haven't upgraded to Luminous yet, so we can skip to next point release and move to ceph-volume together with Luminous). It's still a problem, though - now we have more of our infrastructure to migrate and test, meaning even more delays in production upgrades. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restart is required?
On 17-11-16 05:34 PM, Jaroslaw Owsiewski wrote: Thanks for your reply and information. Yes, we are using filestore. Will it still work in Luminous: ?? http://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/ : |"filestore merge threshold| Description: Min number of files in a subdir before merging into parent NOTE: A negative value means to disable subdir merging " will variable definition like "filestore_merge_treshold = -50" (negative value) work? (in Jewel it worked like a charm) Yes, I don't see any changes to that. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Restart is required?
On 17-11-16 02:44 PM, Jaroslaw Owsiewski wrote: HI, what exactly means message: filestore_split_multiple = '24' (not observed, change may require restart) This has happend after command: # ceph tell osd.0 injectargs '--filestore-split-multiple 24' It means that "filestore split multiple" is not observed for runtime changes, meaning that new value will be stored in osd.0 process memory, but not used at all. Do I really need to restart OSD to make changes to take effect? ceph version 12.2.1 () luminous (stable) Yes. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Libvirt hosts freeze after ceph osd+mon problem
On 17-11-07 12:02 AM, Jan Pekař - Imatic wrote: Hi, I'm using debian stretch with ceph 12.2.1-1~bpo80+1 and qemu 1:2.8+dfsg-6+deb9u3 I'm running 3 nodes with 3 monitors and 8 osds on my nodes, all on IPV6. When I tested the cluster, I detected strange and severe problem. On first node I'm running qemu hosts with librados disk connection to the cluster and all 3 monitors mentioned in connection. On second node I stopped mon and osd with command kill -STOP MONPID OSDPID Within one minute all my qemu hosts on first node freeze, so they even don't respond to ping. [..] Why would you want to *stop* (as in, freeze) a process instead of killing it? Anyway, with processes still there, it may take a few minutes before cluster realizes that daemons are stopped and kicks it out of cluster, restoring normal behavior (assuming correctly set crush rules). -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd rm snap on image with exclusive lock
On 17-10-25 03:30 PM, Jason Dillaman wrote: Hmm, hard to say off the top of my head. If you could enable "debug librbd = 20" logging on the buggy client that owns the lock, create a new snapshot, and attempt to delete it, it would be interesting to verify that the image is being properly refreshed. I'd love to, but that would require us to restart that client - not an option. We'll try to reproduce this somehow anyway and let you know if something interesting shows up. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd rm snap on image with exclusive lock
On 17-10-25 02:39 PM, Jason Dillaman wrote: That log is showing that a snap remove request was made from a client that couldn't acquire the lock to a client that currently owns the lock. The client that currently owns the lock responded w/ an -ENOENT error that the snapshot doesn't exist. Depending on the maintenance operation requested, different errors codes are filtered out to handle the case where Ceph double (or more) delivers the request message to the lock owner. Normally this isn't an issue since the local client pre-checks the image state before sending the RPC message (i.e. snap remove will first locally ensure the snap exists and respond w/ -ENOENT if it doesn't). Therefore, in this case, the question is who is this rogue client that still owns the lock and is responding the a snap remove request but hasn't refreshed its state to know that the snapshot exists. Thanks, that makes things clear. Seems like we have some Cinders utilizing Infernalis (9.2.1) librbd. Are you aware of any bugs in 9.2.x that could cause such behavior? We've seen that for the first time... -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd rm snap on image with exclusive lock
693 7f752da04700 20 librbd::AioImageRequestWQ: clear_require_lock_on_read 2017-10-24 09:50:29.654694 7f752da04700 5 librbd::AioImageRequestWQ: unblock_writes: 0x7f7557932f50, num=0 2017-10-24 09:50:29.654697 7f752da04700 10 librbd::image::CloseRequest: 0x7f7557939090 handle_shut_down_exclusive_lock: r=0 2017-10-24 09:50:29.654700 7f752da04700 10 librbd::image::CloseRequest: 0x7f7557939090 send_flush_readahead 2017-10-24 09:50:29.654702 7f752da04700 10 librbd::image::CloseRequest: 0x7f7557939090 handle_flush_readahead: r=0 2017-10-24 09:50:29.654702 7f752da04700 10 librbd::image::CloseRequest: 0x7f7557939090 send_shut_down_cache 2017-10-24 09:50:29.654789 7f752da04700 10 librbd::image::CloseRequest: 0x7f7557939090 handle_shut_down_cache: r=0 2017-10-24 09:50:29.654793 7f752da04700 10 librbd::image::CloseRequest: 0x7f7557939090 send_flush_op_work_queue 2017-10-24 09:50:29.654796 7f752da04700 10 librbd::image::CloseRequest: 0x7f7557939090 handle_flush_op_work_queue: r=0 2017-10-24 09:50:29.654799 7f752da04700 10 librbd::image::CloseRequest: 0x7f7557939090 handle_flush_image_watcher: r=0 2017-10-24 09:50:29.654812 7f752da04700 10 librbd::ImageState: 0x7f7557933d90 handle_close: r=0 According to the log above, exclusive lock code set error code to EBUSY, which makes sense considering that the client owns the lock and is still alive. Then it's translated to EAGAIN, again making sense (client may go away at some point and just drop the lock). Then out of sudden, that gets translated to ENOENT that gets swallowed by filter in C_InvokeAsyncRequest::finish(). These two things don't make any sense at all. So, two questions: 1. Why it is possible to create snapshots but not remove them when exclusive lock on image is taken? (jewel bug?) 2. Why the error is transformed and then ignored? -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] A new SSD for journals - everything sucks?
On 17-10-11 09:50 AM, Josef Zelenka wrote: Hello everyone, lately, we've had issues with buying SSDs that we use for journaling(Kingston stopped making them) - Kingston V300 - so we decided to start using a different model and started researching which one would be the best price/value for us. We compared five models, to check if they are compatible with our needs - SSDNow v300, HyperX Fury,SSDNOw KC400, SSDNow UV400 and SSDNow A400. the best one is still the V300, with the highest iops of 59 001. Second best and still useable was the HyperX Fury with 45000 iops. The other three had terrible results, the max iops we got were around 13 000 with the dsync and direct flags. We also tested Samsung SSDs(the EVO series) and we got similarly bad results. To get to the root of my question - i am pretty sure we are not the only ones affected by the v300's death. Is there anyone else out there with some benchmarking data/knowledge about some good price/performance SSDs for ceph journaling? I can also share the complete benchmarking data my coworker made, if someone is interested. Never, absolutely never pick consumer-grade SSDs for Ceph cluster, and in particular - never pick a drive with low TBW for journal. Ceph is going to kill it within a few months. Besides, consumer-grade drives are not optimized for Ceph-like/enterprise workloads, resulting in weird performance characteristics, like tens of thousands of IOPS for a first few seconds, then dropping to 1K IOPS (typical for drives with TLC NAND and SLC NAND cache), or performing reasonably till some write queue depth is hit, then degrading badly (underperforming controller), or killing your OSD journals on power failure (no BBU or capacitors to power the drive while flushing when PSU goes down). You may want to look at this: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why sudden (and brief) HEALTH_ERR
On 17-10-04 08:51 AM, lists wrote: Hi, Yesterday I chowned our /var/lib/ceph ceph, to completely finalize our jewel migration, and noticed something interesting. After I brought back up the OSDs I just chowned, the system had some recovery to do. During that recovery, the system went to HEALTH_ERR for a short moment: See below, for consecutive ceph -s outputs: [..] root@pm2:~# ceph -s cluster 1397f1dc-7d94-43ea-ab12-8f8792eee9c1 health HEALTH_ERR 2 pgs are stuck inactive for more than 300 seconds ^^ that. 761 pgs degraded 2 pgs recovering 181 pgs recovery_wait 2 pgs stuck inactive 273 pgs stuck unclean 543 pgs undersized recovery 1394085/8384166 objects degraded (16.628%) 4/24 in osds are down noout flag(s) set monmap e3: 3 mons at {0=10.10.89.1:6789/0,1=10.10.89.2:6789/0,2=10.10.89.3:6789/0} election epoch 256, quorum 0,1,2 0,1,2 osdmap e10230: 24 osds: 20 up, 24 in; 543 remapped pgs flags noout,sortbitwise,require_jewel_osds pgmap v36531146: 1088 pgs, 2 pools, 10703 GB data, 2729 kobjects 32724 GB used, 56656 GB / 89380 GB avail 1394085/8384166 objects degraded (16.628%) 543 active+undersized+degraded 310 active+clean 181 active+recovery_wait+degraded 26 active+degraded 13 active 9 activating+degraded 4 activating 2 active+recovering+degraded recovery io 133 MB/s, 37 objects/s client io 64936 B/s rd, 9935 kB/s wr, 0 op/s rd, 942 op/s wr [..] It was only very briefly, but it did worry me a bit, fortunately, we went back to the expected HEALTH_WARN very quickly, and everything finished fine, so I guess nothing to worry. But I'm curious: can anyone explain WHY we got a brief HEALTH_ERR? No smart errors, apply and commit latency are all within the expected ranges, the systems basically is healthy. Curious :-) Since Jewel (AFAIR), when (re)starting OSDs, pg status is reset to "never contacted", resulting in "pgs are stuck inactive for more than 300 seconds" being reported until osds regain connections between themselves. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Note about rbd_aio_write usage
On 17-07-06 09:39 PM, Jason Dillaman wrote: On Thu, Jul 6, 2017 at 3:25 PM, Piotr Dałek <bra...@predictor.org.pl> wrote: Is that deep copy an equivalent of what Jewel librbd did at unspecified point of time, or extra one? It's equivalent / replacement -- not an additional copy. This was changed to support scatter/gather IO API methods which the latest version of QEMU now directly utilizes (eliminating the need for a bounce-buffer copy on every IO). OK, that makes more sense now. Once we get that librados issue resolved, that initial librbd IO buffer copy will be dropped and librbd will become zero-copy for IO (at least that's the goal). That's why I am recommending that you just assume normal AIO semantics and not try to optimize for Luminous since perhaps the next release will have that implementation detail of the extra copy removed. Is this: https://github.com/yuyuyu101/ceph/commit/794b49b5b860c538a349bdadb16bb6ae97ad9c20#commitcomment-15707924 the issue you mention? Because at this point I'm considering switching to C++ API and passing static bufferptr buried in my bufferlist instead of having extra copy done by C API rbd_aio_write (that way I'd at least control the allocations). -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Note about rbd_aio_write usage
On 17-07-06 04:40 PM, Jason Dillaman wrote: On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek <piotr.da...@corp.ovh.com> wrote: So I really see two problems here: lack of API docs and backwards-incompatible change in API behavior. Docs are always in need of update, so any pull requests would be greatly appreciated. However, I disagree that the behavior has substantively changed -- it was always possible for pre-Luminous to (sometimes) copy the buffer before the "rbd_aio_write" method completed. But that copy was buried somewhere deep in the librbd internals and - looking at Jewel version - most would assume that it's not really copied and user is responsible for keeping buffer intact until write is complete. API user doesn't really care about what's going on internally and is beyond their control. With Luminous, this behavior is more consistent -- but in a future release memory may be zero-copied. If your application can properly conform to the (unwritten) contract that the buffers should remain unchanged, there would be no need for the application to pre-copy the buffers. So far I am forced to do a copy anyway (see below). The question is whether it's me doing it, or librbd. It doesn't make sense to have it both do the same -- especially if it's going to handle tens of terabytes of data, which could mean for 10TB of data at least 83 886 080 memory allocations, releases and copies plus 2 684 354 560 page faults (assuming 4KB pages) -- and these are the best case scenario numbers assuming 128KB I/O size. What I understand that you expect from me, is to have at least number of memory copies doubled and push not "just" 20TB over the memory bus (reading 10TB from one buffer and writing these 10TB to another), but 40. In other words, if I'd write my code considering how Jewel librbd works, there would be no real issue, apart from the fact that suddenly my program would consume more memory and would burn more CPU cycles once librbd is upgraded to Luminous which, considering the amount of data, would be noticeable change. If the libfuse implementation requires that the memory is not-in-use by the time you return control to it (i.e. it's a synchronous API and you are using async methods), you will always need to copy it. Yes, libfuse expects that once I leave entrypoint, it is free to do anything it wishes with previously provided buffers -- and that's what it actually does. > The C++ > API allows you to control the copying since you need to pass > "bufferlist"s to the API methods and since they utilize a reference > counter, there is no internal copying within librbd / librados. How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy the buffer with the assumption that it won't change) and instead of constructing bufferlist containing bufferptr to copied data, construct a bufferlist containing bufferptr made with create_static(user_buffer)? -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Note about rbd_aio_write usage
On 17-07-06 03:43 PM, Jason Dillaman wrote: I've learned the hard way that pre-luminous, even if it copies the buffer, it does so too late. In my specific case, my FUSE module does enter the write call and issues rbd_aio_write there, then exits the write - expecting the buffer provided by FUSE to be copied by librbd (as it happens now in Luminous). I didn't expect that it's a new behavior and once my code was deployed to use Jewel librbd, it started to consistently corrupt data during write. The correct (POSIX-style) program behavior should treat the buffer as immutable until the IO operation completes. It is never safe to assume the buffer can be re-used while the IO is in-flight. You should not add any logic to assume the buffer is safely copied prior to the completion of the IO. Indeed, most systems - not only POSIX ones - supporting asynchronous writes expect that buffer remain unchanged until the write is done. I wasn't sure how rbd_aio_write operates and consulted the source, as there's no docs for the api itself. That intermediate copy in librbd deceived me -- because if librbd copies the data, why should I do the same before calling rbd_aio_write? To stress-test memory bus? So I really see two problems here: lack of API docs and backwards-incompatible change in API behavior. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Note about rbd_aio_write usage
On 17-07-06 03:03 PM, Jason Dillaman wrote: On Thu, Jul 6, 2017 at 8:26 AM, Piotr Dałek <piotr.da...@corp.ovh.com> wrote: Hi, If you're using "rbd_aio_write()" in your code, be aware of the fact that before Luminous release, this function expects buffer to remain unchanged until write op ends, and on Luminous and later this function internally copies the buffer, allocating memory where needed, freeing it once write is done. Pre-Luminous also copies the provided buffer when using the C API -- it just copies it at a later point and not immediately. The eventual goal is to eliminate the copy completely, but that requires some additional plumbing work deep down within the librados messenger layer. I've learned the hard way that pre-luminous, even if it copies the buffer, it does so too late. In my specific case, my FUSE module does enter the write call and issues rbd_aio_write there, then exits the write - expecting the buffer provided by FUSE to be copied by librbd (as it happens now in Luminous). I didn't expect that it's a new behavior and once my code was deployed to use Jewel librbd, it started to consistently corrupt data during write. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Note about rbd_aio_write usage
Hi, If you're using "rbd_aio_write()" in your code, be aware of the fact that before Luminous release, this function expects buffer to remain unchanged until write op ends, and on Luminous and later this function internally copies the buffer, allocating memory where needed, freeing it once write is done. If you write an app that may need to work with Luminous *and* pre-Luminous versions of librbd, you may want to provide a version check (using rbd_version() for example) so either your buffers won't change before write is done or you don't incur a penalty for unnecessary memory allocation and copy on your side (though it's probably unavoidable with current state of Luminous). -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs
On 17-06-21 03:24 PM, Sage Weil wrote: On Wed, 21 Jun 2017, Piotr Dałek wrote: On 17-06-14 03:44 PM, Sage Weil wrote: On Wed, 14 Jun 2017, Paweł Sadowski wrote: On 04/13/2017 04:23 PM, Piotr Dałek wrote: On 04/06/2017 03:25 PM, Sage Weil wrote: On Thu, 6 Apr 2017, Piotr Dałek wrote: [snip] I think the solution here is to use sparse_read during recovery. The PushOp data representation already supports it; it's just a matter of skipping the zeros. The recovery code could also have an option to check for fully-zero regions of the data and turn those into holes as well. For ReplicatedBackend, see build_push_op(). So far it turns out that there's even easier solution, we just enabled "filestore seek hole" on some test cluster and that seems to fix the problem for us. We'll see if fiemap works too. Is it safe to enable "filestore seek hole", are there any tests that verifies that everything related to RBD works fine with this enabled? Can we make this enabled by default? We would need to enable it in the qa environment first. The risk here is that users run a broad range of kernels and we are exposing ourselves to any bugs in any kernel version they may run. I'd prefer to leave it off by default. That's a common regression? If not, we could blacklist particular kernels and call it a day. >> We can enable it in the qa suite, though, which covers centos7 (latest kernel) and ubuntu xenial and trusty. +1. Do you need some particular PR for that? Sure. How about a patch that adds the config option to several of the files in qa/suites/rados/thrash/thrashers? OK. I tested on few of our production images and it seems that about 30% is sparse. This will be lost on any cluster wide event (add/remove nodes, PG grow, recovery). How this is/will be handled in BlueStore? BlueStore exposes the same sparseness metadata that enabling the filestore seek hole or fiemap options does, so it won't be a problem there. I think the only thing that we could potentially add is zero detection on writes (so that explicitly writing zeros consumes no space). We'd have to be a bit careful measuring the performance impact of that check on non-zero writes. I saw that RBD (librbd) does that - replacing writes with discards when buffer contains only zeros. Some code that does the same in librados could be added and it shouldn't impact performance much, current implementation of mem_is_zero is fast and shouldn't be a big problem. I'd rather not have librados silently translating requests; I think it makes more sense to do any zero checking in bluestore. _do_write_small and _do_write_big already break writes into (aligned) chunks; that would be an easy place to add the check. That leaves out filestore. And while I get your point, doing it on librados level would reduce network usage for zeroed out regions as well, and check could be done just once, not replica_size times... -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Prioritise recovery on specific PGs/OSDs?
On 17-06-20 02:44 PM, Richard Hesketh wrote: Is there a way, either by individual PG or by OSD, I can prioritise backfill/recovery on a set of PGs which are currently particularly important to me? For context, I am replacing disks in a 5-node Jewel cluster, on a node-by-node basis - mark out the OSDs on a node, wait for them to clear, replace OSDs, bring up and in, mark out the OSDs on the next set, etc. I've done my first node, but the significant CRUSH map changes means most of my data is moving. I only currently care about the PGs on my next set of OSDs to replace - the other remapped PGs I don't care about settling because they're only going to end up moving around again after I do the next set of disks. I do want the PGs specifically on the OSDs I am about to replace to backfill because I don't want to compromise data integrity by downing them while they host active PGs. If I could specifically prioritise the backfill on those PGs/OSDs, I could get on with replacing disks without worrying about causing degraded PGs. I'm in a situation right now where there is merely a couple of dozen PGs on the disks I want to replace, which are all remapped and waiting to backfill - but there are 2200 other PGs also waiting to backfill because they've moved around too, and it's extremely frustating to be sat waiting to see when the ones I care about will finally be handled so I can get on with replacing those disks. You could prioritize recovery on pool if that would work for you (as others wrote), or +1 this PR: https://github.com/ceph/ceph/pull/13723 (it's bit outdated as I'm constantly low on time, but I promise to push it forward!). -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Sparse file info in filestore not propagated to other OSDs
On 17-06-14 03:44 PM, Sage Weil wrote: On Wed, 14 Jun 2017, Paweł Sadowski wrote: On 04/13/2017 04:23 PM, Piotr Dałek wrote: On 04/06/2017 03:25 PM, Sage Weil wrote: On Thu, 6 Apr 2017, Piotr Dałek wrote: [snip] I think the solution here is to use sparse_read during recovery. The PushOp data representation already supports it; it's just a matter of skipping the zeros. The recovery code could also have an option to check for fully-zero regions of the data and turn those into holes as well. For ReplicatedBackend, see build_push_op(). So far it turns out that there's even easier solution, we just enabled "filestore seek hole" on some test cluster and that seems to fix the problem for us. We'll see if fiemap works too. Is it safe to enable "filestore seek hole", are there any tests that verifies that everything related to RBD works fine with this enabled? Can we make this enabled by default? We would need to enable it in the qa environment first. The risk here is that users run a broad range of kernels and we are exposing ourselves to any bugs in any kernel version they may run. I'd prefer to leave it off by default. That's a common regression? If not, we could blacklist particular kernels and call it a day. > We can enable it in the qa suite, though, which covers centos7 (latest kernel) and ubuntu xenial and trusty. +1. Do you need some particular PR for that? I tested on few of our production images and it seems that about 30% is sparse. This will be lost on any cluster wide event (add/remove nodes, PG grow, recovery). How this is/will be handled in BlueStore? BlueStore exposes the same sparseness metadata that enabling the filestore seek hole or fiemap options does, so it won't be a problem there. I think the only thing that we could potentially add is zero detection on writes (so that explicitly writing zeros consumes no space). We'd have to be a bit careful measuring the performance impact of that check on non-zero writes. I saw that RBD (librbd) does that - replacing writes with discards when buffer contains only zeros. Some code that does the same in librados could be added and it shouldn't impact performance much, current implementation of mem_is_zero is fast and shouldn't be a big problem. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Socket errors, CRC, lossy con messages
On 04/10/2017 08:16 PM, Alex Gorbachev wrote: I am trying to understand the cause of a problem we started encountering a few weeks ago. There are 30 or so per hour messages on OSD nodes of type: ceph-osd.33.log:2017-04-10 13:42:39.935422 7fd7076d8700 0 bad crc in data 2227614508 != exp 2469058201 and 2017-04-10 13:42:39.939284 7fd722c42700 0 -- 10.80.3.25:6826/5752 submit_message osd_op_reply(1826606251 rbd_data.922d95238e1f29.000101bf [set-alloc-hint object_size 16777216 write_size 16777216,write 6328320~12288] v103574'18626765 uv18626765 ondisk = 0) v6 remote, 10.80.3.216:0/1934733503, failed lossy con, dropping message 0x3b55600 [..] Is that happening on entire cluster, or just specific OSDs? That is a clear indication of data corruption, in the above example osd.33 calculated crc for received data block and found out that it doesn't match what was precalculated by sending side. Try gathering some more examples of such crc errors and isolate osd/host that sends malformed data, then do usual diagnostics like memory test on that mahcine. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow perfomance: sanity check
On 04/06/2017 09:34 AM, Stanislav Kopp wrote: Hello, I'm evaluate ceph cluster, to see if you can use it for our virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning drive (XFS), MONITORs are installed on the same nodes, all nodes are connected via 10G switch. The problem is, on client I have only ~25-30 MB/s with seq. write. (dd with "oflag=direct"). [..] 8TB size suggest these are some kind of "archive" drives (SMR drives). Is that correct? If so, you may want to use non-SMR drives, because Ceph is not optimized for those. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recompiling source code - to find exact RPM
On 03/23/2017 06:10 PM, nokia ceph wrote: Hello Piotr, I didn't understand, could you please elaborate about this procedure as mentioned in the last update. It would be really helpful if you share any useful link/doc to understand what you actually meant. Yea correct, normally we do this procedure but it takes more time. But here my intention is to how to find out the rpm which caused the change. I think we are in opposite direction. Here's described how to build Ceph from source ("Build Ceph" paragraph): http://docs.ceph.com/docs/master/install/build-ceph/ And here's how to install the built binaries: http://docs.ceph.com/docs/master/install/install-storage-cluster/#installing-a-build That's enough to build and install Ceph binaries on a specific host without building RPMs. After doing a code change, "make install" is enough to update binaries, restart of Ceph daemons is still required. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recompiling source code - to find exact RPM
On 03/23/2017 02:02 PM, nokia ceph wrote: Hello Piotr, We do customizing ceph code for our testing purpose. It's a part of our R :) Recompiling source code will create 38 rpm's out of these I need to find which one is the correct rpm which I made change in the source code. That's what I'm try to figure out. Yes, I understand that. But wouldn't be faster and/or more convenient if you would just recompile binaries in-place (or use network symlinks) instead of packaging entire Ceph and (re)installing its packages each time you do the change? Generating RPMs takes a while. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recompiling source code - to find exact RPM
On 03/23/2017 01:41 PM, nokia ceph wrote: Hey brad, Thanks for the info. Yea we know that these are test rpm's. The idea behind my question is if I made any changes in the ceph source code, then I recompile it. Then I need to find which is the appropriate rpm mapped to that changed file. If I find the exact RPM, then apply that RPM in our existing ceph cluster instead of applying/overwriting all the compiled rpms. I hope this cleared your doubt. And why exactly you want to rebuild rpms each time? If the machines are powerful enough, you could recompile binaries in place. Or symlink them via nfs (or whatever) to build machine and build once there. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience
On 03/13/2017 11:07 AM, Dan van der Ster wrote: On Sat, Mar 11, 2017 at 12:21 PM, <cephmailingl...@mosibi.nl> wrote: The next and biggest problem we encountered had to do with the CRC errors on the OSD map. On every map update, the OSDs that were not upgraded yet, got that CRC error and asked the monitor for a full OSD map instead of just a delta update. At first we did not understand what exactly happened, we ran the upgrade per node using a script and in that script we watch the state of the cluster and when the cluster is healthy again, we upgrade the next host. Every time we started the script (skipping the already upgraded hosts) the first host(s) upgraded without issues and then we got blocked I/O on the cluster. The blocked I/O went away within a minute of 2 (not measured). After investigation we found out that the blocked I/O happened when nodes where asking the monitor for a (full) OSD map and that resulted shortly in a full saturated network link on our monitor. Thanks for the detailed upgrade report. I wanted to zoom in on this CRC/fullmap issue because it could be quite disruptive for us when we upgrade from hammer to jewel. I've read various reports that the fool proof way to avoid the full map DoS would be to upgrade all OSDs to jewel before the mon's. Did anyone have success with that workaround? I'm cc'ing Bryan because he knows this issue very well. With https://github.com/ceph/ceph/pull/13131 merged into 10.2.6, this issue shouldn't be a problem (at least we don't see it anymore). -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with upgrade from 0.94.9 to 10.2.5
On 01/24/2017 03:57 AM, Mike Lovell wrote: i was just testing an upgrade of some monitors in a test cluster from hammer (0.94.7) to jewel (10.2.5). after upgrade each of the first two monitors, i stopped and restarted a single osd to cause changes in the maps. the same error messages showed up in ceph -w. i haven't dug into it much but just wanted to second that i've seen this happen on a recent hammer to recent jewel upgrade. Thanks for confirmation. We've prepared the patch which fixes the issue for us: https://github.com/ceph/ceph/pull/13131 -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with upgrade from 0.94.9 to 10.2.5
On 01/17/2017 12:52 PM, Piotr Dałek wrote: During our testing we found out that during upgrade from 0.94.9 to 10.2.5 we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading 0.94.6 -> 0.94.9 saturating mon node networking"). Apparently, there's a few commits for both hammer and jewel which are supposed to fix this issue for upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still seeing this upgrading to Jewel, and symptoms are exactly same - after upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors after failing the CRC check. Anyone else encountered this? http://tracker.ceph.com/issues/18582 -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Issue with upgrade from 0.94.9 to 10.2.5
Hello, During our testing we found out that during upgrade from 0.94.9 to 10.2.5 we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading 0.94.6 -> 0.94.9 saturating mon node networking"). Apparently, there's a few commits for both hammer and jewel which are supposed to fix this issue for upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still seeing this upgrading to Jewel, and symptoms are exactly same - after upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors after failing the CRC check. Anyone else encountered this? -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Any librados C API users out there?
Hello, As the subject says - are here any users/consumers of librados C API? I'm asking because we're researching if this PR: https://github.com/ceph/ceph/pull/12216 will be actually beneficial for larger group of users. This PR adds a bunch of new APIs that perform object writes without intermediate data copy, which will reduce cpu and memory load on clients. If you're using librados C API for object writes, feel free to comment here or in the pull request. -- Piotr Dałek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com