Re: [ceph-users] Limits of mds bal fragment size max
On Fri, Apr 12, 2019 at 10:31 AM Benjeman Meekhof wrote: > > We have a user syncing data with some kind of rsync + hardlink based > system creating/removing large numbers of hard links. We've > encountered many of the issues with stray inode re-integration as > described in the thread and tracker below. > > As noted one fix is to increase mds_bal_fragment_size_max so the stray > directories can accommodate the high stray count. We blew right > through 200,000, then 300,000, and at this point I'm wondering if > there is an upper safe limit on this parameter? If I go to something > like 1mil to work with this use case will I have other problems? I'd recommend to try to find a solution that doesn't require you to tweak this. We ended up essentially doing a "repository of origin files", and maybe abusing rsync --link-dest (I don't quite recall). This were a case where changes always were additive at the file level. Files never change, and are only ever added, never removed. So we didn't have to worry about "garbage collecting" it, the amounts were also pretty small. Assuming it doesn't fragment the stray directories, your primary problem is going to be omap sizes: Problems we've run into with large omaps: - Replication of omaps isn't fast if you ever have to do recovery (which you will) - LevelDB/RocksDB compaction for large sets is painful, the bigger the more painful. This is the kind of thing that'll creep up on you - you may not notice this until you have a multi-minute compaction, which ends up blocking requests at the affected osd(s) for the duration. - OSDs being flagged as down due to the above - when the omap's get sufficiently large - Specifically for ceph-mds and stray directories, potentially higher memory usage - Back on hammer - we suspected we'd found some replication corner-cases where we ended up with omap's out of sync (inconsistent objects, which required some surgery with ceph-objectstore-tool), this happened infrequently. Given that you're essentially exceeding "recommended" limits, you are more likely to find corner-cases/bugs though. In terms of "actual numbers" - I'm hesitant to commit to anything, at some point we did run mds with bal fragment size max of 10M. We didn't notice any problems, this could well be because this was the cluster that were the target of every experiments - it was a very noisy environment with relatively low expectations. Where we *really* noticed the omaps I think were well over 10M, although since it came from radosgw on jewel, it crept up on us and didn't appear on our radars until we had blocked requests an osd for minutes ending up affecting i.e. rbd. > Background: > https://www.spinics.net/lists/ceph-users/msg51985.html > http://tracker.ceph.com/issues/38849 > > thanks, > Ben > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks
Hi, If QDV10130 pre-dates feb/march 2018, you may have suffered the same firmware bug as existed on the DC S4600 series. I'm under NDA so I can't bitch and moan about specifics, but your symptoms sounds very familiar. It's entirely possible that there's *something* about bluestore that has access patterns that differ from "regular filesystems", we burnt ourselves with the DC S4600, which were burnt in (I were told) - but probably the burn-in testing were done through filesystems rather than ceph/bluestore. Previously discussed around here http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023835.html On Mon, Feb 18, 2019 at 7:44 AM David Turner wrote: > > We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk > (partitioned), 3 disks per node, 5 nodes per cluster. The clusters are > 12.2.4 running CephFS and RBDs. So in total we have 15 NVMe's per cluster > and 30 NVMe's in total. They were all built at the same time and were > running firmware version QDV10130. On this firmware version we early on had > 2 disks failures, a few months later we had 1 more, and then a month after > that (just a few weeks ago) we had 7 disk failures in 1 week. > > The failures are such that the disk is no longer visible to the OS. This > holds true beyond server reboots as well as placing the failed disks into a > new server. With a firmware upgrade tool we got an error that pretty much > said there's no way to get data back and to RMA the disk. We upgraded all of > our remaining disks' firmware to QDV101D1 and haven't had any problems since > then. Most of our failures happened while rebalancing the cluster after > replacing dead disks and we tested rigorously around that use case after > upgrading the firmware. This firmware version seems to have resolved > whatever the problem was. > > We have about 100 more of these scattered among database servers and other > servers that have never had this problem while running the QDV10130 firmware > as well as firmwares between this one and the one we upgraded to. Bluestore > on Ceph is the only use case we've had so far with this sort of failure. > > Has anyone else come across this issue before? Our current theory is that > Bluestore is accessing the disk in a way that is triggering a bug in the > older firmware version that isn't triggered by more traditional filesystems. > We have a scheduled call with Intel to discuss this, but their preliminary > searches into the bugfixes and known problems between firmware versions > didn't indicate the bug that we triggered. It would be good to have some > more information about what those differences for disk accessing might be to > hopefully get a better answer from them as to what the problem is. > > > [1] > https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs kernel client instability
7fffee40c700 20 allow all > 2018-12-26 19:51:46.207211 7fffee40c700 10 mon.cephmon00@0(leader).osd > e1208017 check_osdmap_sub 0x55592f513d00 next 1207891 (onetime) > 2018-12-26 19:51:46.207213 7fffee40c700 5 mon.cephmon00@0(leader).osd > e1208017 send_incremental [1207891..1208017] to client.36398604 > 10.128.36.18:0/3882984371 > 2018-12-26 19:51:46.207217 7fffee40c700 10 mon.cephmon00@0(leader).osd > e1208017 build_incremental [1207891..1207930] with features 27018fb86aa42ada > 2018-12-26 19:51:46.220019 7fffee40c700 20 mon.cephmon00@0(leader).osd > e1208017 reencode_incremental_map 1207930 with features 504412504116439552 > 2018-12-26 19:51:46.230217 7fffee40c700 20 mon.cephmon00@0(leader).osd > e1208017 build_incrementalinc 1207930 1146701 bytes > 2018-12-26 19:51:46.230349 7fffee40c700 20 mon.cephmon00@0(leader).osd > e1208017 reencode_incremental_map 1207929 with features 504412504116439552 > 2018-12-26 19:51:46.232523 7fffee40c700 20 mon.cephmon00@0(leader).osd > e1208017 build_incrementalinc 1207929 175613 bytes > ... a lot more of reencode_incremental stuff ... > 2018-12-26 19:51:46.745394 7fffee40c700 10 mon.cephmon00@0(leader) e40 > ms_handle_reset 0x637cf800 10.128.36.18:0/3882984371 > 2018-12-26 19:51:46.745395 70c11700 10 mon.cephmon00@0(leader).log > v79246823 encode_full log v 79246823 > 2018-12-26 19:51:46.745469 70c11700 10 mon.cephmon00@0(leader).log > v79246823 encode_pending v79246824 > 2018-12-26 19:51:46.745763 7fffee40c700 10 mon.cephmon00@0(leader) e40 > reset/close on session client.36398604 10.128.36.18:0/3882984371 > 2018-12-26 19:51:46.745769 7fffee40c700 10 mon.cephmon00@0(leader) e40 > remove_session 0x722bb980 client.36398604 10.128.36.18:0/3882984371 > features 0x27018fb86aa42ada > > Any pointers to what to do here? > > Andras > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How many PGs per OSD is too many?
This may be less of an issue now - the most traumatic experience for us, back around hammer, memory usage under recovery+load ended up with OOM kill of osds, needing more recovery, a pretty vicious cycle. -KJ On Wed, Nov 14, 2018 at 11:45 AM Vladimir Brik < vladimir.b...@icecube.wisc.edu> wrote: > Hello > > I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs and > 4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 400 > PGs each (a lot more pools use SSDs than HDDs). Servers are fairly > powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet. > > The impression I got from the docs is that having more than 200 PGs per > OSD is not a good thing, but justifications were vague (no concrete > numbers), like increased peering time, increased resource consumption, > and possibly decreased recovery performance. None of these appeared to > be a significant problem in my testing, but the tests were very basic > and done on a pretty empty cluster under minimal load, so I worry I'll > run into trouble down the road. > > Here are the questions I have: > - In practice, is it a big deal that some OSDs have ~400 PGs? > - In what situations would our cluster most likely fare significantly > better if I went through the trouble of re-creating pools so that no OSD > would have more than, say, ~100 PGs? > - What performance metrics could I monitor to detect possible issues due > to having too many PGs? > > Thanks, > > Vlad > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bcache, dm-cache support
Hi, We tested bcache, dm-cache/lvmcache, and one more which name eludes me with PCIe NVME on top of large spinning rust drives behind a SAS3 expander - and decided this were not for us. This was probably jewel with filestore, and our primary reason for trying to go down this path were that leveldb compaction were killing us, and putting omap/leveldb and things on separate locations were "so-so" supported (IIRC: some were explicitly supported, some you could do a bit of symlink or mount trickery). The caching worked - although, when we started doing power failure survivability (power cycle the entire rig, wait for recovery, repeat), we ended up with seriously corrupted the XFS filesystems on top of the cached block device within a handful of power cycles). We did not test fully disabling the spinning rust on-device cache (which were the leading hypothesis of why this actually failed, potentially combined with ordering of FLUSH+FUA ending up slightly funky combined with the rather asymmetric commit latency). Just to rule out anything else, we did run the same power-fail test regimen for days without the nvme-over-spinning-rust-caching, without triggering the same filesystem corruption. So yea - I'd recommend looking at i.e. bluestore and stick rocksdb, journal and anything else performance critical on faster storage instead. If you do decide to go down the dm-cache/lvmcache/(other cache) road - I'd recommend throughly testing failure scenarios like i.e. power-loss so you don't find out accidentally when you do have a multi-failure-domain outage. :) - KJ On Thu, Oct 4, 2018 at 3:42 AM Maged Mokhtar wrote: > > Hello all, > > Do bcache and dm-cache work well with Ceph ? Is one recommended on the > other ? Are there any issues ? > There are a few posts in this list around them, but i could not > determine if they are ready for mainstream use or not > > Appreciate any clarifications. /Maged > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Have an inconsistent PG, repair not working
Hi, scrub or deep-scrub the pg, that should in theory get you back to list-inconsistent-obj spitting out what's wrong, then mail that info to the list. -KJ On Sun, Apr 1, 2018 at 9:17 AM, Michael Sudnick wrote: > Hello, > > I have a small cluster with an inconsistent pg. I've tried ceph pg repair > multiple times to no luck. rados list-inconsistent-obj 49.11c returns: > > # rados list-inconsistent-obj 49.11c > No scrub information available for pg 49.11c > error 2: (2) No such file or directory > > I'm a bit at a loss here as what to do to recover. That pg is part of a > cephfs_data pool with compression set to force/snappy. > > Does anyone have an suggestions? > > -Michael > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Random individual OSD failures with "connection refused reported by" another OSD?
Hi, another possibility - the osd's "refusing connections" crashed, there's a window of time where connection attempts will fail with connection refused, in between osd died, the osd being re-started by upstart/systemd, and the OSD gets far enough into it's init process to start listening for new connections. While your symptoms look the same, there's no guarantee that you're suffering from the same problem, but.. we're currently suffering from ceph-osd v12.2.4 sporadically segfaulting. Either for config reasons or the signal handler fails to do it's thing, we don't get the typical "oops I crashed" reports in the osd log, although journald/systemd did capture stdout which mentions it and there's a kernel log message left behind saying that ceph-osd segfaulted. (http://tracker.ceph.com/issues/23352 ). -KJ On Wed, Mar 28, 2018 at 10:50 AM, Andre Goree wrote: > On 2018/03/28 1:39 pm, Subhachandra Chandra wrote: > > We have seen similar behavior when there are network issues. AFAIK, the >> OSD is being reported down by an OSD that cannot reach it. But either >> another OSD that can reach it or the heartbeat between the OSD and the >> monitor declares it up. The OSD "boot" message does not seem to indicate an >> actual OSD restart. >> >> Subhachandra >> >> On Wed, Mar 28, 2018 at 10:30 AM, Andre Goree wrote: >> >> Hello, >>> >>> I've recently had a minor issue come up where random individual OSDs are >>> failed due to a connection refused on another OSD. I say minor, bc it's >>> not a node-wide issue, and appears to be random nodes -- and besides that, >>> the OSD comes up within less than a second, as if the OSD is sent a >>> "restart," or something. >>> >>> ... > > > Great! Thank you! Yes I found it funny that it "restarted" so quickly, > and from my readings I remember that it takes more than a single OSD > heartbeat failing to produce and _actual_ failure, so as to prevent false > positives. Thanks for the insight! > > > > -- > Andre Goree > -=-=-=-=-=- > Email - andre at drenet.net > Website - http://blog.drenet.net > PGP key - http://www.drenet.net/pubkey.html > -=-=-=-=-=- > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Memory leak in Ceph OSD?
I retract my previous statement(s). My current suspicion is that this isn't a leak as much as it being load-driven, after enough waiting - it generally seems to settle around some equilibrium. We do seem to sit on the mempools x 2.4 ~ ceph-osd RSS, which is on the higher side (I see documentation alluding to expecting ~1.5x). -KJ On Mon, Mar 19, 2018 at 3:05 AM, Konstantin Shalygin wrote: > > We don't run compression as far as I know, so that wouldn't be it. We do >> actually run a mix of bluestore & filestore - due to the rest of the >> cluster predating a stable bluestore by some amount. >> > > > 12.2.2 -> 12.2.4 at 2018/03/10: I don't see increase of memory usage. No > any compressions of course. > > > http://storage6.static.itmages.com/i/18/0319/h_1521453809_ > 9131482_859b1fb0a5.png > > > > > k > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Memory leak in Ceph OSD?
Hi, addendum: We're running 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b). The workload is a mix of 3xreplicated & ec-coded (rbd, cephfs, rgw). -KJ On Tue, Mar 6, 2018 at 3:53 PM, Kjetil Joergensen wrote: > Hi, > > so.. +1 > > We don't run compression as far as I know, so that wouldn't be it. We do > actually run a mix of bluestore & filestore - due to the rest of the > cluster predating a stable bluestore by some amount. > > The interesting part is - the behavior seems to be specific to our > bluestore nodes. > > Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs. > Blue line - dump_mempools total bytes for all the OSDs running on the > yellow line. The big dips - forced restarts after having suffered through > after effects of letting linux deal with it by OOM->SIGKILL previously. > > > > A gross extrapolation - "right now" the "memory used" seems to be close > enough to "sum of RSS of ceph-osd processes" running on the machines. > > -KJ > > On Thu, Mar 1, 2018 at 7:18 PM, Alex Gorbachev > wrote: > >> On Thu, Mar 1, 2018 at 5:37 PM, Subhachandra Chandra >> wrote: >> > Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives >> > filled to around 90%. One thing that does increase memory usage is the >> > number of clients simultaneously sending write requests to a particular >> > primary OSD if the write sizes are large. >> >> We have not seen a memory increase in Ubuntu 16.04, but I also >> observed repeatedly the following phenomenon: >> >> When doing a VMotion in ESXi of a large 3TB file (this generates a log >> of IO requests of small size) to a Ceph pool with compression set to >> force, after some time the Ceph cluster shows a large number of >> blocked requests and eventually timeouts become very large (to the >> point where ESXi aborts the IO due to timeouts). After abort, the >> blocked/slow requests messages disappear. There are no OSD errors. I >> have OSD logs if anyone is interested. >> >> This does not occur when compression is unset. >> >> -- >> Alex Gorbachev >> Storcium >> >> > >> > Subhachandra >> > >> > On Thu, Mar 1, 2018 at 6:18 AM, David Turner >> wrote: >> >> >> >> With default memory settings, the general rule is 1GB ram/1TB OSD. If >> you >> >> have a 4TB OSD, you should plan to have at least 4GB ram. This was the >> >> recommendation for filestore OSDs, but it was a bit much memory for the >> >> OSDs. From what I've seen, this rule is a little more appropriate with >> >> bluestore now and should still be observed. >> >> >> >> Please note that memory usage in a HEALTH_OK cluster is not the same >> >> amount of memory that the daemons will use during recovery. I have >> seen >> >> deployments with 4x memory usage during recovery. >> >> >> >> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman wrote: >> >>> >> >>> Quoting Caspar Smit (caspars...@supernas.eu): >> >>> > Stefan, >> >>> > >> >>> > How many OSD's and how much RAM are in each server? >> >>> >> >>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12 >> >>> cores (at least one core per OSD). >> >>> >> >>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM >> >>> > right? >> >>> >> >>> Apparently. Sure they will use more RAM than just cache to function >> >>> correctly. I figured 3 GB per OSD would be enough ... >> >>> >> >>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of >> >>> > total >> >>> > RAM. The cache is a part of the memory usage by bluestore OSD's. >> >>> >> >>> A factor 4 is quite high, isn't it? Where is all this RAM used for >> >>> besides cache? RocksDB? >> >>> >> >>> So how should I size the amount of RAM in a OSD server for 10 >> bluestore >> >>> SSDs in a >> >>> replicated setup? >> >>> >> >>> Thanks, >> >>> >> >>> Stefan >> >>> >> >>> -- >> >>> | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 >> >>> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl >> >>> ___ >> >>> ceph-users mailing list >> >>> ceph-users@lists.ceph.com >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> >> ___ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > >> > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Kjetil Joergensen > SRE, Medallia Inc > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Memory leak in Ceph OSD?
Hi, so.. +1 We don't run compression as far as I know, so that wouldn't be it. We do actually run a mix of bluestore & filestore - due to the rest of the cluster predating a stable bluestore by some amount. The interesting part is - the behavior seems to be specific to our bluestore nodes. Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs. Blue line - dump_mempools total bytes for all the OSDs running on the yellow line. The big dips - forced restarts after having suffered through after effects of letting linux deal with it by OOM->SIGKILL previously. A gross extrapolation - "right now" the "memory used" seems to be close enough to "sum of RSS of ceph-osd processes" running on the machines. -KJ On Thu, Mar 1, 2018 at 7:18 PM, Alex Gorbachev wrote: > On Thu, Mar 1, 2018 at 5:37 PM, Subhachandra Chandra > wrote: > > Even with bluestore we saw memory usage plateau at 3-4GB with 8TB drives > > filled to around 90%. One thing that does increase memory usage is the > > number of clients simultaneously sending write requests to a particular > > primary OSD if the write sizes are large. > > We have not seen a memory increase in Ubuntu 16.04, but I also > observed repeatedly the following phenomenon: > > When doing a VMotion in ESXi of a large 3TB file (this generates a log > of IO requests of small size) to a Ceph pool with compression set to > force, after some time the Ceph cluster shows a large number of > blocked requests and eventually timeouts become very large (to the > point where ESXi aborts the IO due to timeouts). After abort, the > blocked/slow requests messages disappear. There are no OSD errors. I > have OSD logs if anyone is interested. > > This does not occur when compression is unset. > > -- > Alex Gorbachev > Storcium > > > > > Subhachandra > > > > On Thu, Mar 1, 2018 at 6:18 AM, David Turner > wrote: > >> > >> With default memory settings, the general rule is 1GB ram/1TB OSD. If > you > >> have a 4TB OSD, you should plan to have at least 4GB ram. This was the > >> recommendation for filestore OSDs, but it was a bit much memory for the > >> OSDs. From what I've seen, this rule is a little more appropriate with > >> bluestore now and should still be observed. > >> > >> Please note that memory usage in a HEALTH_OK cluster is not the same > >> amount of memory that the daemons will use during recovery. I have seen > >> deployments with 4x memory usage during recovery. > >> > >> On Thu, Mar 1, 2018 at 8:11 AM Stefan Kooman wrote: > >>> > >>> Quoting Caspar Smit (caspars...@supernas.eu): > >>> > Stefan, > >>> > > >>> > How many OSD's and how much RAM are in each server? > >>> > >>> Currently 7 OSDs, 128 GB RAM. Max wil be 10 OSDs in these servers. 12 > >>> cores (at least one core per OSD). > >>> > >>> > bluestore_cache_size=6G will not mean each OSD is using max 6GB RAM > >>> > right? > >>> > >>> Apparently. Sure they will use more RAM than just cache to function > >>> correctly. I figured 3 GB per OSD would be enough ... > >>> > >>> > Our bluestore hdd OSD's with bluestore_cache_size at 1G use ~4GB of > >>> > total > >>> > RAM. The cache is a part of the memory usage by bluestore OSD's. > >>> > >>> A factor 4 is quite high, isn't it? Where is all this RAM used for > >>> besides cache? RocksDB? > >>> > >>> So how should I size the amount of RAM in a OSD server for 10 bluestore > >>> SSDs in a > >>> replicated setup? > >>> > >>> Thanks, > >>> > >>> Stefan > >>> > >>> -- > >>> | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 > >>> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl > >>> ___ > >>> ceph-users mailing list > >>> ceph-users@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Duplicate snapid's
Hi, we currently do not understand how we got into this situation, nevertheless we have a set of rbd images which has the same SNAPID in the same pool. kjetil@sc2-r10-u09:~$ rbd snap ls _qa-staging_foo_partial_db SNAPID NAME SIZE 478104 2017-11-29.001 2 MB kjetil@sc2-r10-u09:~$ rbd snap ls _qa-staging_bar_decimated_be SNAPID NAME SIZE 478104 2017-11-27.001 30720 kB (We have a small collection of these) I currently believe this is bad - is this correct that this is bad ? My rudimentary understanding is that snapid is monotonically increasing, and unique within a pool. At which point this is bad at the point a snapshot gets removed, snapid would get put into removed_snaps, and at some point the osd's would go trimming and might prematurely get rid of clones. Cheers, -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] using Bcache on blueStore
Generally on bcache & for that matter lvmcache & dmwriteboost. We did extensive "power off" testing with all of them and reliably managed to break it on our hardware setup. while true; boot box; start writing & stress metadata updates (i.e. make piles of files and unlink them, or you could find something else that's picky about write ordering); let it run for a bit; yank power; power on; This never survived for more than a night without badly corrupting some xfs filesystem. We did the same testing without caching and could not reproduce. This may have been a quirk resulting from our particular setup, I get the impression that others use it and sleep well at night, but I'd recommend testing it under the most unforgivable circumstances you can think of before proceeding. -KJ On Thu, Oct 12, 2017 at 4:54 PM, Jorge Pinilla López wrote: > Well, I wouldn't use bcache on filestore at all. > First there are problems with all that you have said and second but way > important you got doble writes (in FS data was written to journal and to > storage disk at the same time), if jounal and data disk were the same then > speed was divided by two getting really bad output. > > In BlueStore things change quite a lot, first there are not double writes > there is no "journal" (well there is a something call Wal but it's not > used in the same way), data goes directly into the data disk and you only > write a few metadata and make a commit into the DB. Rebalancing and scrub go > through a RockDB not a file system making it way more simple and effective, > you aren't supposed to have all the problems that you had with FS. > > In addition, cache tiering has been deprecated on Red Hat Ceph Storage so I > personally wouldn't use something deprecated by developers and support. > > > Mensaje original > De: Marek Grzybowski > Fecha: 13/10/17 12:22 AM (GMT+01:00) > Para: Jorge Pinilla López , ceph-users@lists.ceph.com > Asunto: Re: [ceph-users] using Bcache on blueStore > > On 12.10.2017 20:28, Jorge Pinilla López wrote: >> Hey all! >> I have a ceph with multiple HDD and 1 really fast SSD with (30GB per OSD) >> per host. >> >> I have been thinking and all docs say that I should give all the SSD space >> for RocksDB, so I would have a HDD data and a 30GB partition for RocksDB. >> >> But it came to my mind that if the OSD isnt full maybe I am not using all >> the space in the SSD, or maybe I prefer having a really small amount of hot >> k/v and metadata and the data itself in a really fast device than just >> storing all could metadata. >> >> So I though that using Bcache to make SSD to be a cache and as metadata >> and k/v are usually hot, they should be place on the cache. But this doesnt >> guarantee me that k/v and metadata are actually always in the SSD cause >> under heavy cache loads it can be pushed out (like really big data files). >> >> So I came up with the idea of setting small 5-10GB partitions for the hot >> RocksDB and the rest to use it as a cache, so I make sure that really hot >> metadata is actually always on the SSD and the coulder one should be also on >> the SSD (as a bcache) if its not really freezing, in that case they would be >> pushed to the HDD. It also doesnt make anysense to have metadatada that you >> never used using space on the SSD, I rather use that space to store hotter >> data. >> >> This is also make writes faster, and in blueStore we dont have the double >> write problem so it should work fine. >> >> What do you think about this? does it have any downsite? is there any >> other way? > > Hi Jorge > I was inexperienced and tried bcache on old fsstore once. It was bad. > Mostly because bcache does not have any typical disk scheduling algorithm. > So when scrub or rebalnce was running latency on such storage was very high > and unpredictable. > OSD deamon could not give any ioprio for disks read or writes, and > additionaly > bcache cache was poisoned by scrub/rebalance. > > Fortunately to me, it is very easy to rolling replace OSDs. > I use some SSDs partitions for journal now and what left for pure ssd > storage. > This works really great . > > If i will ever need cache, i will use cache tiering instead . > > > -- > Kind Regards > Marek Grzybowski > > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Kjetil Joergensen SRE, Medallia Inc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com