[ceph-users] Re: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

2021-07-06 Thread Anthony D';Atri
> Oh, I just read your message again, and I see that I didn't answer your > question. :D I admit I don't know how MAX AVAIL is calculated, and whether > it takes things like imbalance into account (it might). It does. It’s calculated relative to the most-full OSD in the pool, and the full_ratio

[ceph-users] Re: Ceph with BGP?

2021-07-05 Thread Anthony D';Atri
yes. > On Jul 5, 2021, at 11:23 PM, Martin Verges wrote: > > Hello, > >> This is not easy to answer without all the details. But for sure there > are cluster running with BGP in the field just fine. > > Out of curiosity, is there someone here that has his Ceph cluster running > with BGP in pro

[ceph-users] Re: upgrading from Nautilus on CentOS7 to Octopus on Ubuntu 20.04.2

2021-06-29 Thread Anthony D';Atri
> For similar reasons, CentOS 8 stream, as opposed to every other CentOS > released before, is very experimental. I would never go in production with > CentOS 8 stream. Is it, though? Was the experience really any different before “Stream” was appended to the name? We still saw dot releases

[ceph-users] Re: Can we deprecate FileStore in Quincy?

2021-06-27 Thread Anthony D';Atri
Also, only one Ethernet port.  Worse yet they have *zero* HIPPI ports! Can you imagine!?Never used HIPPIAlmost nobody has ;)A 48-port gigabit managed switch is reasonably accessible to the homegamer, both in terms of availability and cost.USD249 from Netgear, interesting.Second-hand 10GbE switches

[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-27 Thread Anthony D';Atri
GCC, the whole toolchain, myriad dependencies, the ways that Python has patterend itself after Java. Add in the way that the major Linux distributions are moving targets and building / running on just one of them is a huge task, not to mention multiple versions of each. And the way that system

[ceph-users] Re: Can we deprecate FileStore in Quincy?

2021-06-26 Thread Anthony D';Atri
>> - Bluestore requires OSD hosts with 8GB+ of RAM With Filestore I found that in production I needed to raise vm.min_free_kbytes, though inheriting the terrible mistake of -n size=65536 didn’t help. A handful of years back WD Labs did their “microserver” project, a cluster of 504 drives with

[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-20 Thread Anthony D';Atri
> 3. Why is in this cephadm still being talked about systemd? Your orchestrator > should handle restarts,namespaces and failed tasks not? There should be no > need to have a systemd dependency, at least I have not seen any container > images relying on this. Podman uses systemd to manage conta

[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-19 Thread Anthony D';Atri
Thanks, Sage. This is a terrific distilation of the challenges and benefits. FWIW here are a few of my own perspectives, as someone experienced with Ceph but with limited container experience. To be very clear, these are *perceptions* not *assertions*; my goal is discussion not argument. Fo

[ceph-users] Re: Strategy for add new osds

2021-06-16 Thread Anthony D';Atri
> Hi, > > as far as I understand it, > > you get no real benefit with doing them one by one, as each osd add, can > cause a lot of data to be moved to a different osd, even tho you just > rebalanced it. Less than with older releases, but yeah. I’ve known someone who advised against doing

[ceph-users] Re: Issues with Ceph network redundancy using L2 MC-LAG

2021-06-15 Thread Anthony D';Atri
> On Jun 15, 2021, at 10:26 AM, Andrew Walker-Brown > wrote: > > With an unstable link/port you could see the issues you describe. Ping > doesn’t have the packet rate for you to necessarily have a packet in transit > at exactly the same time as the port fails temporarily. Iperf on the othe

[ceph-users] Re: CephFS design

2021-06-11 Thread Anthony D';Atri
>> Can you suggest me what is a good cephfs design? One that uses copious complements of my employer’s components, naturally ;) >> I've never used it, only >> rgw and rbd we have, but want to give a try. Howvere in the mail list I saw >> a huge amount of issues with cephfs Something to remembe

[ceph-users] Re: SSD recommendations for RBD and VM's

2021-06-05 Thread Anthony D';Atri
>> I wonder that when a osd came back from power-lost, all the data >> scrubbing and there are 2 other copies. >> PLP is important on mostly Block Storage, Ceph should easily recover >> from that situation. >> That's why I don't understand why I should pay more for PLP and other >> protections. >

[ceph-users] Re: SAS vs SATA for OSD - WAL+DB sizing.

2021-06-03 Thread Anthony D';Atri
gt;> On 6/3/21 5:18 PM, Dave Hall wrote: >>> Anthony, >>> >>> I had recently found a reference in the Ceph docs that indicated >> something >>> like 40GB per TB for WAL+DB space. For a 12TB HDD that comes out to >>> 480GB. If this is n

[ceph-users] Re: SAS vs SATA for OSD

2021-06-03 Thread Anthony D';Atri
Agreed. I think oh …. maybe 15-20 years ago there was often a wider difference between SAS and SATA drives, but with modern queuing etc. my sense is that there is less of an advantage. Seek and rotational latency I suspect dwarf interface differences wrt performance. The HBA may be a bigger

[ceph-users] Re: SSD recommendations for RBD and VM's

2021-05-29 Thread Anthony D';Atri
The choice depends on scale, your choice of chassis / form factor, budget, workload and needs. The sizes you list seem awfully small. Tell us more about your use-case. OpenStack? Proxmox? QEMU? VMware? Converged? Dedicated ? —aad > On May 29, 2021, at 2:10 PM, by morphin wrote: > > Hell

[ceph-users] Re: XFS on RBD on EC painfully slow

2021-05-28 Thread Anthony D';Atri
There is also a longstanding belief that using cpio saves you context switches and data through a pipe. ymmv. > On May 28, 2021, at 7:26 AM, Reed Dier wrote: > > I had it on my list of things to possibly try, a tar in | tar out copy to see > if it yielded different results. > > On its face,

[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Anthony D';Atri
> It's not a 100% clear to me, but is the pdcache the same as the disk > internal (non battery backed up) cache? Yes, AIUI. > As we are located very nearby the hydropower plant, we actually connect > each server individually to an UPS. Lucky you. I’ve seen an entire DC go dark with a power outa

[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Anthony D';Atri
I don’t have the firmware versions handy, but at one point around the 2014-2015 timeframe I found that both LSI’s firmware and storcli claimed that the default setting was DiskDefault, ie. leave whatever the drive has alone. It turned out, though, that for the 9266 and 9271, at least, behind t

[ceph-users] Re: cephadm/podman :: upgrade to pacific stuck

2021-04-01 Thread Anthony D';Atri
I think what it’s saying is that it wants for more than one mgr daemon to be provisioned, so that it can failover when the primary is restarted. I suspect you would then run into the same thing with the mon. All sorts of things tend to crop up on a cluster this minimal. > On Apr 1, 2021, at

[ceph-users] Re: memory consumption by osd

2021-03-27 Thread Anthony D';Atri
Depending on your kernel version, MemFree can be misleading. Attend to the value of MemAvailable instead. Your OSDs all look to be well below the target, I wouldn’t think you have any problems. In fact 256GB for just 10 OSDs is an embarassment of riches. What type of drives are you using, a

[ceph-users] Re: Cephfs metadata and MDS on same node

2021-03-26 Thread Anthony D';Atri
> On Mar 26, 2021, at 6:31 AM, Stefan Kooman wrote: > > On 3/9/21 4:03 PM, Jesper Lykkegaard Karlsen wrote: >> Dear Ceph’ers >> I am about to upgrade MDS nodes for Cephfs in the Ceph cluster (erasure code >> 8+3 ) I am administrating. >> Since they will get plenty of memory and CPU cores, I wa

[ceph-users] Re: How big an OSD disk could be?

2021-03-14 Thread Anthony D';Atri
> After you have filled that up, if such a host crashes or needs > maintenance, another 80-100TB will need recreating from the other huge > drives. A judicious setting of mon_osd_down_out_subtree_limit can help mitigate the thundering herd FWIW. > I don't think there are specific limitations on

[ceph-users] Re: Location of Crush Map and CEPH metadata

2021-03-13 Thread Anthony D';Atri
As Nathan describes, this information is maintained in the database on mon / monitor nodes. One always runs multiple mons in production, at least 3 and commonly 5. Each has a full copy of everything, so that the loss of a node does not lose data or impact operation. BTW, it’s Ceph not CEPH

[ceph-users] Re: How big an OSD disk could be?

2021-03-12 Thread Anthony D';Atri
> I assume the limits are those that linux imposes. iops are the limits. One > 20TB has 100 iops and 4x5TB have 400 iops. 400 iops serves more clients that > 100 iops. You decide what you need/want to have. >> Any other aspects on the limits of bigger capacity hard disk drives? > > Recovery wil

[ceph-users] Re: balance OSD usage.

2021-03-06 Thread Anthony D';Atri
Which Ceph release are you running? You mention the balancer, which would imply a certain lower bound. What does `ceph balancer status` show? > > Does anyone know how I can rebalance my cluster to balance out the OSD > usage? > > > > I just added 12 more 14Tb HDDs to my cluster (cluster

[ceph-users] Re: Questions RE: Ceph/CentOS/IBM

2021-03-03 Thread Anthony D';Atri
I’m at something of a loss to understand all the panic here. Unless I’ve misinterpreted, CentOS isn’t killed, it’s being updated more frequently. Want something stable? Freeze a repository into a local copy, and deploy off of that. Like we all should be doing anyway, vs. relying on slurping

[ceph-users] Re: Slow cluster / misplaced objects - Ceph 15.2.9

2021-02-27 Thread Anthony D';Atri
With older releases, Michael Kidd’s log parser scripts were invaluable, notably map_reporters_to_buckets.sh https://github.com/linuxkidd/ceph-log-parsers With newer releases, at least, one can send `dump_blocked_ops` to the OSD admin socket. I collect these via Prometheus / node_exporter, it’

[ceph-users] Re: Backups of monitor

2021-02-12 Thread Anthony D';Atri
>> So if you are doing maintenance on a mon host in a 5 mon cluster you will >> still have 3 in the quorum. > > Exactly. I was in exactly this situation, doing maintenance on 1 and screwing > up number 2. Service outage Been there. I had a cluster that nominally had 5 mons. Two suffered har

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Anthony D';Atri
, etc) >>> - NVMe device in other node dies >>> - You loose data >>> >>> Although you can bring back the other node which was down but not broken >>> you are missing data. The data on the NVMe devices in there is outdated >>> and thus the PGs

[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Anthony D';Atri
Weighting up slowly so as not to DoS users. Huge omaps and EC. So yes you’re actually agreeing with me. > > Taking a month to weight up a drive suggests the cluster doesn't have > enough spare IO capacity. ___ ceph-users mailing list -- ceph-users@ce

[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Anthony D';Atri
>> Why would I when I can get a 18TB Seagate IronWolf for <$600, a 18TB Seagate >> Exos for <$500, or a 18TB WD Gold for <$600? > > IOPS Some installations don’t care so much about IOPS. Less-tangible factors include: * Time to repair and thus to restore redundancy. When an EC pool of spi

[ceph-users] Re: NVMe and 2x Replica

2021-02-04 Thread Anthony D';Atri
> I searched each to find the section where 2x was discussed. What I found was > interesting. First, there are really only 2 positions here: Micron's and Red > Hat's. Supermicro copies Micron's positon paragraph word for word. Not > surprising considering that they are advertising a Superm

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-04 Thread Anthony D';Atri
> > Maybe the weakest thing in that configuration is having 2 OSDs per node; osd > nearfull must be tuned accordingly so that no OSD goes beyond about 0.45, so > that in case of failure of one disk, the other OSD in the node has enough > space for healing replication. > A careful setting of

[ceph-users] Re: Using RBD to pack billions of small files

2021-02-02 Thread Anthony D';Atri
I’d be nervous about a plan to utilize a single volume, growing indefinitely. I would think that from a blast radius perspective that you’d want to strike a balance between a single monolithic blockchain-style volume vs a zillion tiny files. Perhaps a strategy to shard into, say, 10 TB volumes

[ceph-users] Re: Planning: Ceph User Survey 2020

2021-01-28 Thread Anthony D';Atri
The survey team spent some time discussing the pros and cons of formats for a number of the questions in the new survey. I think when we initially sent out the first draft of the survey, that specific question was simple checkboxes, as I think it had been in the previous year’s edition. The fi

[ceph-users] Re: Running ceph cluster on different os

2021-01-25 Thread Anthony D';Atri
I have firsthand experfience migrating multiple clusters from Ubuntu to RHEL, preserving the OSDs along the way, with no loss or problems. It’s not like you’re talking about OpenVMS ;) > On Jan 25, 2021, at 9:14 PM, Szabo, Istvan (Agoda) > wrote: > > Hi, > > Is there anybody running a cluste

[ceph-users] Re: Snaptrim making cluster unusable

2021-01-10 Thread Anthony D';Atri
When the below was first published my team tried to reproduce, and couldn’t. A couple of factors likely contribute to differing behavior: * _Micron 5100_ for example isn’t a model, the 5100 _Eco_, _Pro_, and _Max_ are different beasts. Similarly, implementation and firmware details vary by dri

[ceph-users] Re: osd gradual reweight question

2021-01-08 Thread Anthony D';Atri
> > Hi, > > We are replacing HDD with SSD, and we first (gradually) drain (reweight) the > HDDs with 0.5 steps until 0 = empty. > > Works perfectly. > > Then (just for kicks) I tried reducing HDD weight from 3.6 to 0 in one large > step. That seemed to have had more impact on the cluster, a

[ceph-users] Re: ceph stuck removing image from trash

2020-12-16 Thread Anthony D';Atri
Perhaps setting the object-map feature on the image, and/or running rbd object-map rebuild? Though I suspect that might perform an equivalent process and take just as long? > On Dec 15, 2020, at 11:49 PM, 胡 玮文 wrote: > > Hi Andre, > > I once faced the same problem. It turns out that ceph nee

[ceph-users] Re: pool nearfull, 300GB rbd image occupies 11TB!

2020-12-13 Thread Anthony D';Atri
and pool > shrinks automatically again? Or still any additional actions are required? > — > Max > >> On 13. Dec 2020, at 15:53, Anthony D'Atri wrote: >> >> rbd status >> rbd info >> >> If the ‘journaling’ flag is enabled, use ‘rbd feature’ to re

[ceph-users] Re: pool nearfull, 300GB rbd image occupies 11TB!

2020-12-13 Thread Anthony D';Atri
Any chance you might have orphaned `rados bench` objects ? This happens more than one might think. `rados ls > /tmp/out` Inspect the result. You should see a few administrative objects, some header and data objects for the RBD volume. If you see a zillion with names like `bench*` there’s y

[ceph-users] Re: add server in crush map before osd

2020-12-03 Thread Anthony D';Atri
This is what I do as well. > You can also just use a single command: > > ceph osd crush add-bucket host room= ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: slow down keys/s in recovery

2020-12-03 Thread Anthony D';Atri
>> If so why the client op priority is default 63 and recovery op is 3? This >> means that by default recovery op is more prioritize than client op! > > Exactly the opposite. Client ops take priority over recovery ops. And > various other ops have priorities as described in the document I poi

[ceph-users] Re: slow down keys/s in recovery

2020-12-03 Thread Anthony D';Atri
t’s why it is commonly suggested to set recovery_op_priority to 1 if you need to slow down recovery as well as the other values I sent you. https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#operations > Thanks. > > On Wed, Dec 2, 2020 at 10:25 PM Anthony D'Atri

[ceph-users] Re: replace osd with Octopus

2020-12-02 Thread Anthony D';Atri
> Give my above understanding, all-to-all is no difference from > one-to-all. In either case, PGs of one disk are remapped to others. > > I must be missing something seriously:) It’s a bit subtle, but I think part of what Frank is getting at is that when OSDs are backfilled / recovered sequent

[ceph-users] Re: slow down keys/s in recovery

2020-12-02 Thread Anthony D';Atri
FWIW https://github.com/ceph/ceph/blob/master/doc/dev/osd_internals/backfill_reservation.rst has some discussion of op priorities, though client ops aren’t mentioned explicitly. If you like, enter a documentation tracker and tag me and I’ll look into adding that. > On Dec 2, 2020, at 9:56 AM,

[ceph-users] Re: slow down keys/s in recovery

2020-12-02 Thread Anthony D';Atri
In certain cases (Luminous) it can actually be faster to destroy an OSD and recreate it than to let it backfill huge maps, but I think that’s been improved by Nautilus. You might also try setting osd_op_queue_cut_off = high to reduce the impact of recovery on client operations. This became t

[ceph-users] Re: DB sizing for lots of large files

2020-11-28 Thread Anthony D';Atri
Christian wrote “post Octopus”. The referenced code seems likely to appear in Pacific. We’ll see how it works out in practice. I suspect that provisioned space will automagically be used when an OSD starts under a future release, though the release notes may give us specific instructions, li

[ceph-users] Re: replace osd with Octopus

2020-11-27 Thread Anthony D';Atri
>> > > Here is the context. > https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd > > When disk is broken, > 1) orch osd rm --replace [--force] > 2) Replace disk. > 3) ceph orch apply osd -i > > Step #1 marks OSD "destroyed". I assume it has the same effect as > "ceph osd destr

[ceph-users] Re: replace osd with Octopus

2020-11-26 Thread Anthony D';Atri
>> When replacing an osd, there will be no PG remapping, and backfill >>> will restore the data on the new disk, right? >> >> That depends on how you decide to go through the replacement process. >> Usually without your intervention (e.g. setting the appropriate OSD >> flags) the remapping will

[ceph-users] Re: Misleading error (osd has already bound to class) when starting osd on nautilus?

2020-11-25 Thread Anthony D';Atri
This was my first thought too. Is it just this one drive, all drives on this host, or all drives in the cluster? I’m curious if stupid HBA tricks are afoot, if this is a SAS / SATA drive. Especially if it’s a RAID-capable HBA vs passthrough. >>> It might be an issue with the driver then rep

[ceph-users] Re: smartctl UNRECOGNIZED OPTION: json=o

2020-11-24 Thread Anthony D';Atri
context : JSON output was added to smartmontools 7 explicitly for Ceph use > > I had to roll an upstream version of the smartmon tools because everything > with redhat 7/8 was too old to support the json option. > ___ ceph-users mailing list -- ceph-

[ceph-users] Re: Ceph on ARM ?

2020-11-24 Thread Anthony D';Atri
I had hoped to stay out of this, but here I go. > 4) SATA controller and PCIe throughput SoftIron claims “wire speed” with their custom hardware FWIW. > Unfortunately these are the kinds of things that you can't easily generalize > between ARM vs x86. Some ARM processors are going to do wildl

[ceph-users] Re: ssd suggestion

2020-11-23 Thread Anthony D';Atri
Those are QLC, with low durability. They may work okay for your use case if you keep an eye on lifetime, esp if your writes tend to sequential. Random writes will eat them more quickly, as will of course EC. Remember that recovery and balancing contribute to writes, and ask Micron for the

[ceph-users] Re: Documentation of older Ceph version not accessible anymore on docs.ceph.com

2020-11-20 Thread Anthony D';Atri
Same problem: “Versions latest octopus nautilus “ This week I had to look up Jewel, Luminous, and Mimic docs and had to do so at GitHub. > >> Hello, >> maybe I missed the announcement but why is the documentation of the >> older ceph version not accessible anymore on docs.ceph.com > > It's

[ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

2020-11-17 Thread Anthony D';Atri
> > I'm probably going to get crucified for this Naw. The <> in your From: header, though …. ;) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

2020-11-11 Thread Anthony D';Atri
> Am 11.11.20 um 11:20 schrieb Hans van den Bogert: >> Hoping to learn from this myself, why will the current setup never work? That was a bit harsh to have said. Without seeing your EC profile and the topology, it’s hard to say for sure, but I suspect that adding another node with at least o

[ceph-users] Re: How to use ceph-volume to create multiple OSDs per NVMe disk, and with fixed WAL/DB partition on another device?

2020-11-11 Thread Anthony D';Atri
Quoting in your message looks kind of messy so forgive me if I’m propagating that below. Honestly I agree that the Optanes will give diminishing returns at best for all but the most extreme workloads (which will probably want to use NVMoF natively anyway). >>> >>> This does split up the NVM

[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

2020-10-25 Thread Anthony D';Atri
> I'm not entirely sure if primary on SSD will actually make the read happen on > SSD. My understanding is that by default reads always happen from the lead OSD in the acting set. Octopus seems to (finally) have an option to spread the reads around, which IIRC defaults to false. I’ve never

[ceph-users] Re: Strange USED size

2020-10-23 Thread Anthony D';Atri
10B as in ten bytes? By chance have you run `rados bench` ? Sometimes a run is interrupted or one forgets to clean up and there are a bunch of orphaned RADOS objects taking up space, though I’d think `ceph df` would reflect that. Is your buckets.data pool replicated or EC? > On Oct 22, 2020

[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Anthony D';Atri
> Yeah, didn't think about a RAID10 really, although there wouldn't be enough > space for 8x300GB = 2400GB WAL/DBs. 300 is overkill for many applications anyway. > > Also, using a RAID10 for WAL/DBs will: > - make OSDs less movable between hosts (they'd have to be moved all > together -

[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Anthony D';Atri
> Also, any thoughts/recommendations on 12TB OSD drives? For price/capacity > this is a good size for us Last I checked HDD prices seemed linear from 10-16TB. Remember to include the cost of the drive bay, ie. the cost of the chassis, the RU(s) it takes up, power, switch ports etc. I’ll gu

[ceph-users] Re: Ceph Octopus

2020-10-20 Thread Anthony D';Atri
I wonder if this would be impactful, even if `nodown` were set. When a given OSD latches onto the new replication network, I would expect it to want to use it for heartbeats — but when its heartbeat peers aren’t using the replication network yet, they won’t be reachable. Unless something has

[ceph-users] Re: Mon DB compaction MON_DISK_BIG

2020-10-19 Thread Anthony D';Atri
th the mon DB size you expect, try removing or replacing that OSD and I’ll bet you have better results. — aad > > mon stat same yes. > > now I fininshed the email it is 8.7Gb. > > I hope I didn't break anything and it will delete everything. > > Thank you > _

[ceph-users] Re: Mon DB compaction MON_DISK_BIG

2020-10-19 Thread Anthony D';Atri
I hope you restarted those mons sequentially, waiting between each for the quorum to return. Is there any recovery or pg autoscaling going on? Are all OSDs up/in, ie. are the three numbers returned by `ceph osd stat` the same? — aad > On Oct 19, 2020, at 7:05 PM, Szabo, Istvan (Agoda) > wro

[ceph-users] Re: Proxmox+Ceph Benchmark 2020

2020-10-14 Thread Anthony D';Atri
>> >> Very nice and useful document. One thing is not clear for me, the fio >> parameters in appendix 5: >> --numjobs=<1|4> --iodepths=<1|32> >> it is not clear if/when the iodepth was set to 32, was it used with all >> tests with numjobs=4 ? or was it: >> --numjobs=<1|4> --iodepths=1 > We have

[ceph-users] Re: Bluestore migration: per-osd device copy

2020-10-12 Thread Anthony D';Atri
Poking through the source I *think* the doc should indeed refer to the “dup” function, vs “copy”. That said, arguably we shouldn’t have a section in the docs that says "there’s this thing you can do but we aren’t going to tell you how”. Looking at the history / blame info, which only seems to

[ceph-users] Re: How to clear Health Warning status?

2020-10-09 Thread Anthony D';Atri
* Monitors now have a config option ``mon_osd_warn_num_repaired``, 10 by default. If any OSD has repaired more than this many I/O errors in stored data a ``OSD_TOO_MANY_REPAIRS`` health warning is generated. Look at `dmesg` and the underlying drive’s SMART counters. You almost certainly hav

[ceph-users] Re: Ceph iSCSI Performance

2020-10-05 Thread Anthony D';Atri
Thanks, Mark. I’m interested as well, wanting to provide block service to baremetal hosts; iSCSI seems to be the classic way to do that. I know there’s some work on MS Windows RBD code, but I’m uncertain if it’s production-worthy, and if RBD namespaces suffice for tenant isolation — and are

[ceph-users] Re: Feedback for proof of concept OSD Node

2020-10-04 Thread Anthony D';Atri
>> If you guys have any suggestions about used hardware that can be a good fit >> considering mainly low noise, please let me know. > > So we didn’t get these requirements initially, there’s no way for us to help > you when the requirements aren’t available for us to consider, even if we had >

[ceph-users] Re: Massive Mon DB Size with noout on 14.2.11

2020-10-02 Thread Anthony D';Atri
> thx for taking care. I read "works as designed, be sure to have disk > space for the mon available”. Well, yeah ;) > It sounds a little odd that the growth > from 50MB to ~15GB + compaction space happens within a couple of > seconds, when two OSD rejoin the cluster. I’m suspicious — even on

[ceph-users] Re: objects misplaced jumps up at 5%

2020-09-29 Thread Anthony D';Atri
>> I think you found the answer! >> >> When adding 100 new OSDs to the cluster, I increased both pg and pgp >> from 4096 to 16,384 >> > > Too much for your cluster, 4096 seems sufficient for a pool of size 10. > You can still reduce it relatively cheaply while it hasn't been fully > actuated y

[ceph-users] Re: NVMe's

2020-09-23 Thread Anthony D';Atri
>> With today’s networking, _maybe_ a super-dense NVMe box needs 100Gb/s where >> a less-dense probably is fine with 25Gb/s. And of course PCI lanes. >> >> https://cephalocon2019.sched.com/event/M7uJ/affordable-nvme-performance-on-ceph-ceph-on-nvme-true-unbiased-story-to-fast-ceph-wido-den-holl

[ceph-users] Re: NVMe's

2020-09-23 Thread Anthony D';Atri
Apologies for not consolidating these replys. My UMA is not my friend today. > With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is > going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but > will be network bound for large IO workloads unless you are sticking > 2x1

[ceph-users] Re: NVMe's

2020-09-23 Thread Anthony D';Atri
> How they did it? You can create partitions / LVs by hand and build OSDs on them, or you can use ceph-volume lvm batch –osds-per-device > I have an idea to create a new bucket type under host, and put two LV from > each ceph osd VG into that new bucket. Rules are the same (different host),

[ceph-users] Re: NVMe's

2020-09-23 Thread Anthony D';Atri
> That's pretty much the advice I've been giving people since the Inktank days. > It costs more and is lower density, but the design is simpler, you are less > likely to under provision CPU, less likely to run into memory bandwidth > bottlenecks, and you have less recovery to do when a node f

[ceph-users] Re: Setting up a small experimental CEPH network

2020-09-21 Thread Anthony D';Atri
> we use heavily bonded interfaces (6x10G) and also needed to look at this > balancing question. We use LACP bonding and, while the host OS probably tries > to balance outgoing traffic over all NICs > I tested something in the past[1] where I could notice that an osd > staturated a bond link an

[ceph-users] Re: Setting up a small experimental CEPH network

2020-09-21 Thread Anthony D';Atri
Depending what you’re looking to accomplish, setting up a cluster in VMs (VirtualBox, Fusion, cloud provider, etc) may meet your needs without having to buy anything. > > - Don't think having a few 1Gbit can replace a >10Gbit. Ceph doesn't use > such bonds optimal. I already asked about this y

[ceph-users] Re: Choosing suitable SSD for Ceph cluster

2020-09-12 Thread Anthony D';Atri
Is this a reply to Paul’s message from 11 months ago? https://bit.ly/32oZGlR The PM1725b is interesting in that it has explicitly configurable durability vs capacity, which may be even more effective than user-level short-stroking / underprovisioning. > > Hi. How do you say 883DCT is faster

[ceph-users] Re: Change crush rule on pool

2020-09-12 Thread Anthony D';Atri
If you have capacity to have both online at the same time, why not add the SSDs to the existing pool, let the cluster converge, then remove the HDDs? Either all at once or incrementally? With care you’d have zero service impact. If you want to change the replication strategy at the same time,

[ceph-users] Re: Is it possible to assign osd id numbers?

2020-09-11 Thread Anthony D';Atri
Now that’s a *very* different question from numbers assigned during an install. With recent releases instead of going down the full removal litany listed below, you can instead down/out the OSD and `destroy` it. That preserves the CRUSH bucket and OSD ID, then when you use ceph-disk, ceph-volu

[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-09-06 Thread Anthony D';Atri
FWIW a handful of years back there was a bug in at least some LSI firmware where the setting “Disk Default” silently turned the volatile cache *on* instead of the documented behavior, which was to leave alone. > On Sep 3, 2020, at 8:13 AM, Reed Dier wrote: > > It looks like I ran into the same

[ceph-users] Re: PG number per OSD

2020-09-06 Thread Anthony D';Atri
> > huxia...@horebdata.cn > > From: Anthony D'Atri > Date: 2020-09-05 20:00 > To: huxia...@horebdata.cn > CC: ceph-users > Subject: Re: [ceph-users] PG number per OSD > One factor is RAM usage, that was IIRC the motivation for the lowering of the > recommendat

[ceph-users] Re: PG number per OSD

2020-09-05 Thread Anthony D';Atri
One factor is RAM usage, that was IIRC the motivation for the lowering of the recommendation of the ratio from 200 to 100. Memory needs also increase during recovery and backfill. When calculating, be sure to consider repllicas. ratio = (pgp_num x replication) / num_osds As HDDs grow the inte

[ceph-users] Re: Fwd: Upgrade Path Advice Nautilus (CentOS 7) -> Octopus (new OS)

2020-08-27 Thread Anthony D';Atri
> > Looking for a bit of guidance / approach to upgrading from Nautilus to > Octopus considering CentOS and Ceph-Ansible. > > We're presently running a Nautilus cluster (all nodes / daemons 14.2.11 as > of this post). > - There are 4 monitor-hosts with mon, mgr, and dashboard functions > consol

[ceph-users] Re: Cluster degraded after adding OSDs to increase capacity

2020-08-27 Thread Anthony D';Atri
Is your MUA wrapping lines, or is the list software? As predicted. Look at the VAR column and the STDDEV of 37.27 > On Aug 27, 2020, at 9:02 AM, Dallas Jones wrote: > > 1 122.79410- 123 TiB 42 TiB 41 TiB 217 GiB 466 GiB 81 > TiB 33.86 1.00 -root default > -3

[ceph-users] Re: Cluster degraded after adding OSDs to increase capacity

2020-08-27 Thread Anthony D';Atri
Doubling the capacity in one shot was a big topology change, hence the 53% misplaced. OSD fullness will naturally reflect a bell curve; there will be a tail of under-full and over-full OSDs. If you’d not said that your cluster was very full before expansion I would have predicted it from the f

[ceph-users] Re: does ceph rgw has any option to limit bandwidth

2020-08-19 Thread Anthony D';Atri
> I wanna limit the traffic of specific buckets. Can haproxy, nginx or any > other proxy software deal with it ? Yes. I’ve seen it done. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph not warning about clock skew on an OSD-only host?

2020-08-12 Thread Anthony D';Atri
My understanding is that the existing mon_clock_drift_allowed value of 50 ms (default) is so that PAXOS among the mon quorum can function. So OSDs (and mgrs, and clients etc) are out of scope of that existing code. Things like this are why I like to ensure that the OS does `ntpdate -b` or equi

[ceph-users] Re: Problems with long taking deep-scrubbing processes causing PG_NOT_DEEP_SCRUBBED

2020-07-31 Thread Anthony D';Atri
One way this can happen is if you have the default setting osd_scrub_during_recovery=false If you’ve been doing a lot of [re]balancing, drive replacements, topology changes, expansions, etc. scrubs can be starved especially if you’re doing EC on HDDs. HDD or SSD OSDs? Replication or E

[ceph-users] Re: unbalanced pg/osd allocation

2020-07-30 Thread Anthony D';Atri
This is a natural condition of CRUSH. You don’t mention what release the back-end or the clients are running so it’s difficult to give an exact answer. Don’t mess with the CRUSH weights. Either adjust the override / reweights with `ceph osd test-reweight-by-utilization / reweight-by-utilizatio

[ceph-users] Re: OSD memory leak?

2020-07-14 Thread Anthony D';Atri
>> In the past, the minimum recommendation was 1GB RAM per HDD blue store OSD. There was a rule of thumb of 1GB RAM *per TB* of HDD Filestore OSD, perhaps you were influenced by that? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe

[ceph-users] Re: Questions on Ceph on ARM

2020-07-07 Thread Anthony D';Atri
Bear in mind that ARM and x86 are architectures, not CPU models. Both are available in a vast variety of core counts, clocks, and implementations. Eg., an 80 core Ampere Altra likely will smoke a Intel Atom D410 in every way. That said, what does “performance” mean? For object storage, it mig

[ceph-users] Re: Nautilus upgrade HEALTH_WARN legacy tunables

2020-07-04 Thread Anthony D';Atri
min_compat is a different thing entirely. You need to set the tunables as a group. This will cause data to move, so you may wish to throttle recovery, model the PG movement ahead of time, use the upmap trick to control movement etc. https://ceph.io/geen-categorie/set-tunables-optimal-on-ce

[ceph-users] Re: Bluestore performance tuning for hdd with nvme db+wal

2020-06-30 Thread Anthony D';Atri
> That is an interesting point. We are using 12 on 1 nvme journal for our > Filestore nodes (which seems to work ok). The workload for wal + db is > different so that could be a factor. However when I've looked at the IO > metrics for the nvme it seems to be only lightly loaded, so does not ap

[ceph-users] Re: Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Anthony D';Atri
> Thanks for the thinking. By 'traffic' I mean: when a user space rbd > write has as a destination three replica osds in the same chassis eek. > does the whole write get shipped out to the mon and then back Mons are control-plane only. > All the 'usual suspects' like lossy ethernets and mis

[ceph-users] Re: Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Anthony D';Atri
What does “traffic” mean? Reads? Writes will have to hit the net regardless of any machinations. > On Jun 29, 2020, at 7:31 PM, Harry G. Coin wrote: > > I need exactly what ceph is for a whole lot of work, that work just > doesn't represent a large fraction of the total local traffic.

[ceph-users] Re: fault tolerant about erasure code pool

2020-06-26 Thread Anthony D';Atri
M=1 is never a good choice. Just use replication instead. > On Jun 26, 2020, at 3:05 AM, Zhenshi Zhou wrote: > > Hi Janne, > > I use the default profile(2+1) and set failure-domain=host, is my best > practice? > > Janne Johansson 于2020年6月26日周五 下午4:59写道: > >> Den fre 26 juni 2020 kl 10:3

[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2020-06-24 Thread Anthony D';Atri
The benefit of disabling on-drive cache may be at least partly dependent on the HBA; I’ve done testing of one specific drive model and found no difference, where someone else reported a measurable difference for the same model. > Good to know that we're not alone :) I also looked for a newer fir

[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs

2020-06-24 Thread Anthony D';Atri
>> I can remember reading this before. I was hoping you maybe had some >> setup with systemd scripts or maybe udev. > > Yeah, doing this on boot up would be ideal. I was looking really hard into > tuned and other services that claimed can do it, but required plugins or > other stuff did/does n

<    1   2   3   4   5   6   >