Analogies between a distributed system and one that isn’t can be a bit strained or nuanced.
The question really isn’t IF a given solution is dangerous, but HOW dangerous it is. There is always a long tail ; one picks a point along it based on capex, business needs, etc. I sometimes read that RAID1 == R2, but I’ve always considered it RN. After I asked HP to support R3 with their HBAs they did, I like to think I had something to do with that but it may have been coincidence. For RAID5, look up “write hole”. min_size = 1 has occasional utility when *very temporarily* set to allow recovery from a bad situation(1), but as a permanent topology it’s Russian Roulette. Spin enough times and .... 1: These aren’t as common as they used to be. > On Feb 5, 2021, at 8:09 AM, Jack <c...@jack.fr.eu.org> wrote: > > Is raid1 dangerous ? > Is raid5 dangerous ? > > They both allow non-redondant writes > > >> On 2/5/21 4:19 PM, Frank Schilder wrote: >> I don't run a secondary site and don't know if short windows of read-only >> access are terrible. From the data security point of view, min_size 2 is >> fine. Its the min_size 1 that really is dangerous, because it accepts >> non-redundant writes. >> Even if you loose the second site entirely, you can always re-sync from >> scratch - assuming decent network bandwidth. >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> ________________________________________ >> From: Adam Boyhan <ad...@medent.com> >> Sent: 05 February 2021 13:58:34 >> To: Frank Schilder >> Cc: Jack; ceph-users >> Subject: Re: [ceph-users] Re: NVMe and 2x Replica >> This turned into a great thread. Lots of good information and clarification. >> I am 100% on board with 3 copies for the primary. >> What does everyone think about possibly only doing 2 copies on the >> secondary? Keeping in mind that I would keep min=2 which I think will be >> reasonable for a secondary site. >> ________________________________ >> From: "Frank Schilder" <fr...@dtu.dk> >> To: "Jack" <c...@jack.fr.eu.org>, "ceph-users" <ceph-users@ceph.io> >> Sent: Friday, February 5, 2021 7:14:52 AM >> Subject: [ceph-users] Re: NVMe and 2x Replica >>> Picture this, using size=3, min_size=2: >>> - One node is down for maintenance >>> - You loose a couple of devices >>> - You loose data >>> >>> Is it likely that a nvme device dies during a short maintenance window ? >>> Is it likely that two devices dies at the same time ? >> If you just look at it from this narrow point of view of fundamental laws of >> nature, then, yes, 2+1 is safe. As safe as is nuclear power just looking at >> the laws of physics. So why then did Chernobyl and Fukushima happen? Its >> because its operated by humans. If you look around, the No. 1 reason for >> loosing data on ceph or entire clusters is 2+1. >> Look at the reasons. Its rarely a broken disk. A system designed with no >> redundancy that offers a margin for error will suffer from every little >> admin mistake, undetected race condition, bug in ceph or bug in firmware. >> So, if the savings are worth the sweat, downtime and consultancy budget, why >> not? >> Ceph has infinite uptime. During such a long period, low-probability events >> will happen with probability 1. >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> ________________________________________ >> From: Jack <c...@jack.fr.eu.org> >> Sent: 05 February 2021 12:48:33 >> To: ceph-users@ceph.io >> Subject: [ceph-users] Re: NVMe and 2x Replica >> At the end, this is nothing but a probability stuff >> Picture this, using size=3, min_size=2: >> - One node is down for maintenance >> - You loose a couple of devices >> - You loose data >> Is it likely that a nvme device dies during a short maintenance window ? >> Is it likely that two devices dies at the same time ? >> What are the numbers ? >>> On 2/5/21 12:26 PM, Wido den Hollander wrote: >>> >>> >>> On 04/02/2021 18:57, Adam Boyhan wrote: >>>> All great input and points guys. >>>> >>>> Helps me lean towards 3 copes a bit more. >>>> >>>> I mean honestly NVMe cost per TB isn't that much more than SATA SSD >>>> now. Somewhat surprised the salesmen aren't pitching 3x replication as >>>> it makes them more money. >>> >>> To add to this, I have seen real cases as a Ceph consultant where size=2 >>> and min_size=1 on all flash lead to data loss. >>> >>> Picture this: >>> >>> - One node is down (Maintenance, failure, etc, etc) >>> - NVMe device in other node dies >>> - You loose data >>> >>> Although you can bring back the other node which was down but not broken >>> you are missing data. The data on the NVMe devices in there is outdated >>> and thus the PGs will not become active. >>> >>> size=2 is only safe with min_size=2, but that doesn't really provide HA. >>> >>> The same goes with ZFS in mirror, raidz1, etc. If you loose one device >>> the chances are real you loose the other device before the array has >>> healed itself. >>> >>> With Ceph it's slighly more complex, but the same principles apply. >>> >>> No, with NVMe I still would highly advise against using size=2, min_size=1 >>> >>> The question is not if you will loose data, but the question is: When >>> will you loose data? Within one year, 2? 3? 10? >>> >>> Wido >>> >>>> >>>> >>>> >>>> From: "Anthony D'Atri" <anthony.da...@gmail.com> >>>> To: "ceph-users" <ceph-users@ceph.io> >>>> Sent: Thursday, February 4, 2021 12:47:27 PM >>>> Subject: [ceph-users] Re: NVMe and 2x Replica >>>> >>>>> I searched each to find the section where 2x was discussed. What I >>>>> found was interesting. First, there are really only 2 positions here: >>>>> Micron's and Red Hat's. Supermicro copies Micron's positon paragraph >>>>> word for word. Not surprising considering that they are advertising a >>>>> Supermicro / Micron solution. >>>> >>>> FWIW, at Cephalocon another vendor made a similar claim during a talk. >>>> >>>> * Failure rates are averages, not minima. Some drives will always fail >>>> sooner >>>> * Firmware and other design flaws can result in much higher rates of >>>> failure or insidious UREs that can result in partial data >>>> unavailability or loss >>>> * Latent soft failures may not be detected until a deep scrub >>>> succeeds, which could be weeks later >>>> * In a distributed system, there are up/down/failure scenarios where >>>> the location of even one good / canonical / latest copy of data is >>>> unclear, especially when drive or HBA cache is in play. >>>> * One of these is a power failure. Sure PDU / PSU redundancy helps, >>>> but stuff happens, like a DC underprovisioning amps, so that a spike >>>> in user traffic results in the whole row going down :-x Various >>>> unpleasant things can happen. >>>> >>>> I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. >>>> As others have written, as drives get larger the time to fill them >>>> with replica data increases, as does the chance of overlapping >>>> failures. I’ve experieneced R2 overlapping failures more than once, >>>> with and before Ceph. >>>> >>>> My sense has been that not many people run R2 for data they care >>>> about, and as has been written recently 2,2 EC is safer with the same >>>> raw:usable ratio. I’ve figured that vendors make R2 statements like >>>> these as a selling point to assert lower TCO. My first response is >>>> often “How much would it cost you directly, and indirectly in terms of >>>> user / customer goodwill, to loose data?”. >>>> >>>>> Personally, this looks like marketing BS to me. SSD shops want to >>>>> sell SSDs, but because of the cost difference they have to convince >>>>> buyers that their products are competitive. >>>> >>>> ^this. I’m watching the QLC arena with interest for the potential to >>>> narrow the CapEx gap. Durability has been one concern, though I’m >>>> seeing newer products claiming that eg. ZNS improves that. It also >>>> seems that there are something like what, *4* separate EDSFF / ruler >>>> form factors, I really want to embrace those eg. for object clusters, >>>> but I’m VERY wary of the longevity of competing standards and any >>>> single-source for chassies or drives. >>>> >>>>> Our products cost twice as much, but LOOK you only need 2/3 as many, >>>>> and you get all these other benefits (performance). Plus, if you >>>>> replace everything in 2 or 3 years anyway, then you won't have to >>>>> worry about them failing. >>>> >>>> Refresh timelines. You’re funny ;) Every time, every single time, that >>>> I’ve worked in an organization that claims a 3 (or 5, or whatever) >>>> hardware refresh cycle, it hasn’t happened. When you start getting >>>> close, the capex doesn’t materialize, or the opex cost of DC hands and >>>> operational oversight. “How do you know that the drives will start >>>> failing or getting slower? Let’s revisit this in 6 months”. Etc. >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io