[ceph-users] Re: NVMe and 2x Replica

Anthony D'Atri Fri, 05 Feb 2021 08:26:32 -0800

Analogies between a distributed system and one that isn’t can be a bit strained 
or nuanced.


The question really isn’t IF a given solution is dangerous, but HOW dangerous 
it is.  There is always a long tail ; one picks a point along it based on 
capex, business needs, etc.  

I sometimes read that RAID1 == R2, but I’ve always considered it RN.  After I 
asked HP to support R3 with their HBAs they did, I like to think I had 
something to do with that but it may have been coincidence. 

For RAID5, look up “write hole”.  

min_size = 1 has occasional utility when *very temporarily* set to allow 
recovery from a bad situation(1), but as a permanent topology it’s Russian 
Roulette.  Spin enough times and ....

1:   These aren’t as common as they used to be.   


> On Feb 5, 2021, at 8:09 AM, Jack <c...@jack.fr.eu.org> wrote:
> 
> Is raid1 dangerous ?
> Is raid5 dangerous ?
> 
> They both allow non-redondant writes
> 
> 
>> On 2/5/21 4:19 PM, Frank Schilder wrote:
>> I don't run a secondary site and don't know if short windows of read-only 
>> access are terrible. From the data security point of view, min_size 2 is 
>> fine. Its the min_size 1 that really is dangerous, because it accepts 
>> non-redundant writes.
>> Even if you loose the second site entirely, you can always re-sync from 
>> scratch - assuming decent network bandwidth.
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> ________________________________________
>> From: Adam Boyhan <ad...@medent.com>
>> Sent: 05 February 2021 13:58:34
>> To: Frank Schilder
>> Cc: Jack; ceph-users
>> Subject: Re: [ceph-users] Re: NVMe and 2x Replica
>> This turned into a great thread.  Lots of good information and clarification.
>> I am 100% on board with 3 copies for the primary.
>> What does everyone think about possibly only doing 2 copies on the 
>> secondary?  Keeping in mind that I would keep min=2 which I think will be 
>> reasonable for a secondary site.
>> ________________________________
>> From: "Frank Schilder" <fr...@dtu.dk>
>> To: "Jack" <c...@jack.fr.eu.org>, "ceph-users" <ceph-users@ceph.io>
>> Sent: Friday, February 5, 2021 7:14:52 AM
>> Subject: [ceph-users] Re: NVMe and 2x Replica
>>> Picture this, using size=3, min_size=2:
>>> - One node is down for maintenance
>>> - You loose a couple of devices
>>> - You loose data
>>> 
>>> Is it likely that a nvme device dies during a short maintenance window ?
>>> Is it likely that two devices dies at the same time ?
>> If you just look at it from this narrow point of view of fundamental laws of 
>> nature, then, yes, 2+1 is safe. As safe as is nuclear power just looking at 
>> the laws of physics. So why then did Chernobyl and Fukushima happen? Its 
>> because its operated by humans. If you look around, the No. 1 reason for 
>> loosing data on ceph or entire clusters is 2+1.
>> Look at the reasons. Its rarely a broken disk. A system designed with no 
>> redundancy that offers a margin for error will suffer from every little 
>> admin mistake, undetected race condition, bug in ceph or bug in firmware. 
>> So, if the savings are worth the sweat, downtime and consultancy budget, why 
>> not?
>> Ceph has infinite uptime. During such a long period, low-probability events 
>> will happen with probability 1.
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> ________________________________________
>> From: Jack <c...@jack.fr.eu.org>
>> Sent: 05 February 2021 12:48:33
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: NVMe and 2x Replica
>> At the end, this is nothing but a probability stuff
>> Picture this, using size=3, min_size=2:
>> - One node is down for maintenance
>> - You loose a couple of devices
>> - You loose data
>> Is it likely that a nvme device dies during a short maintenance window ?
>> Is it likely that two devices dies at the same time ?
>> What are the numbers ?
>>> On 2/5/21 12:26 PM, Wido den Hollander wrote:
>>> 
>>> 
>>> On 04/02/2021 18:57, Adam Boyhan wrote:
>>>> All great input and points guys.
>>>> 
>>>> Helps me lean towards 3 copes a bit more.
>>>> 
>>>> I mean honestly NVMe cost per TB isn't that much more than SATA SSD
>>>> now. Somewhat surprised the salesmen aren't pitching 3x replication as
>>>> it makes them more money.
>>> 
>>> To add to this, I have seen real cases as a Ceph consultant where size=2
>>> and min_size=1 on all flash lead to data loss.
>>> 
>>> Picture this:
>>> 
>>> - One node is down (Maintenance, failure, etc, etc)
>>> - NVMe device in other node dies
>>> - You loose data
>>> 
>>> Although you can bring back the other node which was down but not broken
>>> you are missing data. The data on the NVMe devices in there is outdated
>>> and thus the PGs will not become active.
>>> 
>>> size=2 is only safe with min_size=2, but that doesn't really provide HA.
>>> 
>>> The same goes with ZFS in mirror, raidz1, etc. If you loose one device
>>> the chances are real you loose the other device before the array has
>>> healed itself.
>>> 
>>> With Ceph it's slighly more complex, but the same principles apply.
>>> 
>>> No, with NVMe I still would highly advise against using size=2, min_size=1
>>> 
>>> The question is not if you will loose data, but the question is: When
>>> will you loose data? Within one year, 2? 3? 10?
>>> 
>>> Wido
>>> 
>>>> 
>>>> 
>>>> 
>>>> From: "Anthony D'Atri" <anthony.da...@gmail.com>
>>>> To: "ceph-users" <ceph-users@ceph.io>
>>>> Sent: Thursday, February 4, 2021 12:47:27 PM
>>>> Subject: [ceph-users] Re: NVMe and 2x Replica
>>>> 
>>>>> I searched each to find the section where 2x was discussed. What I
>>>>> found was interesting. First, there are really only 2 positions here:
>>>>> Micron's and Red Hat's. Supermicro copies Micron's positon paragraph
>>>>> word for word. Not surprising considering that they are advertising a
>>>>> Supermicro / Micron solution.
>>>> 
>>>> FWIW, at Cephalocon another vendor made a similar claim during a talk.
>>>> 
>>>> * Failure rates are averages, not minima. Some drives will always fail
>>>> sooner
>>>> * Firmware and other design flaws can result in much higher rates of
>>>> failure or insidious UREs that can result in partial data
>>>> unavailability or loss
>>>> * Latent soft failures may not be detected until a deep scrub
>>>> succeeds, which could be weeks later
>>>> * In a distributed system, there are up/down/failure scenarios where
>>>> the location of even one good / canonical / latest copy of data is
>>>> unclear, especially when drive or HBA cache is in play.
>>>> * One of these is a power failure. Sure PDU / PSU redundancy helps,
>>>> but stuff happens, like a DC underprovisioning amps, so that a spike
>>>> in user traffic results in the whole row going down :-x Various
>>>> unpleasant things can happen.
>>>> 
>>>> I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID.
>>>> As others have written, as drives get larger the time to fill them
>>>> with replica data increases, as does the chance of overlapping
>>>> failures. I’ve experieneced R2 overlapping failures more than once,
>>>> with and before Ceph.
>>>> 
>>>> My sense has been that not many people run R2 for data they care
>>>> about, and as has been written recently 2,2 EC is safer with the same
>>>> raw:usable ratio. I’ve figured that vendors make R2 statements like
>>>> these as a selling point to assert lower TCO. My first response is
>>>> often “How much would it cost you directly, and indirectly in terms of
>>>> user / customer goodwill, to loose data?”.
>>>> 
>>>>> Personally, this looks like marketing BS to me. SSD shops want to
>>>>> sell SSDs, but because of the cost difference they have to convince
>>>>> buyers that their products are competitive.
>>>> 
>>>> ^this. I’m watching the QLC arena with interest for the potential to
>>>> narrow the CapEx gap. Durability has been one concern, though I’m
>>>> seeing newer products claiming that eg. ZNS improves that. It also
>>>> seems that there are something like what, *4* separate EDSFF / ruler
>>>> form factors, I really want to embrace those eg. for object clusters,
>>>> but I’m VERY wary of the longevity of competing standards and any
>>>> single-source for chassies or drives.
>>>> 
>>>>> Our products cost twice as much, but LOOK you only need 2/3 as many,
>>>>> and you get all these other benefits (performance). Plus, if you
>>>>> replace everything in 2 or 3 years anyway, then you won't have to
>>>>> worry about them failing.
>>>> 
>>>> Refresh timelines. You’re funny ;) Every time, every single time, that
>>>> I’ve worked in an organization that claims a 3 (or 5, or whatever)
>>>> hardware refresh cycle, it hasn’t happened. When you start getting
>>>> close, the capex doesn’t materialize, or the opex cost of DC hands and
>>>> operational oversight. “How do you know that the drives will start
>>>> failing or getting slower? Let’s revisit this in 6 months”. Etc.
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NVMe and 2x Replica

Reply via email to