Re: [ceph-users] what happen to the OSDs if the OS disk dies?

Elias Abacioglu Mon, 31 Oct 2016 01:51:36 -0700

Hi Felix,

I have experience from running Ceph on SATADOM on R630. And it is kind of
bad cause we got bad SATADOM's from Dell.
If you are going to use SATADOM make sure to buy directly from a Innodisk
reseller and not from Dell.
We bought our SATADOM from Dell and they degraded in 5-6 months. And the
reason is that Dell is to cheap to get decent SATADOM, Innodisk got SATADOM
with and without TRIM. Dell resells the SATADOM's without TRIM and use them
for their Nutanix XC Series.


And here is the quirk. You won't find a regular 4-pin Molex power inside
the Dell R series, they have a small 4-pin power on the mobo next to the
internal SATA slots, but it's not a regular 4p 12v ATX, it is smaller.

So unless you can get Dell to sell custom power cables or better SATADOM,
you need to buy the SATADOM from Dell which includes a small cable that
fits and then you are screwed in a couple of months cause their SATADOM
doesn't do TRIM.

I've also tried using R630 internal USB port with Corsair Voyager GTX USB
flash drive which supports TRIM, but unfortunately Linux (we are running
v4.4.0) does not send TRIM over USB. In MS Windows TRIM works with that USB
drive, I tested that with my laptop using virtualbox.
So these USB drives will degrade as well.

Whenever you are trying to do something smart, there is always a quirk it
seems.

/Elias

On Tue, Aug 16, 2016 at 10:43 AM, Félix Barbeira <fbarbe...@gmail.com>
wrote:

> Thanks everybody for the answers, it really helped me a lot. So, to sum
> up, this is the options that I have:
>
>
>    - OS in a RAID1.
>       - PROS: the cluster is protected against OS failures. If one of
>       this disks fail, it could be easily replaced because it is 
> hot-swappable.
>       - CONS: we are "wasting" 2 bays of disks that could be destinated
>       to OSDs.
>
> * In the case of R730xd we have the option to put 2x2.5" SSDs disks on
> the slots on the back like Brian says. For me this is clearly the best
> option. We'll see if the department of finance has the same opinion :)
>
>
>    - OS in a single disk.
>    - PROS: we are using only 1 disk slot. It could be a cheaper disk than
>       the 4TB model because we are only going to use ~10GB.
>       - CONS: the OS is not protected against failures and if this disk
>       fails, the OSDs in this machine (11) fails too. In this case we might 
> try
>       to adjust the configuration in order to not reconstruct all this OSDs 
> data
>       and wait until the OS disk is replaced (I'm not sure if this is 
> possible, I
>       should check the docs).
>    - OS in a SATADOM ( http://www.innodisk.com/intel/product.html )
>       - PROS: we have all the disk slots available to use for OSDs.
>       - CONS: I have no experience with this kind of devices, I'm not
>       sure if the are trustworthy. This devices are fast but they are not raid
>       protected, it's a single point of failure like the previous option.
>    - OS boot from a SAN (this is the option I'm considering for the non
>    R730xd machines, which does not have the 2x2.5" slots on the back).
>       - PROS: all the disk slots are available to OSDs. The OS disk is
>       protected by RAID on the remote storage.
>       - CONS: we depend of the network, I guess the OS device does not
>       require a lot of traffic, all the ceph OSDs network traffic should be
>       managed through another network card.
>
> Maybe I'm missing some other option, in that case please tell me, it would
> be helpful.
>
> It would be really helpful if somebody has experience with the option of
> booting OS from a SAN, sharing their pros/cons experience because that
> option it's very interesting to me.
>
>
> 2016-08-14 14:57 GMT+02:00 Christian Balzer <ch...@gol.com>:
>
>>
>> Hello,
>>
>> I shall top-quote, summarize here.
>>
>> Firstly we have to consider that Ceph is deployed by people with a wide
>> variety of needs, budgets and most of all cluster sizes.
>>
>> Wido has the pleasure (or is that nightmare? ^o^) to deal with a really
>> huge cluster, thousands of OSDs and an according larg number of nodes (if
>> memory serves me).
>>
>> While many others have comparatively small clusters, with decisively less
>> than 10 storage nodes, like me.
>>
>> So the approach and philosophy is obviously going to differ quite a bit
>> on either end of this spectrum.
>>
>> If you start large (dozens of nodes and hundreds of OSDs), where only a
>> small fraction of your data (10% or less) is in a failure domain (host
>> initially), then you can play fast and loose and save a lot of money by
>> designing your machines and infrastructure accordingly.
>> Things like redundant OS drives, PSUs, even network links on the host if
>> the cluster big enough.
>> In a cluster of sufficient size, a node failure and the resulting data
>> movements is just background noise.
>>
>> OTOH with smaller clusters, you obviously want to avoid failures if at all
>> possible, since not only the re-balancing is going to be more painful, but
>> the resulting smaller cluster will also have less performance.
>> This is why my OSD nodes have all the redundancy bells and whistles there
>> are, simply because a cluster big enough to not need them would be both
>> vastly more expensive despite cheaper individual node costs and also
>> underutilized.
>>
>> Of course if you should grow to a certain point, maybe your next
>> generation of OSD nodes can be build on the cheap w/o compromising safe
>> operations.
>>
>> No matter what size your cluster is though, setting
>> "mon_osd_down_out_subtree_limit" to an appropriate value (host for small
>> clusters) is a good way to avoid re-balancing storms when a node (or some
>> larger segment) goes down, given that recovering the failed part can be
>> significantly faster than moving tons of data around.
>> This of course implies 24/7 monitoring and access to the HW.
>>
>>
>> As for dedicated MONs, I usually try to have the primary MON (lowest IP)
>> on dedicated HW and to be sure that MONs residing on OSD nodes have fast
>> storage and enough CPU/RAM to be happy even if the OSDs go on full spin.
>>
>> Which incidentally is why your shared MONs are likely a better fit for a
>> HDD based OSD node than a SSD based one used for a cache pool for example.
>>
>> Anyway, MONs are clearly candidates for having their OS (where /var/lib
>> resides) on RAIDed, hot-swappable fast and durable and power-loss safe
>> SSDs, just so you can avoid loosing one and having to shut down the whole
>> thing in the (unlikely) case of a SSD failure.
>>
>>
>> Regards,
>>
>> Christian
>>
>> On Sat, 13 Aug 2016 09:43:26 +0200 w...@42on.com wrote:
>>
>> >
>> >
>> > > Op 13 aug. 2016 om 08:58 heeft Georgios Dimitrakakis <
>> gior...@acmac.uoc.gr> het volgende geschreven:
>> > >
>> > >
>> > >>> Op 13 aug. 2016 om 03:19 heeft Bill Sharer  het volgende geschreven:
>> > >>>
>> > >>> If all the system disk does is handle the o/s (ie osd journals are
>> > >>> on dedicated or osd drives as well), no problem.Â Just rebuild the
>> > >>> system and copy the ceph.conf back in when you re-install ceph.Â
>> > >>> Keep a spare copy of your original fstab to keep your osd filesystem
>> > >>> mounts straight.
>> > >>
>> > >> With systems deployed with ceph-disk/ceph-deploy you no longer need a
>> > >> fstab. Udev handles it.
>> > >>
>> > >>> Just keep in mind that you are down 11 osds while that system drive
>> > >>> gets rebuilt though.Â It's safer to do 10 osds and then have a
>> > >>> mirror set for the system disk.
>> > >>
>> > >> In the years that I run Ceph I rarely see OS disks fail. Why bother?
>> > >> Ceph is designed for failure.
>> > >>
>> > >> I would not sacrifice a OSD slot for a OS disk. Also, let's say a
>> > >> additional OS disk is €100.
>> > >>
>> > >> If you put that disk in 20 machines that's €2.000. For that money
>> > >> you can even buy a additional chassis.
>> > >>
>> > >> No, I would run on a single OS disk. It fails? Let it fail.
>> Re-install
>> > >> and you're good again.
>> > >>
>> > >> Ceph makes sure the data is safe.
>> > >>
>> > >
>> > > Wido,
>> > >
>> > > can you elaborate a little bit more on this? How does CEPH achieve
>> that? Is it by redundant MONs?
>> > >
>> >
>> > No, Ceph replicates over hosts by default. So you can loose a host and
>> the other ones will have copies.
>> >
>> >
>> > > To my understanding the OSD mapping is needed to have the cluster
>> back. In our setup (I assume in others as well) that is stored in the OS
>> disk.Furthermore, our MONs are running on the same host as OSDs. So if the
>> OS disk fails not only we loose the OSD host but we also loose the MON
>> node. Is there another way to be protected by such a failure besides
>> additional MONs?
>> > >
>> >
>> > Aha, MON on the OSD host. I never recommend that. Try to use dedicated
>> machines with a good SSD for MONs.
>> >
>> > Technically you can run the MON on the OSD nodes, but I always try to
>> avoid it. It just isn't practical when stuff really goes wrong.
>> >
>> > Wido
>> >
>> > > We recently had a problem where a user accidentally deleted a volume.
>> Of course this has nothing to do with OS disk failure itself but we 've
>> been in the loop to start looking for other possible failures on our system
>> that could jeopardize data and this thread got my attention.
>> > >
>> > >
>> > > Warmest regards,
>> > >
>> > > George
>> > >
>> > >
>> > >> Wido
>> > >>
>> > >> Bill Sharer
>> > >>
>> > >>> On 08/12/2016 03:33 PM, Ronny Aasen wrote:
>> > >>>
>> > >>>> On 12.08.2016 13:41, FÃ©lix Barbeira wrote:
>> > >>>>
>> > >>>> Hi,
>> > >>>>
>> > >>>> I'm planning to make a ceph cluster but I have a serious doubt. At
>> > >>>> this moment we have ~10 servers DELL R730xd with 12x4TB SATA
>> > >>>> disks. The official ceph docs says:
>> > >>>>
>> > >>>> "We recommend using a dedicated drive for the operating system and
>> > >>>> software, and one drive for each Ceph OSD Daemon you run on the
>> > >>>> host."
>> > >>>>
>> > >>>> I could use for example 1 disk for the OS and 11 for OSD data. In
>> > >>>> the operating system I would run 11 daemons to control the OSDs.
>> > >>>> But...what happen to the cluster if the disk with the OS fails??
>> > >>>> maybe the cluster thinks that 11 OSD failed and try to replicate
>> > >>>> all that data over the cluster...that sounds no good.
>> > >>>>
>> > >>>> Should I use 2 disks for the OS making a RAID1? in this case I'm
>> > >>>> "wasting" 8TB only for ~10GB that the OS needs.
>> > >>>>
>> > >>>> In all the docs that i've been reading says ceph has no unique
>> > >>>> single point of failure, so I think that this scenario must have a
>> > >>>> optimal solution, maybe somebody could help me.
>> > >>>>
>> > >>>> Thanks in advance.
>> > >>>>
>> > >>>> --
>> > >>>>
>> > >>>> FÃ©lix Barbeira.
>> > >>> if you do not have dedicated slots on the back for OS disks, then i
>> > >>> would recomend using SATADOM flash modules directly into a SATA port
>> > >>> internal in the machine. Saves you 2 slots for osd's and they are
>> > >>> quite reliable. you could even use 2 sd cards if your machine have
>> > >>> the internal SD slot
>> > >>>
>> > >>>
>> > >> http://www.dell.com/downloads/global/products/pedge/en/power
>> edge-idsdm-whitepaper-en.pdf
>> > >>> [1]
>> > >>>
>> > >>> kind regards
>> > >>> Ronny Aasen
>> > >>>
>> > >>> _______________________________________________
>> > >>> ceph-users mailing list
>> > >>> ceph-users@lists.ceph.com [2]
>> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]
>> > >>>
>> > >>> _______________________________________________
>> > >>> ceph-users mailing list
>> > >>> ceph-u
>> > >> ph.com
>> > >> http://li
>> > >>
>> > >>> i/ceph-users-ceph.com
>> > >>
>> > >>
>> > >> Links:
>> > >> ------
>> > >> [1]
>> > >> http://www.dell.com/downloads/global/products/pedge/en/power
>> edge-idsdm-whitepaper-en.pdf
>> > >> [2] mailto:ceph-users@lists.ceph.com
>> > >> [3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >> [4] mailto:bsha...@sharerland.com
>> > >
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> ch...@gol.com           Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Félix Barbeira.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] what happen to the OSDs if the OS disk dies?

Reply via email to