Re: [ceph-users] 2x replication: A BIG warning

Дмитрий Глушенок Wed, 07 Dec 2016 03:50:46 -0800

RAID10 also will suffer from LSE on big disks, isn't it?

> 7 дек. 2016 г., в 13:35, Christian Balzer <ch...@gol.com> написал(а):
> 
> 
> 
> Hello,
> 
> On Wed, 7 Dec 2016 13:16:45 +0300 Дмитрий Глушенок wrote:
> 
>> Hi,
>> 
>> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on 
>> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB 
>> disk. So, every 18th recovery will probably fail. Similarly to RAID6 (two 
>> parity disks) size=3 mitigates the problem.
> 
> Indeed.
> That math changes significantly of course if you have very reliable,
> endurable, well monitored and fast SSDs of not too big a size.
> Something that will recover in less than hour.
> 
> So people with SSD pools might have an acceptable risk.
> 
> That being said, I'd prefer size 3 for my SSD pool as well, alas both cost
> and the increased latency stopped me for this time.
> Next round I'll upgrade my HW requirements and budget.
> 
>> By the way - why it is a common opinion that using RAID (RAID6) with Ceph 
>> (size=2) is bad idea? It is cheaper than size=3, all hardware disk errors 
>> are handled by RAID (instead of OS/Ceph), decreases OSD count, adds some 
>> battery-backed cache and increases performance of single OSD.
>> 
> 
> I did run something like that and if your IOPS needs are low enough it
> works well (the larger HW cache the better).
> But once you exceed the combined speed of HW cache coalescing, it degrades
> badly, something that's usually triggered by very mixed R/W ops and/or
> deep scrubs.
> It also depends on your cluster size, if you have dozens of OSDs based on
> such a design, it will work a lot better than with a few.
> 
> I changed it to RAID10s with 4 HDDs each since I needed the speed (IOPS)
> and didn't require all the space.
> 
> Christian
> 
>>> 7 дек. 2016 г., в 11:08, Wido den Hollander <w...@42on.com> написал(а):
>>> 
>>> Hi,
>>> 
>>> As a Ceph consultant I get numerous calls throughout the year to help 
>>> people with getting their broken Ceph clusters back online.
>>> 
>>> The causes of downtime vary vastly, but one of the biggest causes is that 
>>> people use replication 2x. size = 2, min_size = 1.
>>> 
>>> In 2016 the amount of cases I have where data was lost due to these 
>>> settings grew exponentially.
>>> 
>>> Usually a disk failed, recovery kicks in and while recovery is happening a 
>>> second disk fails. Causing PGs to become incomplete.
>>> 
>>> There have been to many times where I had to use xfs_repair on broken disks 
>>> and use ceph-objectstore-tool to export/import PGs.
>>> 
>>> I really don't like these cases, mainly because they can be prevented 
>>> easily by using size = 3 and min_size = 2 for all pools.
>>> 
>>> With size = 2 you go into the danger zone as soon as a single disk/daemon 
>>> fails. With size = 3 you always have two additional copies left thus 
>>> keeping your data safe(r).
>>> 
>>> If you are running CephFS, at least consider running the 'metadata' pool 
>>> with size = 3 to keep the MDS happy.
>>> 
>>> Please, let this be a big warning to everybody who is running with size = 
>>> 2. The downtime and problems caused by missing objects/replicas are usually 
>>> big and it takes days to recover from those. But very often data is lost 
>>> and/or corrupted which causes even more problems.
>>> 
>>> I can't stress this enough. Running with size = 2 in production is a 
>>> SERIOUS hazard and should not be done imho.
>>> 
>>> To anyone out there running with size = 2, please reconsider this!
>>> 
>>> Thanks,
>>> 
>>> Wido
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> --
>> Dmitry Glushenok
>> Jet Infosystems
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> ch...@gol.com <mailto:ch...@gol.com>          Global OnLine Japan/Rakuten 
> Communications
> http://www.gol.com/ <http://www.gol.com/>
--
Дмитрий Глушенок
Инфосистемы Джет
+7-910-453-2568

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 2x replication: A BIG warning

Reply via email to