RAID10 also will suffer from LSE on big disks, isn't it? > 7 дек. 2016 г., в 13:35, Christian Balzer <ch...@gol.com> написал(а): > > > > Hello, > > On Wed, 7 Dec 2016 13:16:45 +0300 Дмитрий Глушенок wrote: > >> Hi, >> >> Let me add a little math to your warning: with LSE rate of 1 in 10^15 on >> modern 8 TB disks there is 5,8% chance to hit LSE during recovery of 8 TB >> disk. So, every 18th recovery will probably fail. Similarly to RAID6 (two >> parity disks) size=3 mitigates the problem. > > Indeed. > That math changes significantly of course if you have very reliable, > endurable, well monitored and fast SSDs of not too big a size. > Something that will recover in less than hour. > > So people with SSD pools might have an acceptable risk. > > That being said, I'd prefer size 3 for my SSD pool as well, alas both cost > and the increased latency stopped me for this time. > Next round I'll upgrade my HW requirements and budget. > >> By the way - why it is a common opinion that using RAID (RAID6) with Ceph >> (size=2) is bad idea? It is cheaper than size=3, all hardware disk errors >> are handled by RAID (instead of OS/Ceph), decreases OSD count, adds some >> battery-backed cache and increases performance of single OSD. >> > > I did run something like that and if your IOPS needs are low enough it > works well (the larger HW cache the better). > But once you exceed the combined speed of HW cache coalescing, it degrades > badly, something that's usually triggered by very mixed R/W ops and/or > deep scrubs. > It also depends on your cluster size, if you have dozens of OSDs based on > such a design, it will work a lot better than with a few. > > I changed it to RAID10s with 4 HDDs each since I needed the speed (IOPS) > and didn't require all the space. > > Christian > >>> 7 дек. 2016 г., в 11:08, Wido den Hollander <w...@42on.com> написал(а): >>> >>> Hi, >>> >>> As a Ceph consultant I get numerous calls throughout the year to help >>> people with getting their broken Ceph clusters back online. >>> >>> The causes of downtime vary vastly, but one of the biggest causes is that >>> people use replication 2x. size = 2, min_size = 1. >>> >>> In 2016 the amount of cases I have where data was lost due to these >>> settings grew exponentially. >>> >>> Usually a disk failed, recovery kicks in and while recovery is happening a >>> second disk fails. Causing PGs to become incomplete. >>> >>> There have been to many times where I had to use xfs_repair on broken disks >>> and use ceph-objectstore-tool to export/import PGs. >>> >>> I really don't like these cases, mainly because they can be prevented >>> easily by using size = 3 and min_size = 2 for all pools. >>> >>> With size = 2 you go into the danger zone as soon as a single disk/daemon >>> fails. With size = 3 you always have two additional copies left thus >>> keeping your data safe(r). >>> >>> If you are running CephFS, at least consider running the 'metadata' pool >>> with size = 3 to keep the MDS happy. >>> >>> Please, let this be a big warning to everybody who is running with size = >>> 2. The downtime and problems caused by missing objects/replicas are usually >>> big and it takes days to recover from those. But very often data is lost >>> and/or corrupted which causes even more problems. >>> >>> I can't stress this enough. Running with size = 2 in production is a >>> SERIOUS hazard and should not be done imho. >>> >>> To anyone out there running with size = 2, please reconsider this! >>> >>> Thanks, >>> >>> Wido >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> -- >> Dmitry Glushenok >> Jet Infosystems >> > > > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com <mailto:ch...@gol.com> Global OnLine Japan/Rakuten > Communications > http://www.gol.com/ <http://www.gol.com/> -- Дмитрий Глушенок Инфосистемы Джет +7-910-453-2568
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com