[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

胡玮文 Sun, 25 Oct 2020 10:34:02 -0700

> 在 2020年10月26日，00:07，Anthony D'Atri <anthony.da...@gmail.com> 写道：
>
>> I'm not entirely sure if primary on SSD will actually make the read happen 
>> on SSD.
>
> My understanding is that by default reads always happen from the lead OSD in 
> the acting set.  Octopus seems to (finally) have an option to spread the 
> reads around, which IIRC defaults to false.


I also remember that “by default reads always happen from the lead OSD in the 
acting set”. I dig through git blame and it seems ceph-fuse has a 
—localize-reads options since 10 years ago [1], through not documented 
anywhere. I don’t find such setting in kernel ceph module.

[1]: 
https://github.com/ceph/ceph/commit/7912f5c7034bd26d22615d1be1d398849e124749

> I’ve never seen anything that implies that lead OSDs within an acting set are 
> a function of CRUSH rule ordering. I’m not asserting that they aren’t though, 
> but I’m … skeptical.

That conclusion is from experiments. I create an empty pool with above 
mentioned CRUSH rule, and all 32 pgs have SSD as primary.

> Setting primary affinity would do the job, and you’d want to have cron 
> continually update it across the cluster to react to topology changes.  I was 
> told of this strategy back in 2014, but haven’t personally seen it 
> implemented.

I’m also considering this. But if I set the primary affinity of HDDs to 0, then 
what will happen if I create another all-HDD pool? Or I should just set primary 
affinity to a very small value, say 0.00001.

> That said, HDDs are more of a bottleneck for writes than reads and just might 
> be fine for your application.  Tiny reads are going to limit you to some 
> degree regardless of drive type, and you do mention throughput, not IOPS.
>
> I must echo Frank’s notes about capacity too.  Ceph can do a lot of things, 
> but that doesn’t mean something exotic is necessarily the best choice.  
> You’re concerned about 3R only yielding 1/3 of raw capacity if using an 
> all-SSD cluster, but the architecture you propose limits you anyway because 
> drive size. Consider also chassis, CPU, RAM, RU, switch port costs as well, 
> and the cost of you fussing over an exotic solution instead of the hundreds 
> of other things in your backlog.
>
> And your cluster as described is *tiny*.  Honestly I’d suggest considering 
> one of these alternatives:
>
> * Ditch the HDDs, use QLC flash.  The emerging EDSFF drives are really 
> promising for replacing HDDs for density in this kind of application.  You 
> might even consider ARM if IOPs aren’t a concern.
> * An NVMeoF solution

Thanks for the advices, we will discuss on these. But this deployment is on 
existing server hardwares, so we don’t have many choices. And our budget is 
very limited. We want to make best use of our existing SSDs. And we have plenty 
of cold data to fill our HDDs. We will not worry about the wasting of HDD 
capacity.

Sorry Anthony, I sent this mail twice. I forgot to CC this mail list at first.

> Cache tiers are “deprecated”, but then so are custom cluster names.  Neither 
> appears
>
>> For EC pools there is an option "fast_read" 
>> (https://docs.ceph.com/en/latest/rados/operations/pools/?highlight=fast_read#set-pool-values),
>>  which states that a read will return as soon as the first k shards have 
>> arrived. The default is to wait for all k+m shards (all replicas). This 
>> option is not available for replicated pools.
>>
>> Now, not sure if this option is not available for replicated pools because 
>> the read will always be served by the acting primary, or if it currently 
>> waits for all replicas. In the latter case, reads will wait for the slowest 
>> device.
>>
>> I'm not sure if I interpret this correctly. I think you should test the 
>> setup with HDD only and SSD+HDD to see if read speed improves. Note that 
>> write speed will always depend on the slowest device.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Frank Schilder <fr...@dtu.dk>
>> Sent: 25 October 2020 15:03:16
>> To: 胡 玮文; Alexander E. Patrakov
>> Cc: ceph-users@ceph.io
>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated 
>> pool
>>
>> A cache pool might be an alternative, heavily depending on how much data is 
>> hot. However, then you will have much less SSD capacity available, because 
>> it also requires replication.
>>
>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD 
>> you will probably run short of SSD capacity. Or, looking at it the other way 
>> around, with copies on 1 SSD+3HDD, you will only be able to use about 30T 
>> out of 120T HDD capacity.
>>
>> With this replication, the usable storage will be 10T and raw used will be 
>> 10T SSD and 30T HDD. If you can't do anything else on the HDD space, you 
>> will need more SSDs. If your servers have more free disk slots, you can add 
>> SSDs over time until you have at least 40T SSD capacity to balance SSD and 
>> HDD capacity.
>>
>> Personally, I think the 1SSD + 3HDD is a good option compared with a cache 
>> pool. You have the data security of 3-times replication and, if everything 
>> is up, need only 1 copy in the SSD cache, which means that you have 3 times 
>> the cache capacity.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: 胡 玮文 <huw...@outlook.com>
>> Sent: 25 October 2020 13:40:55
>> To: Alexander E. Patrakov
>> Cc: ceph-users@ceph.io
>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated 
>> pool
>>
>> Yes. This is the limitation of CRUSH algorithm, in my mind. In order to 
>> guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 
>> 3 on HDD. This will work as intended, right? Because at least I can ensure 3 
>> HDDs are from different hosts.
>>
>>>> 在 2020年10月25日，20:04，Alexander E. Patrakov <patra...@gmail.com> 写道：
>>>
>>> On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com <huw...@outlook.com> 
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We are planning for a new pool to store our dataset using CephFS. These 
>>>> data are almost read-only (but not guaranteed) and consist of a lot of 
>>>> small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and 
>>>> we will deploy about 10 such nodes. We aim at getting the highest read 
>>>> throughput.
>>>>
>>>> If we just use a replicated pool of size 3 on SSD, we should get the best 
>>>> performance, however, that only leave us 1/3 of usable SSD space. And EC 
>>>> pools are not friendly to such small object read workload, I think.
>>>>
>>>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I 
>>>> want 3 data replications, each on a different host (fail domain). 1 of 
>>>> them on SSD, the other 2 on HDD. And normally every read request is 
>>>> directed to SSD. So, if every SSD OSD is up, I’d expect the same read 
>>>> throughout as the all SSD deployment.
>>>>
>>>> I’ve read the documents and did some tests. Here is the crush rule I’m 
>>>> testing with:
>>>>
>>>> rule mixed_replicated_rule {
>>>>      id 3
>>>>      type replicated
>>>>      min_size 1
>>>>      max_size 10
>>>>      step take default class ssd
>>>>      step chooseleaf firstn 1 type host
>>>>      step emit
>>>>      step take default class hdd
>>>>      step chooseleaf firstn -1 type host
>>>>      step emit
>>>> }
>>>>
>>>> Now I have the following conclusions, but I’m not very sure:
>>>> * The first OSD produced by crush will be the primary OSD (at least if I 
>>>> don’t change the “primary affinity”). So, the above rule is guaranteed to 
>>>> map SSD OSD as primary in pg. And every read request will read from SSD if 
>>>> it is up.
>>>> * It is currently not possible to enforce SSD and HDD OSD to be chosen 
>>>> from different hosts. So, if I want to ensure data availability even if 2 
>>>> hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the 
>>>> replication size to 4, instead of the ideal value 3, on the pool using the 
>>>> above crush rule.
>>>>
>>>> Am I correct about the above statements? How would this work from your 
>>>> experience? Thanks.
>>>
>>> This works (i.e. guards against host failures) only if you have
>>> strictly separate sets of hosts that have SSDs and that have HDDs.
>>> I.e., there should be no host that has both, otherwise there is a
>>> chance that one hdd and one ssd from that host will be picked.
>>>
>>> --
>>> Alexander E. Patrakov
>>> CV: 
>>> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7&amp;data=04%7C01%7C%7Cfdfe2029034643f3f2f408d878de2b44%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392242885406736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8NY0IpDiDnLZV2FGxwChZmNC8IA6%2BsZ2NEHPb%2B%2BEiA0%3D&amp;reserved=0
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

Reply via email to