Re: [ceph-users] Disk/Pool Layout

Robert LeBlanc Thu, 27 Aug 2015 12:58:49 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I suggest setting logging to 0/5 on everything. Depending on your
desire for reliability and availability, you may want to change your
pool min_size/size to 2/4 and adjust your CRUSH map to include rack.
Then instruct CRUSH to place two copies in each rack. That way if you
lose power to a rack, you can still continue with minimal
interruption.


You would want a rule similar to this:
        rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 2 type rack
        step chooseleaf firstn 2 type host
        step emit
}

I would also set:
mon osd downout subtree limit = host

so that if you lose power in a rack it won't try to recover. If you
only have two racks, this is not an issue. If you move to three racks,
then you can adjust the min_size/size to 2/3 and adjust the rule to:

rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type rack
        step emit
}

Other than that, the defaults are pretty good.

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV32vaCRDmVDuy+mK58QAAt3EP/0VPChXtbijtIXZmItuG
H+e4moCAfsu5dLAfpdorZOEivjh2xVdni9XlHlBE8Qm7UmfpyycP1SUST8bd
3BcI2xC0xlV0xJShJcoL5+vXyVZYPhrSKdooCuo5coYhRZOtSqg86uVojpHA
8hy0eLVd8qXKjvqvQJBIDZXQP41Ct6UoejT+sP7JuepH9SWb+0c61+TpOCQm
BSTraapfyqNxo5y40FI7pM7E0EZw1H3Ag8Ie1HiQ3NfbkVQ4N4KMmRGzsCzl
QpZB/gAkUmdpJptRUzo2habaLzl0szuaXiP/JnFE8Vu5H2GnrsFelHfOnQQx
hrEhqfVXtZ7oCQLYy0N+KpgfAf9b7+2kA9Tm8Ztx+nw8YOgAPrWheFUj9Jjs
Ry9dK/J9toaKAXfW12EKiU+qNKOgHYKEn+FSR+y+y7UJSbexhmeUhPy5S4Jt
he1KJMUe7BnGRuFM/94vCCApAgqoHiatpFeKY7cEd6x0V3YOA+j8MDbr5YWJ
PCWXWyFpClyp9h9LW0uqlwE3LtYBD0ec3d4nJmqNy5v2sszWJo4UWptRhEdi
XOwoda3DNnqoj5G7dmKkSrvXJqSRXA784gIMD0rO7JfXlahjCOsVaYQdo76v
U+bQtxGRTXTAV+1ygOL7rElXMyc4Wo6IyUkpE6dnhFPGsi0lZnOih+kM0Wmt
wt/B
=mSex
-----END PGP SIGNATURE-----


----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, Aug 27, 2015 at 1:42 PM, German Anders <gand...@despegar.com> wrote:

> Thanks a lot Robert and Jan for the comments about the available and
> possible disk layouts. Is there any advice from the point of view of
> configuration? any tunable parameters, crush algorithm?
>
> Thanks a lot,
>
> Best regards,
>
> *German*
>
> 2015-08-27 16:37 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>>
>> On Thu, Aug 27, 2015 at 1:13 PM, Jan Schermer  wrote:
>> >
>> >> On 27 Aug 2015, at 20:57, Robert LeBlanc  wrote:
>> >>
>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> Hash: SHA256
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Aug 27, 2015 at 10:25 AM, Jan Schermer  wrote:
>> >>> Some comments inline.
>> >>> A lot of it depends on your workload, but I'd say you almost certainly 
>> >>> need
>> >>> higher-grade SSDs. You can save money on memory.
>> >>>
>> >>> What will be the role of this cluster? VM disks? Object storage?
>> >>> Streaming?...
>> >>>
>> >>> Jan
>> >>>
>> >>> On 27 Aug 2015, at 17:56, German Anders  wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>>   I'm planning to deploy a new Ceph cluster with IB FDR 56Gb/s and I've 
>> >>> the
>> >>> following HW:
>> >>>
>> >>> 3x MON Servers:
>> >>>   2x Intel Xeon E5-2600@v3 8C
>> >>
>> >> This is overkill if only a monitor server.
>> >
>> > Maybe with newer releases of Ceph, but my Mons spin CPU pretty high (100% 
>> > core, which means it doesn't scale that well with cores), and when 
>> > adding/removing OSDs or shuffling data some of the peering issues I've 
>> > seen were caused by lagging Mons.
>>
>> If I remember right, you have a fairly large cluster. This is a pretty small 
>> cluster, so probably OK with less CPU. Are you running Dumpling? I haven't 
>> seen many issues with Hammer.
>>
>> >
>> >>
>> >>>
>> >>>   256GB RAM
>> >>>
>> >>>
>> >>> I don't think you need that much memory, 64GB should be plenty (if that's
>> >>> the only role for the servers).
>> >>
>> >>
>> >> If it is only monitor, you can get by with even less.
>> >>
>> >>>
>> >>>   1xIB FRD ADPT-DP (two ports for PUB network)
>> >>>   1xGB ADPT-DP
>> >>>
>> >>>   Disk Layout:
>> >>>
>> >>>   SOFT-RAID:
>> >>>   SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
>> >>>   SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
>> >>>
>> >>>
>> >>> I 100% recommend going with SSDs for the /var/lib/ceph/mon storage, fast
>> >>> ones (but they can be fairly small). Should be the same grade as journal
>> >>> drives IMO.
>> >>> NOT S3500!
>> >>> I can recommend S3610 (just got some :)), Samsung 845 DC PRO. At least 1
>> >>> DWPD rating, better go with 3 DWPD.
>> >>
>> >> S3500 should be just fine here. I get 25% better performance on the
>> >> S3500 vs the S3700 doing sync direct writes. Write endurance should be
>> >> just fine as the volume of data is not going to be that great. Unless
>> >> there is something else I'm not aware of.
>> >>
>> >
>> > S3500 is faster than S3700? I can compare 3700 x 3510 x 3610 tomorrow but 
>> > I'd be very surprised if the S3500 had a _sustained_ throughput better 
>> > than 36xx or 37xx. Were you comparing that on the same HBA and in the same 
>> > way? (No offense, just curious)
>>
>> None taken. I used the same box and swapped out the drives. The only 
>> difference was the S3500 has been heavily used, the 3700 was fresh from the 
>> package (if anything that should have helped the S3700).
>>
>> for i in {1..8}; do fio --filename=/dev/sda --direct=1 --sync=1 --rw=write 
>> --bs=4k --numjobs=$i --iodepth=1 --runtime=60 --time_based --group_reporting 
>> --name=journal-test; done
>>
>> # jobs  IOPs   Bandwidth (KB/s)
>>
>> Intel S3500 (SSDSC2BB240G4) Max 4K RW 7,500
>> 1       5,617  22,468.0
>> 2       8,326  33,305.0
>> 3      11,575  46,301.0
>> 4      13,882  55,529.0
>> 5      16,254  65,020.0
>> 6      17,890  71,562.0
>> 7      19,438  77,752.0
>> 8      20,894  83,576.0
>>
>> Intel S3700 (SSDSC2BA200G3) Max 4K RW 32,000
>>  1      4,417  17,670.0
>>  2      5,544  22,178.0
>>  3      7,337  29,352.0
>>  4      9,243  36,975.0
>>  5     11,189  44,759.0
>>  6     13,218  52,874.0
>>  7     14,801  59,207.0
>>  8     16,604  66,419.0
>>  9     17,671  70,685.0
>> 10     18,715  74,861.0
>> 11     20,079  80,318.0
>> 12     20,832  83,330.0
>> 13     20,571  82,288.0
>> 14     23,033  92,135.0
>> 15     22,169  88,679.0
>> 16     22,875  91,502.0
>>
>> >
>> > Mons can use some space, I've experienced logging havoc, leveldb bloating 
>> > havoc  (I have to compact manually or it just grows and grows), and my 
>> > Mons write quite a lot at times. I guesstimate my mons can write 200GB a 
>> > day, often less but often more. Maybe that's not normal. I can confirm 
>> > those numbers tomorrow.
>>
>> True, I haven't had the compact issues so I can't comment on that. He has a 
>> small cluster so I don't think he will get to the level you have.
>>
>> >
>> >>>
>> >>>
>> >>> 8x OSD Servers:
>> >>>   2x Intel Xeon E5-2600@v3 10C
>> >>>
>> >>>
>> >>> Go for the fastest you can afford if you need the latency - even at the
>> >>> expense of cores.
>> >>> Go for cores if you want bigger throughput.
>> >>
>> >> I'm in the middle of my testing, but it seems that with lots of I/O
>> >> depth (either from a single client or multiple clients) that clock
>> >> speed does not have as much of an impact as core count does. Once I'm
>> >> done, I'll be posting my results. Unless you have a single client that
>> >> has a QD=1, go for cores at this point.
>> >
>> > NoSQL is basically still a database, and while NoSQL is mostly a more 
>> > modern stuff which is built for clouds and horizontal scaling, you still 
>> > need some baseline performance to achieve a good durability/replication 
>> > and stuff.
>> >
>> >>
>> >>>
>> >>>   256GB RAM
>> >>>
>> >>>
>> >>> Again - I think too much if that's the only role for those nodes, 64GB
>> >>> should be plenty.
>> >>
>> >> Agree, if you can afford more RAM, it just means more page cache.
>> >
>> > But too much  page cache = bad.
>>
>> I think /proc/sys/vm/min_free_kbytes help.
>>
>> >
>> >>
>> >>>
>> >>>
>> >>>   1xIB FRD ADPT-DP (one port for PUB and one for CLUS network)
>> >>>   1xGB ADPT-DP
>> >>>
>> >>>   Disk Layout:
>> >>>
>> >>>   SOFT-RAID:
>> >>>   SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
>> >>>   SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
>> >>>
>> >>>   JBOD:
>> >>>   SCSI9 (0,0,0) (sdd) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
>> >>>   SCSI9 (0,1,0) (sde) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
>> >>>   SCSI9 (0,2,0) (sdf) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
>> >>>
>> >>>
>> >>> No no no. Those SSDs will die a horrible death, too little endurance.
>> >>> Better go with 2x 3700 in RAID1 and partition them for journals. Or just
>> >>> don't use journaling drives and buy better SSDs for storage.
>> >>
>> >> If he is only using these for journals, he can be just fine. He can
>> >> get the same endurance as the S3700 by only using a portion of the
>> >> drive space. [1][2]
>> >
>> > True for the 120GB drives. You only really need something like 1-10GB at 
>> > most.
>> > I'd still get a smaller higher-class drive and just not touch 
>> > provisioning, if only for the sake of warranty. But I think it's easier to 
>> > just skip dedicated journal drives in this case.
>>
>> I think I remember someone saying that journals on separate SSDs gave them 
>> better performance than journals co-located on the SSD, I don't remember 
>> though. If warranty replacement is your primary concern, then go with the 
>> 3700. If they already have the 3500, they can get it to perform/endure like 
>> the 3700 with the only cost is disk space.
>>
>> >
>> > NoSQL is very write intensive - depending on implemenation (applications) 
>> > of course. But it's not unusual to have 300MB of semi-structured data and 
>> > 100GB indexes that are rebuilt all the time (of course that indicates the 
>> > developers were just lazystupid, which is exactly why NoSQL is so popular 
>> > and Agile :)).
>>
>> Understandable. Our cluster is primarily write because reads are being 
>> served out of all the layers of cache. Overprovisioned 3500s will work just 
>> as well as the 3700.
>>
>> - ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> -----BEGIN PGP SIGNATURE-----
>> Version: Mailvelope v1.0.2
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJV32bnCRDmVDuy+mK58QAA0e4P/3jclEcvCRWgOYwUz0bo
>> scf42NOhyNp3bPt4sUMN5h1aptX1s9TtUQxaq9yficjHhIb9ZBt1/SPxzDpf
>> cbWBMgjKgEPHhN7AAGK6HwlQ+zrB8znRPabv81JO9heIwrcOY7LLJTl8kpij
>> 0ktU7oRBn4xTDINTugZnq+YaBL+8N1/5g65lev6nnMs9ngTh4DSmjYuDjxFH
>> Y8YuToImBQtuUQiL4feNN+lA+fPy3k0iYaTS2XvO7yX+w84ElDjUHvjZxOTt
>> kZE5/YMKz7sImhhvLmvRRpqpEbJVPDl6JqhbyMTwpH4fkebrEGY/EbVYV+bT
>> m3Hq6iMIs2NleExShOwdUK0r0cw1MnWPThdEtOAHefefDcsWPZoQpvPiuqwJ
>> MdFxGP1LnX7yx1vYAt89nRhUsBQUvCcparcjjbM4aIe/6Q39Orkqb4sMuygf
>> VyxFRwULDPwnl6xMn/oVIAXycXOMs3dWM12t6UGfe4kmSGEoShzkwimgJcvC
>> lQnrp8u6jFYz6lflMMOQRauJSA4vDAU63JJMb7MLDqI6zy7MqXjnA9kyS1PP
>> Px7mgxLINQ/KG4ymGtlRNKfZVF29fe+CGYZEwrVFsRGAIJsfG9TZj3IhdO1r
>> /9gkXHvvE6NMPQWWNwxnvnFseqdNDbCZl3DFy9fciCgofznNo2sQumY8eG9P
>> k5jF
>> =HkOn
>> -----END PGP SIGNATURE-----
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Disk/Pool Layout

Reply via email to