-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I suggest setting logging to 0/5 on everything. Depending on your desire for reliability and availability, you may want to change your pool min_size/size to 2/4 and adjust your CRUSH map to include rack. Then instruct CRUSH to place two copies in each rack. That way if you lose power to a rack, you can still continue with minimal interruption.
You would want a rule similar to this: rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 2 type rack step chooseleaf firstn 2 type host step emit } I would also set: mon osd downout subtree limit = host so that if you lose power in a rack it won't try to recover. If you only have two racks, this is not an issue. If you move to three racks, then you can adjust the min_size/size to 2/3 and adjust the rule to: rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit } Other than that, the defaults are pretty good. -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.0.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV32vaCRDmVDuy+mK58QAAt3EP/0VPChXtbijtIXZmItuG H+e4moCAfsu5dLAfpdorZOEivjh2xVdni9XlHlBE8Qm7UmfpyycP1SUST8bd 3BcI2xC0xlV0xJShJcoL5+vXyVZYPhrSKdooCuo5coYhRZOtSqg86uVojpHA 8hy0eLVd8qXKjvqvQJBIDZXQP41Ct6UoejT+sP7JuepH9SWb+0c61+TpOCQm BSTraapfyqNxo5y40FI7pM7E0EZw1H3Ag8Ie1HiQ3NfbkVQ4N4KMmRGzsCzl QpZB/gAkUmdpJptRUzo2habaLzl0szuaXiP/JnFE8Vu5H2GnrsFelHfOnQQx hrEhqfVXtZ7oCQLYy0N+KpgfAf9b7+2kA9Tm8Ztx+nw8YOgAPrWheFUj9Jjs Ry9dK/J9toaKAXfW12EKiU+qNKOgHYKEn+FSR+y+y7UJSbexhmeUhPy5S4Jt he1KJMUe7BnGRuFM/94vCCApAgqoHiatpFeKY7cEd6x0V3YOA+j8MDbr5YWJ PCWXWyFpClyp9h9LW0uqlwE3LtYBD0ec3d4nJmqNy5v2sszWJo4UWptRhEdi XOwoda3DNnqoj5G7dmKkSrvXJqSRXA784gIMD0rO7JfXlahjCOsVaYQdo76v U+bQtxGRTXTAV+1ygOL7rElXMyc4Wo6IyUkpE6dnhFPGsi0lZnOih+kM0Wmt wt/B =mSex -----END PGP SIGNATURE----- ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Aug 27, 2015 at 1:42 PM, German Anders <gand...@despegar.com> wrote: > Thanks a lot Robert and Jan for the comments about the available and > possible disk layouts. Is there any advice from the point of view of > configuration? any tunable parameters, crush algorithm? > > Thanks a lot, > > Best regards, > > *German* > > 2015-08-27 16:37 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> >> On Thu, Aug 27, 2015 at 1:13 PM, Jan Schermer wrote: >> > >> >> On 27 Aug 2015, at 20:57, Robert LeBlanc wrote: >> >> >> >> -----BEGIN PGP SIGNED MESSAGE----- >> >> Hash: SHA256 >> >> >> >> >> >> >> >> >> >> On Thu, Aug 27, 2015 at 10:25 AM, Jan Schermer wrote: >> >>> Some comments inline. >> >>> A lot of it depends on your workload, but I'd say you almost certainly >> >>> need >> >>> higher-grade SSDs. You can save money on memory. >> >>> >> >>> What will be the role of this cluster? VM disks? Object storage? >> >>> Streaming?... >> >>> >> >>> Jan >> >>> >> >>> On 27 Aug 2015, at 17:56, German Anders wrote: >> >>> >> >>> Hi all, >> >>> >> >>> I'm planning to deploy a new Ceph cluster with IB FDR 56Gb/s and I've >> >>> the >> >>> following HW: >> >>> >> >>> 3x MON Servers: >> >>> 2x Intel Xeon E5-2600@v3 8C >> >> >> >> This is overkill if only a monitor server. >> > >> > Maybe with newer releases of Ceph, but my Mons spin CPU pretty high (100% >> > core, which means it doesn't scale that well with cores), and when >> > adding/removing OSDs or shuffling data some of the peering issues I've >> > seen were caused by lagging Mons. >> >> If I remember right, you have a fairly large cluster. This is a pretty small >> cluster, so probably OK with less CPU. Are you running Dumpling? I haven't >> seen many issues with Hammer. >> >> > >> >> >> >>> >> >>> 256GB RAM >> >>> >> >>> >> >>> I don't think you need that much memory, 64GB should be plenty (if that's >> >>> the only role for the servers). >> >> >> >> >> >> If it is only monitor, you can get by with even less. >> >> >> >>> >> >>> 1xIB FRD ADPT-DP (two ports for PUB network) >> >>> 1xGB ADPT-DP >> >>> >> >>> Disk Layout: >> >>> >> >>> SOFT-RAID: >> >>> SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1) >> >>> SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1) >> >>> >> >>> >> >>> I 100% recommend going with SSDs for the /var/lib/ceph/mon storage, fast >> >>> ones (but they can be fairly small). Should be the same grade as journal >> >>> drives IMO. >> >>> NOT S3500! >> >>> I can recommend S3610 (just got some :)), Samsung 845 DC PRO. At least 1 >> >>> DWPD rating, better go with 3 DWPD. >> >> >> >> S3500 should be just fine here. I get 25% better performance on the >> >> S3500 vs the S3700 doing sync direct writes. Write endurance should be >> >> just fine as the volume of data is not going to be that great. Unless >> >> there is something else I'm not aware of. >> >> >> > >> > S3500 is faster than S3700? I can compare 3700 x 3510 x 3610 tomorrow but >> > I'd be very surprised if the S3500 had a _sustained_ throughput better >> > than 36xx or 37xx. Were you comparing that on the same HBA and in the same >> > way? (No offense, just curious) >> >> None taken. I used the same box and swapped out the drives. The only >> difference was the S3500 has been heavily used, the 3700 was fresh from the >> package (if anything that should have helped the S3700). >> >> for i in {1..8}; do fio --filename=/dev/sda --direct=1 --sync=1 --rw=write >> --bs=4k --numjobs=$i --iodepth=1 --runtime=60 --time_based --group_reporting >> --name=journal-test; done >> >> # jobs IOPs Bandwidth (KB/s) >> >> Intel S3500 (SSDSC2BB240G4) Max 4K RW 7,500 >> 1 5,617 22,468.0 >> 2 8,326 33,305.0 >> 3 11,575 46,301.0 >> 4 13,882 55,529.0 >> 5 16,254 65,020.0 >> 6 17,890 71,562.0 >> 7 19,438 77,752.0 >> 8 20,894 83,576.0 >> >> Intel S3700 (SSDSC2BA200G3) Max 4K RW 32,000 >> 1 4,417 17,670.0 >> 2 5,544 22,178.0 >> 3 7,337 29,352.0 >> 4 9,243 36,975.0 >> 5 11,189 44,759.0 >> 6 13,218 52,874.0 >> 7 14,801 59,207.0 >> 8 16,604 66,419.0 >> 9 17,671 70,685.0 >> 10 18,715 74,861.0 >> 11 20,079 80,318.0 >> 12 20,832 83,330.0 >> 13 20,571 82,288.0 >> 14 23,033 92,135.0 >> 15 22,169 88,679.0 >> 16 22,875 91,502.0 >> >> > >> > Mons can use some space, I've experienced logging havoc, leveldb bloating >> > havoc (I have to compact manually or it just grows and grows), and my >> > Mons write quite a lot at times. I guesstimate my mons can write 200GB a >> > day, often less but often more. Maybe that's not normal. I can confirm >> > those numbers tomorrow. >> >> True, I haven't had the compact issues so I can't comment on that. He has a >> small cluster so I don't think he will get to the level you have. >> >> > >> >>> >> >>> >> >>> 8x OSD Servers: >> >>> 2x Intel Xeon E5-2600@v3 10C >> >>> >> >>> >> >>> Go for the fastest you can afford if you need the latency - even at the >> >>> expense of cores. >> >>> Go for cores if you want bigger throughput. >> >> >> >> I'm in the middle of my testing, but it seems that with lots of I/O >> >> depth (either from a single client or multiple clients) that clock >> >> speed does not have as much of an impact as core count does. Once I'm >> >> done, I'll be posting my results. Unless you have a single client that >> >> has a QD=1, go for cores at this point. >> > >> > NoSQL is basically still a database, and while NoSQL is mostly a more >> > modern stuff which is built for clouds and horizontal scaling, you still >> > need some baseline performance to achieve a good durability/replication >> > and stuff. >> > >> >> >> >>> >> >>> 256GB RAM >> >>> >> >>> >> >>> Again - I think too much if that's the only role for those nodes, 64GB >> >>> should be plenty. >> >> >> >> Agree, if you can afford more RAM, it just means more page cache. >> > >> > But too much page cache = bad. >> >> I think /proc/sys/vm/min_free_kbytes help. >> >> > >> >> >> >>> >> >>> >> >>> 1xIB FRD ADPT-DP (one port for PUB and one for CLUS network) >> >>> 1xGB ADPT-DP >> >>> >> >>> Disk Layout: >> >>> >> >>> SOFT-RAID: >> >>> SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1) >> >>> SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1) >> >>> >> >>> JBOD: >> >>> SCSI9 (0,0,0) (sdd) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal) >> >>> SCSI9 (0,1,0) (sde) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal) >> >>> SCSI9 (0,2,0) (sdf) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal) >> >>> >> >>> >> >>> No no no. Those SSDs will die a horrible death, too little endurance. >> >>> Better go with 2x 3700 in RAID1 and partition them for journals. Or just >> >>> don't use journaling drives and buy better SSDs for storage. >> >> >> >> If he is only using these for journals, he can be just fine. He can >> >> get the same endurance as the S3700 by only using a portion of the >> >> drive space. [1][2] >> > >> > True for the 120GB drives. You only really need something like 1-10GB at >> > most. >> > I'd still get a smaller higher-class drive and just not touch >> > provisioning, if only for the sake of warranty. But I think it's easier to >> > just skip dedicated journal drives in this case. >> >> I think I remember someone saying that journals on separate SSDs gave them >> better performance than journals co-located on the SSD, I don't remember >> though. If warranty replacement is your primary concern, then go with the >> 3700. If they already have the 3500, they can get it to perform/endure like >> the 3700 with the only cost is disk space. >> >> > >> > NoSQL is very write intensive - depending on implemenation (applications) >> > of course. But it's not unusual to have 300MB of semi-structured data and >> > 100GB indexes that are rebuilt all the time (of course that indicates the >> > developers were just lazystupid, which is exactly why NoSQL is so popular >> > and Agile :)). >> >> Understandable. Our cluster is primarily write because reads are being >> served out of all the layers of cache. Overprovisioned 3500s will work just >> as well as the 3700. >> >> - ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> -----BEGIN PGP SIGNATURE----- >> Version: Mailvelope v1.0.2 >> Comment: https://www.mailvelope.com >> >> wsFcBAEBCAAQBQJV32bnCRDmVDuy+mK58QAA0e4P/3jclEcvCRWgOYwUz0bo >> scf42NOhyNp3bPt4sUMN5h1aptX1s9TtUQxaq9yficjHhIb9ZBt1/SPxzDpf >> cbWBMgjKgEPHhN7AAGK6HwlQ+zrB8znRPabv81JO9heIwrcOY7LLJTl8kpij >> 0ktU7oRBn4xTDINTugZnq+YaBL+8N1/5g65lev6nnMs9ngTh4DSmjYuDjxFH >> Y8YuToImBQtuUQiL4feNN+lA+fPy3k0iYaTS2XvO7yX+w84ElDjUHvjZxOTt >> kZE5/YMKz7sImhhvLmvRRpqpEbJVPDl6JqhbyMTwpH4fkebrEGY/EbVYV+bT >> m3Hq6iMIs2NleExShOwdUK0r0cw1MnWPThdEtOAHefefDcsWPZoQpvPiuqwJ >> MdFxGP1LnX7yx1vYAt89nRhUsBQUvCcparcjjbM4aIe/6Q39Orkqb4sMuygf >> VyxFRwULDPwnl6xMn/oVIAXycXOMs3dWM12t6UGfe4kmSGEoShzkwimgJcvC >> lQnrp8u6jFYz6lflMMOQRauJSA4vDAU63JJMb7MLDqI6zy7MqXjnA9kyS1PP >> Px7mgxLINQ/KG4ymGtlRNKfZVF29fe+CGYZEwrVFsRGAIJsfG9TZj3IhdO1r >> /9gkXHvvE6NMPQWWNwxnvnFseqdNDbCZl3DFy9fciCgofznNo2sQumY8eG9P >> k5jF >> =HkOn >> -----END PGP SIGNATURE----- >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com