On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <ch...@gol.com> wrote:
> > Hello, > > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote: > > > Question: > > I'm curious if there is anybody else out there running CephFS at the > > scale I'm planning for. I'd like to know some of the issues you didn't > > expect that I should be looking out for. I'd also like to simply see > > when CephFS hasn't worked out and why. Basically, give me your war > > stories. > > > Not me, but diligently search the archives, there are people with large > CephFS deployments (despite the non-production status when they did them). > Also look at the current horror story thread about what happens when you > have huge directories. > > > > > Problem Details: > > Now that I'm out of my design phase and finished testing on VMs, I'm > > ready to drop $100k on a pilo. I'd like to get some sense of confidence > > from the community that this is going to work before I pull the trigger. > > > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with > > CephFS by this time next year (hopefully by December). My workload is a > > mix of small and vary large files (100GB+ in size). We do fMRI analysis > > on DICOM image sets as well as other physio data collected from > > subjects. We also have plenty of spreadsheets, scripts, etc. Currently > > 90% of our analysis is I/O bound and generally sequential. > > > There are other people here doing similar things (medical institutes, > universities), again search the archives and maybe contact them directly. > > > In deploying Ceph, I am hoping to see more throughput than the 7320 can > > currently provide. I'm also looking to get away from traditional > > file-systems that require forklift upgrades. That's where Ceph really > > shines for us. > > > > I don't have a total file count, but I do know that we have about 500k > > directories. > > > > > > Planned Architecture: > > > Well, we talked about this 2 months ago, you seem to have changed only a > few things. > So lets dissect this again... > > > Storage Interconnect: > > Brocade VDX 6940 (40 gig) > > > Is this a flat (single) network for all the storage nodes? > And then from these 40Gb/s switches links to the access switches? > This will start as a single 40Gb/s switch with a single link to each node (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch will also be connected to several 10Gb/s and 1Gb/s access switches with dual 40Gb/s uplinks. We do intend to segment the public and private networks using VLANs untagged at the node. There are obviously many subnets on our network. The 40Gb/s switch will handle routing for those networks. You can see list discussion in "Public and Private network over 1 interface" May 23,2016 regarding some of this. > > > Access Switches for clients (servers): > > Brocade VDX 6740 (10 gig) > > > > Access Switches for clients (workstations): > > Brocade ICX 7450 > > > > 3x MON: > > 128GB RAM > > 2x 200GB SSD for OS > > 2x 400GB P3700 for LevelDB > > 2x E5-2660v4 > > 1x Dual Port 40Gb Ethernet > > > Total overkill in the CPU core arena, fewer but faster cores would be more > suited for this task. > A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing like > that, the closest one would be the E5-2643v4. > > Same for RAM, MON processes are pretty frugal. > > No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and thus > the leveldb) and that's being overly generous in the speed/IOPS department. > > Note also that 40Gb/s isn't really needed here, alas latency and KISS do > speak in favor of it, especially if you can afford it. > Noted > > > 2x MDS: > > 128GB RAM > > 2x 200GB SSD for OS > > 2x 400GB P3700 for LevelDB (is this necessary?) > No, there isn't any persistent data with MDS, unlike what I assumed as > well before reading up on it and trying it out for the first time. > That's what I thought. For some reason, my VAR keeps throwing these on the config. > > > 2x E5-2660v4 > > 1x Dual Port 40Gb Ethernet > > > Dedicated MONs/MDS are often a waste, they are suggested to avoid people > who don't know what they're doing from overloading things. > > So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make > the first one a dedicated MON and give it the lowest IP. > HW Specs as discussed above, make sure to use DIMMs that allow you to > upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons > (from my limited experience with CephFS). > So: > > 128GB RAM (expandable to 256GB or more) > 2x E5-2643v4 > 2x 400GB DC S3710 > 1x Dual Port 40Gb Ethernet > > > 8x OSD: > > 128GB RAM > Use your savings above to make that 256GB for grate performance > improvements as hot objects stay in memory and so will all dir-entries (in > SLAB). > I like this idea. > > > 2x 200GB SSD for OS > Overkill really. Other than the normally rather terse OSD logs, nothing > much will ever be written to them. So 3510s or at most 3610s. > > > 2x 400GB P3700 for Journals > As discussed 2 months ago, this limits you to writes at half (or quarter > depending on your design and if you do LACP, vLAG) of what your network is > capable of. > OTOH, I wouldn't expect your 24 HDDs do to much better than 2GB/s either > (at least with filestore and bluestore is a year away at best). > So good enough, especially if you're read heavy. > Yeah, the thought is that we're going to be close to equilibrium. It's not too big a deal to add an extra card, so my plan was to expand to 3 if necessary after our pilot project. > > > 24x 6TB Enterprise SATA > > 2x E5-2660v4 > > 1x Dual Port 40Gb Ethernet > > Regards, > > Christian > As always, I appreciate your comments and time. I'm looking forward to joining you and the rest of the community in operating a great Ceph environment. > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com