Re: [ceph-users] CephFS in the wild

Brady Deetz Thu, 02 Jun 2016 09:12:02 -0700

On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <ch...@gol.com> wrote:


>
> Hello,
>
> On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
>
> > Question:
> > I'm curious if there is anybody else out there running CephFS at the
> > scale I'm planning for. I'd like to know some of the issues you didn't
> > expect that I should be looking out for. I'd also like to simply see
> > when CephFS hasn't worked out and why. Basically, give me your war
> > stories.
> >
> Not me, but diligently search the archives, there are people with large
> CephFS deployments (despite the non-production status when they did them).
> Also look at the current horror story thread about what happens when you
> have huge directories.
>
> >
> > Problem Details:
> > Now that I'm out of my design phase and finished testing on VMs, I'm
> > ready to drop $100k on a pilo. I'd like to get some sense of confidence
> > from the community that this is going to work before I pull the trigger.
> >
> > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> > CephFS by this time next year (hopefully by December). My workload is a
> > mix of small and vary large files (100GB+ in size). We do fMRI analysis
> > on DICOM image sets as well as other physio data collected from
> > subjects. We also have plenty of spreadsheets, scripts, etc. Currently
> > 90% of our analysis is I/O bound and generally sequential.
> >
> There are other people here doing similar things (medical institutes,
> universities), again search the archives and maybe contact them directly.
>
> > In deploying Ceph, I am hoping to see more throughput than the 7320 can
> > currently provide. I'm also looking to get away from traditional
> > file-systems that require forklift upgrades. That's where Ceph really
> > shines for us.
> >
> > I don't have a total file count, but I do know that we have about 500k
> > directories.
> >
> >
> > Planned Architecture:
> >
> Well, we talked about this 2 months ago, you seem to have changed only a
> few things.
> So lets dissect this again...
>
> > Storage Interconnect:
> > Brocade VDX 6940 (40 gig)
> >
> Is this a flat (single) network for all the storage nodes?
> And then from these 40Gb/s switches links to the access switches?
>

This will start as a single 40Gb/s switch with a single link to each node
(upgraded in the future to dual-switch + dual-link). The 40Gb/s switch will
also be connected to several 10Gb/s and 1Gb/s access switches with dual
40Gb/s uplinks.

We do intend to segment the public and private networks using VLANs
untagged at the node. There are obviously many subnets on our network. The
40Gb/s switch will handle routing for those networks.

You can see list discussion in "Public and Private network over 1
interface" May 23,2016 regarding some of this.


>
> > Access Switches for clients (servers):
> > Brocade VDX 6740 (10 gig)
> >
> > Access Switches for clients (workstations):
> > Brocade ICX 7450
> >
> > 3x MON:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> Total overkill in the CPU core arena, fewer but faster cores would be more
> suited for this task.
> A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing like
> that, the closest one would be the E5-2643v4.
>
> Same for RAM, MON processes are pretty frugal.
>
> No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and thus
> the leveldb) and that's being overly generous in the speed/IOPS department.
>
> Note also that 40Gb/s isn't really needed here, alas latency and KISS do
> speak in favor of it, especially if you can afford it.
>

Noted


>
> > 2x MDS:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB (is this necessary?)
> No, there isn't any persistent data with MDS, unlike what I assumed as
> well before reading up on it and trying it out for the first time.
>

That's what I thought. For some reason, my VAR keeps throwing these on the
config.


>
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> Dedicated MONs/MDS are often a waste, they are suggested to avoid people
> who don't know what they're doing from overloading things.
>
> So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make
> the first one a dedicated MON and give it the lowest IP.
> HW Specs as discussed above, make sure to use DIMMs that allow you to
> upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons
> (from my limited experience with CephFS).
> So:
>
> 128GB RAM (expandable to 256GB or more)
> 2x E5-2643v4
> 2x 400GB DC S3710
> 1x Dual Port 40Gb Ethernet
>
> > 8x OSD:
> > 128GB RAM
> Use your savings above to make that 256GB for grate performance
> improvements as hot objects stay in memory and so will all dir-entries (in
> SLAB).
>

I like this idea.


>
> > 2x 200GB SSD for OS
> Overkill really. Other than the normally rather terse OSD logs, nothing
> much will ever be written to them. So 3510s or at most 3610s.
>
> > 2x 400GB P3700 for Journals
> As discussed 2 months ago, this limits you to writes at half (or quarter
> depending on your design and if you do LACP, vLAG) of what your network is
> capable of.
> OTOH, I wouldn't expect your 24 HDDs do to much better than 2GB/s either
> (at least with filestore and bluestore is a year away at best).
> So good enough, especially if you're read heavy.
>

Yeah, the thought is that we're going to be close to equilibrium. It's not
too big a deal to add an extra card, so my plan was to expand to 3 if
necessary after our pilot project.


>
> > 24x 6TB Enterprise SATA
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
>
> Regards,
>
> Christian
>

As always, I appreciate your comments and time. I'm looking forward to
joining you and the rest of the community in operating a great Ceph
environment.


> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS in the wild

Reply via email to