Re: [ceph-users] CephFS in the wild

Brady Deetz Mon, 06 Jun 2016 12:15:33 -0700

This is an interesting idea that I hadn't yet considered testing.

My test cluster is also looking like 2K per object.


It looks like our hardware purchase for a one-half sized pilot is getting
approved and I don't really want to modify it when we're this close to
moving forward. So, using spare NVMe capacity is certainly an option, but
increasing my OS disk size or replacing OSDs is pretty much a no go for
this iteration of the cluster.

My single concern with the idea of using the NVMe capacity is the potential
to affect journal performance which is already cutting it close with each
NVMe supporting 12 journals. It seems to me what would probably be better
would be to replace 2 HDD OSDs with 2 SSD OSDs and put the metadata pool on
those dedicated SSDs. Even if testing goes well on the NVMe based pool,
dedicated SSDs seem like a safer play and may be what I implement when we
buy our second round of hardware to finish out the cluster and go live
(January-March 2017).



On Mon, Jun 6, 2016 at 12:02 PM, David <dclistsli...@gmail.com> wrote:

>
>
> On Mon, Jun 6, 2016 at 7:06 AM, Christian Balzer <ch...@gol.com> wrote:
>
>>
>> Hello,
>>
>> On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:
>>
>> > I'm hoping to implement cephfs in production at some point this year so
>> > I'd be interested to hear your progress on this.
>> >
>> > Have you considered SSD for your metadata pool? You wouldn't need loads
>> > of capacity although even with reliable SSD I'd probably still do x3
>> > replication for metadata. I've been looking at the intel s3610's for
>> > this.
>> >
>> That's an interesting and potentially quite beneficial thought, but it
>> depends on a number of things (more below).
>>
>> I'm using S3610s (800GB) for a cache pool with 2x replication and am quite
>> happy with that, but then again I have a very predictable usage pattern
>> and am monitoring those SSDs religiously and I'm sure they will outlive
>> things by a huge margin.
>>
>> We didn't go for 3x replication due to (in order):
>> a) cost
>> b) rack space
>> c) increased performance with 2x
>
>
> I'd also be happy with 2x replication for data pools and that's probably
> what I'll do for the reasons you've given. I plan on using File Layouts to
> map some dirs to the ssd pool. I'm testing this at the moment and it's an
> awesome feature. I'm just very paranoid with the metadata and considering
> the relatively low capacity requirement I'd stick with the 3x replication
> although as you say that means a performance hit.
>
>
>>
>> Now for how useful/helpful a fast meta-data pool would be, I reckon it
>> depends on a number of things:
>>
>> a) Is the cluster write or read heavy?
>> b) Do reads, flocks, anything that is not directly considered a read
>>    cause writes to the meta-data pool?
>> c) Anything else that might cause write storms to the meta-data pool, like
>>    bit in the current NFS over CephFS thread with sync?
>>
>> A quick glance at my test cluster seems to indicate that CephFS meta data
>> per filesystem object is about 2KB, somebody with actual clues please
>> confirm this.
>>
>
> 2K per object appears to be the case on my test cluster too.
>
>
>> Brady has large amounts of NVMe space left over in his current design,
>> assuming 10GB journals about 2.8TB of raw space.
>> So if running the (verified) numbers indicates that the meta data can fit
>> in this space, I'd put it there.
>>
>> Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool storage
>> may
>> be the way forward.
>>
>> Regards,
>>
>> Christian
>> >
>> >
>> > On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz <bde...@gmail.com> wrote:
>> >
>> > > Question:
>> > > I'm curious if there is anybody else out there running CephFS at the
>> > > scale I'm planning for. I'd like to know some of the issues you didn't
>> > > expect that I should be looking out for. I'd also like to simply see
>> > > when CephFS hasn't worked out and why. Basically, give me your war
>> > > stories.
>> > >
>> > >
>> > > Problem Details:
>> > > Now that I'm out of my design phase and finished testing on VMs, I'm
>> > > ready to drop $100k on a pilo. I'd like to get some sense of
>> > > confidence from the community that this is going to work before I pull
>> > > the trigger.
>> > >
>> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
>> with
>> > > CephFS by this time next year (hopefully by December). My workload is
>> > > a mix of small and vary large files (100GB+ in size). We do fMRI
>> > > analysis on DICOM image sets as well as other physio data collected
>> > > from subjects. We also have plenty of spreadsheets, scripts, etc.
>> > > Currently 90% of our analysis is I/O bound and generally sequential.
>> > >
>> > > In deploying Ceph, I am hoping to see more throughput than the 7320
>> can
>> > > currently provide. I'm also looking to get away from traditional
>> > > file-systems that require forklift upgrades. That's where Ceph really
>> > > shines for us.
>> > >
>> > > I don't have a total file count, but I do know that we have about 500k
>> > > directories.
>> > >
>> > >
>> > > Planned Architecture:
>> > >
>> > > Storage Interconnect:
>> > > Brocade VDX 6940 (40 gig)
>> > >
>> > > Access Switches for clients (servers):
>> > > Brocade VDX 6740 (10 gig)
>> > >
>> > > Access Switches for clients (workstations):
>> > > Brocade ICX 7450
>> > >
>> > > 3x MON:
>> > > 128GB RAM
>> > > 2x 200GB SSD for OS
>> > > 2x 400GB P3700 for LevelDB
>> > > 2x E5-2660v4
>> > > 1x Dual Port 40Gb Ethernet
>> > >
>> > > 2x MDS:
>> > > 128GB RAM
>> > > 2x 200GB SSD for OS
>> > > 2x 400GB P3700 for LevelDB (is this necessary?)
>> > > 2x E5-2660v4
>> > > 1x Dual Port 40Gb Ethernet
>> > >
>> > > 8x OSD:
>> > > 128GB RAM
>> > > 2x 200GB SSD for OS
>> > > 2x 400GB P3700 for Journals
>> > > 24x 6TB Enterprise SATA
>> > > 2x E5-2660v4
>> > > 1x Dual Port 40Gb Ethernet
>> > >
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> > >
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> ch...@gol.com           Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS in the wild

Reply via email to