Re: [ceph-users] CephFS in the wild

2016-06-06 Thread Christian Balzer

Hello,

On Mon, 6 Jun 2016 14:14:17 -0500 Brady Deetz wrote:

> This is an interesting idea that I hadn't yet considered testing.
> 
> My test cluster is also looking like 2K per object.
> 
> It looks like our hardware purchase for a one-half sized pilot is getting
> approved and I don't really want to modify it when we're this close to
> moving forward. So, using spare NVMe capacity is certainly an option, but
> increasing my OS disk size or replacing OSDs is pretty much a no go for
> this iteration of the cluster.
> 
> My single concern with the idea of using the NVMe capacity is the
> potential to affect journal performance which is already cutting it
> close with each NVMe supporting 12 journals. 

I thought you might say that. ^o^

Consider however that when your journals are busy due to massive large
writes, that also means little meta-data activity.

>It seems to me what would
> probably be better would be to replace 2 HDD OSDs with 2 SSD OSDs and
> put the metadata pool on those dedicated SSDs. Even if testing goes well
> on the NVMe based pool, dedicated SSDs seem like a safer play and may be
> what I implement when we buy our second round of hardware to finish out
> the cluster and go live (January-March 2017).
> 
Again, if you can afford this, bully for you. ^_^
With dedicated SSDs small to medium sized S3710s are probably the way
forward.

Christian 
> 
> 
> On Mon, Jun 6, 2016 at 12:02 PM, David  wrote:
> 
> >
> >
> > On Mon, Jun 6, 2016 at 7:06 AM, Christian Balzer  wrote:
> >
> >>
> >> Hello,
> >>
> >> On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:
> >>
> >> > I'm hoping to implement cephfs in production at some point this
> >> > year so I'd be interested to hear your progress on this.
> >> >
> >> > Have you considered SSD for your metadata pool? You wouldn't need
> >> > loads of capacity although even with reliable SSD I'd probably
> >> > still do x3 replication for metadata. I've been looking at the
> >> > intel s3610's for this.
> >> >
> >> That's an interesting and potentially quite beneficial thought, but it
> >> depends on a number of things (more below).
> >>
> >> I'm using S3610s (800GB) for a cache pool with 2x replication and am
> >> quite happy with that, but then again I have a very predictable usage
> >> pattern and am monitoring those SSDs religiously and I'm sure they
> >> will outlive things by a huge margin.
> >>
> >> We didn't go for 3x replication due to (in order):
> >> a) cost
> >> b) rack space
> >> c) increased performance with 2x
> >
> >
> > I'd also be happy with 2x replication for data pools and that's
> > probably what I'll do for the reasons you've given. I plan on using
> > File Layouts to map some dirs to the ssd pool. I'm testing this at the
> > moment and it's an awesome feature. I'm just very paranoid with the
> > metadata and considering the relatively low capacity requirement I'd
> > stick with the 3x replication although as you say that means a
> > performance hit.
> >
> >
> >>
> >> Now for how useful/helpful a fast meta-data pool would be, I reckon it
> >> depends on a number of things:
> >>
> >> a) Is the cluster write or read heavy?
> >> b) Do reads, flocks, anything that is not directly considered a read
> >>cause writes to the meta-data pool?
> >> c) Anything else that might cause write storms to the meta-data pool,
> >> like bit in the current NFS over CephFS thread with sync?
> >>
> >> A quick glance at my test cluster seems to indicate that CephFS meta
> >> data per filesystem object is about 2KB, somebody with actual clues
> >> please confirm this.
> >>
> >
> > 2K per object appears to be the case on my test cluster too.
> >
> >
> >> Brady has large amounts of NVMe space left over in his current design,
> >> assuming 10GB journals about 2.8TB of raw space.
> >> So if running the (verified) numbers indicates that the meta data can
> >> fit in this space, I'd put it there.
> >>
> >> Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool
> >> storage may
> >> be the way forward.
> >>
> >> Regards,
> >>
> >> Christian
> >> >
> >> >
> >> > On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz 
> >> > wrote:
> >> >
> >> > > Question:
> >> > > I'm curious if there is anybody else out there running CephFS at
> >> > > the scale I'm planning for. I'd like to know some of the issues
> >> > > you didn't expect that I should be looking out for. I'd also like
> >> > > to simply see when CephFS hasn't worked out and why. Basically,
> >> > > give me your war stories.
> >> > >
> >> > >
> >> > > Problem Details:
> >> > > Now that I'm out of my design phase and finished testing on VMs,
> >> > > I'm ready to drop $100k on a pilo. I'd like to get some sense of
> >> > > confidence from the community that this is going to work before I
> >> > > pull the trigger.
> >> > >
> >> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
> >> with
> >> > > CephFS by this time next year (hopefully by 

Re: [ceph-users] CephFS in the wild

2016-06-06 Thread Brady Deetz
This is an interesting idea that I hadn't yet considered testing.

My test cluster is also looking like 2K per object.

It looks like our hardware purchase for a one-half sized pilot is getting
approved and I don't really want to modify it when we're this close to
moving forward. So, using spare NVMe capacity is certainly an option, but
increasing my OS disk size or replacing OSDs is pretty much a no go for
this iteration of the cluster.

My single concern with the idea of using the NVMe capacity is the potential
to affect journal performance which is already cutting it close with each
NVMe supporting 12 journals. It seems to me what would probably be better
would be to replace 2 HDD OSDs with 2 SSD OSDs and put the metadata pool on
those dedicated SSDs. Even if testing goes well on the NVMe based pool,
dedicated SSDs seem like a safer play and may be what I implement when we
buy our second round of hardware to finish out the cluster and go live
(January-March 2017).



On Mon, Jun 6, 2016 at 12:02 PM, David  wrote:

>
>
> On Mon, Jun 6, 2016 at 7:06 AM, Christian Balzer  wrote:
>
>>
>> Hello,
>>
>> On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:
>>
>> > I'm hoping to implement cephfs in production at some point this year so
>> > I'd be interested to hear your progress on this.
>> >
>> > Have you considered SSD for your metadata pool? You wouldn't need loads
>> > of capacity although even with reliable SSD I'd probably still do x3
>> > replication for metadata. I've been looking at the intel s3610's for
>> > this.
>> >
>> That's an interesting and potentially quite beneficial thought, but it
>> depends on a number of things (more below).
>>
>> I'm using S3610s (800GB) for a cache pool with 2x replication and am quite
>> happy with that, but then again I have a very predictable usage pattern
>> and am monitoring those SSDs religiously and I'm sure they will outlive
>> things by a huge margin.
>>
>> We didn't go for 3x replication due to (in order):
>> a) cost
>> b) rack space
>> c) increased performance with 2x
>
>
> I'd also be happy with 2x replication for data pools and that's probably
> what I'll do for the reasons you've given. I plan on using File Layouts to
> map some dirs to the ssd pool. I'm testing this at the moment and it's an
> awesome feature. I'm just very paranoid with the metadata and considering
> the relatively low capacity requirement I'd stick with the 3x replication
> although as you say that means a performance hit.
>
>
>>
>> Now for how useful/helpful a fast meta-data pool would be, I reckon it
>> depends on a number of things:
>>
>> a) Is the cluster write or read heavy?
>> b) Do reads, flocks, anything that is not directly considered a read
>>cause writes to the meta-data pool?
>> c) Anything else that might cause write storms to the meta-data pool, like
>>bit in the current NFS over CephFS thread with sync?
>>
>> A quick glance at my test cluster seems to indicate that CephFS meta data
>> per filesystem object is about 2KB, somebody with actual clues please
>> confirm this.
>>
>
> 2K per object appears to be the case on my test cluster too.
>
>
>> Brady has large amounts of NVMe space left over in his current design,
>> assuming 10GB journals about 2.8TB of raw space.
>> So if running the (verified) numbers indicates that the meta data can fit
>> in this space, I'd put it there.
>>
>> Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool storage
>> may
>> be the way forward.
>>
>> Regards,
>>
>> Christian
>> >
>> >
>> > On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz  wrote:
>> >
>> > > Question:
>> > > I'm curious if there is anybody else out there running CephFS at the
>> > > scale I'm planning for. I'd like to know some of the issues you didn't
>> > > expect that I should be looking out for. I'd also like to simply see
>> > > when CephFS hasn't worked out and why. Basically, give me your war
>> > > stories.
>> > >
>> > >
>> > > Problem Details:
>> > > Now that I'm out of my design phase and finished testing on VMs, I'm
>> > > ready to drop $100k on a pilo. I'd like to get some sense of
>> > > confidence from the community that this is going to work before I pull
>> > > the trigger.
>> > >
>> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
>> with
>> > > CephFS by this time next year (hopefully by December). My workload is
>> > > a mix of small and vary large files (100GB+ in size). We do fMRI
>> > > analysis on DICOM image sets as well as other physio data collected
>> > > from subjects. We also have plenty of spreadsheets, scripts, etc.
>> > > Currently 90% of our analysis is I/O bound and generally sequential.
>> > >
>> > > In deploying Ceph, I am hoping to see more throughput than the 7320
>> can
>> > > currently provide. I'm also looking to get away from traditional
>> > > file-systems that require forklift upgrades. That's where Ceph really
>> > > shines for us.
>> > >
>> > > I 

Re: [ceph-users] CephFS in the wild

2016-06-06 Thread David
On Mon, Jun 6, 2016 at 7:06 AM, Christian Balzer  wrote:

>
> Hello,
>
> On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:
>
> > I'm hoping to implement cephfs in production at some point this year so
> > I'd be interested to hear your progress on this.
> >
> > Have you considered SSD for your metadata pool? You wouldn't need loads
> > of capacity although even with reliable SSD I'd probably still do x3
> > replication for metadata. I've been looking at the intel s3610's for
> > this.
> >
> That's an interesting and potentially quite beneficial thought, but it
> depends on a number of things (more below).
>
> I'm using S3610s (800GB) for a cache pool with 2x replication and am quite
> happy with that, but then again I have a very predictable usage pattern
> and am monitoring those SSDs religiously and I'm sure they will outlive
> things by a huge margin.
>
> We didn't go for 3x replication due to (in order):
> a) cost
> b) rack space
> c) increased performance with 2x


I'd also be happy with 2x replication for data pools and that's probably
what I'll do for the reasons you've given. I plan on using File Layouts to
map some dirs to the ssd pool. I'm testing this at the moment and it's an
awesome feature. I'm just very paranoid with the metadata and considering
the relatively low capacity requirement I'd stick with the 3x replication
although as you say that means a performance hit.


>
> Now for how useful/helpful a fast meta-data pool would be, I reckon it
> depends on a number of things:
>
> a) Is the cluster write or read heavy?
> b) Do reads, flocks, anything that is not directly considered a read
>cause writes to the meta-data pool?
> c) Anything else that might cause write storms to the meta-data pool, like
>bit in the current NFS over CephFS thread with sync?
>
> A quick glance at my test cluster seems to indicate that CephFS meta data
> per filesystem object is about 2KB, somebody with actual clues please
> confirm this.
>

2K per object appears to be the case on my test cluster too.


> Brady has large amounts of NVMe space left over in his current design,
> assuming 10GB journals about 2.8TB of raw space.
> So if running the (verified) numbers indicates that the meta data can fit
> in this space, I'd put it there.
>
> Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool storage may
> be the way forward.
>
> Regards,
>
> Christian
> >
> >
> > On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz  wrote:
> >
> > > Question:
> > > I'm curious if there is anybody else out there running CephFS at the
> > > scale I'm planning for. I'd like to know some of the issues you didn't
> > > expect that I should be looking out for. I'd also like to simply see
> > > when CephFS hasn't worked out and why. Basically, give me your war
> > > stories.
> > >
> > >
> > > Problem Details:
> > > Now that I'm out of my design phase and finished testing on VMs, I'm
> > > ready to drop $100k on a pilo. I'd like to get some sense of
> > > confidence from the community that this is going to work before I pull
> > > the trigger.
> > >
> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> > > CephFS by this time next year (hopefully by December). My workload is
> > > a mix of small and vary large files (100GB+ in size). We do fMRI
> > > analysis on DICOM image sets as well as other physio data collected
> > > from subjects. We also have plenty of spreadsheets, scripts, etc.
> > > Currently 90% of our analysis is I/O bound and generally sequential.
> > >
> > > In deploying Ceph, I am hoping to see more throughput than the 7320 can
> > > currently provide. I'm also looking to get away from traditional
> > > file-systems that require forklift upgrades. That's where Ceph really
> > > shines for us.
> > >
> > > I don't have a total file count, but I do know that we have about 500k
> > > directories.
> > >
> > >
> > > Planned Architecture:
> > >
> > > Storage Interconnect:
> > > Brocade VDX 6940 (40 gig)
> > >
> > > Access Switches for clients (servers):
> > > Brocade VDX 6740 (10 gig)
> > >
> > > Access Switches for clients (workstations):
> > > Brocade ICX 7450
> > >
> > > 3x MON:
> > > 128GB RAM
> > > 2x 200GB SSD for OS
> > > 2x 400GB P3700 for LevelDB
> > > 2x E5-2660v4
> > > 1x Dual Port 40Gb Ethernet
> > >
> > > 2x MDS:
> > > 128GB RAM
> > > 2x 200GB SSD for OS
> > > 2x 400GB P3700 for LevelDB (is this necessary?)
> > > 2x E5-2660v4
> > > 1x Dual Port 40Gb Ethernet
> > >
> > > 8x OSD:
> > > 128GB RAM
> > > 2x 200GB SSD for OS
> > > 2x 400GB P3700 for Journals
> > > 24x 6TB Enterprise SATA
> > > 2x E5-2660v4
> > > 1x Dual Port 40Gb Ethernet
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> 

Re: [ceph-users] CephFS in the wild

2016-06-06 Thread Christian Balzer

Hello,

On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:

> I'm hoping to implement cephfs in production at some point this year so
> I'd be interested to hear your progress on this.
> 
> Have you considered SSD for your metadata pool? You wouldn't need loads
> of capacity although even with reliable SSD I'd probably still do x3
> replication for metadata. I've been looking at the intel s3610's for
> this.
> 
That's an interesting and potentially quite beneficial thought, but it
depends on a number of things (more below).

I'm using S3610s (800GB) for a cache pool with 2x replication and am quite
happy with that, but then again I have a very predictable usage pattern
and am monitoring those SSDs religiously and I'm sure they will outlive
things by a huge margin. 

We didn't go for 3x replication due to (in order):
a) cost
b) rack space
c) increased performance with 2x


Now for how useful/helpful a fast meta-data pool would be, I reckon it
depends on a number of things:

a) Is the cluster write or read heavy?
b) Do reads, flocks, anything that is not directly considered a read
   cause writes to the meta-data pool?
c) Anything else that might cause write storms to the meta-data pool, like
   bit in the current NFS over CephFS thread with sync?

A quick glance at my test cluster seems to indicate that CephFS meta data
per filesystem object is about 2KB, somebody with actual clues please
confirm this.

Brady has large amounts of NVMe space left over in his current design,
assuming 10GB journals about 2.8TB of raw space.
So if running the (verified) numbers indicates that the meta data can fit
in this space, I'd put it there.

Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool storage may
be the way forward.

Regards,

Christian
> 
> 
> On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz  wrote:
> 
> > Question:
> > I'm curious if there is anybody else out there running CephFS at the
> > scale I'm planning for. I'd like to know some of the issues you didn't
> > expect that I should be looking out for. I'd also like to simply see
> > when CephFS hasn't worked out and why. Basically, give me your war
> > stories.
> >
> >
> > Problem Details:
> > Now that I'm out of my design phase and finished testing on VMs, I'm
> > ready to drop $100k on a pilo. I'd like to get some sense of
> > confidence from the community that this is going to work before I pull
> > the trigger.
> >
> > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> > CephFS by this time next year (hopefully by December). My workload is
> > a mix of small and vary large files (100GB+ in size). We do fMRI
> > analysis on DICOM image sets as well as other physio data collected
> > from subjects. We also have plenty of spreadsheets, scripts, etc.
> > Currently 90% of our analysis is I/O bound and generally sequential.
> >
> > In deploying Ceph, I am hoping to see more throughput than the 7320 can
> > currently provide. I'm also looking to get away from traditional
> > file-systems that require forklift upgrades. That's where Ceph really
> > shines for us.
> >
> > I don't have a total file count, but I do know that we have about 500k
> > directories.
> >
> >
> > Planned Architecture:
> >
> > Storage Interconnect:
> > Brocade VDX 6940 (40 gig)
> >
> > Access Switches for clients (servers):
> > Brocade VDX 6740 (10 gig)
> >
> > Access Switches for clients (workstations):
> > Brocade ICX 7450
> >
> > 3x MON:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> > 2x MDS:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB (is this necessary?)
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> > 8x OSD:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for Journals
> > 24x 6TB Enterprise SATA
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS in the wild

2016-06-05 Thread Gregory Farnum
On Wed, Jun 1, 2016 at 1:50 PM, Brady Deetz  wrote:
> Question:
> I'm curious if there is anybody else out there running CephFS at the scale
> I'm planning for. I'd like to know some of the issues you didn't expect that
> I should be looking out for. I'd also like to simply see when CephFS hasn't
> worked out and why. Basically, give me your war stories.
>
>
> Problem Details:
> Now that I'm out of my design phase and finished testing on VMs, I'm ready
> to drop $100k on a pilo. I'd like to get some sense of confidence from the
> community that this is going to work before I pull the trigger.
>
> I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> CephFS by this time next year (hopefully by December). My workload is a mix
> of small and vary large files (100GB+ in size). We do fMRI analysis on DICOM
> image sets as well as other physio data collected from subjects. We also
> have plenty of spreadsheets, scripts, etc. Currently 90% of our analysis is
> I/O bound and generally sequential.
>
> In deploying Ceph, I am hoping to see more throughput than the 7320 can
> currently provide. I'm also looking to get away from traditional
> file-systems that require forklift upgrades. That's where Ceph really shines
> for us.
>
> I don't have a total file count, but I do know that we have about 500k
> directories.
>
>
> Planned Architecture:
>
> Storage Interconnect:
> Brocade VDX 6940 (40 gig)
>
> Access Switches for clients (servers):
> Brocade VDX 6740 (10 gig)
>
> Access Switches for clients (workstations):
> Brocade ICX 7450
>
> 3x MON:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
>
> 2x MDS:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB (is this necessary?)
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet

The MDS doesn't use any local storage, other than for storing its
ceph.conf and keyring.

>
> 8x OSD:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for Journals
> 24x 6TB Enterprise SATA
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet

I don't know what kind of throughput you're currently seeing on your
ZFS system. Unfortunately most of the big CephFS users are pretty
quiet on the lists :( although they sometimes come out to play at
events like https://www.msi.umn.edu/sc15Ceph. :)

You'll definitely want to do some tuning. Right now we default to 100k
inodes in the metadata cache for instance, which fits in <1GB of RAM.
You'll want to bump that way, way up. Also keep in mind that CephFS'
performance characteristics are just weirdly different to NAS boxes or
ZFS in ways you might not be ready for. So large streaming writes will
do great, but if you have shared RW files or directories, that might
be much faster in some places and much slower in ones you didn't think
about. Large streaming reads and writes will go as quickly as RADOS
can drive them (80-100MB/s per OSD for reads is generally a good
estimate, I think? And divide that by replication factor for writes);
with smaller ops you start running into latency issues and the fact
that CephFS (since it's sending RADOS writes to separate objects)
can't coalesce writes as much as local FSes (or boxes built on them).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS in the wild

2016-06-02 Thread Christian Balzer
On Thu, 2 Jun 2016 21:13:41 -0500 Brady Deetz wrote:

> On Thu, Jun 2, 2016 at 8:58 PM, Christian Balzer  wrote:
> 
> > On Thu, 2 Jun 2016 11:11:19 -0500 Brady Deetz wrote:
> >
> > > On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer 
> > > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
> > > >
[snip]
> > > > > Planned Architecture:
> > > > >
> > > > Well, we talked about this 2 months ago, you seem to have changed
> > > > only a few things.
> > > > So lets dissect this again...
> > > >
> > > > > Storage Interconnect:
> > > > > Brocade VDX 6940 (40 gig)
> > > > >
> > > > Is this a flat (single) network for all the storage nodes?
> > > > And then from these 40Gb/s switches links to the access switches?
> > > >
> > >
> > > This will start as a single 40Gb/s switch with a single link to each
> > > node (upgraded in the future to dual-switch + dual-link). The 40Gb/s
> > > switch will also be connected to several 10Gb/s and 1Gb/s access
> > > switches with dual 40Gb/s uplinks.
> > >
> > So initially 80Gb/s and with the 2nd switch probably 160Gb/s for your
> > clients.
> > Network wise, your 8 storage servers outstrip that, actual storage
> > bandwidth and IOPS wise, you're looking at 8x2GB/s aka 160Gb/s best
> > case writes, so a match.
> >
> > > We do intend to segment the public and private networks using VLANs
> > > untagged at the node. There are obviously many subnets on our
> > > network. The 40Gb/s switch will handle routing for those networks.
> > >
> > > You can see list discussion in "Public and Private network over 1
> > > interface" May 23,2016 regarding some of this.
> > >
> > And I did comment in that thread, the final one actually. ^o^
> >
> > Unless you can come up with a _very_ good reason not covered in that
> > thread, I'd keep it to one network.
> >
> > Once the 2nd switch is in place and running vLAG (LACP on your servers)
> > your network bandwidth per host VASTLY exceeds that of your storage.
> >
> >
> My theory is that with a single switch, I can QoS traffic for the private
> network in case of the situation where we do see massive client I/O at
> the same time that a re-weight or something like that was happening.
> But... I think you're right. KISS
> 
Lets run with this example:

1. You just lost a NMVe (cosmic rays, dontcha know) and 12 of your OSDs
are toast.

2. Ceph does its thing and kicks off all that recovery and backfill magic
(terror would be a better term).

3. Your clients at this point also would like to read (just read for
simplicity, R/W would make it worse of course) at the max speed of your
initial network layout, that is 8GB/s 

4. As stated your nodes can't write more than 2GB/s, which in turn also
means that recovery/backfill traffic from another node (reads) can't
exceed this value. (Wrongly assuming equal distribution of activity
per node, but this will be correct cluster wide)
Leaving 2GB/s per node (or 16GB/s total) of read bandwidth.
So from a network perspective you should have no need for QoS at all, ever.

This of course leaves out the pertinent detail that all this activity will
result in severely degraded performance due to the thrashing HDDs with
default parameters.
So your clients will be hobbled by your storage, not your network.

And if you tuned down things so that recovery/backfill have the least
possible impact on your client I/O, that in turn also means vastly
reduced network needs.


> My initial KISS thought was single network was the opposite due to the
> alternate and maybe less tested configuration of Ceph. Perhaps
> multi-netting is a better compromise. We still run 2 networks, but not
> over separate VLANs.
> 
If you look at that other thread, you will find that many people run and
prefer single networks.
Just because it's in the documentation and an option doesn't mean it's the
best/correct approach.

I use split networks in exactly one cluster, my shitty test one which has
2 1Gb/s ports per node and _more_ IO bandwidth per node than a single
link. 

> Terrible idea?
> 
More to the tune of pointless.

Christian

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS in the wild

2016-06-02 Thread Brady Deetz
On Thu, Jun 2, 2016 at 8:58 PM, Christian Balzer  wrote:

> On Thu, 2 Jun 2016 11:11:19 -0500 Brady Deetz wrote:
>
> > On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer  wrote:
> >
> > >
> > > Hello,
> > >
> > > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
> > >
> > > > Question:
> > > > I'm curious if there is anybody else out there running CephFS at the
> > > > scale I'm planning for. I'd like to know some of the issues you
> > > > didn't expect that I should be looking out for. I'd also like to
> > > > simply see when CephFS hasn't worked out and why. Basically, give me
> > > > your war stories.
> > > >
> > > Not me, but diligently search the archives, there are people with large
> > > CephFS deployments (despite the non-production status when they did
> > > them). Also look at the current horror story thread about what happens
> > > when you have huge directories.
> > >
> > > >
> > > > Problem Details:
> > > > Now that I'm out of my design phase and finished testing on VMs, I'm
> > > > ready to drop $100k on a pilo. I'd like to get some sense of
> > > > confidence from the community that this is going to work before I
> > > > pull the trigger.
> > > >
> > > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
> > > > with CephFS by this time next year (hopefully by December). My
> > > > workload is a mix of small and vary large files (100GB+ in size). We
> > > > do fMRI analysis on DICOM image sets as well as other physio data
> > > > collected from subjects. We also have plenty of spreadsheets,
> > > > scripts, etc. Currently 90% of our analysis is I/O bound and
> > > > generally sequential.
> > > >
> > > There are other people here doing similar things (medical institutes,
> > > universities), again search the archives and maybe contact them
> > > directly.
> > >
> > > > In deploying Ceph, I am hoping to see more throughput than the 7320
> > > > can currently provide. I'm also looking to get away from traditional
> > > > file-systems that require forklift upgrades. That's where Ceph really
> > > > shines for us.
> > > >
> > > > I don't have a total file count, but I do know that we have about
> > > > 500k directories.
> > > >
> > > >
> > > > Planned Architecture:
> > > >
> > > Well, we talked about this 2 months ago, you seem to have changed only
> > > a few things.
> > > So lets dissect this again...
> > >
> > > > Storage Interconnect:
> > > > Brocade VDX 6940 (40 gig)
> > > >
> > > Is this a flat (single) network for all the storage nodes?
> > > And then from these 40Gb/s switches links to the access switches?
> > >
> >
> > This will start as a single 40Gb/s switch with a single link to each node
> > (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch
> > will also be connected to several 10Gb/s and 1Gb/s access switches with
> > dual 40Gb/s uplinks.
> >
> So initially 80Gb/s and with the 2nd switch probably 160Gb/s for your
> clients.
> Network wise, your 8 storage servers outstrip that, actual storage
> bandwidth and IOPS wise, you're looking at 8x2GB/s aka 160Gb/s best case
> writes, so a match.
>
> > We do intend to segment the public and private networks using VLANs
> > untagged at the node. There are obviously many subnets on our network.
> > The 40Gb/s switch will handle routing for those networks.
> >
> > You can see list discussion in "Public and Private network over 1
> > interface" May 23,2016 regarding some of this.
> >
> And I did comment in that thread, the final one actually. ^o^
>
> Unless you can come up with a _very_ good reason not covered in that
> thread, I'd keep it to one network.
>
> Once the 2nd switch is in place and running vLAG (LACP on your servers)
> your network bandwidth per host VASTLY exceeds that of your storage.
>
>
My theory is that with a single switch, I can QoS traffic for the private
network in case of the situation where we do see massive client I/O at the
same time that a re-weight or something like that was happening. But... I
think you're right. KISS

My initial KISS thought was single network was the opposite due to the
alternate and maybe less tested configuration of Ceph. Perhaps
multi-netting is a better compromise. We still run 2 networks, but not over
separate VLANs.

Terrible idea?


> >
> > >
> > > > Access Switches for clients (servers):
> > > > Brocade VDX 6740 (10 gig)
> > > >
> > > > Access Switches for clients (workstations):
> > > > Brocade ICX 7450
> > > >
> > > > 3x MON:
> > > > 128GB RAM
> > > > 2x 200GB SSD for OS
> > > > 2x 400GB P3700 for LevelDB
> > > > 2x E5-2660v4
> > > > 1x Dual Port 40Gb Ethernet
> > > >
> > > Total overkill in the CPU core arena, fewer but faster cores would be
> > > more suited for this task.
> > > A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing
> > > like that, the closest one would be the E5-2643v4.
> > >
> > > Same for RAM, MON processes are pretty frugal.
> > >
> > > No need for NVMes for the leveldb, use 2 400GB 

Re: [ceph-users] CephFS in the wild

2016-06-02 Thread Christian Balzer
On Thu, 2 Jun 2016 11:11:19 -0500 Brady Deetz wrote:

> On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
> >
> > > Question:
> > > I'm curious if there is anybody else out there running CephFS at the
> > > scale I'm planning for. I'd like to know some of the issues you
> > > didn't expect that I should be looking out for. I'd also like to
> > > simply see when CephFS hasn't worked out and why. Basically, give me
> > > your war stories.
> > >
> > Not me, but diligently search the archives, there are people with large
> > CephFS deployments (despite the non-production status when they did
> > them). Also look at the current horror story thread about what happens
> > when you have huge directories.
> >
> > >
> > > Problem Details:
> > > Now that I'm out of my design phase and finished testing on VMs, I'm
> > > ready to drop $100k on a pilo. I'd like to get some sense of
> > > confidence from the community that this is going to work before I
> > > pull the trigger.
> > >
> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
> > > with CephFS by this time next year (hopefully by December). My
> > > workload is a mix of small and vary large files (100GB+ in size). We
> > > do fMRI analysis on DICOM image sets as well as other physio data
> > > collected from subjects. We also have plenty of spreadsheets,
> > > scripts, etc. Currently 90% of our analysis is I/O bound and
> > > generally sequential.
> > >
> > There are other people here doing similar things (medical institutes,
> > universities), again search the archives and maybe contact them
> > directly.
> >
> > > In deploying Ceph, I am hoping to see more throughput than the 7320
> > > can currently provide. I'm also looking to get away from traditional
> > > file-systems that require forklift upgrades. That's where Ceph really
> > > shines for us.
> > >
> > > I don't have a total file count, but I do know that we have about
> > > 500k directories.
> > >
> > >
> > > Planned Architecture:
> > >
> > Well, we talked about this 2 months ago, you seem to have changed only
> > a few things.
> > So lets dissect this again...
> >
> > > Storage Interconnect:
> > > Brocade VDX 6940 (40 gig)
> > >
> > Is this a flat (single) network for all the storage nodes?
> > And then from these 40Gb/s switches links to the access switches?
> >
> 
> This will start as a single 40Gb/s switch with a single link to each node
> (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch
> will also be connected to several 10Gb/s and 1Gb/s access switches with
> dual 40Gb/s uplinks.
> 
So initially 80Gb/s and with the 2nd switch probably 160Gb/s for your
clients.
Network wise, your 8 storage servers outstrip that, actual storage
bandwidth and IOPS wise, you're looking at 8x2GB/s aka 160Gb/s best case
writes, so a match.

> We do intend to segment the public and private networks using VLANs
> untagged at the node. There are obviously many subnets on our network.
> The 40Gb/s switch will handle routing for those networks.
> 
> You can see list discussion in "Public and Private network over 1
> interface" May 23,2016 regarding some of this.
> 
And I did comment in that thread, the final one actually. ^o^

Unless you can come up with a _very_ good reason not covered in that
thread, I'd keep it to one network.

Once the 2nd switch is in place and running vLAG (LACP on your servers)
your network bandwidth per host VASTLY exceeds that of your storage.

> 
> >
> > > Access Switches for clients (servers):
> > > Brocade VDX 6740 (10 gig)
> > >
> > > Access Switches for clients (workstations):
> > > Brocade ICX 7450
> > >
> > > 3x MON:
> > > 128GB RAM
> > > 2x 200GB SSD for OS
> > > 2x 400GB P3700 for LevelDB
> > > 2x E5-2660v4
> > > 1x Dual Port 40Gb Ethernet
> > >
> > Total overkill in the CPU core arena, fewer but faster cores would be
> > more suited for this task.
> > A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing
> > like that, the closest one would be the E5-2643v4.
> >
> > Same for RAM, MON processes are pretty frugal.
> >
> > No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and
> > thus the leveldb) and that's being overly generous in the speed/IOPS
> > department.
> >
> > Note also that 40Gb/s isn't really needed here, alas latency and KISS
> > do speak in favor of it, especially if you can afford it.
> >
> 
> Noted
> 
> 
> >
> > > 2x MDS:
> > > 128GB RAM
> > > 2x 200GB SSD for OS
> > > 2x 400GB P3700 for LevelDB (is this necessary?)
> > No, there isn't any persistent data with MDS, unlike what I assumed as
> > well before reading up on it and trying it out for the first time.
> >
> 
> That's what I thought. For some reason, my VAR keeps throwing these on
> the config.
> 
That's their job after all, selling you hardware that you don't need so
that they can create added value (for themselves). ^o^
 
> 
> >
> > > 2x 

Re: [ceph-users] CephFS in the wild

2016-06-02 Thread Scottix
I have three comments on our CephFS deployment. Some background first, we
have been using CephFS since Giant with some not so important data. We are
using it more heavily now in Infernalis. We have our own raw data storage
using the POSIX semantics and keep everything as basic as possible.
Basically open, read, and write.

1st thing is if you have a lot of files or directories in a folder. The
lookup can get slow, I would say when you get to about 5000 items you can
feel the latency. Although traditionally this never has been ultra fast on
regular file systems, but just be aware.
2nd We do see an increase in parallelization of reading and writing data
compared to a tradition spinning raid file system. I think this is
testimony of Ceph.
3rd When we do an upgrade to a mds, we basically have to stop all activity
on cephfs to restart the MDS. Replaying the backlog when it is starting, if
it is large, can eat a lot of memory and hope you don't hit swap. This does
create some downtime for us but it usually isn't long.

I am hoping for more improvements in MDS like HA and various other things
to make it even better.

On Thu, Jun 2, 2016 at 9:11 AM Brady Deetz  wrote:

> On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer  wrote:
>
>>
>> Hello,
>>
>> On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
>>
>> > Question:
>> > I'm curious if there is anybody else out there running CephFS at the
>> > scale I'm planning for. I'd like to know some of the issues you didn't
>> > expect that I should be looking out for. I'd also like to simply see
>> > when CephFS hasn't worked out and why. Basically, give me your war
>> > stories.
>> >
>> Not me, but diligently search the archives, there are people with large
>> CephFS deployments (despite the non-production status when they did them).
>> Also look at the current horror story thread about what happens when you
>> have huge directories.
>>
>> >
>> > Problem Details:
>> > Now that I'm out of my design phase and finished testing on VMs, I'm
>> > ready to drop $100k on a pilo. I'd like to get some sense of confidence
>> > from the community that this is going to work before I pull the trigger.
>> >
>> > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
>> > CephFS by this time next year (hopefully by December). My workload is a
>> > mix of small and vary large files (100GB+ in size). We do fMRI analysis
>> > on DICOM image sets as well as other physio data collected from
>> > subjects. We also have plenty of spreadsheets, scripts, etc. Currently
>> > 90% of our analysis is I/O bound and generally sequential.
>> >
>> There are other people here doing similar things (medical institutes,
>> universities), again search the archives and maybe contact them directly.
>>
>> > In deploying Ceph, I am hoping to see more throughput than the 7320 can
>> > currently provide. I'm also looking to get away from traditional
>> > file-systems that require forklift upgrades. That's where Ceph really
>> > shines for us.
>> >
>> > I don't have a total file count, but I do know that we have about 500k
>> > directories.
>> >
>> >
>> > Planned Architecture:
>> >
>> Well, we talked about this 2 months ago, you seem to have changed only a
>> few things.
>> So lets dissect this again...
>>
>> > Storage Interconnect:
>> > Brocade VDX 6940 (40 gig)
>> >
>> Is this a flat (single) network for all the storage nodes?
>> And then from these 40Gb/s switches links to the access switches?
>>
>
> This will start as a single 40Gb/s switch with a single link to each node
> (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch will
> also be connected to several 10Gb/s and 1Gb/s access switches with dual
> 40Gb/s uplinks.
>
> We do intend to segment the public and private networks using VLANs
> untagged at the node. There are obviously many subnets on our network. The
> 40Gb/s switch will handle routing for those networks.
>
> You can see list discussion in "Public and Private network over 1
> interface" May 23,2016 regarding some of this.
>
>
>>
>> > Access Switches for clients (servers):
>> > Brocade VDX 6740 (10 gig)
>> >
>> > Access Switches for clients (workstations):
>> > Brocade ICX 7450
>> >
>> > 3x MON:
>> > 128GB RAM
>> > 2x 200GB SSD for OS
>> > 2x 400GB P3700 for LevelDB
>> > 2x E5-2660v4
>> > 1x Dual Port 40Gb Ethernet
>> >
>> Total overkill in the CPU core arena, fewer but faster cores would be more
>> suited for this task.
>> A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing like
>> that, the closest one would be the E5-2643v4.
>>
>> Same for RAM, MON processes are pretty frugal.
>>
>> No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and thus
>> the leveldb) and that's being overly generous in the speed/IOPS
>> department.
>>
>> Note also that 40Gb/s isn't really needed here, alas latency and KISS do
>> speak in favor of it, especially if you can afford it.
>>
>
> Noted
>
>
>>
>> > 2x MDS:
>> > 

Re: [ceph-users] CephFS in the wild

2016-06-02 Thread Brady Deetz
On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
>
> > Question:
> > I'm curious if there is anybody else out there running CephFS at the
> > scale I'm planning for. I'd like to know some of the issues you didn't
> > expect that I should be looking out for. I'd also like to simply see
> > when CephFS hasn't worked out and why. Basically, give me your war
> > stories.
> >
> Not me, but diligently search the archives, there are people with large
> CephFS deployments (despite the non-production status when they did them).
> Also look at the current horror story thread about what happens when you
> have huge directories.
>
> >
> > Problem Details:
> > Now that I'm out of my design phase and finished testing on VMs, I'm
> > ready to drop $100k on a pilo. I'd like to get some sense of confidence
> > from the community that this is going to work before I pull the trigger.
> >
> > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> > CephFS by this time next year (hopefully by December). My workload is a
> > mix of small and vary large files (100GB+ in size). We do fMRI analysis
> > on DICOM image sets as well as other physio data collected from
> > subjects. We also have plenty of spreadsheets, scripts, etc. Currently
> > 90% of our analysis is I/O bound and generally sequential.
> >
> There are other people here doing similar things (medical institutes,
> universities), again search the archives and maybe contact them directly.
>
> > In deploying Ceph, I am hoping to see more throughput than the 7320 can
> > currently provide. I'm also looking to get away from traditional
> > file-systems that require forklift upgrades. That's where Ceph really
> > shines for us.
> >
> > I don't have a total file count, but I do know that we have about 500k
> > directories.
> >
> >
> > Planned Architecture:
> >
> Well, we talked about this 2 months ago, you seem to have changed only a
> few things.
> So lets dissect this again...
>
> > Storage Interconnect:
> > Brocade VDX 6940 (40 gig)
> >
> Is this a flat (single) network for all the storage nodes?
> And then from these 40Gb/s switches links to the access switches?
>

This will start as a single 40Gb/s switch with a single link to each node
(upgraded in the future to dual-switch + dual-link). The 40Gb/s switch will
also be connected to several 10Gb/s and 1Gb/s access switches with dual
40Gb/s uplinks.

We do intend to segment the public and private networks using VLANs
untagged at the node. There are obviously many subnets on our network. The
40Gb/s switch will handle routing for those networks.

You can see list discussion in "Public and Private network over 1
interface" May 23,2016 regarding some of this.


>
> > Access Switches for clients (servers):
> > Brocade VDX 6740 (10 gig)
> >
> > Access Switches for clients (workstations):
> > Brocade ICX 7450
> >
> > 3x MON:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> Total overkill in the CPU core arena, fewer but faster cores would be more
> suited for this task.
> A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing like
> that, the closest one would be the E5-2643v4.
>
> Same for RAM, MON processes are pretty frugal.
>
> No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and thus
> the leveldb) and that's being overly generous in the speed/IOPS department.
>
> Note also that 40Gb/s isn't really needed here, alas latency and KISS do
> speak in favor of it, especially if you can afford it.
>

Noted


>
> > 2x MDS:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB (is this necessary?)
> No, there isn't any persistent data with MDS, unlike what I assumed as
> well before reading up on it and trying it out for the first time.
>

That's what I thought. For some reason, my VAR keeps throwing these on the
config.


>
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> Dedicated MONs/MDS are often a waste, they are suggested to avoid people
> who don't know what they're doing from overloading things.
>
> So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make
> the first one a dedicated MON and give it the lowest IP.
> HW Specs as discussed above, make sure to use DIMMs that allow you to
> upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons
> (from my limited experience with CephFS).
> So:
>
> 128GB RAM (expandable to 256GB or more)
> 2x E5-2643v4
> 2x 400GB DC S3710
> 1x Dual Port 40Gb Ethernet
>
> > 8x OSD:
> > 128GB RAM
> Use your savings above to make that 256GB for grate performance
> improvements as hot objects stay in memory and so will all dir-entries (in
> SLAB).
>

I like this idea.


>
> > 2x 200GB SSD for OS
> Overkill really. Other than the normally rather terse OSD logs, nothing
> much will ever be written to them. So 3510s or at most 3610s.
>
> > 

Re: [ceph-users] CephFS in the wild

2016-06-01 Thread Christian Balzer

Hello,

On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:

> Question:
> I'm curious if there is anybody else out there running CephFS at the
> scale I'm planning for. I'd like to know some of the issues you didn't
> expect that I should be looking out for. I'd also like to simply see
> when CephFS hasn't worked out and why. Basically, give me your war
> stories.
>
Not me, but diligently search the archives, there are people with large
CephFS deployments (despite the non-production status when they did them).
Also look at the current horror story thread about what happens when you
have huge directories.
  
> 
> Problem Details:
> Now that I'm out of my design phase and finished testing on VMs, I'm
> ready to drop $100k on a pilo. I'd like to get some sense of confidence
> from the community that this is going to work before I pull the trigger.
> 
> I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> CephFS by this time next year (hopefully by December). My workload is a
> mix of small and vary large files (100GB+ in size). We do fMRI analysis
> on DICOM image sets as well as other physio data collected from
> subjects. We also have plenty of spreadsheets, scripts, etc. Currently
> 90% of our analysis is I/O bound and generally sequential.
> 
There are other people here doing similar things (medical institutes,
universities), again search the archives and maybe contact them directly.

> In deploying Ceph, I am hoping to see more throughput than the 7320 can
> currently provide. I'm also looking to get away from traditional
> file-systems that require forklift upgrades. That's where Ceph really
> shines for us.
> 
> I don't have a total file count, but I do know that we have about 500k
> directories.
> 
> 
> Planned Architecture:
> 
Well, we talked about this 2 months ago, you seem to have changed only a
few things.
So lets dissect this again...

> Storage Interconnect:
> Brocade VDX 6940 (40 gig)
> 
Is this a flat (single) network for all the storage nodes?
And then from these 40Gb/s switches links to the access switches?

> Access Switches for clients (servers):
> Brocade VDX 6740 (10 gig)
> 
> Access Switches for clients (workstations):
> Brocade ICX 7450
> 
> 3x MON:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
> 
Total overkill in the CPU core arena, fewer but faster cores would be more
suited for this task.
A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing like
that, the closest one would be the E5-2643v4.

Same for RAM, MON processes are pretty frugal.

No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and thus
the leveldb) and that's being overly generous in the speed/IOPS department.

Note also that 40Gb/s isn't really needed here, alas latency and KISS do
speak in favor of it, especially if you can afford it.

> 2x MDS:
> 128GB RAM
> 2x 200GB SSD for OS
> 2x 400GB P3700 for LevelDB (is this necessary?)
No, there isn't any persistent data with MDS, unlike what I assumed as
well before reading up on it and trying it out for the first time.

> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet
> 
Dedicated MONs/MDS are often a waste, they are suggested to avoid people
who don't know what they're doing from overloading things.

So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make
the first one a dedicated MON and give it the lowest IP.
HW Specs as discussed above, make sure to use DIMMs that allow you to
upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons
(from my limited experience with CephFS).
So:

128GB RAM (expandable to 256GB or more)
2x E5-2643v4
2x 400GB DC S3710
1x Dual Port 40Gb Ethernet

> 8x OSD:
> 128GB RAM
Use your savings above to make that 256GB for grate performance
improvements as hot objects stay in memory and so will all dir-entries (in
SLAB). 

> 2x 200GB SSD for OS
Overkill really. Other than the normally rather terse OSD logs, nothing
much will ever be written to them. So 3510s or at most 3610s.

> 2x 400GB P3700 for Journals
As discussed 2 months ago, this limits you to writes at half (or quarter
depending on your design and if you do LACP, vLAG) of what your network is
capable of. 
OTOH, I wouldn't expect your 24 HDDs do to much better than 2GB/s either
(at least with filestore and bluestore is a year away at best).
So good enough, especially if you're read heavy.

> 24x 6TB Enterprise SATA
> 2x E5-2660v4
> 1x Dual Port 40Gb Ethernet

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com