On Wed, 01 Oct 2014 20:12:03 +0200 Massimiliano Cuttini wrote:

> Hello Christian,
> 
> 
> Il 01/10/2014 19:20, Christian Balzer ha scritto:
> > Hello,
> >
> > On Wed, 01 Oct 2014 18:26:53 +0200 Massimiliano Cuttini wrote:
> >
> >> Dear all,
> >>
> >> i need few tips about Ceph best solution for driver controller.
> >> I'm getting confused about IT mode, RAID and JBoD.
> >> I read many posts about don't go for RAID but use instead a JBoD
> >> configuration.
> >>
> >> I have 2 storage alternatives right now in my mind:
> >>
> >>      *SuperStorage Server 2027R-E1CR24L*
> >>      which use SAS3 via LSI 3008 AOC; IT Mode/Pass-through
> >>      http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24L.cfm
> >>
> >> and
> >>
> >>      *SuperStorage Server 2027R-E1CR24N*
> >>      which use SAS3 via LSI 3108 SAS3 AOC (in RAID mode?)
> >>      http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24N.cfm
> >>
> > Firstly, both of these use an expander backplane.
> > So if you're planning on putting SSDs in there (even if just like 6 for
> > journals) you may be hampered by that.
> > The Supermicro homepage is vague as usual and the manual doesn't
> > actually have a section for that backplane. I guess it will be a 4link
> > connection, so 4x12Gb/s aka 4.8 GB/s.
> > If the disks all going to be HDDs you're OK, but keep that bit in mind.
> >   
> ok i was thinking about connect 24 SSD disks connected with SATA3
> (6Gbps). This is why i choose a 8x SAS3 port LSI card that use double
> PCI 3.0 connection, that support even (12Gbps).
> This should allow me to use the full speed of the SSD (i guess).
> 
Given the SSD speeds you cite below, SAS2 aka SATA3 would do, too. 
And of course be cheaper.

Also what SSDs are you planning to deploy?

> I made this analysis:
> - Total output: 8x12 = 96Gbps full speed available on the PCI3.0
That's the speed/capacity of the controller.

I'm talking about the actual backplane, where drives plug in. 
And that is connected either by one cable  (and thus 48Gb/s) or two (and
thus the 96GB/s you're expecting), the documentation is unclear on the
homepage and not in the manual of that server. Digging around I found
http://www.supermicro.com.tw/manuals/other/BPN-SAS3-216EL.pdf
which suggests two ports, so your basic assumptions are correct.

But verify that with your Supermicro vendor and read up about SAS/SATA
expanders.

If you want/need full speed, the only option with Supermicro seems to be
http://www.supermicro.com.tw/products/chassis/2U/216/SC216BAC-R920LP.cfm
at this time for SAS3.

Of course a direct connect backplane chassis with SAS2/SATA3 will do fine
as I wrote above, like this one.
http://www.supermicro.com.tw/products/chassis/2U/216/SC216BA-R1K28LP.cfm

In either case get the fastest motherboard/CPUs (Ceph will need those for
SSDs) and the appropriate controller(s). If you're unwilling to build them
yourself, I'm sure some vendor will do BTO. ^^

> - Than i should have at least for each disk a maximum speed of 96Gbps/24 
> disks which 4Gbps each disk
> - The disks are SATA3 6Gbps than i should have here a little bootleneck 
> that lower me at 4Gbps.
> - However a common SSD never hit the interface speed, the tend to be at 
> 450MB/s.
> 
> Average speed of a SSD:
> Min   Avg     Max
> 369   Read 485        522
> 162   Write 428       504
> 223   Mixed 449       512
> 
> 
> Then having a bottleneck to 4Gbps (which mean 400MB/s) should be fine 
> (should only if I'm not in wrong).
> Is it right what i thougth?
> 
Also expanders introduce some level of overhead, so you're probably going
to wind up with less than 400MB/s per drive.

> I think that the only bottleneck here is the 4x1Gb ethernet connection.
>
With a firebreathing storage server like that, you definitely do NOT want
to limit yourself to 1Gb/s links. The latency of these links, never mind
bandwidth will render all your investment in the storage nodes rather moot.

Even if your clients would not be on something faster, for replication at
least use 10Gb/s Ethernet or my favorite (price and performance wise),
Infiniband.

> >> Ok both of them solution should support JBoD.
> >> However I read that only a LSI with HBA or/and flashed in IT MODE
> >> allow to:
> >>
> >>    * "plug&play" a new driver and see it already on a linux
> >> distribution (without recheck disks)
> >>    * see S.M.A.R.T. data (because there is no volume layer between
> >>      motherboard and disks)
> > smartctl can handle handle the LSI RAID stuff fine.
> Good
> 
> >
> >>    * reduce the disk latency
> >>
> > Not sure about that, depending on the actual RAID and configuration any
> > cache of the RAID subsystem might get used, so improving things.
> >
> > The most important reason to use IT for me would be in conjunction with
> > SSDs, none of the RAIDs I'm aware allow for TRIM/DISCARD. to work.
> 
> Did you know if i can flash the LSI 3108 to IT mode?
> 
Don't know, given that the 2108 can't, I would expect the answer to be no.

> >> Then i should probably avoid LSI 3108 (which have a RAID config by
> >> default) and go for the LSI 3008 (already flashed in IT mode).
> >>
> > Of the 2 I would pick the IT mode one for a "classic" Ceph deployment.
> 
> Ok, but why?
Because you're using SSDs for starters and thus REALLY want a HBA, IT mode.
And because it is cheaper, more straightforward.

Also having to create 24 single disk RAID0 volumes with certain controllers
(and the 3108 is among them if it is anything like the 2108) is a pain.

Other controllers will automatically make their onboard cache available in
JBOD mode, like Areca. So you get the best of both worlds, at a price of
course.

> Can you suggest me some good tech datasheet about IT mode?
> 
Not really, it's called experience and reading what others wrote.

> >
> >> Is it so or I'm completly wasting my time on useless specs?
> >>
> > It might be a good idea to tell us what your actual plans are.
> > As in, how many nodes (these are quite dense ones with 24 drives!), how
> > much storage in total, what kind of use pattern, clients.
> Right now we are just testing and experimenting.
> We would start with a non-production environment with 2 nodes, learn 
> Cephs in depth and then replicate test&findings on other 2 nodes, 
> upgrade it to 10GB ethernet and go live.

Given that you're aiming for all SSDs, definitely consider Infiniband for
the backend (replication network) at least. 
It's cheaper/faster and also will have more native support (thus even
faster) in upcoming Ceph releases.
Failing that, definitely dedicated client and replication networks, each
with 2x10Gb/s bonded links to get somewhere close to your storage
abilities/bandwidth. 

Next consider the HA aspects of your cluster. Aside from the obvious like
having redundant power feeds and network links/switches, what happens if a
storage node fails?
If you're starting with 2 nodes, that's risky in and by itself (also
deploy at least 3 mons).

If you start with 4 nodes, if one goes down the default behavior of Ceph
would be to redistribute the data on the 3 remaining nodes to maintain the
replication level (a level of 2 is probably acceptable with the right kind
of SSDs). 
Now what means is a LOT of traffic for the replication, potentially
impacting your performance depending on the configuration options and
actual hardware used. It also means your "near full" settings should be at
70% or lower, because otherwise a node failure could result in full OSDs
and thus a blocked cluster. And of course after the data is rebalanced the
lack of one node means that your cluster is about 25% slower than before.

There are many threads in this ML that touch on this subject, with
various suggestions on how to minimize or negate the impact of a node
failure.

The most common and from a pure HA perspective sensible suggestion is to
start with enough nodes that a failure won't have too much impact, but
that of course is also the most expensive option. ^^

> I don't want to start with a bad hardware environment since the 
> beginning, then I'm reading a lot to find the perfect config for our
> need. However LSI specs are a nightmare.... they are completly confused.
>
I just told the Supermicro manager for Japan that last week. ^o^
She acknowledged it and suggested pestering their sales people who
supposedly all know every last detail that is to be known. ^^ 

> About the kind of use, take in mind that we need Ceph to run XEN VMs 
> with high-availability (LUN on a NAS), they commonly run Mysql and other 
> low latency application.
Read the "[Single OSD performance on SSD] Can't go over 3,2K IOPS" thread
and related ones here.

Christian

> Probably we'll implementing them with OpenStack in a next future.
> Let me know if you need some more specs.
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to