Re: [ceph-users] design guidance

2017-06-07 Thread Christian Balzer

Hello,

On Tue, 6 Jun 2017 20:59:40 -0400 Daniel K wrote:

> Christian,
> 
> Thank you for the tips -- I certainly googled my eyes out for a good while
> before asking -- maybe my google-fu wasn't too good last night.
> 
> > I love using IB, alas with just one port per host you're likely best off
> > ignoring it, unless you have a converged network/switches that can make
> > use of it (or run it in Ethernet mode).  
> 
> I've always heard people speak fondly of IB, but I've honestly never dealt
> with it. I'm mostly a network guy at heart, so I'm perfectly comfortable
> aggregating 10GB/s connections till the cows come home. What are some of
> the virtues of IB, over ethernet? (not ethernet over IB)
>
IB natively is very low latency and until not so long ago was also
significantly cheaper than respective Ethernet offerings.

With IPoIB you loose some of that latency advantage, but it's still quite
good. And the advent of "cheap" whitebox as well as big brand switches this
has been eroding the IB cost advantages, too.

Native IB support for Ceph has been in development for years, so don't
hold your breath there, though. 
 
> > Bluestore doesn't have journals per se and unless you're going to wait for
> > Luminous I wouldn't recommend using Bluestore in production.
> > Hell, I won't be using it any time soon, but anything pre L sounds
> > like outright channeling Murphy to smite you  
> 
> I do like to play with fire often, but not normally with other people's
> data. I suppose I will stay away from Bluestore for now, unless Luminous is
> released within the next few weeks. I am using it on  Kraken in my small
> test-cluster so far without a visit from Murphy.
> 
If you look at the ML archives, there seem to be plenty of problems
creeping up, some of it more serious than others. 
And expect another slew when it goes mainstream and/or becomes the default.

> > That said, what SSD is it?
> > Bluestore WAL needs are rather small.
> > OTOH, a single SSD isn't something I'd recommend either, SPOF and all.  
> 
> > I'm guessing you have no budget to improve on that gift horse?  
> 
> It's a Micron 1100 256Gb, rated for 120TBW, which works out to about
> 100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to
> journal 36 1TB drives.
> 
Yeah, no go with that.

> I do have some room in the budget, and NVMe journals have been on the back
> of my mind. These servers have 6 PCIe x8 slots in them, so tons of room.
> But then I'm going to get asked about a cache tier, which everyone seems to
> think is the holy grail (and probably would be, if they could 'just work')
> 
> But from what I read, they're an utter nightmare to manage, particularly
> without a well defined workload, and often would hurt more than they help.
> 
I'm very happy with them, but I do have a perfect match in terms of
workload, use case and experience. ^o^

But in any shape or form those servers are underpowered CPU wise and
putting more things into them won't improve things.
Building a cache-tier on dedicated nodes (that's what I do) is another
story.

To correlate CPU usage/needs, I'm literally in the middle of setting up a
new cluster (small, as usual).

The 3 HDD storage nodes have 1 E5-1650 v3 @ 3.50GHz CPU (6 core/SMT = 12
linux cores), 64GB RAM, IB, 12 3TB SAS HDDs and 2 400GB DC S3710 SSDs for
OS and journals.

The 3 cache-tier nodes have 2 E5-2623 v3 @ 3.00GHz CPUs (4 cores/SMT), 64GB
RAM, IB and 5 800GB DC S3610 SSDs.

If you run this fio against a kernel mounted RBD image (same diff from a 
userspace Ceph VM):
"fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randwrite --name=fiojob --blocksize=4k --iodepth=32"

one winds up with 50% CPU (that's half a pseudo-core) usage per OSD
process on the HDD storage, because the HDDs are at 100% util and the
bottleneck. The journal SSDs are bored around 20%. 
Still, half of these 3.5GHz (yes it ramps up to full speed) cores gone,
now relate that to your triple OSDs and slower CPUs...

On the the cache-tier pool that fio results in about 50% SSD utilization
but with each OSD process now consuming about 210%, a full real core.
So this one is neither CPU nor storage limited, the latency/RTT is the
limiting factor here.

Fun fact, for large sequential writes the cache-tier is actually slightly
slower, due to the co-location of the journals on the OSD SSDs. 
For small, random IOPS that of course is not true. ^.^

> I haven't spent a ton of time with the network gear that was dumped on me,
> but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do
> have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB
> switch), what appears to be a giant IB switch (Qlogic 12800-120) and
> another apparently big boy (Qlogic 12800-180). I'm going to pick them up
> from the warehouse tomorrow.
> 
> If I stay away from IB completely, may just use the IB card as a 4x10GB +
> the 2x 10GB on board like I had originally mentioned. But if that IB 

Re: [ceph-users] design guidance

2017-06-06 Thread Daniel K
I started down that path and got so deep that I couldn't even find where I
went in. I couldn't make heads or tails out of what would or wouldn't work.

We didn't need multiple hosts accessing a single datastore, so on the
client side I just have a single VM guest running on each ESXi hosts, with
the cephfs filesystem mounted on it(via a 10Gb connection to the ceph
environment), and then exported via NFS on a host-only network, and mounted
on the host.

Not quite as redundant as it could be, but good enough for our usage. I'm
seeing ~500MB/s speeds going to a 4-node cluster with 5x1TB 7200rpm drives.
I tried it first, in a similar config, except using LIO to export an RBD
device via iSCSI, still on  the local host network. Write performance was
good, but read performance was only around 120MB/s. I didn't do much
troubleshooting, just tried NFS after that and was happy with it.

On Tue, Jun 6, 2017 at 2:33 AM, Adrian Saul 
wrote:

> > > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5
> > > and
> > > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio
> > > and saw much worse performance with the first cluster, so it seems
> > > this may be the better way, but I'm open to other suggestions.
> > >
> > I've never seen any ultimate solution to providing HA iSCSI on top of
> Ceph,
> > though other people here have made significant efforts.
>
> In our tests our best results were with SCST - also because it provided
> proper ALUA support at the time.  I ended up developing my own pacemaker
> cluster resources to manage the SCST orchestration and ALUA failover.  In
> our model we have  a pacemaker cluster in front being an RBD client
> presenting LUNs/NFS out to VMware (NFS), Solaris and Hyper-V (iSCSI).  We
> are using CephFS over NFS but performance has been poor, even using it just
> for VMware templates.  We are on an earlier version of Jewel so its
> possibly some later versions may improve CephFS for that but I have not had
> time to test it.
>
> We have been running a small production/POC for over 18 months on that
> setup, and gone live into a much larger setup in the last 6 months based on
> that model.  It's not without its issues, but most of that is a lack of
> test resources to be able to shake out some of the client compatibility and
> failover shortfalls we have.
>
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege. They
> are intended solely for the attention and use of the named addressee(s).
> They may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been
> sent to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] design guidance

2017-06-06 Thread Daniel K
Christian,

Thank you for the tips -- I certainly googled my eyes out for a good while
before asking -- maybe my google-fu wasn't too good last night.

> I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make
> use of it (or run it in Ethernet mode).

I've always heard people speak fondly of IB, but I've honestly never dealt
with it. I'm mostly a network guy at heart, so I'm perfectly comfortable
aggregating 10GB/s connections till the cows come home. What are some of
the virtues of IB, over ethernet? (not ethernet over IB)

> Bluestore doesn't have journals per se and unless you're going to wait for
> Luminous I wouldn't recommend using Bluestore in production.
> Hell, I won't be using it any time soon, but anything pre L sounds
> like outright channeling Murphy to smite you

I do like to play with fire often, but not normally with other people's
data. I suppose I will stay away from Bluestore for now, unless Luminous is
released within the next few weeks. I am using it on  Kraken in my small
test-cluster so far without a visit from Murphy.

> That said, what SSD is it?
> Bluestore WAL needs are rather small.
> OTOH, a single SSD isn't something I'd recommend either, SPOF and all.

> I'm guessing you have no budget to improve on that gift horse?

It's a Micron 1100 256Gb, rated for 120TBW, which works out to about
100GB/day for 3 years, so not even .5DWPD. I doubt it has the endurance to
journal 36 1TB drives.

I do have some room in the budget, and NVMe journals have been on the back
of my mind. These servers have 6 PCIe x8 slots in them, so tons of room.
But then I'm going to get asked about a cache tier, which everyone seems to
think is the holy grail (and probably would be, if they could 'just work')

But from what I read, they're an utter nightmare to manage, particularly
without a well defined workload, and often would hurt more than they help.

I haven't spent a ton of time with the network gear that was dumped on me,
but the switches I have now are a Nexus 7000, x4 Force10 S4810 (so I do
have some stackable 10Gb that I can MC-LAG), x2 Mellanox IS5023 (18 port IB
switch), what appears to be a giant IB switch (Qlogic 12800-120) and
another apparently big boy (Qlogic 12800-180). I'm going to pick them up
from the warehouse tomorrow.

If I stay away from IB completely, may just use the IB card as a 4x10GB +
the 2x 10GB on board like I had originally mentioned. But if that IB gear
is good, I'd hate to see it go to waste. Might be worth getting a second IB
card for each server.



Again, thanks a million for the advice. I'd rather learn this the easy way
than to have to rebuild this 6 times over the next 6 months.






On Tue, Jun 6, 2017 at 2:05 AM, Christian Balzer  wrote:

>
> Hello,
>
> lots of similar questions in the past, google is your friend.
>
> On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote:
>
> > I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive
> > Supermicro servers and dual 10Gb interfaces(one cluster, one public)
> >
> > I now have 9x 36-drive supermicro StorageServers made available to me,
> each
> > with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except
> > IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36
> > drives. Currently 32GB of ram. 36x 1TB 7.2k drives.
> >
> I love using IB, alas with just one port per host you're likely best off
> ignoring it, unless you have a converged network/switches that can make
> use of it (or run it in Ethernet mode).
>
> > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and
> > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and
> saw
> > much worse performance with the first cluster, so it seems this may be
> the
> > better way, but I'm open to other suggestions.
> >
> I've never seen any ultimate solution to providing HA iSCSI on top of
> Ceph, though other people here have made significant efforts.
>
> > Considerations:
> > Best practice documents indicate .5 cpu per OSD, but I have 36 drives and
> > 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware
> > raid card to present a fewer number of larger devices to ceph? Or run
> > multiple drives per OSD?
> >
> You're definitely underpowered in the CPU department and I personally
> would make RAID1 or 10s for never having to re-balance an OSD.
> But if space is an issue, RAID0s would do.
> OTOH, w/o any SSDs in the game your HDD only cluster is going to be less
> CPU hungry than others.
>
> > There is a single 256gb SSD which i feel would be a bottleneck if I used
> it
> > as a journal for all 36 drives, so I believe bluestore with a journal on
> > each drive would be the best option.
> >
> Bluestore doesn't have journals per se and unless you're going to wait for
> Luminous I wouldn't recommend using 

Re: [ceph-users] design guidance

2017-06-06 Thread Maxime Guyot
Hi Daniel,

The flexibility of Ceph is that you can start with your current config,
scale out and upgrade (CPUs, journals etc...) as your performance
requirement increase.

6x1.7Ghz, are we speaking about the Xeon E5 2603L v4? Any chance to bump
that to 2620 v4 or 2630 v4?
Test how the 6x1.7Ghz handles 36 OSDs, then based on that take a decision
to RAID0/LVM or not.
If you have a need for large-low performance block storage, it could be
worth to do a hybrid setup with *some* OSDs in Raid0/LVM.

Since this is a virtualisation use case (VMware and KVM), did you consider
journals? This 256GB SATA SSD is not enough for 36 filestore journals.
Assuming that those 256GB SSD have a performance profile compatible with
journal, a storage tier OSDs with SSD journal (20%) and OSD with collocated
journals (80%) could be nice. Then you place the VMs in different tiers
based on write latency requirements.

If you have the budget for it, you can fit 3x PCIe SSD/NVMe cards into
those StorageServers, that would make a 1:12 ratio and pretty good write
latency.
Another option is to start with filestore then upgrade to bluestore when
stable.

IMO a single network for cluster and public is easier to manage. Since you
already have a 10G cluster, continue with that. Either:
1) If you are tight on 10G ports, do 2x10G per node and skip the 40G NIC
2) If you have plenty of ports, do 4x10G per node: split the 40G NIC into
4x10G.
13 servers (9+3) is usually too small to run in a single ToR setup. So you
should be good with a LACP pair of standard 10G switch as ToR, which you
probably already have?

Cheers,
Maxime

On Tue, 6 Jun 2017 at 08:33 Adrian Saul 
wrote:

> > > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5
> > > and
> > > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio
> > > and saw much worse performance with the first cluster, so it seems
> > > this may be the better way, but I'm open to other suggestions.
> > >
> > I've never seen any ultimate solution to providing HA iSCSI on top of
> Ceph,
> > though other people here have made significant efforts.
>
> In our tests our best results were with SCST - also because it provided
> proper ALUA support at the time.  I ended up developing my own pacemaker
> cluster resources to manage the SCST orchestration and ALUA failover.  In
> our model we have  a pacemaker cluster in front being an RBD client
> presenting LUNs/NFS out to VMware (NFS), Solaris and Hyper-V (iSCSI).  We
> are using CephFS over NFS but performance has been poor, even using it just
> for VMware templates.  We are on an earlier version of Jewel so its
> possibly some later versions may improve CephFS for that but I have not had
> time to test it.
>
> We have been running a small production/POC for over 18 months on that
> setup, and gone live into a much larger setup in the last 6 months based on
> that model.  It's not without its issues, but most of that is a lack of
> test resources to be able to shake out some of the client compatibility and
> failover shortfalls we have.
>
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege. They
> are intended solely for the attention and use of the named addressee(s).
> They may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been
> sent to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] design guidance

2017-06-06 Thread Adrian Saul
> > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5
> > and
> > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio
> > and saw much worse performance with the first cluster, so it seems
> > this may be the better way, but I'm open to other suggestions.
> >
> I've never seen any ultimate solution to providing HA iSCSI on top of Ceph,
> though other people here have made significant efforts.

In our tests our best results were with SCST - also because it provided proper 
ALUA support at the time.  I ended up developing my own pacemaker cluster 
resources to manage the SCST orchestration and ALUA failover.  In our model we 
have  a pacemaker cluster in front being an RBD client presenting LUNs/NFS out 
to VMware (NFS), Solaris and Hyper-V (iSCSI).  We are using CephFS over NFS but 
performance has been poor, even using it just for VMware templates.  We are on 
an earlier version of Jewel so its possibly some later versions may improve 
CephFS for that but I have not had time to test it.

We have been running a small production/POC for over 18 months on that setup, 
and gone live into a much larger setup in the last 6 months based on that 
model.  It's not without its issues, but most of that is a lack of test 
resources to be able to shake out some of the client compatibility and failover 
shortfalls we have.

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] design guidance

2017-06-06 Thread Christian Balzer

Hello,

lots of similar questions in the past, google is your friend.

On Mon, 5 Jun 2017 23:59:07 -0400 Daniel K wrote:

> I've built 'my-first-ceph-cluster' with two of the 4-node, 12 drive
> Supermicro servers and dual 10Gb interfaces(one cluster, one public)
> 
> I now have 9x 36-drive supermicro StorageServers made available to me, each
> with dual 10GB and a single Mellanox IB/40G nic. No 1G interfaces except
> IPMI. 2x 6-core 6-thread 1.7ghz xeon processors (12 cores total) for 36
> drives. Currently 32GB of ram. 36x 1TB 7.2k drives.
>
I love using IB, alas with just one port per host you're likely best off
ignoring it, unless you have a converged network/switches that can make
use of it (or run it in Ethernet mode).
 
> Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5 and
> 6.0 hosts(migrating from a VMWare environment), later to transition to
> qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio and saw
> much worse performance with the first cluster, so it seems this may be the
> better way, but I'm open to other suggestions.
> 
I've never seen any ultimate solution to providing HA iSCSI on top of
Ceph, though other people here have made significant efforts.

> Considerations:
> Best practice documents indicate .5 cpu per OSD, but I have 36 drives and
> 12 CPUs. Would it be better to create 18x 2-drive raid0 on the hardware
> raid card to present a fewer number of larger devices to ceph? Or run
> multiple drives per OSD?
> 
You're definitely underpowered in the CPU department and I personally
would make RAID1 or 10s for never having to re-balance an OSD.
But if space is an issue, RAID0s would do.
OTOH, w/o any SSDs in the game your HDD only cluster is going to be less
CPU hungry than others.

> There is a single 256gb SSD which i feel would be a bottleneck if I used it
> as a journal for all 36 drives, so I believe bluestore with a journal on
> each drive would be the best option.
> 
Bluestore doesn't have journals per se and unless you're going to wait for
Luminous I wouldn't recommend using Bluestore in production.
Hell, I won't be using it any time soon, but anything pre L sounds
like outright channeling Murphy to smite you.

That said, what SSD is it? 
Bluestore WAL needs are rather small.
OTOH, a single SSD isn't something I'd recommend either, SPOF and all.

I'm guessing you have no budget to improve on that gift horse?

> Is 1.7Ghz too slow for what I'm doing?
> 
If you're going to have a lot of small I/Os it probably will be.

> I like the idea of keeping the public and cluster networks separate. 

I don't, at least not on a physical level when you pay for this by loosing
redundancy.
Do you have 2 switches, are they MC-LAG capable (aka stackable)?

>Any
> suggestions on which interfaces to use for what? I could theoretically push
> 36Gb/s, figuring 125MB/s for each drive, but in reality will I ever see
> that? 
Not by a long shot, even with Bluestore. 
With the WAL and other bits on SSD and very kind write patterns, maybe
100MB/s per drive, but IIRC there were issues with current Bluestore and
performance as well.

>Perhaps bond the two 10GB and use them as the public, and the 40gb as
> the cluster network? Or split the 40gb in to 4x10gb and use 3x10GB bonded
> for each?
>
If you can actually split it up, see above, mc-LAG.
That will give you 60Gb/s, half that if a switch fails and if it makes you
fell better, do the cluster and public with VLANs.

But that will cost you in not so cheap switch ports, of course.

Christian
> If there is a more appropriate venue for my request, please point me in
> that direction.
> 
> Thanks,
> Dan


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com