Dear cephers,

I believe we are facing a bottleneck due to an inappropriate overall network 
design and would like to hear about experience and recommendations. I start 
with a description of the urgent problem/question and follow up with more 
details/questions.

These observations are on our HPC home file system served with ceph. It has 12 
storage servers facing 550+ client servers.

Under high load, I start seeing "slow ping time" warnings with quite incredible 
latencies. I suspect we have a network bottleneck. On the storage servers we 
have 6x10G LACP trunks. Clients are on single 10G NICs. We have separate VLANs 
for front- and back network, but they both go through all NICs in the same way, 
so, technically, its just one cluster network shared with clients. The 
aggregated bandwidth is sufficient for a single-node storage server load 
(roughly matches with disk controller IO capacity). However, point-to-point 
connections are 10G only and I believe that we start observing clients 
saturating a 10G link and starving all other ceph cluster traffic that needs to 
go through this link as well. This, in turn, leads to backlog effects with slow 
ops on unrelated OSDs, affecting overall user experience. The number of OSDs 
reporting slow ping times is about the percentage one would expect if one or 
two 10G links are congested. Its usually just one storage server that coughs up.

I guess the users with aggressive workloads getting the full bandwidth are 
happy, but everyone else is complaining. What I observe is that one or two 
clients can DOS everyone else. I typically see a very high read bandwidth from 
a few OSDs only and my suspicion is that this is a large job of 50-100 nodes 
starting the same application at the same time. For example, 50-100 clients 
reading the same executable simultaneously. I see 5-6GB/s and up to 10K IOP/s 
read, which is really good in principle. Except that is not fair-shared with 
other users.

Question: I start considering to enable QOS on the switches for traffic between 
storage servers and would like to know if anyone is doing this and what the 
experience is. Unfortunately, our network design is probably flawed and makes 
this now difficult; see below.

More Info.

Our FS data pool is EC 8+2. I have fast-read enabled. Hence, the network 
traffic amplification for both, read and write, is quite substantial.

Our network is a spine-leaf architecture where ceph servers and ceph clients 
are distributed more or less equally over the leaf switches. I'm afraid that 
this is a first flaw in the design, because storage servers and clients compete 
for the same switches and the clients greatly outnumber the storage servers. It 
also makes implementing QOS a real pain while it could be just traffic shaping 
on an uplink trunk to clients if the storage servers were isolated.

This is the first design question: Isolated storage cluster providing service 
via uplinks/gateways versus "integrated/hyper-converged" where storage servers 
and clients are distributed equally over a spine-leaf architecture. Pros and 
cons?

We have a 100G spine VLT-pair with ports configured as 40G. Up-links from leafs 
are 2x40, in fact, we have these leafs configured as VLT-pairs for HA as well. 
A pair has 2x2x40G uplinks and 2x40G VLT interlinks. There are 2 ceph servers 
per VLT leaf-pair and ca. 85+ client servers on the same pair. There are also 
clients on leaf switches without ceph servers. I don't think the 40G uplinks 
are congested, but you never know.

We started with the ceph servers having 15HDDs for fs data and 1 SSD for fs 
meta-data each. With this configuration, the disk speed was the bottleneck and 
I observed slow ops under high load, but everything was more or less stable. I 
recently changed an MDS setting that greatly improved both, client performance 
and also the client's ability to overload OSDs. In addition, one week ago I 
added 20HDDs in a JBOD per host, which more than doubled the HDD throughput. 
Both increases in performance together have now the counter-intuitive effect 
that aggregated performance has tripled in comparison to 2 months ago, but the 
user experience is very erratic. My suspicion is, as explained above, that each 
server can now handle a volume of traffic that easily saturates a 10G link, 
leading to observations that seem to indicate insufficient network capacity 
whenever too many client/cluster requests go through the same 10G link.

In essence, we increased aggregated performance greatly but users complain more 
than ever.

I suspect that this imbalance of server throughput ability and 10G 
point-to-point limitation is a problem. However, I cannot change the networking 
and would like some advice of how similar set-ups are configured and if QOS can 
help. My idea is to enable dot1p layer 2 QOS and give traffic coming from ports 
with storage servers connected a higher priority than traffic coming from 
everywhere else. I know it would be a lot simpler if the storage cluster was 
isolated, but I have to deal with the situation as is for now. Any advice and 
experience is highly appreciated.

If I do it, should I do QOS on both, front- and back network, or is QOS on the 
VLAN for back-network enough? Note that MONs are only on the front network.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to