Hi,

I recently got my first Ceph cluster up and running and have been doing some 
stress tests. I quickly found that during sequential write benchmarks the 
throughput would often drop to zero. Initially I saw this inside QEMU virtual 
machines, but I can also reproduce the issue with "rados bench" within 5-10 
minutes of sustained writes.  If left alone the writes will eventually start 
going again, but it takes quite a while (at least a couple minutes). If I stop 
and restart the benchmark the write throughput will immediately be where it is 
supposed to be.

I have convinced myself it is not a network hardware issue.  I can load up the 
network with a bunch of parallel iperf benchmarks and it keeps chugging along 
happily. When the issue occurs with Ceph I don't see any indications of network 
issues (e.g. dropped packets).  Adding additional network load during the rados 
bench (using iperf) doesn't seem to trigger the issue any faster or more often.

I have also convinced myself it isn't an issue with a journal getting full or 
an OSD being too busy.  The amount of data being written before the problem 
occurs is much larger than the total journal capacity. Watching the load on the 
OSD servers with top/iostat I don't seen anything being overloaded, rather I 
see the load everywhere drop to essentially zero when the writes stall. Before 
the writes stall the load is well distributed with no visible hot spots. The 
OSDs and hosts that report slow requests are random, so I don't think it is a 
failing disk or server.  I don't see anything interesting going on in the logs 
so far (I am just about to do some tests with Ceph's debug logging cranked up).

The cluster specs are:

OS: Ubuntu 14.04 with 3.16 kernel
Ceph: 9.1.0
OSD Filesystem: XFS
Replication: 3X
Two racks with IPoIB network
10Gbps Ethernet between racks
8 OSD servers with:
  * Dual Xeon E5-2630L (12 cores @ 2.4GHz)
  * 128GB RAM
  * 12 6TB Seagate drives (connected to LSI 2208 chip in JBOD mode)
  * Two 400GB Intel P3600 NVMe drives (OS on RAID1 partition, 6 partitions for 
OSD journals each)
  * Mellanox ConnectX-3 NIC (for both Infiniband and 10Gbps Ethernet)
3 Mons collocated on OSD servers

Any advice is greatly appreciated. I am planning to try this with Hammer too.

Thanks,
Brendan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to