Здравствуйте! 

On Fri, Oct 30, 2015 at 09:30:40PM +0000, moloney wrote:

> Hi,

> I recently got my first Ceph cluster up and running and have been doing some 
> stress tests. I quickly found that during sequential write benchmarks the 
> throughput would often drop to zero. Initially I saw this inside QEMU virtual 
> machines, but I can also reproduce the issue with "rados bench" within 5-10 
> minutes of sustained writes.  If left alone the writes will eventually start 
> going again, but it takes quite a while (at least a couple minutes). If I 
> stop and restart the benchmark the write throughput will immediately be where 
> it is supposed to be.

> I have convinced myself it is not a network hardware issue.  I can load up 
> the network with a bunch of parallel iperf benchmarks and it keeps chugging 
> along happily. When the issue occurs with Ceph I don't see any indications of 
> network issues (e.g. dropped packets).  Adding additional network load during 
> the rados bench (using iperf) doesn't seem to trigger the issue any faster or 
> more often.

> I have also convinced myself it isn't an issue with a journal getting full or 
> an OSD being too busy.  The amount of data being written before the problem 
> occurs is much larger than the total journal capacity. Watching the load on 
> the OSD servers with top/iostat I don't seen anything being overloaded, 
> rather I see the load everywhere drop to essentially zero when the writes 
> stall. Before the writes stall the load is well distributed with no visible 
> hot spots. The OSDs and hosts that report slow requests are random, so I 
> don't think it is a failing disk or server.  I don't see anything interesting 
> going on in the logs so far (I am just about to do some tests with Ceph's 
> debug logging cranked up).

> The cluster specs are:

> OS: Ubuntu 14.04 with 3.16 kernel
> Ceph: 9.1.0
> OSD Filesystem: XFS
> Replication: 3X
> Two racks with IPoIB network
> 10Gbps Ethernet between racks
> 8 OSD servers with:
>   * Dual Xeon E5-2630L (12 cores @ 2.4GHz)
>   * 128GB RAM
>   * 12 6TB Seagate drives (connected to LSI 2208 chip in JBOD mode)
>   * Two 400GB Intel P3600 NVMe drives (OS on RAID1 partition, 6 partitions 
> for OSD journals each)
>   * Mellanox ConnectX-3 NIC (for both Infiniband and 10Gbps Ethernet)
> 3 Mons collocated on OSD servers

> Any advice is greatly appreciated. I am planning to try this with Hammer too.

I had the same trouble with Hammer, Ubuntu 14.04 and 3.19 kernel on Supermicro
X9DRL-3F/iF with Intel 82599ES, bounded into one links to 2 different Cisco
Nexus 5020. It was finally fixed with dropping down MTU from 1500+ to 1500.
It was working with 9000 and folowing sysctls, but after several weeks trouble
repeated and I had to drop mtu down again:

net.ipv4.tcp_rmem= 1024000 8738000 1677721600                                   
                                                                                
                               
net.ipv4.tcp_wmem= 1024000 8738000 1677721600                                   
                                                                                
                               
net.ipv4.tcp_mem= 1024000 8738000 1677721600                                    
                                                                                
                               
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_max_syn_backlog = 150000
net.ipv4.tcp_congestion_control=htcp
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1
vm.swappiness = 1
net.ipv4.tcp_moderate_rcvbuf = 0

All 

> Thanks,
> Brendan

> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
WBR, Max A. Krasilnikov
ColoCall Data Center
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to