I have 3 nodes each running a MON and 30 OSDs. When I test my cluster with 
either rados bench or with fio via a 10GbE client using RBD I get great initial 
speeds >900MBps and I max out my 10GbE links for a while. Then, something goes 
wrong the performance falters and the cluster stops responding all together. 
I'll see a monitor call for a new election and then my OSDs mark each other 
down, they complain that they've been wrongly marked down, I get slow request 
warnings of >30 and >60 seconds. This eventually resolves itself and the 
cluster recovers but it then recurs again right away. Sometimes, via fio, I'll 
get an I/O error and it will bail.

The amount of time for the cluster to start acting up varies. Sometimes it is 
great for hours, sometimes it fails after 10 seconds. Nothing significant shows 
up in dmesg. A snippet from ceph-osd.77.log (for example) is at: 
http://pastebin.com/Zb92Ei7a

I'm not sure why I can run at full speed for a little while or what the problem 
is when it stops working. Please help!

My nodes:
        Ubuntu 14.04 - Linux storage3 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 
15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
        2 x 6-core Xeon 2620s
        64GB RAM
        30 x 3TB Seagate ST3000DM001-1CH166
        6 x 128GB Samsung 840 Pro SSD
        1 x Dual port Broadcom NetXtreme II 5771x/578xx 10GbE
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to