I have 3 nodes each running a MON and 30 OSDs. When I test my cluster with
either rados bench or with fio via a 10GbE client using RBD I get great initial
speeds >900MBps and I max out my 10GbE links for a while. Then, something goes
wrong the performance falters and the cluster stops responding all together.
I'll see a monitor call for a new election and then my OSDs mark each other
down, they complain that they've been wrongly marked down, I get slow request
warnings of >30 and >60 seconds. This eventually resolves itself and the
cluster recovers but it then recurs again right away. Sometimes, via fio, I'll
get an I/O error and it will bail.
The amount of time for the cluster to start acting up varies. Sometimes it is
great for hours, sometimes it fails after 10 seconds. Nothing significant shows
up in dmesg. A snippet from ceph-osd.77.log (for example) is at:
http://pastebin.com/Zb92Ei7a
I'm not sure why I can run at full speed for a little while or what the problem
is when it stops working. Please help!
My nodes:
Ubuntu 14.04 - Linux storage3 3.13.0-32-generic #57-Ubuntu SMP Tue Jul
15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
2 x 6-core Xeon 2620s
64GB RAM
30 x 3TB Seagate ST3000DM001-1CH166
6 x 128GB Samsung 840 Pro SSD
1 x Dual port Broadcom NetXtreme II 5771x/578xx 10GbE
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com