Hello
We have a cluster of 10 ceph servers.
On that cluster there are EC pool with replicated SSD cache tier, used
by OpenStack Cinder for volumes storage for production environment.
From 2 days we observe messages like this in logs:
2017-07-05 10:50:13.451987 osd.114 [WRN] slow request 1165.927215
seconds old, received at 2017-07-05 10:30:47.104746:
osd_op(osd.130.50779:43441 11.57a05c54
rbd_data.5bc14d3135d111a.0000000000000084 [copy-get max 8388608] snapc
0=[]
ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e50881) currently waiting for rw locks
in this example:
* OSD.114 is on HDD backend with EC pool in it
* OSD.130 is on SSD tier
We've analyzed logs and found, that from the beginning RBD image listed
above [rbd_data.5bc14d3135d111a] causes problem from very beginning.
Virtual machine (OpenStack uses Ceph cluster as backend storage for
Cinder) is DOWN/STOPPED. Our conclusion is that this means that problem
lies on cluster, not client side.
This unfortunately results in huge amount of blocked requests and RAM
consumption. In result system restarts OSD daemon, and situation starts
to repeat.
We've tried to temporary down problematic OSD's, but problem propagate
to different OSD pair.
Using "ceph daemon osd.<ID> dump_ops_in_flight" on problematic OSDS
causes OSD to hangand in few minutes down by cluster, with no response
from command.
SSD model used for SSD cache tier pool is: SAMSUNG MZ7KM240
Could anyone tell what does those log messages means ? Anyone had such a
problem and could help to diagnose/repair ?
Thanks for any help
-------------------------------------------------
Pawel Woszuk
PSNC, Poznan Supercomputing and Networking Center
PoznaĆ, Poland
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com