Hi,

What are likely causes for "slow requests" and "monclient: hunting for new
mon" messages? E.g.:

2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
...
2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.383883 secs
2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
seconds old, received at 2013-02-12 16:27:15.508374: 
osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.000000000120 [write 921600~4096] 
2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
seconds old, received at 2013-02-12 16:27:15.508475: 
osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.000000000120 [write 987136~4096] 
2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
seconds old, received at 2013-02-12 16:27:15.508537: 
osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.000000000120 [write 
1036288~8192] 2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
seconds old, received at 2013-02-12 16:27:15.508573: 
osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.000000000122 [write 
1454080~4096] 2.fff29a9a) v4 currently no flag points reached
2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
seconds old, received at 2013-02-12 16:27:15.563271: 
osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.000000000122 [write 
1482752~4096] 2.fff29a9a) v4 currently no flag points reached

I have a ceph 0.56.2 system with 3 boxes: two boxes have both mon and a
single osd, and the 3rd box has just a mon - see ceph.conf below. The boxes
are running an eclectic mix of self-compiled kernels: b2 is linux-3.4.6, b4
is linux-3.7.3 and b5 is linux-3.6.10.

On b5 / osd.1 the 'hunting' message appears in the osd log regularly, e.g.
190 times yesterday. The message does't appear at all on b4 / osd.0.

Both osd logs show the 'slow requests' messages. Generally these come in
waves, with 30-50 of the associated individual 'slow request' messages
coming in just after the initial 'slow requests' message. Each box saw
around 30 waves yesterday, with no obvious time correlation between the two.

The osd disks are generally cruising along at around 400-800 KB/s, with
occasional spikes up to 1.2-2 MB/s, with a mostly write load.

The gigabit network interfaces (2 per box for public vs cluster) are
also cruising, with the busiest interface at less than 5% utilisation.

CPU utilisation is likewise small, with 90% or more idle and less then 3%
wait for io. There's plenty of free memory, 19 GB on b4 and 6 GB on b5.

Cheers,

Chris

----
ceph.conf
----
[global]
        auth supported = cephx
[mon]
[mon.b2]
        host = b2
        mon addr = 10.200.63.130:6789
[mon.b4]
        host = b4
        mon addr = 10.200.63.132:6789
[mon.b5]
        host = b5
        mon addr = 10.200.63.133:6789
[osd]
        osd journal size = 1000
        public network = 10.200.63.0/24
        cluster network = 192.168.254.0/24
[osd.0]
        host = b4
        public addr = 10.200.63.132
        cluster addr = 192.168.254.132
[osd.1]
        host = b5
        public addr = 10.200.63.133
        cluster addr = 192.168.254.133
----
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to