Re: endless flying slow requests

2012-12-02 Thread Samuel Just
Looks good, pushed.
-Sam

On Sat, Dec 1, 2012 at 7:18 PM, Sage Weil s...@inktank.com wrote:
 I pushed another fix that plugs a leak when requests race with peering
 that fixed the 'slow request' for osd_sub_op messages for me.  It's in
 wip-osd-leak.

 Sam, when you get a chance, can you take a look and push it to next if it
 looks okay?

 Thanks!
 sage


 On Fri, 30 Nov 2012, Samuel Just wrote:

 I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
 Let me know if the problem persists with this patch.
 -Sam

 On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
  Also, these clusters aren't mixed argonaut and next, are they?  (Not
  that that shouldn't work, but it would be a useful data point.)
  -Sam
 
  On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
  Did you observe hung io along with that error?  Both sub_op_commit and
  sub_op_applied have happened, so the sub_op_reply should have been
  sent back to the primary.  This looks more like a leak.  If you also
  observed hung io, then it's possible that the problem is occurring
  between the sub_op_applied event and the response.
  -Sam
 
 
  It is relatively easy to check if one of client VMs has locked one or
  more cores to iowait or just hangs, so yes, these ops are related to
  real commit operations and they are hanged.
  I`m using all-new 0.54 cluster, without mixing of course. Does
  everyone who hit that bug readjusted cluster before bug shows
  itself(say, in a day-long distance)?
 
  On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
  Hi Stefan,
 
  On Thu, 15 Nov 2012, Sage Weil wrote:
  On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
   Am 14.11.2012 15:59, schrieb Sage Weil:
Hi Stefan,
   
I would be nice to confirm that no clients are waiting on replies 
for
these requests; currently we suspect that the OSD request 
tracking is the
buggy part.  If you query the OSD admin socket you should be able 
to dump
requests and see the client IP, and then query the client.
   
Is it librbd?  In that case you likely need to change the config 
so that
it is listening on an admin socket ('admin socket = path').
  
   Yes it is. So i have to specify admin socket at the KVM host?
 
  Right.  IIRC the disk line is a ; (or \;) separated list of key/value
  pairs.
 
   How do i query the admin socket for requests?
 
  ceph --admin-daemon /path/to/socket help
  ceph --admin-daemon /path/to/socket objecter_dump (i think)
 
  Were you able to reproduce this?
 
  Thanks!
  sage
 
  Meanwhile, I did. :)
  Such requests will always be created if you have restarted or marked
  an osd out and then back in and scrub didn`t happen in the meantime
  (after such operation and before request arrival).
  What is more interesting, the hangup happens not exactly at the time
  of operation, but tens of minutes later.
 
  { description: osd_sub_op(client.1292013.0:45422 4.731
  a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
  16444'21693 snapset=0=[]:[] snapc=0=[]),
received_at: 2012-11-28 03:54:43.094151,
age: 27812.942680,
duration: 2.676641,
flag_point: started,
events: [
  { time: 2012-11-28 03:54:43.094222,
event: waiting_for_osdmap},
  { time: 2012-11-28 03:54:43.386890,
event: reached_pg},
  { time: 2012-11-28 03:54:43.386894,
event: started},
  { time: 2012-11-28 03:54:43.386973,
event: commit_queued_for_journal_write},
  { time: 2012-11-28 03:54:45.360049,
event: write_thread_in_journal_buffer},
  { time: 2012-11-28 03:54:45.586183,
event: journaled_completion_queued},
  { time: 2012-11-28 03:54:45.586262,
event: sub_op_commit},
  { time: 2012-11-28 03:54:45.770792,
event: sub_op_applied}]}]}
 
 
 
 
 
  sage
 
  
   Stefan
  
  
On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
   
 Hello list,

 i see this several times. Endless flying slow requests. And 
 they never
 stop
 until i restart the mentioned osd.

 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 
 included below;
 oldest blocked for  31789.858457 secs
 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 
 31789.858457 seconds
 old,
 received at 2012-11-14 01:22:07.654922: 
 osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 

Re: endless flying slow requests

2012-12-01 Thread Sage Weil
I pushed another fix that plugs a leak when requests race with peering 
that fixed the 'slow request' for osd_sub_op messages for me.  It's in 
wip-osd-leak.

Sam, when you get a chance, can you take a look and push it to next if it 
looks okay?

Thanks!
sage


On Fri, 30 Nov 2012, Samuel Just wrote:

 I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
 Let me know if the problem persists with this patch.
 -Sam
 
 On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
  Also, these clusters aren't mixed argonaut and next, are they?  (Not
  that that shouldn't work, but it would be a useful data point.)
  -Sam
 
  On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
  Did you observe hung io along with that error?  Both sub_op_commit and
  sub_op_applied have happened, so the sub_op_reply should have been
  sent back to the primary.  This looks more like a leak.  If you also
  observed hung io, then it's possible that the problem is occurring
  between the sub_op_applied event and the response.
  -Sam
 
 
  It is relatively easy to check if one of client VMs has locked one or
  more cores to iowait or just hangs, so yes, these ops are related to
  real commit operations and they are hanged.
  I`m using all-new 0.54 cluster, without mixing of course. Does
  everyone who hit that bug readjusted cluster before bug shows
  itself(say, in a day-long distance)?
 
  On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
  Hi Stefan,
 
  On Thu, 15 Nov 2012, Sage Weil wrote:
  On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
   Am 14.11.2012 15:59, schrieb Sage Weil:
Hi Stefan,
   
I would be nice to confirm that no clients are waiting on replies 
for
these requests; currently we suspect that the OSD request tracking 
is the
buggy part.  If you query the OSD admin socket you should be able 
to dump
requests and see the client IP, and then query the client.
   
Is it librbd?  In that case you likely need to change the config 
so that
it is listening on an admin socket ('admin socket = path').
  
   Yes it is. So i have to specify admin socket at the KVM host?
 
  Right.  IIRC the disk line is a ; (or \;) separated list of key/value
  pairs.
 
   How do i query the admin socket for requests?
 
  ceph --admin-daemon /path/to/socket help
  ceph --admin-daemon /path/to/socket objecter_dump (i think)
 
  Were you able to reproduce this?
 
  Thanks!
  sage
 
  Meanwhile, I did. :)
  Such requests will always be created if you have restarted or marked
  an osd out and then back in and scrub didn`t happen in the meantime
  (after such operation and before request arrival).
  What is more interesting, the hangup happens not exactly at the time
  of operation, but tens of minutes later.
 
  { description: osd_sub_op(client.1292013.0:45422 4.731
  a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
  16444'21693 snapset=0=[]:[] snapc=0=[]),
received_at: 2012-11-28 03:54:43.094151,
age: 27812.942680,
duration: 2.676641,
flag_point: started,
events: [
  { time: 2012-11-28 03:54:43.094222,
event: waiting_for_osdmap},
  { time: 2012-11-28 03:54:43.386890,
event: reached_pg},
  { time: 2012-11-28 03:54:43.386894,
event: started},
  { time: 2012-11-28 03:54:43.386973,
event: commit_queued_for_journal_write},
  { time: 2012-11-28 03:54:45.360049,
event: write_thread_in_journal_buffer},
  { time: 2012-11-28 03:54:45.586183,
event: journaled_completion_queued},
  { time: 2012-11-28 03:54:45.586262,
event: sub_op_commit},
  { time: 2012-11-28 03:54:45.770792,
event: sub_op_applied}]}]}
 
 
 
 
 
  sage
 
  
   Stefan
  
  
On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
   
 Hello list,

 i see this several times. Endless flying slow requests. And they 
 never
 stop
 until i restart the mentioned osd.

 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 
 included below;
 oldest blocked for  31789.858457 secs
 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 
 31789.858457 seconds
 old,
 received at 2012-11-14 01:22:07.654922: 
 osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 
 included below;
 oldest blocked for  31790.858646 secs
 2012-11-14 10:11:58.513586 

Re: endless flying slow requests

2012-11-30 Thread Samuel Just
I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
Let me know if the problem persists with this patch.
-Sam

On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 Also, these clusters aren't mixed argonaut and next, are they?  (Not
 that that shouldn't work, but it would be a useful data point.)
 -Sam

 On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam


 It is relatively easy to check if one of client VMs has locked one or
 more cores to iowait or just hangs, so yes, these ops are related to
 real commit operations and they are hanged.
 I`m using all-new 0.54 cluster, without mixing of course. Does
 everyone who hit that bug readjusted cluster before bug shows
 itself(say, in a day-long distance)?

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking 
   is the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 

Re: endless flying slow requests

2012-11-28 Thread Samuel Just
Did you observe hung io along with that error?  Both sub_op_commit and
sub_op_applied have happened, so the sub_op_reply should have been
sent back to the primary.  This looks more like a leak.  If you also
observed hung io, then it's possible that the problem is occurring
between the sub_op_applied event and the response.
-Sam

On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 
seconds
old,
received at 2012-11-14 01:22:07.654922: 

Re: endless flying slow requests

2012-11-28 Thread Samuel Just
Also, these clusters aren't mixed argonaut and next, are they?  (Not
that that shouldn't work, but it would be a useful data point.)
-Sam

On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 

RE: endless flying slow requests

2012-11-28 Thread Moore, Shawn M
I actually have this exact issue with hung io during these slow requests.  The 
odd thing is when I restarted the osd that was being mentioned via ceph -w it 
didn't resume io.  A bit later (unsure on exact time) another one started being 
listed in ceph -w and when I restarted that different osd io came back.  I do 
believe these occur when I reboot nodes.  I never reboot more than one at a 
time and I have 3 replicas in the cluster.

I have a pure ceph version 0.53 
(commit:2528b5ee105b16352c91af064af5c0b5a7d45d7c) cluster.

Regards,
Shawn


From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on 
behalf of Samuel Just [sam.j...@inktank.com]
Sent: Wednesday, November 28, 2012 16:12
To: Andrey Korolyov
Cc: Sage Weil; Stefan Priebe - Profihost AG; ceph-devel@vger.kernel.org
Subject: Re: endless flying slow requests

Also, these clusters aren't mixed argonaut and next, are they?  (Not
that that shouldn't work, but it would be a useful data point.)
-Sam

On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write

Re: endless flying slow requests

2012-11-28 Thread Andrey Korolyov
On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 Also, these clusters aren't mixed argonaut and next, are they?  (Not
 that that shouldn't work, but it would be a useful data point.)
 -Sam

 On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam


It is relatively easy to check if one of client VMs has locked one or
more cores to iowait or just hangs, so yes, these ops are related to
real commit operations and they are hanged.
I`m using all-new 0.54 cluster, without mixing of course. Does
everyone who hit that bug readjusted cluster before bug shows
itself(say, in a day-long distance)?

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is 
   the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 

Re: endless flying slow requests

2012-11-27 Thread Sage Weil
Hi Stefan,

On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
   
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is the
   buggy part.  If you query the OSD admin socket you should be able to dump
   requests and see the client IP, and then query the client.
   
   Is it librbd?  In that case you likely need to change the config so that
   it is listening on an admin socket ('admin socket = path').
  
  Yes it is. So i have to specify admin socket at the KVM host?
 
 Right.  IIRC the disk line is a ; (or \;) separated list of key/value 
 pairs.
 
  How do i query the admin socket for requests?
 
 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

Were you able to reproduce this?

Thanks!
sage


 
 sage
 
  
  Stefan
  
  
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
   
Hello list,

i see this several times. Endless flying slow requests. And they never
stop
until i restart the mentioned osd.

2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed

When i now restart osd 24 they go away and everything is fine again.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-27 Thread Stefan Priebe - Profihost AG

Am 28.11.2012 02:51, schrieb Sage Weil:

Hi Stefan,


Yes it is. So i have to specify admin socket at the KVM host?


Right.  IIRC the disk line is a ; (or \;) separated list of key/value
pairs.


How do i query the admin socket for requests?


ceph --admin-daemon /path/to/socket help
ceph --admin-daemon /path/to/socket objecter_dump (i think)


Were you able to reproduce this?


No not right now. But i've saved your answers and will make the tests 
when i see this again.


Right now i'm just seeing extremely slow OSD recovery.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-27 Thread Andrey Korolyov
On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking is the
   buggy part.  If you query the OSD admin socket you should be able to dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

Meanwhile, I did. :)
Such requests will always be created if you have restarted or marked
an osd out and then back in and scrub didn`t happen in the meantime
(after such operation and before request arrival).
What is more interesting, the hangup happens not exactly at the time
of operation, but tens of minutes later.

{ description: osd_sub_op(client.1292013.0:45422 4.731
a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
16444'21693 snapset=0=[]:[] snapc=0=[]),
  received_at: 2012-11-28 03:54:43.094151,
  age: 27812.942680,
  duration: 2.676641,
  flag_point: started,
  events: [
{ time: 2012-11-28 03:54:43.094222,
  event: waiting_for_osdmap},
{ time: 2012-11-28 03:54:43.386890,
  event: reached_pg},
{ time: 2012-11-28 03:54:43.386894,
  event: started},
{ time: 2012-11-28 03:54:43.386973,
  event: commit_queued_for_journal_write},
{ time: 2012-11-28 03:54:45.360049,
  event: write_thread_in_journal_buffer},
{ time: 2012-11-28 03:54:45.586183,
  event: journaled_completion_queued},
{ time: 2012-11-28 03:54:45.586262,
  event: sub_op_commit},
{ time: 2012-11-28 03:54:45.770792,
  event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included 
below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) 
v4
currently delayed
   
When i now restart osd 24 they go away and everything is fine again.
   
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel 
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
   
   --
   To 

Re: endless flying slow requests

2012-11-15 Thread Stefan Priebe - Profihost AG

Am 14.11.2012 15:59, schrieb Sage Weil:

Hi Stefan,

I would be nice to confirm that no clients are waiting on replies for
these requests; currently we suspect that the OSD request tracking is the
buggy part.  If you query the OSD admin socket you should be able to dump
requests and see the client IP, and then query the client.

Is it librbd?  In that case you likely need to change the config so that
it is listening on an admin socket ('admin socket = path').


Yes it is. So i have to specify admin socket at the KVM host? How do i 
query the admin socket for requests?


Stefan



On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:


Hello list,

i see this several times. Endless flying slow requests. And they never stop
until i restart the mentioned osd.

2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31792.858971 secs
2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed
2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below;
oldest blocked for  31793.859124 secs
2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
currently delayed

When i now restart osd 24 they go away and everything is fine again.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-15 Thread Josh Durgin

On 11/15/2012 12:09 AM, Stefan Priebe - Profihost AG wrote:

Am 14.11.2012 15:59, schrieb Sage Weil:

Hi Stefan,

I would be nice to confirm that no clients are waiting on replies for
these requests; currently we suspect that the OSD request tracking is the
buggy part.  If you query the OSD admin socket you should be able to dump
requests and see the client IP, and then query the client.

Is it librbd?  In that case you likely need to change the config so that
it is listening on an admin socket ('admin socket = path').


Yes it is. So i have to specify admin socket at the KVM host? How do i
query the admin socket for requests?


Yes, add 'admin socket = /path/to/admin/socket' to the [client] section
in ceph.conf, and when a guest is running, show outstanding requests
with:

ceph --admin-daemon /path/to/admin/socket objecter_requests

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
 Am 14.11.2012 15:59, schrieb Sage Weil:
  Hi Stefan,
  
  I would be nice to confirm that no clients are waiting on replies for
  these requests; currently we suspect that the OSD request tracking is the
  buggy part.  If you query the OSD admin socket you should be able to dump
  requests and see the client IP, and then query the client.
  
  Is it librbd?  In that case you likely need to change the config so that
  it is listening on an admin socket ('admin socket = path').
 
 Yes it is. So i have to specify admin socket at the KVM host?

Right.  IIRC the disk line is a ; (or \;) separated list of key/value 
pairs.

 How do i query the admin socket for requests?

ceph --admin-daemon /path/to/socket help
ceph --admin-daemon /path/to/socket objecter_dump (i think)

sage

 
 Stefan
 
 
  On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
   Hello list,
   
   i see this several times. Endless flying slow requests. And they never
   stop
   until i restart the mentioned osd.
   
   2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31789.858457 secs
   2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31790.858646 secs
   2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31791.858827 secs
   2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31792.858971 secs
   2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below;
   oldest blocked for  31793.859124 secs
   2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds
   old,
   received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
   rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
   currently delayed
   
   When i now restart osd 24 they go away and everything is fine again.
   
   Stefan
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
   
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-14 Thread Sage Weil
Hi Stefan,

I would be nice to confirm that no clients are waiting on replies for 
these requests; currently we suspect that the OSD request tracking is the 
buggy part.  If you query the OSD admin socket you should be able to dump 
requests and see the client IP, and then query the client.  

Is it librbd?  In that case you likely need to change the config so that 
it is listening on an admin socket ('admin socket = path').

Thanks!
sage


On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:

 Hello list,
 
 i see this several times. Endless flying slow requests. And they never stop
 until i restart the mentioned osd.
 
 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31789.858457 secs
 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31790.858646 secs
 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31791.858827 secs
 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31792.858971 secs
 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below;
 oldest blocked for  31793.859124 secs
 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old,
 received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4
 currently delayed
 
 When i now restart osd 24 they go away and everything is fine again.
 
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html