Re: endless flying slow requests
Looks good, pushed. -Sam On Sat, Dec 1, 2012 at 7:18 PM, Sage Weil s...@inktank.com wrote: I pushed another fix that plugs a leak when requests race with peering that fixed the 'slow request' for osd_sub_op messages for me. It's in wip-osd-leak. Sam, when you get a chance, can you take a look and push it to next if it looks okay? Thanks! sage On Fri, 30 Nov 2012, Samuel Just wrote: I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7. Let me know if the problem persists with this patch. -Sam On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote: Also, these clusters aren't mixed argonaut and next, are they? (Not that that shouldn't work, but it would be a useful data point.) -Sam On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote: Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam It is relatively easy to check if one of client VMs has locked one or more cores to iowait or just hangs, so yes, these ops are related to real commit operations and they are hanged. I`m using all-new 0.54 cluster, without mixing of course. Does everyone who hit that bug readjusted cluster before bug shows itself(say, in a day-long distance)? On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1
Re: endless flying slow requests
I pushed another fix that plugs a leak when requests race with peering that fixed the 'slow request' for osd_sub_op messages for me. It's in wip-osd-leak. Sam, when you get a chance, can you take a look and push it to next if it looks okay? Thanks! sage On Fri, 30 Nov 2012, Samuel Just wrote: I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7. Let me know if the problem persists with this patch. -Sam On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote: Also, these clusters aren't mixed argonaut and next, are they? (Not that that shouldn't work, but it would be a useful data point.) -Sam On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote: Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam It is relatively easy to check if one of client VMs has locked one or more cores to iowait or just hangs, so yes, these ops are related to real commit operations and they are hanged. I`m using all-new 0.54 cluster, without mixing of course. Does everyone who hit that bug readjusted cluster before bug shows itself(say, in a day-long distance)? On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586
Re: endless flying slow requests
I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7. Let me know if the problem persists with this patch. -Sam On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote: Also, these clusters aren't mixed argonaut and next, are they? (Not that that shouldn't work, but it would be a useful data point.) -Sam On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote: Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam It is relatively easy to check if one of client VMs has locked one or more cores to iowait or just hangs, so yes, these ops are related to real commit operations and they are hanged. I`m using all-new 0.54 cluster, without mixing of course. Does everyone who hit that bug readjusted cluster before bug shows itself(say, in a day-long distance)? On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827
Re: endless flying slow requests
Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31792.858971 secs 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31793.859124 secs 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old, received at 2012-11-14 01:22:07.654922:
Re: endless flying slow requests
Also, these clusters aren't mixed argonaut and next, are they? (Not that that shouldn't work, but it would be a useful data point.) -Sam On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote: Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31792.858971 secs 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:01.514061 osd.24 [WRN] 1
RE: endless flying slow requests
I actually have this exact issue with hung io during these slow requests. The odd thing is when I restarted the osd that was being mentioned via ceph -w it didn't resume io. A bit later (unsure on exact time) another one started being listed in ceph -w and when I restarted that different osd io came back. I do believe these occur when I reboot nodes. I never reboot more than one at a time and I have 3 replicas in the cluster. I have a pure ceph version 0.53 (commit:2528b5ee105b16352c91af064af5c0b5a7d45d7c) cluster. Regards, Shawn From: ceph-devel-ow...@vger.kernel.org [ceph-devel-ow...@vger.kernel.org] on behalf of Samuel Just [sam.j...@inktank.com] Sent: Wednesday, November 28, 2012 16:12 To: Andrey Korolyov Cc: Sage Weil; Stefan Priebe - Profihost AG; ceph-devel@vger.kernel.org Subject: Re: endless flying slow requests Also, these clusters aren't mixed argonaut and next, are they? (Not that that shouldn't work, but it would be a useful data point.) -Sam On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote: Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write
Re: endless flying slow requests
On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote: Also, these clusters aren't mixed argonaut and next, are they? (Not that that shouldn't work, but it would be a useful data point.) -Sam On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote: Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam It is relatively easy to check if one of client VMs has locked one or more cores to iowait or just hangs, so yes, these ops are related to real commit operations and they are hanged. I`m using all-new 0.54 cluster, without mixing of course. Does everyone who hit that bug readjusted cluster before bug shows itself(say, in a day-long distance)? On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14
Re: endless flying slow requests
Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31792.858971 secs 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31793.859124 secs 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed When i now restart osd 24 they go away and everything is fine again. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: endless flying slow requests
Am 28.11.2012 02:51, schrieb Sage Weil: Hi Stefan, Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? No not right now. But i've saved your answers and will make the tests when i see this again. Right now i'm just seeing extremely slow OSD recovery. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: endless flying slow requests
On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31792.858971 secs 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31793.859124 secs 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed When i now restart osd 24 they go away and everything is fine again. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To
Re: endless flying slow requests
Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? How do i query the admin socket for requests? Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31792.858971 secs 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31793.859124 secs 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed When i now restart osd 24 they go away and everything is fine again. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: endless flying slow requests
On 11/15/2012 12:09 AM, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? How do i query the admin socket for requests? Yes, add 'admin socket = /path/to/admin/socket' to the [client] section in ceph.conf, and when a guest is running, show outstanding requests with: ceph --admin-daemon /path/to/admin/socket objecter_requests Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: endless flying slow requests
On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31792.858971 secs 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31793.859124 secs 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed When i now restart osd 24 they go away and everything is fine again. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: endless flying slow requests
Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Thanks! sage On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:00.513909 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31792.858971 secs 2012-11-14 10:12:00.513916 osd.24 [WRN] slow request 31792.858971 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:12:01.514061 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31793.859124 secs 2012-11-14 10:12:01.514063 osd.24 [WRN] slow request 31793.859124 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed When i now restart osd 24 they go away and everything is fine again. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html