Re: rbd map command hangs for 15 minutes during system start up

2012-12-31 Thread Alex Elder
On 12/26/2012 03:36 PM, Alex Elder wrote: On 12/26/2012 11:45 AM, Nick Bartos wrote: Here's a log with a hang on the updated branch: https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log OK, new naming scheme. Please try: wip-nick-1 Now that

Re: rbd map command hangs for 15 minutes during system start up

2012-12-27 Thread Sage Weil
On Thu, 27 Dec 2012, Nick Bartos wrote: I have some exciting news. After 215 test runs, no hung processes were detected. I think we may actually have it this time. Thanks for all your hard work! Sweet! I think it was the new branch naming scheme that did it. sage -Nick On Wed, Dec

Re: rbd map command hangs for 15 minutes during system start up

2012-12-27 Thread Alex Elder
On 12/27/2012 12:43 PM, Sage Weil wrote: On Thu, 27 Dec 2012, Nick Bartos wrote: I have some exciting news. After 215 test runs, no hung processes were detected. I think we may actually have it this time. Thanks for all your hard work! This is great news Nick, and I really appreciate your

Re: rbd map command hangs for 15 minutes during system start up

2012-12-26 Thread Alex Elder
On 12/26/2012 11:45 AM, Nick Bartos wrote: Here's a log with a hang on the updated branch: https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log I'm starting to look this over. Thanks a lot for supplying it. Sorry we still haven't nailed the

Re: rbd map command hangs for 15 minutes during system start up

2012-12-26 Thread Nick Bartos
Here's a log with a hang on the updated branch: https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log On Thu, Dec 20, 2012 at 1:59 PM, Alex Elder el...@inktank.com wrote: On 12/20/2012 11:48 AM, Nick Bartos wrote: Unfortunately, we still have a

Re: rbd map command hangs for 15 minutes during system start up

2012-12-26 Thread Alex Elder
On 12/26/2012 11:45 AM, Nick Bartos wrote: Here's a log with a hang on the updated branch: https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log OK, new naming scheme. Please try: wip-nick-1 I added another simple fix, but then collapsed three

Re: rbd map command hangs for 15 minutes during system start up

2012-12-20 Thread Nick Bartos
Unfortunately, we still have a hang: https://gist.github.com/4347052/download On Wed, Dec 19, 2012 at 2:42 PM, Alex Elder el...@inktank.com wrote: On 12/19/2012 03:25 PM, Alex Elder wrote: On 12/18/2012 12:05 PM, Nick Bartos wrote: I've added the output of ps -ef in addition to triggering a

Re: rbd map command hangs for 15 minutes during system start up

2012-12-20 Thread Alex Elder
On 12/20/2012 11:48 AM, Nick Bartos wrote: Unfortunately, we still have a hang: https://gist.github.com/4347052/download The saga continues, and each time we get a little more information. Please try branch: wip-nick-newerest Thank you. -Alex On

Re: rbd map command hangs for 15 minutes during system start up

2012-12-19 Thread Alex Elder
On 12/18/2012 12:05 PM, Nick Bartos wrote: I've added the output of ps -ef in addition to triggering a trace when a hang is detected. Not much is generally running at that point, but you can have a look:

Re: rbd map command hangs for 15 minutes during system start up

2012-12-18 Thread Nick Bartos
I've added the output of ps -ef in addition to triggering a trace when a hang is detected. Not much is generally running at that point, but you can have a look: https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-hang-1355851498.txt Is it possible that there is some

Re: rbd map command hangs for 15 minutes during system start up

2012-12-14 Thread Alex Elder
On 12/13/2012 01:00 PM, Nick Bartos wrote: Here's another log with the kernel debugging enabled: https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log Note that it hung on the 2nd try. Just to make sure I'm working with the right code base, can

Re: rbd map command hangs for 15 minutes during system start up

2012-12-14 Thread Nick Bartos
The kernel is 3.5.7 with the following patches applied (and in the order specified below): 001-libceph_eliminate_connection_state_DEAD_13_days_ago.patch 002-libceph_kill_bad_proto_ceph_connection_op_13_days_ago.patch 003-libceph_rename_socket_callbacks_13_days_ago.patch

Re: rbd map command hangs for 15 minutes during system start up

2012-12-13 Thread Nick Bartos
Here's another log with the kernel debugging enabled: https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log Note that it hung on the 2nd try. On Wed, Dec 12, 2012 at 4:57 PM, Nick Bartos n...@pistoncloud.com wrote: Using wip-nick-newer, the

Re: rbd map command hangs for 15 minutes during system start up

2012-12-13 Thread Alex Elder
On 12/13/2012 01:00 PM, Nick Bartos wrote: Here's another log with the kernel debugging enabled: https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log Note that it hung on the 2nd try. OK, thanks for the info. We'll keep looking. -Alex On

Re: rbd map command hangs for 15 minutes during system start up

2012-12-12 Thread Nick Bartos
Using wip-nick-newer, the problem still presented itself after 4 successful runs (so it may be a fluke, but it got slightly further than before). The log is here: https://gist.github.com/raw/4273114/9085ed00d5bdd5ebab9a94b48f4a562d1fbac431/rbd-hang-1355359129.log Unfortunately I forgot to enable

Re: rbd map command hangs for 15 minutes during system start up

2012-12-11 Thread Nick Bartos
Thanks! I'm creating a build with the new patches now. I'll let you know how testing goes. On Mon, Dec 10, 2012 at 1:57 PM, Alex Elder el...@inktank.com wrote: On 12/02/2012 10:43 PM, Alex Elder wrote: On 12/01/2012 11:34 PM, Nick Bartos wrote: Unfortunately the hangs happen with the new set

Re: rbd map command hangs for 15 minutes during system start up

2012-12-11 Thread Alex Elder
On 12/11/2012 11:26 AM, Nick Bartos wrote: Thanks! I'm creating a build with the new patches now. I'll let you know how testing goes. FYI, I've been testing with these changes and have *not* been hitting the kinds of problems I'd been previously. However those problems were different from

Re: rbd map command hangs for 15 minutes during system start up

2012-12-11 Thread Alex Elder
On 12/11/2012 12:01 PM, Alex Elder wrote: On 12/11/2012 11:26 AM, Nick Bartos wrote: Thanks! I'm creating a build with the new patches now. I'll let you know how testing goes. FYI, I've been testing with these changes and have *not* been hitting the kinds of problems I'd been previously.

Re: rbd map command hangs for 15 minutes during system start up

2012-12-01 Thread Nick Bartos
Unfortunately the hangs happen with the new set of patches. Here's some debug info: https://gist.github.com/raw/4187123/90194ce172130244a9c1c968ed185eee7282d809/gistfile1.txt On Fri, Nov 30, 2012 at 3:22 PM, Alex Elder el...@inktank.com wrote: On 11/29/2012 02:37 PM, Alex Elder wrote: On

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Nick Bartos
My initial tests using a 3.5.7 kernel with the 55 patches from wip-nick are going well. So far I've gone through 8 installs without an incident, I'll leave it run for a bit longer to see if it crops up again. Can I get a branch with these patches integrated into all of the backported patches to

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Alex Elder
On 11/30/2012 12:49 PM, Nick Bartos wrote: My initial tests using a 3.5.7 kernel with the 55 patches from wip-nick are going well. So far I've gone through 8 installs without an incident, I'll leave it run for a bit longer to see if it crops up again. This is great news! Now I wonder which

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Alex Elder
On 11/29/2012 02:37 PM, Alex Elder wrote: On 11/22/2012 12:04 PM, Nick Bartos wrote: Here are the ceph log messages (including the libceph kernel debug stuff you asked for) from a node boot with the rbd command hung for a couple of minutes: I'm sorry, but I did something stupid... Yes, the

Re: rbd map command hangs for 15 minutes during system start up

2012-11-29 Thread Alex Elder
On 11/22/2012 12:04 PM, Nick Bartos wrote: Here are the ceph log messages (including the libceph kernel debug stuff you asked for) from a node boot with the rbd command hung for a couple of minutes: Nick, I have put together a branch that includes two fixes that might be helpful. I don't

Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
Here are the ceph log messages (including the libceph kernel debug stuff you asked for) from a node boot with the rbd command hung for a couple of minutes: https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos

Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
It's very easy to reproduce now with my automated install script, the most I've seen it succeed with that patch is 2 in a row, and hanging on the 3rd, although it hangs on most builds. So it shouldn't take much to get it to do it again. I'll try and get to that tomorrow, when I'm a bit more

Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Sage Weil
On Wed, 21 Nov 2012, Nick Bartos wrote: FYI the build which included all 3.5 backports except patch #50 is still going strong after 21 builds. Okay, that one at least makes some sense. I've opened http://tracker.newdream.net/issues/3519 How easy is this to reproduce? If it is

Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
FYI the build which included all 3.5 backports except patch #50 is still going strong after 21 builds. On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote: With 8 successful installs already done, I'm reasonably confident that it's patch #50. I'm making another build which

Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Sage Weil
On Tue, 20 Nov 2012, Nick Bartos wrote: Since I now have a decent script which can reproduce this, I decided to re-test with the same 3.5.7 kernel, but just not applying the patches from the wip-3.5 branch. With the patches, I can only go 2 builds before I run into a hang. Without the

Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Nick Bartos
It's really looking like it's the libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When patches 1-50 (listed below) are applied to 3.5.7, the hang is present. So far I have gone through 4 successful installs with no hang with only 1-49 applied. I'm still leaving my test run to make

Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Nick Bartos
With 8 successful installs already done, I'm reasonably confident that it's patch #50. I'm making another build which applies all patches from the 3.5 backport branch, excluding that specific one. I'll let you know if that turns up any unexpected failures. What will the potential fall out be

Re: rbd map command hangs for 15 minutes during system start up

2012-11-20 Thread Nick Bartos
I reproduced the problem and got several sysrq states captured. During this run, the monitor running on the host complained a few times about the clocks being off, but all messages were for under 0.55 seconds. Here are the kernel logs. Note that there are several traces, I thought multiple

Re: rbd map command hangs for 15 minutes during system start up

2012-11-20 Thread Nick Bartos
Since I now have a decent script which can reproduce this, I decided to re-test with the same 3.5.7 kernel, but just not applying the patches from the wip-3.5 branch. With the patches, I can only go 2 builds before I run into a hang. Without the patches, I have gone 9 consecutive builds (and

Re: rbd map command hangs for 15 minutes during system start up

2012-11-19 Thread Nick Bartos
Making 'mon clock drift allowed' very small (0.1) does not reliably reproduce the hang. I started looking at the code for 0.48.2 and it looks like this is only used in Paxos::warn_on_future_time, which only handles the warning, nothing else. On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Nick Bartos
Turns out we're having the 'rbd map' hang on startup again, after we started using the wip-3.5 patch set. How critical is the libceph_protect_ceph_con_open_with_mutex commit? That's the one I removed before which seemed to get rid of the problem (although I'm not completely sure if it completely

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Sage Weil
I just realized I was mixing up this thread with the other deadlock thread. On Fri, 16 Nov 2012, Nick Bartos wrote: Turns out we're having the 'rbd map' hang on startup again, after we started using the wip-3.5 patch set. How critical is the libceph_protect_ceph_con_open_with_mutex commit?

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Nick Bartos
How far off do the clocks need to be before there is a problem? It would seem to be hard to ensure a very large cluster has all of it's nodes synchronized within 50ms (which seems to be the default for mon clock drift allowed). Does the mon clock drift allowed parameter change anything other

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Nick Bartos
Should I be lowering the clock drift allowed, or the lease interval to help reproduce it? On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil s...@inktank.com wrote: You can safely set the clock drift allowed as high as 500ms. The real limitation is that it needs to be well under the lease interval,

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Sage Weil
On Fri, 16 Nov 2012, Nick Bartos wrote: Should I be lowering the clock drift allowed, or the lease interval to help reproduce it? clock drift allowed. On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil s...@inktank.com wrote: You can safely set the clock drift allowed as high as 500ms. The

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Gregory Farnum
To be clear, the monitor cluster needs to be within this clock drift — the rest of the Ceph cluster can be off by as much as you care to. (Well, there's also a limit imposed by cephx authorization which can keep nodes out of the cluster, but that drift allowance is measured in units of hours.)

Re: rbd map command hangs for 15 minutes during system start up

2012-11-15 Thread Nick Bartos
Sorry I guess this e-mail got missed. I believe those patches came from the ceph/linux-3.5.5-ceph branch. I'm now using the wip-3.5 branch patches, which seem to all be fine. We'll stick with 3.5 and this backport for now until we can figure out what's wrong with 3.6. I typically ignore the

Re: rbd map command hangs for 15 minutes during system start up

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Nick Bartos wrote: Sorry I guess this e-mail got missed. I believe those patches came from the ceph/linux-3.5.5-ceph branch. I'm now using the wip-3.5 branch patches, which seem to all be fine. We'll stick with 3.5 and this backport for now until we can figure out

Re: rbd map command hangs for 15 minutes during system start up

2012-11-12 Thread Nick Bartos
After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it seems we no longer have this hang. On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin josh.dur...@inktank.com wrote: On 11/08/2012 02:10 PM, Mandell Degerness wrote: We are seeing a somewhat random, but frequent hang on our systems

Re: rbd map command hangs for 15 minutes during system start up

2012-11-12 Thread Sage Weil
On Mon, 12 Nov 2012, Nick Bartos wrote: After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it seems we no longer have this hang. Hmm, that's a bit disconcerting. Did this series come from our old 3.5 stable series? I recently prepared a new one that backports *all* of the

rbd map command hangs for 15 minutes during system start up

2012-11-08 Thread Mandell Degerness
We are seeing a somewhat random, but frequent hang on our systems during startup. The hang happens at the point where an rbd map rbdvol command is run. I've attached the ceph logs from the cluster. The map command happens at Nov 8 18:41:09 on server 172.18.0.15. The process which hung can be

Re: rbd map command hangs for 15 minutes during system start up

2012-11-08 Thread Josh Durgin
On 11/08/2012 02:10 PM, Mandell Degerness wrote: We are seeing a somewhat random, but frequent hang on our systems during startup. The hang happens at the point where an rbd map rbdvol command is run. I've attached the ceph logs from the cluster. The map command happens at Nov 8 18:41:09 on