Re: rbd map command hangs for 15 minutes during system start up

2013-01-02 Thread Nick Bartos
So far basic things are working fine, and my hang test is at 78 passes and still going good. I'll let you know if any problems crop up with it. On Mon, Dec 31, 2012 at 10:22 AM, Alex Elder wrote: > On 12/26/2012 03:36 PM, Alex Elder wrote: >> On 12/26/2012 11:45 AM, Nick Bartos wrote: >>> Here's

Re: rbd map command hangs for 15 minutes during system start up

2012-12-31 Thread Alex Elder
On 12/26/2012 03:36 PM, Alex Elder wrote: > On 12/26/2012 11:45 AM, Nick Bartos wrote: >> Here's a log with a hang on the updated branch: >> >> https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log > > OK, new naming scheme. Please try: wip-nick-1

Re: rbd map command hangs for 15 minutes during system start up

2012-12-27 Thread Alex Elder
On 12/27/2012 12:43 PM, Sage Weil wrote: > On Thu, 27 Dec 2012, Nick Bartos wrote: >> I have some exciting news. After 215 test runs, no hung processes >> were detected. I think we may actually have it this time. Thanks for >> all your hard work! This is great news Nick, and I really appreciate

Re: rbd map command hangs for 15 minutes during system start up

2012-12-27 Thread Sage Weil
On Thu, 27 Dec 2012, Nick Bartos wrote: > I have some exciting news. After 215 test runs, no hung processes > were detected. I think we may actually have it this time. Thanks for > all your hard work! Sweet! I think it was the new branch naming scheme that did it. sage > > -Nick > > On Wed

Re: rbd map command hangs for 15 minutes during system start up

2012-12-27 Thread Nick Bartos
I have some exciting news. After 215 test runs, no hung processes were detected. I think we may actually have it this time. Thanks for all your hard work! -Nick On Wed, Dec 26, 2012 at 1:36 PM, Alex Elder wrote: > On 12/26/2012 11:45 AM, Nick Bartos wrote: >> Here's a log with a hang on the u

Re: rbd map command hangs for 15 minutes during system start up

2012-12-26 Thread Alex Elder
On 12/26/2012 11:45 AM, Nick Bartos wrote: > Here's a log with a hang on the updated branch: > > https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log OK, new naming scheme. Please try: wip-nick-1 I added another simple fix, but then collapsed thr

Re: rbd map command hangs for 15 minutes during system start up

2012-12-26 Thread Nick Bartos
Here's a log with a hang on the updated branch: https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log On Thu, Dec 20, 2012 at 1:59 PM, Alex Elder wrote: > On 12/20/2012 11:48 AM, Nick Bartos wrote: >> Unfortunately, we still have a hang: >> >> http

Re: rbd map command hangs for 15 minutes during system start up

2012-12-26 Thread Alex Elder
On 12/26/2012 11:45 AM, Nick Bartos wrote: > Here's a log with a hang on the updated branch: > > https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log I'm starting to look this over. Thanks a lot for supplying it. Sorry we still haven't nailed the p

Re: rbd map command hangs for 15 minutes during system start up

2012-12-20 Thread Alex Elder
On 12/20/2012 11:48 AM, Nick Bartos wrote: > Unfortunately, we still have a hang: > > https://gist.github.com/4347052/download The saga continues, and each time we get a little more information. Please try branch: "wip-nick-newerest" Thank you. -Alex >

Re: rbd map command hangs for 15 minutes during system start up

2012-12-20 Thread Nick Bartos
Unfortunately, we still have a hang: https://gist.github.com/4347052/download On Wed, Dec 19, 2012 at 2:42 PM, Alex Elder wrote: > On 12/19/2012 03:25 PM, Alex Elder wrote: >> On 12/18/2012 12:05 PM, Nick Bartos wrote: >>> I've added the output of "ps -ef" in addition to triggering a trace >>>

Re: rbd map command hangs for 15 minutes during system start up

2012-12-19 Thread Alex Elder
On 12/19/2012 03:25 PM, Alex Elder wrote: > On 12/18/2012 12:05 PM, Nick Bartos wrote: >> I've added the output of "ps -ef" in addition to triggering a trace >> when a hang is detected. Not much is generally running at that point, >> but you can have a look: >> >> https://gist.github.com/raw/43302

Re: rbd map command hangs for 15 minutes during system start up

2012-12-19 Thread Alex Elder
On 12/18/2012 12:05 PM, Nick Bartos wrote: > I've added the output of "ps -ef" in addition to triggering a trace > when a hang is detected. Not much is generally running at that point, > but you can have a look: > > https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-

Re: rbd map command hangs for 15 minutes during system start up

2012-12-18 Thread Nick Bartos
I've added the output of "ps -ef" in addition to triggering a trace when a hang is detected. Not much is generally running at that point, but you can have a look: https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-hang-1355851498.txt Is it possible that there is som

Re: rbd map command hangs for 15 minutes during system start up

2012-12-18 Thread Alex Elder
On 12/17/2012 11:12 AM, Nick Bartos wrote: > Here's a log with the rbd debugging enabled: > > https://gist.github.com/raw/4319962/d9690fd92c169198efc5eecabf275ef1808929d2/rbd-hang-test-1355763470.log > > On Fri, Dec 14, 2012 at 10:03 AM, Alex Elder wrote: >> On 12/14/2012 10:53 AM, Nick Bartos w

Re: rbd map command hangs for 15 minutes during system start up

2012-12-17 Thread Nick Bartos
Here's a log with the rbd debugging enabled: https://gist.github.com/raw/4319962/d9690fd92c169198efc5eecabf275ef1808929d2/rbd-hang-test-1355763470.log On Fri, Dec 14, 2012 at 10:03 AM, Alex Elder wrote: > On 12/14/2012 10:53 AM, Nick Bartos wrote: >> Yes I was only enabling debugging for libceph

Re: rbd map command hangs for 15 minutes during system start up

2012-12-14 Thread Alex Elder
On 12/14/2012 10:53 AM, Nick Bartos wrote: > Yes I was only enabling debugging for libceph. I'm adding debugging > for rbd as well. I'll do a repro later today when a test cluster > opens up. Excellent, thank you. -Alex -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i

Re: rbd map command hangs for 15 minutes during system start up

2012-12-14 Thread Nick Bartos
The kernel is 3.5.7 with the following patches applied (and in the order specified below): 001-libceph_eliminate_connection_state_DEAD_13_days_ago.patch 002-libceph_kill_bad_proto_ceph_connection_op_13_days_ago.patch 003-libceph_rename_socket_callbacks_13_days_ago.patch 004-libceph_rename_kvec_res

Re: rbd map command hangs for 15 minutes during system start up

2012-12-14 Thread Alex Elder
On 12/13/2012 01:00 PM, Nick Bartos wrote: > Here's another log with the kernel debugging enabled: > https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log > > Note that it hung on the 2nd try. Just to make sure I'm working with the right code base, c

Re: rbd map command hangs for 15 minutes during system start up

2012-12-13 Thread Alex Elder
On 12/13/2012 01:00 PM, Nick Bartos wrote: > Here's another log with the kernel debugging enabled: > https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log > > Note that it hung on the 2nd try. OK, thanks for the info. We'll keep looking. -Alex >

Re: rbd map command hangs for 15 minutes during system start up

2012-12-13 Thread Nick Bartos
Here's another log with the kernel debugging enabled: https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log Note that it hung on the 2nd try. On Wed, Dec 12, 2012 at 4:57 PM, Nick Bartos wrote: > Using wip-nick-newer, the problem still presented it

Re: rbd map command hangs for 15 minutes during system start up

2012-12-12 Thread Nick Bartos
Using wip-nick-newer, the problem still presented itself after 4 successful runs (so it may be a fluke, but it got slightly further than before). The log is here: https://gist.github.com/raw/4273114/9085ed00d5bdd5ebab9a94b48f4a562d1fbac431/rbd-hang-1355359129.log Unfortunately I forgot to enable

Re: rbd map command hangs for 15 minutes during system start up

2012-12-11 Thread Alex Elder
On 12/11/2012 12:01 PM, Alex Elder wrote: > On 12/11/2012 11:26 AM, Nick Bartos wrote: >> Thanks! I'm creating a build with the new patches now. I'll let you >> know how testing goes. > > FYI, I've been testing with these changes and have *not* been > hitting the kinds of problems I'd been previo

Re: rbd map command hangs for 15 minutes during system start up

2012-12-11 Thread Alex Elder
On 12/11/2012 11:26 AM, Nick Bartos wrote: > Thanks! I'm creating a build with the new patches now. I'll let you > know how testing goes. FYI, I've been testing with these changes and have *not* been hitting the kinds of problems I'd been previously. However those problems were different from yo

Re: rbd map command hangs for 15 minutes during system start up

2012-12-11 Thread Nick Bartos
Thanks! I'm creating a build with the new patches now. I'll let you know how testing goes. On Mon, Dec 10, 2012 at 1:57 PM, Alex Elder wrote: > On 12/02/2012 10:43 PM, Alex Elder wrote: >> On 12/01/2012 11:34 PM, Nick Bartos wrote: >>> Unfortunately the hangs happen with the new set of patches.

Re: rbd map command hangs for 15 minutes during system start up

2012-12-10 Thread Alex Elder
On 12/02/2012 10:43 PM, Alex Elder wrote: > On 12/01/2012 11:34 PM, Nick Bartos wrote: >> Unfortunately the hangs happen with the new set of patches. Here's >> some debug info: >> >> https://gist.github.com/raw/4187123/90194ce172130244a9c1c968ed185eee7282d809/gistfile1.txt >> > > Well I'm sorry t

Re: rbd map command hangs for 15 minutes during system start up

2012-12-02 Thread Alex Elder
On 12/01/2012 11:34 PM, Nick Bartos wrote: Unfortunately the hangs happen with the new set of patches. Here's some debug info: https://gist.github.com/raw/4187123/90194ce172130244a9c1c968ed185eee7282d809/gistfile1.txt Well I'm sorry to hear that but I'm glad to have the new info. In retrospe

Re: rbd map command hangs for 15 minutes during system start up

2012-12-01 Thread Nick Bartos
Unfortunately the hangs happen with the new set of patches. Here's some debug info: https://gist.github.com/raw/4187123/90194ce172130244a9c1c968ed185eee7282d809/gistfile1.txt On Fri, Nov 30, 2012 at 3:22 PM, Alex Elder wrote: > On 11/29/2012 02:37 PM, Alex Elder wrote: >> On 11/22/2012 12:04 P

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Alex Elder
On 11/29/2012 02:37 PM, Alex Elder wrote: > On 11/22/2012 12:04 PM, Nick Bartos wrote: >> Here are the ceph log messages (including the libceph kernel debug >> stuff you asked for) from a node boot with the rbd command hung for a >> couple of minutes: I'm sorry, but I did something stupid... Yes,

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Sage Weil
On Fri, 30 Nov 2012, Alex Elder wrote: > On 11/30/2012 12:49 PM, Nick Bartos wrote: > > My initial tests using a 3.5.7 kernel with the 55 patches from > > wip-nick are going well. So far I've gone through 8 installs without > > an incident, I'll leave it run for a bit longer to see if it crops up

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Alex Elder
On 11/30/2012 12:49 PM, Nick Bartos wrote: > My initial tests using a 3.5.7 kernel with the 55 patches from > wip-nick are going well. So far I've gone through 8 installs without > an incident, I'll leave it run for a bit longer to see if it crops up > again. This is great news! Now I wonder whi

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Nick Bartos
My initial tests using a 3.5.7 kernel with the 55 patches from wip-nick are going well. So far I've gone through 8 installs without an incident, I'll leave it run for a bit longer to see if it crops up again. Can I get a branch with these patches integrated into all of the backported patches to 3

Re: rbd map command hangs for 15 minutes during system start up

2012-11-29 Thread Alex Elder
On 11/22/2012 12:04 PM, Nick Bartos wrote: > Here are the ceph log messages (including the libceph kernel debug > stuff you asked for) from a node boot with the rbd command hung for a > couple of minutes: Nick, I have put together a branch that includes two fixes that might be helpful. I don't ex

Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
FYI the build which included all 3.5 backports except patch #50 is still going strong after 21 builds. On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos wrote: > With 8 successful installs already done, I'm reasonably confident that > it's patch #50. I'm making another build which applies all patches

Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Sage Weil
On Wed, 21 Nov 2012, Nick Bartos wrote: > FYI the build which included all 3.5 backports except patch #50 is > still going strong after 21 builds. Okay, that one at least makes some sense. I've opened http://tracker.newdream.net/issues/3519 How easy is this to reproduce? If it is somet

Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
It's very easy to reproduce now with my automated install script, the most I've seen it succeed with that patch is 2 in a row, and hanging on the 3rd, although it hangs on most builds. So it shouldn't take much to get it to do it again. I'll try and get to that tomorrow, when I'm a bit more reste

Re: rbd map command hangs for 15 minutes during system start up

2012-11-22 Thread Nick Bartos
Here are the ceph log messages (including the libceph kernel debug stuff you asked for) from a node boot with the rbd command hung for a couple of minutes: https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos wrot

Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Nick Bartos
With 8 successful installs already done, I'm reasonably confident that it's patch #50. I'm making another build which applies all patches from the 3.5 backport branch, excluding that specific one. I'll let you know if that turns up any unexpected failures. What will the potential fall out be for

Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Nick Bartos
It's really looking like it's the libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When patches 1-50 (listed below) are applied to 3.5.7, the hang is present. So far I have gone through 4 successful installs with no hang with only 1-49 applied. I'm still leaving my test run to make su

Re: rbd map command hangs for 15 minutes during system start up

2012-11-21 Thread Sage Weil
On Tue, 20 Nov 2012, Nick Bartos wrote: > Since I now have a decent script which can reproduce this, I decided > to re-test with the same 3.5.7 kernel, but just not applying the > patches from the wip-3.5 branch. With the patches, I can only go 2 > builds before I run into a hang. Without the pat

Re: rbd map command hangs for 15 minutes during system start up

2012-11-20 Thread Nick Bartos
Since I now have a decent script which can reproduce this, I decided to re-test with the same 3.5.7 kernel, but just not applying the patches from the wip-3.5 branch. With the patches, I can only go 2 builds before I run into a hang. Without the patches, I have gone 9 consecutive builds (and stil

Re: rbd map command hangs for 15 minutes during system start up

2012-11-20 Thread Nick Bartos
I reproduced the problem and got several sysrq states captured. During this run, the monitor running on the host complained a few times about the clocks being off, but all messages were for under 0.55 seconds. Here are the kernel logs. Note that there are several traces, I thought multiple during

Re: rbd map command hangs for 15 minutes during system start up

2012-11-19 Thread Gregory Farnum
Hmm, yep — that param is actually only used for the warning; I guess we forgot what it actually covers. :( Have your monitor clocks been off by more than 5 seconds at any point? On Mon, Nov 19, 2012 at 3:04 PM, Nick Bartos wrote: > Making 'mon clock drift allowed' very small (0.1) does not >

Re: rbd map command hangs for 15 minutes during system start up

2012-11-19 Thread Nick Bartos
Making 'mon clock drift allowed' very small (0.1) does not reliably reproduce the hang. I started looking at the code for 0.48.2 and it looks like this is only used in Paxos::warn_on_future_time, which only handles the warning, nothing else. On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil wrote:

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Gregory Farnum
To be clear, the monitor cluster needs to be within this clock drift — the rest of the Ceph cluster can be off by as much as you care to. (Well, there's also a limit imposed by cephx authorization which can keep nodes out of the cluster, but that drift allowance is measured in units of hours.) -Gr

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Sage Weil
On Fri, 16 Nov 2012, Nick Bartos wrote: > Should I be lowering the clock drift allowed, or the lease interval to > help reproduce it? clock drift allowed. > > On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil wrote: > > You can safely set the clock drift allowed as high as 500ms. The real > > limit

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Nick Bartos
Should I be lowering the clock drift allowed, or the lease interval to help reproduce it? On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil wrote: > You can safely set the clock drift allowed as high as 500ms. The real > limitation is that it needs to be well under the lease interval, which is > curren

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Sage Weil
You can safely set the clock drift allowed as high as 500ms. The real limitation is that it needs to be well under the lease interval, which is currently 5 seconds by default. You might be able to reproduce more easily by lowering the threshold... sage On Fri, 16 Nov 2012, Nick Bartos wrote:

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Nick Bartos
How far off do the clocks need to be before there is a problem? It would seem to be hard to ensure a very large cluster has all of it's nodes synchronized within 50ms (which seems to be the default for "mon clock drift allowed"). Does the mon clock drift allowed parameter change anything other th

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Sage Weil
I just realized I was mixing up this thread with the other deadlock thread. On Fri, 16 Nov 2012, Nick Bartos wrote: > Turns out we're having the 'rbd map' hang on startup again, after we > started using the wip-3.5 patch set. How critical is the > libceph_protect_ceph_con_open_with_mutex commi

Re: rbd map command hangs for 15 minutes during system start up

2012-11-16 Thread Nick Bartos
Turns out we're having the 'rbd map' hang on startup again, after we started using the wip-3.5 patch set. How critical is the libceph_protect_ceph_con_open_with_mutex commit? That's the one I removed before which seemed to get rid of the problem (although I'm not completely sure if it completely

Re: rbd map command hangs for 15 minutes during system start up

2012-11-15 Thread Sage Weil
On Thu, 15 Nov 2012, Nick Bartos wrote: > Sorry I guess this e-mail got missed. I believe those patches came > from the ceph/linux-3.5.5-ceph branch. I'm now using the wip-3.5 > branch patches, which seem to all be fine. We'll stick with 3.5 and > this backport for now until we can figure out wh

Re: rbd map command hangs for 15 minutes during system start up

2012-11-15 Thread Nick Bartos
Sorry I guess this e-mail got missed. I believe those patches came from the ceph/linux-3.5.5-ceph branch. I'm now using the wip-3.5 branch patches, which seem to all be fine. We'll stick with 3.5 and this backport for now until we can figure out what's wrong with 3.6. I typically ignore the wip

Re: rbd map command hangs for 15 minutes during system start up

2012-11-12 Thread Sage Weil
On Mon, 12 Nov 2012, Nick Bartos wrote: > After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it > seems we no longer have this hang. Hmm, that's a bit disconcerting. Did this series come from our old 3.5 stable series? I recently prepared a new one that backports *all* of the fix

Re: rbd map command hangs for 15 minutes during system start up

2012-11-12 Thread Nick Bartos
After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it seems we no longer have this hang. On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin wrote: > On 11/08/2012 02:10 PM, Mandell Degerness wrote: >> >> We are seeing a somewhat random, but frequent hang on our systems >> during startup.

Re: rbd map command hangs for 15 minutes during system start up

2012-11-08 Thread Josh Durgin
On 11/08/2012 02:10 PM, Mandell Degerness wrote: We are seeing a somewhat random, but frequent hang on our systems during startup. The hang happens at the point where an "rbd map " command is run. I've attached the ceph logs from the cluster. The map command happens at Nov 8 18:41:09 on serve

rbd map command hangs for 15 minutes during system start up

2012-11-08 Thread Mandell Degerness
We are seeing a somewhat random, but frequent hang on our systems during startup. The hang happens at the point where an "rbd map " command is run. I've attached the ceph logs from the cluster. The map command happens at Nov 8 18:41:09 on server 172.18.0.15. The process which hung can be seen