On 12/26/2012 03:36 PM, Alex Elder wrote:
On 12/26/2012 11:45 AM, Nick Bartos wrote:
Here's a log with a hang on the updated branch:
https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log
OK, new naming scheme. Please try: wip-nick-1
Now that
On Thu, 27 Dec 2012, Nick Bartos wrote:
I have some exciting news. After 215 test runs, no hung processes
were detected. I think we may actually have it this time. Thanks for
all your hard work!
Sweet! I think it was the new branch naming scheme that did it.
sage
-Nick
On Wed, Dec
On 12/27/2012 12:43 PM, Sage Weil wrote:
On Thu, 27 Dec 2012, Nick Bartos wrote:
I have some exciting news. After 215 test runs, no hung processes
were detected. I think we may actually have it this time. Thanks for
all your hard work!
This is great news Nick, and I really appreciate your
On 12/26/2012 11:45 AM, Nick Bartos wrote:
Here's a log with a hang on the updated branch:
https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log
I'm starting to look this over. Thanks a lot for supplying it.
Sorry we still haven't nailed the
Here's a log with a hang on the updated branch:
https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log
On Thu, Dec 20, 2012 at 1:59 PM, Alex Elder el...@inktank.com wrote:
On 12/20/2012 11:48 AM, Nick Bartos wrote:
Unfortunately, we still have a
On 12/26/2012 11:45 AM, Nick Bartos wrote:
Here's a log with a hang on the updated branch:
https://gist.github.com/raw/4381750/772476e1bae1e6366347a223f34aa6c440b92765/rdb-hang-1356543132.log
OK, new naming scheme. Please try: wip-nick-1
I added another simple fix, but then collapsed three
Unfortunately, we still have a hang:
https://gist.github.com/4347052/download
On Wed, Dec 19, 2012 at 2:42 PM, Alex Elder el...@inktank.com wrote:
On 12/19/2012 03:25 PM, Alex Elder wrote:
On 12/18/2012 12:05 PM, Nick Bartos wrote:
I've added the output of ps -ef in addition to triggering a
On 12/20/2012 11:48 AM, Nick Bartos wrote:
Unfortunately, we still have a hang:
https://gist.github.com/4347052/download
The saga continues, and each time we get a little more
information. Please try branch: wip-nick-newerest
Thank you.
-Alex
On
On 12/18/2012 12:05 PM, Nick Bartos wrote:
I've added the output of ps -ef in addition to triggering a trace
when a hang is detected. Not much is generally running at that point,
but you can have a look:
I've added the output of ps -ef in addition to triggering a trace
when a hang is detected. Not much is generally running at that point,
but you can have a look:
https://gist.github.com/raw/4330223/2f131ee312ee43cb3d8c307a9bf2f454a7edfe57/rbd-hang-1355851498.txt
Is it possible that there is some
On 12/13/2012 01:00 PM, Nick Bartos wrote:
Here's another log with the kernel debugging enabled:
https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log
Note that it hung on the 2nd try.
Just to make sure I'm working with the right code base, can
The kernel is 3.5.7 with the following patches applied (and in the
order specified below):
001-libceph_eliminate_connection_state_DEAD_13_days_ago.patch
002-libceph_kill_bad_proto_ceph_connection_op_13_days_ago.patch
003-libceph_rename_socket_callbacks_13_days_ago.patch
Here's another log with the kernel debugging enabled:
https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log
Note that it hung on the 2nd try.
On Wed, Dec 12, 2012 at 4:57 PM, Nick Bartos n...@pistoncloud.com wrote:
Using wip-nick-newer, the
On 12/13/2012 01:00 PM, Nick Bartos wrote:
Here's another log with the kernel debugging enabled:
https://gist.github.com/raw/4278697/1c9e41d275e614783fbbdee8ca5842680f46c249/rbd-hang-1355424455.log
Note that it hung on the 2nd try.
OK, thanks for the info. We'll keep looking. -Alex
On
Using wip-nick-newer, the problem still presented itself after 4
successful runs (so it may be a fluke, but it got slightly further
than before). The log is here:
https://gist.github.com/raw/4273114/9085ed00d5bdd5ebab9a94b48f4a562d1fbac431/rbd-hang-1355359129.log
Unfortunately I forgot to enable
Thanks! I'm creating a build with the new patches now. I'll let you
know how testing goes.
On Mon, Dec 10, 2012 at 1:57 PM, Alex Elder el...@inktank.com wrote:
On 12/02/2012 10:43 PM, Alex Elder wrote:
On 12/01/2012 11:34 PM, Nick Bartos wrote:
Unfortunately the hangs happen with the new set
On 12/11/2012 11:26 AM, Nick Bartos wrote:
Thanks! I'm creating a build with the new patches now. I'll let you
know how testing goes.
FYI, I've been testing with these changes and have *not* been
hitting the kinds of problems I'd been previously. However
those problems were different from
On 12/11/2012 12:01 PM, Alex Elder wrote:
On 12/11/2012 11:26 AM, Nick Bartos wrote:
Thanks! I'm creating a build with the new patches now. I'll let you
know how testing goes.
FYI, I've been testing with these changes and have *not* been
hitting the kinds of problems I'd been previously.
Unfortunately the hangs happen with the new set of patches. Here's
some debug info:
https://gist.github.com/raw/4187123/90194ce172130244a9c1c968ed185eee7282d809/gistfile1.txt
On Fri, Nov 30, 2012 at 3:22 PM, Alex Elder el...@inktank.com wrote:
On 11/29/2012 02:37 PM, Alex Elder wrote:
On
My initial tests using a 3.5.7 kernel with the 55 patches from
wip-nick are going well. So far I've gone through 8 installs without
an incident, I'll leave it run for a bit longer to see if it crops up
again.
Can I get a branch with these patches integrated into all of the
backported patches to
On 11/30/2012 12:49 PM, Nick Bartos wrote:
My initial tests using a 3.5.7 kernel with the 55 patches from
wip-nick are going well. So far I've gone through 8 installs without
an incident, I'll leave it run for a bit longer to see if it crops up
again.
This is great news! Now I wonder which
On 11/29/2012 02:37 PM, Alex Elder wrote:
On 11/22/2012 12:04 PM, Nick Bartos wrote:
Here are the ceph log messages (including the libceph kernel debug
stuff you asked for) from a node boot with the rbd command hung for a
couple of minutes:
I'm sorry, but I did something stupid...
Yes, the
On 11/22/2012 12:04 PM, Nick Bartos wrote:
Here are the ceph log messages (including the libceph kernel debug
stuff you asked for) from a node boot with the rbd command hung for a
couple of minutes:
Nick, I have put together a branch that includes two fixes
that might be helpful. I don't
Here are the ceph log messages (including the libceph kernel debug
stuff you asked for) from a node boot with the rbd command hung for a
couple of minutes:
https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos
It's very easy to reproduce now with my automated install script, the
most I've seen it succeed with that patch is 2 in a row, and hanging
on the 3rd, although it hangs on most builds. So it shouldn't take
much to get it to do it again. I'll try and get to that tomorrow,
when I'm a bit more
On Wed, 21 Nov 2012, Nick Bartos wrote:
FYI the build which included all 3.5 backports except patch #50 is
still going strong after 21 builds.
Okay, that one at least makes some sense. I've opened
http://tracker.newdream.net/issues/3519
How easy is this to reproduce? If it is
FYI the build which included all 3.5 backports except patch #50 is
still going strong after 21 builds.
On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote:
With 8 successful installs already done, I'm reasonably confident that
it's patch #50. I'm making another build which
On Tue, 20 Nov 2012, Nick Bartos wrote:
Since I now have a decent script which can reproduce this, I decided
to re-test with the same 3.5.7 kernel, but just not applying the
patches from the wip-3.5 branch. With the patches, I can only go 2
builds before I run into a hang. Without the
It's really looking like it's the
libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When
patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
So far I have gone through 4 successful installs with no hang with
only 1-49 applied. I'm still leaving my test run to make
With 8 successful installs already done, I'm reasonably confident that
it's patch #50. I'm making another build which applies all patches
from the 3.5 backport branch, excluding that specific one. I'll let
you know if that turns up any unexpected failures.
What will the potential fall out be
I reproduced the problem and got several sysrq states captured.
During this run, the monitor running on the host complained a few
times about the clocks being off, but all messages were for under 0.55
seconds.
Here are the kernel logs. Note that there are several traces, I
thought multiple
Since I now have a decent script which can reproduce this, I decided
to re-test with the same 3.5.7 kernel, but just not applying the
patches from the wip-3.5 branch. With the patches, I can only go 2
builds before I run into a hang. Without the patches, I have gone 9
consecutive builds (and
Making 'mon clock drift allowed' very small (0.1) does not
reliably reproduce the hang. I started looking at the code for 0.48.2
and it looks like this is only used in Paxos::warn_on_future_time,
which only handles the warning, nothing else.
On Fri, Nov 16, 2012 at 2:21 PM, Sage Weil
Turns out we're having the 'rbd map' hang on startup again, after we
started using the wip-3.5 patch set. How critical is the
libceph_protect_ceph_con_open_with_mutex commit? That's the one I
removed before which seemed to get rid of the problem (although I'm
not completely sure if it completely
I just realized I was mixing up this thread with the other deadlock
thread.
On Fri, 16 Nov 2012, Nick Bartos wrote:
Turns out we're having the 'rbd map' hang on startup again, after we
started using the wip-3.5 patch set. How critical is the
libceph_protect_ceph_con_open_with_mutex commit?
How far off do the clocks need to be before there is a problem? It
would seem to be hard to ensure a very large cluster has all of it's
nodes synchronized within 50ms (which seems to be the default for mon
clock drift allowed). Does the mon clock drift allowed parameter
change anything other
Should I be lowering the clock drift allowed, or the lease interval to
help reproduce it?
On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil s...@inktank.com wrote:
You can safely set the clock drift allowed as high as 500ms. The real
limitation is that it needs to be well under the lease interval,
On Fri, 16 Nov 2012, Nick Bartos wrote:
Should I be lowering the clock drift allowed, or the lease interval to
help reproduce it?
clock drift allowed.
On Fri, Nov 16, 2012 at 2:13 PM, Sage Weil s...@inktank.com wrote:
You can safely set the clock drift allowed as high as 500ms. The
To be clear, the monitor cluster needs to be within this clock drift —
the rest of the Ceph cluster can be off by as much as you care to.
(Well, there's also a limit imposed by cephx authorization which can
keep nodes out of the cluster, but that drift allowance is measured in
units of hours.)
Sorry I guess this e-mail got missed. I believe those patches came
from the ceph/linux-3.5.5-ceph branch. I'm now using the wip-3.5
branch patches, which seem to all be fine. We'll stick with 3.5 and
this backport for now until we can figure out what's wrong with 3.6.
I typically ignore the
On Thu, 15 Nov 2012, Nick Bartos wrote:
Sorry I guess this e-mail got missed. I believe those patches came
from the ceph/linux-3.5.5-ceph branch. I'm now using the wip-3.5
branch patches, which seem to all be fine. We'll stick with 3.5 and
this backport for now until we can figure out
After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it
seems we no longer have this hang.
On Thu, Nov 8, 2012 at 5:43 PM, Josh Durgin josh.dur...@inktank.com wrote:
On 11/08/2012 02:10 PM, Mandell Degerness wrote:
We are seeing a somewhat random, but frequent hang on our systems
On Mon, 12 Nov 2012, Nick Bartos wrote:
After removing 8-libceph-protect-ceph_con_open-with-mutex.patch, it
seems we no longer have this hang.
Hmm, that's a bit disconcerting. Did this series come from our old 3.5
stable series? I recently prepared a new one that backports *all* of the
We are seeing a somewhat random, but frequent hang on our systems
during startup. The hang happens at the point where an rbd map
rbdvol command is run.
I've attached the ceph logs from the cluster. The map command happens
at Nov 8 18:41:09 on server 172.18.0.15. The process which hung can
be
On 11/08/2012 02:10 PM, Mandell Degerness wrote:
We are seeing a somewhat random, but frequent hang on our systems
during startup. The hang happens at the point where an rbd map
rbdvol command is run.
I've attached the ceph logs from the cluster. The map command happens
at Nov 8 18:41:09 on
45 matches
Mail list logo