Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-18 Thread Gregory Farnum
It's a little strange, but with just the one-sided log it looks as
though the OSD is setting up a bunch of connections and then
deliberately tearing them down again within  second or two (i.e., this
is not a direct messenger bug, but it might be an OSD one, or it might
be something else).
Is it possible that you have some firewalls set up that are allowing
through some traffic but not others? The OSDs use a bunch of ports and
it looks like maybe there are at least intermittent issues with them
heartbeating.
-Greg

On Wed, Nov 12, 2014 at 11:32 AM, Scott Laird sc...@sigkill.org wrote:
 Here are the first 33k lines or so:
 https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt

 This is a different (but more or less identical) machine from the past set
 of logs.  This system doesn't have quite as many drives in it, so I couldn't
 spot a same-host error burst, but it's logging tons of the same errors while
 trying to talk to 10.2.0.34.

 On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum g...@gregs42.com wrote:

 On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird sc...@sigkill.org wrote:
  I'm having a problem with my cluster.  It's running 0.87 right now, but
  I
  saw the same behavior with 0.80.5 and 0.80.7.
 
  The problem is that my logs are filling up with replacing existing
  (lossy)
  channel log lines (see below), to the point where I'm filling drives to
  100% almost daily just with logs.
 
  It doesn't appear to be network related, because it happens even when
  talking to other OSDs on the same host.

 Well, that means it's probably not physical network related, but there
 can still be plenty wrong with the networking stack... ;)

  The logs pretty much all point to
  port 0 on the remote end.  Is this an indicator that it's failing to
  resolve
  port numbers somehow, or is this normal at this point in connection
  setup?

 That's definitely unusual, but I'd need to see a little more to be
 sure if it's bad. My guess is that these pipes are connections from
 the other OSD's Objecter, which is treated as a regular client and
 doesn't bind to a socket for incoming connections.

 The repetitive channel replacements are concerning, though — they can
 be harmless in some circumstances but this looks more like the
 connection is simply failing to establish and so it's retrying over
 and over again. Can you restart the OSDs with debug ms = 10 in their
 config file and post the logs somewhere? (There is not really any
 documentation available on what they mean, but the deeper detail ones
 might also be more understandable to you.)
 -Greg

 
  The systems that are causing this problem are somewhat unusual; they're
  running OSDs in Docker containers, but they *should* be configured to
  run as
  root and have full access to the host's network stack.  They manage to
  work,
  mostly, but things are still really flaky.
 
  Also, is there documentation on what the various fields mean, short of
  digging through the source?  And how does Ceph resolve OSD numbers into
  host/port addresses?
 
 
  2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
  c=0x1e070580).accept replacing existing (lossy) channel (new one
  lossy=1)
 
  2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 
  10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
  c=0x1f3db2e0).accept replacing existing (lossy) channel (new one
  lossy=1)
 
  2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
  c=0x1e070420).accept replacing existing (lossy) channel (new one
  lossy=1)
 
  2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 
  10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
  c=0x1f3d8420).accept replacing existing (lossy) channel (new one
  lossy=1)
 
  2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
  c=0x1e070840).accept replacing existing (lossy) channel (new one
  lossy=1)
 
  2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
  c=0x1b2d6260).accept replacing existing (lossy) channel (new one
  lossy=1)
 
  2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
  c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)
 
  2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 
  10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
  c=0x1f3d9600).accept replacing existing (lossy) channel (new one
  lossy=1)
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-18 Thread Scott Laird
I think I just solved at least part of the problem.

Because of the somewhat peculiar way that I have Docker configured, docker
instances on another system were being assigned my OSD's IP address,
running for a couple seconds, and then failing (for unrelated reasons).
Effectively, there was something sitting on the network throwing random
RSTs at my TCP connections and then vanishing.

Amazingly, Ceph seems to have been able to handle it *just* well enough to
make it non-obvious that the problem was external and network related.

That doesn't quite explain the issues with local OSDs acting up, though.

For now, I've moved all of my OSDs back to Ubuntu; it's more work to
manage, but on the other hand it's actually working.


Scott

On Tue Nov 18 2014 at 3:14:54 PM Gregory Farnum g...@gregs42.com wrote:

 It's a little strange, but with just the one-sided log it looks as
 though the OSD is setting up a bunch of connections and then
 deliberately tearing them down again within  second or two (i.e., this
 is not a direct messenger bug, but it might be an OSD one, or it might
 be something else).
 Is it possible that you have some firewalls set up that are allowing
 through some traffic but not others? The OSDs use a bunch of ports and
 it looks like maybe there are at least intermittent issues with them
 heartbeating.
 -Greg

 On Wed, Nov 12, 2014 at 11:32 AM, Scott Laird sc...@sigkill.org wrote:
  Here are the first 33k lines or so:
  https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt
 
  This is a different (but more or less identical) machine from the past
 set
  of logs.  This system doesn't have quite as many drives in it, so I
 couldn't
  spot a same-host error burst, but it's logging tons of the same errors
 while
  trying to talk to 10.2.0.34.
 
  On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum g...@gregs42.com
 wrote:
 
  On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird sc...@sigkill.org wrote:
   I'm having a problem with my cluster.  It's running 0.87 right now,
 but
   I
   saw the same behavior with 0.80.5 and 0.80.7.
  
   The problem is that my logs are filling up with replacing existing
   (lossy)
   channel log lines (see below), to the point where I'm filling drives
 to
   100% almost daily just with logs.
  
   It doesn't appear to be network related, because it happens even when
   talking to other OSDs on the same host.
 
  Well, that means it's probably not physical network related, but there
  can still be plenty wrong with the networking stack... ;)
 
   The logs pretty much all point to
   port 0 on the remote end.  Is this an indicator that it's failing to
   resolve
   port numbers somehow, or is this normal at this point in connection
   setup?
 
  That's definitely unusual, but I'd need to see a little more to be
  sure if it's bad. My guess is that these pipes are connections from
  the other OSD's Objecter, which is treated as a regular client and
  doesn't bind to a socket for incoming connections.
 
  The repetitive channel replacements are concerning, though — they can
  be harmless in some circumstances but this looks more like the
  connection is simply failing to establish and so it's retrying over
  and over again. Can you restart the OSDs with debug ms = 10 in their
  config file and post the logs somewhere? (There is not really any
  documentation available on what they mean, but the deeper detail ones
  might also be more understandable to you.)
  -Greg
 
  
   The systems that are causing this problem are somewhat unusual;
 they're
   running OSDs in Docker containers, but they *should* be configured to
   run as
   root and have full access to the host's network stack.  They manage to
   work,
   mostly, but things are still really flaky.
  
   Also, is there documentation on what the various fields mean, short of
   digging through the source?  And how does Ceph resolve OSD numbers
 into
   host/port addresses?
  
  
   2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 
   10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
   c=0x1e070580).accept replacing existing (lossy) channel (new one
   lossy=1)
  
   2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 
   10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
   c=0x1f3db2e0).accept replacing existing (lossy) channel (new one
   lossy=1)
  
   2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 
   10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
   c=0x1e070420).accept replacing existing (lossy) channel (new one
   lossy=1)
  
   2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 
   10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
   c=0x1f3d8420).accept replacing existing (lossy) channel (new one
   lossy=1)
  
   2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 
   10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
   c=0x1e070840).accept replacing existing (lossy) channel (new one
  

Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-12 Thread Gregory Farnum
On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird sc...@sigkill.org wrote:
 I'm having a problem with my cluster.  It's running 0.87 right now, but I
 saw the same behavior with 0.80.5 and 0.80.7.

 The problem is that my logs are filling up with replacing existing (lossy)
 channel log lines (see below), to the point where I'm filling drives to
 100% almost daily just with logs.

 It doesn't appear to be network related, because it happens even when
 talking to other OSDs on the same host.

Well, that means it's probably not physical network related, but there
can still be plenty wrong with the networking stack... ;)

 The logs pretty much all point to
 port 0 on the remote end.  Is this an indicator that it's failing to resolve
 port numbers somehow, or is this normal at this point in connection setup?

That's definitely unusual, but I'd need to see a little more to be
sure if it's bad. My guess is that these pipes are connections from
the other OSD's Objecter, which is treated as a regular client and
doesn't bind to a socket for incoming connections.

The repetitive channel replacements are concerning, though — they can
be harmless in some circumstances but this looks more like the
connection is simply failing to establish and so it's retrying over
and over again. Can you restart the OSDs with debug ms = 10 in their
config file and post the logs somewhere? (There is not really any
documentation available on what they mean, but the deeper detail ones
might also be more understandable to you.)
-Greg


 The systems that are causing this problem are somewhat unusual; they're
 running OSDs in Docker containers, but they *should* be configured to run as
 root and have full access to the host's network stack.  They manage to work,
 mostly, but things are still really flaky.

 Also, is there documentation on what the various fields mean, short of
 digging through the source?  And how does Ceph resolve OSD numbers into
 host/port addresses?


 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 
 10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
 c=0x1e070580).accept replacing existing (lossy) channel (new one lossy=1)

 2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 
 10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
 c=0x1f3db2e0).accept replacing existing (lossy) channel (new one lossy=1)

 2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 
 10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
 c=0x1e070420).accept replacing existing (lossy) channel (new one lossy=1)

 2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 
 10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
 c=0x1f3d8420).accept replacing existing (lossy) channel (new one lossy=1)

 2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 
 10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
 c=0x1e070840).accept replacing existing (lossy) channel (new one lossy=1)

 2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 
 10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
 c=0x1b2d6260).accept replacing existing (lossy) channel (new one lossy=1)

 2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 
 10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
 c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)

 2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 
 10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
 c=0x1f3d9600).accept replacing existing (lossy) channel (new one lossy=1)



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-12 Thread Scott Laird
Here are the first 33k lines or so:
https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt

This is a different (but more or less identical) machine from the past set
of logs.  This system doesn't have quite as many drives in it, so I
couldn't spot a same-host error burst, but it's logging tons of the same
errors while trying to talk to 10.2.0.34.

On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum g...@gregs42.com wrote:

 On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird sc...@sigkill.org wrote:
  I'm having a problem with my cluster.  It's running 0.87 right now, but I
  saw the same behavior with 0.80.5 and 0.80.7.
 
  The problem is that my logs are filling up with replacing existing
 (lossy)
  channel log lines (see below), to the point where I'm filling drives to
  100% almost daily just with logs.
 
  It doesn't appear to be network related, because it happens even when
  talking to other OSDs on the same host.

 Well, that means it's probably not physical network related, but there
 can still be plenty wrong with the networking stack... ;)

  The logs pretty much all point to
  port 0 on the remote end.  Is this an indicator that it's failing to
 resolve
  port numbers somehow, or is this normal at this point in connection
 setup?

 That's definitely unusual, but I'd need to see a little more to be
 sure if it's bad. My guess is that these pipes are connections from
 the other OSD's Objecter, which is treated as a regular client and
 doesn't bind to a socket for incoming connections.

 The repetitive channel replacements are concerning, though — they can
 be harmless in some circumstances but this looks more like the
 connection is simply failing to establish and so it's retrying over
 and over again. Can you restart the OSDs with debug ms = 10 in their
 config file and post the logs somewhere? (There is not really any
 documentation available on what they mean, but the deeper detail ones
 might also be more understandable to you.)
 -Greg

 
  The systems that are causing this problem are somewhat unusual; they're
  running OSDs in Docker containers, but they *should* be configured to
 run as
  root and have full access to the host's network stack.  They manage to
 work,
  mostly, but things are still really flaky.
 
  Also, is there documentation on what the various fields mean, short of
  digging through the source?  And how does Ceph resolve OSD numbers into
  host/port addresses?
 
 
  2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
  c=0x1e070580).accept replacing existing (lossy) channel (new one lossy=1)
 
  2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 
  10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
  c=0x1f3db2e0).accept replacing existing (lossy) channel (new one lossy=1)
 
  2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
  c=0x1e070420).accept replacing existing (lossy) channel (new one lossy=1)
 
  2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 
  10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
  c=0x1f3d8420).accept replacing existing (lossy) channel (new one lossy=1)
 
  2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
  c=0x1e070840).accept replacing existing (lossy) channel (new one lossy=1)
 
  2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
  c=0x1b2d6260).accept replacing existing (lossy) channel (new one lossy=1)
 
  2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 
  10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
  c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)
 
  2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 
  10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
  c=0x1f3d9600).accept replacing existing (lossy) channel (new one lossy=1)
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-11 Thread Scott Laird
I'm having a problem with my cluster.  It's running 0.87 right now, but I
saw the same behavior with 0.80.5 and 0.80.7.

The problem is that my logs are filling up with replacing existing (lossy)
channel log lines (see below), to the point where I'm filling drives to
100% almost daily just with logs.

It doesn't appear to be network related, because it happens even when
talking to other OSDs on the same host.  The logs pretty much all point to
port 0 on the remote end.  Is this an indicator that it's failing to
resolve port numbers somehow, or is this normal at this point in connection
setup?

The systems that are causing this problem are somewhat unusual; they're
running OSDs in Docker containers, but they *should* be configured to run
as root and have full access to the host's network stack.  They manage to
work, mostly, but things are still really flaky.

Also, is there documentation on what the various fields mean, short of
digging through the source?  And how does Ceph resolve OSD numbers into
host/port addresses?


2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 
10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
c=0x1e070580).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 
10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
c=0x1f3db2e0).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 
10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
c=0x1e070420).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 
10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
c=0x1f3d8420).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 
10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
c=0x1e070840).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 
10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
c=0x1b2d6260).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 
10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)

2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 
10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
c=0x1f3d9600).accept replacing existing (lossy) channel (new one lossy=1)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com