Re: [controller-dev] ODL crashing in CSIT jobs

Michael Vorburger Thu, 02 Nov 2017 15:50:36 -0700

On Thu, Nov 2, 2017 at 11:02 PM, Luis Gomez <[email protected]> wrote:


> So from the time I was debugging the issue with Jamo and Thanh, it seems
> there is a memory leak but:
>
> 1) JVM does not kill itself, the OS does instead after the java process
> grows to 3.7G in a VM of 4G RAM  (note Xmx is set to 2G but still the jvm
> goes far beyond that).
>

duh! Weird. There are various online articles re. how a JVM can consume
more than it's Xmx, mentioning total mem is some magic combination of,
AFAIK, Xmx + Threads + Stacks + PermGen/Metaspace + Native + JIT, so:

Hopefully it's not JUST that you THINK Xmx is 2 GB but it's actually more..
say 4G, then on a small 4G VM, I could totally see that happen; this is the
scenario I described earlier, on
https://lists.opendaylight.org/pipermail/controller-dev/2017-October/014026.html

It's not that I don't trust you guys, but you're like 250% super double
extra sure sure Xmx is 2G, aren't you? ;) Maybe just do a ps wwwwwww thing
to see the full arg line and double check it. In the crash message shown on
https://jira.opendaylight.org/browse/NETVIRT-974 there are only environment
variables, not the actual values. Would you terribly mind doing that and
sharing the actual full JVM start up line here, just so that we're sure?

PermGen/Metaspace is classes loaded; that's.. unlikely, IMHO.

Theoretically I guess it could be some JNI crap Native memory usage within
the JVM going crazy. Someone remind me what JNI pieces we have in the
puzzle? Karaf has some Console Shell thing that's native; but I'm guessing
that's unlikely. Then there is a native LevelDB, right? What if it's that
which goes crazy?

Or, quote: "Off-heap allocations. If you happen to use off-heap memory, for
example while using direct or mapped ByteBuffers yourself or via some
clever 3rd party API then voila – you are extending your heap to something
you actually cannot control via JVM configuration." - anyone knows if we /
CDS / controller / Akka does this kind of stuff? (AFAIK this includes
"memory mapped files")j

Thread memory usage I don't think should just blew up like this, even if we
did spawn threads like crazy in a bug (which wouldn't surprise me), that
would lead to an exception and more orderly termination, AFAIK. I could
write a quick Java Test app to confirm if this assumption is correct.jam

2) The issue happens so fast that we did not have time to take a memory
> dump with map.
>

what is map? We would need a HPROF produced by jmap; just a typo?

BTW: Could we see a HPROF by jmap you take while it's still running, just
before it goes crazy and gets killed? MAYBE with a bit of luck, we already
see something interesting in there..


> So I wonder if there is some java memory combination of parameters to
> prevent OS to kill the JVM, something like if the total java memory (Xmx
> and others)=3G, abort and generate heap dump.
>

No, I don't think so - as Jamo says, the JVM seems to just get kill-9'd by
the OS, so no JVM parameter will helps us; it's... like.. too late. But if
we can't figure it out using any of the ideas above, then I guess we'll
have to go lower.. I'm a little bit less familiar with the OS level stuff,
but I'm sure one of you know, or can figure out how, to get the OS to
produce one of those coredump files when it kills a process? Thing is I
don't know what to do with such a file, and how to analyze its memory
content... but I guess we'll have to find someone who does! Once you have
it.

pmap of the JVM proc is mentioned as also being useful in some psots online



> BR/Luis
>
>
> > On Nov 2, 2017, at 2:53 PM, Jamo Luhrsen <[email protected]> wrote:
> >
> >
> >
> > On 11/02/2017 02:02 PM, Michael Vorburger wrote:
> >> On Thu, Nov 2, 2017 at 9:32 PM, Jamo Luhrsen <[email protected]
> <mailto:[email protected]>> wrote:
> >>
> >>    +integration-dev
> >>
> >>    nitrogen SR1 blocker bug for this problem:
> >>
> >>    https://jira.opendaylight.org/browse/NETVIRT-974 <
> https://jira.opendaylight.org/browse/NETVIRT-974>
> >>
> >>    I'm actively debugging in the sandbox, although it's a very heavy
> process
> >>    (many hours per iteration).
> >>
> >>    wondering if there are any extra options we can pass to the java
> process
> >>    that might shed more light. This is a very fast and silent death.
> >>
> >>
> >> sounds like a memory leak.. we should have a hs_err_pid*.log file and
> *have* to have an *.hprof file to know where and find a
> >> fix for an OOM...
> >>> have you been able to re-double-check if these files are't already
> produced somewhere? How about just doing a dumb:
> >>
> >> sudo find / -name "hs_err_pid*.log"
> >> sudo find / -name "*.hprof"
> >>
> >> the hprof should be produced by I can see that in $ODL/bin/karaf we
> already have "-XX:+HeapDumpOnOutOfMemoryError"
> >> on DEFAULT_JAVA_OPTS...
> >
> >
> > yes, we do this HeapdumpOnOutOfMemoryError.
> >
> >> to fix the folder where it would write the HPROF into, you could add:
> -XX:HeapDumpPath=/a/folder/you/can/recover
> >>
> >> can't wait to get my hands on a hs_err_pid*.log & *.hprof from this...
> ;=)
> >
> >
> > no there is no hs_err* or *hprof here. The OS is killing the PID because
> it's
> > consuming too much memory. I don't think the OS even cares that this is
> java.
> > Same as me doing a kill -9, I presume.
> >
> > :(
> >
> > JamO
> >
> >
> >>    JamO
> >>
> >>
> >>
> >>
> >>    On 10/31/2017 06:11 PM, Sam Hague wrote:
> >>>
> >>>
> >>> On Tue, Oct 31, 2017 at 6:44 PM, Anil Vishnoi <[email protected]
> <mailto:[email protected]> <mailto:[email protected] <mailto:
> [email protected]>>> wrote:
> >>>
> >>>      is it possible to collect dmesg output? That can give an idea if
> it's a JVM native OOM.
> >>>
> >>> Yes, we already collect all those for the openstack nodes, so we just
> need to include it for the ODL node.
> >>>
> >>>
> >>>      On Tue, Oct 31, 2017 at 3:40 PM, Michael Vorburger <
> [email protected] <mailto:[email protected]>
> >>    <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>
> >>>          On Tue, Oct 31, 2017 at 11:02 PM, Jamo Luhrsen <
> [email protected] <mailto:[email protected]> <mailto:[email protected]
> <mailto:[email protected]>>> wrote:
> >>>
> >>>              On 10/31/2017 12:22 AM, Michael Vorburger wrote:
> >>>              > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen <
> [email protected] <mailto:[email protected]> <mailto:[email protected]
> <mailto:[email protected]>>
> >>    <mailto:[email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>>> wrote:
> >>>              >
> >>>              >     On 10/30/2017 01:29 PM, Tom Pantelis wrote:
> >>>              >     > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague <
> [email protected] <mailto:[email protected]> <mailto:[email protected]
> <mailto:[email protected]>>
> >>>              <mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>>
> >>    <mailto:[email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>
> >>>              >     <mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>>>> wrote:
> >>>              >     >     On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis
> <[email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>
> >>    <mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>>
> >>>              >     <mailto:[email protected] <mailto:
> [email protected]> <mailto:[email protected] <mailto:
> [email protected]>>
> >>    <mailto:[email protected] <mailto:[email protected]>
> >>>              <mailto:[email protected] <mailto:
> [email protected]>>>>> wrote:
> >>>              >     >         On Mon, Oct 30, 2017 at 2:49 PM, Michael
> Vorburger <[email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>
> >>    <mailto:[email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>>
> >>>              <mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> >>    <mailto:[email protected] <mailto:[email protected]>
> >>>              <mailto:[email protected] <mailto:[email protected]
> >>>>> wrote:
> >>>              >     >
> >>>              >     >             Hi Sam,
> >>>              >     >
> >>>              >     >             On Mon, Oct 30, 2017 at 7:45 PM, Sam
> Hague <[email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>
> >>>              <mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>>
> >>    <mailto:[email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>
> >>>              <mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected]
> >>    <mailto:[email protected]>>>>> wrote:
> >>>              >     >
> >>>              >     >                 Stephen, Michael, Tom,
> >>>              >     >
> >>>              >     >                 do you have any ways to collect
> debugs when ODL crashes in CSIT?
> >>>              >     >
> >>>              >     >
> >>>              >     >             JVMs (almost) never "just crash"
> without a word... either some code
> >>>              does java.lang.System.exit(), which you may
> >>>              >     >             remember we do in the CDS/Akka code
> somewhere, or there's a bug in the JVM implementation -
> >>>              in which case there
> >>>              >     >             should be a one of those JVM crash
> logs type things - a file named something
> >>>              like hs_err_pid22607.log in the
> >>>              >     >             "current working" directory. Where
> would that be on these CSIT runs, and are the CSIT JJB
> >>>              jobs set up to preserve
> >>>              >     >             such JVM crash log files and copy
> them over to logs.opendaylight.org
> >>    <http://logs.opendaylight.org>
> >>>              <http://logs.opendaylight.org> <
> http://logs.opendaylight.org>
> >>>              >     <http://logs.opendaylight.org> ?
> >>>              >     >
> >>>              >     >
> >>>              >     >         Akka will do System.exit() if it
> encounters an error serious for that.  But it doesn't do it
> >>>              silently. However I
> >>>              >     >         believe we disabled the automatic exiting
> in akka.
> >>>              >     >
> >>>              >     >     Should there be any logs in ODL for this?
> There is nothing in the karaf log when this happens. It
> >>>              literally just stops.
> >>>              >     >
> >>>              >     >     The karaf.console log does say the karaf
> process was killed:
> >>>              >     >
> >>>              >     >     /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line
> 422: 11528 Killed ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS}
> >>>              "$NON_BLOCKING_PRNG"
> >>>              >     >     -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}"
> -Djava.ext.dirs="${JAVA_EXT_DIRS}"
> >>>              >     >     -Dkaraf.instances="${KARAF_HOME}/instances"
> -Dkaraf.home="${KARAF_HOME}"
> >>    -Dkaraf.base="${KARAF_BASE}"
> >>>              >     >     -Dkaraf.data="${KARAF_DATA}"
> -Dkaraf.etc="${KARAF_ETC}" -Dkaraf.restart.jvm.supported=true
> >>>              >     >     -Djava.io.tmpdir="${KARAF_DATA}/tmp"
> >>>              -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.
> util.logging.properties"
> >>>              >     >     ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS}
> "$@" -classpath "${CLASSPATH}" ${MAIN}
> >>>              >     >
> >>>              >     >     In the CSIT robot files we can see the below
> connection errors so ODL is not responding to new
> >>>              requests. This plus the
> >>>              >     >     above lead to think ODL just died.
> >>>              >     >
> >>>              >     >     [ WARN ] Retrying (Retry(total=2,
> connect=None, read=None, redirect=None, status=None)) after
> >>>              connection broken by
> >>>              >     >     'NewConnectionError('<
> requests.packages.urllib3.connection.HTTPConnection object at 0x5ca2d50>:
> >>>              Failed to establish a new
> >>>              >     >     connection: [Errno 111] Connection refused',)'
> >>>              >     >
> >>>              >     >
> >>>              >     >
> >>>              >     > That would seem to indicate something did a kill
> -9.  As Michael said, if the JVM crashed there
> >>    would be
> >>>              an hs_err_pid file
> >>>              >     > and it would log a message about it
> >>>              >
> >>>              >     yeah, this is where my money is at as well. The OS
> must be dumping it because it's
> >>>              >     misbehaving. I'll try to hack the job to start
> collecting os level log info (e.g. journalctl, etc)
> >>>              >
> >>>              >
> >>>              > JamO, do make sure you collect not just OS level but
> also the JVM's hs_err_*.log  file (if any); my bet is a
> >>>              JVM more than an
> >>>              > OS level crash...
> >>>
> >>>              where are these hs_err_*.log files going to be?
> >>>
> >>>
> >>>          they would be in the "current working directory", like what
> was the "pwd" when the JVM was started..
> >>>
> >>>
> >>>              This is such a dragged out process to debug. These
> >>>              jobs take 3+ hours and our problem only comes
> sporadically. ...sigh...
> >>>
> >>>              But, good news is that I think we've confirmed it's an
> oom. but an OOM from the OS perspective,
> >>>              if I'm not mistaken.
> >>>
> >>>
> >>>          OK that kind of thing could happen if you ran an ODL JVM in
> this kind of situation:
> >>>
> >>>          * VM with say 4 GB of RAM, and no swap
> >>>          * JVM like ODL starts with Xms 1 GB and Xmx 2 GB, so reserves
> 1 and plans expand to 2, when needed
> >>>          * other stuff eats up remaining e.g. 3 GB
> >>>          * JVM wants to expand, asks OS for 1 GB, but there is none
> left - so boum
> >>>
> >>>          but AFAIK (I'm not 100% sure) there would still be one of
> those hs_err_*.log files with some details confirming
> >>    above
> >>>          (like "out of native memory", kind of thing).
> >>>
> >>>
> >>>              here's what I saw in a sandbox job [a] that just hit this:
> >>>
> >>>              Out of memory: Kill process 11546 (java) score 933 or
> sacrifice child
> >>>              (more debug output is there in the console log)
> >>>
> >>>              These ODL systems start with 4G and we are setting the
> max mem for the odl java
> >>>              process to be 2G.
> >>>
> >>>
> >>>          erm, I'm not quite following what is 2 and what is 3 here..
> but does my description above help you to narrow
> >>    this down?
> >>>
> >>>
> >>>              I don't think we see this with Carbon, which makes me
> believe it's *not* some problem from outside
> >>>              of ODL (e.g. not a kernel bug from when we updated the
> java builder image back on 10/20)
> >>>
> >>>              I'll keep digging at this. Ideas are welcome for things
> to look at.
> >>>
> >>>
> >>>
> >>>              [a]
> >>>
> >>     https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-
> 1node-openstack-pike-jamo-upstream-stateful-snat-
> conntrack-oxygen/7/consoleFull
> >>    <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-
> 1node-openstack-pike-jamo-upstream-stateful-snat-
> conntrack-oxygen/7/consoleFull>
> >>>
> >>     <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-
> 1node-openstack-pike-jamo-upstream-stateful-snat-
> conntrack-oxygen/7/consoleFull
> >>    <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-
> 1node-openstack-pike-jamo-upstream-stateful-snat-
> conntrack-oxygen/7/consoleFull>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>              > BTW: The most common fix ;) for JVM crashes often is
> simply upgrading to the latest available patch
> >>    version of OpenJDK.. but
> >>>              > I'm guessing/hoping we run from RPM and already have
> the latest - or is this possibly running on an older
> >>    JVM version package
> >>>              > that was somehow "held back" via special dnf
> instructions, or manually installed from a ZIP, kind of thing?
> >>>
> >>>
> >>>              these systems are built and updated periodically. jdk is
> installed with "yum install". The specific version
> >>>              in [a] is:
> >>>
> >>>              10:57:33 Set Java version
> >>>              10:57:34 JDK default version...
> >>>              10:57:34 openjdk version "1.8.0_144"
> >>>              10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01)
> >>>              10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01,
> mixed mode)
> >>>
> >>>
> >>>          OK, that seems to be the latest one I also have locally on
> Fedora 26.
> >>>
> >>>              Thanks,
> >>>              JamO
> >>>
> >>>
> >>>
> >>>              >     JamO
> >>>              >
> >>>              >
> >>>              >     >
> >>>              >     > _______________________________________________
> >>>              >     > controller-dev mailing list
> >>>              >     > [email protected] <mailto:
> [email protected]>
> >>    <mailto:[email protected] <mailto:
> [email protected]>>
> >>>              <mailto:[email protected] <mailto:
> [email protected]>
> >>    <mailto:[email protected] <mailto:
> [email protected]>>>
> >>>              >     > https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >>>              <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
> >>>              >     <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >>>              <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>>
> >>>              >     >
> >>>              >     _______________________________________________
> >>>              >     controller-dev mailing list
> >>>              >     [email protected] <mailto:
> [email protected]>
> >>    <mailto:[email protected] <mailto:
> [email protected]>>
> >>>              <mailto:[email protected] <mailto:
> [email protected]>
> >>    <mailto:[email protected] <mailto:
> [email protected]>>>
> >>>              >     https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >>>              <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
> >>>              >     <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >>>              <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>>
> >>>              >
> >>>              >
> >>>
> >>>
> >>>
> >>>          _______________________________________________
> >>>          controller-dev mailing list
> >>>          [email protected] <mailto:
> [email protected]>
> >>    <mailto:[email protected] <mailto:
> [email protected]>>
> >>>          https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >>>          <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
> >>>
> >>>
> >>>
> >>>
> >>>      --
> >>>      Thanks
> >>>      Anil
> >>>
> >>>      _______________________________________________
> >>>      controller-dev mailing list
> >>>      [email protected] <mailto:
> [email protected]>
> >>    <mailto:[email protected] <mailto:
> [email protected]>>
> >>>      https://lists.opendaylight.org/mailman/listinfo/controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >>>      <https://lists.opendaylight.org/mailman/listinfo/controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> controller-dev mailing list
> >>> [email protected] <mailto:controller-dev@lists.
> opendaylight.org>
> >>> https://lists.opendaylight.org/mailman/listinfo/controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >>>
> >>    _______________________________________________
> >>    controller-dev mailing list
> >>    [email protected] <mailto:controller-dev@lists.
> opendaylight.org>
> >>    https://lists.opendaylight.org/mailman/listinfo/controller-dev
> >>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >>
> >>
> > _______________________________________________
> > controller-dev mailing list
> > [email protected]
> > https://lists.opendaylight.org/mailman/listinfo/controller-dev
>
>

_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] ODL crashing in CSIT jobs

Reply via email to