On Thu, Nov 2, 2017 at 9:32 PM, Jamo Luhrsen <[email protected]> wrote:

> +integration-dev
>
> nitrogen SR1 blocker bug for this problem:
>
> https://jira.opendaylight.org/browse/NETVIRT-974
>
> I'm actively debugging in the sandbox, although it's a very heavy process
> (many hours per iteration).
>
> wondering if there are any extra options we can pass to the java process
> that might shed more light. This is a very fast and silent death.
>

sounds like a memory leak.. we should have a hs_err_pid*.log file and
*have* to have an *.hprof file to know where and find a fix for an OOM...

have you been able to re-double-check if these files are't already produced
somewhere? How about just doing a dumb:

sudo find / -name "hs_err_pid*.log"
sudo find / -name "*.hprof"

the hprof should be produced by I can see that in $ODL/bin/karaf we already
have "-XX:+HeapDumpOnOutOfMemoryError" on DEFAULT_JAVA_OPTS...

to fix the folder where it would write the HPROF into, you could add:
-XX:HeapDumpPath=/a/folder/you/can/recover

can't wait to get my hands on a hs_err_pid*.log & *.hprof from this... ;=)

JamO
>
>
>
>
> On 10/31/2017 06:11 PM, Sam Hague wrote:
> >
> >
> > On Tue, Oct 31, 2017 at 6:44 PM, Anil Vishnoi <[email protected]
> <mailto:[email protected]>> wrote:
> >
> >     is it possible to collect dmesg output? That can give an idea if
> it's a JVM native OOM.
> >
> > Yes, we already collect all those for the openstack nodes, so we just
> need to include it for the ODL node.
> >
> >
> >     On Tue, Oct 31, 2017 at 3:40 PM, Michael Vorburger <
> [email protected] <mailto:[email protected]>> wrote:
> >
> >         On Tue, Oct 31, 2017 at 11:02 PM, Jamo Luhrsen <
> [email protected] <mailto:[email protected]>> wrote:
> >
> >             On 10/31/2017 12:22 AM, Michael Vorburger wrote:
> >             > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen <
> [email protected] <mailto:[email protected]> <mailto:[email protected]
> <mailto:[email protected]>>> wrote:
> >             >
> >             >     On 10/30/2017 01:29 PM, Tom Pantelis wrote:
> >             >     > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague <
> [email protected] <mailto:[email protected]>
> >             <mailto:[email protected] <mailto:[email protected]>>
> <mailto:[email protected] <mailto:[email protected]>
> >             >     <mailto:[email protected] <mailto:[email protected]>>>>
> wrote:
> >             >     >     On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis <
> [email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>
> >             >     <mailto:[email protected] <mailto:
> [email protected]> <mailto:[email protected]
> >             <mailto:[email protected]>>>> wrote:
> >             >     >         On Mon, Oct 30, 2017 at 2:49 PM, Michael
> Vorburger <[email protected] <mailto:[email protected]> <mailto:
> [email protected] <mailto:[email protected]>>
> >             <mailto:[email protected] <mailto:[email protected]>
> <mailto:[email protected]
> >             <mailto:[email protected]>>>> wrote:
> >             >     >
> >             >     >             Hi Sam,
> >             >     >
> >             >     >             On Mon, Oct 30, 2017 at 7:45 PM, Sam
> Hague <[email protected] <mailto:[email protected]>
> >             <mailto:[email protected] <mailto:[email protected]>>
> <mailto:[email protected] <mailto:[email protected]>
> >             <mailto:[email protected] <mailto:[email protected]>>>>
> wrote:
> >             >     >
> >             >     >                 Stephen, Michael, Tom,
> >             >     >
> >             >     >                 do you have any ways to collect
> debugs when ODL crashes in CSIT?
> >             >     >
> >             >     >
> >             >     >             JVMs (almost) never "just crash" without
> a word... either some code
> >             does java.lang.System.exit(), which you may
> >             >     >             remember we do in the CDS/Akka code
> somewhere, or there's a bug in the JVM implementation -
> >             in which case there
> >             >     >             should be a one of those JVM crash logs
> type things - a file named something
> >             like hs_err_pid22607.log in the
> >             >     >             "current working" directory. Where would
> that be on these CSIT runs, and are the CSIT JJB
> >             jobs set up to preserve
> >             >     >             such JVM crash log files and copy them
> over to logs.opendaylight.org
> >             <http://logs.opendaylight.org> <http://logs.opendaylight.org
> >
> >             >     <http://logs.opendaylight.org> ?
> >             >     >
> >             >     >
> >             >     >         Akka will do System.exit() if it encounters
> an error serious for that.  But it doesn't do it
> >             silently. However I
> >             >     >         believe we disabled the automatic exiting in
> akka.
> >             >     >
> >             >     >     Should there be any logs in ODL for this? There
> is nothing in the karaf log when this happens. It
> >             literally just stops.
> >             >     >
> >             >     >     The karaf.console log does say the karaf process
> was killed:
> >             >     >
> >             >     >     /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line 422:
> 11528 Killed ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS}
> >             "$NON_BLOCKING_PRNG"
> >             >     >     -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}"
> -Djava.ext.dirs="${JAVA_EXT_DIRS}"
> >             >     >     -Dkaraf.instances="${KARAF_HOME}/instances"
> -Dkaraf.home="${KARAF_HOME}" -Dkaraf.base="${KARAF_BASE}"
> >             >     >     -Dkaraf.data="${KARAF_DATA}"
> -Dkaraf.etc="${KARAF_ETC}" -Dkaraf.restart.jvm.supported=true
> >             >     >     -Djava.io.tmpdir="${KARAF_DATA}/tmp"
> >             -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.
> util.logging.properties"
> >             >     >     ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@"
> -classpath "${CLASSPATH}" ${MAIN}
> >             >     >
> >             >     >     In the CSIT robot files we can see the below
> connection errors so ODL is not responding to new
> >             requests. This plus the
> >             >     >     above lead to think ODL just died.
> >             >     >
> >             >     >     [ WARN ] Retrying (Retry(total=2, connect=None,
> read=None, redirect=None, status=None)) after
> >             connection broken by
> >             >     >     
> > 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection
> object at 0x5ca2d50>:
> >             Failed to establish a new
> >             >     >     connection: [Errno 111] Connection refused',)'
> >             >     >
> >             >     >
> >             >     >
> >             >     > That would seem to indicate something did a kill
> -9.  As Michael said, if the JVM crashed there would be
> >             an hs_err_pid file
> >             >     > and it would log a message about it
> >             >
> >             >     yeah, this is where my money is at as well. The OS
> must be dumping it because it's
> >             >     misbehaving. I'll try to hack the job to start
> collecting os level log info (e.g. journalctl, etc)
> >             >
> >             >
> >             > JamO, do make sure you collect not just OS level but also
> the JVM's hs_err_*.log  file (if any); my bet is a
> >             JVM more than an
> >             > OS level crash...
> >
> >             where are these hs_err_*.log files going to be?
> >
> >
> >         they would be in the "current working directory", like what was
> the "pwd" when the JVM was started..
> >
> >
> >             This is such a dragged out process to debug. These
> >             jobs take 3+ hours and our problem only comes sporadically.
> ...sigh...
> >
> >             But, good news is that I think we've confirmed it's an oom.
> but an OOM from the OS perspective,
> >             if I'm not mistaken.
> >
> >
> >         OK that kind of thing could happen if you ran an ODL JVM in this
> kind of situation:
> >
> >         * VM with say 4 GB of RAM, and no swap
> >         * JVM like ODL starts with Xms 1 GB and Xmx 2 GB, so reserves 1
> and plans expand to 2, when needed
> >         * other stuff eats up remaining e.g. 3 GB
> >         * JVM wants to expand, asks OS for 1 GB, but there is none left
> - so boum
> >
> >         but AFAIK (I'm not 100% sure) there would still be one of those
> hs_err_*.log files with some details confirming above
> >         (like "out of native memory", kind of thing).
> >
> >
> >             here's what I saw in a sandbox job [a] that just hit this:
> >
> >             Out of memory: Kill process 11546 (java) score 933 or
> sacrifice child
> >             (more debug output is there in the console log)
> >
> >             These ODL systems start with 4G and we are setting the max
> mem for the odl java
> >             process to be 2G.
> >
> >
> >         erm, I'm not quite following what is 2 and what is 3 here.. but
> does my description above help you to narrow this down?
> >
> >
> >             I don't think we see this with Carbon, which makes me
> believe it's *not* some problem from outside
> >             of ODL (e.g. not a kernel bug from when we updated the java
> builder image back on 10/20)
> >
> >             I'll keep digging at this. Ideas are welcome for things to
> look at.
> >
> >
> >
> >             [a]
> >             https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-
> 1node-openstack-pike-jamo-upstream-stateful-snat-
> conntrack-oxygen/7/consoleFull
> >             <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-
> 1node-openstack-pike-jamo-upstream-stateful-snat-
> conntrack-oxygen/7/consoleFull>
> >
> >
> >
> >
> >
> >             > BTW: The most common fix ;) for JVM crashes often is
> simply upgrading to the latest available patch version of OpenJDK.. but
> >             > I'm guessing/hoping we run from RPM and already have the
> latest - or is this possibly running on an older JVM version package
> >             > that was somehow "held back" via special dnf instructions,
> or manually installed from a ZIP, kind of thing?
> >
> >
> >             these systems are built and updated periodically. jdk is
> installed with "yum install". The specific version
> >             in [a] is:
> >
> >             10:57:33 Set Java version
> >             10:57:34 JDK default version...
> >             10:57:34 openjdk version "1.8.0_144"
> >             10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01)
> >             10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01, mixed
> mode)
> >
> >
> >         OK, that seems to be the latest one I also have locally on
> Fedora 26.
> >
> >             Thanks,
> >             JamO
> >
> >
> >
> >             >     JamO
> >             >
> >             >
> >             >     >
> >             >     > _______________________________________________
> >             >     > controller-dev mailing list
> >             >     > [email protected] <mailto:
> [email protected]>
> >             <mailto:[email protected] <mailto:
> [email protected]>>
> >             >     > https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >             <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev>
> >             >     <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >             <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev>>
> >             >     >
> >             >     _______________________________________________
> >             >     controller-dev mailing list
> >             >     [email protected] <mailto:
> [email protected]>
> >             <mailto:[email protected] <mailto:
> [email protected]>>
> >             >     https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >             <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev>
> >             >     <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev
> >             <https://lists.opendaylight.org/mailman/listinfo/
> controller-dev>>
> >             >
> >             >
> >
> >
> >
> >         _______________________________________________
> >         controller-dev mailing list
> >         [email protected] <mailto:
> [email protected]>
> >         https://lists.opendaylight.org/mailman/listinfo/controller-dev
> >         <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >
> >
> >
> >
> >     --
> >     Thanks
> >     Anil
> >
> >     _______________________________________________
> >     controller-dev mailing list
> >     [email protected] <mailto:controller-dev@lists.
> opendaylight.org>
> >     https://lists.opendaylight.org/mailman/listinfo/controller-dev
> >     <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> >
> >
> >
> >
> > _______________________________________________
> > controller-dev mailing list
> > [email protected]
> > https://lists.opendaylight.org/mailman/listinfo/controller-dev
> >
> _______________________________________________
> controller-dev mailing list
> [email protected]
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>
_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to