So from the time I was debugging the issue with Jamo and Thanh, it seems there is a memory leak but:
1) JVM does not kill itself, the OS does instead after the java process grows to 3.7G in a VM of 4G RAM (note Xmx is set to 2G but still the jvm goes far beyond that). 2) The issue happens so fast that we did not have time to take a memory dump with map. So I wonder if there is some java memory combination of parameters to prevent OS to kill the JVM, something like if the total java memory (Xmx and others)=3G, abort and generate heap dump. BR/Luis > On Nov 2, 2017, at 2:53 PM, Jamo Luhrsen <[email protected]> wrote: > > > > On 11/02/2017 02:02 PM, Michael Vorburger wrote: >> On Thu, Nov 2, 2017 at 9:32 PM, Jamo Luhrsen <[email protected] >> <mailto:[email protected]>> wrote: >> >> +integration-dev >> >> nitrogen SR1 blocker bug for this problem: >> >> https://jira.opendaylight.org/browse/NETVIRT-974 >> <https://jira.opendaylight.org/browse/NETVIRT-974> >> >> I'm actively debugging in the sandbox, although it's a very heavy process >> (many hours per iteration). >> >> wondering if there are any extra options we can pass to the java process >> that might shed more light. This is a very fast and silent death. >> >> >> sounds like a memory leak.. we should have a hs_err_pid*.log file and *have* >> to have an *.hprof file to know where and find a >> fix for an OOM... >>> have you been able to re-double-check if these files are't already produced >>> somewhere? How about just doing a dumb: >> >> sudo find / -name "hs_err_pid*.log" >> sudo find / -name "*.hprof" >> >> the hprof should be produced by I can see that in $ODL/bin/karaf we already >> have "-XX:+HeapDumpOnOutOfMemoryError" >> on DEFAULT_JAVA_OPTS... > > > yes, we do this HeapdumpOnOutOfMemoryError. > >> to fix the folder where it would write the HPROF into, you could add: >> -XX:HeapDumpPath=/a/folder/you/can/recover >> >> can't wait to get my hands on a hs_err_pid*.log & *.hprof from this... ;=) > > > no there is no hs_err* or *hprof here. The OS is killing the PID because it's > consuming too much memory. I don't think the OS even cares that this is java. > Same as me doing a kill -9, I presume. > > :( > > JamO > > >> JamO >> >> >> >> >> On 10/31/2017 06:11 PM, Sam Hague wrote: >>> >>> >>> On Tue, Oct 31, 2017 at 6:44 PM, Anil Vishnoi <[email protected] >>> <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>>> wrote: >>> >>> is it possible to collect dmesg output? That can give an idea if it's >>> a JVM native OOM. >>> >>> Yes, we already collect all those for the openstack nodes, so we just need >>> to include it for the ODL node. >>> >>> >>> On Tue, Oct 31, 2017 at 3:40 PM, Michael Vorburger >>> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>> >>> On Tue, Oct 31, 2017 at 11:02 PM, Jamo Luhrsen <[email protected] >>> <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>>> wrote: >>> >>> On 10/31/2017 12:22 AM, Michael Vorburger wrote: >>> > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen >>> <[email protected] <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>>> wrote: >>> > >>> > On 10/30/2017 01:29 PM, Tom Pantelis wrote: >>> > > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague >>> <[email protected] <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>> >>> <mailto:[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >>> > <mailto:[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>>>>> wrote: >>> > > On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis >>> <[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >>> > <mailto:[email protected] >>> <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >>> <mailto:[email protected] >>> <mailto:[email protected]>>>>> wrote: >>> > > On Mon, Oct 30, 2017 at 2:49 PM, Michael >>> Vorburger <[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >>> <mailto:[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>>>>> >>> wrote: >>> > > >>> > > Hi Sam, >>> > > >>> > > On Mon, Oct 30, 2017 at 7:45 PM, Sam Hague >>> <[email protected] <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>> >>> <mailto:[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >>> <mailto:[email protected] <mailto:[email protected]> >>> <mailto:[email protected] >> <mailto:[email protected]>>>>> wrote: >>> > > >>> > > Stephen, Michael, Tom, >>> > > >>> > > do you have any ways to collect debugs >>> when ODL crashes in CSIT? >>> > > >>> > > >>> > > JVMs (almost) never "just crash" without a >>> word... either some code >>> does java.lang.System.exit(), which you may >>> > > remember we do in the CDS/Akka code >>> somewhere, or there's a bug in the JVM implementation - >>> in which case there >>> > > should be a one of those JVM crash logs >>> type things - a file named something >>> like hs_err_pid22607.log in the >>> > > "current working" directory. Where would >>> that be on these CSIT runs, and are the CSIT JJB >>> jobs set up to preserve >>> > > such JVM crash log files and copy them >>> over to logs.opendaylight.org >> <http://logs.opendaylight.org> >>> <http://logs.opendaylight.org> <http://logs.opendaylight.org> >>> > <http://logs.opendaylight.org> ? >>> > > >>> > > >>> > > Akka will do System.exit() if it encounters an >>> error serious for that. But it doesn't do it >>> silently. However I >>> > > believe we disabled the automatic exiting in >>> akka. >>> > > >>> > > Should there be any logs in ODL for this? There is >>> nothing in the karaf log when this happens. It >>> literally just stops. >>> > > >>> > > The karaf.console log does say the karaf process >>> was killed: >>> > > >>> > > /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line 422: >>> 11528 Killed ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS} >>> "$NON_BLOCKING_PRNG" >>> > > -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}" >>> -Djava.ext.dirs="${JAVA_EXT_DIRS}" >>> > > -Dkaraf.instances="${KARAF_HOME}/instances" >>> -Dkaraf.home="${KARAF_HOME}" >> -Dkaraf.base="${KARAF_BASE}" >>> > > -Dkaraf.data="${KARAF_DATA}" >>> -Dkaraf.etc="${KARAF_ETC}" -Dkaraf.restart.jvm.supported=true >>> > > -Djava.io.tmpdir="${KARAF_DATA}/tmp" >>> >>> -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.util.logging.properties" >>> > > ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" >>> -classpath "${CLASSPATH}" ${MAIN} >>> > > >>> > > In the CSIT robot files we can see the below >>> connection errors so ODL is not responding to new >>> requests. This plus the >>> > > above lead to think ODL just died. >>> > > >>> > > [ WARN ] Retrying (Retry(total=2, connect=None, >>> read=None, redirect=None, status=None)) after >>> connection broken by >>> > > >>> 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection >>> object at 0x5ca2d50>: >>> Failed to establish a new >>> > > connection: [Errno 111] Connection refused',)' >>> > > >>> > > >>> > > >>> > > That would seem to indicate something did a kill -9. >>> As Michael said, if the JVM crashed there >> would be >>> an hs_err_pid file >>> > > and it would log a message about it >>> > >>> > yeah, this is where my money is at as well. The OS must >>> be dumping it because it's >>> > misbehaving. I'll try to hack the job to start >>> collecting os level log info (e.g. journalctl, etc) >>> > >>> > >>> > JamO, do make sure you collect not just OS level but also >>> the JVM's hs_err_*.log file (if any); my bet is a >>> JVM more than an >>> > OS level crash... >>> >>> where are these hs_err_*.log files going to be? >>> >>> >>> they would be in the "current working directory", like what was >>> the "pwd" when the JVM was started.. >>> >>> >>> This is such a dragged out process to debug. These >>> jobs take 3+ hours and our problem only comes sporadically. >>> ...sigh... >>> >>> But, good news is that I think we've confirmed it's an oom. >>> but an OOM from the OS perspective, >>> if I'm not mistaken. >>> >>> >>> OK that kind of thing could happen if you ran an ODL JVM in this >>> kind of situation: >>> >>> * VM with say 4 GB of RAM, and no swap >>> * JVM like ODL starts with Xms 1 GB and Xmx 2 GB, so reserves 1 >>> and plans expand to 2, when needed >>> * other stuff eats up remaining e.g. 3 GB >>> * JVM wants to expand, asks OS for 1 GB, but there is none left - >>> so boum >>> >>> but AFAIK (I'm not 100% sure) there would still be one of those >>> hs_err_*.log files with some details confirming >> above >>> (like "out of native memory", kind of thing). >>> >>> >>> here's what I saw in a sandbox job [a] that just hit this: >>> >>> Out of memory: Kill process 11546 (java) score 933 or >>> sacrifice child >>> (more debug output is there in the console log) >>> >>> These ODL systems start with 4G and we are setting the max mem >>> for the odl java >>> process to be 2G. >>> >>> >>> erm, I'm not quite following what is 2 and what is 3 here.. but >>> does my description above help you to narrow >> this down? >>> >>> >>> I don't think we see this with Carbon, which makes me believe >>> it's *not* some problem from outside >>> of ODL (e.g. not a kernel bug from when we updated the java >>> builder image back on 10/20) >>> >>> I'll keep digging at this. Ideas are welcome for things to >>> look at. >>> >>> >>> >>> [a] >>> >> >> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull >> >> <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull> >>> >> >> <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull >> >> <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull>> >>> >>> >>> >>> >>> >>> > BTW: The most common fix ;) for JVM crashes often is simply >>> upgrading to the latest available patch >> version of OpenJDK.. but >>> > I'm guessing/hoping we run from RPM and already have the >>> latest - or is this possibly running on an older >> JVM version package >>> > that was somehow "held back" via special dnf instructions, >>> or manually installed from a ZIP, kind of thing? >>> >>> >>> these systems are built and updated periodically. jdk is >>> installed with "yum install". The specific version >>> in [a] is: >>> >>> 10:57:33 Set Java version >>> 10:57:34 JDK default version... >>> 10:57:34 openjdk version "1.8.0_144" >>> 10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01) >>> 10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01, mixed >>> mode) >>> >>> >>> OK, that seems to be the latest one I also have locally on Fedora >>> 26. >>> >>> Thanks, >>> JamO >>> >>> >>> >>> > JamO >>> > >>> > >>> > > >>> > > _______________________________________________ >>> > > controller-dev mailing list >>> > > [email protected] >>> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>> >>> <mailto:[email protected] >>> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>>> >>> > > >>> https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev>> >>> > >>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>> >>> > > >>> > _______________________________________________ >>> > controller-dev mailing list >>> > [email protected] >>> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>> >>> <mailto:[email protected] >>> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>>> >>> > >>> https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev>> >>> > >>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>> >>> > >>> > >>> >>> >>> >>> _______________________________________________ >>> controller-dev mailing list >>> [email protected] >>> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>> >>> https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev>> >>> >>> >>> >>> >>> -- >>> Thanks >>> Anil >>> >>> _______________________________________________ >>> controller-dev mailing list >>> [email protected] >>> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>> >>> https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev>> >>> >>> >>> >>> >>> _______________________________________________ >>> controller-dev mailing list >>> [email protected] >>> <mailto:[email protected]> >>> https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >>> >> _______________________________________________ >> controller-dev mailing list >> [email protected] >> <mailto:[email protected]> >> https://lists.opendaylight.org/mailman/listinfo/controller-dev >> <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >> >> > _______________________________________________ > controller-dev mailing list > [email protected] > https://lists.opendaylight.org/mailman/listinfo/controller-dev _______________________________________________ controller-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/controller-dev
