is it possible to collect dmesg output? That can give an idea if it's a JVM native OOM.
On Tue, Oct 31, 2017 at 3:40 PM, Michael Vorburger <[email protected]> wrote: > On Tue, Oct 31, 2017 at 11:02 PM, Jamo Luhrsen <[email protected]> wrote: > >> On 10/31/2017 12:22 AM, Michael Vorburger wrote: >> > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen <[email protected] >> <mailto:[email protected]>> wrote: >> > >> > On 10/30/2017 01:29 PM, Tom Pantelis wrote: >> > > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> > <mailto:[email protected]>>> wrote: >> > > On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis < >> [email protected] <mailto:[email protected]> >> > <mailto:[email protected] <mailto:[email protected]>>> >> wrote: >> > > On Mon, Oct 30, 2017 at 2:49 PM, Michael Vorburger < >> [email protected] <mailto:[email protected]> <mailto: >> [email protected] <mailto:[email protected]>>> wrote: >> > > >> > > Hi Sam, >> > > >> > > On Mon, Oct 30, 2017 at 7:45 PM, Sam Hague < >> [email protected] <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>>> wrote: >> > > >> > > Stephen, Michael, Tom, >> > > >> > > do you have any ways to collect debugs when ODL >> crashes in CSIT? >> > > >> > > >> > > JVMs (almost) never "just crash" without a word... >> either some code does java.lang.System.exit(), which you may >> > > remember we do in the CDS/Akka code somewhere, or >> there's a bug in the JVM implementation - in which case there >> > > should be a one of those JVM crash logs type things - >> a file named something like hs_err_pid22607.log in the >> > > "current working" directory. Where would that be on >> these CSIT runs, and are the CSIT JJB jobs set up to preserve >> > > such JVM crash log files and copy them over to >> logs.opendaylight.org <http://logs.opendaylight.org> >> > <http://logs.opendaylight.org> ? >> > > >> > > >> > > Akka will do System.exit() if it encounters an error >> serious for that. But it doesn't do it silently. However I >> > > believe we disabled the automatic exiting in akka. >> > > >> > > Should there be any logs in ODL for this? There is nothing in >> the karaf log when this happens. It literally just stops. >> > > >> > > The karaf.console log does say the karaf process was killed: >> > > >> > > /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line 422: 11528 Killed >> ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS} "$NON_BLOCKING_PRNG" >> > > -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}" >> -Djava.ext.dirs="${JAVA_EXT_DIRS}" >> > > -Dkaraf.instances="${KARAF_HOME}/instances" >> -Dkaraf.home="${KARAF_HOME}" -Dkaraf.base="${KARAF_BASE}" >> > > -Dkaraf.data="${KARAF_DATA}" -Dkaraf.etc="${KARAF_ETC}" >> -Dkaraf.restart.jvm.supported=true >> > > -Djava.io.tmpdir="${KARAF_DATA}/tmp" >> -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.util >> .logging.properties" >> > > ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" -classpath >> "${CLASSPATH}" ${MAIN} >> > > >> > > In the CSIT robot files we can see the below connection >> errors so ODL is not responding to new requests. This plus the >> > > above lead to think ODL just died. >> > > >> > > [ WARN ] Retrying (Retry(total=2, connect=None, read=None, >> redirect=None, status=None)) after connection broken by >> > > >> > 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection >> object at 0x5ca2d50>: Failed to establish a new >> > > connection: [Errno 111] Connection refused',)' >> > > >> > > >> > > >> > > That would seem to indicate something did a kill -9. As Michael >> said, if the JVM crashed there would be an hs_err_pid file >> > > and it would log a message about it >> > >> > yeah, this is where my money is at as well. The OS must be dumping >> it because it's >> > misbehaving. I'll try to hack the job to start collecting os level >> log info (e.g. journalctl, etc) >> > >> > >> > JamO, do make sure you collect not just OS level but also the >> JVM's hs_err_*.log file (if any); my bet is a JVM more than an >> > OS level crash... >> >> where are these hs_err_*.log files going to be? > > > they would be in the "current working directory", like what was the "pwd" > when the JVM was started.. > > >> This is such a dragged out process to debug. These >> jobs take 3+ hours and our problem only comes sporadically. ...sigh... >> >> But, good news is that I think we've confirmed it's an oom. but an OOM >> from the OS perspective, >> if I'm not mistaken. >> > > OK that kind of thing could happen if you ran an ODL JVM in this kind of > situation: > > * VM with say 4 GB of RAM, and no swap > * JVM like ODL starts with Xms 1 GB and Xmx 2 GB, so reserves 1 and plans > expand to 2, when needed > * other stuff eats up remaining e.g. 3 GB > * JVM wants to expand, asks OS for 1 GB, but there is none left - so boum > > but AFAIK (I'm not 100% sure) there would still be one of those > hs_err_*.log files with some details confirming above (like "out of native > memory", kind of thing). > > >> here's what I saw in a sandbox job [a] that just hit this: >> >> Out of memory: Kill process 11546 (java) score 933 or sacrifice child >> (more debug output is there in the console log) >> >> These ODL systems start with 4G and we are setting the max mem for the >> odl java >> process to be 2G. >> > > erm, I'm not quite following what is 2 and what is 3 here.. but does my > description above help you to narrow this down? > > >> I don't think we see this with Carbon, which makes me believe it's *not* >> some problem from outside >> of ODL (e.g. not a kernel bug from when we updated the java builder image >> back on 10/20) >> >> I'll keep digging at this. Ideas are welcome for things to look at. >> >> >> >> [a] >> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1n >> ode-openstack-pike-jamo-upstream-stateful-snat-conntrack- >> oxygen/7/consoleFull >> >> >> >> >> >> > BTW: The most common fix ;) for JVM crashes often is simply upgrading >> to the latest available patch version of OpenJDK.. but >> > I'm guessing/hoping we run from RPM and already have the latest - or is >> this possibly running on an older JVM version package >> > that was somehow "held back" via special dnf instructions, or manually >> installed from a ZIP, kind of thing? >> >> >> these systems are built and updated periodically. jdk is installed with >> "yum install". The specific version >> in [a] is: >> >> 10:57:33 Set Java version >> 10:57:34 JDK default version... >> 10:57:34 openjdk version "1.8.0_144" >> 10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01) >> 10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode) >> > > OK, that seems to be the latest one I also have locally on Fedora 26. > > Thanks, >> JamO >> >> >> >> > JamO >> > >> > >> > > >> > > _______________________________________________ >> > > controller-dev mailing list >> > > [email protected] <mailto: >> [email protected]> >> > > https://lists.opendaylight.org/mailman/listinfo/controller-dev >> > <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >> > > >> > _______________________________________________ >> > controller-dev mailing list >> > [email protected] <mailto: >> [email protected]> >> > https://lists.opendaylight.org/mailman/listinfo/controller-dev >> > <https://lists.opendaylight.org/mailman/listinfo/controller-dev> >> > >> > >> > > > _______________________________________________ > controller-dev mailing list > [email protected] > https://lists.opendaylight.org/mailman/listinfo/controller-dev > > -- Thanks Anil
_______________________________________________ controller-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/controller-dev
