On Thu, Nov 2, 2017 at 9:32 PM, Jamo Luhrsen <[email protected]> wrote:
> +integration-dev > > nitrogen SR1 blocker bug for this problem: > > https://jira.opendaylight.org/browse/NETVIRT-974 > > I'm actively debugging in the sandbox, although it's a very heavy process > (many hours per iteration). > > wondering if there are any extra options we can pass to the java process > that might shed more light. This is a very fast and silent death. > sounds like a memory leak.. we should have a hs_err_pid*.log file and *have* to have an *.hprof file to know where and find a fix for an OOM... have you been able to re-double-check if these files are't already produced somewhere? How about just doing a dumb: sudo find / -name "hs_err_pid*.log" sudo find / -name "*.hprof" the hprof should be produced by I can see that in $ODL/bin/karaf we already have "-XX:+HeapDumpOnOutOfMemoryError" on DEFAULT_JAVA_OPTS... to fix the folder where it would write the HPROF into, you could add: -XX:HeapDumpPath=/a/folder/you/can/recover can't wait to get my hands on a hs_err_pid*.log & *.hprof from this... ;=) JamO > > > > > On 10/31/2017 06:11 PM, Sam Hague wrote: > > > > > > On Tue, Oct 31, 2017 at 6:44 PM, Anil Vishnoi <[email protected] > <mailto:[email protected]>> wrote: > > > > is it possible to collect dmesg output? That can give an idea if > it's a JVM native OOM. > > > > Yes, we already collect all those for the openstack nodes, so we just > need to include it for the ODL node. > > > > > > On Tue, Oct 31, 2017 at 3:40 PM, Michael Vorburger < > [email protected] <mailto:[email protected]>> wrote: > > > > On Tue, Oct 31, 2017 at 11:02 PM, Jamo Luhrsen < > [email protected] <mailto:[email protected]>> wrote: > > > > On 10/31/2017 12:22 AM, Michael Vorburger wrote: > > > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen < > [email protected] <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>> wrote: > > > > > > On 10/30/2017 01:29 PM, Tom Pantelis wrote: > > > > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague < > [email protected] <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > > > <mailto:[email protected] <mailto:[email protected]>>>> > wrote: > > > > On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis < > [email protected] <mailto:[email protected]> <mailto: > [email protected] <mailto:[email protected]>> > > > <mailto:[email protected] <mailto: > [email protected]> <mailto:[email protected] > > <mailto:[email protected]>>>> wrote: > > > > On Mon, Oct 30, 2017 at 2:49 PM, Michael > Vorburger <[email protected] <mailto:[email protected]> <mailto: > [email protected] <mailto:[email protected]>> > > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] > > <mailto:[email protected]>>>> wrote: > > > > > > > > Hi Sam, > > > > > > > > On Mon, Oct 30, 2017 at 7:45 PM, Sam > Hague <[email protected] <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>> > <mailto:[email protected] <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>>>> > wrote: > > > > > > > > Stephen, Michael, Tom, > > > > > > > > do you have any ways to collect > debugs when ODL crashes in CSIT? > > > > > > > > > > > > JVMs (almost) never "just crash" without > a word... either some code > > does java.lang.System.exit(), which you may > > > > remember we do in the CDS/Akka code > somewhere, or there's a bug in the JVM implementation - > > in which case there > > > > should be a one of those JVM crash logs > type things - a file named something > > like hs_err_pid22607.log in the > > > > "current working" directory. Where would > that be on these CSIT runs, and are the CSIT JJB > > jobs set up to preserve > > > > such JVM crash log files and copy them > over to logs.opendaylight.org > > <http://logs.opendaylight.org> <http://logs.opendaylight.org > > > > > <http://logs.opendaylight.org> ? > > > > > > > > > > > > Akka will do System.exit() if it encounters > an error serious for that. But it doesn't do it > > silently. However I > > > > believe we disabled the automatic exiting in > akka. > > > > > > > > Should there be any logs in ODL for this? There > is nothing in the karaf log when this happens. It > > literally just stops. > > > > > > > > The karaf.console log does say the karaf process > was killed: > > > > > > > > /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line 422: > 11528 Killed ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS} > > "$NON_BLOCKING_PRNG" > > > > -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}" > -Djava.ext.dirs="${JAVA_EXT_DIRS}" > > > > -Dkaraf.instances="${KARAF_HOME}/instances" > -Dkaraf.home="${KARAF_HOME}" -Dkaraf.base="${KARAF_BASE}" > > > > -Dkaraf.data="${KARAF_DATA}" > -Dkaraf.etc="${KARAF_ETC}" -Dkaraf.restart.jvm.supported=true > > > > -Djava.io.tmpdir="${KARAF_DATA}/tmp" > > -Djava.util.logging.config.file="${KARAF_BASE}/etc/java. > util.logging.properties" > > > > ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" > -classpath "${CLASSPATH}" ${MAIN} > > > > > > > > In the CSIT robot files we can see the below > connection errors so ODL is not responding to new > > requests. This plus the > > > > above lead to think ODL just died. > > > > > > > > [ WARN ] Retrying (Retry(total=2, connect=None, > read=None, redirect=None, status=None)) after > > connection broken by > > > > > > 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection > object at 0x5ca2d50>: > > Failed to establish a new > > > > connection: [Errno 111] Connection refused',)' > > > > > > > > > > > > > > > > That would seem to indicate something did a kill > -9. As Michael said, if the JVM crashed there would be > > an hs_err_pid file > > > > and it would log a message about it > > > > > > yeah, this is where my money is at as well. The OS > must be dumping it because it's > > > misbehaving. I'll try to hack the job to start > collecting os level log info (e.g. journalctl, etc) > > > > > > > > > JamO, do make sure you collect not just OS level but also > the JVM's hs_err_*.log file (if any); my bet is a > > JVM more than an > > > OS level crash... > > > > where are these hs_err_*.log files going to be? > > > > > > they would be in the "current working directory", like what was > the "pwd" when the JVM was started.. > > > > > > This is such a dragged out process to debug. These > > jobs take 3+ hours and our problem only comes sporadically. > ...sigh... > > > > But, good news is that I think we've confirmed it's an oom. > but an OOM from the OS perspective, > > if I'm not mistaken. > > > > > > OK that kind of thing could happen if you ran an ODL JVM in this > kind of situation: > > > > * VM with say 4 GB of RAM, and no swap > > * JVM like ODL starts with Xms 1 GB and Xmx 2 GB, so reserves 1 > and plans expand to 2, when needed > > * other stuff eats up remaining e.g. 3 GB > > * JVM wants to expand, asks OS for 1 GB, but there is none left > - so boum > > > > but AFAIK (I'm not 100% sure) there would still be one of those > hs_err_*.log files with some details confirming above > > (like "out of native memory", kind of thing). > > > > > > here's what I saw in a sandbox job [a] that just hit this: > > > > Out of memory: Kill process 11546 (java) score 933 or > sacrifice child > > (more debug output is there in the console log) > > > > These ODL systems start with 4G and we are setting the max > mem for the odl java > > process to be 2G. > > > > > > erm, I'm not quite following what is 2 and what is 3 here.. but > does my description above help you to narrow this down? > > > > > > I don't think we see this with Carbon, which makes me > believe it's *not* some problem from outside > > of ODL (e.g. not a kernel bug from when we updated the java > builder image back on 10/20) > > > > I'll keep digging at this. Ideas are welcome for things to > look at. > > > > > > > > [a] > > https://jenkins.opendaylight.org/sandbox/job/netvirt-csit- > 1node-openstack-pike-jamo-upstream-stateful-snat- > conntrack-oxygen/7/consoleFull > > <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit- > 1node-openstack-pike-jamo-upstream-stateful-snat- > conntrack-oxygen/7/consoleFull> > > > > > > > > > > > > > BTW: The most common fix ;) for JVM crashes often is > simply upgrading to the latest available patch version of OpenJDK.. but > > > I'm guessing/hoping we run from RPM and already have the > latest - or is this possibly running on an older JVM version package > > > that was somehow "held back" via special dnf instructions, > or manually installed from a ZIP, kind of thing? > > > > > > these systems are built and updated periodically. jdk is > installed with "yum install". The specific version > > in [a] is: > > > > 10:57:33 Set Java version > > 10:57:34 JDK default version... > > 10:57:34 openjdk version "1.8.0_144" > > 10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01) > > 10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01, mixed > mode) > > > > > > OK, that seems to be the latest one I also have locally on > Fedora 26. > > > > Thanks, > > JamO > > > > > > > > > JamO > > > > > > > > > > > > > > _______________________________________________ > > > > controller-dev mailing list > > > > [email protected] <mailto: > [email protected]> > > <mailto:[email protected] <mailto: > [email protected]>> > > > > https://lists.opendaylight.org/mailman/listinfo/ > controller-dev > > <https://lists.opendaylight.org/mailman/listinfo/ > controller-dev> > > > <https://lists.opendaylight.org/mailman/listinfo/ > controller-dev > > <https://lists.opendaylight.org/mailman/listinfo/ > controller-dev>> > > > > > > > _______________________________________________ > > > controller-dev mailing list > > > [email protected] <mailto: > [email protected]> > > <mailto:[email protected] <mailto: > [email protected]>> > > > https://lists.opendaylight.org/mailman/listinfo/ > controller-dev > > <https://lists.opendaylight.org/mailman/listinfo/ > controller-dev> > > > <https://lists.opendaylight.org/mailman/listinfo/ > controller-dev > > <https://lists.opendaylight.org/mailman/listinfo/ > controller-dev>> > > > > > > > > > > > > > > _______________________________________________ > > controller-dev mailing list > > [email protected] <mailto: > [email protected]> > > https://lists.opendaylight.org/mailman/listinfo/controller-dev > > <https://lists.opendaylight.org/mailman/listinfo/controller-dev> > > > > > > > > > > -- > > Thanks > > Anil > > > > _______________________________________________ > > controller-dev mailing list > > [email protected] <mailto:controller-dev@lists. > opendaylight.org> > > https://lists.opendaylight.org/mailman/listinfo/controller-dev > > <https://lists.opendaylight.org/mailman/listinfo/controller-dev> > > > > > > > > > > _______________________________________________ > > controller-dev mailing list > > [email protected] > > https://lists.opendaylight.org/mailman/listinfo/controller-dev > > > _______________________________________________ > controller-dev mailing list > [email protected] > https://lists.opendaylight.org/mailman/listinfo/controller-dev >
_______________________________________________ controller-dev mailing list [email protected] https://lists.opendaylight.org/mailman/listinfo/controller-dev
