+integration-dev

nitrogen SR1 blocker bug for this problem:

https://jira.opendaylight.org/browse/NETVIRT-974

I'm actively debugging in the sandbox, although it's a very heavy process
(many hours per iteration).

wondering if there are any extra options we can pass to the java process
that might shed more light. This is a very fast and silent death.

JamO




On 10/31/2017 06:11 PM, Sam Hague wrote:
> 
> 
> On Tue, Oct 31, 2017 at 6:44 PM, Anil Vishnoi <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>     is it possible to collect dmesg output? That can give an idea if it's a 
> JVM native OOM.
> 
> Yes, we already collect all those for the openstack nodes, so we just need to 
> include it for the ODL node. 
> 
> 
>     On Tue, Oct 31, 2017 at 3:40 PM, Michael Vorburger <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>         On Tue, Oct 31, 2017 at 11:02 PM, Jamo Luhrsen <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>             On 10/31/2017 12:22 AM, Michael Vorburger wrote:
>             > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen 
> <[email protected] <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote:
>             >
>             >     On 10/30/2017 01:29 PM, Tom Pantelis wrote:
>             >     > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague 
> <[email protected] <mailto:[email protected]>
>             <mailto:[email protected] <mailto:[email protected]>> 
> <mailto:[email protected] <mailto:[email protected]>
>             >     <mailto:[email protected] <mailto:[email protected]>>>> 
> wrote:
>             >     >     On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis 
> <[email protected] <mailto:[email protected]> 
> <mailto:[email protected] <mailto:[email protected]>>
>             >     <mailto:[email protected] 
> <mailto:[email protected]> <mailto:[email protected]
>             <mailto:[email protected]>>>> wrote:
>             >     >         On Mon, Oct 30, 2017 at 2:49 PM, Michael 
> Vorburger <[email protected] <mailto:[email protected]> 
> <mailto:[email protected] <mailto:[email protected]>>
>             <mailto:[email protected] <mailto:[email protected]> 
> <mailto:[email protected]
>             <mailto:[email protected]>>>> wrote:
>             >     >
>             >     >             Hi Sam,
>             >     >
>             >     >             On Mon, Oct 30, 2017 at 7:45 PM, Sam Hague 
> <[email protected] <mailto:[email protected]>
>             <mailto:[email protected] <mailto:[email protected]>> 
> <mailto:[email protected] <mailto:[email protected]>
>             <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>             >     >
>             >     >                 Stephen, Michael, Tom,
>             >     >
>             >     >                 do you have any ways to collect debugs 
> when ODL crashes in CSIT?
>             >     >
>             >     >
>             >     >             JVMs (almost) never "just crash" without a 
> word... either some code
>             does java.lang.System.exit(), which you may
>             >     >             remember we do in the CDS/Akka code 
> somewhere, or there's a bug in the JVM implementation -
>             in which case there
>             >     >             should be a one of those JVM crash logs type 
> things - a file named something
>             like hs_err_pid22607.log in the
>             >     >             "current working" directory. Where would that 
> be on these CSIT runs, and are the CSIT JJB
>             jobs set up to preserve
>             >     >             such JVM crash log files and copy them over 
> to logs.opendaylight.org
>             <http://logs.opendaylight.org> <http://logs.opendaylight.org>
>             >     <http://logs.opendaylight.org> ?
>             >     >
>             >     >
>             >     >         Akka will do System.exit() if it encounters an 
> error serious for that.  But it doesn't do it
>             silently. However I
>             >     >         believe we disabled the automatic exiting in akka.
>             >     >
>             >     >     Should there be any logs in ODL for this? There is 
> nothing in the karaf log when this happens. It
>             literally just stops.
>             >     >
>             >     >     The karaf.console log does say the karaf process was 
> killed:
>             >     >
>             >     >     /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line 422: 11528 
> Killed ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS}
>             "$NON_BLOCKING_PRNG"
>             >     >     -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}" 
> -Djava.ext.dirs="${JAVA_EXT_DIRS}"
>             >     >     -Dkaraf.instances="${KARAF_HOME}/instances" 
> -Dkaraf.home="${KARAF_HOME}" -Dkaraf.base="${KARAF_BASE}"
>             >     >     -Dkaraf.data="${KARAF_DATA}" 
> -Dkaraf.etc="${KARAF_ETC}" -Dkaraf.restart.jvm.supported=true
>             >     >     -Djava.io.tmpdir="${KARAF_DATA}/tmp"
>             
> -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.util.logging.properties"
>             >     >     ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" 
> -classpath "${CLASSPATH}" ${MAIN}
>             >     >
>             >     >     In the CSIT robot files we can see the below 
> connection errors so ODL is not responding to new
>             requests. This plus the
>             >     >     above lead to think ODL just died.
>             >     >
>             >     >     [ WARN ] Retrying (Retry(total=2, connect=None, 
> read=None, redirect=None, status=None)) after
>             connection broken by
>             >     >     
> 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection 
> object at 0x5ca2d50>:
>             Failed to establish a new
>             >     >     connection: [Errno 111] Connection refused',)'
>             >     >
>             >     >
>             >     >
>             >     > That would seem to indicate something did a kill -9.  As 
> Michael said, if the JVM crashed there would be
>             an hs_err_pid file
>             >     > and it would log a message about it
>             >
>             >     yeah, this is where my money is at as well. The OS must be 
> dumping it because it's
>             >     misbehaving. I'll try to hack the job to start collecting 
> os level log info (e.g. journalctl, etc)
>             >
>             >
>             > JamO, do make sure you collect not just OS level but also the 
> JVM's hs_err_*.log  file (if any); my bet is a
>             JVM more than an
>             > OS level crash...
> 
>             where are these hs_err_*.log files going to be? 
> 
> 
>         they would be in the "current working directory", like what was the 
> "pwd" when the JVM was started..
>          
> 
>             This is such a dragged out process to debug. These
>             jobs take 3+ hours and our problem only comes sporadically. 
> ...sigh...
> 
>             But, good news is that I think we've confirmed it's an oom. but 
> an OOM from the OS perspective,
>             if I'm not mistaken.
> 
> 
>         OK that kind of thing could happen if you ran an ODL JVM in this kind 
> of situation:
> 
>         * VM with say 4 GB of RAM, and no swap
>         * JVM like ODL starts with Xms 1 GB and Xmx 2 GB, so reserves 1 and 
> plans expand to 2, when needed
>         * other stuff eats up remaining e.g. 3 GB
>         * JVM wants to expand, asks OS for 1 GB, but there is none left - so 
> boum
> 
>         but AFAIK (I'm not 100% sure) there would still be one of those 
> hs_err_*.log files with some details confirming above
>         (like "out of native memory", kind of thing).
>          
> 
>             here's what I saw in a sandbox job [a] that just hit this:
> 
>             Out of memory: Kill process 11546 (java) score 933 or sacrifice 
> child
>             (more debug output is there in the console log)
> 
>             These ODL systems start with 4G and we are setting the max mem 
> for the odl java
>             process to be 2G.
> 
> 
>         erm, I'm not quite following what is 2 and what is 3 here.. but does 
> my description above help you to narrow this down?
>          
> 
>             I don't think we see this with Carbon, which makes me believe 
> it's *not* some problem from outside
>             of ODL (e.g. not a kernel bug from when we updated the java 
> builder image back on 10/20)
> 
>             I'll keep digging at this. Ideas are welcome for things to look 
> at.
> 
> 
> 
>             [a]
>             
> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull
>             
> <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull>
> 
> 
> 
> 
> 
>             > BTW: The most common fix ;) for JVM crashes often is simply 
> upgrading to the latest available patch version of OpenJDK.. but
>             > I'm guessing/hoping we run from RPM and already have the latest 
> - or is this possibly running on an older JVM version package
>             > that was somehow "held back" via special dnf instructions, or 
> manually installed from a ZIP, kind of thing?
> 
> 
>             these systems are built and updated periodically. jdk is 
> installed with "yum install". The specific version
>             in [a] is:
> 
>             10:57:33 Set Java version
>             10:57:34 JDK default version...
>             10:57:34 openjdk version "1.8.0_144"
>             10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01)
>             10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)
> 
> 
>         OK, that seems to be the latest one I also have locally on Fedora 26. 
> 
>             Thanks,
>             JamO
> 
> 
> 
>             >     JamO
>             >
>             >
>             >     >
>             >     > _______________________________________________
>             >     > controller-dev mailing list
>             >     > [email protected] 
> <mailto:[email protected]>
>             <mailto:[email protected] 
> <mailto:[email protected]>>
>             >     > 
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>             <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>             >     
> <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>             <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
>             >     >
>             >     _______________________________________________
>             >     controller-dev mailing list
>             >     [email protected] 
> <mailto:[email protected]>
>             <mailto:[email protected] 
> <mailto:[email protected]>>
>             >     
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>             <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>             >     
> <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>             <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
>             >
>             >
> 
> 
> 
>         _______________________________________________
>         controller-dev mailing list
>         [email protected] 
> <mailto:[email protected]>
>         https://lists.opendaylight.org/mailman/listinfo/controller-dev
>         <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> 
> 
> 
> 
>     -- 
>     Thanks
>     Anil
> 
>     _______________________________________________
>     controller-dev mailing list
>     [email protected] 
> <mailto:[email protected]>
>     https://lists.opendaylight.org/mailman/listinfo/controller-dev
>     <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
> 
> 
> 
> 
> _______________________________________________
> controller-dev mailing list
> [email protected]
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
> 
_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to