On Tue, Oct 31, 2017 at 6:44 PM, Anil Vishnoi <[email protected]> wrote:

> is it possible to collect dmesg output? That can give an idea if it's a
> JVM native OOM.
>
Yes, we already collect all those for the openstack nodes, so we just need
to include it for the ODL node.

>
> On Tue, Oct 31, 2017 at 3:40 PM, Michael Vorburger <[email protected]>
> wrote:
>
>> On Tue, Oct 31, 2017 at 11:02 PM, Jamo Luhrsen <[email protected]>
>> wrote:
>>
>>> On 10/31/2017 12:22 AM, Michael Vorburger wrote:
>>> > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen <[email protected]
>>> <mailto:[email protected]>> wrote:
>>> >
>>> >     On 10/30/2017 01:29 PM, Tom Pantelis wrote:
>>> >     > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague <[email protected]
>>> <mailto:[email protected]> <mailto:[email protected]
>>> >     <mailto:[email protected]>>> wrote:
>>> >     >     On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis <
>>> [email protected] <mailto:[email protected]>
>>> >     <mailto:[email protected] <mailto:[email protected]>>>
>>> wrote:
>>> >     >         On Mon, Oct 30, 2017 at 2:49 PM, Michael Vorburger <
>>> [email protected] <mailto:[email protected]> <mailto:
>>> [email protected] <mailto:[email protected]>>> wrote:
>>> >     >
>>> >     >             Hi Sam,
>>> >     >
>>> >     >             On Mon, Oct 30, 2017 at 7:45 PM, Sam Hague <
>>> [email protected] <mailto:[email protected]> <mailto:[email protected]
>>> <mailto:[email protected]>>> wrote:
>>> >     >
>>> >     >                 Stephen, Michael, Tom,
>>> >     >
>>> >     >                 do you have any ways to collect debugs when ODL
>>> crashes in CSIT?
>>> >     >
>>> >     >
>>> >     >             JVMs (almost) never "just crash" without a word...
>>> either some code does java.lang.System.exit(), which you may
>>> >     >             remember we do in the CDS/Akka code somewhere, or
>>> there's a bug in the JVM implementation - in which case there
>>> >     >             should be a one of those JVM crash logs type things
>>> - a file named something like hs_err_pid22607.log in the
>>> >     >             "current working" directory. Where would that be on
>>> these CSIT runs, and are the CSIT JJB jobs set up to preserve
>>> >     >             such JVM crash log files and copy them over to
>>> logs.opendaylight.org <http://logs.opendaylight.org>
>>> >     <http://logs.opendaylight.org> ?
>>> >     >
>>> >     >
>>> >     >         Akka will do System.exit() if it encounters an error
>>> serious for that.  But it doesn't do it silently. However I
>>> >     >         believe we disabled the automatic exiting in akka.
>>> >     >
>>> >     >     Should there be any logs in ODL for this? There is nothing
>>> in the karaf log when this happens. It literally just stops.
>>> >     >
>>> >     >     The karaf.console log does say the karaf process was killed:
>>> >     >
>>> >     >     /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line 422: 11528 Killed
>>> ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS} "$NON_BLOCKING_PRNG"
>>> >     >     -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}"
>>> -Djava.ext.dirs="${JAVA_EXT_DIRS}"
>>> >     >     -Dkaraf.instances="${KARAF_HOME}/instances"
>>> -Dkaraf.home="${KARAF_HOME}" -Dkaraf.base="${KARAF_BASE}"
>>> >     >     -Dkaraf.data="${KARAF_DATA}" -Dkaraf.etc="${KARAF_ETC}"
>>> -Dkaraf.restart.jvm.supported=true
>>> >     >     -Djava.io.tmpdir="${KARAF_DATA}/tmp"
>>> -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.util
>>> .logging.properties"
>>> >     >     ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" -classpath
>>> "${CLASSPATH}" ${MAIN}
>>> >     >
>>> >     >     In the CSIT robot files we can see the below connection
>>> errors so ODL is not responding to new requests. This plus the
>>> >     >     above lead to think ODL just died.
>>> >     >
>>> >     >     [ WARN ] Retrying (Retry(total=2, connect=None, read=None,
>>> redirect=None, status=None)) after connection broken by
>>> >     >     
>>> > 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection
>>> object at 0x5ca2d50>: Failed to establish a new
>>> >     >     connection: [Errno 111] Connection refused',)'
>>> >     >
>>> >     >
>>> >     >
>>> >     > That would seem to indicate something did a kill -9.  As Michael
>>> said, if the JVM crashed there would be an hs_err_pid file
>>> >     > and it would log a message about it
>>> >
>>> >     yeah, this is where my money is at as well. The OS must be dumping
>>> it because it's
>>> >     misbehaving. I'll try to hack the job to start collecting os level
>>> log info (e.g. journalctl, etc)
>>> >
>>> >
>>> > JamO, do make sure you collect not just OS level but also the
>>> JVM's hs_err_*.log  file (if any); my bet is a JVM more than an
>>> > OS level crash...
>>>
>>> where are these hs_err_*.log files going to be?
>>
>>
>> they would be in the "current working directory", like what was the "pwd"
>> when the JVM was started..
>>
>>
>>> This is such a dragged out process to debug. These
>>> jobs take 3+ hours and our problem only comes sporadically. ...sigh...
>>>
>>> But, good news is that I think we've confirmed it's an oom. but an OOM
>>> from the OS perspective,
>>> if I'm not mistaken.
>>>
>>
>> OK that kind of thing could happen if you ran an ODL JVM in this kind of
>> situation:
>>
>> * VM with say 4 GB of RAM, and no swap
>> * JVM like ODL starts with Xms 1 GB and Xmx 2 GB, so reserves 1 and plans
>> expand to 2, when needed
>> * other stuff eats up remaining e.g. 3 GB
>> * JVM wants to expand, asks OS for 1 GB, but there is none left - so boum
>>
>> but AFAIK (I'm not 100% sure) there would still be one of those
>> hs_err_*.log files with some details confirming above (like "out of native
>> memory", kind of thing).
>>
>>
>>> here's what I saw in a sandbox job [a] that just hit this:
>>>
>>> Out of memory: Kill process 11546 (java) score 933 or sacrifice child
>>> (more debug output is there in the console log)
>>>
>>> These ODL systems start with 4G and we are setting the max mem for the
>>> odl java
>>> process to be 2G.
>>>
>>
>> erm, I'm not quite following what is 2 and what is 3 here.. but does my
>> description above help you to narrow this down?
>>
>>
>>> I don't think we see this with Carbon, which makes me believe it's *not*
>>> some problem from outside
>>> of ODL (e.g. not a kernel bug from when we updated the java builder
>>> image back on 10/20)
>>>
>>> I'll keep digging at this. Ideas are welcome for things to look at.
>>>
>>>
>>>
>>> [a]
>>> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1n
>>> ode-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxy
>>> gen/7/consoleFull
>>>
>>>
>>>
>>>
>>>
>>> > BTW: The most common fix ;) for JVM crashes often is simply upgrading
>>> to the latest available patch version of OpenJDK.. but
>>> > I'm guessing/hoping we run from RPM and already have the latest - or
>>> is this possibly running on an older JVM version package
>>> > that was somehow "held back" via special dnf instructions, or manually
>>> installed from a ZIP, kind of thing?
>>>
>>>
>>> these systems are built and updated periodically. jdk is installed with
>>> "yum install". The specific version
>>> in [a] is:
>>>
>>> 10:57:33 Set Java version
>>> 10:57:34 JDK default version...
>>> 10:57:34 openjdk version "1.8.0_144"
>>> 10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01)
>>> 10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)
>>>
>>
>> OK, that seems to be the latest one I also have locally on Fedora 26.
>>
>> Thanks,
>>> JamO
>>>
>>>
>>>
>>> >     JamO
>>> >
>>> >
>>> >     >
>>> >     > _______________________________________________
>>> >     > controller-dev mailing list
>>> >     > [email protected] <mailto:
>>> [email protected]>
>>> >     > https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>> >     <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>> >     >
>>> >     _______________________________________________
>>> >     controller-dev mailing list
>>> >     [email protected] <mailto:
>>> [email protected]>
>>> >     https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>> >     <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>> >
>>> >
>>>
>>
>>
>> _______________________________________________
>> controller-dev mailing list
>> [email protected]
>> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>
>>
>
>
> --
> Thanks
> Anil
>
> _______________________________________________
> controller-dev mailing list
> [email protected]
> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>
>
_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to