So from the time I was debugging the issue with Jamo and Thanh, it seems there 
is a memory leak but:

1) JVM does not kill itself, the OS does instead after the java process grows 
to 3.7G in a VM of 4G RAM  (note Xmx is set to 2G but still the jvm goes far 
beyond that).
2) The issue happens so fast that we did not have time to take a memory dump 
with map.

So I wonder if there is some java memory combination of parameters to prevent 
OS to kill the JVM, something like if the total java memory (Xmx and 
others)=3G, abort and generate heap dump.

BR/Luis


> On Nov 2, 2017, at 2:53 PM, Jamo Luhrsen <[email protected]> wrote:
> 
> 
> 
> On 11/02/2017 02:02 PM, Michael Vorburger wrote:
>> On Thu, Nov 2, 2017 at 9:32 PM, Jamo Luhrsen <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>    +integration-dev
>> 
>>    nitrogen SR1 blocker bug for this problem:
>> 
>>    https://jira.opendaylight.org/browse/NETVIRT-974 
>> <https://jira.opendaylight.org/browse/NETVIRT-974>
>> 
>>    I'm actively debugging in the sandbox, although it's a very heavy process
>>    (many hours per iteration).
>> 
>>    wondering if there are any extra options we can pass to the java process
>>    that might shed more light. This is a very fast and silent death.
>> 
>> 
>> sounds like a memory leak.. we should have a hs_err_pid*.log file and *have* 
>> to have an *.hprof file to know where and find a
>> fix for an OOM...
>>> have you been able to re-double-check if these files are't already produced 
>>> somewhere? How about just doing a dumb:
>> 
>> sudo find / -name "hs_err_pid*.log"
>> sudo find / -name "*.hprof"
>> 
>> the hprof should be produced by I can see that in $ODL/bin/karaf we already 
>> have "-XX:+HeapDumpOnOutOfMemoryError"
>> on DEFAULT_JAVA_OPTS...
> 
> 
> yes, we do this HeapdumpOnOutOfMemoryError.
> 
>> to fix the folder where it would write the HPROF into, you could add: 
>> -XX:HeapDumpPath=/a/folder/you/can/recover 
>> 
>> can't wait to get my hands on a hs_err_pid*.log & *.hprof from this... ;=)
> 
> 
> no there is no hs_err* or *hprof here. The OS is killing the PID because it's
> consuming too much memory. I don't think the OS even cares that this is java.
> Same as me doing a kill -9, I presume.
> 
> :(
> 
> JamO
> 
> 
>>    JamO
>> 
>> 
>> 
>> 
>>    On 10/31/2017 06:11 PM, Sam Hague wrote:
>>> 
>>> 
>>> On Tue, Oct 31, 2017 at 6:44 PM, Anil Vishnoi <[email protected] 
>>> <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>> wrote:
>>> 
>>>      is it possible to collect dmesg output? That can give an idea if it's 
>>> a JVM native OOM.
>>> 
>>> Yes, we already collect all those for the openstack nodes, so we just need 
>>> to include it for the ODL node. 
>>> 
>>> 
>>>      On Tue, Oct 31, 2017 at 3:40 PM, Michael Vorburger 
>>> <[email protected] <mailto:[email protected]>
>>    <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>> 
>>>          On Tue, Oct 31, 2017 at 11:02 PM, Jamo Luhrsen <[email protected] 
>>> <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>> wrote:
>>> 
>>>              On 10/31/2017 12:22 AM, Michael Vorburger wrote:
>>>              > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen 
>>> <[email protected] <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>
>>    <mailto:[email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>>>              >
>>>              >     On 10/30/2017 01:29 PM, Tom Pantelis wrote:
>>>              >     > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague 
>>> <[email protected] <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>
>>>              <mailto:[email protected] <mailto:[email protected]> 
>>> <mailto:[email protected] <mailto:[email protected]>>>
>>    <mailto:[email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>
>>>              >     <mailto:[email protected] <mailto:[email protected]> 
>>> <mailto:[email protected] <mailto:[email protected]>>>>> wrote:
>>>              >     >     On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis 
>>> <[email protected] <mailto:[email protected]> 
>>> <mailto:[email protected] <mailto:[email protected]>>
>>    <mailto:[email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>              >     <mailto:[email protected] 
>>> <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>
>>    <mailto:[email protected] <mailto:[email protected]>
>>>              <mailto:[email protected] 
>>> <mailto:[email protected]>>>>> wrote:
>>>              >     >         On Mon, Oct 30, 2017 at 2:49 PM, Michael 
>>> Vorburger <[email protected] <mailto:[email protected]> 
>>> <mailto:[email protected] <mailto:[email protected]>>
>>    <mailto:[email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>>
>>>              <mailto:[email protected] <mailto:[email protected]> 
>>> <mailto:[email protected] <mailto:[email protected]>>
>>    <mailto:[email protected] <mailto:[email protected]>
>>>              <mailto:[email protected] <mailto:[email protected]>>>>> 
>>> wrote:
>>>              >     >
>>>              >     >             Hi Sam,
>>>              >     >
>>>              >     >             On Mon, Oct 30, 2017 at 7:45 PM, Sam Hague 
>>> <[email protected] <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>
>>>              <mailto:[email protected] <mailto:[email protected]> 
>>> <mailto:[email protected] <mailto:[email protected]>>>
>>    <mailto:[email protected] <mailto:[email protected]> 
>> <mailto:[email protected] <mailto:[email protected]>>
>>>              <mailto:[email protected] <mailto:[email protected]> 
>>> <mailto:[email protected]
>>    <mailto:[email protected]>>>>> wrote:
>>>              >     >
>>>              >     >                 Stephen, Michael, Tom,
>>>              >     >
>>>              >     >                 do you have any ways to collect debugs 
>>> when ODL crashes in CSIT?
>>>              >     >
>>>              >     >
>>>              >     >             JVMs (almost) never "just crash" without a 
>>> word... either some code
>>>              does java.lang.System.exit(), which you may
>>>              >     >             remember we do in the CDS/Akka code 
>>> somewhere, or there's a bug in the JVM implementation -
>>>              in which case there
>>>              >     >             should be a one of those JVM crash logs 
>>> type things - a file named something
>>>              like hs_err_pid22607.log in the
>>>              >     >             "current working" directory. Where would 
>>> that be on these CSIT runs, and are the CSIT JJB
>>>              jobs set up to preserve
>>>              >     >             such JVM crash log files and copy them 
>>> over to logs.opendaylight.org
>>    <http://logs.opendaylight.org>
>>>              <http://logs.opendaylight.org> <http://logs.opendaylight.org>
>>>              >     <http://logs.opendaylight.org> ?
>>>              >     >
>>>              >     >
>>>              >     >         Akka will do System.exit() if it encounters an 
>>> error serious for that.  But it doesn't do it
>>>              silently. However I
>>>              >     >         believe we disabled the automatic exiting in 
>>> akka.
>>>              >     >
>>>              >     >     Should there be any logs in ODL for this? There is 
>>> nothing in the karaf log when this happens. It
>>>              literally just stops.
>>>              >     >
>>>              >     >     The karaf.console log does say the karaf process 
>>> was killed:
>>>              >     >
>>>              >     >     /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line 422: 
>>> 11528 Killed ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS}
>>>              "$NON_BLOCKING_PRNG"
>>>              >     >     -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}" 
>>> -Djava.ext.dirs="${JAVA_EXT_DIRS}"
>>>              >     >     -Dkaraf.instances="${KARAF_HOME}/instances" 
>>> -Dkaraf.home="${KARAF_HOME}"
>>    -Dkaraf.base="${KARAF_BASE}"
>>>              >     >     -Dkaraf.data="${KARAF_DATA}" 
>>> -Dkaraf.etc="${KARAF_ETC}" -Dkaraf.restart.jvm.supported=true
>>>              >     >     -Djava.io.tmpdir="${KARAF_DATA}/tmp"
>>>              
>>> -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.util.logging.properties"
>>>              >     >     ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" 
>>> -classpath "${CLASSPATH}" ${MAIN}
>>>              >     >
>>>              >     >     In the CSIT robot files we can see the below 
>>> connection errors so ODL is not responding to new
>>>              requests. This plus the
>>>              >     >     above lead to think ODL just died.
>>>              >     >
>>>              >     >     [ WARN ] Retrying (Retry(total=2, connect=None, 
>>> read=None, redirect=None, status=None)) after
>>>              connection broken by
>>>              >     >     
>>> 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection 
>>> object at 0x5ca2d50>:
>>>              Failed to establish a new
>>>              >     >     connection: [Errno 111] Connection refused',)'
>>>              >     >
>>>              >     >
>>>              >     >
>>>              >     > That would seem to indicate something did a kill -9.  
>>> As Michael said, if the JVM crashed there
>>    would be
>>>              an hs_err_pid file
>>>              >     > and it would log a message about it
>>>              >
>>>              >     yeah, this is where my money is at as well. The OS must 
>>> be dumping it because it's
>>>              >     misbehaving. I'll try to hack the job to start 
>>> collecting os level log info (e.g. journalctl, etc)
>>>              >
>>>              >
>>>              > JamO, do make sure you collect not just OS level but also 
>>> the JVM's hs_err_*.log  file (if any); my bet is a
>>>              JVM more than an
>>>              > OS level crash...
>>> 
>>>              where are these hs_err_*.log files going to be?
>>> 
>>> 
>>>          they would be in the "current working directory", like what was 
>>> the "pwd" when the JVM was started..
>>>           
>>> 
>>>              This is such a dragged out process to debug. These
>>>              jobs take 3+ hours and our problem only comes sporadically. 
>>> ...sigh...
>>> 
>>>              But, good news is that I think we've confirmed it's an oom. 
>>> but an OOM from the OS perspective,
>>>              if I'm not mistaken.
>>> 
>>> 
>>>          OK that kind of thing could happen if you ran an ODL JVM in this 
>>> kind of situation:
>>> 
>>>          * VM with say 4 GB of RAM, and no swap
>>>          * JVM like ODL starts with Xms 1 GB and Xmx 2 GB, so reserves 1 
>>> and plans expand to 2, when needed
>>>          * other stuff eats up remaining e.g. 3 GB
>>>          * JVM wants to expand, asks OS for 1 GB, but there is none left - 
>>> so boum
>>> 
>>>          but AFAIK (I'm not 100% sure) there would still be one of those 
>>> hs_err_*.log files with some details confirming
>>    above
>>>          (like "out of native memory", kind of thing).
>>>           
>>> 
>>>              here's what I saw in a sandbox job [a] that just hit this:
>>> 
>>>              Out of memory: Kill process 11546 (java) score 933 or 
>>> sacrifice child
>>>              (more debug output is there in the console log)
>>> 
>>>              These ODL systems start with 4G and we are setting the max mem 
>>> for the odl java
>>>              process to be 2G.
>>> 
>>> 
>>>          erm, I'm not quite following what is 2 and what is 3 here.. but 
>>> does my description above help you to narrow
>>    this down?
>>>           
>>> 
>>>              I don't think we see this with Carbon, which makes me believe 
>>> it's *not* some problem from outside
>>>              of ODL (e.g. not a kernel bug from when we updated the java 
>>> builder image back on 10/20)
>>> 
>>>              I'll keep digging at this. Ideas are welcome for things to 
>>> look at.
>>> 
>>> 
>>> 
>>>              [a]
>>>            
>>     
>> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull
>>    
>> <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull>
>>>            
>>     
>> <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull
>>    
>> <https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull>>
>>> 
>>> 
>>> 
>>> 
>>> 
>>>              > BTW: The most common fix ;) for JVM crashes often is simply 
>>> upgrading to the latest available patch
>>    version of OpenJDK.. but
>>>              > I'm guessing/hoping we run from RPM and already have the 
>>> latest - or is this possibly running on an older
>>    JVM version package
>>>              > that was somehow "held back" via special dnf instructions, 
>>> or manually installed from a ZIP, kind of thing?
>>> 
>>> 
>>>              these systems are built and updated periodically. jdk is 
>>> installed with "yum install". The specific version
>>>              in [a] is:
>>> 
>>>              10:57:33 Set Java version
>>>              10:57:34 JDK default version...
>>>              10:57:34 openjdk version "1.8.0_144"
>>>              10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01)
>>>              10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01, mixed 
>>> mode)
>>> 
>>> 
>>>          OK, that seems to be the latest one I also have locally on Fedora 
>>> 26. 
>>> 
>>>              Thanks,
>>>              JamO
>>> 
>>> 
>>> 
>>>              >     JamO
>>>              >
>>>              >
>>>              >     >
>>>              >     > _______________________________________________
>>>              >     > controller-dev mailing list
>>>              >     > [email protected] 
>>> <mailto:[email protected]>
>>    <mailto:[email protected] 
>> <mailto:[email protected]>>
>>>              <mailto:[email protected] 
>>> <mailto:[email protected]>
>>    <mailto:[email protected] 
>> <mailto:[email protected]>>>
>>>              >     > 
>>> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>>              <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
>>>              >     
>>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>>              <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>>
>>>              >     >
>>>              >     _______________________________________________
>>>              >     controller-dev mailing list
>>>              >     [email protected] 
>>> <mailto:[email protected]>
>>    <mailto:[email protected] 
>> <mailto:[email protected]>>
>>>              <mailto:[email protected] 
>>> <mailto:[email protected]>
>>    <mailto:[email protected] 
>> <mailto:[email protected]>>>
>>>              >     
>>> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>>              <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
>>>              >     
>>> <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>>              <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>>
>>>              >
>>>              >
>>> 
>>> 
>>> 
>>>          _______________________________________________
>>>          controller-dev mailing list
>>>          [email protected] 
>>> <mailto:[email protected]>
>>    <mailto:[email protected] 
>> <mailto:[email protected]>>
>>>          https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>>          <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
>>> 
>>> 
>>> 
>>> 
>>>      --
>>>      Thanks
>>>      Anil
>>> 
>>>      _______________________________________________
>>>      controller-dev mailing list
>>>      [email protected] 
>>> <mailto:[email protected]>
>>    <mailto:[email protected] 
>> <mailto:[email protected]>>
>>>      https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>>      <https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>>
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> controller-dev mailing list
>>> [email protected] 
>>> <mailto:[email protected]>
>>> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>>> 
>>    _______________________________________________
>>    controller-dev mailing list
>>    [email protected] 
>> <mailto:[email protected]>
>>    https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>    <https://lists.opendaylight.org/mailman/listinfo/controller-dev>
>> 
>> 
> _______________________________________________
> controller-dev mailing list
> [email protected]
> https://lists.opendaylight.org/mailman/listinfo/controller-dev

_______________________________________________
controller-dev mailing list
[email protected]
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to