On 07/02/2018 01:46 PM, Victor Pickard wrote:
On Mon, Jul 2, 2018 at 2:44 PM Tom Pantelis <tompante...@gmail.com
<mailto:tompante...@gmail.com>> wrote:
On Mon, Jul 2, 2018 at 2:15 PM, Victor Pickard <vpick...@redhat.com
<mailto:vpick...@redhat.com>> wrote:
Hi all,
I'm looking at clustering stability. One of the jobs I've been looking
at is controller clustering. This is a
good CSIT, in that it stops and starts ODL several times during the run.
In one of failed test runs (sandbox, logs wiped from last week, but I
do have this particular karaf log archived
locally), ODL is started, and rest calls fail during the test. Looking
at the logs, I can see why. Karaf failed
to start, or better yet, took a really long time to start. From the
snipped below, you can see about 7 mins
between when Karaf launched, and did something?, maybe restarted again.
But the main thing is that karaf failed
to start in a timely manner, taking over 7 minutes to begin to start up
blueprints, etc.
I ran a job that had karaf debug logging enabled with this setting:
log4j.rootLogger=DEBUG
This did not go very well. This generates way too much debug info, and
was causing timeouts and other various
errors in the CSIT run.
So, my questions are:
1. Has anyone see this issue where karaf seems to hang on startup
(after a kill -9 on karaf pid)? If so, is this
a known issue?
2. What debug would be needed to figure out why karaf was hanging? Note
the above generated a log file of ~768
MB in a very short timespan.
Vic - does this happen if you gracefully shut it down? In years past with
karaf I recall corruption could occur in
the bundle cache under data if the karaf process was killed. I don't know
if that potential issue is still present
with karaf 4. Does it clean the data dir before restarting? If not, it
would be good to do so to be safe.
Other than that, we probably need to get a thread dump.
I'm thinking that you want a thread dump at the point in time where this issue occurs? I'm not sure if we easily can do
this in CSIT. I see that thread dumps are (supposed to be) collected for jobs, but only at the very end of the test
suite run. The last runs don't seem to be working because jstack command cannot be found.
this has just been sitting and broken for some time now. It was never anything
anyone cared about until now. We should try to get it fixed. I'll add it as
a subtask to your new bug (CONTROLLER-1845) and assign it to myself.
I'm thinking we would have to add some additional checking to see if karaf did start properly and in a timely manner,
and dump the threads right then if failure detected.
Can you point me to a specific robot keyword that would fail in this
case? I can dig it up eventually, but hoping you have it quickly
available.
Jamo, do we have anything like this in CSIT now that you know of?
I think we can add some jstack dumping to certain keywords, or test
case teardowns (to use in the sandbox), and it shouldn't be hard
to do (once we get jstack working again). but, no we don't have anything
like it at this point. It's another subtask on my plate now.
Thanks
JamO
Thanks,
Vic
Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch
INFO: Installing and starting initial bundles
Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch
INFO: All initial bundles installed and set to start
Jun 29, 2018 3:43:47 PM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Trying to lock /tmp/karaf-0.8.3-SNAPSHOT/lock
Jun 29, 2018 3:43:47 PM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Lock acquired
Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main$KarafLockCallback
lockAquired INFO: Lock acquired. Setting
startlevel to 100 Jun 29, 2018 3:50:48 PM org.apache.karaf.main.Main
launch INFO: Installing and starting
initial bundles
Jun 29, 2018 3:50:49 PM org.apache.karaf.main.Main launch
INFO: All initial bundles installed and set to start
Jun 29, 2018 3:50:49 PM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Trying to lock /tmp/karaf-0.8.3-SNAPSHOT/lock
Jun 29, 2018 3:50:49 PM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Lock acquired
Jun 29, 2018 3:50:49 PM org.apache.karaf.main.Main$KarafLockCallback
lockAquired
INFO: Lock acquired. Setting startlevel to 100
_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
<mailto:controller-dev@lists.opendaylight.org>
https://lists.opendaylight.org/mailman/listinfo/controller-dev
_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev
_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev