On Mon, Jul 2, 2018 at 3:26 PM Victor Pickard <vpick...@redhat.com> wrote:
> > > On Mon, Jul 2, 2018 at 2:44 PM Tom Pantelis <tompante...@gmail.com> wrote: > >> >> >> On Mon, Jul 2, 2018 at 2:15 PM, Victor Pickard <vpick...@redhat.com> >> wrote: >> >>> Hi all, >>> >>> I'm looking at clustering stability. One of the jobs I've been looking at >>> is controller clustering. This is a good CSIT, in that it stops and starts >>> ODL several times during the run. >>> >>> In one of failed test runs (sandbox, logs wiped from last week, but I do >>> have this particular karaf log archived locally), ODL is started, and rest >>> calls fail during the test. Looking at the logs, I can see why. Karaf >>> failed to start, or better yet, took a really long time to start. From the >>> snipped below, you can see about 7 mins between when Karaf launched, and >>> did something?, maybe restarted again. But the main thing is that karaf >>> failed to start in a timely manner, taking over 7 minutes to begin to start >>> up blueprints, etc. >>> >>> >>> I ran a job that had karaf debug logging enabled with this setting: >>> >>> log4j.rootLogger=DEBUG >>> >>> >>> This did not go very well. This generates way too much debug info, and was >>> causing timeouts and other various errors in the CSIT run. >>> >>> >>> So, my questions are: >>> >>> 1. Has anyone see this issue where karaf seems to hang on startup (after a >>> kill -9 on karaf pid)? If so, is this a known issue? >>> >>> 2. What debug would be needed to figure out why karaf was hanging? Note the >>> above generated a log file of ~768 MB in a very short timespan. >>> >>> >>> Vic - does this happen if you gracefully shut it down? >> > > Hi Tom, > I haven't tried that. I'm just running the controller csit, which does a > kill -9 on karaf pid. > > >> In years past with karaf I recall corruption could occur in the bundle >> cache under data if the karaf process was killed. I don't know if that >> potential issue is still present with karaf 4. Does it clean the data dir >> before restarting? If not, it would be good to do so to be safe. >> > > Here are the steps in from the controller csit job for restarting ODL > (Restart Odl With Tell Based False). Looking at this, yes, the data dir is > deleted. > > 1. kill -9 on karaf pid ( 'ps axf | grep org.apache.karaf | grep -v grep | > awk '{print "kill -9 " $1}' | sh' ) > 2. Verify karaf is not running > 3. Set Tell Based to False in config file > 4. Copy karaf logs to /tmp > 5. Clean the following directories > > 1. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/tmp/ > 2. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/data/ > 3. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/cache/ > 4. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/snapshots/ > 5. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/journal/ > 6. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/etc/opendaylight/current/ > 7. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/etc/host.key > > 6. Copy logs back to new snapshot dir, as below: > > 1. mkdir -p '/tmp/karaf-0.8.3-SNAPSHOT/data' && rm -vrf > '/tmp/karaf-0.8.3-SNAPSHOT/log' && mv -vf '/tmp/log' > '/tmp/karaf-0.8.3-SNAPSHOT/data/ > > Tom, After re-reading my last mail, I see that the data directory is cleaned/removed, then, the last step, is to copy what was stashed away back to the newly created data dir. From what I see, this is only copying the logs from the previous karaf instance back to the newly created dir. So, this seems ok, agree? Here are more details on step #4 above, where the karaf logs are copied to /tmp: mkdir -p '/tmp' && rm -vrf '/tmp/log' && mv -vf '/tmp/karaf-0.8.3-SNAPSHOT/data/log' '/tmp/' > > 1. > > >> Other than that, we probably need to get a thread dump. >> >>> Thanks, >>> >>> Vic >>> >>> >>> >>> >>> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch >>> INFO: Installing and starting initial bundles >>> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch >>> INFO: All initial bundles installed and set to start >>> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.lock.SimpleFileLock lock >>> INFO: Trying to lock /tmp/karaf-0.8.3-SNAPSHOT/lock >>> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.lock.SimpleFileLock lock >>> INFO: Lock acquiredJun 29, 2018 3:43:47 PM >>> org.apache.karaf.main.Main$KarafLockCallback lockAquired >>> INFO: Lock acquired. Setting startlevel to 100 >>> Jun 29, 2018 3:50:48 PM org.apache.karaf.main.Main launch >>> INFO: Installing and starting initial bundles >>> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.Main launch >>> INFO: All initial bundles installed and set to start >>> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.lock.SimpleFileLock lock >>> INFO: Trying to lock /tmp/karaf-0.8.3-SNAPSHOT/lock >>> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.lock.SimpleFileLock lock >>> INFO: Lock acquired >>> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.Main$KarafLockCallback >>> lockAquired >>> INFO: Lock acquired. Setting startlevel to 100 >>> >>> >>> >>> _______________________________________________ >>> controller-dev mailing list >>> controller-dev@lists.opendaylight.org >>> https://lists.opendaylight.org/mailman/listinfo/controller-dev >>> >>> >>
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev