On Mon, Jul 2, 2018 at 3:26 PM Victor Pickard <vpick...@redhat.com> wrote:

>
>
> On Mon, Jul 2, 2018 at 2:44 PM Tom Pantelis <tompante...@gmail.com> wrote:
>
>>
>>
>> On Mon, Jul 2, 2018 at 2:15 PM, Victor Pickard <vpick...@redhat.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm looking at clustering stability. One of the jobs I've been looking at 
>>> is controller clustering. This is a good CSIT, in that it stops and starts 
>>> ODL several times during the run.
>>>
>>> In one of failed test runs (sandbox, logs wiped from last week, but I do 
>>> have this particular karaf log archived locally), ODL is started, and rest 
>>> calls fail during the test. Looking at the logs, I can see why. Karaf 
>>> failed to start, or better yet, took a really long time to start. From the 
>>> snipped below, you can see about 7 mins between when Karaf launched, and 
>>> did something?, maybe restarted again. But the main thing is that karaf 
>>> failed to start in a timely manner, taking over 7 minutes to begin to start 
>>> up blueprints, etc.
>>>
>>>
>>> I ran a job that had karaf debug logging enabled with this setting:
>>>
>>> log4j.rootLogger=DEBUG
>>>
>>>
>>> This did not go very well. This generates way too much debug info, and was 
>>> causing timeouts and other various errors in the CSIT run.
>>>
>>>
>>> So, my questions are:
>>>
>>> 1. Has anyone see this issue where karaf seems to hang on startup (after a 
>>> kill -9 on karaf pid)? If so, is this a known issue?
>>>
>>> 2. What debug would be needed to figure out why karaf was hanging? Note the 
>>> above generated a log file of ~768 MB in a very short timespan.
>>>
>>>
>>> Vic - does this happen if you gracefully shut it down?
>>
>
> Hi Tom,
> I haven't tried that. I'm just running the controller csit, which does a
> kill -9 on karaf pid.
>
>
>> In years past with karaf I recall corruption could occur in the bundle
>> cache under data if the karaf process was killed. I don't know if that
>> potential issue is still present with karaf 4. Does it clean the data dir
>> before restarting? If not, it would be good to do so to be safe.
>>
>
> Here are the steps in from the controller csit job for restarting ODL
> (Restart Odl With Tell Based False). Looking at this, yes, the data dir is
> deleted.
>
> 1. kill -9 on karaf pid ( 'ps axf | grep org.apache.karaf | grep -v grep |
> awk '{print "kill -9 " $1}' | sh' )
> 2. Verify karaf is not running
> 3. Set Tell Based to False in config file
> 4. Copy karaf logs to /tmp
> 5. Clean the following directories
>
>    1. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/tmp/
>    2. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/data/
>    3. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/cache/
>    4. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/snapshots/
>    5. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/journal/
>    6. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/etc/opendaylight/current/
>    7. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/etc/host.key
>
> 6. Copy logs back to new snapshot dir, as below:
>
>    1. mkdir -p '/tmp/karaf-0.8.3-SNAPSHOT/data' && rm -vrf
>    '/tmp/karaf-0.8.3-SNAPSHOT/log' && mv -vf '/tmp/log'
>    '/tmp/karaf-0.8.3-SNAPSHOT/data/
>
>
Tom,
After re-reading my last mail, I see that the data directory is
cleaned/removed, then, the last step, is to copy what was stashed away back
to the newly created data dir. From what I see, this is only copying the
logs from the previous karaf instance back to the newly created dir. So,
this seems ok, agree?

Here are more details on step #4 above, where the karaf logs are copied to
/tmp:

mkdir -p '/tmp' && rm -vrf '/tmp/log' && mv -vf
'/tmp/karaf-0.8.3-SNAPSHOT/data/log' '/tmp/'




>
>    1.
>
>
>> Other than that, we probably need to get a thread dump.
>>
>>> Thanks,
>>>
>>> Vic
>>>
>>>
>>>
>>>
>>> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch
>>> INFO: Installing and starting initial bundles
>>> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch
>>> INFO: All initial bundles installed and set to start
>>> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.lock.SimpleFileLock lock
>>> INFO: Trying to lock /tmp/karaf-0.8.3-SNAPSHOT/lock
>>> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.lock.SimpleFileLock lock
>>> INFO: Lock acquiredJun 29, 2018 3:43:47 PM 
>>> org.apache.karaf.main.Main$KarafLockCallback lockAquired
>>> INFO: Lock acquired. Setting startlevel to 100
>>> Jun 29, 2018 3:50:48 PM org.apache.karaf.main.Main launch
>>> INFO: Installing and starting initial bundles
>>> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.Main launch
>>> INFO: All initial bundles installed and set to start
>>> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.lock.SimpleFileLock lock
>>> INFO: Trying to lock /tmp/karaf-0.8.3-SNAPSHOT/lock
>>> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.lock.SimpleFileLock lock
>>> INFO: Lock acquired
>>> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.Main$KarafLockCallback 
>>> lockAquired
>>> INFO: Lock acquired. Setting startlevel to 100
>>>
>>>
>>>
>>> _______________________________________________
>>> controller-dev mailing list
>>> controller-dev@lists.opendaylight.org
>>> https://lists.opendaylight.org/mailman/listinfo/controller-dev
>>>
>>>
>>
_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to