Re: Problem when running server-integ tests

Emmanuel Lécharny Mon, 21 Feb 2011 03:17:44 -0800

On 2/21/11 11:37 AM, Alex Karasulu wrote:

On Mon, Feb 21, 2011 at 11:24 AM, Emmanuel Lecharny<[email protected]>wrote:

Hi guys,

this week-end, I experienced some bad issues with the server-integ tests,
while trying to use the same LdapDecoder for both client and server. After 3
or 4 tests, I get blocked for 30 seconds, and then move to the next test,
get blocked again, etc.

Of course, at the end, I got test failures, as expected.

I did some deep analysis on what was going on, added some traces, and that
led me to a point where I get blocked while trying to access to any of the
LdapContainer method.

Here is an exemple of the plcae I get blocked in :

    public void decode( ByteBuffer stream, Asn1Container container ) throws
DecoderException
    {
        boolean hasRemaining = stream.hasRemaining();
        System.out.println( "<Has Remaining>  : " + hasRemaining + "/" +
stream.remaining() );

        // Increment the PDU size counter.
        int remaining = stream.remaining();
        System.out.println( "----- 0" );
        container.incrementDecodeBytes( remaining );

        System.out.println( "----- 1" );

and in the console, the "----- 0" get printed, not the "----- 1". The
incrementDecodeBytes is a very simple method :

    public void incrementDecodeBytes( int nb )
    {
        System.out.println( "---in incrementDecodedBytes" );
        decodeBytes += nb;
    }

(note that I don't get either the "---in incrementDecodedBytes" message).
It's not synchronized, not the decodeBytes is an AtomicInteger.

I did a threadDump, which is available here : http://pastebin.com/wPNPL1FY

It shows that 15 threads are blocked on monitors :

pool-24-thread-1        Task    5       51326976        5680140288
  waiting on condition [152900000]
pool-22-thread-1        Task    5       50612224        5684236288
  waiting on condition [152ce8000]
pool-14-thread-4        Task    5       56862720        5715816448
  waiting on condition [154b06000]
net.sf.ehcache.CacheManager@6f7cf6b6    Daemon  5       27879424
  5657616384 in Object.wait()     [151385000]
main    Task    5       50333696        4300214272      waiting on
condition    [1004fe000]
kdcReplayCache.data     Daemon  5       50774016        5658677248
  waiting on condition    [151488000]
groupCache.data         Daemon  5       50778112        5660798976
  waiting on condition [15168e000]
changePwdReplayCache.data       Daemon  5       27881472        5659738112
      waiting on condition    [15158b000]
SynchWorkerThread       Task    5       35258368        5652594688      in
Object.wait() [150ebb000]
Surefire fork timeout timer     Task    5       26988544        5647831040
      in Object.wait()        [150a30000]
Reference Handler       Daemon  10      50362368        5634641920      in
Object.wait() [14fd9c000]
Finalizer       Daemon  8       50364416        5635702784      in
Object.wait()        [14fe9f000]
FelixStartLevel         Daemon  5       35293184        5670658048      in
Object.wait() [151ff5000]
FelixPackageAdmin       Daemon  5       33517568        5655359488      in
Object.wait() [15115e000]
FelixDispatchQueue      Task    5       28618752        5661859840      in
Object.wait() [151791000]


I have *no idea* why the test get blocked, I see no rational reason for it
to be blocked. I'll add the thread-id n the logs to see which thread it is
running on...

Can felix be an issue here ?


OK let's just list what we know:

(1) These problems were not present until we experimented with changes to
the codec.
(2) A small part of codec code may be running inside Felix, but only if
we're hitting some of these extra controls. The codec core is not inside
Felix.

It's probably unlikely that Felix is the culprit but who knows: are there
specific tests that are showing this behavior or is it all of them? This
behavior is weird so we should suspect everything. Can you post the stack
traces from the thread dump?

I have some reproductible scenario, I'll commit it n a branch. I'm notsure at all that Felix is the culprit, but even if it's innocent, as thenew kid in the block, throwing stones to it seems pretty natural to me :)


This is *very* weird.

It seems that there is some contention on the container instance, whichis simply strange.


More to come after lunch.

--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Re: Problem when running server-integ tests

Reply via email to