On Mon, Feb 21, 2011 at 11:24 AM, Emmanuel Lecharny <[email protected]>wrote:
> Hi guys,
>
> this week-end, I experienced some bad issues with the server-integ tests,
> while trying to use the same LdapDecoder for both client and server. After 3
> or 4 tests, I get blocked for 30 seconds, and then move to the next test,
> get blocked again, etc.
>
> Of course, at the end, I got test failures, as expected.
>
> I did some deep analysis on what was going on, added some traces, and that
> led me to a point where I get blocked while trying to access to any of the
> LdapContainer method.
>
> Here is an exemple of the plcae I get blocked in :
>
> public void decode( ByteBuffer stream, Asn1Container container ) throws
> DecoderException
> {
> boolean hasRemaining = stream.hasRemaining();
> System.out.println( "<Has Remaining> : " + hasRemaining + "/" +
> stream.remaining() );
>
> // Increment the PDU size counter.
> int remaining = stream.remaining();
> System.out.println( "----- 0" );
> container.incrementDecodeBytes( remaining );
>
> System.out.println( "----- 1" );
>
> and in the console, the "----- 0" get printed, not the "----- 1". The
> incrementDecodeBytes is a very simple method :
>
> public void incrementDecodeBytes( int nb )
> {
> System.out.println( "---in incrementDecodedBytes" );
> decodeBytes += nb;
> }
>
> (note that I don't get either the "---in incrementDecodedBytes" message).
> It's not synchronized, not the decodeBytes is an AtomicInteger.
>
> I did a threadDump, which is available here : http://pastebin.com/wPNPL1FY
>
> It shows that 15 threads are blocked on monitors :
>
> pool-24-thread-1 Task 5 51326976 5680140288
> waiting on condition [152900000]
> pool-22-thread-1 Task 5 50612224 5684236288
> waiting on condition [152ce8000]
> pool-14-thread-4 Task 5 56862720 5715816448
> waiting on condition [154b06000]
> net.sf.ehcache.CacheManager@6f7cf6b6 Daemon 5 27879424
> 5657616384 in Object.wait() [151385000]
> main Task 5 50333696 4300214272 waiting on
> condition [1004fe000]
> kdcReplayCache.data Daemon 5 50774016 5658677248
> waiting on condition [151488000]
> groupCache.data Daemon 5 50778112 5660798976
> waiting on condition [15168e000]
> changePwdReplayCache.data Daemon 5 27881472 5659738112
> waiting on condition [15158b000]
> SynchWorkerThread Task 5 35258368 5652594688 in
> Object.wait() [150ebb000]
> Surefire fork timeout timer Task 5 26988544 5647831040
> in Object.wait() [150a30000]
> Reference Handler Daemon 10 50362368 5634641920 in
> Object.wait() [14fd9c000]
> Finalizer Daemon 8 50364416 5635702784 in
> Object.wait() [14fe9f000]
> FelixStartLevel Daemon 5 35293184 5670658048 in
> Object.wait() [151ff5000]
> FelixPackageAdmin Daemon 5 33517568 5655359488 in
> Object.wait() [15115e000]
> FelixDispatchQueue Task 5 28618752 5661859840 in
> Object.wait() [151791000]
>
>
> I have *no idea* why the test get blocked, I see no rational reason for it
> to be blocked. I'll add the thread-id n the logs to see which thread it is
> running on...
>
> Can felix be an issue here ?
OK let's just list what we know:
(1) These problems were not present until we experimented with changes to
the codec.
(2) A small part of codec code may be running inside Felix, but only if
we're hitting some of these extra controls. The codec core is not inside
Felix.
It's probably unlikely that Felix is the culprit but who knows: are there
specific tests that are showing this behavior or is it all of them? This
behavior is weird so we should suspect everything. Can you post the stack
traces from the thread dump?
Thanks,
Alex