On 2/21/11 11:37 AM, Alex Karasulu wrote:
On Mon, Feb 21, 2011 at 11:24 AM, Emmanuel Lecharny<[email protected]>wrote:
Hi guys,
this week-end, I experienced some bad issues with the server-integ tests,
while trying to use the same LdapDecoder for both client and server. After 3
or 4 tests, I get blocked for 30 seconds, and then move to the next test,
get blocked again, etc.
Of course, at the end, I got test failures, as expected.
I did some deep analysis on what was going on, added some traces, and that
led me to a point where I get blocked while trying to access to any of the
LdapContainer method.
Here is an exemple of the plcae I get blocked in :
public void decode( ByteBuffer stream, Asn1Container container ) throws
DecoderException
{
boolean hasRemaining = stream.hasRemaining();
System.out.println( "<Has Remaining> : " + hasRemaining + "/" +
stream.remaining() );
// Increment the PDU size counter.
int remaining = stream.remaining();
System.out.println( "----- 0" );
container.incrementDecodeBytes( remaining );
System.out.println( "----- 1" );
and in the console, the "----- 0" get printed, not the "----- 1". The
incrementDecodeBytes is a very simple method :
public void incrementDecodeBytes( int nb )
{
System.out.println( "---in incrementDecodedBytes" );
decodeBytes += nb;
}
(note that I don't get either the "---in incrementDecodedBytes" message).
It's not synchronized, not the decodeBytes is an AtomicInteger.
I did a threadDump, which is available here : http://pastebin.com/wPNPL1FY
It shows that 15 threads are blocked on monitors :
pool-24-thread-1 Task 5 51326976 5680140288
waiting on condition [152900000]
pool-22-thread-1 Task 5 50612224 5684236288
waiting on condition [152ce8000]
pool-14-thread-4 Task 5 56862720 5715816448
waiting on condition [154b06000]
net.sf.ehcache.CacheManager@6f7cf6b6 Daemon 5 27879424
5657616384 in Object.wait() [151385000]
main Task 5 50333696 4300214272 waiting on
condition [1004fe000]
kdcReplayCache.data Daemon 5 50774016 5658677248
waiting on condition [151488000]
groupCache.data Daemon 5 50778112 5660798976
waiting on condition [15168e000]
changePwdReplayCache.data Daemon 5 27881472 5659738112
waiting on condition [15158b000]
SynchWorkerThread Task 5 35258368 5652594688 in
Object.wait() [150ebb000]
Surefire fork timeout timer Task 5 26988544 5647831040
in Object.wait() [150a30000]
Reference Handler Daemon 10 50362368 5634641920 in
Object.wait() [14fd9c000]
Finalizer Daemon 8 50364416 5635702784 in
Object.wait() [14fe9f000]
FelixStartLevel Daemon 5 35293184 5670658048 in
Object.wait() [151ff5000]
FelixPackageAdmin Daemon 5 33517568 5655359488 in
Object.wait() [15115e000]
FelixDispatchQueue Task 5 28618752 5661859840 in
Object.wait() [151791000]
I have *no idea* why the test get blocked, I see no rational reason for it
to be blocked. I'll add the thread-id n the logs to see which thread it is
running on...
Can felix be an issue here ?
OK let's just list what we know:
(1) These problems were not present until we experimented with changes to
the codec.
(2) A small part of codec code may be running inside Felix, but only if
we're hitting some of these extra controls. The codec core is not inside
Felix.
It's probably unlikely that Felix is the culprit but who knows: are there
specific tests that are showing this behavior or is it all of them? This
behavior is weird so we should suspect everything. Can you post the stack
traces from the thread dump?
I have some reproductible scenario, I'll commit it n a branch. I'm not
sure at all that Felix is the culprit, but even if it's innocent, as the
new kid in the block, throwing stones to it seems pretty natural to me :)
This is *very* weird.
It seems that there is some contention on the container instance, which
is simply strange.
More to come after lunch.
--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com