I could repro a couple of these cases finally( on Linux x64 ) and I think that this problem is happening because of the known weakness in shutdown of daemon threads. gdb shows a SIGSEGV in the cancel handler, usually reporting a zombie thread.
In the current shutdown we register safepoint shutdown callbacks and do timed joins, waiting for the daemon threads to exit. We make some reasonable guess on the join timeout interval. After this, we kill the threads ( on Linux with a pthread_cancel ). When the cycle eater runs in the background, the join interval we have chosen is not enough. But sometimes, between the time we give up on the joins and before we post the cancel signals, the thread( default attribute is joinable and not detached thread ) finally completes the safepoint shutdown callback and exits. It is now a zombie or whatever, and would release all resources on join. But in shutdown we have given up on join and has started pthread_cancel(). The CANCEL signal fails to handle on the zombie thread and raises SIGSEGV. I don't know Linux well enough to know the exact dynamics of zombies. I multiplied the join timeout interval by a factor of 100 and the errors went away, with cycle eater running in the background. I don't think we want to make changes like this in the VM. This is not a good way to tune wall clock times ( some of which need to exist in the implementation ). I also have some concern about how we are choosing to create these test scenarios. Artificial severe stress conditions can be simulated in tests creating failures that are time consuming to debug. But I don't know how much extra information they give us. For example, we already known that daemon thread shutdown is not perfect. If we choose to create stresses, I think that it is better to use real applications or well known workloads. In that case, failures would be more meaningful and would give us some good guidance on tuning things. On6/6/07, Vladimir Ivanov <[EMAIL PROTECTED]> wrote:
issue HARMONY-4080 was created to track it. thanks, Vladimir On 5/18/07, Vladimir Ivanov <[EMAIL PROTECTED]> wrote: > The CC/CI report failures just now on linux x86_64 in default mode: > ----------------------------- > Running test : thread.ThreadInterrupt > *** FAILED **** : thread.ThreadInterrupt (139 res code) > ----------------------------- > > thanks, Vladimir > > > On 5/18/07, Rana Dasgupta <[EMAIL PROTECTED]> wrote: > > OK, I will also try to change this test to make it more meaningful > > than it is now. We can then decide if we want to keep it or lose it? > > > > On 5/17/07, Pavel Rebriy <[EMAIL PROTECTED]> wrote: > > > May be better modify tests to the correct way? > > > The test gc.ThreadSuspension check suspension model during garbage > > > collection. It is a very useful test for VM. > > > -- > > > Best regards, > > > Pavel Rebriy > > > > > >
