Re: [drlvm] run of smoke tests on overloaded box

Xiao-Feng Li Fri, 15 Jun 2007 23:10:55 -0700

On 6/16/07, Rana Dasgupta <[EMAIL PROTECTED]> wrote:

I could repro a couple of these cases finally( on Linux x64 ) and I
think that this problem is happening because of the known weakness in
shutdown of daemon threads.  gdb shows a SIGSEGV in the cancel
handler, usually reporting a zombie thread.


In the current shutdown we register safepoint shutdown callbacks and
do timed joins, waiting for the daemon threads to exit. We make some
reasonable guess on the join timeout interval. After this, we kill the
threads ( on Linux with a pthread_cancel ). When the cycle eater runs
in the background, the join interval we have chosen is not enough. But
sometimes, between the time we give up on the joins and before we post
the cancel signals, the thread( default attribute is joinable and not
detached thread ) finally completes the safepoint shutdown callback
and exits. It is now a zombie or whatever, and would release all
resources on join. But in shutdown we have given up on join and has
started pthread_cancel(). The CANCEL signal fails to handle on the
zombie thread and raises SIGSEGV. I don't know Linux well enough to
know the exact dynamics of zombies.


Very interesting study. This situation happens not only here but also
finalizer threads shutdown. We have test case that creates infinite
loop execution in a finalizer (or waiting on a lost socket), requiring
the system can shutdown correctly by sort of figuring out this
situation and not waiting for the (dead) finalizer's finish. At the
same time, we have test case that lets the finalizer to run lots of
heavy duty work, and requiring the system to figure out this situation
and waiting for the finalizer's finish.

In GCv5, we solved the problem (or passed the tests anyway) by letting
the system to timed wait on the finalizers. If at the timeout event we
detect there is at least one finalizer is executed, we will loop back
timed waiting again, since in this case it means the finalizers are
still making progress. If at a timeout event we find the finalizers
number is unchanged, we decide the finalizers are dead and will go on
to exit.

The problem is, we don't know which timeout value is reasonable, 1ms
or 1s. In this case, I personally think a bigger value makes more
sense. Since in our case, the timed wait doesn't need to wait for
timeout, it can also be waken up by the finalizers once they are
finished, so a longer timeout value does not impact the performance
normally. I guess this is the same case for the thread joining timed
wait?

Thanks,
xiaofeng

I multiplied the join timeout interval by a factor of 100 and the
errors went away, with cycle eater running in the background. I don't
think we want to make changes like this in the VM. This is not a good
way to tune wall clock times ( some of which need to exist in the
implementation ).

I also have some concern about how we are choosing to create these
test scenarios. Artificial severe stress conditions can be simulated
in tests creating failures that are time consuming to debug. But I
don't know how much extra information they give us. For example, we
already known that daemon thread shutdown is not perfect. If we choose
to create stresses, I think that it is better to use real applications
or well known workloads. In that case, failures would be more
meaningful and would give us some good guidance on tuning things.

On6/6/07, Vladimir Ivanov <[EMAIL PROTECTED]> wrote:
> issue HARMONY-4080 was created to track it.
>
>  thanks, Vladimir
>
> On 5/18/07, Vladimir Ivanov <[EMAIL PROTECTED]> wrote:
> > The CC/CI report failures just now on linux x86_64 in default mode:
> > -----------------------------
> > Running test : thread.ThreadInterrupt
> > *** FAILED **** : thread.ThreadInterrupt (139 res code)
> > -----------------------------
> >
> >  thanks, Vladimir
> >
> >
> > On 5/18/07, Rana Dasgupta <[EMAIL PROTECTED]> wrote:
> > > OK, I will also try to change this test to make it more meaningful
> > > than it is now. We can then decide if we want to keep it or lose it?
> > >
> > > On 5/17/07, Pavel Rebriy <[EMAIL PROTECTED]> wrote:
> > > > May be better modify tests to the correct way?
> > > > The test gc.ThreadSuspension check suspension model during garbage
> > > > collection. It is a very useful test for VM.
> > > > --
> > > > Best regards,
> > > > Pavel Rebriy
> > > >
> > >
> >
>



--
http://xiao-feng.blogspot.com

Re: [drlvm] run of smoke tests on overloaded box

Reply via email to