Re: RFR: 8032901: WaitForMultipleObjects() return value not handled appropriately

Aleksej Efimov Wed, 04 Jun 2014 04:44:50 -0700

Thank you David.
Thank you Staffan.

On 06/04/2014 01:46 PM, David Holmes wrote:

Me too.
Thanks,
David

On 4/06/2014 5:23 PM, Staffan Larsen wrote:
Looks ok to me.

/Staffan
On 3 jun 2014, at 15:49, Aleksej Efimov <aleksej.efi...@oracle.com>wrote:
Staffan, David,
Returning back to our WaitForMultipleObjects()/WAIT_ABANDONEDdiscussions:Thank you for your comments and remarks. I can't disagree withmotivation that it's better to have a fatal error during theincorrect mutex handling then data corruption (the consequence ofthe previous fix). In case of such error it'll be much more easierto debug/catch it (but as Staffan said - we have tried to check allcall paths and don't think that we'll encounter such behavior).I have modified the discussed code according to your suggestions:http://cr.openjdk.java.net/~aefimov/8032901/9/webrev.01/To abort the process the 'exitTransportWithError' function wasutilized.Also I have tried to check that behavior isn't changed by running"svc" regression tests set. There was no related test failuresobserved.
Best Regards,
Aleksej

On 05/15/2014 01:11 PM, Staffan Larsen wrote:
On 15 maj 2014, at 03:48, David Holmes <david.hol...@oracle.com>wrote:
On 14/05/2014 11:18 PM, Aleksej Efimov wrote:
David, Vitaly,
I totally agree with Vitaly's explanation (Vitaly - thank you forthat)
and code in shmemBase.c (the usage of enterMutex() function for
sending/receiving bytes through shared memory connection)illustrates on
how the connection shutdown event is used as a "cancellation token".
Thanks for clarifying that.
So if we were to encounter an abandoned mutex the code wouldpresently have acquired the mutex but return an error, thuspreventing a subsequent release, and preventing other threads fromacquiring (but allowing current thread to recurisevely acquire. Sothis could both hang and cause data corruption.
The new code will still return an error but release the mutex. Sono more hangs (other than by conditions caused by data corruption)but more opportunity for data corruption.
Obviously it depends on exactly what happens in the criticalsections guarded by this mutex, but in general I don't agree withthis rationale for making the change:
204     /* If the mutex is abandoned we want to return an error
205      * and also release the mutex so that we don't cause
206      * other threads to be blocked. If a mutex was abandoned
207 * we are in scary state. Data may be corrupted orinconsistent208 * but it is still better to let other threads run (andpossibly209 * crash) than having them blocked (on the premise that acrash
210      * is always easier to debug than a hang).
Considering something has to have gone drastically wrong for themutex to become abandoned, I'm more inclined to consider this afatal error and abort.
But I'll let the serviceability folk chime in here.
I was involved in fixing this and writing the comment, so obviouslyI thought it a good solution :-)
I do agree that it would probably be a good idea to consider this afatal error and abort. At that point in the code we don’t have anyreally nice ways of doing that, though. We could just print anerror and call abort(). What we are doing now is returning an errorfrom sysIPMutexEnter() and delegating the error handling to thecaller. We have tried to check all call paths to verify that theydo “the right thing” in the face of an error. It is obviously hardto verify, but it looks like they all terminate the connection withsome kind of error message.
/Staffan
Thanks,
David
Thank you,
-Aleksej


On 05/14/2014 01:05 PM, David Holmes wrote:
On 14/05/2014 11:06 AM, Vitaly Davidovich wrote:
In windows, you acquire a mutex by waiting on it using one ofthe wait
functions, one of them employed in the code in question.  If
WaitForMultipleObjects succeeds and returns the index of themutex,
current thread has ownership now.
Yes I understand the basic mechanics :)
It's also common to use multi wait functions where the event is a
"cancelation token", e.g. manual reset event; this allowssomeone tocancel waiting on mutex acquisition and return from the waitfunction.Presumably that's the case here, but I'll let Aleksej confirm;just
wanted to throw this out there in the meantime :).
Ah I see - yes cancellable lock acquisition would make sense.

Thanks,
David
Sent from my phone

On May 13, 2014 6:46 PM, "David Holmes" <david.hol...@oracle.com
<mailto:david.hol...@oracle.com>> wrote:

    Hi Aleksej,

    Thanks for the doc references regarding abandonment.
Let me rephrase my question. What is this logic trying toachieve bywaiting on both a mutex and an event? Do we already own themutex
    when this function is called?

    David

    On 13/05/2014 11:19 PM, Aleksej Efimov wrote:

        David,
The Windows has a different terminology for mutexobjects (much
        differs
from the POSIX one). This one link gave me someunderstanding of
        it [1].
Here is the MSDN [1] description of what "abandonedmutex" is:" If a thread terminates without releasing itsownership of a
mutex
        object, the mutex object is considered to be abandoned. A
        waiting thread
can acquire ownership of an abandoned mutex object, butthe waitfunction will return*WAIT_ABANDONED*to indicate thatthe mutex
        object is
abandoned. An abandoned mutex object indicates that anerror hasoccurred and that any shared resource being protectedby the
mutex
        object is in an undefined state. If the thread proceeds as
        though the
mutex object had not been abandoned, it is no longerconsideredabandoned after the thread releases its ownership. Thisrestores
        normal
        behavior if a handle to the mutex object is subsequently
        specified in a
        wait function."
What does it mean to wait on mutex and ownership of themutex
        object:
        "Any thread with a handle to a mutex object can use one of
thewait
        functions
<http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687069%28v=vs.85%29.aspx<http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687069%28v=vs.85%29.aspx>>torequest ownership of the mutex object. If the mutexobject is
        owned by
another thread, the wait function blocks the requestingthread
        until the
owning thread releases the mutex object usingthe*ReleaseMutex*<http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms685066%28v=vs.85%29.aspx<http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685066%28v=vs.85%29.aspx>>__function."
        How we can release mutex and wait on already owned mutex:
" After a thread obtains ownership of a mutex, it canspecify
        the same
        mutex in repeated calls to the wait-functions
<http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687069%28v=vs.85%29.aspx<http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687069%28v=vs.85%29.aspx>>__withoutblocking its execution. This prevents a thread fromdeadlocking
        itself
while waiting for a mutex that it already owns. Torelease its
        ownership
under such circumstances, the thread mustcall*ReleaseMutex*<http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms685066%28v=vs.85%29.aspx<http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685066%28v=vs.85%29.aspx>>__oncefor each time that the mutex satisfied the conditionsof a wait
        function."

        [1]
http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms684266(v=vs.85).aspx<http://msdn.microsoft.com/en-gb/library/windows/desktop/ms684266(v=vs.85).aspx>
        -Aleksej

        On 05/13/2014 04:00 PM, David Holmes wrote:
I don't understand this one at all. What is an"abandoned
            mutex"? For
that matter why does the code wait on a mutex andan event?
            Do we
already own the mutex? If so what does it mean towait on
            it? If not
            then how can we release it?

            ???

            Thanks,
            David


            On 13/05/2014 8:57 PM, Alan Bateman wrote:
This is debugger's shared memory transport socc'ing
                serviceability-dev
                as that is there this code is maintained.
Is there a test case or any outline of theconditions
                that cause this? I
                think that would be useful to understand the issue
further.

                -Alan

                On 13/05/2014 11:46, Aleksej Efimov wrote:

                    Hi,
Can I have a review for 8032901 bug [1] fix[2].
                    There is a possible
case when 'WaitForMultipleObjects' functioncan
                    return the
                    WAIT_ABANDONED_0 [3] error value.
In such case it's better to release themutex and
                    return error value.
This will prevent other threads to beblocked on
                    abandoned mutex.

                    Thank you,
                    Aleksej

                    [1]
https://bugs.openjdk.java.net/__browse/JDK-8032901
<https://bugs.openjdk.java.net/browse/JDK-8032901>
                    [2]
http://cr.openjdk.java.net/~__aefimov/8032901/9/webrev.00/
<http://cr.openjdk.java.net/~aefimov/8032901/9/webrev.00/>
                    [3]
http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687025(v=vs.85).aspx<http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687025(v=vs.85).aspx>

Re: RFR: 8032901: WaitForMultipleObjects() return value not handled appropriately

Reply via email to