Just to clarify ...

Martin Buchholz said the following on 07/16/09 07:24:
In summary,
there are two different bugs at work here,
and neither of them is in LBD.
The hotspot team is working on the LBD deadlock.

Not the "hotspot team", just me :) - I am looking into this in my role as j.u.c evaluator. Given the difficulty in reproducing this issue in-house, progress is very slow. It will be a while before I can determine whether this is a bug in AQS code, or whether there is something bad happening with regard to memory ordering/visibility on some systems.

Cheers,
David Holmes

(As always) It would be good to have a good test case for
the dead socket problem.

Martin

On Wed, Jul 15, 2009 at 12:24, Ariel Weisberg <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:

    Hi,
I have found that there are two different failure modes without
    involving -XX:+UseMembar. There is the LBD deadlock and then there
    is the dead socket in between two nodes. Either failure can occur
    with the same code and settings. It appears that the dead socket
    problem is more common. The LBD failure is also not correlated with
    any specific LBD (originally saw it with only the LBD for an
    Initiator's mailbox).
With -XX:+UseMembar the system is noticeably more reliable and tends
    to run much longer without failing (although it can still fail
    immediately). When it does fail it has been due to a dead
    connection. I have not reproduced a deadlock on an LBD with
    -XX:+UseMembar.
I also found that the dead socket issue was reproducible twice on
    Dell Poweredge 2970s (two socket AMD). It takes an hour or so to
    reproduce the dead socket problem on the 2970. I have not recreated
    the LBD issue on them although given how difficult the socket issue
    is to reproduce it may be that I have not run them long enough. On
    the AMD machines I did not use -XX:+UseMembar.
Ariel On Mon, 13 Jul 2009 18:59 -0400, "Ariel Weisberg" <ar...@weisberg.ws
    <mailto:ar...@weisberg.ws>> wrote:
    Hi all.
Sorry Martin I missed reading your last email. I am not confident
    that I will get a small reproducible test case in a reasonable
    time frame. Reproducing it with the application is easy and I will
    see what I can do about getting the source available.
One interesting thing I can tell you is that if I remove the
    LinkedBlockingDeque from the mailbox of the Initiator the system
    still deadlocks. The cluster has a TCP mesh topology so any node
    can deliver messages to any other node. One of the connections
    goes dead and neither side detects that there is a problem. I add
    some assertions to the network selection thread to check that all
    the connections in the cluster are still healthy and assert that
    they have the correct interests set.
Here are the things it checks for to make sure each connection is
    working:
    >                             for (ForeignHost.Port port :
    foreignHostPorts) {
    >                             assert(port.m_selectionKey.isValid());
> assert(port.m_selectionKey.selector() == m_selector);
    >                             assert(port.m_channel.isOpen());
> assert(((SocketChannel)port.m_channel).isConnected()); > assert(((SocketChannel)port.m_channel).socket().isInputShutdown()
    == false);
> assert(((SocketChannel)port.m_channel).socket().isOutputShutdown()
    == false);
> assert(((SocketChannel)port.m_channel).isOpen()); > assert(((SocketChannel)port.m_channel).isRegistered()); > assert(((SocketChannel)port.m_channel).keyFor(m_selector) != null); > assert(((SocketChannel)port.m_channel).keyFor(m_selector) ==
    port.m_selectionKey);
    >                             if
    (m_selector.selectedKeys().contains(port.m_selectionKey)) {
> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_READ)
    != 0);
> assert((port.m_selectionKey.interestOps() & SelectionKey.OP_WRITE)
    != 0);
    >                             } else {
    >                                 if (port.isRunning()) {
> assert(port.m_selectionKey.interestOps() == 0);
    >                                 } else {
> port.m_selectionKey.interestOps(SelectionKey.OP_READ |
    SelectionKey.OP_WRITE);
    >                                     assert((port.interestOps() &
    SelectionKey.OP_READ) != 0);
    >                                     assert((port.interestOps() &
    SelectionKey.OP_WRITE) != 0);
    >                                 }
    >                             }
    >                             assert(m_selector.isOpen());
> assert(m_selector.keys().contains(port.m_selectionKey));
    OP_READ | OP_WRITE is set as the interest ops every time through,
    and there is no other code that changes the interest ops during
    execution. The application will run for a while and then one of
    the connections will stop being selected on both sides. If I step
    in with the debugger on either side everything looks correct. The
    keys have the correct interest ops and the selectors have the keys
    in their key set.
What I suspect is happening is that a bug on one node stops the
    socket from being selected (for both read and write), and
    eventually the socket fills up and can't be written to by the
    other side.
If I can get my VPN access together tomorrow I will run with
    -XX:+UseMembar and also try running on some 8-core AMD machines.
    Otherwise I will have to get to it Wednesday.
Thanks, Ariel Weisberg On Tue, 14 Jul 2009 05:00 +1000, "David Holmes"
    <davidchol...@aapt.net.au <mailto:davidchol...@aapt.net.au>> wrote:
    Martin,
I don't think this is due to LBQ/D. This is looking similar to a
    couple of other ReentrantLock/AQS "lost wakeup" hangs that I've
    got on the radar. We have a reprodeucible test case for one issue
    but it only fails on one kind of system - x4450. I'm on vacation
    most of this week but will try and get back to this next week.
Ariel: one thing to try please see if -XX:+UseMembar fixes the
    problem.
Thanks,
    David Holmes

        -----Original Message-----
        *From:* Martin Buchholz [mailto:marti...@google.com
        <mailto:marti...@google.com>]
        *Sent:* Tuesday, 14 July 2009 8:38 AM
        *To:* Ariel Weisberg
        *Cc:* davidchol...@aapt.net.au
        <mailto:davidchol...@aapt.net.au>; core-libs-dev;
        concurrency-inter...@cs.oswego.edu
        <mailto:concurrency-inter...@cs.oswego.edu>
        *Subject:* Re: [concurrency-interest] LinkedBlockingDeque
        deadlock?

        I did some stack trace eyeballing and did a mini-audit of the
        LinkedBlockingDeque code, with a view to finding possible bugs,
        and came up empty.  Maybe it's a deep bug in hotspot?

        Ariel, it would be good if you could get a reproducible test
        case soonish,
        while someone on the planet has the motivation and
        familiarity to fix it.
        In another month I may disavow all knowledge of j.u.c.*Blocking*

        Martin


        On Wed, Jul 8, 2009 at 15:57, Ariel Weisberg
        <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:

            Hi,

            > The poll()ing thread is blocked waiting for the
            internal lock, but
            > there's
            > no indication of any thread owning that lock. You're
            using an OpenJDK 6
            > build ... can you try JDK7 ?
I got a chance to do that today. I downloaded JDK 7 from
            
http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b63-linux-x64-02_jul_2009.bin
            and was able to reproduce the problem. I have attached
            the stack trace
            from running the 1.7 version. It is the same situation as
            before except
            there are 9 execution sites running on each host. There
            are no threads
            that are missing or that have been restarted. Foo Network
            thread
            (selector thread) and Network Thread - 0 are waiting on
            0x00002aaab43d3b28. I also ran with JDK 7 and 6 and
            LinkedBlockingQueue
            and was not able to recreate the problem using that
            structure.

            > I don't recall anything similar to this, but I don't
            know what version
            > that
            > OpenJDK6 build relates to.
The cluster is running on CentOS 5.3.
            >[aweisb...@3f ~]$ rpm -qi
            java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5
            >Name        : java-1.6.0-openjdk           Relocations:
            (not relocatable)
            >Version     : 1.6.0.0                           Vendor:
            CentOS
            >Release     : 0.30.b09.el5                  Build Date:
            Tue 07 Apr 2009 07:24:52 PM EDT
            >Install Date: Thu 11 Jun 2009 03:27:46 PM EDT      Build
            Host: builder10.centos.org <http://builder10.centos.org>
            >Group       : Development/Languages         Source RPM:
            java-1.6.0-openjdk-1.6.0.0-0.30.b09.el5.src.rpm
            >Size        : 76336266                         License:
            GPLv2 with exceptions
            >Signature   : DSA/SHA1, Wed 08 Apr 2009 07:55:13 AM EDT,
            Key ID a8a447dce8562897
            >URL         : http://icedtea.classpath.org/
            >Summary     : OpenJDK Runtime Environment
            >Description :
            >The OpenJDK runtime environment.

            > Make sure you haven't missed any exceptions occurring
            in other threads.
            There are no threads missing in the application
            (terminated threads are
            not replaced) and there is a try catch pair (prints error
            and rethrows)
            around the run loop of each thread. It is possible that
            an exception may
            have been swallowed up somewhere.

            >A small reproducible test case from you would be useful.
            I am working on that. I wrote a test case that mimics the
            application's
            use of the LBD, but I have not succeeded in reproducing
            the problem in
            the test case. The app has a single thread (network
            selector) that polls
            the LBD and several threads (ExecutionSites, and network
            threads that
            return results from remote ExecutionSites) that offer
            results into the
            queue. About 120k items will go into/out of the deque
            each second. In
            the actual app the problem is reproducible but
            inconsistent. If I run on
            my dual core laptop I can't reproduce it, and it is less
            likely to occur
            with a small cluster, but with 6 nodes (~560k
            transactions/sec) the
            problem will usually appear. Sometimes the cluster will
            run for several
            minutes without issue and other times it will deadlock
            immediately.

            Thanks,

            Ariel

            On Wed, 08 Jul 2009 05:14 +1000, "Martin Buchholz"
            <marti...@google.com <mailto:marti...@google.com>> wrote:
            >[+core-libs-dev]
            >
            >Doug Lea and I are (slowly) working on a new version of
            LinkedBlockingDeque.
            >I was not aware of a deadlock but can vaguely imagine
            how it might happen.
            >A small reproducible test case from you would be useful.
            >
            >Unfinished work in progress can be found here:
            >http://cr.openjdk.java.net/~martin/webrevs/openjdk7/BlockingQueue/
            
<http://cr.openjdk.java.net/%7Emartin/webrevs/openjdk7/BlockingQueue/>
            >
            >Martin
On Wed, 08 Jul 2009 05:14 +1000, "David Holmes"
            <davidchol...@aapt.net.au
            <mailto:davidchol...@aapt.net.au>> wrote:
            >
> Ariel,
            >
            > The poll()ing thread is blocked waiting for the
            internal lock, but
            > there's
            > no indication of any thread owning that lock. You're
            using an OpenJDK 6
            > build ... can you try JDK7 ?
            >
            > I don't recall anything similar to this, but I don't
            know what version
            > that
            > OpenJDK6 build relates to.
            >
            > Make sure you haven't missed any exceptions occurring
            in other threads.
            >
            > David Holmes
            >
            > > -----Original Message-----
            > > From: concurrency-interest-boun...@cs.oswego.edu
            <mailto:concurrency-interest-boun...@cs.oswego.edu>
            > > [mailto:concurrency-interest-boun...@cs.oswego.edu
            <mailto:concurrency-interest-boun...@cs.oswego.edu>]On
            Behalf Of Ariel
            > > Weisberg
            > > Sent: Wednesday, 8 July 2009 8:31 AM
            > > To: concurrency-inter...@cs.oswego.edu
            <mailto:concurrency-inter...@cs.oswego.edu>
            > > Subject: [concurrency-interest] LinkedBlockingDeque
            deadlock?
            > >
            > >
            > > Hi all,
            > >
            > > I did a search on LinkedBlockingDeque and didn't find
            anything similar
            > > to what I am seeing. Attached is the stack trace from
            an application
            > > that is deadlocked with three threads waiting for
            0x00002aaab3e91080
            > > (threads "ExecutionSite: 26", "ExecutionSite:27", and
            "Network
            > > Selector"). The execution sites are attempting to
            offer results to the
            > > deque and the network thread is trying to poll for
            them using the
            > > non-blocking version of poll. I am seeing the network
            thread never
            > > return from poll (straight poll()). Do my eyes
            deceive me?
            > >
            > > Thanks,
            > >
            > > Ariel Weisberg
            > >
            >



    _______________________________________________
    Concurrency-interest mailing list
    concurrency-inter...@cs.oswego.edu
    <mailto:concurrency-inter...@cs.oswego.edu>
    http://cs.oswego.edu/mailman/listinfo/concurrency-interest


Reply via email to