[jira] [Comment Edited] (HAWQ-559) QD hangs when QE is killed after connected to QD

Lili Ma (JIRA) Sat, 19 Mar 2016 05:03:21 -0700

    [ 
https://issues.apache.org/jira/browse/HAWQ-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201124#comment-15201124
 ]


Lili Ma edited comment on HAWQ-559 at 3/18/16 7:57 AM:
-------------------------------------------------------

Root Cause: 
After QD establishes all the connections to QE, it will start a new thread 
group to dispatch tasks to QE. Before dispatching, each thread will check 
whether the connection to QE is still alive. In the case, the connection is not 
alive any more. 
In interconnect part code, the main thread will periodically check whether QD 
has set cancel in function receiveChunksUDP. To be more precisely, check the 
errcode of dispatchdata. 
{code}
                /* check to see if the dispatcher should cancel */
                if (Gp_role == GP_ROLE_DISPATCH)
                {
                        checkForCancelFromQD(pTransportStates);
                }
{code}
{code}
static void
checkForCancelFromQD(ChunkTransportState *pTransportStates)
{
        Assert(Gp_role == GP_ROLE_DISPATCH);
        Assert(pTransportStates);
        Assert(pTransportStates->estate);

        if (dispatcher_has_error(pTransportStates->estate->dispatch_data))
        {
                ereport(ERROR, (errcode(ERRCODE_GP_INTERCONNECTION_ERROR),
                                                
errmsg(CDB_MOTION_LOST_CONTACT_STRING)));
                /* not reached */
        }
}
{code}
{code}
bool
dispatcher_has_error(DispatchData *data)
{
        return data->results && data->results->errcode;
}
{code}
However, when QD finds the connection is not available, it didn't mark that 
errcode. Should mark that to inform interconnect.


was (Author: lilima):
Root Cause: 
After QD establishes all the connections to QE, it will start a new thread 
group to dispatch tasks to QE. Before dispatching, each thread will check 
whether the connection to QE is still alive. In the case, the connection is not 
alive any more. 
In interconnect part code, the main thread will periodically check whether QD 
has set cancel. To be more precisely, check the errcode of dispatchdata. 
{code}
                /* check to see if the dispatcher should cancel */
                if (Gp_role == GP_ROLE_DISPATCH)
                {
                        checkForCancelFromQD(pTransportStates);
                }
{code}
{code}
static void
checkForCancelFromQD(ChunkTransportState *pTransportStates)
{
        Assert(Gp_role == GP_ROLE_DISPATCH);
        Assert(pTransportStates);
        Assert(pTransportStates->estate);

        if (dispatcher_has_error(pTransportStates->estate->dispatch_data))
        {
                ereport(ERROR, (errcode(ERRCODE_GP_INTERCONNECTION_ERROR),
                                                
errmsg(CDB_MOTION_LOST_CONTACT_STRING)));
                /* not reached */
        }
}
{code}
{code}
bool
dispatcher_has_error(DispatchData *data)
{
        return data->results && data->results->errcode;
}
{code}
However, when QD finds the connection is not available, it didn't mark that 
errcode. Should mark that to inform interconnect.

> QD hangs when QE is killed after connected to QD
> ------------------------------------------------
>
>                 Key: HAWQ-559
>                 URL: https://issues.apache.org/jira/browse/HAWQ-559
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Dispatcher
>    Affects Versions: 2.0.0
>         Environment: mac os X 10.10
>            Reporter: Chunling Wang
>            Assignee: Lili Ma
>
> When the first query finishes, the QE is still alive. Then we run the second 
> query. After the thread of QD is created and bind to QE but not send data to 
> QE, we kill this QE and find QD hangs.
> Here is the backtrace when QD hangs:
> {code}
> * thread #1: tid = 0x1c4afd, 0x00007fff890355be libsystem_kernel.dylib`poll + 
> 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
>     frame #1: 0x000000010745692c postgres`receiveChunksUDP [inlined] 
> udpSignalPoll + 42 at ic_udp.c:2882
>     frame #2: 0x0000000107456902 postgres`receiveChunksUDP + 26 at 
> ic_udp.c:2715
>     frame #3: 0x00000001074568e8 postgres`receiveChunksUDP [inlined] 
> waitOnCondition(timeout_us=250000) + 82 at ic_udp.c:1599
>     frame #4: 0x0000000107456896 
> postgres`receiveChunksUDP(pTransportStates=0x00007ff2a381ae48, 
> pEntry=0x00007ff2a18f2230, motNodeID=<unavailable>, 
> srcRoute=0x00007fff58c0ce96, conn=<unavailable>, inTeardown='\0') + 726 at 
> ic_udp.c:4039
>     frame #5: 0x0000000107452a86 postgres`RecvTupleChunkFromAnyUDP [inlined] 
> RecvTupleChunkFromAnyUDP_Internal + 498 at ic_udp.c:4146
>     frame #6: 0x0000000107452894 
> postgres`RecvTupleChunkFromAnyUDP(mlStates=<unavailable>, 
> transportStates=<unavailable>, motNodeID=1, srcRoute=0x00007fff58c0ce96) + 
> 100 at ic_udp.c:4167
>     frame #7: 0x0000000107442254 postgres`RecvTupleFrom [inlined] 
> processIncomingChunks(mlStates=0x00007ff2a3812a30, 
> transportStates=0x00007ff2a381ae48, motNodeID=1, srcRoute=<unavailable>) + 34 
> at cdbmotion.c:684
>     frame #8: 0x0000000107442232 
> postgres`RecvTupleFrom(mlStates=0x00007ff2a3812a30, 
> transportStates=<unavailable>, motNodeID=1, tup_i=0x00007fff58c0cf00, 
> srcRoute=-100) + 370 at cdbmotion.c:610
>     frame #9: 0x00000001071c8778 postgres`ExecMotion [inlined] 
> execMotionUnsortedReceiver(node=<unavailable>) + 57 at nodeMotion.c:466
>     frame #10: 0x00000001071c873f postgres`ExecMotion(node=<unavailable>) + 
> 1071 at nodeMotion.c:298
>     frame #11: 0x00000001071a4835 
> postgres`ExecProcNode(node=0x00007ff2a38164b8) + 613 at execProcnode.c:999
>     frame #12: 0x00000001071b9f82 postgres`ExecAgg + 104 at nodeAgg.c:1163
>     frame #13: 0x00000001071b9f1a postgres`ExecAgg + 316 at nodeAgg.c:1693
>     frame #14: 0x00000001071b9dde postgres`ExecAgg(node=0x00007ff2a3815348) + 
> 126 at nodeAgg.c:1138
>     frame #15: 0x00000001071a4803 
> postgres`ExecProcNode(node=0x00007ff2a3815348) + 563 at execProcnode.c:979
>     frame #16: 0x000000010719ecfd 
> postgres`ExecutePlan(estate=0x00007ff2a3814e30, planstate=0x00007ff2a3815348, 
> operation=CMD_SELECT, numberTuples=0, direction=<unavailable>, 
> dest=0x00007ff2a28db178) + 1181 at execMain.c:3218
>     frame #17: 0x000000010719e619 
> postgres`ExecutorRun(queryDesc=0x00007ff2a3811f00, 
> direction=ForwardScanDirection, count=0) + 569 at execMain.c:1213
>     frame #18: 0x00000001072e7fc2 postgres`PortalRun + 14 at pquery.c:1649
>     frame #19: 0x00000001072e7fb4 
> postgres`PortalRun(portal=0x00007ff2a1893e30, count=<unavailable>, 
> isTopLevel='\x01', dest=<unavailable>, altdest=0x00007ff2a28db178, 
> completionTag=0x00007fff58c0d530) + 1124 at pquery.c:1471
>     frame #20: 0x00000001072e4a8e 
> postgres`exec_simple_query(query_string=0x00007ff2a380fe30, 
> seqServerHost=0x0000000000000000, seqServerPort=-1) + 2078 at postgres.c:1745
>     frame #21: 0x00000001072e0c4c postgres`PostgresMain(argc=<unavailable>, 
> argv=<unavailable>, username=0x00007ff2a201bcf0) + 9404 at postgres.c:4754
>     frame #22: 0x000000010729a002 postgres`ServerLoop [inlined] BackendRun + 
> 105 at postmaster.c:5889
>     frame #23: 0x0000000107299f99 postgres`ServerLoop at postmaster.c:5484
>     frame #24: 0x0000000107299f99 postgres`ServerLoop + 9593 at 
> postmaster.c:2163
>     frame #25: 0x0000000107296f3b postgres`PostmasterMain(argc=<unavailable>, 
> argv=<unavailable>) + 5019 at postmaster.c:1454
>     frame #26: 0x0000000107200ca9 postgres`main(argc=9, 
> argv=0x00007ff2a141eef0) + 1433 at main.c:209
>     frame #27: 0x00007fff95e8c5c9 libdyld.dylib`start + 1
>   thread #2: tid = 0x1c4afe, 0x00007fff890355be libsystem_kernel.dylib`poll + 
> 10
>     frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
>     frame #1: 0x000000010744d8e3 postgres`rxThreadFunc(arg=<unavailable>) + 
> 2163 at ic_udp.c:6251
>     frame #2: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
>     frame #3: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
>     frame #4: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
>   thread #3: tid = 0x1c4b02, 0x00007fff890343f6 
> libsystem_kernel.dylib`__select + 10
>     frame #0: 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10
>     frame #1: 0x00000001074ec47e postgres`pg_usleep(microsec=<unavailable>) + 
> 78 at pgsleep.c:43
>     frame #2: 0x0000000107400c26 
> postgres`generateResourceRefreshHeartBeat(arg=0x00007ff2a141ce90) + 166 at 
> rmcomm_QD2RM.c:1519
>     frame #3: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
>     frame #4: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
>     frame #5: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HAWQ-559) QD hangs when QE is killed after connected to QD

Reply via email to