Chunling Wang created HAWQ-564:
----------------------------------

             Summary: QD hangs when connecting to resource manager
                 Key: HAWQ-564
                 URL: https://issues.apache.org/jira/browse/HAWQ-564
             Project: Apache HAWQ
          Issue Type: Bug
          Components: Resource Manager
            Reporter: Chunling Wang
            Assignee: Lei Chang


When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x00007fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
    frame #1: 0x0000000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
    frame #2: 0x0000000101db85f5 
postgres`callSyncRPCRemote(hostname=0x00007f9c19e00cd0, port=5437, 
sendbuff=0x00007f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=<unavailable>, errorbuf=0x000000010230c1a0, errorbufsize=<unavailable>) 
+ 645 at rmcomm_SyncComm.c:122
    frame #3: 0x0000000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x00007f9c1b918f50, sendbuffsize=<unavailable>, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x00007f9c1b918e70, 
errorbuf=<unavailable>, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
    frame #4: 0x0000000101db2d3c 
postgres`acquireResourceFromRM(index=<unavailable>, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x00007f9c1a02d398, 
preferred_nodes_size=<unavailable>, max_seg_count_fix=<unavailable>, 
min_seg_count_fix=<unavailable>, errorbuf=<unavailable>, 
errorbufsize=<unavailable>) + 572 at rmcomm_QD2RM.c:742
    frame #5: 0x0000000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x00007f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
    frame #6: 0x0000000101e8c60f 
postgres`calculate_planner_segment_num(query=<unavailable>, 
resourceLife=QRL_ONCE, fullRangeTable=<unavailable>, intoPolicy=<unavailable>, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
    frame #7: 0x0000000101c0f671 postgres`planner + 106 at planner.c:496
    frame #8: 0x0000000101c0f607 postgres`planner(parse=0x00007f9c1a02a140, 
cursorOptions=<unavailable>, boundParams=0x0000000000000000, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
    frame #9: 0x0000000101c8eb33 
postgres`pg_plan_query(querytree=0x00007f9c1a02a140, 
boundParams=0x0000000000000000, resource_life=QRL_ONCE) + 99 at postgres.c:837
    frame #10: 0x0000000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
    frame #11: 0x0000000101c95699 
postgres`exec_simple_query(query_string=0x00007f9c1a028a30, 
seqServerHost=0x0000000000000000, seqServerPort=-1) + 1577 at postgres.c:1671
    frame #12: 0x0000000101c91a4c postgres`PostgresMain(argc=<unavailable>, 
argv=<unavailable>, username=0x00007f9c1b808cf0) + 9404 at postgres.c:4754
    frame #13: 0x0000000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
    frame #14: 0x0000000101c4ad99 postgres`ServerLoop at postmaster.c:5484
    frame #15: 0x0000000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
    frame #16: 0x0000000101c47d3b postgres`PostmasterMain(argc=<unavailable>, 
argv=<unavailable>) + 5019 at postmaster.c:1454
    frame #17: 0x0000000101bb1aa9 postgres`main(argc=9, 
argv=0x00007f9c19c1eef0) + 1433 at main.c:209
    frame #18: 0x00007fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x00007fff890355be libsystem_kernel.dylib`poll + 10
    frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
    frame #1: 0x0000000101dfe723 postgres`rxThreadFunc(arg=<unavailable>) + 
2163 at ic_udp.c:6251
    frame #2: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
    frame #3: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
    frame #4: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x00007fff890343f6 libsystem_kernel.dylib`__select 
+ 10
    frame #0: 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10
    frame #1: 0x0000000101e9d42e postgres`pg_usleep(microsec=<unavailable>) + 
78 at pgsleep.c:43
    frame #2: 0x0000000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x00007f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
    frame #3: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
    frame #4: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
    frame #5: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to