Chunling Wang created HAWQ-564: ---------------------------------- Summary: QD hangs when connecting to resource manager Key: HAWQ-564 URL: https://issues.apache.org/jira/browse/HAWQ-564 Project: Apache HAWQ Issue Type: Bug Components: Resource Manager Reporter: Chunling Wang Assignee: Lei Chang
When first inject panic in QE process, we run a query and segment is down. After the segment is up, we run another query and get correct answer. Then we inject the same panic second time. After the segment is down and then up again, we run a query and find QD process hangs when connecting to resource manager. Here is the backtrace when QD hangs: {code} * thread #1: tid = 0x21d8be, 0x00007fff890355be libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x0000000101daeafe postgres`processAllCommFileDescs + 158 at rmcomm_AsyncComm.c:156 frame #2: 0x0000000101db85f5 postgres`callSyncRPCRemote(hostname=0x00007f9c19e00cd0, port=5437, sendbuff=0x00007f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, recvsmb=<unavailable>, errorbuf=0x000000010230c1a0, errorbufsize=<unavailable>) + 645 at rmcomm_SyncComm.c:122 frame #3: 0x0000000101db2d85 postgres`acquireResourceFromRM [inlined] callSyncRPCToRM(sendbuff=0x00007f9c1b918f50, sendbuffsize=<unavailable>, sendmsgid=259, exprecvmsgid=2307, recvsmb=0x00007f9c1b918e70, errorbuf=<unavailable>, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 frame #4: 0x0000000101db2d3c postgres`acquireResourceFromRM(index=<unavailable>, sessionid=12, slice_size=462524016, iobytes=134217728, preferred_nodes=0x00007f9c1a02d398, preferred_nodes_size=<unavailable>, max_seg_count_fix=<unavailable>, min_seg_count_fix=<unavailable>, errorbuf=<unavailable>, errorbufsize=<unavailable>) + 572 at rmcomm_QD2RM.c:742 frame #5: 0x0000000101c979e7 postgres`AllocateResource(life=QRL_ONCE, slice_size=5, iobytes=134217728, max_target_segment_num=1, min_target_segment_num=1, vol_info=0x00007f9c1a02d398, vol_info_size=1) + 631 at pquery.c:796 frame #6: 0x0000000101e8c60f postgres`calculate_planner_segment_num(query=<unavailable>, resourceLife=QRL_ONCE, fullRangeTable=<unavailable>, intoPolicy=<unavailable>, sliceNum=5) + 14287 at cdbdatalocality.c:4207 frame #7: 0x0000000101c0f671 postgres`planner + 106 at planner.c:496 frame #8: 0x0000000101c0f607 postgres`planner(parse=0x00007f9c1a02a140, cursorOptions=<unavailable>, boundParams=0x0000000000000000, resourceLife=QRL_ONCE) + 311 at planner.c:310 frame #9: 0x0000000101c8eb33 postgres`pg_plan_query(querytree=0x00007f9c1a02a140, boundParams=0x0000000000000000, resource_life=QRL_ONCE) + 99 at postgres.c:837 frame #10: 0x0000000101c956ae postgres`exec_simple_query + 21 at postgres.c:911 frame #11: 0x0000000101c95699 postgres`exec_simple_query(query_string=0x00007f9c1a028a30, seqServerHost=0x0000000000000000, seqServerPort=-1) + 1577 at postgres.c:1671 frame #12: 0x0000000101c91a4c postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, username=0x00007f9c1b808cf0) + 9404 at postgres.c:4754 frame #13: 0x0000000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 105 at postmaster.c:5889 frame #14: 0x0000000101c4ad99 postgres`ServerLoop at postmaster.c:5484 frame #15: 0x0000000101c4ad99 postgres`ServerLoop + 9593 at postmaster.c:2163 frame #16: 0x0000000101c47d3b postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) + 5019 at postmaster.c:1454 frame #17: 0x0000000101bb1aa9 postgres`main(argc=9, argv=0x00007f9c19c1eef0) + 1433 at main.c:209 frame #18: 0x00007fff95e8c5c9 libdyld.dylib`start + 1 thread #2: tid = 0x21d8bf, 0x00007fff890355be libsystem_kernel.dylib`poll + 10 frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x0000000101dfe723 postgres`rxThreadFunc(arg=<unavailable>) + 2163 at ic_udp.c:6251 frame #2: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #3: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #4: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13 thread #3: tid = 0x21d9c2, 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10 frame #0: 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10 frame #1: 0x0000000101e9d42e postgres`pg_usleep(microsec=<unavailable>) + 78 at pgsleep.c:43 frame #2: 0x0000000101db1a66 postgres`generateResourceRefreshHeartBeat(arg=0x00007f9c19f02480) + 166 at rmcomm_QD2RM.c:1519 frame #3: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #4: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #5: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)