[ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217323#comment-15217323
 ] 

Chunling Wang commented on HAWQ-564:
------------------------------------

We find this bug again in dispather tests. We first run a 'COPY' and suspend QD 
after QD connecting to QE. Then kill -9 QE and resume. The 'COPY' fails. And 
run any query immediately, QD hangs again. If run query after a while but not 
immediately, we can get the correct answer. Here is the backtrace when QD hangs 
which is similar to above.

{code}
* thread #1: tid = 0xa9b7b, 0x00007fff8748a5be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff8748a5be libsystem_kernel.dylib`poll + 10
    frame #1: 0x000000010ccd83bc postgres`processAllCommFileDescs + 316 at 
rmcomm_AsyncComm.c:156
    frame #2: 0x000000010cce5321 
postgres`callSyncRPCRemote(hostname=0x00007f84e8c1f490, port=5437, 
sendbuff=0x00007f84e9141a00, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=0x00007f84e91415d8, errorbuf=0x000000010d292e20, errorbufsize=1024) + 
529 at rmcomm_SyncComm.c:122
    frame #3: 0x000000010ccdd374 
postgres`callSyncRPCToRM(sendbuff=0x00007f84e9141a00, sendbuffsize=80, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x00007f84e91415d8, 
errorbuf=0x000000010d292e20, errorbufsize=1024) + 164 at rmcomm_QD2RM.c:2823
    frame #4: 0x000000010ccdda7f postgres`acquireResourceFromRM(index=0, 
sessionid=33, slice_size=2, iobytes=134217728, 
preferred_nodes=0x00007f84ea020048, preferred_nodes_size=1, 
max_seg_count_fix=1, min_seg_count_fix=1, errorbuf=0x00007fff53557330, 
errorbufsize=1024) + 1023 at rmcomm_QD2RM.c:757
    frame #5: 0x000000010cb4362a postgres`AllocateResource(life=QRL_ONCE, 
slice_size=2, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x00007f84ea020048, vol_info_size=1) + 778 
at pquery.c:797
    frame #6: 0x000000010ce0b033 
postgres`calculate_planner_segment_num(query=0x00007f84e90e5880, 
resourceLife=QRL_ONCE, fullRangeTable=0x00007f84ea014ba0, 
intoPolicy=0x00007f84ea015458, sliceNum=2, fixedVsegNum=-1) + 3731 at 
cdbdatalocality.c:4326
    frame #7: 0x000000010ca829e8 
postgres`resource_negotiator(parse=0x00007f84e90e5990, cursorOptions=0, 
boundParams=0x0000000000000000, resourceLife=QRL_ONCE, 
result=0x00007fff53557c40) + 200 at planner.c:534
    frame #8: 0x000000010ca822e0 postgres`planner(parse=0x00007f84e90e5990, 
cursorOptions=0, boundParams=0x0000000000000000, resourceLife=QRL_ONCE) + 464 
at planner.c:340
    frame #9: 0x000000010cb36ff8 
postgres`pg_plan_query(querytree=0x00007f84e90e5990, 
boundParams=0x0000000000000000, resource_life=QRL_ONCE) + 88 at postgres.c:837
    frame #10: 0x000000010cb3713d 
postgres`pg_plan_queries(querytrees=0x00007f84e90e66d0, 
boundParams=0x0000000000000000, needSnapshot='\0', resource_life=QRL_ONCE) + 
157 at postgres.c:911
    frame #11: 0x000000010cb3b5f6 
postgres`exec_simple_query(query_string=0x00007f84e90e4c30, 
seqServerHost=0x0000000000000000, seqServerPort=-1) + 1478 at postgres.c:1671
    frame #12: 0x000000010cb39d00 postgres`PostgresMain(argc=4, 
argv=0x00007f84ea803fe8, username=0x00007f84ea803ef0) + 7472 at postgres.c:4754
    frame #13: 0x000000010cade4e6 postgres`BackendRun(port=0x00007f84e8c1f9d0) 
+ 1014 at postmaster.c:5889
    frame #14: 0x000000010cadd631 
postgres`BackendStartup(port=0x00007f84e8c1f9d0) + 385 at postmaster.c:5484
    frame #15: 0x000000010cada2a0 postgres`ServerLoop + 1312 at 
postmaster.c:2163
    frame #16: 0x000000010cad88e3 postgres`PostmasterMain(argc=9, 
argv=0x00007f84e8c1edd0) + 4931 at postmaster.c:1454
    frame #17: 0x000000010c9e31e2 postgres`main(argc=9, 
argv=0x00007f84e8c1edd0) + 978 at main.c:226
    frame #18: 0x00007fff942e15c9 libdyld.dylib`start + 1

  thread #2: tid = 0xa9b7c, 0x00007fff8748a5be libsystem_kernel.dylib`poll + 10
    frame #0: 0x00007fff8748a5be libsystem_kernel.dylib`poll + 10
    frame #1: 0x000000010cd4489d postgres`rxThreadFunc(arg=0x0000000000000000) 
+ 317 at ic_udp.c:6251
    frame #2: 0x00007fff942d72fc libsystem_pthread.dylib`_pthread_body + 131
    frame #3: 0x00007fff942d7279 libsystem_pthread.dylib`_pthread_start + 176
    frame #4: 0x00007fff942d54b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0xa9b7d, 0x00007fff874893f6 libsystem_kernel.dylib`__select 
+ 10
    frame #0: 0x00007fff874893f6 libsystem_kernel.dylib`__select + 10
    frame #1: 0x000000010ce26ffb postgres`pg_usleep(microsec=1000000) + 91 at 
pgsleep.c:43
    frame #2: 0x000000010ccdc344 
postgres`generateResourceRefreshHeartBeat(arg=0x00007f84e8c20ce0) + 1476 at 
rmcomm_QD2RM.c:1560
    frame #3: 0x00007fff942d72fc libsystem_pthread.dylib`_pthread_body + 131
    frame #4: 0x00007fff942d7279 libsystem_pthread.dylib`_pthread_start + 176
    frame #5: 0x00007fff942d54b1 libsystem_pthread.dylib`thread_start + 13
{code}

> QD hangs when connecting to resource manager
> --------------------------------------------
>
>                 Key: HAWQ-564
>                 URL: https://issues.apache.org/jira/browse/HAWQ-564
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Resource Manager
>    Affects Versions: 2.0.0
>            Reporter: Chunling Wang
>            Assignee: Lei Chang
>
> When first inject panic in QE process, we run a query and segment is down. 
> After the segment is up, we run another query and get correct answer. Then we 
> inject the same panic second time. After the segment is down and then up 
> again, we run a query and find QD process hangs when connecting to resource 
> manager. Here is the backtrace when QD hangs:
> {code}
> * thread #1: tid = 0x21d8be, 0x00007fff890355be libsystem_kernel.dylib`poll + 
> 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
>     frame #1: 0x0000000101daeafe postgres`processAllCommFileDescs + 158 at 
> rmcomm_AsyncComm.c:156
>     frame #2: 0x0000000101db85f5 
> postgres`callSyncRPCRemote(hostname=0x00007f9c19e00cd0, port=5437, 
> sendbuff=0x00007f9c1b918f50, sendbuffsize=80, sendmsgid=259, 
> exprecvmsgid=2307, recvsmb=<unavailable>, errorbuf=0x000000010230c1a0, 
> errorbufsize=<unavailable>) + 645 at rmcomm_SyncComm.c:122
>     frame #3: 0x0000000101db2d85 postgres`acquireResourceFromRM [inlined] 
> callSyncRPCToRM(sendbuff=0x00007f9c1b918f50, sendbuffsize=<unavailable>, 
> sendmsgid=259, exprecvmsgid=2307, recvsmb=0x00007f9c1b918e70, 
> errorbuf=<unavailable>, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
>     frame #4: 0x0000000101db2d3c 
> postgres`acquireResourceFromRM(index=<unavailable>, sessionid=12, 
> slice_size=462524016, iobytes=134217728, preferred_nodes=0x00007f9c1a02d398, 
> preferred_nodes_size=<unavailable>, max_seg_count_fix=<unavailable>, 
> min_seg_count_fix=<unavailable>, errorbuf=<unavailable>, 
> errorbufsize=<unavailable>) + 572 at rmcomm_QD2RM.c:742
>     frame #5: 0x0000000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
> slice_size=5, iobytes=134217728, max_target_segment_num=1, 
> min_target_segment_num=1, vol_info=0x00007f9c1a02d398, vol_info_size=1) + 631 
> at pquery.c:796
>     frame #6: 0x0000000101e8c60f 
> postgres`calculate_planner_segment_num(query=<unavailable>, 
> resourceLife=QRL_ONCE, fullRangeTable=<unavailable>, 
> intoPolicy=<unavailable>, sliceNum=5) + 14287 at cdbdatalocality.c:4207
>     frame #7: 0x0000000101c0f671 postgres`planner + 106 at planner.c:496
>     frame #8: 0x0000000101c0f607 postgres`planner(parse=0x00007f9c1a02a140, 
> cursorOptions=<unavailable>, boundParams=0x0000000000000000, 
> resourceLife=QRL_ONCE) + 311 at planner.c:310
>     frame #9: 0x0000000101c8eb33 
> postgres`pg_plan_query(querytree=0x00007f9c1a02a140, 
> boundParams=0x0000000000000000, resource_life=QRL_ONCE) + 99 at postgres.c:837
>     frame #10: 0x0000000101c956ae postgres`exec_simple_query + 21 at 
> postgres.c:911
>     frame #11: 0x0000000101c95699 
> postgres`exec_simple_query(query_string=0x00007f9c1a028a30, 
> seqServerHost=0x0000000000000000, seqServerPort=-1) + 1577 at postgres.c:1671
>     frame #12: 0x0000000101c91a4c postgres`PostgresMain(argc=<unavailable>, 
> argv=<unavailable>, username=0x00007f9c1b808cf0) + 9404 at postgres.c:4754
>     frame #13: 0x0000000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
> 105 at postmaster.c:5889
>     frame #14: 0x0000000101c4ad99 postgres`ServerLoop at postmaster.c:5484
>     frame #15: 0x0000000101c4ad99 postgres`ServerLoop + 9593 at 
> postmaster.c:2163
>     frame #16: 0x0000000101c47d3b postgres`PostmasterMain(argc=<unavailable>, 
> argv=<unavailable>) + 5019 at postmaster.c:1454
>     frame #17: 0x0000000101bb1aa9 postgres`main(argc=9, 
> argv=0x00007f9c19c1eef0) + 1433 at main.c:209
>     frame #18: 0x00007fff95e8c5c9 libdyld.dylib`start + 1
>   thread #2: tid = 0x21d8bf, 0x00007fff890355be libsystem_kernel.dylib`poll + 
> 10
>     frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
>     frame #1: 0x0000000101dfe723 postgres`rxThreadFunc(arg=<unavailable>) + 
> 2163 at ic_udp.c:6251
>     frame #2: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
>     frame #3: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
>     frame #4: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
>   thread #3: tid = 0x21d9c2, 0x00007fff890343f6 
> libsystem_kernel.dylib`__select + 10
>     frame #0: 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10
>     frame #1: 0x0000000101e9d42e postgres`pg_usleep(microsec=<unavailable>) + 
> 78 at pgsleep.c:43
>     frame #2: 0x0000000101db1a66 
> postgres`generateResourceRefreshHeartBeat(arg=0x00007f9c19f02480) + 166 at 
> rmcomm_QD2RM.c:1519
>     frame #3: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
>     frame #4: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
>     frame #5: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
> {code}
> And here is the operations:
> 1. Before injection, get query answer correctly.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
>  count
> -------
>   3725
> (1 row)
> {code}
> 2. Inject panic, fault triggered, and segment is down.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> ERROR:  fault triggered, fault name:'fail_qe_when_do_query' fault 
> type:'panic' (faultinjector.c:656)  (seg0 localhost:40000 pid=26936)
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> ERROR:  failed to acquire resource from resource manager, 1 of 1 segments is 
> unavailable (pquery.c:807)
> {code}
> 3. After a while and when segment is up, get correct answer.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
>  count
> -------
>   3725
> (1 row)
> {code}
> 4. Inject again, fault triggered, and segment is down.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> ERROR:  fault triggered, fault name:'fail_qe_when_do_query' fault 
> type:'panic' (faultinjector.c:656)  (seg0 localhost:40000 pid=26994)
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> ERROR:  failed to acquire resource from resource manager, 1 of 1 segments is 
> unavailable (pquery.c:807)
> {code}
> 5. After a while, run query and find QD hangs.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> {code}
> 6. Open another terminal, find segment is already up.
> {code}
> dispatch=# select * from gp_segment_configuration;
>  registration_order | role | status | port  |          hostname           |   
>         address           | description
> --------------------+------+--------+-------+-----------------------------+-----------------------------+-------------
>                   0 | m    | u      |  5432 | ChunlingdeMacBook-Pro.local | 
> ChunlingdeMacBook-Pro.local |
>                   1 | p    | u      | 40000 | localhost                   | 
> 127.0.0.1                   |
> (2 rows)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to