[ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruilong Huo updated HAWQ-1342:
------------------------------
    Fix Version/s:     (was: backlog)
                   2.1.0.0-incubating

> QE process hang in shared input scan on segment node
> ----------------------------------------------------
>
>                 Key: HAWQ-1342
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1342
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Query Execution
>    Affects Versions: 2.0.0.0-incubating
>            Reporter: Amy
>            Assignee: Ming LI
>             Fix For: 2.1.0.0-incubating
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877      1  0 05:35 ?        00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?        00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?        00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?        00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?        00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?        00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?        00:00:04 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   63159  41877  0 05:43 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 
> MPPEXEC SELECT
> gpadmin   64018  41877  0 05:44 ?        00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC 
> SELECT
> {code}
> The stack info is as below and it seems that QE hang in shared input scan.
> {code}
> [gpadmin@test3 ~]$ gdb -p 42108
> (gdb) info threads
>   2 Thread 0x7f4f6b335700 (LWP 42109)  0x00000032214df283 in poll () from 
> /lib64/libc.so.6
> * 1 Thread 0x7f4f9041c920 (LWP 42108)  0x00000032214e1523 in select () from 
> /lib64/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7f4f9041c920 (LWP 42108))]#1  
> 0x00000000007199be in shareinput_reader_waitready (share_id=1, 
> planGen=PLANGEN_PLANNER)
>     at nodeShareInputScan.c:760
> 760   in nodeShareInputScan.c
> (gdb) bt
> #0  0x00000032214e1523 in select () from /lib64/libc.so.6
> #1  0x00000000007199be in shareinput_reader_waitready (share_id=1, 
> planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> #2  0x0000000000718c68 in ExecSliceDependencyShareInputScan 
> (node=0x7f4f6aeb1d68) at nodeShareInputScan.c:344
> #3  0x00000000006e490f in ExecSliceDependencyNode (node=0x7f4f6aeb1d68) at 
> execProcnode.c:774
> #4  0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb17c0) at 
> execProcnode.c:797
> #5  0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb1230) at 
> execProcnode.c:797
> #6  0x00000000006dee81 in ExecutePlan (estate=0x3462b50, 
> planstate=0x7f4f6aeb1230, operation=CMD_SELECT, numberTuples=0, 
> direction=ForwardScanDirection, dest=0x7f4f6b229118)
>     at execMain.c:3178
> #7  0x00000000006dbba1 in ExecutorRun (queryDesc=0x346b570, 
> direction=ForwardScanDirection, count=0) at execMain.c:1197
> #8  0x000000000088e973 in PortalRunSelect (portal=0x3467c40, forward=1 
> '\001', count=0, dest=0x7f4f6b229118) at pquery.c:1731
> #9  0x000000000088e58e in PortalRun (portal=0x3467c40, 
> count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f4f6b229118, 
> altdest=0x7f4f6b229118,
>     completionTag=0x7fff4286c090 "") at pquery.c:1553
> #10 0x0000000000883e32 in exec_mpp_query (
>     query_string=0x348fa92 "SELECT 
> sale.qty,sale.pn,sale.pn,GROUPING(sale.pn,sale.qty,sale.pn), 
> TO_CHAR(COALESCE(COVAR_POP(floor(sale.vn*sale.qty),floor(sale.pn/sale.pn)),0),'99999999.9999999'),TO_CHAR(COALESCE(AVG(DISTINCT
>  floo"..., serializedQuerytree=0x0, serializedQuerytreelen=0, 
> serializedPlantree=0x348fe94 "y6\002", serializedPlantreelen=9061,
>     serializedParams=0x0, serializedParamslen=0, 
> serializedSliceInfo=0x34921f9 "\355\002", serializedSliceInfolen=230, 
> serializedResource=0x349232c "(",
>     serializedResourceLen=37, seqServerHost=0x3492351 "10.32.35.192", 
> seqServerPort=44013, localSlice=9) at postgres.c:1487
> #11 0x00000000008898fb in PostgresMain (argc=254, argv=0x3311360, 
> username=0x32f2f10 "hawqsuperuser") at postgres.c:5051
> #12 0x000000000083c198 in BackendRun (port=0x32af5f0) at postmaster.c:5915
> #13 0x000000000083b5e3 in BackendStartup (port=0x32af5f0) at postmaster.c:5484
> #14 0x0000000000835df0 in ServerLoop () at postmaster.c:2163
> #15 0x0000000000834ebb in PostmasterMain (argc=9, argv=0x32b7d10) at 
> postmaster.c:1454
> #16 0x000000000076115b in main (argc=9, argv=0x32b7d10) at main.c:226
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7f4f6b335700 (LWP 42109))]#0  
> 0x00000032214df283 in poll () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00000032214df283 in poll () from /lib64/libc.so.6
> #1  0x0000000000a29d03 in rxThreadFunc (arg=0x0) at ic_udp.c:6278
> #2  0x0000003221807aa1 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000032214e8aad in clone () from /lib64/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7f4f9041c920 (LWP 42108))]#0  
> 0x00000032214e1523 in select () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00000032214e1523 in select () from /lib64/libc.so.6
> #1  0x00000000007199be in shareinput_reader_waitready (share_id=1, 
> planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> #2  0x0000000000718c68 in ExecSliceDependencyShareInputScan 
> (node=0x7f4f6aeb1d68) at nodeShareInputScan.c:344
> #3  0x00000000006e490f in ExecSliceDependencyNode (node=0x7f4f6aeb1d68) at 
> execProcnode.c:774
> #4  0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb17c0) at 
> execProcnode.c:797
> #5  0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb1230) at 
> execProcnode.c:797
> #6  0x00000000006dee81 in ExecutePlan (estate=0x3462b50, 
> planstate=0x7f4f6aeb1230, operation=CMD_SELECT, numberTuples=0, 
> direction=ForwardScanDirection, dest=0x7f4f6b229118)
>     at execMain.c:3178
> #7  0x00000000006dbba1 in ExecutorRun (queryDesc=0x346b570, 
> direction=ForwardScanDirection, count=0) at execMain.c:1197
> #8  0x000000000088e973 in PortalRunSelect (portal=0x3467c40, forward=1 
> '\001', count=0, dest=0x7f4f6b229118) at pquery.c:1731
> #9  0x000000000088e58e in PortalRun (portal=0x3467c40, 
> count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f4f6b229118, 
> altdest=0x7f4f6b229118,
>     completionTag=0x7fff4286c090 "") at pquery.c:1553
> #10 0x0000000000883e32 in exec_mpp_query (
>     query_string=0x348fa92 "SELECT 
> sale.qty,sale.pn,sale.pn,GROUPING(sale.pn,sale.qty,sale.pn), 
> TO_CHAR(COALESCE(COVAR_POP(floor(sale.vn*sale.qty),floor(sale.pn/sale.pn)),0),'99999999.9999999'),TO_CHAR(COALESCE(AVG(DISTINCT
>  floo"..., serializedQuerytree=0x0, serializedQuerytreelen=0, 
> serializedPlantree=0x348fe94 "y6\002", serializedPlantreelen=9061,
>     serializedParams=0x0, serializedParamslen=0, 
> serializedSliceInfo=0x34921f9 "\355\002", serializedSliceInfolen=230, 
> serializedResource=0x349232c "(",
>     serializedResourceLen=37, seqServerHost=0x3492351 "10.32.35.192", 
> seqServerPort=44013, localSlice=9) at postgres.c:1487
> #11 0x00000000008898fb in PostgresMain (argc=254, argv=0x3311360, 
> username=0x32f2f10 "hawqsuperuser") at postgres.c:5051
> #12 0x000000000083c198 in BackendRun (port=0x32af5f0) at postmaster.c:5915
> #13 0x000000000083b5e3 in BackendStartup (port=0x32af5f0) at postmaster.c:5484
> #14 0x0000000000835df0 in ServerLoop () at postmaster.c:2163
> #15 0x0000000000834ebb in PostmasterMain (argc=9, argv=0x32b7d10) at 
> postmaster.c:1454
> #16 0x000000000076115b in main (argc=9, argv=0x32b7d10) at main.c:226
> (gdb) f 1
> #1  0x00000000007199be in shareinput_reader_waitready (share_id=1, 
> planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> 760   in nodeShareInputScan.c
> (gdb) p n
> $1 = 0
> (gdb) p errno
> $2 = 17
> (gdb) p InterruptPending
> $3 = 0 '\000'
> (gdb) p QueryCancelPending
> $4 = 0 '\000'
> (gdb) p ProcDiePending
> $5 = 0 '\000'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to