[ https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruilong Huo updated HAWQ-1342: ------------------------------ Fix Version/s: (was: backlog) 2.1.0.0-incubating > QE process hang in shared input scan on segment node > ---------------------------------------------------- > > Key: HAWQ-1342 > URL: https://issues.apache.org/jira/browse/HAWQ-1342 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution > Affects Versions: 2.0.0.0-incubating > Reporter: Amy > Assignee: Ming LI > Fix For: 2.1.0.0-incubating > > > QE process hang on some segment node while QD and QE on other segment nodes > terminated. > {code} > [gpadmin@test1 ~]$ cat hostfile > test1 master secondary namenode > test2 segment datanode > test3 segment datanode > test4 segment datanode > test5 segment namenode > [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep > gpadmin 41877 1 0 05:35 ? 00:01:04 > /usr/local/hawq_2_1_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 41878 41877 0 05:35 ? 00:00:02 postgres: port 20100, > logger process > gpadmin 41881 41877 0 05:35 ? 00:00:00 postgres: port 20100, stats > collector process > gpadmin 41882 41877 0 05:35 ? 00:00:07 postgres: port 20100, > writer process > gpadmin 41883 41877 0 05:35 ? 00:00:01 postgres: port 20100, > checkpoint process > gpadmin 41884 41877 0 05:35 ? 00:00:11 postgres: port 20100, > segment resource manager > gpadmin 42108 41877 0 05:35 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 42416 41877 0 05:35 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 44807 41877 0 05:36 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC > SELECT > gpadmin 44819 41877 0 05:36 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 44821 41877 0 05:36 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC > SELECT > gpadmin 45447 41877 0 05:36 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 49859 41877 0 05:38 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC > SELECT > gpadmin 49881 41877 0 05:38 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51937 41877 0 05:39 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51939 41877 0 05:39 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 51941 41877 0 05:39 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 51943 41877 0 05:39 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC > SELECT > gpadmin 51953 41877 0 05:39 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC > SELECT > gpadmin 53436 41877 0 05:40 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC > SELECT > gpadmin 57095 41877 0 05:41 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 57097 41877 0 05:41 ? 00:00:04 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 63159 41877 0 05:43 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 > MPPEXEC SELECT > gpadmin 64018 41877 0 05:44 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC > SELECT > {code} > The stack info is as below and it seems that QE hang in shared input scan. > {code} > [gpadmin@test3 ~]$ gdb -p 42108 > (gdb) info threads > 2 Thread 0x7f4f6b335700 (LWP 42109) 0x00000032214df283 in poll () from > /lib64/libc.so.6 > * 1 Thread 0x7f4f9041c920 (LWP 42108) 0x00000032214e1523 in select () from > /lib64/libc.so.6 > (gdb) thread 1 > [Switching to thread 1 (Thread 0x7f4f9041c920 (LWP 42108))]#1 > 0x00000000007199be in shareinput_reader_waitready (share_id=1, > planGen=PLANGEN_PLANNER) > at nodeShareInputScan.c:760 > 760 in nodeShareInputScan.c > (gdb) bt > #0 0x00000032214e1523 in select () from /lib64/libc.so.6 > #1 0x00000000007199be in shareinput_reader_waitready (share_id=1, > planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760 > #2 0x0000000000718c68 in ExecSliceDependencyShareInputScan > (node=0x7f4f6aeb1d68) at nodeShareInputScan.c:344 > #3 0x00000000006e490f in ExecSliceDependencyNode (node=0x7f4f6aeb1d68) at > execProcnode.c:774 > #4 0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb17c0) at > execProcnode.c:797 > #5 0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb1230) at > execProcnode.c:797 > #6 0x00000000006dee81 in ExecutePlan (estate=0x3462b50, > planstate=0x7f4f6aeb1230, operation=CMD_SELECT, numberTuples=0, > direction=ForwardScanDirection, dest=0x7f4f6b229118) > at execMain.c:3178 > #7 0x00000000006dbba1 in ExecutorRun (queryDesc=0x346b570, > direction=ForwardScanDirection, count=0) at execMain.c:1197 > #8 0x000000000088e973 in PortalRunSelect (portal=0x3467c40, forward=1 > '\001', count=0, dest=0x7f4f6b229118) at pquery.c:1731 > #9 0x000000000088e58e in PortalRun (portal=0x3467c40, > count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f4f6b229118, > altdest=0x7f4f6b229118, > completionTag=0x7fff4286c090 "") at pquery.c:1553 > #10 0x0000000000883e32 in exec_mpp_query ( > query_string=0x348fa92 "SELECT > sale.qty,sale.pn,sale.pn,GROUPING(sale.pn,sale.qty,sale.pn), > TO_CHAR(COALESCE(COVAR_POP(floor(sale.vn*sale.qty),floor(sale.pn/sale.pn)),0),'99999999.9999999'),TO_CHAR(COALESCE(AVG(DISTINCT > floo"..., serializedQuerytree=0x0, serializedQuerytreelen=0, > serializedPlantree=0x348fe94 "y6\002", serializedPlantreelen=9061, > serializedParams=0x0, serializedParamslen=0, > serializedSliceInfo=0x34921f9 "\355\002", serializedSliceInfolen=230, > serializedResource=0x349232c "(", > serializedResourceLen=37, seqServerHost=0x3492351 "10.32.35.192", > seqServerPort=44013, localSlice=9) at postgres.c:1487 > #11 0x00000000008898fb in PostgresMain (argc=254, argv=0x3311360, > username=0x32f2f10 "hawqsuperuser") at postgres.c:5051 > #12 0x000000000083c198 in BackendRun (port=0x32af5f0) at postmaster.c:5915 > #13 0x000000000083b5e3 in BackendStartup (port=0x32af5f0) at postmaster.c:5484 > #14 0x0000000000835df0 in ServerLoop () at postmaster.c:2163 > #15 0x0000000000834ebb in PostmasterMain (argc=9, argv=0x32b7d10) at > postmaster.c:1454 > #16 0x000000000076115b in main (argc=9, argv=0x32b7d10) at main.c:226 > (gdb) thread 2 > [Switching to thread 2 (Thread 0x7f4f6b335700 (LWP 42109))]#0 > 0x00000032214df283 in poll () from /lib64/libc.so.6 > (gdb) bt > #0 0x00000032214df283 in poll () from /lib64/libc.so.6 > #1 0x0000000000a29d03 in rxThreadFunc (arg=0x0) at ic_udp.c:6278 > #2 0x0000003221807aa1 in start_thread () from /lib64/libpthread.so.0 > #3 0x00000032214e8aad in clone () from /lib64/libc.so.6 > (gdb) thread 1 > [Switching to thread 1 (Thread 0x7f4f9041c920 (LWP 42108))]#0 > 0x00000032214e1523 in select () from /lib64/libc.so.6 > (gdb) bt > #0 0x00000032214e1523 in select () from /lib64/libc.so.6 > #1 0x00000000007199be in shareinput_reader_waitready (share_id=1, > planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760 > #2 0x0000000000718c68 in ExecSliceDependencyShareInputScan > (node=0x7f4f6aeb1d68) at nodeShareInputScan.c:344 > #3 0x00000000006e490f in ExecSliceDependencyNode (node=0x7f4f6aeb1d68) at > execProcnode.c:774 > #4 0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb17c0) at > execProcnode.c:797 > #5 0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb1230) at > execProcnode.c:797 > #6 0x00000000006dee81 in ExecutePlan (estate=0x3462b50, > planstate=0x7f4f6aeb1230, operation=CMD_SELECT, numberTuples=0, > direction=ForwardScanDirection, dest=0x7f4f6b229118) > at execMain.c:3178 > #7 0x00000000006dbba1 in ExecutorRun (queryDesc=0x346b570, > direction=ForwardScanDirection, count=0) at execMain.c:1197 > #8 0x000000000088e973 in PortalRunSelect (portal=0x3467c40, forward=1 > '\001', count=0, dest=0x7f4f6b229118) at pquery.c:1731 > #9 0x000000000088e58e in PortalRun (portal=0x3467c40, > count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f4f6b229118, > altdest=0x7f4f6b229118, > completionTag=0x7fff4286c090 "") at pquery.c:1553 > #10 0x0000000000883e32 in exec_mpp_query ( > query_string=0x348fa92 "SELECT > sale.qty,sale.pn,sale.pn,GROUPING(sale.pn,sale.qty,sale.pn), > TO_CHAR(COALESCE(COVAR_POP(floor(sale.vn*sale.qty),floor(sale.pn/sale.pn)),0),'99999999.9999999'),TO_CHAR(COALESCE(AVG(DISTINCT > floo"..., serializedQuerytree=0x0, serializedQuerytreelen=0, > serializedPlantree=0x348fe94 "y6\002", serializedPlantreelen=9061, > serializedParams=0x0, serializedParamslen=0, > serializedSliceInfo=0x34921f9 "\355\002", serializedSliceInfolen=230, > serializedResource=0x349232c "(", > serializedResourceLen=37, seqServerHost=0x3492351 "10.32.35.192", > seqServerPort=44013, localSlice=9) at postgres.c:1487 > #11 0x00000000008898fb in PostgresMain (argc=254, argv=0x3311360, > username=0x32f2f10 "hawqsuperuser") at postgres.c:5051 > #12 0x000000000083c198 in BackendRun (port=0x32af5f0) at postmaster.c:5915 > #13 0x000000000083b5e3 in BackendStartup (port=0x32af5f0) at postmaster.c:5484 > #14 0x0000000000835df0 in ServerLoop () at postmaster.c:2163 > #15 0x0000000000834ebb in PostmasterMain (argc=9, argv=0x32b7d10) at > postmaster.c:1454 > #16 0x000000000076115b in main (argc=9, argv=0x32b7d10) at main.c:226 > (gdb) f 1 > #1 0x00000000007199be in shareinput_reader_waitready (share_id=1, > planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760 > 760 in nodeShareInputScan.c > (gdb) p n > $1 = 0 > (gdb) p errno > $2 = 17 > (gdb) p InterruptPending > $3 = 0 '\000' > (gdb) p QueryCancelPending > $4 = 0 '\000' > (gdb) p ProcDiePending > $5 = 0 '\000' > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)