[ https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruilong Huo closed HAWQ-1487. ----------------------------- > hang process due to deadlock when it try to process interrupt in error > handling > ------------------------------------------------------------------------------- > > Key: HAWQ-1487 > URL: https://issues.apache.org/jira/browse/HAWQ-1487 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution > Affects Versions: 2.2.0.0-incubating > Reporter: Ruilong Huo > Assignee: Ruilong Huo > Fix For: 2.3.0.0-incubating > > > It has hang process when it try to process interrupt in error handling. To be > specific, some QE encounter division by zero error, and then it error out. > During the error processing, it try to handle query cancelling interrupt and > thus deadlock occur. > The hang process is: > {noformat} > $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" > gpadmin 51246 51245 0 06:15 ? 00:00:01 postgres: port 20100, > logger p > gpadmin 51249 51245 0 06:15 ? 00:00:00 postgres: port 20100, stats > co > gpadmin 51250 51245 0 06:15 ? 00:00:07 postgres: port 20100, > writer p > gpadmin 51251 51245 0 06:15 ? 00:00:01 postgres: port 20100, > checkpoi > gpadmin 51252 51245 0 06:15 ? 00:00:11 postgres: port 20100, > segment > gpadmin 182983 51245 0 07:00 ? 00:00:03 postgres: port 20100, > hawqsupe > $ ps -ef | grep postgres | grep -v grep > gpadmin 51245 1 0 06:15 ? 00:01:01 > /usr/local/hawq_2_2_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 51246 51245 0 06:15 ? 00:00:01 postgres: port 20100, > logger process > gpadmin 51249 51245 0 06:15 ? 00:00:00 postgres: port 20100, stats > collector process > gpadmin 51250 51245 0 06:15 ? 00:00:07 postgres: port 20100, > writer process > gpadmin 51251 51245 0 06:15 ? 00:00:01 postgres: port 20100, > checkpoint process > gpadmin 51252 51245 0 06:15 ? 00:00:11 postgres: port 20100, > segment resource manager > gpadmin 182983 51245 0 07:00 ? 00:00:03 postgres: port 20100, > hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 > MPPEXEC SELECT > gpadmin 194424 194402 0 23:50 pts/0 00:00:00 grep postgres > {noformat} > The call stack is: > {noformat} > $ sudo gdb -p 182983 > (gdb) bt > #0 0x0000003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x0000003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x0000003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x0000003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #4 0x0000003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #5 0x0000003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 > #6 0x0000003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #7 0x0000003ff02fe966 in backtrace () from /lib64/libc.so.6 > #8 0x00000000009cda3f in errstart (elevel=20, filename=0xd309e0 > "postgres.c", lineno=3618, > funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 > #9 0x00000000008e8fcb in ProcessInterrupts () at postgres.c:3616 > #10 0x00000000008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at > postgres.c:3463 > #11 <signal handler called> > #12 0x0000003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #13 0x0000003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #14 0x0000003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #15 0x0000003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #16 0x0000003ff02fe966 in backtrace () from /lib64/libc.so.6 > #17 0x00000000009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", > lineno=839, funcname=0xd3bf3a "float8div", > domain=0x0) at elog.c:492 > #18 0x0000000000921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 > #19 0x0000000000722fe5 in ExecMakeFunctionResult (fcache=0x324a088, > econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:1762 > #20 0x0000000000723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, > isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:2250 > #21 0x0000000000722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, > argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 > #22 0x0000000000722a68 in ExecMakeFunctionResult (fcache=0x3249850, > econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at > execQual.c:1532 > #23 0x0000000000723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", > isDone=0x0) at execQual.c:2228 > #24 0x000000000076eed2 in initFcinfo (wrxstate=0x31b8fe0, > fcinfo=0x7ffd04d2c280, funcstate=0x7f83c7412318, econtext=0x32495d8, > check_nulls=1 '\001') at nodeWindow.c:3201 > #25 0x000000000076efa4 in add_tuple_to_trans (funcstate=0x7f83c7412318, > wstate=0x3248ab8, econtext=0x32495d8, > check_nulls=1 '\001') at nodeWindow.c:3223 > #26 0x0000000000772f72 in processTupleSlot (wstate=0x3248ab8, slot=0x31ac150, > last_peer=0 '\000') at nodeWindow.c:5105 > #27 0x0000000000772760 in ExecWindow (wstate=0x3248ab8) at nodeWindow.c:4821 > ---Type <return> to continue, or q <return> to quit--- > #28 0x000000000071eda7 in ExecProcNode (node=0x3248ab8) at execProcnode.c:1007 > #29 0x000000000075aded in NextInputSlot (node=0x31af928) at nodeResult.c:95 > #30 0x000000000075afba in ExecResult (node=0x31af928) at nodeResult.c:194 > #31 0x000000000071eb64 in ExecProcNode (node=0x31af928) at execProcnode.c:891 > #32 0x000000000075aded in NextInputSlot (node=0x31ae608) at nodeResult.c:95 > #33 0x000000000075afba in ExecResult (node=0x31ae608) at nodeResult.c:194 > #34 0x000000000071eb64 in ExecProcNode (node=0x31ae608) at execProcnode.c:891 > #35 0x00000000007574e1 in execMotionSender (node=0x31adf80) at > nodeMotion.c:364 > #36 0x00000000007573cf in ExecMotion (node=0x31adf80) at nodeMotion.c:331 > #37 0x000000000071ed80 in ExecProcNode (node=0x31adf80) at execProcnode.c:999 > #38 0x00000000007180fc in ExecutePlan (estate=0x31b70d8, planstate=0x31adf80, > operation=CMD_SELECT, numberTuples=0, > direction=ForwardScanDirection, dest=0x7f83c74609f0) at execMain.c:3199 > #39 0x0000000000714450 in ExecutorRun (queryDesc=0x31c0c40, > direction=ForwardScanDirection, count=0) at execMain.c:1197 > #40 0x00000000008f0b5b in PortalRunSelect (portal=0x31ba158, forward=1 > '\001', count=0, dest=0x7f83c74609f0) at pquery.c:1730 > #41 0x00000000008f073b in PortalRun (portal=0x31ba158, > count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f83c74609f0, > altdest=0x7f83c74609f0, completionTag=0x7ffd04d2d000 "") at pquery.c:1552 > #42 0x00000000008e5878 in exec_mpp_query ( > query_string=0x31edeca "SELECT sale.vn,sale.qty,sale.qty, > TO_CHAR(COALESCE(REGR_AVGY(floor(sale.vn*sale.vn),floor(sale.prc+sale.pn)) > OVER(win1),0),'99999999.9999999'),sale.dt,\nTO_CHAR(COALESCE(MIN(floor(sale.qty)) > OVER(win1)"..., > serializedQuerytree=0x0, serializedQuerytreelen=0, > serializedPlantree=0x31ee32f "P{", serializedPlantreelen=4076, > serializedParams=0x0, serializedParamslen=0, > serializedSliceInfo=0x31ef31b "a\020", serializedSliceInfolen=1057, > serializedResource=0x31ef78a "\260", serializedResourceLen=50, > seqServerHost=0x31ef7bc "10.32.34.225", seqServerPort=19765, > localSlice=7) at postgres.c:1487 > #43 0x00000000008eb6b8 in PostgresMain (argc=270, argv=0x3060678, > username=0x303a938 "hawqsuperuser") at postgres.c:5080 > #44 0x00000000008947f7 in BackendRun (port=0x300b5f0) at postmaster.c:5915 > #45 0x0000000000893c16 in BackendStartup (port=0x300b5f0) at postmaster.c:5484 > #46 0x000000000088dc92 in ServerLoop () at postmaster.c:2163 > #47 0x000000000088cc9f in PostmasterMain (argc=9, argv=0x3013d10) at > postmaster.c:1454 > #48 0x00000000007aa05b in main (argc=9, argv=0x3013d10) at main.c:226 > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)