[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling
[ https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruilong Huo updated HAWQ-1487: -- Affects Version/s: 2.2.0.0-incubating > hang process due to deadlock when it try to process interrupt in error > handling > --- > > Key: HAWQ-1487 > URL: https://issues.apache.org/jira/browse/HAWQ-1487 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Affects Versions: 2.2.0.0-incubating >Reporter: Ruilong Huo >Assignee: Ruilong Huo > Fix For: 2.3.0.0-incubating > > > It has hang process when it try to process interrupt in error handling. To be > specific, some QE encounter division by zero error, and then it error out. > During the error processing, it try to handle query cancelling interrupt and > thus deadlock occur. > The hang process is: > {noformat} > $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger p > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > co > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer p > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoi > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsupe > $ ps -ef | grep postgres | grep -v grep > gpadmin 51245 1 0 06:15 ?00:01:01 > /usr/local/hawq_2_2_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger process > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer process > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 > MPPEXEC SELECT > gpadmin 194424 194402 0 23:50 pts/000:00:00 grep postgres > {noformat} > The call stack is: > {noformat} > $ sudo gdb -p 182983 > (gdb) bt > #0 0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #4 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #5 0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 > #6 0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #7 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #8 0x009cda3f in errstart (elevel=20, filename=0xd309e0 > "postgres.c", lineno=3618, > funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 > #9 0x008e8fcb in ProcessInterrupts () at postgres.c:3616 > #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at > postgres.c:3463 > #11 > #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", > lineno=839, funcname=0xd3bf3a "float8div", > domain=0x0) at elog.c:492 > #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 > #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, > econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:1762 > #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, > isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:2250 > #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, > argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 > #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, > econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at > execQual.c:1532 > #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", > isDone=0x0) at
[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling
[ https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruilong Huo updated HAWQ-1487: -- Fix Version/s: 2.3.0.0-incubating > hang process due to deadlock when it try to process interrupt in error > handling > --- > > Key: HAWQ-1487 > URL: https://issues.apache.org/jira/browse/HAWQ-1487 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Reporter: Ruilong Huo >Assignee: Ruilong Huo > Fix For: 2.3.0.0-incubating > > > It has hang process when it try to process interrupt in error handling. To be > specific, some QE encounter division by zero error, and then it error out. > During the error processing, it try to handle query cancelling interrupt and > thus deadlock occur. > The hang process is: > {noformat} > $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger p > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > co > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer p > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoi > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsupe > $ ps -ef | grep postgres | grep -v grep > gpadmin 51245 1 0 06:15 ?00:01:01 > /usr/local/hawq_2_2_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger process > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer process > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 > MPPEXEC SELECT > gpadmin 194424 194402 0 23:50 pts/000:00:00 grep postgres > {noformat} > The call stack is: > {noformat} > $ sudo gdb -p 182983 > (gdb) bt > #0 0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #4 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #5 0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 > #6 0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #7 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #8 0x009cda3f in errstart (elevel=20, filename=0xd309e0 > "postgres.c", lineno=3618, > funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 > #9 0x008e8fcb in ProcessInterrupts () at postgres.c:3616 > #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at > postgres.c:3463 > #11 > #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", > lineno=839, funcname=0xd3bf3a "float8div", > domain=0x0) at elog.c:492 > #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 > #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, > econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:1762 > #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, > isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:2250 > #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, > argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 > #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, > econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at > execQual.c:1532 > #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", > isDone=0x0) at execQual.c:2228 > #24 0x0076eed2 in initFcinfo
[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling
[ https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruilong Huo updated HAWQ-1487: -- Description: It has hang process when it try to process interrupt in error handling. To be specific, some QE encounter division by zero error, and then it error out. During the error processing, it try to handle query cancelling interrupt and thus deadlock occur. The hang process is: {noformat} $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, logger p gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats co gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, writer p gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, checkpoi gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, segment gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, hawqsupe $ ps -ef | grep postgres | grep -v grep gpadmin 51245 1 0 06:15 ?00:01:01 /usr/local/hawq_2_2_0_0/bin/postgres -D /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd -i -M segment -p 20100 --silent-mode=true gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, logger process gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats collector process gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, writer process gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, checkpoint process gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, segment resource manager gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 MPPEXEC SELECT gpadmin 194424 194402 0 23:50 pts/000:00:00 grep postgres {noformat} The call stack is: {noformat} $ sudo gdb -p 182983 (gdb) bt #0 0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #4 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 #5 0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 #6 0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #7 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 #8 0x009cda3f in errstart (elevel=20, filename=0xd309e0 "postgres.c", lineno=3618, funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 #9 0x008e8fcb in ProcessInterrupts () at postgres.c:3616 #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at postgres.c:3463 #11 #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", lineno=839, funcname=0xd3bf3a "float8div", domain=0x0) at elog.c:492 #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", isDone=0x7ffd04d2bd04) at execQual.c:1762 #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", isDone=0x7ffd04d2bd04) at execQual.c:2250 #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, econtext=0x32495d8, isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at execQual.c:1532 #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at execQual.c:2228 #24 0x0076eed2 in initFcinfo (wrxstate=0x31b8fe0, fcinfo=0x7ffd04d2c280, funcstate=0x7f83c7412318, econtext=0x32495d8, check_nulls=1 '\001') at nodeWindow.c:3201 #25 0x0076efa4 in add_tuple_to_trans (funcstate=0x7f83c7412318, wstate=0x3248ab8, econtext=0x32495d8, check_nulls=1 '\001') at nodeWindow.c:3223 #26 0x00772f72 in processTupleSlot (wstate=0x3248ab8, slot=0x31ac150, last_peer=0 '\000') at nodeWindow.c:5105 #27 0x00772760 in ExecWindow (wstate=0x3248ab8) at nodeWindow.c:4821 ---Type to continue, or q to quit--- #28 0x0071eda7 in ExecProcNode (node=0x3248ab8) at execProcnode.c:1007 #29 0x0075aded in NextInputSlot (node=0x31af928) at nodeResult.c:95