[ 
https://issues.apache.org/jira/browse/HAWQ-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206036#comment-15206036
 ] 

Ming LI commented on HAWQ-575:
------------------------------

The log is below:

2016-03-19 23:44:54.621653 
PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p1372,th-1217652544,"172.28.8.250","15627",2016-03-19
 22:52:05 
PDT,1172392,con92730,cmd2,seg97,slice5,,x1172392,sx1,"ERROR","58030","could not 
read from temporary file: Input/output error",,,,,,"select ...

2016-03-19 23:44:54.649381 
PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p797594,th-1217652544,"172.28.8.250","14688",2016-03-19
 22:48:14 
PDT,1172286,con92501,cmd2,seg101,slice7,,x1172286,sx1,"FATAL","08006","connection
 to client lost",,,,,,,0,,"postgres.c",3512,
2016-03-19 23:44:54.675656 
PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p2466,th-1217652544,"172.28.8.250","18226",2016-03-19
 22:53:40 
PDT,1172688,con92825,cmd2,seg97,slice5,,x1172688,sx1,"ERROR","58030","could not 
read from temporary file: Input/output error",,,,,,"select
 nation,
 o_year,
 sum(amount) as sum_profit
from
 (
 select
 n_name as nation,
 extract(year from o_orderdate) as o_year,
 l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
 from
 part,
 supplier,
 lineitem,
 partsupp,
 orders,
 nation
 where
 s_suppkey = l_suppkey
 and ps_suppkey = l_suppkey
 and ps_partkey = l_partkey
 and p_partkey = l_partkey
 and o_orderkey = l_orderkey
 and s_nationkey = n_nationkey
 and p_name like '%aquamarine%'
 ) as profit
group by
 nation,
 o_year
order by
 nation,
 o_year desc;",0,,"compress_nothing.c",61,
2016-03-19 23:44:54.683709 
PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p2466,th-1217652544,"172.28.8.250","18226",2016-03-19
 22:53:40 
PDT,1172688,con92825,cmd2,seg97,slice5,,x1172688,sx1,"ERROR","58030","could not 
close temporary file 
/data21/tmp/pgsql_tmp/workfile_set_HashJoin_Slice5.XXXXSzweO6/spillfile_f95: 
Input/output error",,,,,,,0,,"bfz.c",466,
2016-03-19 23:44:54.689898 
PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p2466,th-1217652544,"172.28.8.250","18226",2016-03-19
 22:53:40 
PDT,1172688,con92825,cmd2,seg97,slice5,,x1172688,sx1,"WARNING","58030","could 
not close temporary file 
/data21/tmp/pgsql_tmp/workfile_set_HashJoin_Slice5.XXXXSzweO6/spillfile_f123: 
Input/output error",,,,,,,0,,"bfz.c",466,

2016-03-19 23:45:08.582441 
PDT,"gpadmin","tpch_row_200gpn_quicklz1_part_random_gpadmin",p2466,th-1217652544,"172.28.8.250","18226",2016-03-19
 22:53:40 
PDT,1172688,con92825,cmd2,seg97,slice5,,x1172688,sx1,"PANIC","XX000","Resume 
interrupt holdoff count is bad (0) (xact.c:2907)",,,,,,,0,,"xact.c",2907,"Stack 
trace:
1    0x871f7f postgres <symbol not found> + 0x871f7f
2    0x872659 postgres elog_finish + 0xa9
3    0x4e171b postgres AbortTransaction + 0x7cb
4    0x4e2c45 postgres AbortCurrentTransaction + 0x25
5    0x7b01ea postgres PostgresMain + 0xaba
6    0x763c03 postgres <symbol not found> + 0x763c03
7    0x76435d postgres <symbol not found> + 0x76435d
8    0x76618e postgres PostmasterMain + 0xc7e
9    0x6c028a postgres main + 0x48a
10   0x33d401ed1d libc.so.6 __libc_start_main + 0xfd
11   0x4a17e9 postgres <symbol not found> + 0x4a17e9


>From the log above, the root cause is:
1) con92730,cmd2,seg97,slice5 reported: could not read from temporary file: 
Input/output error
2) So the transaction will be aborted. Master node will send SIGQUIT to all 
processes on segment and quit
3) con92501,cmd2,seg101,slice7: before processing SIGQUIT, it first detect that 
connection to QD error, so report FATAL.
4) con92825,cmd2,seg97,slice5: why 2 occurrence of report error here? Maybe the 
second error is called in the AbortTransaction() which will set 
InterruptHoldoffCount to 0.

> QE core dumped when report "Resume interrupt holdoff count is bad (0) 
> (xact.c:2907)"
> ------------------------------------------------------------------------------------
>
>                 Key: HAWQ-575
>                 URL: https://issues.apache.org/jira/browse/HAWQ-575
>             Project: Apache HAWQ
>          Issue Type: Bug
>            Reporter: Ming LI
>            Assignee: Lei Chang
>
> Core was generated by `postgres: port  5532, gpadmin tpch_row_2... 
> 172.28.8.250(18226) con92825 seg97'.
> Program terminated with signal 6, Aborted.
> #0  0x00000033d4032925 in raise () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install 
> hawq-2.0.0.0_beta-20925.x86_64
> (gdb) bt
> #0  0x00000033d4032925 in raise () from /lib64/libc.so.6
> #1  0x00000033d4034105 in abort () from /lib64/libc.so.6
> #2  0x0000000000871c6e in errfinish (dummy=<value optimized out>) at 
> elog.c:682
> #3  0x00000000008727bb in elog_finish (elevel=<value optimized out>, 
> fmt=<value optimized out>) at elog.c:1459
> #4  0x00000000004e171b in AbortTransaction () at xact.c:2907
> #5  0x00000000004e2c45 in AbortCurrentTransaction () at xact.c:3377
> #6  0x00000000007b01ea in PostgresMain (argc=37474312, argv=0x0, 
> username=<value optimized out>) at postgres.c:4507
> #7  0x0000000000763c03 in BackendRun (port=0x2373210) at postmaster.c:5889
> #8  BackendStartup (port=0x2373210) at postmaster.c:5484
> #9  0x000000000076435d in ServerLoop () at postmaster.c:2163
> #10 0x000000000076618e in PostmasterMain (argc=9, argv=0x236a5b0) at 
> postmaster.c:1454
> #11 0x00000000006c028a in main (argc=9, argv=0x236a570) at main.c:226
> (gdb) f 3
> #3  0x00000000008727bb in elog_finish (elevel=<value optimized out>, 
> fmt=<value optimized out>) at elog.c:1459
> (gdb) p *edata
> $1 = {elevel = 22, output_to_server = 1 '\001', output_to_client = 1 '\001', 
> show_funcname = 0 '\000', omit_location = 0 '\000', fatal_return = 0 '\000',
>   hide_stmt = 0 '\000', send_alert = 1 '\001', filename = 0x9cc38e "xact.c", 
> lineno = 2907, funcname = 0x9c66c0 "AbortTransaction",
>   domain = 0xafb668 "postgres-8.2", sqlerrcode = 2600, message = 0x236da50 
> "Resume interrupt holdoff count is bad (0) (xact.c:2907)", detail = 0x0,
>   detail_log = 0x0, hint = 0x0, context = 0x0, cursorpos = 0, internalpos = 
> 0, internalquery = 0x0, saved_errno = 5, stacktracearray = {0x871f7f, 
> 0x872659,
>     0x4e171b, 0x4e2c45, 0x7b01ea, 0x763c03, 0x76435d, 0x76618e, 0x6c028a, 
> 0x33d401ed1d, 0x4a17e9, 0x0 <repeats 19 times>}, stacktracesize = 11,
>   printstack = 0 '\000'}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to