[jira] [Commented] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-11-02 Thread Radar Lei (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235239#comment-16235239
 ] 

Radar Lei commented on HAWQ-1324:
-

Reopen to update the jira information.

> Query cancel cause segment to go into Crash recovery
> 
>
> Key: HAWQ-1324
> URL: https://issues.apache.org/jira/browse/HAWQ-1324
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Ming LI
>Assignee: Ming LI
>Priority: Major
> Fix For: 2.1.0.0-incubating
>
>
> A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
> then went into crash recovery due to a INSERT query being cancelled. What 
> should be the expected behaviour when HDFS becomes unavailable and a Query 
> fails due to HDFS unavailability.
> There was a core file generated at the time of the Crash recovery. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-02-13 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863267#comment-15863267
 ] 

Ming LI commented on HAWQ-1324:
---

Hi all,

The root cause is backtrace() is not a safe functions to call from a signal 
handler. 

It is similar to below problem:
http://stackoverflow.com/questions/6371028/what-makes-backtrace-crashsigsegv-on-linux-64-bit
{code}
The documentation for signal handling 
(http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html) 
defines the list of safe functions to call from a signal handler, you must not 
use any other functions, including backtrace. (search for async-signal-safe in 
that document)
{code}

So the fix should be similar with 
https://issues.apache.org/jira/browse/HAWQ-978. 

Need more time to verify the fix.  Thanks.

> Query cancel cause segment to go into Crash recovery
> 
>
> Key: HAWQ-1324
> URL: https://issues.apache.org/jira/browse/HAWQ-1324
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ed Espino
>
> A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
> then went into crash recovery due to a INSERT query being cancelled. What 
> should be the expected behaviour when HDFS becomes unavailable and a Query 
> fails due to HDFS unavailability.
> Below is the HDFS error
> {code}
> 2017-01-04 03:04:08.382615 
> JST,"carund","dwhrun",p574246,th1862944896,"192.168.10.12","47554",2017-01-04 
> 03:03:08 JST,0,con198952,,seg29,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.420099 
> JST,,,p755778,th18629448960,,,seg-1,"LOG","0","3rd party 
> error log:
> 2017-01-04 03:04:08.419969, p574222, th140507423066240, ERROR Handle 
> Exception: NamenodeImpl.cpp: 670: Unexpected error: status: 
> STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=
> ""/hawq_default/16385/16563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240
> @ Hdfs::Internal::UnWrapper Hdfs::HdfsIOException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing , 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::UnWrapper Hdfs::UnresolvedLinkException, Hdfs::HdfsIOException, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Not hing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::NamenodeImpl::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::NamenodeProxy::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::OutputStreamImpl::closePipeline()
> @ Hdfs::Internal::OutputStreamImpl::close()
> @ hdfsCloseFile
> @ gpfs_hdfs_closefile
> @ HdfsCloseFile
> @ HdfsFileClose
> @ CleanupTempFiles
> @ AbortTransaction
> @ AbortCurrentTransaction
> @ PostgresMain
> @ BackendStartup
> @ ServerLoop
> @ PostmasterMain
> @ main
> @ Unknown
> @ Unknown""SysLoggerMain","syslogger.c",518,
> 2017-01-04 03:04:08.420272 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"WARNING","58030","could 
> not close file 7 : (hdfs://ffd
> lakehd.ffwin.fujifilm.co.jp:8020/hawq_default/16385/16563/802748/26) errno 
> 5","Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=""/hawq_default/16385/16
> 563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240",,0,,"fd.c",2762,
> {code}
> Segment 26 going into Crash recovery - from seg26 log file
> {code}
> 2017-01-04 03:04:08.420314 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"LOG","08006","could not 
> send data to client: 接続が相
> 手からリセットされました",,,0,,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420358 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"LOG","08006","could not send data to 
> client: パイプが切断されました",,,0,
> ,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420375 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.950354 
> JST,,,p755773,th18629448960,,,seg-1,"LOG","0","server process 

[jira] [Commented] (HAWQ-1324) Query cancel cause segment to go into Crash recovery

2017-02-13 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863274#comment-15863274
 ] 

Ming LI commented on HAWQ-1324:
---

The compete fix for this defect should be similar with PostgreSQL 9.6:
1) Function StatementCancelHandler() can not directly call ProcessInterrupts() 
which cascade call some unsafe functions. Only simple logic ( e.g. set variable 
flags) can be included in signal handler function,  and call 
ProcessInterrupts() wherever the code pointer the signal can be triggered and 
executed.
2) Forward all related fixes from postgresql to hawq should be very complex 
task. We currently just offer the fix which can not completely fix this kinds 
of crash, but it reduce the possibility of this kind of crash. 

Thanks.

> Query cancel cause segment to go into Crash recovery
> 
>
> Key: HAWQ-1324
> URL: https://issues.apache.org/jira/browse/HAWQ-1324
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Ming LI
>Assignee: Ed Espino
>
> A query was cancelled due to this connection issue to HDFS on Isilon. Seg26 
> then went into crash recovery due to a INSERT query being cancelled. What 
> should be the expected behaviour when HDFS becomes unavailable and a Query 
> fails due to HDFS unavailability.
> Below is the HDFS error
> {code}
> 2017-01-04 03:04:08.382615 
> JST,"carund","dwhrun",p574246,th1862944896,"192.168.10.12","47554",2017-01-04 
> 03:03:08 JST,0,con198952,,seg29,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.420099 
> JST,,,p755778,th18629448960,,,seg-1,"LOG","0","3rd party 
> error log:
> 2017-01-04 03:04:08.419969, p574222, th140507423066240, ERROR Handle 
> Exception: NamenodeImpl.cpp: 670: Unexpected error: status: 
> STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=
> ""/hawq_default/16385/16563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240
> @ Hdfs::Internal::UnWrapper Hdfs::HdfsIOException, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing , 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::UnWrapper Hdfs::UnresolvedLinkException, Hdfs::HdfsIOException, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Not hing, Hdfs::Internal::Nothing, Hdfs::Internal::Nothing, 
> Hdfs::Internal::Nothing, Hdfs::Internal::Nothing>::unwrap(char const, int)
> @ Hdfs::Internal::NamenodeImpl::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::NamenodeProxy::fsync(std::string const&, std::string const&)
> @ Hdfs::Internal::OutputStreamImpl::closePipeline()
> @ Hdfs::Internal::OutputStreamImpl::close()
> @ hdfsCloseFile
> @ gpfs_hdfs_closefile
> @ HdfsCloseFile
> @ HdfsFileClose
> @ CleanupTempFiles
> @ AbortTransaction
> @ AbortCurrentTransaction
> @ PostgresMain
> @ BackendStartup
> @ ServerLoop
> @ PostmasterMain
> @ main
> @ Unknown
> @ Unknown""SysLoggerMain","syslogger.c",518,
> 2017-01-04 03:04:08.420272 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"WARNING","58030","could 
> not close file 7 : (hdfs://ffd
> lakehd.ffwin.fujifilm.co.jp:8020/hawq_default/16385/16563/802748/26) errno 
> 5","Unexpected error: status: STATUS_FILE_NOT_AVAILABLE = 0xC467 Path: 
> hawq_default/16385/16563/802748/26 with path=""/hawq_default/16385/16
> 563/802748/26"", 
> clientname=libhdfs3_client_random_866998528_count_1_pid_574222_tid_140507423066240",,0,,"fd.c",2762,
> {code}
> Segment 26 going into Crash recovery - from seg26 log file
> {code}
> 2017-01-04 03:04:08.420314 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 
> JST,40678725,con198952,cmd4,seg25,,,x40678725,sx1,"LOG","08006","could not 
> send data to client: 接続が相
> 手からリセットされました",,,0,,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420358 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"LOG","08006","could not send data to 
> client: パイプが切断されました",,,0,
> ,"pqcomm.c",1292,
> 2017-01-04 03:04:08.420375 
> JST,"carund","dwhrun",p574222,th1862944896,"192.168.10.12","47550",2017-01-04 
> 03:03:08 JST,0,con198952,,seg25,"FATAL","08006","connection to client 
> lost",,,0,,"postgres.c",3518,
> 2017-01-04 03:04:08.950354 
> JST,,,p755773,th18629448960,,,seg-1,"LOG","0","server process 
> (PID 574240) was terminated by signal 11: