[ 
https://issues.apache.org/jira/browse/IMPALA-12439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825675#comment-17825675
 ] 

Evgeniy commented on IMPALA-12439:
----------------------------------

It seems that Impala daemon hangs on RPC. We have seen a lot of rpc method 
"CancelQueryFInstances" for inbound connections. Something like below:

{
    "remote_ip": "ip.ip.ip.ip:44876",
    "num_calls_in_flight": 58,
    ""socket_status": {
        "rtt": 2442,
        "rttvar": 4604,
        "snd_cwnd": 10,
        "total_retrans": 0,
        "pacing_rate": 11957411,
        "max_pacing_rate": 18446744073709551615,
        "bytes_acked": 12566319,
        "bytes_received": 10993660785,
        "segs_out": 901401,
        "segs_in": 7990372,
        "send_queue_bytes": 0,
        "receive_queue_bytes": 0,
        "send_bytes_per_sec": 5978705
    },
    "calls_in_flight": [
        {
            "header": {
                "call_id": 169977,
                "remote_method": {
                    "service_name": "impala.Control Service",
                    "method_name": "CancelQueryFInstances"
            },
            "timeout_millis": 10000
            },
            "micros_elapsed": 3175972030
        },
        {
            "header": {
                "call_id": 169975,
                "remote_method": {
                    "service_name": "impala.Control Service",
                    "method_name": "CancelQueryFInstances"
                },
                "timeout_millis": 10000
            },
            "micros_elapsed": 3185965057
        },....

> Impala Daemon stucks on random executors
> ----------------------------------------
>
>                 Key: IMPALA-12439
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12439
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.4.0
>            Reporter: Evgeniy
>            Priority: Critical
>         Attachments: resolved_420a96bf.txt, resolved_d7750c55.txt
>
>
> Hi!
> In our cluster we face the next problem periodically: 
> 1. The query fails with the error like this "Exec() rpc failed: Timed out: 
> ExecQueryFInstances RPC to <node_ip>:27000 timed out after 300.000s". Every 
> time when the problem appears the problem node may be different.
> 2. We have analyzed minidumps of the impala daemon from two different cases 
> (there are resolving minidumps in attachment).  It seems that impala daemon 
> stuck on cancelation query fragment:  
> Thread 244
>  0  libpthread-2.17.so + 0xba35
>     rax = 0xfffffffffffffe00   rdx = 0x0000000000000002
>     rcx = 0xffffffffffffffff   rbx = 0x000000007cd81b10
>     rsi = 0x0000000000000080   rdi = 0x000000007cd81b14
>     rbp = 0x00007f7ba5ae8580   rsp = 0x00007f7ba5ae8520
>      r8 = 0x000000007cd81b00    r9 = 0x0000000000000000
>     r10 = 0x0000000000000000   r11 = 0x0000000000000246
>     r12 = 0x00000000eafe6400   r13 = 0x00007f7ba5ae85c0
>     r14 = 0x00007f845b7287d0   r15 = 0x00007f7ba5ae8660
>     rip = 0x00007f845b727a35
>     Found by: given as instruction pointer in context
>  1  impalad!impala::QueryState::Cancel() + 0xdb
>     rbp = 0x00007f7ba5ae8600   rsp = 0x00007f7ba5ae8590
>     rip = 0x00000000011791bb
>     Found by: previous frame's frame pointer
>  2  
> impalad!impala::ControlService::CancelQueryFInstances(impala::CancelQueryFInstancesRequestPB
>  const*, impala::CancelQueryFInstancesResponsePB*, kudu::rpc::RpcContext*) + 
> 0x177
>     rbx = 0x00007f8458e136a0   rbp = 0x00007f7ba5ae8780
>     rsp = 0x00007f7ba5ae8610   r12 = 0x00007f7ba5ae8720
>     r13 = 0x00007f7ba5ae86a0   rip = 0x0000000001218f77
>     Found by: call frame info
>  3  impalad!kudu::rpc::GeneratedServiceIf::Handle(kudu::rpc::InboundCall*) + 
> 0x17c
>     rbx = 0x0000000015e4e460   rbp = 0x00007f7ba5ae87e0
>     rsp = 0x00007f7ba5ae8790   r12 = 0x00000007a6bf8ee0
>     r13 = 0x0000000014f86740   r14 = 0x0000000014f86f00
>     r15 = 0x0000000014f87480   rip = 0x0000000001788ffc
>     Found by: call frame info
>  4  impalad!impala::ImpalaServicePool::RunThread() + 0x1be
>     rbx = 0x00007f840000000d   rbp = 0x00007f7ba5ae88a0
>     rsp = 0x00007f7ba5ae87f0   r12 = 0x0000000018b30f80
>     r13 = 0x0000000000000000   r14 = 0x0000000000000051
>     r15 = 0x00007f840000000d   rip = 0x00000000010dbdee
>     Found by: call frame info
>  5  impalad!impala::Thread::SuperviseThread(std::string const&, std::string 
> const&, boost::function<void ()>, impala::ThreadDebugInfo const*, 
> impala::Promise<long, (impala::PromiseMode)0>*) + 0x30b
>     rbx = 0x00007f7ba5ae8970   rbp = 0x00007f7ba5ae8be0
>     rsp = 0x00007f7ba5ae88b0   r12 = 0x00007ffed2cdb298
>     r13 = 0x000000000592ee20   r14 = 0x00007f7ba5ae8910
>     r15 = 0x00007f8458e136a0   rip = 0x0000000001435f8b
>     Found by: call frame info
>  6  impalad!boost::detail::thread_data<boost::_bi::bind_t<void, void 
> (*)(std::string const&, std::string const&, boost::function<void ()>, 
> impala::ThreadDebugInfo const*, impala::Promise<long, 
> (impala::PromiseMode)0>*), boost::_bi::list5<boost::_bi::value<std::string>, 
> boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, 
> boost::_bi::value<impala::ThreadDebugInfo*>, 
> boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > > 
> >::run() + 0x7a
>     rbx = 0x0000000015e34e00   rbp = 0x00007f7ba5ae8c40
>     rsp = 0x00007f7ba5ae8bf0   r12 = 0x00007f7ba5ae8c00
>     r13 = 0x0000000001435c80   r14 = 0x0000000000000000
>     r15 = 0x00007f7ba5ae9700   rip = 0x0000000001436e5a
>     Found by: call frame info
>  7  impalad!thread_proxy + 0xea
>     rbx = 0x0000000015e34e00   rbp = 0x0000000000000000
>     rsp = 0x00007f7ba5ae8c50   r12 = 0x00007f7ba5ae8c50
>     r13 = 0x0000000000801000   r14 = 0x0000000000000000
>     r15 = 0x00007f7ba5ae9700   rip = 0x0000000001c18e1a
>     Found by: call frame info
>  8  libpthread-2.17.so + 0x7ea5
>     rbx = 0x0000000000000000   rbp = 0x0000000000000000
>     rsp = 0x00007f7ba5ae8ca0   r12 = 0x0000000000000000
>     r13 = 0x0000000000801000   r14 = 0x0000000000000000
>     r15 = 0x00007f7ba5ae9700   rip = 0x00007f845b723ea5
>     Found by: call frame info
>  9  libc-2.17.so + 0xfeb0d
>     rsp = 0x00007f7ba5ae8d40   rip = 0x00007f8458321b0d
>     Found by: stack scanning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to