[
https://issues.apache.org/jira/browse/KUDU-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenzhe Zhou updated KUDU-3582:
------------------------------
Description:
Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to read
serialized thrift object from KRPC, then do thrift deserialization. (See
GetSidecar() at
https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67)
In a customer reported cases, extra workloads were added to Impala cluster,
which caused long delay for KRPCs between Impala daemons. The long delay caused
KRPCs been cancelled, hence impala query failures.
{code:java}
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.243.38.160:27000
(fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error:
Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state
ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to
10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000
is cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b751966300000735):
took 59m57s. Error: Aborted: EndDataStream RPC to 10.34.163.32:27000 is
cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to
10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 is
cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.243.36.21:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033a):
took 1h. Error: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled
in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to
10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 is
cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.243.38.160:27000
(fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error:
Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state
ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to
10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000
is cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b75196630000043e):
took 1h. Error: Aborted:
{code}
Then extra workloads were removed and Impala cluster was restarted. During
restarting Impala cluster, lots of Impala daemon crashed. The stacktraces of
core files and log messages shows that impala daemons received incomplete data
from KRPC sidecar. The incomplete data did not cause thrift deserialization
failure so the valid but incomplete data was not captured and handled properly.
See impala Jira: IMPALA-13107. The issue could not be re-produced locally.
A quick fixing from Impala side was merged to mitigate the crash issue. Need to
look into this issue further from KRPC internal.
was:
Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to read
serialized thrift object from KRPC, then do thrift deserialization. (See
GetSidecar() at
https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67)
In a customer reported cases, extra workloads were added to Impala cluster,
which caused long delay for KRPCs between Impala daemons. The long delay caused
KRPCs been cancelled, hence impala query failures.
{code:java}
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.243.38.160:27000
(fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error:
Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state
ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to
10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000
is cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b751966300000735):
took 59m57s. Error: Aborted: EndDataStream RPC to 10.34.163.32:27000 is
cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to
10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 is
cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.243.36.21:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033a):
took 1h. Error: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled
in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to
10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 is
cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.243.38.160:27000
(fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error:
Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state
ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to
10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000
is cancelled in state ON_OUTBOUND_QUEUE
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC
to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b75196630000043e):
took 1h. Error: Aborted:
{code}
Then extra workloads are removed and Impala cluster was restarted. During
restarting Impala cluster, lots of Impala daemon crashed. The stacktrace of
core files and log messages shows that impala daemons received incomplete data
from KRPC sidecar. The incomplete data did not cause thrift deserialization
failure so the valid but incomplete data was not captured and handled properly.
See impala Jira: IMPALA-13107. The issue could not be re-produced locally.
A quick fixing from Impala side was merged to mitigate the crash issue. Need to
look into this issue further from KRPC internal.
> Incomplete sidecar data returned by RpcContext::GetInboundSidecar()
> -------------------------------------------------------------------
>
> Key: KUDU-3582
> URL: https://issues.apache.org/jira/browse/KUDU-3582
> Project: Kudu
> Issue Type: Bug
> Components: rpc
> Reporter: Wenzhe Zhou
> Priority: Major
>
> Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to
> read serialized thrift object from KRPC, then do thrift deserialization. (See
> GetSidecar() at
> https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67)
> In a customer reported cases, extra workloads were added to Impala cluster,
> which caused long delay for KRPCs between Impala daemons. The long delay
> caused KRPCs been cancelled, hence impala query failures.
> {code:java}
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream
> RPC to 10.243.38.160:27000
> (fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error:
> Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state
> ON_OUTBOUND_QUEUE
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to
> 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000
> is cancelled in state ON_OUTBOUND_QUEUE
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream
> RPC to 10.34.163.32:27000
> (fragment_instance_id=9940332ce09828fd:b751966300000735): took 59m57s. Error:
> Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state
> ON_OUTBOUND_QUEUE
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to
> 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000
> is cancelled in state ON_OUTBOUND_QUEUE
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream
> RPC to 10.243.36.21:27000
> (fragment_instance_id=9940332ce09828fd:b75196630000033a): took 1h. Error:
> Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state
> ON_OUTBOUND_QUEUE
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to
> 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000
> is cancelled in state ON_OUTBOUND_QUEUE
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream
> RPC to 10.243.38.160:27000
> (fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error:
> Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state
> ON_OUTBOUND_QUEUE
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to
> 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000
> is cancelled in state ON_OUTBOUND_QUEUE
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
> 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream
> RPC to 10.34.163.32:27000
> (fragment_instance_id=9940332ce09828fd:b75196630000043e): took 1h. Error:
> Aborted:
> {code}
> Then extra workloads were removed and Impala cluster was restarted. During
> restarting Impala cluster, lots of Impala daemon crashed. The stacktraces of
> core files and log messages shows that impala daemons received incomplete
> data from KRPC sidecar. The incomplete data did not cause thrift
> deserialization failure so the valid but incomplete data was not captured and
> handled properly.
> See impala Jira: IMPALA-13107. The issue could not be re-produced locally.
> A quick fixing from Impala side was merged to mitigate the crash issue. Need
> to look into this issue further from KRPC internal.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)