[ https://issues.apache.org/jira/browse/IMPALA-11814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695847#comment-17695847 ]
Michael Smith commented on IMPALA-11814: ---------------------------------------- The most likely candidates here are: CPU starvation, network starvation, or lock contention. You can look at something like pidstat (i.e. https://access.redhat.com/solutions/69271) to look at CPU context switching and top for high CPU activity. Something like iftop could be a starting point for looking at network saturation. If neither are maxed out, there are a number of tools for analyzing lock contention - https://docs.kernel.org/locking/lockstat.html - usually requires recompiling the kernel to add CONFIG_LOCK_STAT - https://github.com/isc-projects/mutrace - https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/systemtap_beginners_guide/futexcontentionsect - perf sampling and look at repeated locks > TransmitData RPC become slow when multiple large queries are executed > simultaneously > ------------------------------------------------------------------------------------ > > Key: IMPALA-11814 > URL: https://issues.apache.org/jira/browse/IMPALA-11814 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 4.0.0 > Reporter: YifanZhang > Priority: Major > Labels: rpc > > We have seen a lot of slow TransmitData RPC in our clusters when multiple > large queries are running. Logs in executors: > {code:java} > 1227 16:44:24.859488 1134 rpcz_store.cc:269] Call > impala.DataStreamService.TransmitData from 11.174.152.179:53824 (request call > id 10458) took 368799ms. Request Metrics: {} > I1227 16:44:24.859828 1134 rpcz_store.cc:273] Trace: > 1227 16:38:16.060450 (+ 0us) impala-service-pool.cc:179] Inserting onto > call queue > 1227 16:38:16.060480 (+ 30us) impala-service-pool.cc:278] Handling call > 1227 16:38:16.060498 (+ 18us) krpc-data-stream-recvr.cc:325] Enqueuing > deferred RPC > 1227 16:44:24.859372 (+368798874us) krpc-data-stream-recvr.cc:504] Processing > deferred RPC > 1227 16:44:24.859382 (+ 10us) krpc-data-stream-recvr.cc:397] Deserializing > batch > 1227 16:44:24.859442 (+ 60us) krpc-data-stream-recvr.cc:424] Enqueuing > deserialized batch > 1227 16:44:24.859451 (+ 9us) inbound_call.cc:162] Queueing success > response > Metrics: {} {code} > The RPC spends a lot of time waiting in the queue. Some other executors also > take a long time to transmit data: > {code:java} > I1227 16:44:23.379583 528 krpc-data-stream-sender.cc:396] Slow TransmitData > RPC to 11.136.184.52:50257 > (fragment_instance_id=f9477dcd54c9d817:1bc346de000001e7): took 5m37s. > Receiver time: 5m37s Network time: 1.343ms > I1227 16:44:23.396365 631 krpc-data-stream-sender.cc:396] Slow TransmitData > RPC to 11.174.158.152:58535 > (fragment_instance_id=f9477dcd54c9d817:1bc346de00000231): took 5m57s. > Receiver time: 5m57s Network time: 2.970ms > I1227 16:44:23.396482 4654 krpc-data-stream-sender.cc:430] > f9477dcd54c9d817:1bc346de000001b6] Long delay waiting for RPC to > 11.174.158.152:58535 > (fragment_instance_id=f9477dcd54c9d817:1bc346de00000231): took 5m57s > I1227 16:44:23.444633 572 krpc-data-stream-sender.cc:396] Slow TransmitData > RPC to 11.174.153.240:60303 > (fragment_instance_id=f9477dcd54c9d817:1bc346de000002c3): took 5m53s. > Receiver time: 5m53s Network time: 17.604ms{code} > The max handler latency time is also very long: > |*Method: _TransmitData_*| > |Handler Latency|Count: 2321637, min / max: 3.000us / 5m10s, 25th %-ile: > 22.000us, 50th %-ile: 38.000us, 75th %-ile: 55.000us, 90th %-ile: 72.000us, > 95th %-ile: 86.000us, 99.9th %-ile: 4.512ms| > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org