@David Alves

Thanks for reply me David Alves!

This is my environment.

1. 9 nodes. 40 core, 3.3T SSD
2. Most common sql is 'SELECT d1, SUM(m1) FROM table WHERE d1 in
('some1', 'some2', 'some3', 'some4' ...)'
3. 'd1' is lz4 + plain encoding
4. I already computed statistics in impala. But i does not seems to
have great performance, since i just filter and sum up not join
multiple tables
5. Total tablet in 600. (Because We are deciding best partition number
through performance test, so it quite many tables in Kudu server.)
6. Concurrent query is 40~100 qps to impala hs2.
7. Kudu, Impala are in same node.
8. Maintenance thread factor is 8. Ingestion job is running from
outside yarn cluster, and io budget is 1g.
During performance testing compaction jobs are running. (It's designed
on purpose. In production, ingestion job and scan job are running
independently ultimately)
9. Command 'top -d 1 -H -p $(impalad)' show impala idle most time. Cpu
usage is below 3%.
10. Stress testing target table's tablet distributions are below

Target table's tablet per nodes.

node1 : 12
node2 : 11
node3 : 12
node4 : 13
node5 : 16
node6 : 14
node7 : 14
node8 : 13
node9 : 15

11. One acceptor thread
12. 50 Service pool
13. 20 Service threads




I find few things during testing.

1. Impala is idle most time.
2. Query execution time in few node among 9 nodes are bottleneck.

SQL -> SELECT d1, SUM(m1) FROM table WHERE d1 in ('some1', 'some2', 'some3')

node1 -> 702ms
node2 -> 4,777ms          *
node3 -> 6,624ms          *
node4 -> 731ms
node5 -> 688ms
node6 -> 910ms
node7 -> 16,667ms        *
node8 -> 17,960ms        *
node9 -> 655ms

As you can see some nodes are definitely bottleneck also bottleneck
server is changing every time after issuing single query.

3. Execution time between `in ('some')` and `in ('some1', 'some2')`
are quite difference.
Single filtering have 3x performance compared to containing more than
single filtering.

4. I also did 'SET NUM_SCANNER_THREADS=10' but it's upperbound, not
lowerbound. It is to set maximum scan thread number not minimum.
And execution time after doing 'num_scanner_theads=1' and
'num_scanner_threads=5'  are almost same.
Also in this situation, some nodes still lagged.
So I think impala's scan thread is not bottleneck.



So i concluded ingestion job, and concurrent query workload to kudu is
bottleneck.
In outside yarn cluster spark job requesting insert rpc requests, and
multiple impalad are sending scan rpc requests to kudu.


This is the process up to my decision that improve kudu's rpc
throughput, improve scan throughput.


And why do you think execution time between `in ('some')` and `in
('some1', 'some2')` are so big?
It is better run queries twice `in ('some1')`, `in ('some2)` and not
using `in ('some1', 'some2')`






@ Todd Lipcon

Thanks for replying me Tood Lipcon!

So below is what i understood. Can plz check my understanding?

Several services registered in reactor. (Event callback)

RPC(Insert, Scan etc) -> Acceptor -> bind packet to reactor threads ->
Signal service's callback -> Doing his job. (Insert, Scan)

And top command shows that most cpu usages are by maintenance threads.
(8 threads)




2017-04-29 5:48 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
> To clarify one bit - the acceptor thread is the thread calling accept() on
> the listening TCP socket. Once accepted, the RPC system uses libev
> (event-based IO) to react to new packets on a "reactor thread". When a full
> RPC request is received, it is distributed to the "service threads".
>
> I'd also suggest running 'top -H -p $(pgrep kudu-tserver)' to see the thread
> activity during the workload. You can see if one of the reactor threads is
> hitting 100% CPU, for example, though I've never seen that to be a
> bottleneck. David's pointers are probably good places to start
> investigating.
>
> -Todd
>
> On Fri, Apr 28, 2017 at 1:41 PM, David Alves <davidral...@gmail.com> wrote:
>>
>> Hi
>>
>>   The acceptor thread only distributes work, it's very unlikely that is a
>> bottleneck. Same goes for the number of workers, since the number of threads
>> pulling data is defined by impala.
>>   What is "extremely" slow in this case?
>>
>>   Some things to check:
>>   It seems like this is scanning only 5 tablets? Are those all the tablets
>> in per ts? Do tablets have roughly the same size?
>>   Are you using encoding/compression?
>>   How much data per tablet?
>>   Have you ran "compute stats" on impala?
>>
>> Best
>> David
>>
>>
>>
>> On Fri, Apr 28, 2017 at 9:07 AM, 기준 <0ctopus13pr...@gmail.com> wrote:
>>>
>>> Hi!
>>>
>>> I'm using kudu 1.3, impala 2.7.
>>>
>>> I'm investigating about extreamly slow scan read in impala's profiling.
>>>
>>> So i digged source impala, kudu's source code.
>>>
>>> And i concluded this as a connection throughput problem.
>>>
>>> As i found out, impala use below steps to send scan request to kudu.
>>>
>>> 1. RunScannerThread -> Create new scan threads
>>> 2. ProcessScanToken -> Open
>>> 3. KuduScanner:GetNext
>>> 4. Send Scan RPC -> Send scan rpc continuously
>>>
>>> So i checked kudu's rpc configurations.
>>>
>>> --rpc_num_acceptors_per_address=1
>>> --rpc_num_service_threads=20
>>> --rpc_service_queue_length=50
>>>
>>>
>>> Here are my questions.
>>>
>>> 1. Does acceptor accept all rpc requests and toss those to proper
>>> service?
>>> So, Scan rpc -> Acceptor -> RpcService?
>>>
>>> 2. If i want to increase input throughput then should i increase
>>> '--rpc_num_service_threads' right?
>>>
>>> 3. Why '--rpc_num_acceptors_per_address' has so small value compared
>>> to --rpc_num_service_threads? Because I'm going to increase that value
>>> too, do you think this is a bad idea? if so can you plz describe
>>> reason?
>>>
>>> Thanks for replying me!
>>>
>>> Have a nice day~ :)
>>
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

Reply via email to