Re: Why may "tablet read ahead" take long time? (was: Profile a (batch) scan)

Keith Turner Thu, 17 Jan 2019 07:52:55 -0800

Maxim,

Below are some pointers into the code you asked about.  Let me know if
you have any questions about the code.


The following tablet server code puts a task on a thread pool for a batch scan.

https://github.com/apache/accumulo/blob/rel/1.9.2/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L776

This thread pool is created here.

https://github.com/apache/accumulo/blob/rel/1.9.2/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java#L227

The function that creates the thread pool calls the following code to
wrap the thread pool

https://github.com/apache/accumulo/blob/rel/1.9.2/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java#L111

Keith

On Wed, Jan 16, 2019 at 4:09 AM Maxim Kolchin <[email protected]> wrote:
>
> Hi Adam,
>
> Thank you for the list!
>
> In my case, there is only one client which runs a single query which I'm 
> tracing, so I hope the thread pools have free threads.
>
> I'd like to look at the code which runs in the scope of the "table read 
> ahead". Do you know where I should look at? I tried to search by keywords on 
> GitHub, but it wasn't able to find a string "table read ahead" which is given 
> to the tracer.
>
> Cheers,
> Maxim
>
> On Tue, Jan 15, 2019 at 9:01 PM Adam Fuchs <[email protected]> wrote:
>>
>> Hi Maxim,
>>
>> What you're seeing is an artifact of the threading model that Accumulo uses. 
>> When you launch a query, Accumulo tablet servers will coordinate RPCs via 
>> Thrift in one thread pool (which grows unbounded) and queue up scans (rfile 
>> lookups, decryption/decompression, iterators, etc.) in another threadpool 
>> known as the readahead pool (which has a fixed number of threads). You're 
>> seeing everything that happens in that readahead thread in one big chunk. 
>> You may need to look a bit deeper into profiling/sampling tablet server CPU 
>> to get insights into how to improve your query performance. If you want to 
>> speed up queries in general you might try (in no particular order):
>> 1. Increase parallelism by bumping up the readahead threads 
>> (tserver.readahead.concurrent.max). This will still be bounded by the number 
>> of parallel scans clients are driving.
>> 2. Increase parallelism driven by clients by querying more, smaller ranges, 
>> or by splitting tablets.
>> 3. Increase scan batch sizes if the readahead thread or thrift coordination 
>> overhead is high.
>> 4. Optimize custom iterators if that is a CPU bottleneck.
>> 5. Increase cache sizes or otherwise modify queries to improve cache hit 
>> rates.
>> 6. Change compression settings if that is a CPU bottleneck. Try snappy 
>> instead of gz.
>>
>> Cheers,
>> Adam
>>
>> On Tue, Jan 15, 2019, 10:45 AM Maxim Kolchin <[email protected] wrote:
>>>
>>> Hi all,
>>>
>>> I try to trace some scans with Zipkin and see that quite often the trace 
>>> called "tablet read ahead" takes 10x or 100x more time than the other 
>>> similar traces.
>>>
>>> Why it may happen? What could be done to reduce the time? I found a similar 
>>> discussion on the list, but it doesn't have an answer. I'd be great to have 
>>> a how-to article listing some steps which could be done.
>>>
>>> Attaching a screenshot of one of the traces having this issue.
>>>
>>> Maxim Kolchin
>>>
>>> E-mail: [email protected]
>>> Tel.: +7 (911) 199-55-73
>>> Homepage: http://kolchinmax.ru
>>>
>>>> Below you can find a good example of what I'm struggling to understand
>>>> right now. It's a trace for a simple scan over some columns with a
>>>> BatchScanner using 75 threads. The scan takes 877 milliseconds and the main
>>>> contributor is the entry "tablet read ahead 1", which starts at 248 ms.
>>>> These are the questions that I cannot answer with this trace:
>>>>
>>>>    1. why this heavy operation starts after 248ms? By summing up the delay
>>>>    before this operation you get a number which is not even close to 248ms.
>>>>    2. what does "tablet read ahead 1" means? In general, how to map the
>>>>    entries of a trace to their meaning? Is there a guide about this?
>>>>    3. why "tablet read ahead 1" takes 600ms? It's clearly not the sum of
>>>>    the entries under this one but that's the important part.
>>>>    4. I may be naive but...how much data have been read by this scan? How
>>>>    many entries? That's very important to understand what's going on.
>>>>
>>>> Thanks for the help,
>>>>
>>>> Mario
>>>>
>>>> 877+ 0 Dice@h01 counts
>>>> 2+ 7 tserver@h12 startScan
>>>> 6+ 10 tserver@h15 startScan
>>>> 5+ 11 tserver@h15 metadata tablets read ahead 4
>>>> 843+ 34 Dice@h01 batch scanner 74- 1
>>>> 620+ 230 tserver@h09 startMultiScan
>>>> 600+ 248 tserver@h09 tablet read ahead 1
>>>> 22+ 299 tserver@h09 newDFSInputStream
>>>> 22+ 299 tserver@h09 getBlockLocations
>>>> 2+ 310 tserver@h09 ClientNamenodeProtocol#getBlockLocations
>>>> 1+ 321 tserver@h09 getFileInfo
>>>> 1+ 321 tserver@h09 ClientNamenodeProtocol#getFileInfo
>>>> 2+ 322 tserver@h09 DFSInputStream#byteArrayRead
>>>> 1+ 324 tserver@h09 DFSInputStream#byteArrayRead
>>>> 2+ 831 tserver@h09 DFSInputStream#byteArrayRead
>>>> 2+ 834 tserver@h09 DFSInputStream#byteArrayRead
>>>> 1+ 835 tserver@h09 BlockReaderLocal#fillBuffer(1091850413)
>>>> 1+ 874 tserver@h09 closeMultiScan
>>>> --
>>>> Mario Pastorelli | TERALYTICS
>>>>
>>>> *software engineer*
>>>>
>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>> phone: +41794381682
>>>> email: [email protected]
>>>> www.teralytics.net
>>>>
>>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>>> Zurich
>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
>>>> de Vries
>>>>
>>>> This e-mail message contains confidential information which is for the sole
>>>> attention and use of the intended recipient. Please notify us at once if
>>>> you think that it may not be intended for you and delete it immediately.

Re: Why may "tablet read ahead" take long time? (was: Profile a (batch) scan)

Reply via email to