[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844658#action_12844658
 ] 

Jay Booth commented on HDFS-918:
--------------------------------

>>> I think it is very important to have separate pools for each partition.
>> This would be the case if I were using a fixed-size thread pool and a 
>> LinkedBlockingQueue - but I'm not, see Executors.newCachedThreadPool(),

>hmm.. does it mean that if you have thousand clients and the load is disk 
>bound, we end up with 1000 threads?

Yeah, although it'll likely turn out to be less than 1000 in practice..   If 
the requests are all short-lived, it could be significantly less than 1000 
threads when you consider re-use, if it's 1000 long reads, it'll probably wind 
up being only a little less if at all.  The threads themselves are really 
lightweight, the only resources attached to them are a 
ThreadLocal<ByteBuffer(8096)>.   (8k seemed ok for the ByteBuffer because the 
header+checksums portion is always significantly less than that, and the main 
block file transfers are done using transferTo).

I chose this approach after initially experimenting with a fixed-size 
threadpool and LinkedBlockingQueue because the handoff is faster and every 
pending IO request is guaranteed to become an actual disk-read syscall waiting 
on the operating system as fast as possible.  This way, the operating system 
decides which disk request to fulfill first, taking advantage of the 
lower-level optimizations around disk IO.  Since the threads are pretty 
lightweight and the lower-level calls do a better job of optimal fulfillment, I 
think this will work better than a fixed-size threadpool, where for example, 2 
adjacent reads from separate threads could be separated from each other in time 
whereas the disk controller might fulfill both simultaneously and faster.  This 
becomes even more important, I think, with the higher 512kb packet size -- 
those are big chunks of work per-sycall that can be optimized by the underlying 
OS.  Regarding the extra resource allocation for the threads -- if we're 
disk-bound, then generally speaking a few extra memory resources shouldn't be a 
huge deal -- the gains from dispatching more disk requests in parallel should 
outweigh the memory allocation and context switch costs.

The above is all in theory -- I haven't benchmarked parallel implementations 
head-to-head.  But certainly for random reads, and likely for longer reads, 
this approach should get the syscall invoked as fast as possible.  Switching 
between the two models would be pretty simple, just change the parameters we 
pass to the constructor for new ThreadPoolExecutorService().

> Use single Selector and small thread pool to replace many instances of 
> BlockSender for reads
> --------------------------------------------------------------------------------------------
>
>                 Key: HDFS-918
>                 URL: https://issues.apache.org/jira/browse/HDFS-918
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Jay Booth
>             Fix For: 0.22.0
>
>         Attachments: hdfs-918-20100201.patch, hdfs-918-20100203.patch, 
> hdfs-918-20100211.patch, hdfs-918-20100228.patch, hdfs-918-20100309.patch, 
> hdfs-multiplex.patch
>
>
> Currently, on read requests, the DataXCeiver server allocates a new thread 
> per request, which must allocate its own buffers and leads to 
> higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
> single selector and a small threadpool to multiplex request packets, we could 
> theoretically achieve higher performance while taking up fewer resources and 
> leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
> can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to