[ 
https://issues.apache.org/jira/browse/HDFS-3290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256742#comment-13256742
 ] 

Colin Patrick McCabe commented on HDFS-3290:
--------------------------------------------

Hi Kihwal,

The data node currently keeps all of the block files for a single BlockPool in 
a single directory.  Unless you are using federation, this means that all of 
the block files for a DataNode are in a single directory.  This becomes 
inefficient as the number of blocks grows.

The idea is to make a small, incremental change to the directory structure, so 
that the block files are in multiple different directories rather than all in 
the same directory.  This is similar to how git works now.

{code}
cmccabe@keter:~/hadoop2> ls .git/objects/
00  09  12  1b  24  2d  36  3f  48  51  5a  63  6c  75  7e  87  90  9a  a3  ac  
b5  be  c7  d0  d9  e2  eb  f4  fd
01  0a  13  1c  25  2e  37  40  49  52  5b  64  6d  76  7f  88  92  9b  a4  ad  
b6  bf  c8  d1  da  e3  ec  f5  fe
...
{code}

The subdirectories contain the object files:
{code}
cmccabe@keter:~/hadoop2> ls .git/objects/00
005d570a8ba44e314bb33db88499c0d385c66d  517f2a598d935eebac57453fb376c955184d72  
fef1d2a78c2d30cede61a734b8fd1ae5c5f28f
41b3f30cb773267de5bb2a47169fae40ea65d7  68451b4b9acb9100fe78efc1d0b0283acc2024
471dd7fbdb1ccd6e7e079cd994b2920c5c93a8  fe8841111800e846bb961308224a826c33971c
{code}

In contrast, the DataNode puts everything in the same directory:
{code}
cmccabe@keter:~/hadoop1> ls 
/opt/hadoop/run4/data1/current/BP-1579759935-127.0.0.1-1333677135630/current/finalized/
blk_2787038401297968504            blk_4287797753246219082             
blk_-7903322996600832353
blk_2787038401297968504_1013.meta  blk_4287797753246219082_1005.meta   
blk_-7903322996600832353_1003.meta
blk_3011630105542771325            blk_-5154897827824037676            
blk_-8206000470100252669
blk_3011630105542771325_1007.meta  blk_-5154897827824037676_1017.meta  
blk_-8206000470100252669_1009.meta
blk_3119417112012450397            blk_-6449276351298923965
blk_3119417112012450397_1015.meta  blk_-6449276351298923965_1011.meta
{code}

P.S.  Yes, I am aware of the rbw directory.  However, most of the blocks are 
not going to be in that directory.

cheers,
Colin
                
> Use a better local directory layout for the datanode
> ----------------------------------------------------
>
>                 Key: HDFS-3290
>                 URL: https://issues.apache.org/jira/browse/HDFS-3290
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 0.23.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>
> When the HDFS DataNode stores chunks in a local directory, it currently puts 
> all of the chunk files into one big directory.  As the number of files 
> increases, this does not work well at all.  Local filesystems are not 
> optimized for the case where there are hundreds of thousands of files in the 
> same directory.  It also makes inspecting directories with standard UNIX 
> tools difficult.
> Similar to the git version control system, HDFS should create a few different 
> top level directories keyed off of a few bits in the chunk ID.  Git uses 8 
> bits.  This substantially cuts down on the number of chunk files in the same 
> directory and gives increased performance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to