[ 
https://issues.apache.org/jira/browse/HDFS-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved HDFS-224.
--------------------------------

    Resolution: Duplicate

We have a different version of harchives.
                
> I propose a tool for creating and manipulating a new abstraction, Hadoop 
> Archives.
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-224
>                 URL: https://issues.apache.org/jira/browse/HDFS-224
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Dick King
>
> -- Introduction
> In some hadoop map/reduce and dfs use cases, including a specific case that 
> arises in my own work, users would like to populate dfs with a family of 
> hundreds or thousands of directory trees, each of which consists of thousands 
> of files.  In our case, the trees each have perhaps 20 gigabytes; two or 
> three 3-10-gigabyte files, a thousand small ones, and a large number of files 
> of intermediate size.  I am writing this JIRA to encourage discussion of a 
> new facility I want to create and contribute to the dfs core.
> -- The problem
> You can't store such families of trees in dfs in the obvious manner.  The 
> problem is that the name nodes can't handle the millions or ten million files 
> that result from such a family, especially if there are a couple of families. 
>  I understand that dfs will not be able to accommodate tens of millions of 
> files in one instance for quite a while.
> -- Exposed API of my proposed solution
> I would therefore like to produce, and contribute to the dfs core, a new tool 
> that implements an abstraction called a Hadoop Archive [or harchive].  
> Conceptually, a harchive is a unit, but it manages a space that looks like a 
> directory tree.  The tool exposes an interface that allows a user to do the 
> following:
>  * directory-level operations
>    ** create a harchive [either empty, or initially populated form a 
> locally-stored directory tree] .  The namespace for harchives is the same as 
> the space of possible dfs directory locators, and a harchive would in fact be 
> implemented as a dfs directory with specialized contents.
>    ** Add a directory tree to an existing harchive in a specific place within 
> the harchive
>    ** retrieve a directory tree or subtree at or beneath the root of the 
> harchive directory structure, into a local directory tree
>  * file-level operations
>    ** add a local file to a specific place in the harchive
>    ** modify a file image in a specific place in the harchive to match a 
> local file
>    ** delete a file image in the harchive.
>    ** move a file image within the harchive
>    ** open a file image in the harchive for reading or writing.
>  * stream operations
>    ** open a harchive file image for reading or writing as a stream, in a 
> manner similar to dfs files, and read or write it [ie., hdfsRead(...) ].  
> This would include random access operators for reading.
>  * management operations
>    ** commit a group of changes [which would be made atomically -- there 
> would be no way half of a change could be made to a harchive if a client 
> crashes].
>    ** clean up a harchive, if it's gotten less performant because of 
> extensive editing
>    ** delete a harchive
> We would also implement a command line interface.
> -- Brief sketch of internals
> A harchive would be represented as a small collection of files, called 
> segments, in a dfs directory at the harchive's location.  Each segment would 
> contain some of the files of the harchive's file images in a format to be 
> determined, plus a harchive index.  We may group files by size, or some other 
> criteria.  It is likely that harchives would contain only one segment in 
> common cases.
> Changes would be made by adding the text of the new files, either by 
> rewriting an existing segment that contains not much more data than the size 
> of the changes or by creating a new segment, complete with a new index.  When 
> dfs comes to be enhanced to allow appends to dfs files, as requested by 
> HADOOP-1700 , we would be able to take advantage of that.
> Often, when a harchive is initially populated, it could be a single segment, 
> and a file it contains could be accessed with two random accesses into the 
> segment.  The first access retrieves the index, and the second access 
> retrieves the beginning of the file.  We could choose to put smaller files 
> closer to the index to allow lower average amortized costs per byte.
> We might instead choose to represent a harchive as one file or a few files 
> for the large represented files, and smaller files for the represented 
> smaller files.  That lets us make modifications by copying at lower cost.
> The segment containing the index is found by a naming convention.  Atomicity 
> is obtained by creating indices and renaming the files containing them 
> according to the convention, when a change is committed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to