[ https://issues.apache.org/jira/browse/HDFS-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Owen O'Malley resolved HDFS-224. -------------------------------- Resolution: Duplicate We have a different version of harchives. > I propose a tool for creating and manipulating a new abstraction, Hadoop > Archives. > ---------------------------------------------------------------------------------- > > Key: HDFS-224 > URL: https://issues.apache.org/jira/browse/HDFS-224 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Dick King > > -- Introduction > In some hadoop map/reduce and dfs use cases, including a specific case that > arises in my own work, users would like to populate dfs with a family of > hundreds or thousands of directory trees, each of which consists of thousands > of files. In our case, the trees each have perhaps 20 gigabytes; two or > three 3-10-gigabyte files, a thousand small ones, and a large number of files > of intermediate size. I am writing this JIRA to encourage discussion of a > new facility I want to create and contribute to the dfs core. > -- The problem > You can't store such families of trees in dfs in the obvious manner. The > problem is that the name nodes can't handle the millions or ten million files > that result from such a family, especially if there are a couple of families. > I understand that dfs will not be able to accommodate tens of millions of > files in one instance for quite a while. > -- Exposed API of my proposed solution > I would therefore like to produce, and contribute to the dfs core, a new tool > that implements an abstraction called a Hadoop Archive [or harchive]. > Conceptually, a harchive is a unit, but it manages a space that looks like a > directory tree. The tool exposes an interface that allows a user to do the > following: > * directory-level operations > ** create a harchive [either empty, or initially populated form a > locally-stored directory tree] . The namespace for harchives is the same as > the space of possible dfs directory locators, and a harchive would in fact be > implemented as a dfs directory with specialized contents. > ** Add a directory tree to an existing harchive in a specific place within > the harchive > ** retrieve a directory tree or subtree at or beneath the root of the > harchive directory structure, into a local directory tree > * file-level operations > ** add a local file to a specific place in the harchive > ** modify a file image in a specific place in the harchive to match a > local file > ** delete a file image in the harchive. > ** move a file image within the harchive > ** open a file image in the harchive for reading or writing. > * stream operations > ** open a harchive file image for reading or writing as a stream, in a > manner similar to dfs files, and read or write it [ie., hdfsRead(...) ]. > This would include random access operators for reading. > * management operations > ** commit a group of changes [which would be made atomically -- there > would be no way half of a change could be made to a harchive if a client > crashes]. > ** clean up a harchive, if it's gotten less performant because of > extensive editing > ** delete a harchive > We would also implement a command line interface. > -- Brief sketch of internals > A harchive would be represented as a small collection of files, called > segments, in a dfs directory at the harchive's location. Each segment would > contain some of the files of the harchive's file images in a format to be > determined, plus a harchive index. We may group files by size, or some other > criteria. It is likely that harchives would contain only one segment in > common cases. > Changes would be made by adding the text of the new files, either by > rewriting an existing segment that contains not much more data than the size > of the changes or by creating a new segment, complete with a new index. When > dfs comes to be enhanced to allow appends to dfs files, as requested by > HADOOP-1700 , we would be able to take advantage of that. > Often, when a harchive is initially populated, it could be a single segment, > and a file it contains could be accessed with two random accesses into the > segment. The first access retrieves the index, and the second access > retrieves the beginning of the file. We could choose to put smaller files > closer to the index to allow lower average amortized costs per byte. > We might instead choose to represent a harchive as one file or a few files > for the large represented files, and smaller files for the represented > smaller files. That lets us make modifications by copying at lower cost. > The segment containing the index is found by a naming convention. Atomicity > is obtained by creating indices and renaming the files containing them > according to the convention, when a change is committed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira