I propose a tool for creating and manipulating a new abstraction, Hadoop 
Archives.
----------------------------------------------------------------------------------

                 Key: HADOOP-2146
                 URL: https://issues.apache.org/jira/browse/HADOOP-2146
             Project: Hadoop
          Issue Type: New Feature
          Components: dfs
            Reporter: Dick King


-- Introduction

In some hadoop map/reduce and dfs use cases, including a specific case that 
arises in my own work, users would like to populate dfs with a family of 
hundreds or thousands of directory trees, each of which consists of thousands 
of files.  In our case, the trees each have perhaps 20 gigabytes; two or three 
3-10-gigabyte files, a thousand small ones, and a large number of files of 
intermediate size.  I am writing this JIRA to encourage discussion of a new 
facility I want to create and contribute to the dfs core.

-- The problem

You can't store such families of trees in dfs in the obvious manner.  The 
problem is that the name nodes can't handle the millions or ten million files 
that result from such a family, especially if there are a couple of families.  
I understand that dfs will not be able to accommodate tens of millions of files 
in one instance for quite a while.

-- Exposed API of my proposed solution

I would therefore like to produce, and contribute to the dfs core, a new tool 
that implements an abstraction called a Hadoop Archive [or harchive].  
Conceptually, a harchive is a unit, but it manages a space that looks like a 
directory tree.  The tool exposes an interface that allows a user to do the 
following:

 * directory-level operations

   ** create a harchive [either empty, or initially populated form a 
locally-stored directory tree] .  The namespace for harchives is the same as 
the space of possible dfs directory locators, and a harchive would in fact be 
implemented as a dfs directory with specialized contents.

   ** Add a directory tree to an existing harchive in a specific place within 
the harchive

   ** retrieve a directory tree or subtree at or beneath the root of the 
harchive directory structure, into a local directory tree

 * file-level operations

   ** add a local file to a specific place in the harchive

   ** modify a file image in a specific place in the harchive to match a local 
file

   ** delete a file image in the harchive.

   ** move a file image within the harchive

   ** open a file image in the harchive for reading or writing.

 * stream operations

   ** open a harchive file image for reading or writing as a stream, in a 
manner similar to dfs files, and read or write it [ie., hdfsRead(...) ].  This 
would include random access operators for reading.

 * management operations

   ** commit a group of changes [which would be made atomically -- there would 
be no way half of a change could be made to a harchive if a client crashes].

   ** clean up a harchive, if it's gotten less performant because of extensive 
editing

   ** delete a harchive

We would also implement a command line interface.

-- Brief sketch of internals

A harchive would be represented as a small collection of files, called 
segments, in a dfs directory at the harchive's location.  Each segment would 
contain some of the files of the harchive's file images in a format to be 
determined, plus a harchive index.  We may group files by size, or some other 
criteria.  It is likely that harchives would contain only one segment in common 
cases.

Changes would be made by adding the text of the new files, either by rewriting 
an existing segment that contains not much more data than the size of the 
changes or by creating a new segment, complete with a new index.  When dfs 
comes to be enhanced to allow appends to dfs files, as requested by HADOOP-1700 
, we would be able to take advantage of that.

Often, when a harchive is initially populated, it could be a single segment, 
and a file it contains could be accessed with two random accesses into the 
segment.  The first access retrieves the index, and the second access retrieves 
the beginning of the file.  We could choose to put smaller files closer to the 
index to allow lower average amortized costs per byte.

We might instead choose to represent a harchive as one file or a few files for 
the large represented files, and smaller files for the represented smaller 
files.  That lets us make modifications by copying at lower cost.

The segment containing the index is found by a naming convention.  Atomicity is 
obtained by creating indices and renaming the files containing them according 
to the convention, when a change is committed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to