[
https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mahadev konar updated HADOOP-3307:
----------------------------------
Attachment: hadoop-3307_1.patch
this patch addresses the archives isssue.
This patch includes the following --
- har:///user/mahadev/foo.har
denotes a Hadoop archive. This is default uri which will use the default
underlying filesystem specififed in your conf.
In case you want to be explicit or some other hdfs (not the defautlt one )
then the uri is --
har://hdfs-host:port/user/mahadev/foo.har
The uri's have an implicit assumption on which part of the uri denotes the
directory for hadoop archives. The code looks the path from the end and
assumes the part matching *.har to be the directory that is the archive.
- it has a filesystem layer so all the commands like
hadoop fs -ls har:///user/mahadev/foo.har
work. Most of the mutating commands are not implemented in the archives. -cat
-copytolocal work as expected.
- works with map reduce.
so the input to a map reduce job could be har:///user/mahadev/foo.har and this
would work fine.
Code Design and explanation -
- There are two index files _index file contains files of the form
filename <dir>/<file> partfile startindex size childpathnames_if_directory.
The _index file is sorted by hashcode of filenames.
The second index file _masterindex contains pointers into the index file to
speed up the lookuptime of files inside the _index file.
- To create an archive user need to run
bin/hadoop archives -archiveName foo.har inputpaths outputdir
This is a map reduce job wherein all the files are distributed amongst the
maps which create part files of around 2GB or so. The reduce then get the
startindex and size ffrom the maps for all the files and creates the _index and
_masterindex.
- Permissions are not persisted. So the permissions returned by the Har
filesystem are the same as those of index files.
> Archives in Hadoop.
> -------------------
>
> Key: HADOOP-3307
> URL: https://issues.apache.org/jira/browse/HADOOP-3307
> Project: Hadoop Core
> Issue Type: New Feature
> Components: fs
> Reporter: Mahadev konar
> Assignee: Mahadev konar
> Fix For: 0.18.0
>
> Attachments: hadoop-3307_1.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.