[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Mahadev konar (JIRA) Sat, 24 May 2008 20:08:23 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mahadev konar updated HADOOP-3307:
----------------------------------

    Attachment: hadoop-3307_1.patch

this patch addresses the archives isssue. 

This patch includes the following -- 

- har:///user/mahadev/foo.har 

denotes a Hadoop archive. This is default uri which will use the default 
underlying filesystem specififed in your conf. 

In case you want to be explicit or some other hdfs (not the defautlt one )

then the uri is -- 

har://hdfs-host:port/user/mahadev/foo.har

The uri's have an implicit assumption on which part of the uri denotes the 
directory for  hadoop archives. The code looks the path from the end and 
assumes the part matching *.har to be the directory that is the archive.


- it has a filesystem layer so all the commands like 

hadoop fs -ls har:///user/mahadev/foo.har 

work. Most of the mutating commands are not implemented in the archives. -cat 
-copytolocal work as expected. 

- works with map reduce. 

so the input to a map reduce job could be har:///user/mahadev/foo.har and this 
would work fine.

Code Design and explanation - 

- There are two index files _index file contains files of the form 
  filename <dir>/<file> partfile startindex size childpathnames_if_directory.
  The _index file is sorted by hashcode of filenames.
  The second index file _masterindex contains pointers into the index file to 
speed up the lookuptime of files inside the _index file. 

- To create an archive user need to run 
  bin/hadoop archives -archiveName foo.har inputpaths outputdir
 
  This is a map reduce job wherein all the files are distributed amongst the 
maps which create part files of around 2GB or so. The reduce then get the 
startindex and size ffrom the maps for all the files and creates the _index and 
_masterindex. 

- Permissions are not persisted. So the permissions returned by the Har 
filesystem are the same as those of index files. 



> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Reply via email to