[
https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592164#action_12592164
]
mahadev edited comment on HADOOP-3307 at 4/24/08 1:36 PM:
----------------------------------------------------------------
Here is the design for the archives.
Archiving files in HDFS
- *Motivation*
The Namenode is a limited resource and we usually end up with lots of small
files that users do not use so often. We would like to create an archiving
utility that is able to archive these files which are semi transparent and
usable by map reduce.
- Why not just concatenate the files?
As we understand that concatenation of files might be useful but not a full
fledged solution for archiving files. Users want to keep their files as
distinct files and would sometime like to unarchive and not lose the file
layouts.
- *Requirements*
- transparent or semi transparent usage of archives.
- Must be able to archive and unarchive in parallel
- Changeable archives is not a requirement but the design should not prevent
it to be implemented later.
- Compression is not a goal.
- *Archive Format *
- Conventional archive formats like tar are not convenient for parallel archive
creation
- Here is a proposal that will allow archive creation in parallel
The format of an archive as a filesystem path is:
/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-*
The indexes store the filenames and the offset with the part files.
- *URI Syntax *
Har FileSystem is a client side filesystem which is semitransparent.
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive
- How will map reduce work with this new Filesystem.
There will not be any changes required to map reduce to get the Archives
running as input to map reduce jobs.
- How will the dfs commands work --
The DFS command will have to specify the whole URI for doing dfs operations
on the files. Archives are immutable, so renames, deletes, creates will throw
an exception in the initial versions of archives.
- How will permissions work with archives
In the first version of HAR, all the files that are archived into HAR will
lose permissions that they initially had. In later versions of HAR, permissions
can be stored into the metadata making it possible to unarchive without losing
permissions.
- *Future Work*
- Transparent use of archives.
This will need changes on the Hadoop File System to have mounts that point
to a archives and changes to DFSClient that will transparently walk this mount
to the real archive and will allow transparent use of archives.
Comments?
was (Author: mahadev):
Here is the design for the archives.
Archiving files in HDFS
-- Motivation--
The Namenode is a limited resource and we usually end up with lots of small
files that users do not use so often. We would like to create an archiving
utility that is able to archive these files which are semi transparent and
usable by map reduce.
-- Why not just concatenate the files?
As we understand that concatenation of files might be useful but not a full
fledged solution for archiving files. Users want to keep their files as
distinct files and would sometime like to unarchive and not lose the file
layouts.
-- Requirements--
-- transparent or semi transparent usage of archives.
-- Must be able to archive and unarchive in parallel
-- Changeable archives is not a requirement but the design should not prevent
it to be implemented later.
-- Compression is not a goal.
-- Archive Format --
-- Conventional archive formats like tar are not convenient for parallel
archive creation
-- Here is a proposal that will allow archive creation in parallel
The format of an archive as a filesystem path is:
/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-*
The indexes store the filenames and the offset with the part files.
-- URI Syntax --
Har FileSystem is a client side filesystem which is semitransparent.
-- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive
-- How will map reduce work with this new Filesystem.
There will not be any changes required to map reduce to get the Archives
running as input to map reduce jobs.
-- How will the dfs commands work --
The DFS command will have to specify the whole URI for doing dfs operations
on the files. Archives are immutable, so renames, deletes, creates will throw
an exception in the initial versions of archives.
-- How will permissions work with archives
In the first version of HAR, all the files that are archived into HAR will
lose permissions that they initially had. In later versions of HAR, permissions
can be stored into the metadata making it possible to unarchive without losing
permissions.
-- Future Work:
-- Transparent use of archives.
This will need changes on the Hadoop File System to have mounts that point
to a archives and changes to DFSClient that will transparently walk this mount
to the real archive and will allow transparent use of archives.
Comments?
> Archives in Hadoop.
> -------------------
>
> Key: HADOOP-3307
> URL: https://issues.apache.org/jira/browse/HADOOP-3307
> Project: Hadoop Core
> Issue Type: New Feature
> Components: fs
> Reporter: Mahadev konar
> Assignee: Mahadev konar
> Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.