Agreed on HDFS not being designed for this purpose.

Additionally, version control tools are typically geared toward managing a
lot of small text files (e.g. source code) with lots of versions.  The
versions are usually stored as diffs from a base version to conserve disk
space, and those diffs are re-applied to recreate any given file version.
 If you are talking about a lot of very large files, a traditional version
control system may not be the most efficient option for a number of
reasons, not the least of which might be re-applying diffs to GB-sized
binary files.  I'm not saying it wouldn't work, just that it really isn't
what VCS are usually designed for.

Simple snapshots of your file set, tar/gzipped, would probably be simpler
to manage.  Creating a new (or for that matter retrieving and unpacking an
old) snapshot file could be a simple pre/post step in your MR job workflow.




On Tue, Nov 22, 2011 at 2:14 AM, Ted Dunning <tdunn...@maprtech.com> wrote:

> It is a bit off topic, but maprfs is closely equivalent to HDFS except
> that it provides the read-write and NFS semantics you are looking for.
>
> Trying to shoe-horn HDFS into a job that it wasn't intended to do (i.e.
> general file I/O) isn't a great idea.  Better to use what it is good for.
>
>
> On Mon, Nov 21, 2011 at 10:11 PM, Stuti Awasthi <stutiawas...@hcl.com>wrote:
>
>> This is what I also thought and looking at Mountable HDFS options. I need
>> to check if its buggy.
>> To your other ques : I am building an app which will have TBs of data and
>> require analysis. Now I am using HDFS as a background filesystem and Hbase
>> for database option. To do analysis on the data, I use MR jobs .
>> All this is not provided by NFS. One part of this app include document
>> management in which I wanted to provide version control. This is the
>> usecase I have. May be this helps. ☺
>>
>
>

Reply via email to