Agreed on HDFS not being designed for this purpose. Additionally, version control tools are typically geared toward managing a lot of small text files (e.g. source code) with lots of versions. The versions are usually stored as diffs from a base version to conserve disk space, and those diffs are re-applied to recreate any given file version. If you are talking about a lot of very large files, a traditional version control system may not be the most efficient option for a number of reasons, not the least of which might be re-applying diffs to GB-sized binary files. I'm not saying it wouldn't work, just that it really isn't what VCS are usually designed for.
Simple snapshots of your file set, tar/gzipped, would probably be simpler to manage. Creating a new (or for that matter retrieving and unpacking an old) snapshot file could be a simple pre/post step in your MR job workflow. On Tue, Nov 22, 2011 at 2:14 AM, Ted Dunning <tdunn...@maprtech.com> wrote: > It is a bit off topic, but maprfs is closely equivalent to HDFS except > that it provides the read-write and NFS semantics you are looking for. > > Trying to shoe-horn HDFS into a job that it wasn't intended to do (i.e. > general file I/O) isn't a great idea. Better to use what it is good for. > > > On Mon, Nov 21, 2011 at 10:11 PM, Stuti Awasthi <stutiawas...@hcl.com>wrote: > >> This is what I also thought and looking at Mountable HDFS options. I need >> to check if its buggy. >> To your other ques : I am building an app which will have TBs of data and >> require analysis. Now I am using HDFS as a background filesystem and Hbase >> for database option. To do analysis on the data, I use MR jobs . >> All this is not provided by NFS. One part of this app include document >> management in which I wanted to provide version control. This is the >> usecase I have. May be this helps. ☺ >> > >