If you are going to mention commercial distros, you should include MapR as well. Hadoop compatible, very scalable and handles very large numbers of files in a Posix-ish environment.
On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote: > Hi, > > We use HDFS to process data for the LHC - somewhat similar case here. Our > files are a bit larger, our total local data size if ~1PB logical, and we > "bring our own" batch system, so no Map-Reduce. We perform many random > reads, so we are quite sensitive to underlying latency. > > I don't see any obvious mismatches between your requirements and HDFS > capabilities that you can eliminate it as a candidate without an > evaluation. Do note that HDFS does not provide complete POSIX semantics - > but you don't appear to need them? > > IMHO, if you are looking for the following requirements: > 1) Proven petascale data store (never want to be on the bleeding edge of > your filesystem's scaling!). > 2) Has self-healing semantics (can recover from the loss of RAIDs or > entire storage targets). > 3) Open source (but do consider commercial companies - your time is worth > something!). > > You end up at looking at a very small number of candidates. Others > filesystems that should be on your list: > > 1) Gluster. A quite viable alternate. Like HDFS, you can buy commercial > support. I personally don't know enough to provide a pros/cons list, but > we keep it on our radar. > 2) Ceph. Not as proven IMHO. I don't know of multiple petascale deploys. > Requires a quite recent kernel. Quite good on-paper design. > 3) Lustre. I think you'd be disappointed with the self-healing. A very > "traditional" HPC/clustered filesystem design. > > For us, HDFS wins. I think it has the possibility of being a winner in > your case too. > > Brian > > On Oct 15, 2012, at 3:21 PM, Jay Vyas <jayunit...@gmail.com> wrote: > > Seems like a heavyweight solution unless you are actually processing the > images? > > Wow, no mapreduce, no streaming writes, and relatively small files. Im > surprised that you are considering hadoop at all ? > > Im surprised there isnt a simpler solution that uses redundancy without > all the > daemons and name nodes and task trackers and stuff. > > Might make it kind of awkward as a normal file system. > > On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <ha...@cloudera.com> wrote: > >> Hey Matt, >> >> What do you mean by 'real-time' though? While HDFS has pretty good >> contiguous data read speeds (and you get N x replicas to read from), >> if you're looking to "cache" frequently accessed files into memory >> then HDFS does not natively have support for that. Otherwise, I agree >> with Brock, seems like you could make it work with HDFS (sans >> MapReduce - no need to run it if you don't need it). >> >> The presence of NameNode audit logging will help your file access >> analysis requirement. >> >> >> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <m...@deity.co.nz> wrote: >> > Hi, >> > >> > I am a new Hadoop user, and would really appreciate your opinions on >> whether >> > Hadoop is the right tool for what I'm thinking of using it for. >> > >> > I am investigating options for scaling an archive of around 100Tb of >> image >> > data. These images are typically TIFF files of around 50-100Mb each and >> need >> > to be made available online in realtime. Access to the files will be >> > sporadic and occasional, but writing the files will be a daily activity. >> > Speed of write is not particularly important. >> > >> > Our previous solution was a monolithic, expensive - and very full - SAN >> so I >> > am excited by Hadoop's distributed, extensible, redundant architecture. >> > >> > My concern is that a lot of the discussion on and use cases for Hadoop >> is >> > regarding data processing with MapReduce and - from what I understand - >> > using HDFS for the purpose of input for MapReduce jobs. My other >> concern is >> > vague indication that it's not a 'real-time' system. We may be using >> > MapReduce in small components of the application, but it will most >> likely be >> > in file access analysis rather than any processing on the files >> themselves. >> > >> > In other words, what I really want is a distributed, resilient, scalable >> > filesystem. >> > >> > Is Hadoop suitable if we just use this facility, or would I be misusing >> it >> > and inviting grief? >> > >> > M >> >> >> >> -- >> Harsh J >> > > > > -- > Jay Vyas > MMSB/UCHC > > >