Dmitry,

It sounds like an interesting idea, but I have not really heard of anyone doing 
it before.  It would make for a good feature to have tiered file systems all 
mapped into the same namespace, but that would be a lot of work and complexity.

The quick solution would be to know what data you want to process before hand 
and then run distcp to copy it from S3 into HDFS before launching the other 
map/reduce jobs.  I don't think there is anything automatic out there.

--Bobby Evans

On 8/29/11 4:56 PM, "Dmitry Pushkarev" <u...@stanford.edu> wrote:

Dear hadoop users,

Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
and one thing that I'm trying to explore is whether we can use alternative
scheduling systems like SGE with shared FS for non data intensive tasks,
since they are easier to work with for lay users.

One problem for now is how to create shared cluster filesystem similar to
HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
and permissions), that will use amazon EC2 local nonpersistent storage.

Idea is to keep original data on S3, then as needed fire up a bunch of
nodes, start shared filesystem, and quickly copy data from S3 to that FS,
run the analysis with SGE, save results and shut down that filesystem.
I tried things like S3FS and similar native S3 implementation but speed is
too bad. Currently I just have a FS on my master node that is shared via NFS
to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
more than 10 nodes.

Thank you. I'd appreciate any suggestions and links to relevant resources!.


Dmitry

Reply via email to