Re: Mixing s3, s3n and hdfs

2009-05-08 Thread Tom White
Hi Kevin,

The s3n filesystem treats each file as a single block, however you may
be able to split files by setting the number of mappers appropriately
(or setting mapred.max.split.size in the new MapReduce API in 0.20.0).
S3 supports range requests, and the s3n implementation uses them, so
it wouldn't try to download the entire file for each split.

You don't need to run a namenode for S3 filesystems, it is only needed
for HDFS. So it is feasible to run S3 and HDFS in parallel, copying
data from one to the other.

Cheers,
Tom

On Fri, May 8, 2009 at 8:55 AM, Kevin Peterson  wrote:
> Currently, we are running our cluster in EC2 with HDFS stored on the local
> (i.e. transient) disk. We don't want to deal with EBS, because it
> complicates being able to spin up additional slaves as needed. We're looking
> at moving to a combination of s3 (block) or s3n for data that we care about,
> and leaving lower value data that we can recreate on HDFS.
>
> My thinking is that s3n has significant advantages in terms of how easy it
> is to import data from non-Hadoop processes, and also the ease of sampling
> data, but I'm not sure how well it actually works. I'm guessing that it
> wouldn't be able to split files, or maybe it would need to download the
> entire file from S3 multiple times to split it? Is the issue with writes
> buffering the entire file on the local machine significant? Our jobs tend to
> be more CPU intensive than the usual kind of log processing type jobs, so we
> usually end up with smaller files.
>
> Is it feasible to run s3 (block) and hdfs in parallel? Would I need two
> namenodes to do this? Is this a good idea?
>
> Has anyone tried either of these configurations in EC2?
>


Mixing s3, s3n and hdfs

2009-05-08 Thread Kevin Peterson
Currently, we are running our cluster in EC2 with HDFS stored on the local
(i.e. transient) disk. We don't want to deal with EBS, because it
complicates being able to spin up additional slaves as needed. We're looking
at moving to a combination of s3 (block) or s3n for data that we care about,
and leaving lower value data that we can recreate on HDFS.

My thinking is that s3n has significant advantages in terms of how easy it
is to import data from non-Hadoop processes, and also the ease of sampling
data, but I'm not sure how well it actually works. I'm guessing that it
wouldn't be able to split files, or maybe it would need to download the
entire file from S3 multiple times to split it? Is the issue with writes
buffering the entire file on the local machine significant? Our jobs tend to
be more CPU intensive than the usual kind of log processing type jobs, so we
usually end up with smaller files.

Is it feasible to run s3 (block) and hdfs in parallel? Would I need two
namenodes to do this? Is this a good idea?

Has anyone tried either of these configurations in EC2?