Hi Kevin,
The s3n filesystem treats each file as a single block, however you may
be able to split files by setting the number of mappers appropriately
(or setting mapred.max.split.size in the new MapReduce API in 0.20.0).
S3 supports range requests, and the s3n implementation uses them, so
it wouldn't try to download the entire file for each split.
You don't need to run a namenode for S3 filesystems, it is only needed
for HDFS. So it is feasible to run S3 and HDFS in parallel, copying
data from one to the other.
Cheers,
Tom
On Fri, May 8, 2009 at 8:55 AM, Kevin Peterson wrote:
> Currently, we are running our cluster in EC2 with HDFS stored on the local
> (i.e. transient) disk. We don't want to deal with EBS, because it
> complicates being able to spin up additional slaves as needed. We're looking
> at moving to a combination of s3 (block) or s3n for data that we care about,
> and leaving lower value data that we can recreate on HDFS.
>
> My thinking is that s3n has significant advantages in terms of how easy it
> is to import data from non-Hadoop processes, and also the ease of sampling
> data, but I'm not sure how well it actually works. I'm guessing that it
> wouldn't be able to split files, or maybe it would need to download the
> entire file from S3 multiple times to split it? Is the issue with writes
> buffering the entire file on the local machine significant? Our jobs tend to
> be more CPU intensive than the usual kind of log processing type jobs, so we
> usually end up with smaller files.
>
> Is it feasible to run s3 (block) and hdfs in parallel? Would I need two
> namenodes to do this? Is this a good idea?
>
> Has anyone tried either of these configurations in EC2?
>