Hello,

I'm working on a project that moves data from HDFS file systems into S3 for
analysis with Hive on EMR. Recently I've become quite confused with the
state of play regarding the different FileSystems: s3, s3n, and s3a. For my
use case I require the following:

   - Support for the transfer of very large files.
   - MD5 checks on copy operations to provide data verification.
   - Excellent compatibility within an EMR/Hive environment.

To move data between clusters it would seem that current versions of the
NativeS3FileSystem are my best bet; It appears that only s3n provides MD5
checking
<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L120>.
It is often cited that s3n does not support files over 5GB but I can find
no indication of such a limitation in the source code, in fact I see that
it switches over to multi-part upload for larger files
<https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.java#L130>.
So, has this limitation been removed in s3n?

Within EMR Amazon appear to recommend s3, support s3n, and advise against
s3a
<http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-file-systems.html>.
So yet again s3n would appear to win out here too? I assume that the s3n
implementation available in EMR is different to that in Apache Hadoop? I
find it hard to imagine that AWS would use JetS3t instead of their own AWS
Java client, but perhaps they do?

Finally, could I use NativeS3FileSystem to perform the actual transfer on
my Apache Hadoop cluster but then rewrite the table locations in my EMR
Hive metastore to use the s3:// protocol prefix? Could that work?

I'd appreciate any light that can be shed on these questions, and any
advice regarding my reasoning behind my proposal to use s3n for this
particular use case.

Thanks,

Elliot.

Reply via email to