Hi Jeff, what would be the difference between this path, and what can be 
accomplished by using a Hadoop FileSystem interface based connector to talk to 
S3? Is it because of the consistency limitations with s3a:// 
(https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)?

As you probably know for Azure, we went with the abfss:// connector provided as 
part of hadoop-azure 
(https://hadoop.apache.org/docs/current/hadoop-azure/abfs.html) with minimal 
effort. Just wondering what the key difference here is for S3.

Thanks!

Arvind.

-----Original Message-----
From: Jeff Kubina <jeff.kub...@gmail.com> 
Sent: Tuesday, July 27, 2021 10:16 AM
To: dev@accumulo.apache.org
Subject: [EXTERNAL] Accumulo with Native S3 Support

All,

Some of AWS's back end services use a version of Accumulo modified to use 
Amazon's S3 as its storage system. Amazon engineers forked Accumulo 2.0 and 
merged that S3 support into it 
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcmilbert%2Faccumulo%2F&amp;data=04%7C01%7Carvindsh%40microsoft.com%7C9b8c533f2a85467b90c008d95122491f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637630030450339294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WvhjAgkOZMRVM%2B2KzXH8ZvDU2ZsFxaw%2BFUPtupsNNbs%3D&amp;reserved=0>.
Chris Milbert is the lead Amazon engineer who did the integration. Chris and I 
would like to jump start the conversation about how best to initiate the pull 
request for these changes into Accumulo 2.1.

Mike Wall suggested using this as an opportunity to abstract out the storage 
system of Accumulo and make it pluggable. He suggested the following broad 
steps:

   1. Identify all the things HDFS provides such as read, write,
   replication and failover.
   2. Abstract out a file system interface with hooks for all those things
   (and does not require loading hadoop jars).
   3. Plugin HDFS as the default implementation of that interface, hiding
   all hadoop jars there.
   4. Make another implementation that plugins in S3 and make it optionally
   configured.
   5. Run tests to make sure we didn't break things with HDFS.
   6. Run tests to see if S3 meets all the requirements.

Ed Coleman also suggested first forking Accumulo 2.1 and merging the S3 changes 
into it.

Chris and I look forward to the discussion on how best to add S3 support to 
Accumulo.

Thanks,
Jeff
--
Jeff Kubina

Reply via email to