Suggested S3 FileIO/Getting Started

John Clara Wed, 11 Nov 2020 13:45:23 -0800

Hello all,

Thank you all for creating/continuing this great project! I am juststarting to get comfortable with the fundamentals and I'm thinking thatmy team has been using Iceberg the wrong way at the FileIO level.

I was wondering if people would be willing to share how they set uptheir FileIO/FileSystem with S3 and any customizations they had to add.

(Preferably from smaller teams. My team is small and cannotrealistically customize everything. If there's an up to date threaddiscussing this that I missed, please link me that instead.)


*****My team's specific problems/setup which you can ignore ***

My team has been using Hadoop FileIO with the S3AFileSystem. Jars areprovided by AWS EMR 5.23 which is on Hadoop 2.8.5. We use DynamoDB foratomic renames by implementing Iceberg's provided interfaces. Weread/write from either Spark in EMR or on-prem JVM's in dockercontainers (managed by k8s). Both use s3a, but the EMR clusters haveHDFS (backed by core nodes) for the s3a buffered writes while theon-prem containers use the docker container's default file system whichuses an overlay2 storage driver (that I know nothing about).

Hadoop 2.8.5's S3AFileSystem does a bunch of unnecessary get and listrequests which is well known in the community (but not to my teamunfortunately). There's also GET PUT GET inconsistency issues with S3that have been talked about, but I don't yet understand how they arisein the 2.8.5 S3AFilesystem (https://github.com/apache/iceberg/issues/1398).


*** End of specific ***


The options I'm seeing are:

1. Using Iceberg's new S3 FileIO. Is anyone using this in prod?

This still seems very new unless it is actually based on Netflix'sprod implementation that they're releasing to the community? (I'mwondering if it's safe to start moving onto it in prod in the near term.If Netflix is using it (or rolling it out) that would be more thanenough for my team.)

2. Using a newer hadoop version and use the S3AFileSystem. Anyrecommendations on a version and are you also using S3Guard?

From a quick look, most gains compared to older versions seem to befrom S3Guard. Are there substantial gains without it? (My team doesn'thave experience with S3Guard and Iceberg seems to not need it outside ofatomic renames?)


3. Using an alternative hadoop file system. Any recommendations?

In the recent Iceberg S3 FileIO, the License states it was basedoff the Presto FileSystem. Has anyone used this file system as is withIceberg? (https://github.com/apache/iceberg/blob/master/LICENSE#L251)

4. Roll our own hadoop file system. Anyone have stories/blogs aboutpitfalls or difficulties?

rdblue hints that Netflix already done this:https://github.com/apache/iceberg/issues/1398#issuecomment-682837392 .(My team probably doesn't have the capacity for this)



Places where I tried looking for this info:

 * https://github.com/apache/iceberg/issues/761 (issue for getting
   started guide)
 * https://iceberg.apache.org/spec/#file-system-operations

Thanks everyone,

John Clara

Suggested S3 FileIO/Getting Started

Reply via email to