Hello all,

Thank you all for creating/continuing this great project! I am just starting to get comfortable with the fundamentals and I'm thinking that my team has been using Iceberg the wrong way at the FileIO level.

I was wondering if people would be willing to share how they set up their FileIO/FileSystem with S3 and any customizations they had to add.

    (Preferably from smaller teams. My team is small and cannot realistically customize everything. If there's an up to date thread discussing this that I missed, please link me that instead.)

*****My team's specific problems/setup which you can ignore ***

My team has been using Hadoop FileIO with the S3AFileSystem. Jars are provided by AWS EMR 5.23 which is on Hadoop 2.8.5. We use DynamoDB for atomic renames by implementing Iceberg's provided interfaces. We read/write from either Spark in EMR or on-prem JVM's in docker containers (managed by k8s). Both use s3a, but the EMR clusters have HDFS (backed by core nodes) for the s3a buffered writes while the on-prem containers use the docker container's default file system which uses an overlay2 storage driver (that I know nothing about).

Hadoop 2.8.5's S3AFileSystem does a bunch of unnecessary get and list requests which is well known in the community (but not to my team unfortunately). There's also GET PUT GET inconsistency issues with S3 that have been talked about, but I don't yet understand how they arise in the 2.8.5 S3AFilesystem (https://github.com/apache/iceberg/issues/1398).

*** End of specific ***


The options I'm seeing are:

1. Using Iceberg's new S3 FileIO. Is anyone using this in prod?

    This still seems very new unless it is actually based on Netflix's prod implementation that they're releasing to the community? (I'm wondering if it's safe to start moving onto it in prod in the near term. If Netflix is using it (or rolling it out) that would be more than enough for my team.)

2. Using a newer hadoop version and use the S3AFileSystem. Any recommendations on a version and are you also using S3Guard?

    From a quick look, most gains compared to older versions seem to be from S3Guard. Are there substantial gains without it? (My team doesn't have experience with S3Guard and Iceberg seems to not need it outside of atomic renames?)

3. Using an alternative hadoop file system. Any recommendations?

    In the recent Iceberg S3 FileIO, the License states it was based off the Presto FileSystem. Has anyone used this file system as is with Iceberg? (https://github.com/apache/iceberg/blob/master/LICENSE#L251)

4. Roll our own hadoop file system. Anyone have stories/blogs about pitfalls or difficulties?

    rdblue hints that Netflix already done this: https://github.com/apache/iceberg/issues/1398#issuecomment-682837392 . (My team probably doesn't have the capacity for this)


Places where I tried looking for this info:

 * https://github.com/apache/iceberg/issues/761 (issue for getting
   started guide)
 * https://iceberg.apache.org/spec/#file-system-operations

Thanks everyone,

John Clara

Reply via email to