Hello all,
Thank you all for creating/continuing this great project! I am just
starting to get comfortable with the fundamentals and I'm thinking that
my team has been using Iceberg the wrong way at the FileIO level.
I was wondering if people would be willing to share how they set up
their FileIO/FileSystem with S3 and any customizations they had to add.
(Preferably from smaller teams. My team is small and cannot
realistically customize everything. If there's an up to date thread
discussing this that I missed, please link me that instead.)
*****My team's specific problems/setup which you can ignore ***
My team has been using Hadoop FileIO with the S3AFileSystem. Jars are
provided by AWS EMR 5.23 which is on Hadoop 2.8.5. We use DynamoDB for
atomic renames by implementing Iceberg's provided interfaces. We
read/write from either Spark in EMR or on-prem JVM's in docker
containers (managed by k8s). Both use s3a, but the EMR clusters have
HDFS (backed by core nodes) for the s3a buffered writes while the
on-prem containers use the docker container's default file system which
uses an overlay2 storage driver (that I know nothing about).
Hadoop 2.8.5's S3AFileSystem does a bunch of unnecessary get and list
requests which is well known in the community (but not to my team
unfortunately). There's also GET PUT GET inconsistency issues with S3
that have been talked about, but I don't yet understand how they arise
in the 2.8.5 S3AFilesystem (https://github.com/apache/iceberg/issues/1398).
*** End of specific ***
The options I'm seeing are:
1. Using Iceberg's new S3 FileIO. Is anyone using this in prod?
This still seems very new unless it is actually based on Netflix's
prod implementation that they're releasing to the community? (I'm
wondering if it's safe to start moving onto it in prod in the near term.
If Netflix is using it (or rolling it out) that would be more than
enough for my team.)
2. Using a newer hadoop version and use the S3AFileSystem. Any
recommendations on a version and are you also using S3Guard?
From a quick look, most gains compared to older versions seem to be
from S3Guard. Are there substantial gains without it? (My team doesn't
have experience with S3Guard and Iceberg seems to not need it outside of
atomic renames?)
3. Using an alternative hadoop file system. Any recommendations?
In the recent Iceberg S3 FileIO, the License states it was based
off the Presto FileSystem. Has anyone used this file system as is with
Iceberg? (https://github.com/apache/iceberg/blob/master/LICENSE#L251)
4. Roll our own hadoop file system. Anyone have stories/blogs about
pitfalls or difficulties?
rdblue hints that Netflix already done this:
https://github.com/apache/iceberg/issues/1398#issuecomment-682837392 .
(My team probably doesn't have the capacity for this)
Places where I tried looking for this info:
* https://github.com/apache/iceberg/issues/761 (issue for getting
started guide)
* https://iceberg.apache.org/spec/#file-system-operations
Thanks everyone,
John Clara