Thanks a lot Ryan for your explanation, greatly helpful. Best regards, Saisai
Ryan Blue <rb...@netflix.com.invalid> 于2020年11月14日周六 上午2:03写道: > Saisai, > > Iceberg's FileIO interface doesn't require guarantees as strict as a > Hadoop-compatible FileSystem. For S3, that allows us to avoid negative > caching that can cause problems when reading a table that has just been > updated. (Specifically, S3A performs a HEAD request to check whether a path > is a directory even for overwrites.) > > It's up to users whether they want to use S3FileIO or S3A with > HadoopFileIO, or even HadoopFileIO with a custom FileSystem. We are working > to make it easy to switch between them, but there are notable problems like > ORC not using the FileIO interface yet. I would recommend using S3FileIO to > avoid negative caching, but S3A with S3Guard may be another way to avoid > the problem. I think the default should be to use S3FileIO if it is in the > classpath, and fall back to a Hadoop FileSystem if it isn't. > > For cloud providers, I think the question is whether there is a need for > the relaxed guarantees. If there is no need, then using HadoopFileIO and a > FileSystem works great and is probably the easiest way to maintain an > implementation for the object store. > > On Thu, Nov 12, 2020 at 7:31 PM Saisai Shao <sai.sai.s...@gmail.com> > wrote: > >> Hi all, >> >> Sorry to chime in, I also have a same concern about using Iceberg with >> Object storage. >> >> One of my concerns with S3FileIO is getting tied too much to a single >>> cloud provider. I'm wondering if an ObjectStoreFileIO would be helpful >>> so that S3FileIO and (a future) GCSFileIO could share logic? I haven't >>> looked deep enough into the S3FileIO to know how much logic is not s3 >>> specific. Maybe the FileIO interface is enough. >>> >> >> Now we have a S3 specific FileIO implementation, is it recommended to use >> this one instead of s3a like HCFS implementation? Also each public cloud >> provider has its own HCFS implementation for its own object storage. Are we >> going to suggest to create a specific FileIO implementation or use the >> existing HCFS implementation? >> >> Thanks >> Saisai >> >> >> Daniel Weeks <dwe...@netflix.com.invalid> 于2020年11月13日周五 上午1:09写道: >> >>> Hey John, about the concerns around cloud provider dependency, I feel >>> like the FileIO interface is actually the right level of abstraction >>> already. >>> >>> That interface basically requires "open for read" and "open for write", >>> where the implementation will diverge across different platforms. >>> >>> I guess you could think of it as S3FileIO is to FileIO what >>> S3AFileSystem is to FileSystem (in Hadoop). You can have many different >>> implementations that coexist. >>> >>> In fact, recent changes to the Catalog allow for very flexible >>> management of FIleIO and you could even have files within a table split >>> across multiple cloud vendors. >>> >>> As to the consistency questions, the list operation can be inconsistent >>> (e.g. if a new file is created and the implementation relies on list then >>> read, it may not see newly created objects. Iceberg does not list, so that >>> should not be an issue). >>> >>> The stated read-after-write consistency is limited and does not include: >>> - Read after overwrite >>> - Read after delete >>> - Read after negative cache (e.g. a GET or HEAD that occurred before >>> the object was created). >>> >>> Some of those inconsistencies have caused problems in certain cases when >>> it comes to committing data (the negative cache being the main culprit). >>> >>> -Dan >>> >>> >>> On Wed, Nov 11, 2020 at 6:49 PM John Clara <john.anthony.cl...@gmail.com> >>> wrote: >>> >>>> Update: I think I'm wrong about the listing part. I think it will only >>>> do the HEAD request. Also it seems like the consistency issue is >>>> probably not something my team would encounter with our current jobs. >>>> >>>> On 2020/11/12 02:17:10, John Clara <j...@gmail.com> wrote: >>>> > (Not sure if this is actually replying or just starting a new >>>> thread)> >>>> > >>>> > Hi Daniel,> >>>> > >>>> > Thanks for the response! It's very helpful and answers a lot my >>>> questions.> >>>> > >>>> > A couple follow ups:> >>>> > >>>> > One of my concerns with S3FileIO is getting tied too much to a >>>> single > >>>> > cloud provider. I'm wondering if an ObjectStoreFileIO would be >>>> helpful > >>>> > so that S3FileIO and (a future) GCSFileIO could share logic? I >>>> haven't > >>>> > looked deep enough into the S3FileIO to know how much logic is not >>>> s3 > >>>> > specific. Maybe the FileIO interface is enough.> >>>> > >>>> > About consistency (no need to respond here):> >>>> > I'm seeing that during "getFileStatus" my version of s3a does some >>>> list > >>>> > requests (but I'm not sure if that could fail from consistency >>>> issues).> >>>> > I'm also confused about the read-after-(initial) write part:> >>>> > "Amazon S3 provides read-after-write consistency for PUTS of new >>>> objects > >>>> > in your S3 bucket in all Regions with one caveat. The caveat is that >>>> if > >>>> > you make a HEAD or GET request to a key name before the object is > >>>> > created, then create the object shortly after that, a subsequent GET >>>> > >>>> > might not return the object due to eventual consistency. - > >>>> > https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html"> >>>> > >>>> > When my version of s3a does a create, it first does a >>>> getMetadataRequest > >>>> > (HEAD) to check if the object exists before creating the object. I >>>> think > >>>> > this is talked about in this issue: > >>>> > https://github.com/apache/iceberg/issues/1398 and talked about in >>>> the > >>>> > S3FileIO PR: https://github.com/apache/iceberg/pull/1573. I'll >>>> follow >>>> up > >>>> > in that issue for more info.> >>>> > >>>> > John> >>>> > >>>> > >>>> > On 2020/11/12 00:36:10, Daniel Weeks <d....@netflix.com.INVALID> >>>> wrote:> >>>> > > Hey John, I might be able to help answer some of your questions >>>> and > >>>> > provide>> >>>> > > some context around how you might want to go forward.>> >>>> > >> >>>> > > So, one fundamental aspect of Iceberg is that it only relies on a >>>> few>> >>>> > > operations (as defined by the FileIO interface). This makes much >>>> of >>>> the>> >>>> > > functionality and complexity of full file system implementations>> >>>> > > unnecessary. You should not need features like S3Guard or >>>> additional S3>> >>>> > > operations these implementations rely on in order to achieve file > >>>> > system>> >>>> > > contract behavior. Consistency issues should also not be a problem >>>> > >>>> > since>> >>>> > > Iceberg does not overwrite or list and read-after-(initial)write >>>> is >>>> a>> >>>> > > guarantee provided by S3.>> >>>> > >> >>>> > > At Netflix, we use a custom FileSystem implementation (somewhat >>>> like > >>>> > S3A),>> >>>> > > but with much of the contract behavior that drives additional > >>>> > operations>> >>>> > > against S3 disabled. However, we are transitioning to a more >>>> native>> >>>> > > implementation of S3FileIO, which you'll see as part of the >>>> ongoing > >>>> > work in>> >>>> > > Iceberg.>> >>>> > >> >>>> > > Per your specific questions:>> >>>> > >> >>>> > > 1) The S3FileIO implementation is very new, though internally we >>>> have>> >>>> > > something very similar. There are features missing that we are > >>>> > working to>> >>>> > > add (e.g. progressive multipart upload for large files is likely >>>> the > >>>> > most>> >>>> > > important).>> >>>> > > 2) You can use S3AFileSystem with the HadoopFileIO implementation, >>>> > >>>> > though>> >>>> > > you may still see similar behavior with additional calls being >>>> made >>>> (I>> >>>> > > don't know if these can be disabled).>> >>>> > > 3) The PrestoS3FileSystem is tailored to Presto's use and is >>>> likely > >>>> > not as>> >>>> > > complete as S3A, but seeing as it is using the Hadoop FileSystem >>>> api, > >>>> > it>> >>>> > > would likely work for what HadoopFileIO exercises (as would the>> >>>> > > EMRFileSystem).>> >>>> > > 4) I would probably discourage you from writing your own file >>>> system > >>>> > as the>> >>>> > > S3FileIO will likely be a more optimized implementation for what > >>>> > Iceberg>> >>>> > > needs.>> >>>> > >> >>>> > > If you want to contribute or have time to help contribute to > >>>> > S3FileIO, that>> >>>> > > is the path I would recommend. As for configuration, I would say a >>>> > >>>> > lot of>> >>>> > > it comes down to how to configure the AWS S3 Client that you >>>> provide > >>>> > to the>> >>>> > > S3FileIO implementation, but a lot of the defaults are reasonable >>>> (you>> >>>> > > might want to tweak a few like max connections and maybe the retry >>>> > >>>> > policy).>> >>>> > >> >>>> > > The recently committed work to dynamically load your FileIO should >>>> > >>>> > make it>> >>>> > > relatively easy to test out and we'd love to have extra eyes and > >>>> > feedback>> >>>> > > on it.>> >>>> > >> >>>> > > Let me know if that helps,>> >>>> > > -Dan>> >>>> > >> >>>> > >> >>>> > >> >>>> > > On Wed, Nov 11, 2020 at 1:45 PM John Clara <jo...@gmail.com>>> >>>> > > wrote:>> >>>> > >> >>>> > > > Hello all,>> >>>> > > >>> >>>> > > > Thank you all for creating/continuing this great project! I am >>>> just>> >>>> > > > starting to get comfortable with the fundamentals and I'm >>>> thinking > >>>> > that my>> >>>> > > > team has been using Iceberg the wrong way at the FileIO level.>> >>>> > > >>> >>>> > > > I was wondering if people would be willing to share how they set >>>> up > >>>> > their>> >>>> > > > FileIO/FileSystem with S3 and any customizations they had to >>>> add.>> >>>> > > >>> >>>> > > > (Preferably from smaller teams. My team is small and cannot>> >>>> > > > realistically customize everything. If there's an up to date >>>> thread>> >>>> > > > discussing this that I missed, please link me that instead.)>> >>>> > > >>> >>>> > > > ***** My team's specific problems/setup which you can ignore >>>> ***>> >>>> > > >>> >>>> > > > My team has been using Hadoop FileIO with the S3AFileSystem. >>>> Jars >>>> are>> >>>> > > > provided by AWS EMR 5.23 which is on Hadoop 2.8.5. We use >>>> DynamoDB > >>>> > for>> >>>> > > > atomic renames by implementing Iceberg's provided interfaces. We >>>> > >>>> > read/write>> >>>> > > > from either Spark in EMR or on-prem JVM's in docker containers > >>>> > (managed by>> >>>> > > > k8s). Both use s3a, but the EMR clusters have HDFS (backed by >>>> core > >>>> > nodes)>> >>>> > > > for the s3a buffered writes while the on-prem containers use the >>>> > >>>> > docker>> >>>> > > > container's default file system which uses an overlay2 storage > >>>> > driver (that>> >>>> > > > I know nothing about).>> >>>> > > >>> >>>> > > > Hadoop 2.8.5's S3AFileSystem does a bunch of unnecessary get and >>>> list>> >>>> > > > requests which is well known in the community (but not to my >>>> team>> >>>> > > > unfortunately). There's also GET PUT GET inconsistency issues >>>> with > >>>> > S3 that>> >>>> > > > have been talked about, but I don't yet understand how they >>>> arise > >>>> > in the>> >>>> > > > 2.8.5 S3AFilesystem >>>> (https://github.com/apache/iceberg/issues/1398).>> >>>> > > >>> >>>> > > > *** End of specific ***>> >>>> > > >>> >>>> > > >>> >>>> > > > The options I'm seeing are:>> >>>> > > >>> >>>> > > > 1. Using Iceberg's new S3 FileIO. Is anyone using this in prod?>> >>>> > > >>> >>>> > > > This still seems very new unless it is actually based on >>>> Netflix's>> >>>> > > > prod implementation that they're releasing to the community? >>>> (I'm > >>>> > wondering>> >>>> > > > if it's safe to start moving onto it in prod in the near term. >>>> If > >>>> > Netflix>> >>>> > > > is using it (or rolling it out) that would be more than enough >>>> for > >>>> > my team.)>> >>>> > > >>> >>>> > > > 2. Using a newer hadoop version and use the S3AFileSystem. Any>> >>>> > > > recommendations on a version and are you also using S3Guard?>> >>>> > > >>> >>>> > > > From a quick look, most gains compared to older versions seem to >>>> be>> >>>> > > > from S3Guard. Are there substantial gains without it? (My team > >>>> > doesn't have>> >>>> > > > experience with S3Guard and Iceberg seems to not need it outside >>>> of > >>>> > atomic>> >>>> > > > renames?)>> >>>> > > >>> >>>> > > > 3. Using an alternative hadoop file system. Any >>>> recommendations?>> >>>> > > >>> >>>> > > > In the recent Iceberg S3 FileIO, the License states it was based >>>> off>> >>>> > > > the Presto FileSystem. Has anyone used this file system as is >>>> with > >>>> > Iceberg?>> >>>> > > > (https://github.com/apache/iceberg/blob/master/LICENSE#L251)>> >>>> > > >>> >>>> > > > 4. Roll our own hadoop file system. Anyone have stories/blogs >>>> about>> >>>> > > > pitfalls or difficulties?>> >>>> > > >>> >>>> > > > rdblue hints that Netflix already done this:>> >>>> > > > > >>>> > https://github.com/apache/iceberg/issues/1398#issuecomment-682837392 >>>> .>> >>>> > > > (My team probably doesn't have the capacity for this)>> >>>> > > >>> >>>> > > >>> >>>> > > > Places where I tried looking for this info:>> >>>> > > >>> >>>> > > > - https://github.com/apache/iceberg/issues/761 (issue for >>>> getting>> >>>> > > > started guide)>> >>>> > > > - https://iceberg.apache.org/spec/#file-system-operations>> >>>> > > >>> >>>> > > > Thanks everyone,>> >>>> > > >>> >>>> > > > John Clara>> >>>> > > >>> >>>> > >> >>>> > >>>> >>> > > -- > Ryan Blue > Software Engineer > Netflix >