+1 on deprecating now and removing them from the codebase with Iceberg 2.0 On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:
> +1 on deprecating the `File System Tables` from spec and `HadoopCatalog`, > `HadoopTableOperations` in code for now > and removing them permanently during 2.0 release. > > For testing we can use `InMemoryCatalog` as others mentioned. > > I am not sure about moving to test or keeping them only for HDFS. Because, > it leads to confusion to existing users of Hadoop catalog. > > I wanted to have it deprecated 2 years ago > <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309> > and I remember that we discussed it in sync that time and left it as it is. > Also, when the user brought this up in slack > <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F> > recently about lockmanager and refactoring the HadoopTableOperations, > I have asked to open this discussion on the mailing list. So, that we can > conclude it once and for all. > > - Ajantha > > On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org> > wrote: > >> Hey Ryan and others, >> >> Thanks for bringing this up. I would be in favor of removing the >> HadoopTableOperations, mostly because of the reasons that you already >> mentioned, but also about the fact that it is not fully in line with the >> first principles of Iceberg (being object store native) as it uses >> file-listing. >> >> I think we should deprecate the HadoopTables to raise the attention of >> their users. I would be reluctant to move it to test to just use it for >> testing purposes, I'd rather remove it and replace its use in tests with >> the InMemoryCatalog. >> >> Regarding the StaticTable, this is an easy way to have a read-only table >> by directly pointing to the metadata. This also lives in Java under >> StaticTableOperations >> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>. >> It isn't a full-blown catalog where you can list {tables,schemas}, >> update tables, etc. As ZENOTME pointed out already, it is all up to the >> user, for example, there is no listing of directories to determine which >> tables are in the catalog. >> >> is there a probability that the strategy used by HadoopCatalog is not >>> compatible with the table managed by other catalogs? >> >> >> Yes, so they are different, you can see in the spec the section on File >> System tables >> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>, >> is used by the HadoopTable implementation. Whereas the other catalogs >> follow the Metastore Tables >> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables> >> . >> >> Kind regards, >> Fokko >> >> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>: >> >>> According to our requirements, this function is for some users who want >>> to read iceberg tables without relying on any catalogs, I think the >>> StaticTable may be more flexible and clear in semantics. For StaticTable, >>> it's the user's responsibility to decide which metadata of the table to >>> read. But for read-only HadoopCatalog, the metadata may be decided by >>> Catalog, is there a probability that the strategy used by HadoopCatalog is >>> not compatible with the table managed by other catalogs? >>> >>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道: >>> >>>> I think there are two ways to do this: >>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw >>>> unsupported operation exception for other operations that manipulate >>>> tables. >>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in >>>> pyiceberg or iceberg-rust. >>>> >>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote: >>>> >>>>> Hi, Renjie >>>>> >>>>> Are you suggesting that we refactor HadoopCatalog as a >>>>> FileSystemCatalog to enable direct reading from file systems like HDFS, >>>>> S3, >>>>> and Azure Blob Storage? This catalog will be read-only that don't support >>>>> write operations. >>>>> >>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote: >>>>> >>>>> Hi, Ryan: >>>>> >>>>> Thanks for raising this. I agree that HadoopCatalog is dangerous in >>>>> manipulating tables/catalogs given limitations of different file systems. >>>>> But I see that there are some users who want to read iceberg tables >>>>> without >>>>> relying on any catalogs, this is also the motivational use case of >>>>> StaticTable in pyiceberg and iceberg-rust, is there similar things in java >>>>> implementation? >>>>> >>>>> >>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote: >>>>> >>>>> Hey everyone, >>>>> >>>>> There has been some recent discussion about improving >>>>> HadoopTableOperations and the catalog based on those tables, but we've >>>>> discouraged using file system only table (or "hadoop" tables) for years >>>>> now >>>>> because of major problems: >>>>> * It is only safe to use hadoop tables with HDFS; most local file >>>>> systems, S3, and other common object stores are unsafe >>>>> * Despite not providing atomicity guarantees outside of HDFS, people >>>>> use the tables in unsafe situations >>>>> * HadoopCatalog cannot implement atomic operations for rename and drop >>>>> table, which are commonly used in data engineering >>>>> * Alternative file names (for instance when using metadata file >>>>> compression) also break guarantees >>>>> >>>>> While these tables are useful for testing in non-production scenarios, >>>>> I think it's misleading to have them in the core module because there's an >>>>> appearance that they are a reasonable choice. I propose we deprecate the >>>>> HadoopTableOperations and HadoopCatalog implementations and move them to >>>>> tests the next time we can make breaking API changes (2.0). >>>>> >>>>> I think we should also consider similar fixes to the table spec. It >>>>> currently describes how HadoopTableOperations works, which does not work >>>>> in >>>>> object stores or local file systems. HDFS is becoming much less common and >>>>> I propose that we note that the strategy in the spec should ONLY be used >>>>> with HDFS. >>>>> >>>>> What do other people think? >>>>> >>>>> Ryan >>>>> >>>>> -- >>>>> Ryan Blue >>>>> >>>>> >>>>> Xuanwo >>>>> >>>>> https://xuanwo.io/ >>>>> >>>>>