+1 on deprecating the `File System Tables` from spec and `HadoopCatalog`, `HadoopTableOperations` in code for now and removing them permanently during 2.0 release.
For testing we can use `InMemoryCatalog` as others mentioned. I am not sure about moving to test or keeping them only for HDFS. Because, it leads to confusion to existing users of Hadoop catalog. I wanted to have it deprecated 2 years ago <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309> and I remember that we discussed it in sync that time and left it as it is. Also, when the user brought this up in slack <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F> recently about lockmanager and refactoring the HadoopTableOperations, I have asked to open this discussion on the mailing list. So, that we can conclude it once and for all. - Ajantha On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org> wrote: > Hey Ryan and others, > > Thanks for bringing this up. I would be in favor of removing the > HadoopTableOperations, mostly because of the reasons that you already > mentioned, but also about the fact that it is not fully in line with the > first principles of Iceberg (being object store native) as it uses > file-listing. > > I think we should deprecate the HadoopTables to raise the attention of > their users. I would be reluctant to move it to test to just use it for > testing purposes, I'd rather remove it and replace its use in tests with > the InMemoryCatalog. > > Regarding the StaticTable, this is an easy way to have a read-only table > by directly pointing to the metadata. This also lives in Java under > StaticTableOperations > <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>. > It isn't a full-blown catalog where you can list {tables,schemas}, update > tables, etc. As ZENOTME pointed out already, it is all up to the user, for > example, there is no listing of directories to determine which tables are > in the catalog. > > is there a probability that the strategy used by HadoopCatalog is not >> compatible with the table managed by other catalogs? > > > Yes, so they are different, you can see in the spec the section on File > System tables > <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>, > is used by the HadoopTable implementation. Whereas the other catalogs > follow the Metastore Tables > <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables> > . > > Kind regards, > Fokko > > Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>: > >> According to our requirements, this function is for some users who want >> to read iceberg tables without relying on any catalogs, I think the >> StaticTable may be more flexible and clear in semantics. For StaticTable, >> it's the user's responsibility to decide which metadata of the table to >> read. But for read-only HadoopCatalog, the metadata may be decided by >> Catalog, is there a probability that the strategy used by HadoopCatalog is >> not compatible with the table managed by other catalogs? >> >> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道: >> >>> I think there are two ways to do this: >>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw >>> unsupported operation exception for other operations that manipulate tables. >>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in >>> pyiceberg or iceberg-rust. >>> >>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote: >>> >>>> Hi, Renjie >>>> >>>> Are you suggesting that we refactor HadoopCatalog as a >>>> FileSystemCatalog to enable direct reading from file systems like HDFS, S3, >>>> and Azure Blob Storage? This catalog will be read-only that don't support >>>> write operations. >>>> >>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote: >>>> >>>> Hi, Ryan: >>>> >>>> Thanks for raising this. I agree that HadoopCatalog is dangerous in >>>> manipulating tables/catalogs given limitations of different file systems. >>>> But I see that there are some users who want to read iceberg tables without >>>> relying on any catalogs, this is also the motivational use case of >>>> StaticTable in pyiceberg and iceberg-rust, is there similar things in java >>>> implementation? >>>> >>>> >>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote: >>>> >>>> Hey everyone, >>>> >>>> There has been some recent discussion about improving >>>> HadoopTableOperations and the catalog based on those tables, but we've >>>> discouraged using file system only table (or "hadoop" tables) for years now >>>> because of major problems: >>>> * It is only safe to use hadoop tables with HDFS; most local file >>>> systems, S3, and other common object stores are unsafe >>>> * Despite not providing atomicity guarantees outside of HDFS, people >>>> use the tables in unsafe situations >>>> * HadoopCatalog cannot implement atomic operations for rename and drop >>>> table, which are commonly used in data engineering >>>> * Alternative file names (for instance when using metadata file >>>> compression) also break guarantees >>>> >>>> While these tables are useful for testing in non-production scenarios, >>>> I think it's misleading to have them in the core module because there's an >>>> appearance that they are a reasonable choice. I propose we deprecate the >>>> HadoopTableOperations and HadoopCatalog implementations and move them to >>>> tests the next time we can make breaking API changes (2.0). >>>> >>>> I think we should also consider similar fixes to the table spec. It >>>> currently describes how HadoopTableOperations works, which does not work in >>>> object stores or local file systems. HDFS is becoming much less common and >>>> I propose that we note that the strategy in the spec should ONLY be used >>>> with HDFS. >>>> >>>> What do other people think? >>>> >>>> Ryan >>>> >>>> -- >>>> Ryan Blue >>>> >>>> >>>> Xuanwo >>>> >>>> https://xuanwo.io/ >>>> >>>>