Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Ajantha Bhat Thu, 18 Jul 2024 01:40:06 -0700

+1 on deprecating the `File System Tables` from spec and `HadoopCatalog`,
`HadoopTableOperations` in code for now
and removing them permanently during 2.0 release.


For testing we can use `InMemoryCatalog` as others mentioned.

I am not sure about moving to test or keeping them only for HDFS. Because,
it leads to confusion to existing users of Hadoop catalog.

I wanted to have it deprecated 2 years ago
<https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309>
and I remember that we discussed it in sync that time and left it as it is.
Also, when the user brought this up in slack
<https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F>
recently about lockmanager and refactoring the HadoopTableOperations,
I have asked to open this discussion on the mailing list. So, that we can
conclude it once and for all.

- Ajantha

On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org> wrote:

> Hey Ryan and others,
>
> Thanks for bringing this up. I would be in favor of removing the
> HadoopTableOperations, mostly because of the reasons that you already
> mentioned, but also about the fact that it is not fully in line with the
> first principles of Iceberg (being object store native) as it uses
> file-listing.
>
> I think we should deprecate the HadoopTables to raise the attention of
> their users. I would be reluctant to move it to test to just use it for
> testing purposes, I'd rather remove it and replace its use in tests with
> the InMemoryCatalog.
>
> Regarding the StaticTable, this is an easy way to have a read-only table
> by directly pointing to the metadata. This also lives in Java under
> StaticTableOperations
> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
> It isn't a full-blown catalog where you can list {tables,schemas}, update
> tables, etc. As ZENOTME pointed out already, it is all up to the user, for
> example, there is no listing of directories to determine which tables are
> in the catalog.
>
> is there a probability that the strategy used by HadoopCatalog is not
>> compatible with the table managed by other catalogs?
>
>
> Yes, so they are different, you can see in the spec the section on File
> System tables
> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
> is used by the HadoopTable implementation. Whereas the other catalogs
> follow the Metastore Tables
> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
> .
>
> Kind regards,
> Fokko
>
> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>:
>
>> According to our requirements, this function is for some users who want
>> to read iceberg tables without relying on any catalogs, I think the
>> StaticTable may be more flexible and clear in semantics. For StaticTable,
>> it's the user's responsibility to decide which metadata of the table to
>> read. But for read-only HadoopCatalog, the metadata may be decided by
>> Catalog, is there a probability that the strategy used by HadoopCatalog is
>> not compatible with the table managed by other catalogs?
>>
>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：
>>
>>> I think there are two ways to do this:
>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw
>>> unsupported operation exception for other operations that manipulate tables.
>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in
>>> pyiceberg or iceberg-rust.
>>>
>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote:
>>>
>>>> Hi, Renjie
>>>>
>>>> Are you suggesting that we refactor HadoopCatalog as a
>>>> FileSystemCatalog to enable direct reading from file systems like HDFS, S3,
>>>> and Azure Blob Storage? This catalog will be read-only that don't support
>>>> write operations.
>>>>
>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>>
>>>> Hi, Ryan:
>>>>
>>>> Thanks for raising this. I agree that HadoopCatalog is dangerous in
>>>> manipulating tables/catalogs given limitations of different file systems.
>>>> But I see that there are some users who want to read iceberg tables without
>>>> relying on any catalogs, this is also the motivational use case of
>>>> StaticTable in pyiceberg and iceberg-rust, is there similar things in java
>>>> implementation?
>>>>
>>>>
>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote:
>>>>
>>>> Hey everyone,
>>>>
>>>> There has been some recent discussion about improving
>>>> HadoopTableOperations and the catalog based on those tables, but we've
>>>> discouraged using file system only table (or "hadoop" tables) for years now
>>>> because of major problems:
>>>> * It is only safe to use hadoop tables with HDFS; most local file
>>>> systems, S3, and other common object stores are unsafe
>>>> * Despite not providing atomicity guarantees outside of HDFS, people
>>>> use the tables in unsafe situations
>>>> * HadoopCatalog cannot implement atomic operations for rename and drop
>>>> table, which are commonly used in data engineering
>>>> * Alternative file names (for instance when using metadata file
>>>> compression) also break guarantees
>>>>
>>>> While these tables are useful for testing in non-production scenarios,
>>>> I think it's misleading to have them in the core module because there's an
>>>> appearance that they are a reasonable choice. I propose we deprecate the
>>>> HadoopTableOperations and HadoopCatalog implementations and move them to
>>>> tests the next time we can make breaking API changes (2.0).
>>>>
>>>> I think we should also consider similar fixes to the table spec. It
>>>> currently describes how HadoopTableOperations works, which does not work in
>>>> object stores or local file systems. HDFS is becoming much less common and
>>>> I propose that we note that the strategy in the spec should ONLY be used
>>>> with HDFS.
>>>>
>>>> What do other people think?
>>>>
>>>> Ryan
>>>>
>>>> --
>>>> Ryan Blue
>>>>
>>>>
>>>> Xuanwo
>>>>
>>>> https://xuanwo.io/
>>>>
>>>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to