Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Eduard Tudenhöfner Thu, 18 Jul 2024 07:18:11 -0700

+1 on deprecating now and removing them from the codebase with Iceberg 2.0

On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <[email protected]> wrote:


> +1 on deprecating the `File System Tables` from spec and `HadoopCatalog`,
> `HadoopTableOperations` in code for now
> and removing them permanently during 2.0 release.
>
> For testing we can use `InMemoryCatalog` as others mentioned.
>
> I am not sure about moving to test or keeping them only for HDFS. Because,
> it leads to confusion to existing users of Hadoop catalog.
>
> I wanted to have it deprecated 2 years ago
> <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309>
> and I remember that we discussed it in sync that time and left it as it is.
> Also, when the user brought this up in slack
> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F>
> recently about lockmanager and refactoring the HadoopTableOperations,
> I have asked to open this discussion on the mailing list. So, that we can
> conclude it once and for all.
>
> - Ajantha
>
> On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <[email protected]>
> wrote:
>
>> Hey Ryan and others,
>>
>> Thanks for bringing this up. I would be in favor of removing the
>> HadoopTableOperations, mostly because of the reasons that you already
>> mentioned, but also about the fact that it is not fully in line with the
>> first principles of Iceberg (being object store native) as it uses
>> file-listing.
>>
>> I think we should deprecate the HadoopTables to raise the attention of
>> their users. I would be reluctant to move it to test to just use it for
>> testing purposes, I'd rather remove it and replace its use in tests with
>> the InMemoryCatalog.
>>
>> Regarding the StaticTable, this is an easy way to have a read-only table
>> by directly pointing to the metadata. This also lives in Java under
>> StaticTableOperations
>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
>> It isn't a full-blown catalog where you can list {tables,schemas},
>> update tables, etc. As ZENOTME pointed out already, it is all up to the
>> user, for example, there is no listing of directories to determine which
>> tables are in the catalog.
>>
>> is there a probability that the strategy used by HadoopCatalog is not
>>> compatible with the table managed by other catalogs?
>>
>>
>> Yes, so they are different, you can see in the spec the section on File
>> System tables
>> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
>> is used by the HadoopTable implementation. Whereas the other catalogs
>> follow the Metastore Tables
>> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
>> .
>>
>> Kind regards,
>> Fokko
>>
>> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <[email protected]>:
>>
>>> According to our requirements, this function is for some users who want
>>> to read iceberg tables without relying on any catalogs, I think the
>>> StaticTable may be more flexible and clear in semantics. For StaticTable,
>>> it's the user's responsibility to decide which metadata of the table to
>>> read. But for read-only HadoopCatalog, the metadata may be decided by
>>> Catalog, is there a probability that the strategy used by HadoopCatalog is
>>> not compatible with the table managed by other catalogs?
>>>
>>> Renjie Liu <[email protected]> 于2024年7月18日周四 11:39写道：
>>>
>>>> I think there are two ways to do this:
>>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw
>>>> unsupported operation exception for other operations that manipulate 
>>>> tables.
>>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in
>>>> pyiceberg or iceberg-rust.
>>>>
>>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <[email protected]> wrote:
>>>>
>>>>> Hi, Renjie
>>>>>
>>>>> Are you suggesting that we refactor HadoopCatalog as a
>>>>> FileSystemCatalog to enable direct reading from file systems like HDFS, 
>>>>> S3,
>>>>> and Azure Blob Storage? This catalog will be read-only that don't support
>>>>> write operations.
>>>>>
>>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>>>
>>>>> Hi, Ryan:
>>>>>
>>>>> Thanks for raising this. I agree that HadoopCatalog is dangerous in
>>>>> manipulating tables/catalogs given limitations of different file systems.
>>>>> But I see that there are some users who want to read iceberg tables 
>>>>> without
>>>>> relying on any catalogs, this is also the motivational use case of
>>>>> StaticTable in pyiceberg and iceberg-rust, is there similar things in java
>>>>> implementation?
>>>>>
>>>>>
>>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>> Hey everyone,
>>>>>
>>>>> There has been some recent discussion about improving
>>>>> HadoopTableOperations and the catalog based on those tables, but we've
>>>>> discouraged using file system only table (or "hadoop" tables) for years 
>>>>> now
>>>>> because of major problems:
>>>>> * It is only safe to use hadoop tables with HDFS; most local file
>>>>> systems, S3, and other common object stores are unsafe
>>>>> * Despite not providing atomicity guarantees outside of HDFS, people
>>>>> use the tables in unsafe situations
>>>>> * HadoopCatalog cannot implement atomic operations for rename and drop
>>>>> table, which are commonly used in data engineering
>>>>> * Alternative file names (for instance when using metadata file
>>>>> compression) also break guarantees
>>>>>
>>>>> While these tables are useful for testing in non-production scenarios,
>>>>> I think it's misleading to have them in the core module because there's an
>>>>> appearance that they are a reasonable choice. I propose we deprecate the
>>>>> HadoopTableOperations and HadoopCatalog implementations and move them to
>>>>> tests the next time we can make breaking API changes (2.0).
>>>>>
>>>>> I think we should also consider similar fixes to the table spec. It
>>>>> currently describes how HadoopTableOperations works, which does not work 
>>>>> in
>>>>> object stores or local file systems. HDFS is becoming much less common and
>>>>> I propose that we note that the strategy in the spec should ONLY be used
>>>>> with HDFS.
>>>>>
>>>>> What do other people think?
>>>>>
>>>>> Ryan
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>>
>>>>>
>>>>> Xuanwo
>>>>>
>>>>> https://xuanwo.io/
>>>>>
>>>>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to