Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Fokko Driesprong Thu, 18 Jul 2024 00:19:11 -0700

Hey Ryan and others,

Thanks for bringing this up. I would be in favor of removing the
HadoopTableOperations, mostly because of the reasons that you already
mentioned, but also about the fact that it is not fully in line with the
first principles of Iceberg (being object store native) as it uses
file-listing.


I think we should deprecate the HadoopTables to raise the attention of
their users. I would be reluctant to move it to test to just use it for
testing purposes, I'd rather remove it and replace its use in tests with
the InMemoryCatalog.

Regarding the StaticTable, this is an easy way to have a read-only table by
directly pointing to the metadata. This also lives in Java under
StaticTableOperations
<https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
It isn't a full-blown catalog where you can list {tables,schemas}, update
tables, etc. As ZENOTME pointed out already, it is all up to the user, for
example, there is no listing of directories to determine which tables are
in the catalog.

is there a probability that the strategy used by HadoopCatalog is not
> compatible with the table managed by other catalogs?


Yes, so they are different, you can see in the spec the section on File
System tables
<https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
is used by the HadoopTable implementation. Whereas the other catalogs
follow the Metastore Tables
<https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
.

Kind regards,
Fokko

Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>:

> According to our requirements, this function is for some users who want to
> read iceberg tables without relying on any catalogs, I think the
> StaticTable may be more flexible and clear in semantics. For StaticTable,
> it's the user's responsibility to decide which metadata of the table to
> read. But for read-only HadoopCatalog, the metadata may be decided by
> Catalog, is there a probability that the strategy used by HadoopCatalog is
> not compatible with the table managed by other catalogs?
>
> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：
>
>> I think there are two ways to do this:
>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw
>> unsupported operation exception for other operations that manipulate tables.
>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in
>> pyiceberg or iceberg-rust.
>>
>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote:
>>
>>> Hi, Renjie
>>>
>>> Are you suggesting that we refactor HadoopCatalog as a FileSystemCatalog
>>> to enable direct reading from file systems like HDFS, S3, and Azure Blob
>>> Storage? This catalog will be read-only that don't support write operations.
>>>
>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>
>>> Hi, Ryan:
>>>
>>> Thanks for raising this. I agree that HadoopCatalog is dangerous in
>>> manipulating tables/catalogs given limitations of different file systems.
>>> But I see that there are some users who want to read iceberg tables without
>>> relying on any catalogs, this is also the motivational use case of
>>> StaticTable in pyiceberg and iceberg-rust, is there similar things in java
>>> implementation?
>>>
>>>
>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote:
>>>
>>> Hey everyone,
>>>
>>> There has been some recent discussion about improving
>>> HadoopTableOperations and the catalog based on those tables, but we've
>>> discouraged using file system only table (or "hadoop" tables) for years now
>>> because of major problems:
>>> * It is only safe to use hadoop tables with HDFS; most local file
>>> systems, S3, and other common object stores are unsafe
>>> * Despite not providing atomicity guarantees outside of HDFS, people use
>>> the tables in unsafe situations
>>> * HadoopCatalog cannot implement atomic operations for rename and drop
>>> table, which are commonly used in data engineering
>>> * Alternative file names (for instance when using metadata file
>>> compression) also break guarantees
>>>
>>> While these tables are useful for testing in non-production scenarios, I
>>> think it's misleading to have them in the core module because there's an
>>> appearance that they are a reasonable choice. I propose we deprecate the
>>> HadoopTableOperations and HadoopCatalog implementations and move them to
>>> tests the next time we can make breaking API changes (2.0).
>>>
>>> I think we should also consider similar fixes to the table spec. It
>>> currently describes how HadoopTableOperations works, which does not work in
>>> object stores or local file systems. HDFS is becoming much less common and
>>> I propose that we note that the strategy in the spec should ONLY be used
>>> with HDFS.
>>>
>>> What do other people think?
>>>
>>> Ryan
>>>
>>> --
>>> Ryan Blue
>>>
>>>
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to