Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Steven Wu Thu, 18 Jul 2024 13:23:21 -0700

Thanks Jack for the thoughtful comments.

I am not fully sold that object storage issues have been solved. S3
directory bucket is not a general purpose bucket and lives in a single
zone. The data durability guarantee may not work for many use cases. We
don't know when S3 will add the atomic renaming support.


As the community has mostly converged toward the direction of REST catalog,
I definitely agree with Jack's comment on the gap of the REST backend
server.  I know this has been discussed before and there were some concerns
on a backend server component. But missing the REST backend server
increased the hurdle for many companies to adopt the REST catalog. Some
open-source REST backend implementation (inside or outside the Iceberg
project) would really be helpful for the broader community. This is
probably a separate discussion from this thread.

While Glue catalog may not support all the catalog operations (like atomic
table rename), I assume the gap is much smaller than the current
HadoopCatalog on object storage. Also there was a previous discussion on
moving the vendor specific catalogs to a separate repo. The Iceberg
community can focus on the catalog (like REST) that can support the full
features of the table format.

At minimal, I would be +1 for these changes now
1) clarifying that HadoopCatalog should only work with HDFS in the spec
2) move HadoopCatalog from iceberg-core to a separate iceberg-hdfs module.
add Javadoc to explain its limitations on object storage.


On Thu, Jul 18, 2024 at 11:48 AM John Zhuge <jzh...@apache.org> wrote:

> Appreciate the thoughtful comments!
>
>
>
>
> On Thu, Jul 18, 2024 at 10:29 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Thank you for bringing this up Ryan. I have been also in the camp of
>> saying HadoopCatalog is not recommended, but after thinking about this more
>> deeply last night, I now have mixed feelings about this topic. Just to
>> comment on the reasons you listed first:
>>
>> * For reason 1 & 2, it looks like the root cause is that people try to
>> use HadoopCatalog outside native HDFS because there are HDFS connectors to
>> other storages like S3AFileSystem. However, the norm for such usage has
>> been that those connectors do not strictly follow HDFS semantics, and it is
>> assumed that people acknowledge the implication of such usage and accept
>> the risk. For example, S3AFileSystem was there even before S3 was strongly
>> consistent, but people have been using that to write files.
>>
>> * For reason 3, there are multiple catalogs that do not support all
>> operations (e.g. Glue for atomic table rename) and people still widely use
>> it.
>>
>> * For reason 4, I see that more as a missing feature. More features could
>> definitely be developed in that catalog implementation.
>>
>> So the key question to me is, how can we prevent people from using
>> HadoopCatalog outside native HDFS. We know HadoopCatalog is popular because
>> it is a storage only solution. For object storages specifically,
>> HadoopCatalog is not suitable for 2 reasons:
>>
>> (1) file write does not enforce mutual exclusion, thus cannot enforce
>> Iceberg optimistic concurrency requirement (a.k.a. cannot do atomic and
>> swap)
>>
>> (2) directory-based design is not preferred in object storage and will
>> result in bad performance.
>>
>> However, now I look at these 2 issues, they are getting outdated.
>>
>> (1) object storage is starting to enforce file mutual exclusion. GCS
>> supports file generation number [1] that increments monotonically, and can
>> use x-goog-if-generation-match [2] to perform atomic swap. Similar feature
>> [3] exists in Azure Blob Storage. I cannot speak for the S3 team roadmap.
>> But Amazon S3 is clearly falling behind in this domain, and with market
>> competition, it is very clear that similar features will come in reasonably
>> near future.
>>
>> (2) directory bucket is becoming the norm. Amazon S3 announced directory
>> bucket in 2023 re:invent [4], which does not have the same performance
>> limitation even if you have very nested folders and many objects in a
>> folder. GCS also has a similar feature launched in preview [5] right now.
>> Azure also already has this feature since 2021 [6].
>>
>> With these new developments in the industry, a storage-only Iceberg
>> catalog becomes very attractive. It is simple with only one service
>> dependency. It can safely perform atomic compare-and-swap. It is performant
>> without the need to worry about folder and file organization. If you want
>> to add additional features for things like access control, there are also
>> integrations like access grant [7] that can be integrated to do it in a
>> very scalable way.
>>
>> I know the direction in the community so far is to go with the REST
>> catalog, and I am personally a big advocate for that. However, that
>> requires either building a full REST catalog, or choosing a catalog vendor
>> that supports REST. There are many capabilities that REST would unlock, but
>> those are visions which I expect will take many years down the road for the
>> community to continue to drive consensus and build those features. If I am
>> the CTO of a small company and I just want an Iceberg data lake(house)
>> right now, do I choose REST, or do I choose (or even just build) a
>> storage-only Iceberg catalog? I feel I would actually choose the later.
>>
>> Going back to the discussion points, my current take of this topic is
>> that:
>>
>> (1) +1 for clarifying that HadoopCatalog should only work with HDFS in
>> the spec.
>>
>> (2) +1 if we want to block non-HDFS use cases in HadoopCatalog by default
>> (e.g. fail if using S3A), but we should allow a feature flag to unblock the
>> usage so that people can use it after understanding the implications and
>> risks, just like how people use S3A today.
>>
>> (3) +0 for removing HadoopCatalog from the core library. It could be in a
>> different module like iceberg-hdfs if that is more suitable.
>>
>> (4) -1 for moving HadoopCatalog to tests, because HDFS is still a valid
>> use case for Iceberg. After the measures 1-3 above, people actually having
>> a HDFS use case should be able to continue to innovate and optimize the
>> HadoopCatalog implementation. Although "HDFS is becoming much less common",
>> looking at GitHub issues and discussion forums, it still has a pretty big
>> user base.
>>
>> (5) In general, I propose we separate the discussion of HadoopCatalog
>> from a "storage only catalog" that also deals with other object stages when
>> evaluating it. With these latest industry developments, we should evaluate
>> the direction for building a storage only Iceberg catalog and see if the
>> community has an interest in that. I could help raise a thread about it
>> after this discussion is closed.
>>
>> Best,
>> Jack Ye
>>
>> [1]
>> https://cloud.google.com/storage/docs/object-versioning#file_restoration_behavior
>> [2]
>> https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch
>> [3]
>> https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations
>> [4]
>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html
>> [5] https://cloud.google.com/storage/docs/buckets#enable-hns
>> [6]
>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace
>> [7]
>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html
>>
>>
>>
>>
>>
>>
>> On Thu, Jul 18, 2024 at 7:16 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> +1 on deprecating now and removing them from the codebase with Iceberg
>>> 2.0
>>>
>>> On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> +1 on deprecating the `File System Tables` from spec and
>>>> `HadoopCatalog`, `HadoopTableOperations` in code for now
>>>> and removing them permanently during 2.0 release.
>>>>
>>>> For testing we can use `InMemoryCatalog` as others mentioned.
>>>>
>>>> I am not sure about moving to test or keeping them only for HDFS.
>>>> Because, it leads to confusion to existing users of Hadoop catalog.
>>>>
>>>> I wanted to have it deprecated 2 years ago
>>>> <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309>
>>>> and I remember that we discussed it in sync that time and left it as it is.
>>>> Also, when the user brought this up in slack
>>>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F>
>>>> recently about lockmanager and refactoring the HadoopTableOperations,
>>>> I have asked to open this discussion on the mailing list. So, that we
>>>> can conclude it once and for all.
>>>>
>>>> - Ajantha
>>>>
>>>> On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>>> Hey Ryan and others,
>>>>>
>>>>> Thanks for bringing this up. I would be in favor of removing the
>>>>> HadoopTableOperations, mostly because of the reasons that you already
>>>>> mentioned, but also about the fact that it is not fully in line with the
>>>>> first principles of Iceberg (being object store native) as it uses
>>>>> file-listing.
>>>>>
>>>>> I think we should deprecate the HadoopTables to raise the attention of
>>>>> their users. I would be reluctant to move it to test to just use it for
>>>>> testing purposes, I'd rather remove it and replace its use in tests with
>>>>> the InMemoryCatalog.
>>>>>
>>>>> Regarding the StaticTable, this is an easy way to have a read-only
>>>>> table by directly pointing to the metadata. This also lives in Java under
>>>>> StaticTableOperations
>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
>>>>> It isn't a full-blown catalog where you can list {tables,schemas},
>>>>> update tables, etc. As ZENOTME pointed out already, it is all up to the
>>>>> user, for example, there is no listing of directories to determine which
>>>>> tables are in the catalog.
>>>>>
>>>>> is there a probability that the strategy used by HadoopCatalog is not
>>>>>> compatible with the table managed by other catalogs?
>>>>>
>>>>>
>>>>> Yes, so they are different, you can see in the spec the section on File
>>>>> System tables
>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
>>>>> is used by the HadoopTable implementation. Whereas the other catalogs
>>>>> follow the Metastore Tables
>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
>>>>> .
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>:
>>>>>
>>>>>> According to our requirements, this function is for some users who
>>>>>> want to read iceberg tables without relying on any catalogs, I think the
>>>>>> StaticTable may be more flexible and clear in semantics. For StaticTable,
>>>>>> it's the user's responsibility to decide which metadata of the table to
>>>>>> read. But for read-only HadoopCatalog, the metadata may be decided by
>>>>>> Catalog, is there a probability that the strategy used by HadoopCatalog 
>>>>>> is
>>>>>> not compatible with the table managed by other catalogs?
>>>>>>
>>>>>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：
>>>>>>
>>>>>>> I think there are two ways to do this:
>>>>>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and
>>>>>>> throw unsupported operation exception for other operations that 
>>>>>>> manipulate
>>>>>>> tables.
>>>>>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in
>>>>>>> pyiceberg or iceberg-rust.
>>>>>>>
>>>>>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote:
>>>>>>>
>>>>>>>> Hi, Renjie
>>>>>>>>
>>>>>>>> Are you suggesting that we refactor HadoopCatalog as a
>>>>>>>> FileSystemCatalog to enable direct reading from file systems like 
>>>>>>>> HDFS, S3,
>>>>>>>> and Azure Blob Storage? This catalog will be read-only that don't 
>>>>>>>> support
>>>>>>>> write operations.
>>>>>>>>
>>>>>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>>>>>>
>>>>>>>> Hi, Ryan:
>>>>>>>>
>>>>>>>> Thanks for raising this. I agree that HadoopCatalog is dangerous in
>>>>>>>> manipulating tables/catalogs given limitations of different file 
>>>>>>>> systems.
>>>>>>>> But I see that there are some users who want to read iceberg tables 
>>>>>>>> without
>>>>>>>> relying on any catalogs, this is also the motivational use case of
>>>>>>>> StaticTable in pyiceberg and iceberg-rust, is there similar things in 
>>>>>>>> java
>>>>>>>> implementation?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote:
>>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> There has been some recent discussion about improving
>>>>>>>> HadoopTableOperations and the catalog based on those tables, but we've
>>>>>>>> discouraged using file system only table (or "hadoop" tables) for 
>>>>>>>> years now
>>>>>>>> because of major problems:
>>>>>>>> * It is only safe to use hadoop tables with HDFS; most local file
>>>>>>>> systems, S3, and other common object stores are unsafe
>>>>>>>> * Despite not providing atomicity guarantees outside of HDFS,
>>>>>>>> people use the tables in unsafe situations
>>>>>>>> * HadoopCatalog cannot implement atomic operations for rename and
>>>>>>>> drop table, which are commonly used in data engineering
>>>>>>>> * Alternative file names (for instance when using metadata file
>>>>>>>> compression) also break guarantees
>>>>>>>>
>>>>>>>> While these tables are useful for testing in non-production
>>>>>>>> scenarios, I think it's misleading to have them in the core module 
>>>>>>>> because
>>>>>>>> there's an appearance that they are a reasonable choice. I propose we
>>>>>>>> deprecate the HadoopTableOperations and HadoopCatalog implementations 
>>>>>>>> and
>>>>>>>> move them to tests the next time we can make breaking API changes 
>>>>>>>> (2.0).
>>>>>>>>
>>>>>>>> I think we should also consider similar fixes to the table spec. It
>>>>>>>> currently describes how HadoopTableOperations works, which does not 
>>>>>>>> work in
>>>>>>>> object stores or local file systems. HDFS is becoming much less common 
>>>>>>>> and
>>>>>>>> I propose that we note that the strategy in the spec should ONLY be 
>>>>>>>> used
>>>>>>>> with HDFS.
>>>>>>>>
>>>>>>>> What do other people think?
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>>
>>>>>>>>
>>>>>>>> Xuanwo
>>>>>>>>
>>>>>>>> https://xuanwo.io/
>>>>>>>>
>>>>>>>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to