Hi team.
     I am not a pmc member, just a regular user. Instead of discussing whether 
hadoopcatalog needs to continue to exist, I'd like to share a more practical 
issue. 


    We currently serve over 30,000 customers, all of whom use Iceberg to store 
their foundational data, and all business analyses are conducted based on 
Iceberg. However, all the Iceberg tables are hadoop_catalog. At least, this has 
been the case since I started working with our production environment system. 


    In recent days, I've attempted to migrate hadoop_catalog to jdbc-catalog, 
but I failed. We store 2PB of data, and replacing the current catalogues has 
become an almost impossible task. Users not only create hadoop_catalog tables 
through Spark, they also continuously use third-party OLAP systems/FLINK and 
other means to write data into Iceberg in the form of hadoop_catalog. Given 
this situation, we can only continue to fix hadoop_catalog and provide services 
to customers. 


    I understand that the community wants to make a big push into rest-catalog, 
and I agree with the direction the community is going.But considering that 
there might be a significant number of users facing similar issues, can we at 
least retain a module similar to iceberg-hadoop to extend hadoop_catalog? If it 
is removed, we won't be able to continue providing services to customers. So, 
if possible, please consider this option. 


Thank you all.


Kind regards,
lisoda













At 2024-07-19 01:28:18, "Jack Ye" <yezhao...@gmail.com> wrote:

Thank you for bringing this up Ryan. I have been also in the camp of saying 
HadoopCatalog is not recommended, but after thinking about this more deeply 
last night, I now have mixed feelings about this topic. Just to comment on the 
reasons you listed first:



* For reason 1 & 2, it looks like the root cause is that people try to use 
HadoopCatalog outside native HDFS because there are HDFS connectors to other 
storages like S3AFileSystem. However, the norm for such usage has been that 
those connectors do not strictly follow HDFS semantics, and it is assumed that 
people acknowledge the implication of such usage and accept the risk. For 
example, S3AFileSystem was there even before S3 was strongly consistent, but 
people have been using that to write files.


* For reason 3, there are multiple catalogs that do not support all operations 
(e.g. Glue for atomic table rename) and people still widely use it.



* For reason 4, I see that more as a missing feature. More features could 
definitely be developed in that catalog implementation.


So the key question to me is, how can we prevent people from using 
HadoopCatalog outside native HDFS. We know HadoopCatalog is popular because it 
is a storage only solution. For object storages specifically, HadoopCatalog is 
not suitable for 2 reasons:


(1) file write does not enforce mutual exclusion, thus cannot enforce Iceberg 
optimistic concurrency requirement (a.k.a. cannot do atomic and swap)



(2) directory-based design is not preferred in object storage and will result 
in bad performance.


However, now I look at these 2 issues, they are getting outdated.



(1) object storage is starting to enforce file mutual exclusion. GCS supports 
file generation number [1] that increments monotonically, and can use 
x-goog-if-generation-match [2] to perform atomic swap. Similar feature [3] 
exists in Azure Blob Storage. I cannot speak for the S3 team roadmap. But 
Amazon S3 is clearly falling behind in this domain, and with market 
competition, it is very clear that similar features will come in reasonably 
near future.


(2) directory bucket is becoming the norm. Amazon S3 announced directory bucket 
in 2023 re:invent [4], which does not have the same performance limitation even 
if you have very nested folders and many objects in a folder. GCS also has a 
similar feature launched in preview [5] right now. Azure also already has this 
feature since 2021 [6].



With these new developments in the industry, a storage-only Iceberg catalog 
becomes very attractive. It is simple with only one service dependency. It can 
safely perform atomic compare-and-swap. It is performant without the need to 
worry about folder and file organization. If you want to add additional 
features for things like access control, there are also integrations like 
access grant [7] that can be integrated to do it in a very scalable way.


I know the direction in the community so far is to go with the REST catalog, 
and I am personally a big advocate for that. However, that requires either 
building a full REST catalog, or choosing a catalog vendor that supports REST. 
There are many capabilities that REST would unlock, but those are visions which 
I expect will take many years down the road for the community to continue to 
drive consensus and build those features. If I am the CTO of a small company 
and I just want an Iceberg data lake(house) right now, do I choose REST, or do 
I choose (or even just build) a storage-only Iceberg catalog? I feel I would 
actually choose the later.


Going back to the discussion points, my current take of this topic is that:


(1) +1 for clarifying that HadoopCatalog should only work with HDFS in the spec.



(2) +1 if we want to block non-HDFS use cases in HadoopCatalog by default (e.g. 
fail if using S3A), but we should allow a feature flag to unblock the usage so 
that people can use it after understanding the implications and risks, just 
like how people use S3A today.


(3) +0 for removing HadoopCatalog from the core library. It could be in a 
different module like iceberg-hdfs if that is more suitable.


(4) -1 for moving HadoopCatalog to tests, because HDFS is still a valid use 
case for Iceberg. After the measures 1-3 above, people actually having a HDFS 
use case should be able to continue to innovate and optimize the HadoopCatalog 
implementation. Although "HDFS is becoming much less common", looking at GitHub 
issues and discussion forums, it still has a pretty big user base.



(5) In general, I propose we separate the discussion of HadoopCatalog from a 
"storage only catalog" that also deals with other object stages when evaluating 
it. With these latest industry developments, we should evaluate the direction 
for building a storage only Iceberg catalog and see if the community has an 
interest in that. I could help raise a thread about it after this discussion is 
closed.



Best,
Jack Ye



[1] 
https://cloud.google.com/storage/docs/object-versioning#file_restoration_behavior
[2] 
https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch
[3] 
https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations
[4] 
https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html
[5] https://cloud.google.com/storage/docs/buckets#enable-hns
[6] 
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace
[7] https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html












On Thu, Jul 18, 2024 at 7:16 AM Eduard Tudenhöfner <etudenhoef...@apache.org> 
wrote:

+1 on deprecating now and removing them from the codebase with Iceberg 2.0



On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:

+1 on deprecating the `File System Tables` from spec and `HadoopCatalog`, 
`HadoopTableOperations` in code for now
and removing them permanently during 2.0 release.

For testing we can use `InMemoryCatalog` as others mentioned.

I am not sure about moving to test or keeping them only for HDFS. Because, it 
leads to confusion to existing users of Hadoop catalog.

I wanted to have it deprecated 2 years ago and I remember that we discussed it 
in sync that time and left it as it is.
Also, when the user brought this up in slack recently about lockmanager and 
refactoring the HadoopTableOperations,
I have asked to open this discussion on the mailing list. So, that we can 
conclude it once and for all.

- Ajantha


On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org> wrote:

Hey Ryan and others,


Thanks for bringing this up. I would be in favor of removing the 
HadoopTableOperations, mostly because of the reasons that you already 
mentioned, but also about the fact that it is not fully in line with the first 
principles of Iceberg (being object store native) as it uses file-listing.


I think we should deprecate the HadoopTables to raise the attention of their 
users. I would be reluctant to move it to test to just use it for testing 
purposes, I'd rather remove it and replace its use in tests with the 
InMemoryCatalog.


Regarding the StaticTable, this is an easy way to have a read-only table by 
directly pointing to the metadata. This also lives in Java under 
StaticTableOperations. It isn't a full-blown catalog where you can list 
{tables,schemas}, update tables, etc. As ZENOTME pointed out already, it is all 
up to the user, for example, there is no listing of directories to determine 
which tables are in the catalog.


is there a probability that the strategy used by HadoopCatalog is not 
compatible with the table managed by other catalogs?


Yes, so they are different, you can see in the spec the section on File System 
tables, is used by the HadoopTable implementation. Whereas the other catalogs 
follow the Metastore Tables.


Kind regards,
Fokko


Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>:

According to our requirements, this function is for some users who want to read 
iceberg tables without relying on any catalogs, I think the StaticTable may be 
more flexible and clear in semantics. For StaticTable, it's the user's 
responsibility to decide which metadata of the table to read. But for read-only 
HadoopCatalog, the metadata may be decided by Catalog, is there a probability 
that the strategy used by HadoopCatalog is not compatible with the table 
managed by other catalogs?


Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道:

I think there are two ways to do this:
1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw 
unsupported operation exception for other operations that manipulate tables.
2. Totally deprecate HadoopCatalog, and add StaticTable as we did in pyiceberg 
or iceberg-rust.


On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote:

Hi, Renjie



Are you suggesting that we refactor HadoopCatalog as a FileSystemCatalog to 
enable direct reading from file systems like HDFS, S3, and Azure Blob Storage? 
This catalog will be read-only that don't support write operations.



On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:

Hi, Ryan:



Thanks for raising this. I agree that HadoopCatalog is dangerous in 
manipulating tables/catalogs given limitations of different file systems. But I 
see that there are some users who want to read iceberg tables without relying 
on any catalogs, this is also the motivational use case of StaticTable in 
pyiceberg and iceberg-rust, is there similar things in java implementation?





On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote:

Hey everyone,



There has been some recent discussion about improving HadoopTableOperations and 
the catalog based on those tables, but we've discouraged using file system only 
table (or "hadoop" tables) for years now because of major problems:

* It is only safe to use hadoop tables with HDFS; most local file systems, S3, 
and other common object stores are unsafe

* Despite not providing atomicity guarantees outside of HDFS, people use the 
tables in unsafe situations

* HadoopCatalog cannot implement atomic operations for rename and drop table, 
which are commonly used in data engineering

* Alternative file names (for instance when using metadata file compression) 
also break guarantees



While these tables are useful for testing in non-production scenarios, I think 
it's misleading to have them in the core module because there's an appearance 
that they are a reasonable choice. I propose we deprecate the 
HadoopTableOperations and HadoopCatalog implementations and move them to tests 
the next time we can make breaking API changes (2.0).



I think we should also consider similar fixes to the table spec. It currently 
describes how HadoopTableOperations works, which does not work in object stores 
or local file systems. HDFS is becoming much less common and I propose that we 
note that the strategy in the spec should ONLY be used with HDFS.



What do other people think?



Ryan



--

Ryan Blue



Xuanwo


https://xuanwo.io/

Reply via email to