Hi, Renjie

Are you suggesting that we refactor HadoopCatalog as a FileSystemCatalog to 
enable direct reading from file systems like HDFS, S3, and Azure Blob Storage? 
This catalog will be read-only that don't support write operations.

On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
> Hi, Ryan:
> 
> Thanks for raising this. I agree that HadoopCatalog is dangerous in 
> manipulating tables/catalogs given limitations of different file systems. But 
> I see that there are some users who want to read iceberg tables without 
> relying on any catalogs, this is also the motivational use case of 
> StaticTable in pyiceberg and iceberg-rust, is there similar things in java 
> implementation?
> 
> 
> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote:
>> Hey everyone,
>> 
>> There has been some recent discussion about improving HadoopTableOperations 
>> and the catalog based on those tables, but we've discouraged using file 
>> system only table (or "hadoop" tables) for years now because of major 
>> problems:
>> * It is only safe to use hadoop tables with HDFS; most local file systems, 
>> S3, and other common object stores are unsafe
>> * Despite not providing atomicity guarantees outside of HDFS, people use the 
>> tables in unsafe situations
>> * HadoopCatalog cannot implement atomic operations for rename and drop 
>> table, which are commonly used in data engineering
>> * Alternative file names (for instance when using metadata file compression) 
>> also break guarantees
>> 
>> While these tables are useful for testing in non-production scenarios, I 
>> think it's misleading to have them in the core module because there's an 
>> appearance that they are a reasonable choice. I propose we deprecate the 
>> HadoopTableOperations and HadoopCatalog implementations and move them to 
>> tests the next time we can make breaking API changes (2.0).
>> 
>> I think we should also consider similar fixes to the table spec. It 
>> currently describes how HadoopTableOperations works, which does not work in 
>> object stores or local file systems. HDFS is becoming much less common and I 
>> propose that we note that the strategy in the spec should ONLY be used with 
>> HDFS.
>> 
>> What do other people think?
>> 
>> Ryan
>> 
>> --
>> Ryan Blue

Xuanwo

https://xuanwo.io/

Reply via email to