Nir Yanay created NIFI-15568:
--------------------------------

             Summary: Iceberg S3 on-prem support and iceberg-parquet timestamp 
fix.
                 Key: NIFI-15568
                 URL: https://issues.apache.org/jira/browse/NIFI-15568
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
    Affects Versions: 2.7.2, 2.7.1, 2.7.0, 2.8.0
            Reporter: Nir Yanay


While working with PutIcebergRecord in NiFi 2.7.2, I encountered two separate 
issues when writing to Apache Iceberg tables using an on-prem S3-compatible 
object store and an Iceberg REST catalog.
h3. *Issue 1: On-Prem S3 Configuration Not Supported by S3FileIOProvider*

NiFi's default S3IcebergFileIOProvider does not expose the necessary 
configuration options required to connect to an on-prem S3-compatible storage 
(e.g., MinIO).

Specifically, it does not allow configuring:
 * Custom S3 endpoint
 * Path-style access
 * Storage class

As a result, PutIcebergRecord cannot be used with an on-prem S3 backend out of 
the box. To resolve this, I extended S3IcebergFileIOProvider to support the 
missing properties, enabling connectivity to on-prem S3-compatible storage 
systems.
h3. *Issue 2: Timestamp Type Mismatch Between NiFi and Iceberg*

After enabling on-prem S3 support, I encountered a timestamp compatibility 
issue when writing records containing timestamp fields: NiFi represents 
timestamps as java.sql.timestamp while Iceberg represents timestamps as 
java.time.LocalDateTime ( [Find 
Here|https://github.com/apache/iceberg/blob/730ce29d5cd722b1751a1984d9eabb68542eba39/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java#L122]
 )
h4. Unpartitioned Tables

Initially, I added a converter to handle the type conversion, which resolved 
the issue for unpartitioned Iceberg tables.
h4. Partitioned Tables

However, when the timestamp column was used as a partition key unfortunately 
writes failed again. Further investigation showed that Iceberg internally 
expects timestamp partition keys values to be represented both as Long and 
LocalDateTime at different places in the flow of writing. 

To resolve this, I leveraged Iceberg's InternanlRecordWrapper, which correctly 
handles this dual representation and allows partitioned writes to succeed.

*PR*

A pull request has been opened addressing both issues: LINK



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to