Nir Yanay created NIFI-15568:
--------------------------------
Summary: Iceberg S3 on-prem support and iceberg-parquet timestamp
fix.
Key: NIFI-15568
URL: https://issues.apache.org/jira/browse/NIFI-15568
Project: Apache NiFi
Issue Type: Bug
Components: Extensions
Affects Versions: 2.7.2, 2.7.1, 2.7.0, 2.8.0
Reporter: Nir Yanay
While working with PutIcebergRecord in NiFi 2.7.2, I encountered two separate
issues when writing to Apache Iceberg tables using an on-prem S3-compatible
object store and an Iceberg REST catalog.
h3. *Issue 1: On-Prem S3 Configuration Not Supported by S3FileIOProvider*
NiFi's default S3IcebergFileIOProvider does not expose the necessary
configuration options required to connect to an on-prem S3-compatible storage
(e.g., MinIO).
Specifically, it does not allow configuring:
* Custom S3 endpoint
* Path-style access
* Storage class
As a result, PutIcebergRecord cannot be used with an on-prem S3 backend out of
the box. To resolve this, I extended S3IcebergFileIOProvider to support the
missing properties, enabling connectivity to on-prem S3-compatible storage
systems.
h3. *Issue 2: Timestamp Type Mismatch Between NiFi and Iceberg*
After enabling on-prem S3 support, I encountered a timestamp compatibility
issue when writing records containing timestamp fields: NiFi represents
timestamps as java.sql.timestamp while Iceberg represents timestamps as
java.time.LocalDateTime ( [Find
Here|https://github.com/apache/iceberg/blob/730ce29d5cd722b1751a1984d9eabb68542eba39/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java#L122]
)
h4. Unpartitioned Tables
Initially, I added a converter to handle the type conversion, which resolved
the issue for unpartitioned Iceberg tables.
h4. Partitioned Tables
However, when the timestamp column was used as a partition key unfortunately
writes failed again. Further investigation showed that Iceberg internally
expects timestamp partition keys values to be represented both as Long and
LocalDateTime at different places in the flow of writing.
To resolve this, I leveraged Iceberg's InternanlRecordWrapper, which correctly
handles this dual representation and allows partitioned writes to succeed.
*PR*
A pull request has been opened addressing both issues: LINK
--
This message was sent by Atlassian Jira
(v8.20.10#820010)