[
https://issues.apache.org/jira/browse/NIFI-15568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nir Yanay updated NIFI-15568:
-----------------------------
Description:
While working with PutIcebergRecord in NiFi 2.7.2, I encountered two separate
issues when writing to Apache Iceberg tables using an on-prem S3-compatible
object store and an Iceberg REST catalog.
h3. *Issue 1: On-Prem S3 Configuration Not Supported by S3FileIOProvider*
NiFi's default S3IcebergFileIOProvider does not expose the necessary
configuration options required to connect to an on-prem S3-compatible storage
(e.g., MinIO).
Specifically, it does not allow configuring:
* Custom S3 endpoint
* Path-style access
* Storage class
As a result, PutIcebergRecord cannot be used with an on-prem S3 backend out of
the box. To resolve this, I extended S3IcebergFileIOProvider to support the
missing properties, enabling connectivity to on-prem S3-compatible storage
systems.
UPDATE:
Apologies for the confusion — I later noticed that parts of the S3 on-prem
support were already addressed in an earlier change.
The only missing piece for my use case was support for configuring the S3
storage class, which I added in this PR.
h3. *Issue 2: Timestamp Type Mismatch Between NiFi and Iceberg*
After enabling on-prem S3 support, I encountered a timestamp compatibility
issue when writing records containing timestamp fields: NiFi represents
timestamps as java.sql.timestamp while Iceberg represents timestamps as
java.time.LocalDateTime ( [Find
Here|https://github.com/apache/iceberg/blob/730ce29d5cd722b1751a1984d9eabb68542eba39/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java#L122]
)
h4. Unpartitioned Tables
Initially, I added a converter to handle the type conversion, which resolved
the issue for unpartitioned Iceberg tables.
h4. Partitioned Tables
However, when the timestamp column was used as a partition key unfortunately
writes failed again. Further investigation showed that Iceberg internally
expects timestamp partition keys values to be represented both as Long and
LocalDateTime at different places in the flow of writing.
To resolve this, I leveraged Iceberg's InternanlRecordWrapper, which correctly
handles this dual representation and allows partitioned writes to succeed.
*PR*
I have a created a PR with the necessary change
[here|https://github.com/apache/nifi/pull/10877].
was:
While working with PutIcebergRecord in NiFi 2.7.2, I encountered two separate
issues when writing to Apache Iceberg tables using an on-prem S3-compatible
object store and an Iceberg REST catalog.
h3. *Issue 1: On-Prem S3 Configuration Not Supported by S3FileIOProvider*
NiFi's default S3IcebergFileIOProvider does not expose the necessary
configuration options required to connect to an on-prem S3-compatible storage
(e.g., MinIO).
Specifically, it does not allow configuring:
* Custom S3 endpoint
* Path-style access
* Storage class
As a result, PutIcebergRecord cannot be used with an on-prem S3 backend out of
the box. To resolve this, I extended S3IcebergFileIOProvider to support the
missing properties, enabling connectivity to on-prem S3-compatible storage
systems.
h3. *Issue 2: Timestamp Type Mismatch Between NiFi and Iceberg*
After enabling on-prem S3 support, I encountered a timestamp compatibility
issue when writing records containing timestamp fields: NiFi represents
timestamps as java.sql.timestamp while Iceberg represents timestamps as
java.time.LocalDateTime ( [Find
Here|https://github.com/apache/iceberg/blob/730ce29d5cd722b1751a1984d9eabb68542eba39/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java#L122]
)
h4. Unpartitioned Tables
Initially, I added a converter to handle the type conversion, which resolved
the issue for unpartitioned Iceberg tables.
h4. Partitioned Tables
However, when the timestamp column was used as a partition key unfortunately
writes failed again. Further investigation showed that Iceberg internally
expects timestamp partition keys values to be represented both as Long and
LocalDateTime at different places in the flow of writing.
To resolve this, I leveraged Iceberg's InternanlRecordWrapper, which correctly
handles this dual representation and allows partitioned writes to succeed.
*PR*
I have a created a PR with the necessary change
[here|https://github.com/apache/nifi/pull/10872].
> Iceberg S3 on-prem support and iceberg-parquet timestamp fix.
> -------------------------------------------------------------
>
> Key: NIFI-15568
> URL: https://issues.apache.org/jira/browse/NIFI-15568
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 2.7.0, 2.8.0, 2.7.1, 2.7.2
> Reporter: Nir Yanay
> Priority: Minor
> Time Spent: 40m
> Remaining Estimate: 0h
>
> While working with PutIcebergRecord in NiFi 2.7.2, I encountered two separate
> issues when writing to Apache Iceberg tables using an on-prem S3-compatible
> object store and an Iceberg REST catalog.
> h3. *Issue 1: On-Prem S3 Configuration Not Supported by S3FileIOProvider*
> NiFi's default S3IcebergFileIOProvider does not expose the necessary
> configuration options required to connect to an on-prem S3-compatible storage
> (e.g., MinIO).
> Specifically, it does not allow configuring:
> * Custom S3 endpoint
> * Path-style access
> * Storage class
> As a result, PutIcebergRecord cannot be used with an on-prem S3 backend out
> of the box. To resolve this, I extended S3IcebergFileIOProvider to support
> the missing properties, enabling connectivity to on-prem S3-compatible
> storage systems.
> UPDATE:
> Apologies for the confusion — I later noticed that parts of the S3 on-prem
> support were already addressed in an earlier change.
> The only missing piece for my use case was support for configuring the S3
> storage class, which I added in this PR.
> h3. *Issue 2: Timestamp Type Mismatch Between NiFi and Iceberg*
> After enabling on-prem S3 support, I encountered a timestamp compatibility
> issue when writing records containing timestamp fields: NiFi represents
> timestamps as java.sql.timestamp while Iceberg represents timestamps as
> java.time.LocalDateTime ( [Find
> Here|https://github.com/apache/iceberg/blob/730ce29d5cd722b1751a1984d9eabb68542eba39/parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java#L122]
> )
> h4. Unpartitioned Tables
> Initially, I added a converter to handle the type conversion, which resolved
> the issue for unpartitioned Iceberg tables.
> h4. Partitioned Tables
> However, when the timestamp column was used as a partition key unfortunately
> writes failed again. Further investigation showed that Iceberg internally
> expects timestamp partition keys values to be represented both as Long and
> LocalDateTime at different places in the flow of writing.
> To resolve this, I leveraged Iceberg's InternanlRecordWrapper, which
> correctly handles this dual representation and allows partitioned writes to
> succeed.
> *PR*
> I have a created a PR with the necessary change
> [here|https://github.com/apache/nifi/pull/10877].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)