[jira] [Assigned] (NIFI-11449) Investigate Iceberg insert on Object Storage

Jim Steinebrey (Jira) Thu, 04 Apr 2024 19:31:57 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-11449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jim Steinebrey reassigned NIFI-11449:
-------------------------------------

    Assignee:     (was: Jim Steinebrey)

> Investigate Iceberg insert on Object Storage
> --------------------------------------------
>
>                 Key: NIFI-11449
>                 URL: https://issues.apache.org/jira/browse/NIFI-11449
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>    Affects Versions: 1.21.0
>         Environment: Any Nifi Deployment
>            Reporter: Abdelrahim Ahmad
>            Priority: Blocker
>              Labels: Trino, autocommit, database, iceberg, putdatabaserecord
>
> The issue is with the {{PutDatabaseRecord}} processor in Apache NiFi. When 
> using the processor with the Trino-JDBC-Driver or Dremio-JDBC-Driver to write 
> to an Iceberg catalog, it disables the autocommit feature. This leads to 
> errors such as "{*}Catalog only supports writes using autocommit: iceberg{*}".
> the autocommit feature needs to be added in the processor to be 
> enabled/disabled.
> enabling auto-commit in the Nifi PutDatabaseRecord processor is important for 
> Deltalake, Iceberg, and Hudi as it ensures data consistency and integrity by 
> allowing atomic writes to be performed in the underlying database. This will 
> allow the process to be widely used with bigger range of databases.
> _Improving this processor will allow Nifi to be the main tool to ingest data 
> into these new Technologies. So we don't have to deal with another tool to do 
> so._
> +*_{color:#de350b}BUT:{color}_*+
> I have reviewed The {{PutDatabaseRecord}} processor in NiFi. It inserts 
> records one by one into the database using a prepared statement, and commits 
> the transaction at the end of the loop that processes each record. This 
> approach can be inefficient and slow when inserting large volumes of data 
> into tables that are optimized for bulk ingestion, such as Delta Lake, 
> Iceberg, and Hudi tables.
> These tables use various techniques to optimize the performance of bulk 
> ingestion, such as partitioning, clustering, and indexing. Inserting records 
> one by one using a prepared statement can bypass these optimizations, leading 
> to poor performance and potentially causing issues such as excessive disk 
> usage, increased memory consumption, and decreased query performance.
> To avoid these issues, it is recommended to have a new processor, or add 
> feature to the current one, to bulk insert method with AutoCommit feature 
> when inserting large volumes of data into Delta Lake, Iceberg, and Hudi 
> tables. 
>  
> P.S.: using PutSQL is not a have autoCommit but have the same performance 
> problem described above..
> Thanks and best regards :)
> Abdelrahim Ahmad



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (NIFI-11449) Investigate Iceberg insert on Object Storage

Reply via email to