[ https://issues.apache.org/jira/browse/NIFI-11449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jim Steinebrey reassigned NIFI-11449: ------------------------------------- Assignee: (was: Jim Steinebrey) > Investigate Iceberg insert on Object Storage > -------------------------------------------- > > Key: NIFI-11449 > URL: https://issues.apache.org/jira/browse/NIFI-11449 > Project: Apache NiFi > Issue Type: New Feature > Components: Extensions > Affects Versions: 1.21.0 > Environment: Any Nifi Deployment > Reporter: Abdelrahim Ahmad > Priority: Blocker > Labels: Trino, autocommit, database, iceberg, putdatabaserecord > > The issue is with the {{PutDatabaseRecord}} processor in Apache NiFi. When > using the processor with the Trino-JDBC-Driver or Dremio-JDBC-Driver to write > to an Iceberg catalog, it disables the autocommit feature. This leads to > errors such as "{*}Catalog only supports writes using autocommit: iceberg{*}". > the autocommit feature needs to be added in the processor to be > enabled/disabled. > enabling auto-commit in the Nifi PutDatabaseRecord processor is important for > Deltalake, Iceberg, and Hudi as it ensures data consistency and integrity by > allowing atomic writes to be performed in the underlying database. This will > allow the process to be widely used with bigger range of databases. > _Improving this processor will allow Nifi to be the main tool to ingest data > into these new Technologies. So we don't have to deal with another tool to do > so._ > +*_{color:#de350b}BUT:{color}_*+ > I have reviewed The {{PutDatabaseRecord}} processor in NiFi. It inserts > records one by one into the database using a prepared statement, and commits > the transaction at the end of the loop that processes each record. This > approach can be inefficient and slow when inserting large volumes of data > into tables that are optimized for bulk ingestion, such as Delta Lake, > Iceberg, and Hudi tables. > These tables use various techniques to optimize the performance of bulk > ingestion, such as partitioning, clustering, and indexing. Inserting records > one by one using a prepared statement can bypass these optimizations, leading > to poor performance and potentially causing issues such as excessive disk > usage, increased memory consumption, and decreased query performance. > To avoid these issues, it is recommended to have a new processor, or add > feature to the current one, to bulk insert method with AutoCommit feature > when inserting large volumes of data into Delta Lake, Iceberg, and Hudi > tables. > > P.S.: using PutSQL is not a have autoCommit but have the same performance > problem described above.. > Thanks and best regards :) > Abdelrahim Ahmad -- This message was sent by Atlassian Jira (v8.20.10#820010)