cleverxiao001 opened a new issue, #1494: URL: https://github.com/apache/cloudberry/issues/1494
### Apache Cloudberry version apache-cloudberry-2.0.0-incubating ### What happened The database is configured with 1 coordinator node and 24 segment nodes, with no standby nodes or mirror nodes deployed. It uses NVMe hard drives for storage, and both the limits.conf and sysctl.conf files have been modified in accordance with the documentation requirements. Now import paper data to db,one paper has multiple authors and multiple references. Specifically, one paper includes 10 authors and 50 references,with 10 million papers, this amounts to 100 million authors and 500 million references. The database consists of three tables: a basic paper information table, an author table, and a reference table. Each table contains 30 fields,include varchar,text,int,text[] format, all using append optimized,column orientation and zstd compression. Data import is performed using Spark. After multiple tests, the basic paper data and author data can be imported successfully; however, random errors occur exclusively in the reference table. <img width="1843" height="564" alt="Image" src="https://github.com/user-attachments/assets/4e3ea629-7d9d-4bd6-8344-8e044a0b04c5" /> spark error like this <img width="1829" height="509" alt="Image" src="https://github.com/user-attachments/assets/57b9f9b2-4255-4e52-87ac-6fc3e5a41735" /> database error like this ### What you think should happen instead _No response_ ### How to reproduce copy function is def apply(df: DataFrame, pgUrl: String, tableName: String, connectionProperties: Properties): Unit = { df.rdd.foreachPartition { iter => val conn = DriverManager.getConnection(pgUrl, connectionProperties) val copyManager = new CopyManager(conn.asInstanceOf[BaseConnection]) val sql = s"COPY $tableName FROM STDIN WITH (FORMAT csv, NULL '\\N')" val sb = new StringBuilder iter.foreach { row: Row => sb.append(rowToCsv(row) + "\n") } val reader = new StringReader(sb.toString) copyManager.copyIn(sql, reader) conn.close() } } use this function import 500 million data concurrently ### Operating System Rocky 9.7 ### Anything else A temporary solution is to submit the data in multiple batches, which allows all data to be fully imported ### Are you willing to submit PR? - [x] Yes, I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/cloudberry/blob/main/CODE_OF_CONDUCT.md). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
