[I] [Bug] Insert data into the cloudberry via the copy method concurrently using Spark, "could not find segment file to use" errors may occur random when the data volume is extremely large [cloudberry]

via GitHub Tue, 16 Dec 2025 00:24:40 -0800


cleverxiao001 opened a new issue, #1494:
URL: https://github.com/apache/cloudberry/issues/1494


   ### Apache Cloudberry version
   
   apache-cloudberry-2.0.0-incubating
   
   ### What happened
   
   The database is configured with 1 coordinator node and 24 segment nodes, 
with no standby nodes or mirror nodes deployed. It uses NVMe hard drives for 
storage, and both the limits.conf and sysctl.conf files have been modified in 
accordance with the documentation requirements.
   Now import paper data to db,one paper has multiple authors and multiple 
references. Specifically, one paper includes 10 authors and 50 references,with 
10 million papers, this amounts to 100 million authors and 500 million 
references. The database consists of three tables: a basic paper information 
table, an author table, and a reference table. Each table contains 30 
fields,include varchar,text,int,text[] format, all using append 
optimized,column orientation and zstd compression.
   Data import is performed using Spark. After multiple tests, the basic paper 
data and author data can be imported successfully; however, random errors occur 
exclusively in the reference table. 
   
   <img width="1843" height="564" alt="Image" 
src="https://github.com/user-attachments/assets/4e3ea629-7d9d-4bd6-8344-8e044a0b04c5";
 />
   spark error like this
   
   <img width="1829" height="509" alt="Image" 
src="https://github.com/user-attachments/assets/57b9f9b2-4255-4e52-87ac-6fc3e5a41735";
 />
   database error like this
   
   ### What you think should happen instead
   
   _No response_
   
   ### How to reproduce
   
   copy function is
   def apply(df: DataFrame, pgUrl: String, tableName: String, 
connectionProperties: Properties): Unit = {
       df.rdd.foreachPartition { iter =>
         val conn = DriverManager.getConnection(pgUrl, connectionProperties)
         val copyManager = new CopyManager(conn.asInstanceOf[BaseConnection])
         val sql = s"COPY $tableName FROM STDIN WITH (FORMAT csv, NULL '\\N')"
         val sb = new StringBuilder
         iter.foreach { row: Row =>
           sb.append(rowToCsv(row) + "\n")
         }
         val reader = new StringReader(sb.toString)
         copyManager.copyIn(sql, reader)
         conn.close()
       }
     }
   use this function import 500 million data concurrently
   
   ### Operating System
   
   Rocky 9.7
   
   ### Anything else
   
   A temporary solution is to submit the data in multiple batches, which allows 
all data to be fully imported
   
   ### Are you willing to submit PR?
   
   - [x] Yes, I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/cloudberry/blob/main/CODE_OF_CONDUCT.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Bug] Insert data into the cloudberry via the copy method concurrently using Spark, "could not find segment file to use" errors may occur random when the data volume is extremely large [cloudberry]

Reply via email to