Hi

Has there been a strategic change to the way the File component processes 
multiple files in one directory in version 3?

It seems that it process them in parallel which in our situation creates a 
memory issue.

Code:
from(file("{{esma.full.path}}")
.delete(true)
.sortBy("${file:name}"))
.description("Full Import", "Imports FUL files and persists in database", "en")
.autoStartup("{{esma.full.startup}}")
.streamCaching()
.log("Processing file ${file:name}")
.unmarshal()
.zipFile()
.split()
.tokenizeXML("RefData")
.streaming()
.parallelProcessing(true)
.bean(XmlToSqlBean.class)
.choice()
.when(body().isNotNull())
.to(jdbc("default"))
.to(log("Full Import").level(LoggingLevel.INFO.toString())
.groupInterval(60_000L)
.groupActiveOnly(true))
.when(simple("${header.CamelSplitComplete} == true"))
.log("Number of records split: ${header.CamelSplitSize}")
.log("Importing complete: ${header.CamelFileName}")
.endChoice()
.end();

This route processes several zip files, unmarshals them and adds records to a 
database.

The logs seems to reveal this scenario:
It logs every file in the directory like as if it was processing them in 
parallel and then the ThroughputLogger starts printing every minute. This 
logger is using one thread.

2021-10-25 08:31:58.106 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_C_20211023_01of01.zip
2021-10-25 08:32:00.273 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_D_20211023_01of03.zip
2021-10-25 08:32:10.922 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_D_20211023_02of03.zip
2021-10-25 08:32:19.126 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_D_20211023_03of03.zip
2021-10-25 08:32:19.762 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_E_20211023_01of02.zip
2021-10-25 08:32:25.621 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_F_20211023_01of01.zip
2021-10-25 08:32:26.911 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_H_20211023_01of01.zip
2021-10-25 08:32:31.961 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_J_20211023_01of01.zip
2021-10-25 08:32:36.249 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_O_20211023_01of02.zip
2021-10-25 08:32:41.654 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_O_20211023_02of02.zip
2021-10-25 08:32:44.830 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_R_20211023_01of06.zip
2021-10-25 08:32:49.406 [Camel (FIRDSDatabase) thread #3 - ThroughputLogger] 
INFO Full Import.log - Received: 35392 new messages, with total 35392 so far. 
Last group took: 48977 millis which is: 722.625 messages per second. average: 
722.625
2021-10-25 08:32:49.724 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_R_20211023_02of06.zip
2021-10-25 08:32:54.880 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_R_20211023_03of06.zip
2021-10-25 08:33:00.867 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_R_20211023_04of06.zip
2021-10-25 08:33:06.265 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_R_20211023_05of06.zip
2021-10-25 08:33:11.222 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_R_20211023_06of06.zip
2021-10-25 08:33:14.923 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_S_20211023_01of02.zip
2021-10-25 08:33:20.119 [Camel (FIRDSDatabase) thread #5 - 
file://FIRDS/input/full] INFO Full Import.log - Processing file 
FULINS_S_20211023_02of02.zip

We are using parallel processing for each zip file content (XML) but not for 
the files themselves.
If I don't use StreamCaching it will create a havoc on the server with 
OutMemoryException and stuff.

This runs Spring Boot 2.5.6 and Camel 3.11.3

Maybe I have done it in a wrong way but file processing is a bread and butter 
EIP so it shouldn't be a concern but still…
The files are around 15MB zipped, unzipped one XML file of size 0,5 GB. Each 
XML file contains around 500K records to split on. This is critical memory 
issue, I know, but it wouldn't be if the files are processed sequentially.
Looking at the database connections (using a hikariCP Pool) I see 12 
connections active, assuming this is equivalent to the amount of threads in the 
split. It performs around 800 records / second.

Please advise

/M

Reply via email to