Last year at ApacheCon, I showed a demo [1] related to processing a file in parallel in multiple threads (in 'splits' - term borrowed from hdfs - of a configurable size). I used a relatively small csv file for my demo, not xml, but it works exactly the same with xml. Take a look at it, I believe it'll help. There is another example there for generating a massive volume of test messages, if you care.

Cheers,
Hadrian


[1] https://github.com/hzbarcea/apachecon/tree/master/aceu2012/camel-filesplit



On 02/21/2013 04:10 PM, cristisor wrote:
Hello everybody,

I'm using Apache Fuse ESB with Apache Camel 2.4.0 (I think) to process some
large files. Until now a service unit deployed in servicemix would read the
file line by line, create and send an exchange containing that line to
another service unit that would analyze the line and transform it into an
xml according to some parameters, then send the new exchange to a new
service unit that would map that xml to another xml format and send the new
exchange containing the new xml to a final service unit that unmarshals the
xml and inserts the object into a database. I arrived on the project, the
architecture and the design are not mine, and I have to fix some serious
performance problems. I suspect that reading the files line by line is
slowing the processing very much, so I inserted an AggregationStrategy to
aggregate 100 - 200 lines at once. Here I get into trouble:
- if I send an exchange with more than 1 line I have to make a lot of
changes on the xml to xml mappers, choice processors, etc
- even if I solve the first problem, if I read 500 lines at once and I
create a big xml from the data I get into an OOME exception, so I should
read up to 50 lines in order to make sure that no exceptions will arise

What I'm looking for is a way to read 500 - 1000 lines at once but send each
one in a different exchange to the service unit that creates the initial
xml. My route looks similar to this one now:

from("file://myfile.txt")
        .marshal().string("UTF-8")
        .split(body().tokenize("\n")).streaming()
                 .setHeader("foo", constant("foo"))
                 .aggregate(header("foo"),
                                 new
StringBodyAggregator()).completionSize(50)
                 .process(processor)
                .to("activemq queue");

I read something about a template producer but I'm not sure if it can help
me. Basically I want to insert a mechanism to send more than one exchange,
one for each read line, to the processor and then to the endpoint. This way
I read from the file in batches of hundreds or thousands and I keep using
the old mechanism for mapping, one line at a time.

Thank you.



--
View this message in context: 
http://camel.465427.n5.nabble.com/Large-file-processing-with-Apache-Camel-tp5727977.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Reply via email to