Hi Claus,

I'm afraid it would difficult to produce a code snippet that would be
relevant to the problem without disclosing sensitive code.

I have however made further analysis.

First I wasn't exactly correct when I said using parallel processing makes
no difference.

When I run the version without concurrency (camel thread + 0 thread) it
takes 38s. When I run with a thread pool of 1 (1 camel thread + 1 thread) it
takes 26s. Then adding more threads to the pool doesn't improve the
performance and eventually make it worse after about 8 threads (probably
overhead of thread context switching). (this is using split size of 1000)

So to me it seems that most of the work is actually done splitting the file
and therefore there is little to be gain by adding other threads to process
the lines.

I have also made an interesting run with a split size of (1m, file size ~4m)
and a pool of 4 threads.

The thread activity looks like that:

<http://camel.465427.n5.nabble.com/file/n5737423/CamelParallelProcessing3.jpg> 

We can see each thread of the pool processing 1m lines but they don't seem
to interleave very well.

I have also separately tested the piece of code that processes the lines of
the files and it scales well up until 4 threads (about 2.5x speed up).

All of that is a bit confusing but it seems that splitting the files is the
most consuming task. Is there a way in Camel to leverage concurrency to
split the file?

PS: btw I'm running my test on a 16 cores/32 threads machine (2x Intel Xeon
E5-2650)



--
View this message in context: 
http://camel.465427.n5.nabble.com/Parallel-processing-of-big-file-tp5737386p5737423.html
Sent from the Camel - Users mailing list archive at Nabble.com.

Reply via email to