We have noted a number of issues in 2.18.3 in using parallel multicast/recipientlist with a timeout under heavy load or in pathological situations (decreased throughput in one or more tasks).
1. If any of the tasks cannot be submitted, typically due to a RejectedExecutionException, the AggregateOnTheFlyTask will never terminate, but will call the timeout method continually. Excessive calls to timeout could also happen if the thread pool has a CallerRuns policy, but I haven't attempted to produce that. I am pretty sure the issue was introduced in 2.15.x by the changes in CAMEL-8081. 2. The timeout does not start until the first submitted task begins running. This can be quite a substantial delay if threads are being queued in a thread pool. The timeout StopWatch really needs to come from the beginning. It would really help drain the queue if the submitted Callable would fail-fast if the timeout has already run out. 3. Related to (2), the main thread waits forever for the aggregate task to complete. If a timeout is given, it should be honored or at least used to provide a reasonable escape. 4. This is more of a comment, but I would be very wary of the parallelAggregate option. There are a lot of potential races there, especially after timeout. It seems like that could spin its wheels a bit while the parallel aggregate completes, call too many timeouts, and/or exit before the aggregation strategy actually completes. I can reproduce (1) in a test case, so I can do a JIRA on it. I might be able to come up with a similar test for 2/3. I am thinking that a JIRA for that, though related, should probably be separate.