I guess my first message was too long, but that's alright, I ended up getting the answers I needed. I figured I should let you guys in on the results of my testing.
The summary of my initial write up is that my main requirement is message throughput per host, so that is what my testing has been focused. The bottom line is that I was able to sent 4500 messages/sec between two hosts using 3 producer processes and 2 consumer processes each using embedded brokers. That is the highest sustained throughput I was able to achieve, and while it was successful, the 2 consumers were consuming ~80% of the CPU on the host (2 2.4G Xeons with HT enabled) leaving little available for my processing of the messages. Interestingly enough, the 3 producers were consuming ~60% of the CPU on the other host. As a sanity check, I compared a standalone broker to embedded brokers in a 1:1 configuration. The standalone maxed out around 1800/sec, and the embedded brokers sustained 2500/sec, so this looks like it might be a more capable configuration. Unfortunately, none of this works for our requirements. To handle our current load on existing hardware we need to support at least 8000/sec, and to allow for future growth, I'd really want 15-20k. -Sean