See comments inline Jeff
On Sat, Jun 8, 2013 at 10:40 AM, Emmanuel Lécharny <elecha...@gmail.com>wrote: > Hi guys, > > we have spent a couple of days this week with Julien and Jeff during the > EclipseCon working on MINA 3. We have experimented some things, did some > benchmarks, and studied them. This is a short sum up of what we did and > teh resuts we've got. > > 1) Performances > > We have done some tests with MINA 3 and Ntty 3 TCP. basically, we ran > the benchmark code we have either locally (the client and the server on > one machine) or with two machines (the server and the client on two > machines). What it shows is that the difference between MINA3 (M3) and > Netty3 (N3) varies with the size of the exchanged messages. M3 is > slightly faster up to 100Kb messages, then N3 is faster up to 1Gb > messages, then N3 is clearly having some pb. > > When we conduct tests with the server on one machine, and the client on > another machine, we are CPU bound. On my machine, we can reach roughly > 65 000 1kb messages per second (either with M3 or N3). There is no > statistically relevent difference. The CPU is at 90%, with roughly 85% > system, which means the CPU is busy processing the sockets, the impact > of our own code is insignificant. Note that we have mesured reads, not > writes. > > 2) Analysis > > One of the major diffeence between M3 and N3 is the buffer usage. There > are 2 kind of buffers : direct and heap. The direct buffers are > allocated outside the JVM, the heap buffers are allocated within the JVM > memory. It's important to understand that only direct buffers will be > written in a socket, so at some point, we must move the data into a > direct buffer. > > So basically, we would like to push the message into a directBuffer as > soon as possibel, like in the encoder. That means we have to allocate a > DirectBuffer to do the job. It seems to be a smart idea, at first, but... > > There is a bug in the JVM : http://bugs.sun.com/view_bug.do?bug_id=4469299 > > It says "In some cases, particularly with applications with large heaps > and light to moderate loads where collections happen infrequently, the > Java process can consume memory to the point of process address space > exhaustion.". Bottom line, as soon as you have heavy allocations, you > might get OOM, even for Direct Buffers. > > One more problem is that there is a physical limit on the size you can > allocate, and it's defined by a parameter : -XX > :MaxDirectMemorySize=<size>. It defaults to 64M in java 6, and the size > you have set in -Xmx parameter. You can't get any farther. All in all, > it's pretty much the same thing than for the Heap buffers. Assuming that > allocatng a Direct Buffer is twice more expensive than a heap buffer > (again, it depends on the Java version you are using), it's quite > important not to allocate too many direct buffers. > > In order to work around the JVM bug, Sun is suggesting three possibilities > : > 1) Insert occasional explicit System.gc() invocations > 2) Reduce the size of the young generation to force more frequent GCs. > 3) Explicitly pool direct buffers at the application level. > > N3 has implemented the third approach, which is expensive, and create a > pb as soon as you send big messages, thus leading to the bad > performances we have in this case in M3. > We have a possible different approach : never allocate a direct buffer, > always use a heap buffer. This will lead sot a penalty of 3% in > performance, but this eliminate the pb. > > Calling the GC is simply not an option. > > 3) Write performances > > Writng data into a socke is tricky : we never know in advance how many > bytes we will be able to write, and the data must be injected into a > Direct buffer before it can be written into the socket. There are a few > possible strategy : > > 1) write the heap buffer into the channel > 2) write a direct buffer into the channel > 3) get a chunk of the heap buffer, copy it into a direct buffer, and > write it into the channel. > > In case (1), we delegate the copy of the buffer to the channel. If the > heap buffer is huge, we might copy it many times, as the > channel.write(buffer) will return the number of bytes written. > Hopefully, the channel.write() will not copy the whole heap buffer into > a huge direct buffer, but we have no way to control what it does > > In case (2), that means we allocate a huge direct buffer, and put > everything into it. It has the advantage of being done only once, and we > don't have to take care of what's going in in the write() method. But > the main issue is that we will potentially hit the JVM bug > > In case (3), we can have an approach that tries to deal with both issues > : we allocate a direct buffer that is associated with each thread - so > only a few ones will be allocated - and we copy a maximum of bytes that > is determinated by the socket sendBufferSize (roughly 64kb). We will > then copy the data from the heap buffer to the dirct buffer at each > round, and if everything goes well, we will just do the minimal number > of copies. However, we may perfectly well have to copy the data many > times, as the direct buffer might be shared with many other sessions. > > All in all, there is no perfect strategy here. We can imrove the third > strategy by using an adaptative copy : as we know ha many bytes were > written, we can limit the number of bytes we copy into the diretc buffer > to the last few size that the socket was able to send. > > The important thing to remember is that we *have* to keep the buffer to > send in a stack until it has been fully written, which may lead to some > pb when the client are slow readers, and when the server has many client > to serve. > > 5) Selectors > > There is no measurable differences on the server if we use one single > selector or many. It seems that most of the time is consummed in the > select() method, no matter what. The original dsign where we created > many selectors (as many as we have processors, plus one) seems to be > based on some urban legend, or at least, based on Java 4. We have to > reasset this design. > Worth trying a multi thousands test for this. I still believe having several outstanding select call may lead to better parallelism in that case. > > 4) Conclusion > > We have more tests to conduct. This is not simple, it all depends on the > JVM we are running the server on, and many of the aspects may be > configured. > > The next steps would be to conduct tests with the various scenarii, on > different JVM, with different size. We may need to design a plugable > system for handling the reads and the writes, we can use a factory for > that. > > Bottom line, we also would like to compare a NIO based server with a BIO > based server. I'm not sure that we have a big performance penalty with > Java 7. > > Java 7 is way better than Java 6 in the way it handles buffers too. > There is no reason to use Java 6 those days, it's dead anyway. It would > be interesting to benchmark Java 8 to see what it brings. > > > Tahnks ! > > -- > Regards, > Cordialement, > Emmanuel Lécharny > www.iktek.com > > -- Jeff MAURY "Legacy code" often differs from its suggested alternative by actually working and scaling. - Bjarne Stroustrup http://www.jeffmaury.com http://riadiscuss.jeffmaury.com http://www.twitter.com/jeffmaury