[ https://issues.apache.org/jira/browse/CASSANDRA-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
C. Scott Andreas updated CASSANDRA-7029: ---------------------------------------- Component/s: Streaming and Messaging CQL > Investigate alternative transport protocols for both client and inter-server > communications > ------------------------------------------------------------------------------------------- > > Key: CASSANDRA-7029 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7029 > Project: Cassandra > Issue Type: Improvement > Components: CQL, Streaming and Messaging > Reporter: Benedict > Priority: Major > Labels: performance > Fix For: 4.x > > > There are a number of reasons to think we can do better than TCP for our > communications: > 1) We can actually tolerate sporadic small message losses, so guaranteed > delivery isn't essential (although for larger messages it probably is) > 2) As shown in \[1\] and \[2\], Linux can behave quite suboptimally with > regard to TCP message delivery when the system is under load. Judging from > the theoretical description, this is likely to apply even when the > system-load is not high, but the number of processes to schedule is high. > Cassandra generally has a lot of threads to schedule, so this is quite > pertinent for us. UDP performs substantially better here. > 3) Even when the system is not under load, UDP has a lower CPU burden, and > that burden is constant regardless of the number of connections it processes. > 4) On a simple benchmark on my local PC, using non-blocking IO for UDP and > busy spinning on IO I can actually push 20-40% more throughput through > loopback (where TCP should be optimal, as no latency), even for very small > messages. Since we can see networking taking multiple CPUs' worth of time > during a stress test, using a busy-spin for ~100micros after last message > receipt is almost certainly acceptable, especially as we can (ultimately) > process inter-server and client communications on the same thread/socket in > this model. > 5) We can optimise the threading model heavily: since we generally process > very small messages (200 bytes not at all implausible), the thread signalling > costs on the processing thread can actually dramatically impede throughput. > In general it costs ~10micros to signal (and passing the message to another > thread for processing in the current model requires signalling). For 200-byte > messages this caps our throughput at 20MB/s. > I propose to knock up a highly naive UDP-based connection protocol with > super-trivial congestion control over the course of a few days, with the only > initial goal being maximum possible performance (not fairness, reliability, > or anything else), and trial it in Netty (possibly making some changes to > Netty to mitigate thread signalling costs). The reason for knocking up our > own here is to get a ceiling on what the absolute limit of potential for this > approach is. Assuming this pans out with performance gains in C* proper, we > then look to contributing to/forking the udt-java project and see how easy it > is to bring performance in line with what we can get with our naive approach > (I don't suggest starting here, as the project is using blocking old-IO, and > modifying it with latency in mind may be challenging, and we won't know for > sure what the best case scenario is). > \[1\] > http://test-docdb.fnal.gov/0016/001648/002/Potential%20Performance%20Bottleneck%20in%20Linux%20TCP.PDF > \[2\] > http://cd-docdb.fnal.gov/cgi-bin/RetrieveFile?docid=1968;filename=Performance%20Analysis%20of%20Linux%20Networking%20-%20Packet%20Receiving%20(Official).pdf;version=2 > Further related reading: > http://public.dhe.ibm.com/software/commerce/doc/mft/cdunix/41/UDTWhitepaper.pdf > https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/14482/ChoiUndPerTcp.pdf?sequence=1 > https://access.redhat.com/site/documentation/en-US/JBoss_Enterprise_Web_Platform/5/html/Administration_And_Configuration_Guide/jgroups-perf-udpbuffer.html > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153.3762&rep=rep1&type=pdf -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org