[sorry if this forum is the wrong place to take this up] Grant Grundler <[EMAIL PROTECTED]> wrote :
Grant> [ I've probably posted some of these results before...here's another Grant> take on this problem. ] Hopefully not rehashing too much old information. Grant> I'm expect splitting the RX/TX completions would achieve something Grant> similar since we are just "slicing" the same problem from a different Grant> angle. Apps typically do both RX and TX and will be running on one Grant> CPU. So on one path they will be missing cachelines. However, the event handler(s) handling the RX/TX completion are not guaranteed to run on the same CPU as the application unless you have the scheduler do some kind of affinity between the application and the event handler for the completion queue. In addition, if an application has multiple sockets then the event handlers are all of the place because each socket has its own completion queue. Does one event handler handle all completion queues? Grant> Anyway, my take is IPoIB perf isn't as critical as SDP and RDMA perf. Grant> If folks really care about perf, they have to migrate away from Grant> IPoIB to either SDP or directly use RDMA (uDAPL or something). Grant> Splitting RX/TX completions might help initial adoption, but Grant> aren't were the big wins in perf are. My take is, good enough is not good enough. If the cost to move from IP to SDP or RDMA is too great, then applications ( particularly in the commercial sector ) will not convert. Hence if IPoIB is too slow they will go Ethernet. Currently we only get 40% of the link bandwidth compared to 85% for 10 GigE. (Yes I know the cost differences which favor IB ). However, two things hurt user level protocols. First is scaling and memory requirements. Looking at parallel file systems on large clusters, SDP ended up consuming so much memory it couldn't be used. The N by N socket connections per node, using SDP the required buffer space and QP memory got out of control. There is something to be said for sharing buffer and QP space across lots of sockets. The other issue is flow control across hundreds of autonomous sockets. In TCP/IP, traffic can be managed so that there is some fairness (multiplexing, QoS etc.) across all active sockets. For user level protocols like SDP and uDAPL, you can't manage traffic across multiple autonomous user application connections because ther is no where to see all of them at teh same tiem for mangement. This can lead to overrunning adapters or timeouts to the applications. This tends to be a large system problem when you have lots of CPUs. SDP and uDAPL has some good ideas but have a way to go for anything except HPC and workloads that are not expected to scale to large configurations. For HPC you can use MPI for application message passing, but for the rest of the cluster traffic you need a good performing IP implementation for now. With time things can improve. There is also IPoIB-CM for much lower IPoIB overhead. Grant> Pinning netperf/netserver to a different CPU caused SDP perf Grant> to drop from 5.5 Gb/s to 5.4 Gb/s. Service Demand went from Grant> around 0.55 usec/KB to 0.56 usec/KB. ie a much smaller impact Grant> on cacheline misses. I agree cacheline misses are something that has to be watched carefully. for some platforms we need better binding or affinity tools in Linux to solve some of the current problems. This is a bigger long term issue. Grant> Keeping traffic local to the CPU that's taking the interrupt Grant> keeps the cachelines local. I don't want to discourage anyone Grant> from their pet projects. But the conclusion I drew from the Grant> above data is IPoIB is a good compatibility story but cacheline Grant> misses are going to make it hard to improve perf regardless Grant> of how we divide the workload. IPoIB + TCP/IP code path just has Grant> a big foot print. The footprint of IPoIB + TCP/IP is large as on any system, However, as you get to higher CPU counts, the issue becomes less of a problem since more unused CPU cycles are available. However, affinity ( CPU and Memory) and cacheline miss issues get greater. Bernie King-Smith IBM Corporation Server Group Cluster System Performance [EMAIL PROTECTED] (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general