On 10/19/2015 10:20 AM, Matan Barak wrote: > On 10/19/2015 3:23 PM, Doug Ledford wrote:
>> This is a major cause of the slowness. Unless you have a specific need >> of them, per-entry rwlocks are *NOT* a good idea. I was going to bring >> this up separately, so I'll just mention it here. Per entry locks help >> reduce contention when you have lots of multiple accessors. Using >> rwlocks help reduce contention when you have a read-mostly entry that is >> only occasionally changed. But every lock and every unlock (just like >> every atomic access) still requires a locked memory cycle. That means >> every lock acquire and every lock release requires a synchronization >> event between all CPUs. Using per-entry locks means that every entry >> triggers two of those synchronization events. On modern Intel CPUs, >> they cost about 32 cycles per event. If you are going to do something, >> and it can't be done without a lock, then grab a single lock and do it >> as fast as you can. Only in rare cases would you want per-entry locks. >> > > I agree that every rwlock costs us locked access. However, lets look at > the common scenario here. No, let's not. Especially if your "common scenario" causes you to ignore pathological bad behavior. Optimizing for the common case is not an excuse to enable pathological behavior. > I think that in production stable systems, the > IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable. Probably true. > Thus, most accesses should be read calls. Probably true. > That's why IMHO read-write > access makes sense here. Probably true. > Regarding single-lock vs per entry lock, it really depends on common an > entry could be updated while another entry is being used by an > application. No, it depends on more than that. In addition to the frequency of updates versus lookups, there is also the issue of lookup cost and update cost. Using per entry locks makes both the lookup cost and the update cost of anything other than the first gid in the table grow exponentially. A good way to demonstrate this would be to create 100 vlans on a RoCE device, then run the cmtime test between two hosts using the 100th vlan IP at each end. Using per-entry locks, the performance of this test relative to using the first IP on the device will be pathologically bad. > In a (future) dynamic system, you might want to create > containers dynamically, which will add a net device and change the > hardware GID table while other application (maybe in another container) > uses other GIDs, but this might be rare scenario. This is a non-issue. Creating a container and an application inside that container to use an RDMA interface is already a heavy weight operation (minimum fork/exec/program startup). Table lookups can be needed thousands of times per second under certain conditions and must be fast. Per-entry locks are only fast for the first item in the list. After that, they progressively get slower and slower than usage with a single lock. > Ok, refactoring this code for a single lock shouldn't be problematic. > Regarding performance, I think the results here are vastly impacted by > the rate we'll add/remove IPs or upper net-devices in the background. Not really. You will never add/remove IPs fast enough to justify per-entry locks, especially when you consider that ib_cache_gid_add calls find_gid twice, once going over the entire table if the gid isn't found, before ever calling add_gid. The ocrdma table isn't bad because it's only 16 entries. But the mlx4 table is going to be 128 I suspect, so you can do the test I mentioned above and use 100 vlans. In that scenario, the 101st gid will require that you take and release 128 locks to scan for the entry in the list, it will come back empty, then you will take and release 101 locks until you find an empty entry, and then you will take a write lock as you write the gid. That's 230 total locks, for 460 total locked memory operations at 32 cycles each, so a total of 14,720 cycles spent doing nothing but locking the memory bus. And each of those operations slows down all of the CPUs in the system. Where I've seen this sort of behavior totally wreck a production system is when the GID you need to look up is on port 2. Even if it is the firstmost GID on port 2, if you have to search all of port 1's table first, then by the time you get to port 2's table, you've already lost. So, here's some suggestions: 1) Use a single lock on the table per port. 2) Change find_gid so that it will take an additional element, int *empty_idx, and as it is scanning the gid table, when it runs across an empty entry, if empty_idx != NULL, *empty_idx = idx;. This prevents needing to scan the array twice. 3) Consider optimizing the code by requiring that any time a GID is added, it must take the first empty spot, and any time one is deleted, any valid entries after it must be moved up to fill the empty spot, then optimize find_gid to quit once it finds an empty slot because it knows the rest of the table is empty. Given that mlx4 keeps a huge 128 table array, this can be very helpful. 4) Does the table port struct have to use a mutex? Could it be changed to a rwlock instead? If so, you could changing write_gid so that it assumes it is called with the read lock on the port already held and it merely updates itself to a write lock when it is ready to write the entry to the array and update the card. It would then downgrade back to a read lock on return and the calling function would hold the single read lock from the time it is ready to call find_gid until write_gid has returned. 5) I noticed that ib_cache_gid_find_by_port calls find_gid without even taking the read lock, that needs fixed. -- Doug Ledford <dledf...@redhat.com> GPG KeyID: 0E572FDD
signature.asc
Description: OpenPGP digital signature