On Mon, Apr 29, 2019 at 7:52 AM Jan Glauber <jglau...@marvell.com> wrote: > > It turned out the issue we have on ThunderX2 is the file open-close sequence > with small read sizes. If the used files are opened read-only the > lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used. > > The lockref CMPXCHG_LOOP uses an unbound (as long as the associated > spinlock isn't taken) while loop to change the lock count. This behaves > badly under heavy contention
Ok, excuse me when I rant a bit. Since you're at Marvell, maybe you can forward this rant to the proper guilty parties? Who was the absolute *GENIUS* who went Step 1: "Oh, we have a middling CPU that isn't world-class on its own" Step 2: "BUT! We can put a lot of them on a die, because that's 'easy'" Step 3: "But let's make sure the interconnect isn't all that special, because that would negate the the whole 'easy' part, and really strong interconnects are even harder than CPU's and use even more power, so that wouldn't work" Step 4: "I wonder why this thing scales badly?" Seriously. Why are you guys doing this? Has nobody ever looked at the fundamental thought process above and gone "Hmm"? If you try to compensate for a not-great core by putting twice the number of them in a system, you need a cache system and interconnect between them that is more than twice as good as the competition. And honestly, from everything that I hear, you don't have it. The whole chip is designed for "throughput when there is no contention". Is it really a huge surprise that it then falls flat on its face when there's something fancy going on? So now you want to penalize everybody else in the ARM community because you have a badly balanced system? Ok, rant over. The good news is that we can easily fix _this_ particular case by just limiting the CMPXCHG_LOOP to a maximum number of retries, since the loop is already designed to fail quickly if the spin lock value isn't unlocked, and all the lockref code is already organized to fall back to spinlocks. So the attached three-liner patch may just work for you. Once _one_ thread hits the maximum retry case and goes into the spinlocked case, everybody else will also fall back to spinlocks because they now see that the lockref is contended. So the "retry" value probably isn't all that important, but let's make it big enough that it probably never happens on a well-balanced system. But seriously: the whole "let's just do lots of CPU cores because it's easy" needs to stop. It's fine if you have a network processor and you're doing independent things, but it's not a GP processor approach. Your hardware people need to improve on your CPU core (maybe the server version of Cortex A76 is starting to approach being good enough?) and your interconnect (seriously!) instead of just slapping 32 cores on a die and calling it a day. Linus "not a fan of the flock of chickens" Torvalds
lib/lockref.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/lockref.c b/lib/lockref.c index 3d468b53d4c9..a6762f8f45c9 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -9,6 +9,7 @@ * failure case. */ #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ + int retry = 15; /* Guaranteed random number */ \ struct lockref old; \ BUILD_BUG_ON(sizeof(old) != 8); \ old.lock_count = READ_ONCE(lockref->lock_count); \ @@ -21,6 +22,8 @@ if (likely(old.lock_count == prev.lock_count)) { \ SUCCESS; \ } \ + if (!--retry) \ + break; \ cpu_relax(); \ } \ } while (0)