On Tue, 9 Mar 2021 at 17:14, John Naylor <john.nay...@enterprisedb.com> wrote: > On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar <amitdkhan...@gmail.com> wrote: > > Just a quick question before I move on to review the patch ... The > > improvement looks like it is only meant for x86 platforms. > > Actually it's meant to be faster for all platforms, since the C fallback is > quite a bit different from HEAD. I've found it to be faster on ppc64le. An > earlier version of the patch was a loser on 32-bit Arm because of alignment > issues, but if you could run the test script attached to [1] on 64-bit Arm, > I'd be curious to see how it does on 0002, and whether 0003 and 0004 make > things better or worse. If there is trouble building on non-x86 platforms, > I'd want to fix that also.
On my Arm64 VM : HEAD : mixed | ascii -------+------- 1091 | 628 (1 row) PATCHED : mixed | ascii -------+------- 681 | 119 So the fallback function does show improvements on Arm64. I guess, if at all we use the equivalent Arm NEON intrinsics, the "mixed" figures will be close to the "ascii" figures, going by your figures on x86. > > Can this be > > done in a portable way by arranging for auto-vectorization ? Something > > like commit 88709176236caf. This way it would benefit other platforms > > as well. > > I'm fairly certain that the author of a compiler capable of doing that in > this case would be eligible for some kind of AI prize. :-) :) I was not thinking about auto-vectorizing the code in pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization inside the individual helper functions that you wrote, such as _mm_setr_epi8(), shift_right(), bitwise_and(), prev1(), splat(), saturating_sub() etc. I myself am not sure whether it is feasible to write code that auto-vectorizes all these function definitions. saturating_sub() seems hard, but I could see the gcc docs mentioning support for generating such instructions for a particular code loop. But for the index lookup function() it seems impossible to generate the needed index lookup intrinsics. We can have platform-specific function definitions for such exceptional cases. I am considering this only because that would make the exact code work on other platforms like arm64 and ppc, and won't have to have platform-specific files. But I understand that it is easier said than done. We will have to process the loop in pg_validate_utf8_sse42() in 128-bit chunks, and pass each chunk to individual functions, which could mean extra work and extra copy in extracting the chunk data and passing it around, which may make things drastically slow. You are passing around the chunks using __m128i type, so perhaps it means passing around just a reference to the simd registers. Not sure. -- Thanks, -Amit Khandekar Huawei Technologies