> On 30 Jul 2024, at 19:01, Andi Kleen <a...@firstfloor.org> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> Is that from some kind of rigorous measurement under perf? As you
>> surely know, 0.6% wall-clock time can be from boost clock variation
>> or just run-to-run noise on x86.
> 
> I compared it using hyperfine which does rigorous measurements yes.
> It was well above the run-to-run variability.

FWIW when I was experimenting with these paths I found that an -fsyntax-only 
compilation helps make the changes more pronounced.

Thanks,
Kyrill

> 
> I had some other patches that didn't meet that bar, e.g.
> i've been experimenting with more modern hashes for inchash
> and multiple ggc free lists, but so far no above noise
> results.
> 
>> 
>> I have looked at this code before. When AVX2 is available, so is SSSE3,
>> and then a much more efficient approach is available: instead of comparing
>> against \r \n \\ ? one-by-one, build a vector
>> 
>>  0  1  2  3  4  5  6  7  8  9    a   b    c     d   e   f
>> { 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, '\n', 0, '\\', '\r', 0, '?' }
>> 
>> where each character C we're seeking is at position (C % 16). Then
>> you can match against them all at once using PSHUFB:
>> 
>>  t = _mm_shuffle_epi8 (lut, data);
>>  t = t == data;
> 
> I thought the PSHUFB trick only worked for some bit patterns?
> 
> At least according to this paper: https://arxiv.org/pdf/1902.08318
> 
> But yes if it applies here it's a good idea.
> 
> 
>> 
>> As you might recognize this handily beats the fancy SSE4.1 loop as well.
>> I did not pursue this because I did not measure a substantial improvement
>> (we're way into the land of diminishing returns here) and it seemed like
>> maintainers might not like to be distracted with that, but if we are
>> touching this code, might as well use the more efficient algorithm.
>> I'll be happy to propose a patch if people think it's worthwhile.
> 
> Yes makes sense.
> 
> (of course it would be even better to teach the vectorizer about it,
> although this will require fixing some other issues first, see PR116126)
> 
> -Andi

Reply via email to