On 22.05.19 20:46, Richard Henderson wrote: > On 5/22/19 2:16 PM, David Hildenbrand wrote: >> On 22.05.19 17:59, Richard Henderson wrote: >>> On Wed, 22 May 2019 at 07:16, David Hildenbrand <da...@redhat.com> wrote: >>>>> Also plausible. I guess it would be good to know, anyway. >>>> >>>> I'll dump the parameters when booting Linux. My gut feeling is that the >>>> cc option is basically never used ... >>> >>> It looks like our intuition is wrong about that. >> >> Thanks for checking! >> >>> >>> rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r vfaezbs * | wc -l >>> 15 >>> >>> These set cc, use zs, and do not use rt. >>> >>> rth@cloudburst:~/glibc/src/sysdeps/s390$ grep -r 'vfaeb' * | wc -l >>> 3 >>> >>> These do not set cc, do not use zs, and do use rt. >>> >>> Those are the only two VFAE forms used by glibc (note that the same >>> variants as 'f' are used by the wide-character strings). >>> >> >> I guess "rt" and "cc" make the biggest difference. Maybe special case >> these two, result in 4 variants for each of the 3 element sizes? > > Sounds good. >
So .... after all it might not be necessary, at least not for this helper :) Using your crazy helper functions, I have this right now: /* * Returns the number of bits composing one element. */ static uint8_t get_element_bits(uint8_t es) { return (1 << es) * BITS_PER_BYTE; } /* * Returns the bitmask for a single element. */ static uint64_t get_single_element_mask(uint8_t es) { return -1ull >> (64 - get_element_bits(es)); } /* * Returns the bitmask for a single element (excluding the MSB). */ static uint64_t get_single_element_lsbs_mask(uint8_t es) { return -1ull >> (65 - get_element_bits(es)); } /* * Returns the bitmasks for multiple elements (excluding the MSBs). */ static uint64_t get_element_lsbs_mask(uint8_t es) { return dup_const(es, get_single_element_lsbs_mask(es)); } static int vfae(void *v1, const void *v2, const void *v3, bool in, bool rt, bool zs, uint8_t es) { const uint64_t mask = get_element_lsbs_mask(es); const int bits = get_element_bits(es); uint64_t a0, a1, b0, b1, e0, e1, t0, t1, z0, z1; uint64_t first_zero = 16; uint64_t first_equal; int i; a0 = s390_vec_read_element64(v2, 0); a1 = s390_vec_read_element64(v2, 1); b0 = s390_vec_read_element64(v3, 0); b1 = s390_vec_read_element64(v3, 1); e0 = 0; e1 = 0; /* compare against equality with every other element */ for (i = 0; i < 64; i += bits) { t0 = i ? rol64(b0, i) : b0; t1 = i ? rol64(b1, i) : b1; e0 |= zero_search(a0 ^ t0, mask); e0 |= zero_search(a0 ^ t1, mask); e1 |= zero_search(a1 ^ t0, mask); e1 |= zero_search(a1 ^ t1, mask); } /* invert the result if requested - invert only the MSBs */ if (in) { e0 = ~e0 & ~mask; e1 = ~e1 & ~mask; } first_equal = match_index(e0, e1); if (zs) { z0 = zero_search(a0, mask); z1 = zero_search(a1, mask); first_zero = match_index(z0, z1); } if (rt) { e0 = (e0 >> (bits - 1)) * get_single_element_mask(es); e1 = (e1 >> (bits - 1)) * get_single_element_mask(es); s390_vec_write_element64(v1, 0, e0); s390_vec_write_element64(v1, 1, e1); } else { s390_vec_write_element64(v1, 0, MIN(first_equal, first_zero)); s390_vec_write_element64(v1, 1, 0); } if (first_zero == 16 && first_equal == 16) { return 3; /* no match */ } else if (first_zero == 16) { return 1; /* matching elements, no match for zero */ } else if (first_equal < first_zero) { return 2; /* matching elements before match for zero */ } return 0; /* match for zero */ } At least the kernel boots with it - am i missing something or does this indeed work? Cheers! -- Thanks, David / dhildenb