On Wed, Feb 12, 2020 at 1:37 PM Devulapalli, Raghuveer < raghuveer.devulapa...@intel.com> wrote:
> >> I hope there will not be a demand to use many non-universal intrinsics > in ufuncs, we will need to work this out on a case-by-case basis in each > ufunc. Does that sound reasonable? Are there intrinsics you have already > used that have no parallel on other platforms? > > I think that is reasonable. It's hard to anticipate the future need and > benefit of specialized intrinsics but I tried to make a list of some of the > specialized intrinsics that are currently in use in NumPy that I don’t > believe exist on other platforms (most of these actually don’t exist on > AVX2 either). I am not an expert in ARM or VSX architecture, so please > correct me if I am wrong. > > a. _mm512_mask_i32gather_ps > b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd > c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps > d. _mm512_getexp_ps > e. _mm512_getmant_ps > f. _mm512_scalef_ps > g. _mm512_permutex2var_ps, _mm512_permutex2var_pd > h. _mm512_maskz_div_ps, _mm512_maskz_div_pd > i. _mm512_permute_ps/_mm512_permute_pd > j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little > google search I did, it seems like power ISA doesn’t have a vectorized sqrt > instruction) > > Software implementations of these instructions is definitely possible. But > some of them are not trivial to implement and are surely not going to be > one line macro's either. I am also unsure of what implications this has on > performance, but we will hopefully find out once we convert these to > universal intrinsic and then benchmark. > For these it seems like we don't want software implementations of the universal intrinsics - if there's no equivalent on PPC/ARM and there's enough value (performance gain given additional code complexity) in the additional AVX instructions, then we should still simply use AVX instructions directly. Ralf > Raghuveer > > -----Original Message----- > From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli= > intel....@python.org> On Behalf Of Matti Picus > Sent: Tuesday, February 11, 2020 11:19 PM > To: numpy-discussion@python.org > Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics > > On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote: > > > > On top of that the performance implications aren’t clear. Software > > implementations of hardware instructions might perform worse and might > > not even produce the same result. > > > > The proposal for universal intrinsics does not enable replacing an > intrinsic on one platform with a software emulation on another: the > intrinsics are meant to be compile-time defines that overlay the universal > intrinsic with a platform specific one. In order to use a new intrinsic, it > must have parallel intrinsics on the other platforms, or cannot be used > there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the > compiler will not even build a loop for that platform. I will try to > clarify that intention in the NEP. > > > I hope there will not be a demand to use many non-universal intrinsics in > ufuncs, we will need to work this out on a case-by-case basis in each > ufunc. Does that sound reasonable? Are there intrinsics you have already > used that have no parallel on other platforms? > > > Matti > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion