Thank you Albin, currently we don't have any mulhi function, so we didn't decide a possible interface for it. There are some comments in the code where a full mul is used, but some "mulhi" function could be enough. Many of them seem to use unbalanced operands, and might require the exact result or more limbs than the higher half...
By the way, I like the idea of computing one more limb, that way one may know whether the limbs obtained by mulhi are exact or not. Ĝis, mb 6 feb 2024, 23:27 da albin.ahlb...@gmail.com: > Hello, > > I just wrote an implementation for mpn_mulhigh_basecase for Broadwell-type > processors (that is, x86_64 with BMI2 and ADX instructions) based on > Torbjörn's `mpn_mullo_basecase'. > > It is currently declared on the form > > mp_limb_t flint_mpn_mulhigh_basecase(mp_ptr rp, mp_srcptr xp, mp_srcptr yp, > mp_size_t n), > > as it was developed for FLINT (Fast Library for Number Theory). Note that > `rp' cannot be aliased with `xp' or `yp'. > > It will compute an approximation of the upper most `n + 1' limbs of `{xp, n}' > times `{yp, n}', where the upper most `n' limbs are pushed to `rp[0]', ..., > `rp[n - 1]' and the least significant computed limb is returned (via %rax). > This returned limb should have an error of something along `n' ULPs. > > Note that this differs from MPFR's (private) function `mpfr_mulhigh_n', which > computes the approximation of the upper most `n' limbs into `rp[n]', ..., > `rp[2 * n - 1]', where `rp[n]' has an error of something along `n' ULPs at > most. > > Feel free to change it according to your needs (perhaps you do not want to > compute `n + 1' limbs, but rather `n' limbs). > > If this code will be used in GMP, feel free to remove the copyright claim for > FLINT and put my name (spelled Albin Ahlbäck) in the GMP copyright claim > instead. > > Just some notes: > > - We use our own M4 syntax for the beginning and ending of the function, but > it should be easy to translate to GMP's syntax. > - It currently only works for n > 5 (I believe) as we in FLINT have > specialized routines for small n. > - It would be nice to avoid pushing five register, and only push four. > - Reduce the size of the `L(end)' code, and try to avoid multiple jumps > therein. > - Move the code-snippet of `L(f2)' to just above `L(b2)', so that no jump is > needed in between. (This currently does not work because `L(end)' as well as > this code-snippet is too large for a relative 8-bit jump.) > - Start out with an mul_1 sequence with just a mulx+add+adc chain, just like > in `mpn_mullo_basecase'. > - Remove the first multiplication in each `L(fX)' and put it in `L(end)' > instead. > - The `adcx' instruction in `L(fX)' can be removed (then one needs to adjust > the `L(bX)'-label), but I found it to be slower. Can we remove it and somehow > maintain the same performance? > > Best, > Albin > _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel