https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83661

--- Comment #9 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Christophe Monat from comment #4)
> Hi Pratamesh,
> 
> You're absolutely right - maybe it's more efficient when there is some
> hardware sincos available (Intel FSINCOS ?) but I would check also carefully
> the actual performance.

CISC hardware math instructions are laughably slow, there is never a reason to
consider them (https://www.sourceware.org/ml/libc-alpha/2019-03/msg00559.html).

> Indeed, it looks to me that either you have to use two different polynomials
> or shift one argument and use either sin or cos, but anyway twice.

The gain is due to sharing most of the code - you evaluate 2 polynomials, but
that's only few extra FMAs (which even on a single issue in-order will
perfectly parallelize given each polynomial is latency bound).

> We studied that in a slightly different context with Claude-Pierre Jeannerod
> from ENS Lyon and our PhD Jingyan Lu-Jourdan a while ago : "Simultaneous
> floating-point sine and cosine for VLIW integer processors" available here:
> https://hal.archives-ouvertes.fr/hal-00672327 and we were able to gain
> significant performance by exploiting the low-level parallelism of the
> processor. Agreed, this is not a full IEEE implementation but the important
> ideas are there.

Interesting paper!

Reply via email to