https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83661
--- Comment #9 from Wilco <wilco at gcc dot gnu.org> --- (In reply to Christophe Monat from comment #4) > Hi Pratamesh, > > You're absolutely right - maybe it's more efficient when there is some > hardware sincos available (Intel FSINCOS ?) but I would check also carefully > the actual performance. CISC hardware math instructions are laughably slow, there is never a reason to consider them (https://www.sourceware.org/ml/libc-alpha/2019-03/msg00559.html). > Indeed, it looks to me that either you have to use two different polynomials > or shift one argument and use either sin or cos, but anyway twice. The gain is due to sharing most of the code - you evaluate 2 polynomials, but that's only few extra FMAs (which even on a single issue in-order will perfectly parallelize given each polynomial is latency bound). > We studied that in a slightly different context with Claude-Pierre Jeannerod > from ENS Lyon and our PhD Jingyan Lu-Jourdan a while ago : "Simultaneous > floating-point sine and cosine for VLIW integer processors" available here: > https://hal.archives-ouvertes.fr/hal-00672327 and we were able to gain > significant performance by exploiting the low-level parallelism of the > processor. Agreed, this is not a full IEEE implementation but the important > ideas are there. Interesting paper!