https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63185
--- Comment #6 from Marc Glisse <glisse at gcc dot gnu.org> --- In addition to the issues already described, it seems that we generate better code if I replace the VLAs with calls to alloca. Indeed, we assume that alloca returns 16-aligned memory, while with __builtin_alloca_with_align(..., 64), we don't seem to have code to turn it into __builtin_alloca_with_align(..., 128) so we could avoid all the loop adjustment code.