[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #21 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:a90f558bbb87c0b5d2b1e07d55bd585b2285cf3d commit r14-2114-ga90f558bbb87c0b5d2b1e07d55bd585b2285cf3d Author: liuhongt Date: Mon Jun 26 13:59:29 2023 +0800 Don't issue vzeroupper for vzeroupper call_insn. gcc/ChangeLog: PR target/82735 * config/i386/i386.cc (ix86_avx_u127_mode_needed): Don't emit vzeroupper for vzeroupper call_insn. gcc/testsuite/ChangeLog: * gcc.target/i386/avx-vzeroupper-30.c: New test.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #20 from CVS Commits --- The master branch has been updated by H.J. Lu : https://gcc.gnu.org/g:e89759fdfc80db223bd852aba937acb2d7c2cd80 commit r12-1265-ge89759fdfc80db223bd852aba937acb2d7c2cd80 Author: H.J. Lu Date: Mon Jun 7 11:43:25 2021 -0700 x86: Don't compile pr82735-[345].c for x32 Since -mabi=ms isn't compatible with x32, skip pr82735-[345].c for x32. PR target/82735 * gcc.target/i386/pr82735-3.c: Don't compile for x32. * gcc.target/i386/pr82735-4.c: Likewise. * gcc.target/i386/pr82735-5.c: Likewise.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #19 from Hongtao.liu --- Fixed in GCC12.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #18 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:9a90b311f22956addaf4f5f9bdb3592afd45083f commit r12-1253-g9a90b311f22956addaf4f5f9bdb3592afd45083f Author: liuhongt Date: Tue Jun 1 09:09:44 2021 +0800 Fix _mm256_zeroupper by representing the instructions as call_insns in which the call has a special vzeroupper ABI. When __builtin_ia32_vzeroupper is called explicitly, the corresponding vzeroupper pattern does not carry any CLOBBERS or SETs before LRA, which leads to incorrect optimization in pass_reload. In order to solve this problem, this patch refine instructions as call_insns in which the call has a special vzeroupper ABI. gcc/ChangeLog: PR target/82735 * config/i386/i386-expand.c (ix86_expand_builtin): Remove assignment of cfun->machine->has_explicit_vzeroupper. * config/i386/i386-features.c (ix86_add_reg_usage_to_vzerouppers): Delete. (ix86_add_reg_usage_to_vzeroupper): Ditto. (rest_of_handle_insert_vzeroupper): Remove ix86_add_reg_usage_to_vzerouppers, add df_analyze at the end of the function. (gate): Remove cfun->machine->has_explicit_vzeroupper. * config/i386/i386-protos.h (ix86_expand_avx_vzeroupper): Declared. * config/i386/i386.c (ix86_insn_callee_abi): New function. (ix86_initialize_callee_abi): Ditto. (ix86_expand_avx_vzeroupper): Ditto. (ix86_hard_regno_call_part_clobbered): Adjust for vzeroupper ABI. (TARGET_INSN_CALLEE_ABI): Define as ix86_insn_callee_abi. (ix86_emit_mode_set): Call ix86_expand_avx_vzeroupper directly. * config/i386/i386.h (struct GTY(()) machine_function): Delete has_explicit_vzeroupper. * config/i386/i386.md (enum unspec): New member UNSPEC_CALLEE_ABI. (ABI_DEFAULT,ABI_VZEROUPPER,ABI_UNKNOWN): New define_constants for insn callee abi index. * config/i386/predicates.md (vzeroupper_pattern): Adjust. * config/i386/sse.md (UNSPECV_VZEROUPPER): Deleted. (avx_vzeroupper): Call ix86_expand_avx_vzeroupper. (*avx_vzeroupper): Rename to .. (avx_vzeroupper_callee_abi): .. this, and adjust pattern as call_insn which has a special vzeroupper ABI. (*avx_vzeroupper_1): Deleted. gcc/testsuite/ChangeLog: PR target/82735 * gcc.target/i386/pr82735-1.c: New test. * gcc.target/i386/pr82735-2.c: New test. * gcc.target/i386/pr82735-3.c: New test. * gcc.target/i386/pr82735-4.c: New test. * gcc.target/i386/pr82735-5.c: New test.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #17 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:16465ceb06cc1f65cfca3c0eb2c1ee27ab03bdfd commit r12-1252-g16465ceb06cc1f65cfca3c0eb2c1ee27ab03bdfd Author: liuhongt Date: Tue Jun 1 09:00:57 2021 +0800 CALL_INSN may not be a real function call. Use "used" flag for CALL_INSN to indicate it's a fake call. If it's a fake call, it won't have its own function stack. gcc/ChangeLog PR target/82735 * df-scan.c (df_get_call_refs): When call_insn is a fake call, it won't use stack pointer reg. * final.c (leaf_function_p): When call_insn is a fake call, it won't affect caller as a leaf function. * reg-stack.c (callee_clobbers_any_stack_reg): New. (subst_stack_regs): When call_insn doesn't clobber any stack reg, don't clear the arguments. * rtl.c (shallow_copy_rtx): Don't clear flag used when orig is a insn. * shrink-wrap.c (requires_stack_frame_p): No need for stack frame for a fake call. * rtl.h (FAKE_CALL_P): New macro.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #16 from Hongtao.liu --- (In reply to Hongtao.liu from comment #15) > (In reply to Hongtao.liu from comment #14) > > (In reply to Hongtao.liu from comment #12) > > > (In reply to Jakub Jelinek from comment #10) > > > > Last touched in PR99563. > > > > I guess for the explicit user vzeroupper we need to add the > > > > clobbers/sets > > > > earlier than in the vzeroupper pass, but ideally in a way that doesn't > > > > force > > > > save/restore of registers that aren't really needed in the function. > > > > > > Yes, if we want to add the clobbers/sets earlier(than CSE1), vzeroupper > > > pass > > > should be able to remove those unnecessary clobbers/sets. > > > > Correct typo, add the clobbers/sets earlier(than RA) > > I'm trying to add a post_reload splitter to add CLOBBERS of xmm to > vzeroupper so that LRA knows vzeroupper will kill those xmm registers, then > in pass_vzeroupper, transform those CLOBBERS to SET (xmm, xmm), it will > benifit post_reload CSE which allow lower 128bits to cross vzeroupper, then > in post_reload split2, drop those SETs, it's safe since there's no CSE > between split2 and split3, problem is there's no update for data flow info > between split2 and pro_and_epilog which mean even i manually drop those > SETS, xmm6-xmm15 are still marked as used which causes redudant save and > restore under 64-bit MSabi. > > I'm thinking of adding a target_hook for updating df info just in the > begenning of pass_pro_and_epilogue, the default behavior of the target_hook > is doing nothing , and in i386 backend, df_analyse is called only under > TARGET_AVX && cfun->machine->has_explicit_vzeroupper. Oh, regs_ever_live isn't recomputed, that's why even i manually drop those SETs is post_reload splitter, but xmm6 to xmm15 is still marked as live. /* After reload, some ports add certain bits to regs_ever_live so this cannot be reset. */
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #15 from Hongtao.liu --- (In reply to Hongtao.liu from comment #14) > (In reply to Hongtao.liu from comment #12) > > (In reply to Jakub Jelinek from comment #10) > > > Last touched in PR99563. > > > I guess for the explicit user vzeroupper we need to add the clobbers/sets > > > earlier than in the vzeroupper pass, but ideally in a way that doesn't > > > force > > > save/restore of registers that aren't really needed in the function. > > > > Yes, if we want to add the clobbers/sets earlier(than CSE1), vzeroupper pass > > should be able to remove those unnecessary clobbers/sets. > > Correct typo, add the clobbers/sets earlier(than RA) I'm trying to add a post_reload splitter to add CLOBBERS of xmm to vzeroupper so that LRA knows vzeroupper will kill those xmm registers, then in pass_vzeroupper, transform those CLOBBERS to SET (xmm, xmm), it will benifit post_reload CSE which allow lower 128bits to cross vzeroupper, then in post_reload split2, drop those SETs, it's safe since there's no CSE between split2 and split3, problem is there's no update for data flow info between split2 and pro_and_epilog which mean even i manually drop those SETS, xmm6-xmm15 are still marked as used which causes redudant save and restore under 64-bit MSabi. I'm thinking of adding a target_hook for updating df info just in the begenning of pass_pro_and_epilogue, the default behavior of the target_hook is doing nothing , and in i386 backend, df_analyse is called only under TARGET_AVX && cfun->machine->has_explicit_vzeroupper.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #14 from Hongtao.liu --- (In reply to Hongtao.liu from comment #12) > (In reply to Jakub Jelinek from comment #10) > > Last touched in PR99563. > > I guess for the explicit user vzeroupper we need to add the clobbers/sets > > earlier than in the vzeroupper pass, but ideally in a way that doesn't force > > save/restore of registers that aren't really needed in the function. > > Yes, if we want to add the clobbers/sets earlier(than CSE1), vzeroupper pass > should be able to remove those unnecessary clobbers/sets. Correct typo, add the clobbers/sets earlier(than RA)
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #13 from Marcin Ślusarz --- FTR, to reproduce this problem with gcc 9 and 10 I had to either replace -mavx with -march=native or add -mtune=native. The problem starts reproducing with -march=haswell.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #12 from Hongtao.liu --- (In reply to Jakub Jelinek from comment #10) > Last touched in PR99563. > I guess for the explicit user vzeroupper we need to add the clobbers/sets > earlier than in the vzeroupper pass, but ideally in a way that doesn't force > save/restore of registers that aren't really needed in the function. Yes, if we want to add the clobbers/sets earlier(than CSE1), vzeroupper pass should be able to remove those unnecessary clobbers/sets.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #11 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #9) > (In reply to Richard Biener from comment #4) > > Indeed as far as I understand an unspec volatile isn't sth clobbering > > registers (not even memory?!). The insn is missing inputs/outputs > > (we might be able to model that lowparts are preserved). > > Since the instruction operates on the whole pack (8 or 16 SSE registers), > this approach will introduce uninitialized uses, and will clobber the whole > register pack. Since SSE registers are callee-saved this means all registers > will be saved in function prologue and restored in the epilogue. I was typing a bit too fast here: SSE registers are NOT preserved across function calls for SYSV ABI, and lower 128bit are preserved for MS ABI.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #10 from Jakub Jelinek --- Last touched in PR99563. I guess for the explicit user vzeroupper we need to add the clobbers/sets earlier than in the vzeroupper pass, but ideally in a way that doesn't force save/restore of registers that aren't really needed in the function.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #9 from Uroš Bizjak --- (In reply to Richard Biener from comment #4) > Indeed as far as I understand an unspec volatile isn't sth clobbering > registers (not even memory?!). The insn is missing inputs/outputs > (we might be able to model that lowparts are preserved). Since the instruction operates on the whole pack (8 or 16 SSE registers), this approach will introduce uninitialized uses, and will clobber the whole register pack. Since SSE registers are callee-saved this means all registers will be saved in function prologue and restored in the epilogue.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #8 from Uroš Bizjak --- (In reply to Hongtao.liu from comment #7) > Confirmed, let me fix this. Please note that the current definition of vzeroupper does not model effects of the instruction at all. The current definition is intended to handle automatic vzeroupper insertion, and the builtin somehow slipped out of the mind...
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 Hongtao.liu changed: What|Removed |Added CC||crazylht at gmail dot com --- Comment #7 from Hongtao.liu --- Confirmed, let me fix this.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #6 from Jeffrey Walton --- Add 9.3 to the know to fail list: $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Copyright (C) 2019 Free Software Foundation, Inc.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #5 from Jeffrey Walton --- I think we are seeing this bug in the field. We are catching lots of failed self tests as we test on multiple platforms, including Ubuntu 14 ERS and Ubuntu 16 LTS. The problem makes GCC 4.8.4 through 7.5 practically useless for AVX and AVX2. I don't see the problem with GCC 9.3. Maybe the problem got fixed somewhere along the way?
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #4 from Richard Biener --- Indeed as far as I understand an unspec volatile isn't sth clobbering registers (not even memory?!). The insn is missing inputs/outputs (we might be able to model that lowparts are preserved).
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #3 from Marc Glisse --- Actually, what CSE1 does might be fine, and it is LRA that should have noticed that the register it assigned was clobbered, so it should have spilled (or better rematerialized). Assuming the i386 backend does say that this unspec clobbers the registers, which I am not seeing right now (but I may not be looking in the right place).
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-10-26 Ever confirmed|0 |1 --- Comment #2 from Marc Glisse --- CSE1 happily turns uses of the second constant, loaded after vzeroupper, into uses of the first constant, loaded before, ignoring the fact that vzeroupper clobbers (the upper part of) all avx registers.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #1 from Marcin Ślusarz --- Heh, there are really stupid bugs in both files. Thankfully they don't change the outcome. Updated code: $ cat main.c #include #include void test(char *dest); int main() { char buf[64]; memset(buf, 0x2, 64); test(buf); for (int i = 0; i < 32; ++i) printf("%d ", buf[i]); printf("\n"); return 0; } $ cat zeroupper.c #include void test(char *dest) { __m256i ymm1 = _mm256_set1_epi8((char)0x1); _mm256_storeu_si256((__m256i *)(dest + 32), ymm1); _mm256_zeroupper(); __m256i ymm2 = _mm256_set1_epi8((char)0x1); _mm256_storeu_si256((__m256i *)dest, ymm2); } Still the output is: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0