https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90811
--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Ok, had a look today under debugger what happens. In the host cc1, before LTO into for offloading is written, we bump the alignment of the "omp simt private" variable from 64-bits to 128-bits in align_local_variable called from: #0 add_stack_var (decl=<optimized out>) at ../../gcc/cfgexpand.c:450 #1 0x00000000008bb7f6 in expand_one_var (var=0x7fffea9743f0, toplevel=<optimized out>, really_expand=<optimized out>) at ../../gcc/cfgexpand.c:1698 #2 0x00000000008bb8d7 in estimated_stack_frame_size (node=node@entry=0x7fffea7f75a0) at ../../gcc/cfgexpand.c:1974 #3 0x0000000000b13dc0 in compute_fn_summary (node=0x7fffea7f75a0, early=<optimized out>) at ../../gcc/ipa-fnsummary.c:2421 #4 0x0000000000b14121 in compute_fn_summary_for_current () at ../../gcc/cgraph.h:2008 #5 (anonymous namespace)::pass_local_fn_summary::execute (this=<optimized out>) at ../../gcc/ipa-fnsummary.c:3584 #6 0x0000000000c5ca75 in execute_one_pass (pass=0x22f8860) at ../../gcc/passes.c:2473 #7 0x0000000000c5d218 in execute_pass_list_1 (pass=0x22f8860) at ../../gcc/passes.c:2559 #8 0x0000000000c5d269 in execute_pass_list (fn=0x7fffea959160, pass=<optimized out>) at ../../gcc/passes.c:2570 #9 0x0000000000c5e4b6 in do_per_function_toporder (callback=0xc5d250 <execute_pass_list(function*, opt_pass*)>, data=0x22f87a0) at ../../gcc/passes.c:1705 #10 0x0000000000c5e523 in execute_ipa_pass_list (pass=0x22f8740) at ../../gcc/passes.c:2918 #11 0x00000000008f8e43 in ipa_passes () at ../../gcc/cgraphunit.c:2480 And ix86_local_alignment has similar: if (TARGET_64BIT && optimize_function_for_speed_p (cfun) && TARGET_SSE) { if (AGGREGATE_TYPE_P (type) && (va_list_type_node == NULL_TREE || (TYPE_MAIN_VARIANT (type) != TYPE_MAIN_VARIANT (va_list_type_node))) && TYPE_SIZE (type) && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST && wi::geu_p (wi::to_wide (TYPE_SIZE (type)), 128) && align < 128) return 128; }