https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735
--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 11 Sep 2019, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91735 > > --- Comment #5 from Uroš Bizjak <ubizjak at gmail dot com> --- > (In reply to Richard Biener from comment #3) > > Reducing the VF here should be the goal. For the particular case "filling" > > the holes with neutral data and blending in the original values at store > > time > > will likely be optimal. So do > > > > tem = vector load > > zero all [4] elements > > compute > > blend in 'tem' into the [4] elements > > vector store > > MASKMOVDQU [1] should be an excellent fit here. Yes, but it's probably slower. And it avoids store data races, of course plus avoids epilogue peeling (eventually).