2017-12-13 17:37 GMT+01:00 Henrik Gramner <hen...@gramner.com>: > On Sat, Dec 9, 2017 at 1:11 PM, Martin Vignali <martin.vign...@gmail.com> > wrote: > > the idea in AVX2 is to load 128bits of data (2x 64 bits) > > then shuffle accross lane, the two 64 bits in the low part of each lane, > to > > keep the rest of the process similar > > to the sse version > > What about using pmovzxbw instead of movu + vpermq + punpcklbw? >
You're right, this is faster (tested on the first one with intermediate 16bits processing (grainextract) vpermq load grainextract_c: 22162.2 grainextract_sse2: 1160.9 grainextract_avx2: 1154.2 vpmovzxbw grainextract_c: 22165.7 grainextract_sse2: 1155.7 grainextract_avx2: 772.9 > > > for the store, the idea is similar in the opposite way (shuffle before > > store) > > You could also do vextracti128 + 128-bit packuswb instead of 256-bit > packuswb + vpermq. > > Sorry don't understand this part do you mean 128 bit packuswb + movh for each lane ? or something else ? Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel