> > > +@item x86-vect-unroll-min-ldst-threshold
> > > +The vectorizer will check with target information to determine
> > > +whether unroll it. This parameter is used to limit the mininum of
> > > +loads and stores in the main loop.
> > >
> > > It's odd to "limit" the minimum number of something.  I think this
> > > warrants clarification that for some (unknow to me ;)) reason we
> > > think that when we have many loads and (or?) stores it is beneficial
> > > to unroll to get even more loads and stores in a single iteration.
> > > Btw, does the parameter limit the number of loads and stores _after_
> unrolling or before?
> > >
> > When the number of loads/stores exceeds the threshold, the loads/stores
> are more likely to conflict with loop itself in the L1 cache(Assuming that
> address of loads are scattered).
> > Unroll + software scheduling will make 2 or 4 address contiguous
> loads/stores closer together, it will reduce cache miss rate.
> 
> Ah, nice.  Can we express the default as a function of L1 data cache size, L1
> cache line size and more importantly, the size of the vector memory access?
> 
> Btw, I was looking into making a more meaningful cost modeling for loop
> distribution.  Similar reasoning might apply there - try to _reduce_ the
> number of memory streams so L1 cache utilization allows re-use of a cache
> line in the next [next N] iteration[s]?  OTOH given L1D is quite large I'd 
> expect
> the loops affected to be either quite huge or bottlenecked by load/store
> bandwith (there are 1024 L1D cache lines in zen2 for
> example) - what's the effective L1D load you are keying off?.
> Btw, how does L1D allocation on stores play a role here?
> 
Hi Richard,
To answer your question, I rechecked 549, I found that the 549 improvement 
comes from load reduction, it has a 3-level loop and 8 scalar loads in inner 
loop are loop invariants (due to high register pressure, these loop invariants 
all spill to the stack).
After unrolling the inner loop, those scalar parts are not doubled,  so 
unrolling reduces load instructions and L1/L2/L3 accesses. In the inner loop 
there are 8 different three-dimensional arrays, which size like this 
"a[128][480][128]". Although the size of the 3-layer array is very large,
but it doesn't support the theory I said before, Sorry for that. I need to hold 
this patch to see if we can do something about this scenario. 

Thanks,
Lili.


Reply via email to