Re: [gem5-users] Stream benchmark mismatch stats
My bad, I was missing a simple point. There are also 5K writes, meaning there would be additional 5K read from main memory to obtain an exclusive copy of the blocks before write (15K/8=1875). So, the simulator is fairly accurate. For different RPs, during initialization some blocks are kept in the cache and it help with lower access to main memory. The other issue that I can try to solve is the low IPC. With a 3-level cache structure, tagged prefetchers (degree of 2 each), and binary without vectorization, I see IPC of 0.3. I increased LQ and SQ sizes to 128, but no significant changes. Since, there is a pretty high hit rate at caches, and reading 3 blocks (a,b, and c) should resolve 8 instructions (as each block is holding values for 8 instructions), I expected way higher IPC. Do you have any idea? Thanks, Majid On Thu, Feb 20, 2020 at 12:36 PM Jason Lowe-Power wrote: > Hi Majid, > > Are you taking into account the instruction fetches? > > Cheers, > Jason > > On Thu, Feb 20, 2020 at 9:53 AM Majid Jalili wrote: > >> Let me correct myself. If I set the Size to 5K, then there would be total >> of 10K loads (for a[i] and b[i]), so i expect to see 10K/8=1250. >> >> On Thu, Feb 20, 2020 at 11:45 AM Majid Jalili wrote: >> >>> I am running a simple stream benchmark that does a simple addition: >>> m5_reset_stats(0,0); >>> for(int i = 0 ; i >> c[i] =a[i]+b[i]; >>> m5_dump_stats(0,0); >>> >>> Each element of these arrays is a uint64_t. I turned off prefetchers and >>> only enabled one level of cache. When I run for size of 10K elements, >>> since 8 uint_64 elements can be fit onto a block, I expect to have at most >>> 10K/8=1250 reads from main memory. However, if I use LRU RP at L1, I see >>> 1792 reads at main memory. If the RP changes to RRRIP, then it would be >>> 1340 reads. >>> >>> I cannot figure out why LRU is doing poorly, while it should be way >>> better. In terms of numCycles, also LRU is slower than RRRIP? >>> >>> Majid >>> >>> ___ >> gem5-users mailing list >> gem5-users@gem5.org >> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users > > ___ > gem5-users mailing list > gem5-users@gem5.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Re: [gem5-users] Stream benchmark mismatch stats
Hi Majid, Are you taking into account the instruction fetches? Cheers, Jason On Thu, Feb 20, 2020 at 9:53 AM Majid Jalili wrote: > Let me correct myself. If I set the Size to 5K, then there would be total > of 10K loads (for a[i] and b[i]), so i expect to see 10K/8=1250. > > On Thu, Feb 20, 2020 at 11:45 AM Majid Jalili wrote: > >> I am running a simple stream benchmark that does a simple addition: >> m5_reset_stats(0,0); >> for(int i = 0 ; i > c[i] =a[i]+b[i]; >> m5_dump_stats(0,0); >> >> Each element of these arrays is a uint64_t. I turned off prefetchers and >> only enabled one level of cache. When I run for size of 10K elements, >> since 8 uint_64 elements can be fit onto a block, I expect to have at most >> 10K/8=1250 reads from main memory. However, if I use LRU RP at L1, I see >> 1792 reads at main memory. If the RP changes to RRRIP, then it would be >> 1340 reads. >> >> I cannot figure out why LRU is doing poorly, while it should be way >> better. In terms of numCycles, also LRU is slower than RRRIP? >> >> Majid >> >> ___ > gem5-users mailing list > gem5-users@gem5.org > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Re: [gem5-users] Stream benchmark mismatch stats
Let me correct myself. If I set the Size to 5K, then there would be total of 10K loads (for a[i] and b[i]), so i expect to see 10K/8=1250. On Thu, Feb 20, 2020 at 11:45 AM Majid Jalili wrote: > I am running a simple stream benchmark that does a simple addition: > m5_reset_stats(0,0); > for(int i = 0 ; i c[i] =a[i]+b[i]; > m5_dump_stats(0,0); > > Each element of these arrays is a uint64_t. I turned off prefetchers and > only enabled one level of cache. When I run for size of 10K elements, > since 8 uint_64 elements can be fit onto a block, I expect to have at most > 10K/8=1250 reads from main memory. However, if I use LRU RP at L1, I see > 1792 reads at main memory. If the RP changes to RRRIP, then it would be > 1340 reads. > > I cannot figure out why LRU is doing poorly, while it should be way > better. In terms of numCycles, also LRU is slower than RRRIP? > > Majid > > ___ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
[gem5-users] Stream benchmark mismatch stats
I am running a simple stream benchmark that does a simple addition: m5_reset_stats(0,0); for(int i = 0 ; i ___ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users