Re: [gem5-users] Stream benchmark mismatch stats

2020-02-21 Thread Majid Jalili
My bad, I was missing a simple point. There are also 5K writes, meaning
there would be additional 5K read from main memory to obtain an exclusive
copy of the blocks before write (15K/8=1875). So, the simulator is fairly
accurate. For different RPs, during initialization some blocks are kept in
the cache and it help with lower access to main memory.

The other issue that I can try to solve is the low IPC. With a 3-level
cache structure, tagged prefetchers (degree of 2 each), and binary without
vectorization, I see IPC of 0.3.  I increased LQ and SQ sizes to 128, but
no significant changes. Since, there is a pretty high hit rate at caches,
and reading 3 blocks (a,b, and c) should resolve 8 instructions (as each
block is holding values for 8 instructions), I expected way higher IPC.  Do
you have any idea?
Thanks,
Majid

On Thu, Feb 20, 2020 at 12:36 PM Jason Lowe-Power 
wrote:

> Hi Majid,
>
> Are you taking into account the instruction fetches?
>
> Cheers,
> Jason
>
> On Thu, Feb 20, 2020 at 9:53 AM Majid Jalili  wrote:
>
>> Let me correct myself. If I set the Size to 5K, then there would be total
>> of 10K loads (for a[i] and b[i]), so i expect to see 10K/8=1250.
>>
>> On Thu, Feb 20, 2020 at 11:45 AM Majid Jalili  wrote:
>>
>>> I am running a simple stream benchmark that does a simple addition:
>>>  m5_reset_stats(0,0);
>>>  for(int i = 0 ; i >> c[i] =a[i]+b[i];
>>> m5_dump_stats(0,0);
>>>
>>> Each element of these arrays is a uint64_t. I turned off prefetchers and
>>> only enabled one level of cache. When I run for size of 10K elements,
>>> since 8 uint_64 elements can be fit onto a block, I expect to have at most
>>> 10K/8=1250 reads from  main memory. However, if I use LRU RP at L1, I see
>>> 1792 reads at main memory. If the RP changes to RRRIP, then it would be
>>> 1340 reads.
>>>
>>> I cannot figure out why LRU is doing poorly, while it should be way
>>> better. In terms of numCycles, also LRU is slower than RRRIP?
>>>
>>> Majid
>>>
>>> ___
>> gem5-users mailing list
>> gem5-users@gem5.org
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>
> ___
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
___
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Stream benchmark mismatch stats

2020-02-20 Thread Jason Lowe-Power
Hi Majid,

Are you taking into account the instruction fetches?

Cheers,
Jason

On Thu, Feb 20, 2020 at 9:53 AM Majid Jalili  wrote:

> Let me correct myself. If I set the Size to 5K, then there would be total
> of 10K loads (for a[i] and b[i]), so i expect to see 10K/8=1250.
>
> On Thu, Feb 20, 2020 at 11:45 AM Majid Jalili  wrote:
>
>> I am running a simple stream benchmark that does a simple addition:
>>  m5_reset_stats(0,0);
>>  for(int i = 0 ; i > c[i] =a[i]+b[i];
>> m5_dump_stats(0,0);
>>
>> Each element of these arrays is a uint64_t. I turned off prefetchers and
>> only enabled one level of cache. When I run for size of 10K elements,
>> since 8 uint_64 elements can be fit onto a block, I expect to have at most
>> 10K/8=1250 reads from  main memory. However, if I use LRU RP at L1, I see
>> 1792 reads at main memory. If the RP changes to RRRIP, then it would be
>> 1340 reads.
>>
>> I cannot figure out why LRU is doing poorly, while it should be way
>> better. In terms of numCycles, also LRU is slower than RRRIP?
>>
>> Majid
>>
>> ___
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
___
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Stream benchmark mismatch stats

2020-02-20 Thread Majid Jalili
Let me correct myself. If I set the Size to 5K, then there would be total
of 10K loads (for a[i] and b[i]), so i expect to see 10K/8=1250.

On Thu, Feb 20, 2020 at 11:45 AM Majid Jalili  wrote:

> I am running a simple stream benchmark that does a simple addition:
>  m5_reset_stats(0,0);
>  for(int i = 0 ; i  c[i] =a[i]+b[i];
> m5_dump_stats(0,0);
>
> Each element of these arrays is a uint64_t. I turned off prefetchers and
> only enabled one level of cache. When I run for size of 10K elements,
> since 8 uint_64 elements can be fit onto a block, I expect to have at most
> 10K/8=1250 reads from  main memory. However, if I use LRU RP at L1, I see
> 1792 reads at main memory. If the RP changes to RRRIP, then it would be
> 1340 reads.
>
> I cannot figure out why LRU is doing poorly, while it should be way
> better. In terms of numCycles, also LRU is slower than RRRIP?
>
> Majid
>
>
___
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

[gem5-users] Stream benchmark mismatch stats

2020-02-20 Thread Majid Jalili
I am running a simple stream benchmark that does a simple addition:
 m5_reset_stats(0,0);
 for(int i = 0 ; i ___
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users