Hello, I am running a simple single threaded memory benchmark that measures the time it takes to copy an array (https://github.com/BTone/cagbench). I run the benchmark in SE mode with only 1 thread (and 1 CPU) configured to match the setup used in gem5-Skylake ( https://github.com/darchr/gem5-skylake-config) with 32 kB L1I and L1D cache, 256 kB L2 and 8 MB LLC.
On a real Intel Skylake (i7 6700k), DDR4-2400: With an array size of 8 MB (total working set of 16 MB), the throughput is ~11,000 MB/s and with an array size 16 MB (total working set of 32 MB) the throughput is ~9,500 MB/s. In Gem5 (darchr/gem5-skylake-config): With an array size of 8 MB (total working set of 16 MB), the throughput is ~6,000 MB/s. However, with an array size 16 MB (total working set of 32 MB) the throughput drops to ~700 MB/s. The performance when the workload mostly fits in the cache hierarchy is reasonable, but ~700 MB/s seems far slower and does not seem commensurate with the real system. I think this has something to do with the memory system past the last-level cache, but I am having trouble determining what exactly the issue is. Just for reference, this is how I have the cache hierarchy configured (I reduced the tag/data/response latencies to eliminate the caches from being an issue): Both L1I and L1D caches: size = '32kB' assoc = 8 tag_latency = 1 data_latency = 1 response_latency = 1 mshrs = 128 tgts_per_mshr = 16 write_buffers = 56 demand_mshr_reserve = 96 L2 Cache: size = '256kB' assoc = 4 tag_latency = 1 data_latency = 1 response_latency = 1 mshrs = 256 tgts_per_mshr = 16 write_buffers = 256 L3 cache: size = '8MB' assoc = 16 tag_latency = 1 data_latency = 1 response_latency = 1 mshrs = 256 tgts_per_mshr = 20 write_buffers = 256 clusivity = 'mostly_excl' Any suggestions would be greatly appreciated.
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s