> > That is an issue with this code. The Athlon has a 256k L2 last I > > remember, and a 128k L1. Rather hard to keep lots of stuff in cache.
for their time (now well passed), 384 KB was a decent cache capacity. (remember that AMD has traditionally used an exclusive cache mechanism so that everything in L1 is not also in L2, unlike Intel.) > Barton cores had 512k L2 as well as a faster front side bus. I speculate that AMD will follow Intel to 2M/core caches as soon as they start producing 65 nm chips. hopefully, they'll also add better _compute_ units, as well, such as at least matching Intel's Core2 FP capabilities. > > Right now the big issue we are running into for another aspect of this > > project is the lack of a vector max/min function in SSE*. (If anyone I'm a complete SSE virgin (almost), but isn't this largely just a matter of doing a packed comparison, then using the resulting per-unit bit to load and merge? > > from AMD/Intel is listening, this is a *big* issue, and I even have a > > rough idea how to do it "quickly" in SSE at the expense of many SSE > > registers. I'd think you'd need one reg to hold the current max, one to load candidates into, and probably another to do the flag-vector-merge thing. at the end you do a "horizontal" min/max to get the final result. _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
