On Fri, 29 Jan 2010 14:57:51 -0600, Barry Smith <bsmith at mcs.anl.gov> wrote: > This is why I have the 1.5 there instead of 2.5 or 3 or more you > might see without inodes.
Okay, so suppose that the ordering is identical. With no inodes, each entry effectively costs sizeof double + sizeof int (=12). With inodes, it's sizeof double + 1/4*sizeof int (=9). With BAIJ, it's sizeof double + 1/16*sizeof int (=8.25). If the ordering is different, e.g. [u0,u1,...,v0,v1,...] as seems popular for unknown reasons, then cache reuse of the vector goes out the window and it's going to be really bad. > In additional all the "extra" column entries are still stored in the > a->j array. Thus when we move to each new set of rows we skip over > those entries, thus partial cache lines of a->i are constantly > wasted. So a nontrivial matrix with bs=4 will have over 100 nonzeros per row, thus moving to the next block involves skipping 300*sizeof int. This is more than 8 lines ahead so it's almost guaranteed to be a completely cold cache miss at all levels, which means 250 clocks to memory. During the block line, we are using around 400 doubles that also need to come all the way from memory, and cost a bit under 4 clocks each (assuming fully saturated bus). So the miss due only to stepping over these column indices could cost over 15% if everything else was running smoothly (big assumption, I know). Add to that the fact that a stream of matrix entries steps over a 4kb page boundary (which hardware prefetch doesn't cross) four times as often as with BAIJ where the entries are a single contiguous stream, and that the hardware prefetcher only follows one stream per 4kB page. So with this naive analysis, it seems possible for deficient hardware prefetch to be at fault for a nontrivial amount of the observed difference between Inode and BAIJ. Nobody's sparse mat-vecs are especially close to saturating the memory bus with useful stuff, even for matrices that are good at reusing the vector. So there must be more to the story than pure bandwidth and reuse of the vector. Jed
