Hey guys, I'm new to contributing to GNU projects, but... I'm guessing I send commits through here? Would appreciate some sort of note on the procedure on the website
I noticed that your matrix multiplication code had bad cache performance due to a misordering of a loop. In a replicated version of my change, I saw about 20% performance gains on my AMD FX CPU. Do let me know if this is not the correct contribution procedure. -Max
From 0345eaf2eb48997fa3d00fae2b37cf416d3713d4 Mon Sep 17 00:00:00 2001 From: JavaProphet <[email protected]> Date: Thu, 18 Oct 2018 20:00:47 -0700 Subject: [PATCH] Reduce cache misses for source_gemm_r --- cblas/source_gemm_r.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/cblas/source_gemm_r.h b/cblas/source_gemm_r.h index a008d22..7c9848e 100644 --- a/cblas/source_gemm_r.h +++ b/cblas/source_gemm_r.h @@ -71,8 +71,8 @@ /* form C := alpha*A*B + C */ - for (k = 0; k < K; k++) { - for (i = 0; i < n1; i++) { + for (i = 0; i < n1; i++) { + for (k = 0; k < K; k++) { const BASE temp = alpha * F[ldf * i + k]; if (temp != 0.0) { for (j = 0; j < n2; j++) { @@ -86,8 +86,8 @@ /* form C := alpha*A*B' + C */ + for (j = 0; j < n2; j++) { for (i = 0; i < n1; i++) { - for (j = 0; j < n2; j++) { BASE temp = 0.0; for (k = 0; k < K; k++) { temp += F[ldf * i + k] * G[ldg * j + k]; @@ -98,8 +98,8 @@ } else if (TransF == CblasTrans && TransG == CblasNoTrans) { + for (i = 0; i < n1; i++) { for (k = 0; k < K; k++) { - for (i = 0; i < n1; i++) { const BASE temp = alpha * F[ldf * k + i]; if (temp != 0.0) { for (j = 0; j < n2; j++) { @@ -111,8 +111,8 @@ } else if (TransF == CblasTrans && TransG == CblasTrans) { + for (j = 0; j < n2; j++) { for (i = 0; i < n1; i++) { - for (j = 0; j < n2; j++) { BASE temp = 0.0; for (k = 0; k < K; k++) { temp += F[ldf * k + i] * G[ldg * j + k]; -- 2.7.4
