[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #14 from Patrik Huber --- It even seems a few percent slower after the FDO stuff. But the ` -fprofile-use` is a bit weird. If there is no .gcda file, it doesn't complain. If you give it a file that doesn't exist (e.g. -fprofile-use=foo), then it doesn't complain either. So how can I check whether it really ran the FDO?
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #13 from Patrik Huber --- >> Did you try with FDO? (-fprofile-generate, run, -fprofile-use) I just tried this with g++-7. It didn't help, the final executable has the same slower run time as in the attached log without the FDO.
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #10 from Patrik Huber --- Created attachment 43367 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43367&action=edit gcc5_gemm_test.ii
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #11 from Patrik Huber --- Created attachment 43368 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43368&action=edit gcc7_gemm_test.ii
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #8 from Patrik Huber --- Created attachment 43366 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43366&action=edit full_log.txt
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #7 from Patrik Huber --- Created attachment 43365 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43365&action=edit gemm_test.cpp
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #6 from Patrik Huber --- I could also upload you the .ii files but they are 5 MB, which the bugtracker doesn't allow (1 MB limit).
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #5 from Patrik Huber --- Created attachment 43364 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43364&action=edit gcc7_gemm_test.s
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #4 from Patrik Huber --- Created attachment 43363 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43363&action=edit gcc5_gemm_test.s
[Bug target/84280] [6/7/8 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 --- Comment #3 from Patrik Huber --- @Richard: I'm not 100% sure what you mean with "preprocessed source" but I googled and you probably mean the output of compiling with "-c -save-temps". Please see attached.
[Bug c++/84280] New: Performance regression in g++-7 with Eigen for non-AVX2 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84280 Bug ID: 84280 Summary: Performance regression in g++-7 with Eigen for non-AVX2 CPUs Product: gcc Version: 7.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: patrikhuber at gmail dot com Target Milestone: --- Hello, I noticed today what may look like quite a large performance regression with Eigen (3.3.4) matrix multiplication. It only seems to occur on non-AVX2 code paths, meaning that if I compile with -march=native on my core-i7 with AVX2, then it's blazingly fast on both g++ versions, but not on an older core-i5 with only AVX, or if I use -march=core2. Here are some example timings, but it applies to all matrix sizes that the benchmark script tests (see end of the message for the code): g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3 -o gcc5_gemm_test 1124 1215 1465 elapsed_ms: 1970 1730 1235 1758 elapsed_ms: 3505 g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3 -march=core2 -o gcc7_gemm_test 1124 1215 1465 elapsed_ms: 2998 1730 1235 1758 elapsed_ms: 4628 It's even worse if I test this on a i5-3550, which has AVX, but not AVX2: g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc5_gemm_test 1124 1215 1465 elapsed_ms: 941 1730 1235 1758 elapsed_ms: 1780 g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc7_gemm_test 1124 1215 1465 elapsed_ms: 1988 1730 1235 1758 elapsed_ms: 3740 I tried the same with -O2 and it gave the same results. That's a drop to nearly half the speed in matrix multiplication on AVX CPUs. Or maybe I've done something wrong. :-) I realise the benchmark might be a bit crude (better use Google Benchmark or something like that...) But the results I'm getting are pretty consistent on various CPUs, compilers, and with various flags. === Benchmark code: // gemm_test.cpp #include #include #include #include #include using RowMajorMatrixXf = Eigen::Matrix; using ColMajorMatrixXf = Eigen::Matrix; template void run_test(const std::string& name, int s1, int s2, int s3) { using namespace std::chrono; float checksum = 0.0f; // to prevent compiler from optimizing everything away const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count(); for (size_t i = 0; i < 10; ++i) { Mat a_rm(s1, s2); Mat b_rm(s2, s3); const auto c_rm = a_rm * b_rm; checksum += c_rm(0, 0); } const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count(); const auto elapsed_ms = (end_time_ns - start_time_ns) / 100; std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl; } int main() { //std::random_device rd; //std::mt19937 gen(0); //std::uniform_int_distribution<> dis(1, 2048); std::vector vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116, 1736, 868, 1278, 1323, 788 }; for (std::size_t i = 0; i < 12; ++i) { int s1 = vals[i++];//dis(gen); int s2 = vals[i++];//dis(gen); int s3 = vals[i];//dis(gen); std::cout << s1 << " " << s2 << " " << s3 << std::endl; run_test("col major", s1, s2, s3); run_test("row major", s1, s2, s3); std::cout << "" << std::endl; } return 0; } ===