[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2024-01-20 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Andrew Pinski  changed:

   What|Removed |Added

   Target Milestone|--- |14.0

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2024-01-05 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #19 from Jan Hubicka  ---
I think we can declare this one fixed.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #18 from Jan Hubicka  ---
I made a typo:

Mainline with -O2 -flto  -march=native run manually since build machinery patch
is needed
23.03
22.85
23.04

Should be 
Mainline with -O3 -flto  -march=native run manually since build machinery patch
is needed
23.03
22.85
23.04

So with -O2 we still get slightly lower score than clang with -O3 we are
slightly better. push_back inlining does not seem to be a problem (as tested by
increasing limits) so perhaps more agressive unrolling/vectorization settings
clang has at -O2.

I think upstream jpegxl should use -O3 or -Ofast instead of -O2.  It is quite
typical kind of task that benefits from large optimization levels.

I filled in https://github.com/libjxl/libjxl/issues/2970

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-24 Thread aros at gmx dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #17 from Artem S. Tashkinov  ---
Terrific results, thanks a ton!

Maybe this bug report could be closed now?

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-24 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #16 from Sam James  ---
Thank you all for the effort.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #15 from Jan Hubicka  ---
With SRA improvements r:aae723d360ca26cd9fd0b039fb0a616bd0eae363 we finally get
good performance at -O2. Improvements to push_back implementation also helps a
bit.

Mainline with default flags (-O2):
Input: JPEG - Quality: 90:
19.76
19.75
19.68
Mainline with -O2 -march=native:
Input: JPEG - Quality: 90:
20.01
20
19.98
Mainline with -O2 -march=native -flto
Input: JPEG - Quality: 90:
19.95
19.98
19.81
Mainline with -O2 -march=native -flto --param max-inline-insns-auto=80 (this
makes push_back inlined)
Input: JPEG - Quality: 90:
19.98
20.05
20.03
Mainline with -O2 -flto  -march=native -I/usr/include/c++/v1 -nostdinc++ -lc++
(so clang's libc++)
21.38
21.37
21.32
Mainline with -O2 -flto  -march=native run manualy since build machinery patch
is needed
23.03
22.85
23.04
Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize
-fno-tree-slp-vectorize added by cmake. This is with system libstdc++ from
GCC13 so before push_back improvements.
21.16
20.95
21.06
Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize
-fno-tree-slp-vectorize added by cmake. This is with trunk libstdc++ with
push_back improvements.
21.2
20.93
20.98
Clang 17 with -O2 -march=native -flto -stdlib=libc++ and also
-fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with clan'g
libc++
Input: JPEG - Quality: 90:
22.08
21.88
21.78
Clang 17 with -O3 -march=native -flto
23.08
22.90
22.84


libc++ declares push_back always_inline and splits out the slow copying path. I
think the inlined part is still bit too large for inlining at -O2.

We could still try to get remaining approx 10% without increasing code size at 
-O2
However major part of the problem is solved.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-21 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #14 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:1d82fc2e6824bf83159389729c31a942f7b91b04

commit r14-5679-g1d82fc2e6824bf83159389729c31a942f7b91b04
Author: Jan Hubicka 
Date:   Tue Nov 21 15:17:16 2023 +0100

optimize std::vector::push_back

this patch speeds up the push_back at -O3 significantly by making the
reallocation to be inlined by default.  _M_realloc_insert is general
insertion that takes iterator pointing to location where the value
should be inserted.  As such it contains code to move other entries around
that is quite large.

Since appending to the end of array is common operation, I think we should
have specialized code for that.  Sadly it is really hard to work out this
from IPA passes, since we basically care whether the iterator points to
the same place as the end pointer, which are both passed by reference.
This is inter-procedural value numbering that is quite out of reach.

I also added extra check making it clear that the new length of the vector
is non-zero.  This saves extra conditionals.  Again it is quite hard case
since _M_check_len seem to be able to return 0 if its parameter is 0.
This never happens here, but we are not able to propagate this early nor
at IPA stage.

libstdc++-v3/ChangeLog:

PR libstdc++/110287
PR middle-end/109811
PR middle-end/109849
* include/bits/stl_vector.h (_M_realloc_append): New member
function.
(push_back): Use it.
* include/bits/vector.tcc: (emplace_back): Use it.
(_M_realloc_insert): Let compiler know that new vector size is
non-zero.
(_M_realloc_append): New member function.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-11-02 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #13 from Jan Hubicka  ---
So I re-tested it with current mainline and clang 16/17

For mainline I get (megapixels per second, bigger is better):
13.39
13.38
13.42
clang 16:
20.06
20.06
19.87
clang 17:
19.7
19.68
19.69


mainline with Martin's patch to enable SRA across calls where parameter doesn't
example (improvement for PR109849) I get:
19.37
19.35
19.31
this is without inlining m_realloc_insert which we do at -O3 but we don't at
-O2 since it is large (clang inlines at both at -O2 and -O3).

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-06-19 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #12 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:7b34cacc5735385e7e2855d7c0a6fad60ef4a99b

commit r14-1951-g7b34cacc5735385e7e2855d7c0a6fad60ef4a99b
Author: Jan Hubicka 
Date:   Mon Jun 19 18:28:17 2023 +0200

optimize std::max early

we currently produce very bad code on loops using std::vector as a stack,
since
we fail to inline push_back which in turn prevents SRA and we fail to
optimize
out some store-to-load pairs.

I looked into why this function is not inlined and it is inlined by clang. 
We
currently estimate it to 66 instructions and inline limits are 15 at -O2
and 30
at -O3.  Clang has similar estimate, but still decides to inline at -O2.

I looked into reason why the body is so large and one problem I spotted is
the
way std::max is implemented by taking and returning reference to the
values.

  const T& max( const T& a, const T& b );

This makes it necessary to store the values to memory and load them later
and max is used by code computing new size of vector on resize.

We optimize this to MAX_EXPR, but only during late optimizations.  I think
this
is a common enough coding pattern and we ought to make this transparent to
early opts and IPA.  The following is easist fix that simply adds phiprop
pass
that turns the PHI of address values into PHI of values so later FRE can
propagate values across memory, phiopt discover the MAX_EXPR pattern and
DSE
remove the memory stores.

gcc/ChangeLog:

PR tree-optimization/109811
PR tree-optimization/109849
* passes.def: Add phiprop to early optimization passes.
* tree-ssa-phiprop.cc: Allow clonning.

gcc/testsuite/ChangeLog:

PR tree-optimization/109811
PR tree-optimization/109849
* gcc.dg/tree-ssa/phiprop-1.c: New test.
* gcc.dg/tree-ssa/pr21463.c: Adjust template.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-05-18 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #11 from Jan Hubicka  ---
I got -fprofile-use builds working and with profile we peel the innermost loop
8 times which actually gets it off the hottest spot.
We get more slective on what to inline (do not inline cold calls) which may
make the dependency on SRA even worse

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-05-17 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #10 from Jan Hubicka  ---
Actually vectorization hurts on both compilers and bit more with clang.
It seems that all important loops are hand vectorized and since register
pressure is a problem, vectorizing other loops causes enough of collateral
damage to register allocation to regress performance.

I believe the core of the problem (or at least one of them) is simply way we
compile loops popping data from std::vector based stack. See PR109849
We keep updating stack datastructure in the innermost loop becuase in not too
common case reallocation needs to be done and that is done by offlined code.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-05-17 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #9 from Jakub Jelinek  ---
(In reply to JuzheZhong from comment #7)
> It seems that Clang has better performance than GCC in case of no vectorizer?

That is very general statement.  On some particular code, some particular arch,
with some particular flags Clang performs better than GCC, on other it is the
other way around, on some it is wash.  How it performs on larger amounts of
code can be seen from standard benchmarks like SPEC, the Phoronix benchmark
suite is known not to be a very good benchmark for various reasons, but that
doesn't mean it isn't worth looking at it.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-05-17 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #8 from Jan Hubicka  ---
Created attachment 55101
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55101=edit
hottest loop

jpegxl build machinery adds -fno-vectorize and -fno-slp-vectorize to clang
flags.  Adding -fno-tree-vectorize -fno-tree-slp-vectorize makes GCC generated
code more similar.  With this most difference is caused by
FindBestPatchDictionary or FindTextLikePatches if that function is not inlined.

  15.22%  cjxl libjxl.so.0.7.0   [.] jxl::(anonymous
namespace)::FindTextLikePatches 
  10.19%  cjxl libjxl.so.0.7.0   [.] jxl::FindBestPatchDictionary   
   5.27%  cjxl libjxl.so.0.7.0   [.] jxl::N_AVX2::QuantizeBlockAC   
   5.06%  cjxl libjxl.so.0.7.0   [.] jxl::N_AVX2::EstimateEntropy   
   4.82%  cjxl libjxl.so.0.7.0   [.] jxl::N_AVX2::EstimateEntropy   
   4.35%  cjxl libjxl.so.0.7.0   [.] jxl::N_AVX2::QuantizeBlockAC   
   4.21%  cjxl libjxl.so.0.7.0   [.] jxl::N_AVX2::(anonymous
namespace)::TransformFromPixels 
   3.87%  cjxl libjxl.so.0.7.0   [.] jxl::N_AVX2::(anonymous
namespace)::TransformFromPixels 
   3.78%  cjxl libjxl.so.0.7.0   [.] jxl::N_AVX2::FindBestMultiplier
   3.27%  cjxl libjxl.so.0.7.0   [.] jxl::N_AVX2::FindBestMultiplier

I think it is mostly register allocation not handling well the internal loop
quoted above.  I am adding preprocessed sources.

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-05-17 Thread juzhe.zhong at rivai dot ai via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

JuzheZhong  changed:

   What|Removed |Added

 CC||juzhe.zhong at rivai dot ai

--- Comment #7 from JuzheZhong  ---
(In reply to Jan Hubicka from comment #6)
> hottest loop in clang's profile is:
>   for (size_t y = 0; y < opsin.ysize(); y++) {
> for (size_t x = 0; x < opsin.xsize(); x++) {
>   if (is_background_row[y * is_background_stride + x]) continue;
>   cc.clear();
>   stack.clear();
>   stack.emplace_back(x, y);
>   size_t min_x = x;
>   size_t max_x = x;
>   size_t min_y = y;
>   size_t max_y = y;
>   std::pair reference;
>   bool found_border = false;
>   bool all_similar = true;
>   while (!stack.empty()) {
> std::pair cur = stack.back();
> stack.pop_back();
> if (visited_row[cur.second * visited_stride + cur.first]) continue;
> ^^^
> closed by this continue.
> visited_row[cur.second * visited_stride + cur.first] = 1;
> if (cur.first < min_x) min_x = cur.first;
> if (cur.first > max_x) max_x = cur.first;
> if (cur.second < min_y) min_y = cur.second;
> if (cur.second > max_y) max_y = cur.second;
> if (paint_ccs) {
>   cc.push_back(cur);
> }
> for (int dx = -kSearchRadius; dx <= kSearchRadius; dx++) {
>   for (int dy = -kSearchRadius; dy <= kSearchRadius; dy++) {
> if (dx == 0 && dy == 0) continue;
> int next_first = static_cast(cur.first) + dx;
> int next_second = static_cast(cur.second) + dy;
> if (next_first < 0 || next_second < 0 ||
> static_cast(next_first) >= opsin.xsize() ||
> static_cast(next_second) >= opsin.ysize()) {
>   continue;
> }
> std::pair next{next_first, next_second};
> if (!is_background_row[next.second * is_background_stride +
>next.first]) {
>   stack.push_back(next);
> } else {
>   if (!found_border) {
> reference = next;
> found_border = true;
>   } else {
> if (!is_similar_b(next, reference)) all_similar = false;
>   }
> }
>   }
> }
>   }
>   if (!found_border || !all_similar || max_x - min_x >= kMaxPatchSize ||
>   max_y - min_y >= kMaxPatchSize) {
> continue;
>   }
>   size_t bpos = background_stride * reference.second + reference.first;
>   float ref[3] = {background_rows[0][bpos], background_rows[1][bpos],
>   background_rows[2][bpos]};
>   bool has_similar = false;
>   for (size_t iy = std::max(
>static_cast(min_y) - kHasSimilarRadius, 0);
>iy < std::min(max_y + kHasSimilarRadius + 1, opsin.ysize());
> iy++) {
> for (size_t ix = std::max(
>  static_cast(min_x) - kHasSimilarRadius, 0);
>  ix < std::min(max_x + kHasSimilarRadius + 1, opsin.xsize());
>  ix++) {
>   size_t opos = opsin_stride * iy + ix;
>   float px[3] = {opsin_rows[0][opos], opsin_rows[1][opos],
>  opsin_rows[2][opos]};
>   if (pci.is_similar_v(ref, px, kHasSimilarThreshold)) {
> has_similar = true;
>   }
> }
>   }
>   if (!has_similar) continue;
>   info.emplace_back();
>   info.back().second.emplace_back(min_x, min_y);
>   QuantizedPatch& patch = info.back().first;
>   patch.xsize = max_x - min_x + 1;
>   patch.ysize = max_y - min_y + 1;
>   int max_value = 0;
>   for (size_t c : {1, 0, 2}) {
> for (size_t iy = min_y; iy <= max_y; iy++) {
>   for (size_t ix = min_x; ix <= max_x; ix++) {
> size_t offset = (iy - min_y) * patch.xsize + ix - min_x;
> patch.fpixels[c][offset] =
> opsin_rows[c][iy * opsin_stride + ix] - ref[c];
> int val = pci.Quantize(patch.fpixels[c][offset], c);
> patch.pixels[c][offset] = val;
> if (std::abs(val) > max_value) max_value = std::abs(val);
>   }
> }
>   }
>   if (max_value < kMinPeak) {
> info.pop_back();
> continue;
>   }
>   if (paint_ccs) {
> float cc_color = rng.UniformF(0.5, 1.0);
> for (std::pair p : cc) {
>   ccs.Row(p.second)[p.first] = cc_color;
> }
>   }
> }
>   }
> 
> I guess such a large loop nest with hottest loop not being the innermost is
> bad for register pressure. 
> Clangs code is :
>   0.02 │1196:┌─→cmp  %r10,-0xb8(%rbp)   
> ▒
>│ │jxl::FindBestPatchDictionary(jxl::Image3 

[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16

2023-05-17 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811

--- Comment #6 from Jan Hubicka  ---
hottest loop in clang's profile is:
  for (size_t y = 0; y < opsin.ysize(); y++) {
for (size_t x = 0; x < opsin.xsize(); x++) {
  if (is_background_row[y * is_background_stride + x]) continue;
  cc.clear();
  stack.clear();
  stack.emplace_back(x, y);
  size_t min_x = x;
  size_t max_x = x;
  size_t min_y = y;
  size_t max_y = y;
  std::pair reference;
  bool found_border = false;
  bool all_similar = true;
  while (!stack.empty()) {
std::pair cur = stack.back();
stack.pop_back();
if (visited_row[cur.second * visited_stride + cur.first]) continue;
^^^
closed by this continue.
visited_row[cur.second * visited_stride + cur.first] = 1;
if (cur.first < min_x) min_x = cur.first;
if (cur.first > max_x) max_x = cur.first;
if (cur.second < min_y) min_y = cur.second;
if (cur.second > max_y) max_y = cur.second;
if (paint_ccs) {
  cc.push_back(cur);
}
for (int dx = -kSearchRadius; dx <= kSearchRadius; dx++) {
  for (int dy = -kSearchRadius; dy <= kSearchRadius; dy++) {
if (dx == 0 && dy == 0) continue;
int next_first = static_cast(cur.first) + dx;
int next_second = static_cast(cur.second) + dy;
if (next_first < 0 || next_second < 0 ||
static_cast(next_first) >= opsin.xsize() ||
static_cast(next_second) >= opsin.ysize()) {
  continue;
}
std::pair next{next_first, next_second};
if (!is_background_row[next.second * is_background_stride +
   next.first]) {
  stack.push_back(next);
} else {
  if (!found_border) {
reference = next;
found_border = true;
  } else {
if (!is_similar_b(next, reference)) all_similar = false;
  }
}
  }
}
  }
  if (!found_border || !all_similar || max_x - min_x >= kMaxPatchSize ||
  max_y - min_y >= kMaxPatchSize) {
continue;
  }
  size_t bpos = background_stride * reference.second + reference.first;
  float ref[3] = {background_rows[0][bpos], background_rows[1][bpos],
  background_rows[2][bpos]};
  bool has_similar = false;
  for (size_t iy = std::max(
   static_cast(min_y) - kHasSimilarRadius, 0);
   iy < std::min(max_y + kHasSimilarRadius + 1, opsin.ysize()); iy++) {
for (size_t ix = std::max(
 static_cast(min_x) - kHasSimilarRadius, 0);
 ix < std::min(max_x + kHasSimilarRadius + 1, opsin.xsize());
 ix++) {
  size_t opos = opsin_stride * iy + ix;
  float px[3] = {opsin_rows[0][opos], opsin_rows[1][opos],
 opsin_rows[2][opos]};
  if (pci.is_similar_v(ref, px, kHasSimilarThreshold)) {
has_similar = true;
  }
}
  }
  if (!has_similar) continue;
  info.emplace_back();
  info.back().second.emplace_back(min_x, min_y);
  QuantizedPatch& patch = info.back().first;
  patch.xsize = max_x - min_x + 1;
  patch.ysize = max_y - min_y + 1;
  int max_value = 0;
  for (size_t c : {1, 0, 2}) {
for (size_t iy = min_y; iy <= max_y; iy++) {
  for (size_t ix = min_x; ix <= max_x; ix++) {
size_t offset = (iy - min_y) * patch.xsize + ix - min_x;
patch.fpixels[c][offset] =
opsin_rows[c][iy * opsin_stride + ix] - ref[c];
int val = pci.Quantize(patch.fpixels[c][offset], c);
patch.pixels[c][offset] = val;
if (std::abs(val) > max_value) max_value = std::abs(val);
  }
}
  }
  if (max_value < kMinPeak) {
info.pop_back();
continue;
  }
  if (paint_ccs) {
float cc_color = rng.UniformF(0.5, 1.0);
for (std::pair p : cc) {
  ccs.Row(p.second)[p.first] = cc_color;
}
  }
}
  }

I guess such a large loop nest with hottest loop not being the innermost is bad
for register pressure. 
Clangs code is :
  0.02 │1196:┌─→cmp  %r10,-0xb8(%rbp)  
  ▒
   │ │jxl::FindBestPatchDictionary(jxl::Image3 const&,
jxl::PassesEncoderState*, JxlCmsInterface co▒
   │ │while (!stack.empty()) { 
  ◆
  1.39 │ │↓ je   1690  
  ▒
   │ │std::pair cur = stack.back();
  ▒
   │11a3:│  mov  -0x8(%r10),%rbx