[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #10 from Jan Hubicka  ---
runtimes on zen4 hardware.

trunk -O3 -flto -march-native
42171
42964
42106
clang -O3 -flto -march=native
37393
37423
37508
gcc 13 -O3 -flto -march=native
42380
42314
43285

So seems the performance did not change

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-11-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #9 from Andrew Pinski  ---
I should note that PR 112416 is not needed to vectorize the loop, though it
would improve code.

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-11-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

Andrew Pinski  changed:

   What|Removed |Added

 Depends on||112418

--- Comment #8 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #5)
> After fixing PR   112324 (and a secondary patch to phiopt to do
> factor_out_conditional_operation for all phi nodes rather than just a single
> one) we still miss the abs detection:

Filed PR 112418 for the secondary patch.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112418
[Bug 112418] factor_out_conditional_operation could be done for more phis

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-11-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Severity|normal  |enhancement
 Ever confirmed|0   |1
   Last reconfirmed||2023-11-07

--- Comment #7 from Andrew Pinski  ---
Confirmed.

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-11-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

Andrew Pinski  changed:

   What|Removed |Added

 Depends on||112416

--- Comment #6 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #5)
> 
> `a < 0 ? (int)-(unsigned)a : a` needs to detected to be `(int)ABSU_EXPR`.

Filed PR 112416 for that.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112416
[Bug 112416] absu is not detected

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-11-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #5 from Andrew Pinski  ---
After fixing PR 112324 (and a secondary patch to phiopt to do
factor_out_conditional_operation for all phi nodes rather than just a single
one) we still miss the abs detection:

  _34 = tmp_24 < 0;
  _55 = (unsigned int) tmp_24;
  _56 = -_55;
  _1 = (intD.6) _56;
  _30 = _1 | -2147483648;
  iftmp.0_26 = (unsigned intD.9) _30;
  # .MEM_27 = VDEF <.MEM_46>
  # USE = anything
  # CLB = anything
  .MASK_STORE (datap_43, 8B, _34, iftmp.0_26);
  # RANGE [irange] int [0, +INF]
  _25 = _34 ? _1 : tmp_24;

basically

`a < 0 ? (int)-(unsigned)a : a` needs to detected to be `(int)ABSU_EXPR`.

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #4 from Hongtao.liu  ---
> So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can
> be merge together with MAX_EXPR >
> 
Create pr112324.

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-10-31 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #3 from Hongtao.liu  ---
169test.c:85:23: note:   vect_is_simple_use: operand max_38 = PHI , type of def: unknown
170test.c:85:23: missed:   Unsupported pattern.
171test.c:62:24: missed:   not vectorized: unsupported use in stmt.
172test.c:85:23: missed:  unexpected pattern.
173test.c:85:23: note:  * Analysis  failed with vector mode V8SI
174test.c:85:23: note:  * The result for vector mode V32QI would be the
same
175test.c:85:23: missed: couldn't vectorize loop
176test.c:65:13: note: vectorized 0 loops in function.
177Removing basic block 5
178;; basic block 5, loop depth 2
179;;  pred:   16
180;;  43
181# max_38 = PHI 
182# i_42 = PHI 
183# datap_44 = PHI 
184tmp_24 = *datap_44;
185_35 = tmp_24 < 0;
186_56 = (unsigned int) tmp_24;
187_51 = -_56;
188_1 = (int) _51;
189_25 = MAX_EXPR <_1, max_38>;
190_31 = _1 | -2147483648;
191iftmp.0_27 = (unsigned int) _31;
192.MASK_STORE (datap_44, 8B, _35, iftmp.0_27);
193_26 = MAX_EXPR ;
194max_5 = _35 ? _25 : _26;
195i_29 = i_42 + 1;
196datap_30 = datap_44 + 4;
197if (w_22 > i_29)
198  goto ; [89.00%]
199else
200  goto ; [11.00%]
201;;  succ:   16

So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can be
merge together with MAX_EXPR >

manually change the loop to below, then it can be vectorized.

for (j = 0; j < t1->h; ++j) {
const OPJ_UINT32 w = t1->w;
for (i = 0; i < w; ++i, ++datap) {
OPJ_INT32 tmp = *datap;
if (tmp < 0)
  {
OPJ_UINT32 tmp_unsigned;
tmp_unsigned = opj_to_smr(tmp);
memcpy(datap, _unsigned, sizeof(OPJ_INT32));
tmp = -tmp;
  }
max = opj_int_max(max, tmp);
}
}

maybe it's related to phiopt?

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-10-31 Thread zhangjungcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

jun zhang  changed:

   What|Removed |Added

 CC||zhangjungcc at gmail dot com

--- Comment #2 from jun zhang  ---
  The following loop couldn't vectorize in gcc, but could in llvm. it has 3%
improvement.
more info, please refer: https://godbolt.org/z/zMbjq41h5

#include
typedef signed int  OPJ_INT32;
typedef unsigned int OPJ_UINT32;
typedef int OPJ_BOOL;
#define OPJ_TRUE 1
#define OPJ_FALSE 0
typedef char  OPJ_CHAR;
typedef float OPJ_FLOAT32;
typedef doubleOPJ_FLOAT64;
typedef unsigned char OPJ_BYTE;
#define T1_NMSEDEC_FRACBITS 6
#define OPJ_RESTRICT restrict
#define OPJ_TLS_KEY_T1  0
#include 
typedef size_t   OPJ_SIZE_T;

typedef struct opj_tcd_cblk_enc {
OPJ_BYTE* data;   /* Data */
//opj_tcd_layer_t* layers;  /* layer information */
//opj_tcd_pass_t* passes;   /* information about the passes */
OPJ_INT32 x0, y0, x1,
  y1; /* dimension of the code-blocks : left upper corner (x0,
y0) right low corner (x1,y1) */
OPJ_UINT32 numbps;
OPJ_UINT32 numlenbits;
OPJ_UINT32 data_size; /* Size of allocated data buffer */
OPJ_UINT32
numpasses; /* number of pass already done for the code-blocks */
OPJ_UINT32 numpassesinlayers; /* number of passes in the layer */
OPJ_UINT32 totalpasses;   /* total number of passes */
} opj_tcd_cblk_enc_t;
typedef struct opj_t1 {

/** MQC component */
//opj_mqc_t mqc;

OPJ_INT32  *data;
/** Flags used by decoder and encoder.
 * Such that flags[1+0] is for state of col=0,row=0..3,
   flags[1+1] for col=1, row=0..3, flags[1+flags_stride] for
col=0,row=4..7, ...
   This array avoids too much cache trashing when processing by 4 vertical
samples
   as done in the various decoding steps. */
//opj_flag_t *flags;

OPJ_UINT32 w;
OPJ_UINT32 h;
OPJ_UINT32 datasize;
OPJ_UINT32 flagssize;
OPJ_BOOL   encoder;

/* Thre 3 variables below are only used by the decoder */
/* set to TRUE in multithreaded context */
OPJ_BOOL mustuse_cblkdatabuffer;
/* Temporary buffer to concatenate all chunks of a codebock */
OPJ_BYTE*cblkdatabuffer;
/* Maximum size available in cblkdatabuffer */
OPJ_UINT32   cblkdatabuffersize;
} opj_t1_t;

#define INLINE __inline__
static INLINE OPJ_INT32 opj_int_max(OPJ_INT32 a, OPJ_INT32 b)
{
return (a > b) ? a : b;
}
#define opj_to_smr(x)   ((x) >= 0 ? (OPJ_UINT32)(x) : ((OPJ_UINT32)(-x) |
0x8000U))
OPJ_FLOAT64 opj_t1_encode_cblk(opj_t1_t *t1,
  opj_tcd_cblk_enc_t* cblk,
  OPJ_UINT32 orient,
  OPJ_UINT32 compno,
  OPJ_UINT32 level,
  OPJ_UINT32 qmfbid,
  OPJ_FLOAT64 stepsize,
  OPJ_UINT32 cblksty,
  OPJ_UINT32 numcomps,
  const OPJ_FLOAT64 * mct_norms,
  OPJ_UINT32 mct_numcomps)
{
OPJ_INT32 max;
OPJ_UINT32 i, j;
OPJ_INT32* datap;

max = 0;
datap = t1->data;
for (j = 0; j < t1->h; ++j) {
const OPJ_UINT32 w = t1->w;
for (i = 0; i < w; ++i, ++datap) {
OPJ_INT32 tmp = *datap;
if (tmp < 0) {
OPJ_UINT32 tmp_unsigned;
max = opj_int_max(max, -tmp);
tmp_unsigned = opj_to_smr(tmp);
memcpy(datap, _unsigned, sizeof(OPJ_INT32));
} else {
max = opj_int_max(max, tmp);
}
}
}
cblk->numbps = max ? 6 : 0;
}

[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015

--- Comment #1 from Jan Hubicka  ---
opj_t1_enc_refpass is not inlined due to large function growth and some others
due to max-inline-insns-auto.  With inlining forced I get profile:

  87.35%   opj_t1_cblk_encode_processor
   6.22%  opj_dwt_encode_and_deinterleave_v.lto_priv.0
   1.80%  opj_mqc_byteout
   1.50%  opj_dwt_encode_and_deinterleave_h_one_row.lto_priv.0

So pretty much same profile as for clang. However runtime is still 45573 with
-O3 -flto -march=native -fno-semantic-interposition --param
large-function-insns=100  --param max-inline-insns-auto=5

So it does not seem to be missing IPA optimizations.

There are number of conditional moves in clang code, -mbrach=cost helps a bit,
but not enough.