[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #10 from Jan Hubicka --- runtimes on zen4 hardware. trunk -O3 -flto -march-native 42171 42964 42106 clang -O3 -flto -march=native 37393 37423 37508 gcc 13 -O3 -flto -march=native 42380 42314 43285 So seems the performance did not change
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #9 from Andrew Pinski --- I should note that PR 112416 is not needed to vectorize the loop, though it would improve code.
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 Andrew Pinski changed: What|Removed |Added Depends on||112418 --- Comment #8 from Andrew Pinski --- (In reply to Andrew Pinski from comment #5) > After fixing PR 112324 (and a secondary patch to phiopt to do > factor_out_conditional_operation for all phi nodes rather than just a single > one) we still miss the abs detection: Filed PR 112418 for the secondary patch. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112418 [Bug 112418] factor_out_conditional_operation could be done for more phis
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |NEW Severity|normal |enhancement Ever confirmed|0 |1 Last reconfirmed||2023-11-07 --- Comment #7 from Andrew Pinski --- Confirmed.
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 Andrew Pinski changed: What|Removed |Added Depends on||112416 --- Comment #6 from Andrew Pinski --- (In reply to Andrew Pinski from comment #5) > > `a < 0 ? (int)-(unsigned)a : a` needs to detected to be `(int)ABSU_EXPR`. Filed PR 112416 for that. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112416 [Bug 112416] absu is not detected
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #5 from Andrew Pinski --- After fixing PR 112324 (and a secondary patch to phiopt to do factor_out_conditional_operation for all phi nodes rather than just a single one) we still miss the abs detection: _34 = tmp_24 < 0; _55 = (unsigned int) tmp_24; _56 = -_55; _1 = (intD.6) _56; _30 = _1 | -2147483648; iftmp.0_26 = (unsigned intD.9) _30; # .MEM_27 = VDEF <.MEM_46> # USE = anything # CLB = anything .MASK_STORE (datap_43, 8B, _34, iftmp.0_26); # RANGE [irange] int [0, +INF] _25 = _34 ? _1 : tmp_24; basically `a < 0 ? (int)-(unsigned)a : a` needs to detected to be `(int)ABSU_EXPR`.
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #4 from Hongtao.liu --- > So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can > be merge together with MAX_EXPR > > Create pr112324.
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #3 from Hongtao.liu --- 169test.c:85:23: note: vect_is_simple_use: operand max_38 = PHI , type of def: unknown 170test.c:85:23: missed: Unsupported pattern. 171test.c:62:24: missed: not vectorized: unsupported use in stmt. 172test.c:85:23: missed: unexpected pattern. 173test.c:85:23: note: * Analysis failed with vector mode V8SI 174test.c:85:23: note: * The result for vector mode V32QI would be the same 175test.c:85:23: missed: couldn't vectorize loop 176test.c:65:13: note: vectorized 0 loops in function. 177Removing basic block 5 178;; basic block 5, loop depth 2 179;; pred: 16 180;; 43 181# max_38 = PHI 182# i_42 = PHI 183# datap_44 = PHI 184tmp_24 = *datap_44; 185_35 = tmp_24 < 0; 186_56 = (unsigned int) tmp_24; 187_51 = -_56; 188_1 = (int) _51; 189_25 = MAX_EXPR <_1, max_38>; 190_31 = _1 | -2147483648; 191iftmp.0_27 = (unsigned int) _31; 192.MASK_STORE (datap_44, 8B, _35, iftmp.0_27); 193_26 = MAX_EXPR ; 194max_5 = _35 ? _25 : _26; 195i_29 = i_42 + 1; 196datap_30 = datap_44 + 4; 197if (w_22 > i_29) 198 goto ; [89.00%] 199else 200 goto ; [11.00%] 201;; succ: 16 So here we have a reduction for MAX_EXPR, but there's 2 MAX_EXPR which can be merge together with MAX_EXPR > manually change the loop to below, then it can be vectorized. for (j = 0; j < t1->h; ++j) { const OPJ_UINT32 w = t1->w; for (i = 0; i < w; ++i, ++datap) { OPJ_INT32 tmp = *datap; if (tmp < 0) { OPJ_UINT32 tmp_unsigned; tmp_unsigned = opj_to_smr(tmp); memcpy(datap, _unsigned, sizeof(OPJ_INT32)); tmp = -tmp; } max = opj_int_max(max, tmp); } } maybe it's related to phiopt?
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 jun zhang changed: What|Removed |Added CC||zhangjungcc at gmail dot com --- Comment #2 from jun zhang --- The following loop couldn't vectorize in gcc, but could in llvm. it has 3% improvement. more info, please refer: https://godbolt.org/z/zMbjq41h5 #include typedef signed int OPJ_INT32; typedef unsigned int OPJ_UINT32; typedef int OPJ_BOOL; #define OPJ_TRUE 1 #define OPJ_FALSE 0 typedef char OPJ_CHAR; typedef float OPJ_FLOAT32; typedef doubleOPJ_FLOAT64; typedef unsigned char OPJ_BYTE; #define T1_NMSEDEC_FRACBITS 6 #define OPJ_RESTRICT restrict #define OPJ_TLS_KEY_T1 0 #include typedef size_t OPJ_SIZE_T; typedef struct opj_tcd_cblk_enc { OPJ_BYTE* data; /* Data */ //opj_tcd_layer_t* layers; /* layer information */ //opj_tcd_pass_t* passes; /* information about the passes */ OPJ_INT32 x0, y0, x1, y1; /* dimension of the code-blocks : left upper corner (x0, y0) right low corner (x1,y1) */ OPJ_UINT32 numbps; OPJ_UINT32 numlenbits; OPJ_UINT32 data_size; /* Size of allocated data buffer */ OPJ_UINT32 numpasses; /* number of pass already done for the code-blocks */ OPJ_UINT32 numpassesinlayers; /* number of passes in the layer */ OPJ_UINT32 totalpasses; /* total number of passes */ } opj_tcd_cblk_enc_t; typedef struct opj_t1 { /** MQC component */ //opj_mqc_t mqc; OPJ_INT32 *data; /** Flags used by decoder and encoder. * Such that flags[1+0] is for state of col=0,row=0..3, flags[1+1] for col=1, row=0..3, flags[1+flags_stride] for col=0,row=4..7, ... This array avoids too much cache trashing when processing by 4 vertical samples as done in the various decoding steps. */ //opj_flag_t *flags; OPJ_UINT32 w; OPJ_UINT32 h; OPJ_UINT32 datasize; OPJ_UINT32 flagssize; OPJ_BOOL encoder; /* Thre 3 variables below are only used by the decoder */ /* set to TRUE in multithreaded context */ OPJ_BOOL mustuse_cblkdatabuffer; /* Temporary buffer to concatenate all chunks of a codebock */ OPJ_BYTE*cblkdatabuffer; /* Maximum size available in cblkdatabuffer */ OPJ_UINT32 cblkdatabuffersize; } opj_t1_t; #define INLINE __inline__ static INLINE OPJ_INT32 opj_int_max(OPJ_INT32 a, OPJ_INT32 b) { return (a > b) ? a : b; } #define opj_to_smr(x) ((x) >= 0 ? (OPJ_UINT32)(x) : ((OPJ_UINT32)(-x) | 0x8000U)) OPJ_FLOAT64 opj_t1_encode_cblk(opj_t1_t *t1, opj_tcd_cblk_enc_t* cblk, OPJ_UINT32 orient, OPJ_UINT32 compno, OPJ_UINT32 level, OPJ_UINT32 qmfbid, OPJ_FLOAT64 stepsize, OPJ_UINT32 cblksty, OPJ_UINT32 numcomps, const OPJ_FLOAT64 * mct_norms, OPJ_UINT32 mct_numcomps) { OPJ_INT32 max; OPJ_UINT32 i, j; OPJ_INT32* datap; max = 0; datap = t1->data; for (j = 0; j < t1->h; ++j) { const OPJ_UINT32 w = t1->w; for (i = 0; i < w; ++i, ++datap) { OPJ_INT32 tmp = *datap; if (tmp < 0) { OPJ_UINT32 tmp_unsigned; max = opj_int_max(max, -tmp); tmp_unsigned = opj_to_smr(tmp); memcpy(datap, _unsigned, sizeof(OPJ_INT32)); } else { max = opj_int_max(max, tmp); } } } cblk->numbps = max ? 6 : 0; }
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #1 from Jan Hubicka --- opj_t1_enc_refpass is not inlined due to large function growth and some others due to max-inline-insns-auto. With inlining forced I get profile: 87.35% opj_t1_cblk_encode_processor 6.22% opj_dwt_encode_and_deinterleave_v.lto_priv.0 1.80% opj_mqc_byteout 1.50% opj_dwt_encode_and_deinterleave_h_one_row.lto_priv.0 So pretty much same profile as for clang. However runtime is still 45573 with -O3 -flto -march=native -fno-semantic-interposition --param large-function-insns=100 --param max-inline-insns-auto=5 So it does not seem to be missing IPA optimizations. There are number of conditional moves in clang code, -mbrach=cost helps a bit, but not enough.