[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #23 from Hongtao.liu --- > _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565, > _576, _125, _143, _161, _179}; The cost of vec_construct in i386 backend is 64, calculated as 16 x 4 cut from i386.c --- /* N element inserts into SSE vectors. */ int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op; --- >From perspective of pipeline latency, is seems ok, but from perspective of rtx_cost, it seems inaccurate since it would be initialized as --- vmovd %eax, %xmm0 vpinsrb $1, 1(%rsi), %xmm0, %xmm0 vmovd %eax, %xmm7 vpinsrb $1, 3(%rsi), %xmm7, %xmm7 vmovd %eax, %xmm3 vpinsrb $1, 17(%rsi), %xmm3, %xmm3 vmovd %eax, %xmm6 vpinsrb $1, 19(%rsi), %xmm6, %xmm6 vmovd %eax, %xmm1 vpinsrb $1, 33(%rsi), %xmm1, %xmm1 vmovd %eax, %xmm5 vpinsrb $1, 35(%rsi), %xmm5, %xmm5 vmovd %eax, %xmm2 vpinsrb $1, 49(%rsi), %xmm2, %xmm2 vmovd %eax, %xmm4 vpinsrb $1, 51(%rsi), %xmm4, %xmm4 vpunpcklwd %xmm6, %xmm3, %xmm3 vpunpcklwd %xmm4, %xmm2, %xmm2 vpunpcklwd %xmm7, %xmm0, %xmm0 vpunpcklwd %xmm5, %xmm1, %xmm1 vpunpckldq %xmm2, %xmm1, %xmm1 vpunpckldq %xmm3, %xmm0, %xmm0 vpunpcklqdq %xmm1, %xmm0, %xmm0 --- it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost should be 92(23 * 4).
[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789 --- Comment #22 from Hongtao.liu --- >One of my workmates found that if we disable vectorization for SPEC2017 >>525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with >?>explicit function attribute __attribute__((optimize("no-tree-vectorize"))), >it >can speed up by 4%. For CLX, if we disable slp vectorization in sub4x4_dct by __attribute__((optimize("no-tree-slp-vectorize"))), it can also speed up by 4%. > Thanks Richi! Should we take care of this case? or neglect this kind of > extension as "no instruction"? I was intent to handle it in target specific > code, but it isn't recorded into cost vector while it seems too heavy to do > the bb_info slp_instances revisits in finish_cost. For i386 backend unsigned char --> unsigned short is no "no instruction", but in this case --- 1033 _134 = MEM[(pixel *)pix1_295 + 2B]; 1034 _135 = (short unsigned int) _134; --- It could be combined and optimized to --- movzbl 19(%rcx), %r8d --- So, if "unsigned char" variable is loaded from memory, then the convertion would also be "no instruction", i'm not sure if backend cost model could handle such situation.
[Bug c/97215] Possible fread() malfunction of GCC 7.3.0 (Windows)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97215 --- Comment #5 from Georgi --- (In reply to Andrew Pinski from comment #3) > fopen/fread/fwrite DOES NOT come from GCC, but rather than in this case > mingw. Ugh, thanks, will alert them about this issue by giving the link to this tracker.
[Bug c/97215] Possible fread() malfunction of GCC 7.3.0 (Windows)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97215 --- Comment #4 from Georgi --- (In reply to Andrew Pinski from comment #2) > You need b if you don't want \r\n to be turned into just \n. At 11,945th line I use: ``` if ((fp = fopen(argv[1], "rb")) == NULL) { printf("Nakamichi: Can't open '%s' file.\n", argv[1]); exit(13); } ``` As far as I investigated, the problem is that fread() reads less (around 860 bytes) than specified, after decompression I see those bytes being ASCII 000?!
[Bug c/97215] Possible fread() malfunction of GCC 7.3.0 (Windows)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97215 --- Comment #3 from Andrew Pinski --- fopen/fread/fwrite DOES NOT come from GCC, but rather than in this case mingw.
[Bug c/97215] Possible fread() malfunction of GCC 7.3.0 (Windows)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97215 Andrew Pinski changed: What|Removed |Added Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #2 from Andrew Pinski --- You need b if you don't want \r\n to be turned into just \n.
[Bug c/97215] Possible fread() malfunction of GCC 7.3.0 (Windows)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97215 --- Comment #1 from Georgi --- Oops, here are the mentioned files: www.sanmayce.com/Nakamichi/Satanichi_aka_Nakamichi_2020-Jun-09_BUG_ZEROED-END.zip
[Bug c/97215] New: Possible fread() malfunction of GCC 7.3.0 (Windows)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97215 Bug ID: 97215 Summary: Possible fread() malfunction of GCC 7.3.0 (Windows) Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: sanmayce at sanmayce dot com Target Milestone: --- Hi, a C coder here. Regarding a C source loading an entire 3.3GB file and checksumming it. First, I use Intel v15.0 and GCC v7.3.0, on Windows 64bit. For my dismay I encountered that Intel's binary loads and reports the correct checksum, whereas GCC's binary fails, after comparing the loaded content I saw that GCC loads all the file into a malloc-ed pool but without the last ~860 bytes?! If you need to reproduce the issue - the two binaries (GCC and Intel) and the C source as well are here: http://www.sanmayce.com/Nakamichi/Nakamichi_Kaidanji.zip The file being loaded is the Human Genome: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_01405.28_GRCh38.p13/GCA_01405.28_GRCh38.p13_genomic.fna.gz This bug never appeared with files 1GB or less in size, my guess, this is a clue. These are the files: ``` 06/11/2020 09:16 AM 1,316,439 Nakamichi_Ryuugan-ditto-1TB_btree.c 06/15/2019 02:37 AM 3,313,087,324 NCBI_FTP_Homo_sapiens_(human)_GCA_01405.28_GRCh38.p13_genomic.fna 06/15/2019 02:37 AM 3,313,087,324 q 01/07/2018 05:26 PM 191,644 Satanichi_GCC730_64bit.exe 06/11/2020 09:16 AM 198,144 Satanichi_ICL150_64bit.exe ``` As you can see below, the same file is loaded differently into malloc-ed pool: ``` D:\Satanichi_aka_Nakamichi_2020-Jun-09>Satanichi_GCC730_64bit.exe q w 20 888 i ... Allocating Source-Buffer 3,159 MB ... Allocating Target-Buffer 3,191 MB ... Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0xc1d4,3f7f ... D:\Satanichi_aka_Nakamichi_2020-Jun-09> D:\Satanichi_aka_Nakamichi_2020-Jun-09>Satanichi_ICL150_64bit.exe q w 20 888 i Allocating Source-Buffer 3,159 MB ... Allocating Target-Buffer 3,191 MB ... Source-file-Hash(FNV1A_YoshimitsuTRIAD) = 0x81bd,fe4b ... D:\Satanichi_aka_Nakamichi_2020-Jun-09> ``` If you need more info, will add it... Very much I would like to know what causes this anomaly/bug. Georgi
[Bug c/97208] [gcc 10.2.0] Microblaze regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97208 --- Comment #1 from Romain Naour --- Hello, I had to disable -ftree-loop-distribute-patterns while building the kernel on microblaze (using -Os). The regression appear since the commit [1] that moved -ftree-loop-distribute-patterns from -O3 to -O2 (-Os) optimization level. I guess this behavior change should be documented in the gcc 10 changes page [2]? [1] https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=5879ab5fafedc8f6f9bfe95a4cf8501b0df90edd [2] https://gcc.gnu.org/gcc-10/changes.html Best regards, Romain
[Bug fortran/97210] Intrinsic function get_team() does not work
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97210 kargl at gcc dot gnu.org changed: What|Removed |Added Priority|P3 |P4 CC||kargl at gcc dot gnu.org Last reconfirmed||2020-09-26 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #1 from kargl at gcc dot gnu.org --- It seems the implementation of get_team() was wrong from its first appearance in gfortran.
[Bug libstdc++/96817] __cxa_guard_acquire unsafe against dynamically loaded pthread
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96817 --- Comment #17 from CVS Commits --- The master branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:e6923541fae5081b646f240d54de2a32e17a0382 commit r11-3484-ge6923541fae5081b646f240d54de2a32e17a0382 Author: Jonathan Wakely Date: Sat Sep 26 20:32:36 2020 +0100 libstdc++: Use __libc_single_threaded to optimise atomics [PR 96817] Glibc 2.32 adds a global variable that says whether the process is single-threaded. We can use this to decide whether to elide atomic operations, as a more precise and reliable indicator than __gthread_active_p. This means that guard variables for statics and reference counting in shared_ptr can use less expensive, non-atomic ops even in processes that are linked to libpthread, as long as no threads have been created yet. It also means that we switch to using atomics if libpthread gets loaded later via dlopen (this still isn't supported in general, for other reasons). We can't use __libc_single_threaded to replace __gthread_active_p everywhere. If we replaced the uses of __gthread_active_p in std::mutex then we would elide the pthread_mutex_lock in the code below, but not the pthread_mutex_unlock: std::mutex m; m.lock();// pthread_mutex_lock std::thread t([]{}); // __libc_single_threaded = false t.join(); m.unlock(); // pthread_mutex_unlock We need the lock and unlock to use the same "is threading enabled" predicate, and similarly for init/destroy pairs for mutexes and condition variables, so that we don't try to release resources that were never acquired. There are other places that could use __libc_single_threaded, such as _Sp_locker in src/c++11/shared_ptr.cc and locale init functions, but they can be changed later. libstdc++-v3/ChangeLog: PR libstdc++/96817 * include/ext/atomicity.h (__gnu_cxx::__is_single_threaded()): New function wrapping __libc_single_threaded if available. (__exchange_and_add_dispatch, __atomic_add_dispatch): Use it. * libsupc++/guard.cc (__cxa_guard_acquire, __cxa_guard_abort) (__cxa_guard_release): Likewise. * testsuite/18_support/96817.cc: New test.
[Bug middle-end/94195] missing warning reading a smaller object via an lvalue of a larger type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94195 Dmitry G. Dyachenko changed: What|Removed |Added CC||dimhen at gmail dot com --- Comment #3 from Dmitry G. Dyachenko --- (In reply to CVS Commits from comment #2) > The master branch has been updated by Martin Sebor : > > https://gcc.gnu.org/g:3f9a497d1b0dd9da87908a11b59bf364ad40ddca > > commit r11-3306-g3f9a497d1b0dd9da87908a11b59bf364ad40ddca > Author: Martin Sebor > Date: Sat Sep 19 17:47:29 2020 -0600 > > Extend -Warray-bounds to detect out-of-bounds accesses to array > parameters. > > gcc/ChangeLog: > > PR middle-end/82608 > PR middle-end/94195 > PR c/50584 > PR middle-end/84051 > * gimple-array-bounds.cc (get_base_decl): New function. > (get_ref_size): New function. > (trailing_array): New function. > (array_bounds_checker::check_array_ref): Call them. Handle > arrays > declared in function parameters. > (array_bounds_checker::check_mem_ref): Same. Handle references > to > dynamically allocated arrays. > > gcc/testsuite/ChangeLog: > > PR middle-end/82608 > PR middle-end/94195 > PR c/50584 > PR middle-end/84051 > * c-c++-common/Warray-bounds.c: Adjust. > * gcc.dg/Wbuiltin-declaration-mismatch-9.c: Adjust. > * gcc.dg/Warray-bounds-63.c: New test. > * gcc.dg/Warray-bounds-64.c: New test. > * gcc.dg/Warray-bounds-65.c: New test. > * gcc.dg/Warray-bounds-66.c: New test. > * gcc.dg/Warray-bounds-67.c: New test. I am a bit confused -- now gcc produces warning. But access is not out of allocated memory. Is it expected? $ cat x.c #include struct S1 { unsigned x; }; struct S { struct S1 s1; int z; }; void f1() { struct S *pS = (struct S*) calloc(sizeof(struct S1),1); if(pS->s1.x == 0) return; free(pS); } $ gcc -O2 -Wall -c x.i x.c: In function 'f1': x.c:18:8: warning: array subscript 'struct S[0]' is partly outside array bounds of 'unsigned char[4]' [-Warray-bounds] 18 | if(pS->s1.x == 0) |^~ x.c:17:30: note: referencing an object of size 4 allocated by 'calloc' 17 | struct S *pS = (struct S*) calloc(sizeof(struct S1),1); | ^~~
[Bug target/97044] Undefined format macros because of include order on AIX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97044 --- Comment #5 from CVS Commits --- The master branch has been updated by David Edelsohn : https://gcc.gnu.org/g:081b3517b4df826ac917147eb906bbb8fc6528b1 commit r11-3482-g081b3517b4df826ac917147eb906bbb8fc6528b1 Author: David Edelsohn Date: Thu Sep 17 15:18:48 2020 + aix: Fix _STDC_FORMAT_MACROS in inttypes.h [PR97044] AIX protects the STDC Format Macros in a manner that can prevent the definition of the macros depending on the order of header inclusion. The protection of the macros was referenced in C99, removed in C11, and never specified in any C++ standard. Also, the macros are in the namespace reserved to the implementation (compiler) so the compiler is permitted to choose to inject those names. fixincludes/ChangeLog: 2020-09-17 David Edelsohn PR target/97044 * inclhack.def (aix_inttypes): New fix. * fixincl.x: Regenerate. * tests/base/sys/inttypes.h: New file.
[Bug c++/97214] New: ICE in lookup_template_class_1, at cp/pt.c:9896
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97214 Bug ID: 97214 Summary: ICE in lookup_template_class_1, at cp/pt.c:9896 Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: sfranzen85 at hotmail dot com Target Milestone: --- The following snippet reproduces an error I encountered with lambdas: struct Foo { // void operator()(int) {} template void operator()(T, T) { auto bar = [this](auto&& v){ operator()(v); }; } }; int main () { Foo{}(0,1); return 0; } Full error: ../src/main.cpp: In instantiation of ‘void Foo::operator()(T, T) [with T = int]’: ../src/main.cpp:13:14: required from here ../src/main.cpp:7:38: internal compiler error: in lookup_template_class_1, at cp/pt.c:9896 7 | auto bar = [this](auto&& v){ operator()(v); }; | ^~ Further observations: * The error appears regardless of available operator()() overloads; * It only appears if the function call is unqualified, e.g. (*this)(v) is fine if the overload exists. A possibly related error is given using a version of Foo without a function template: struct Foo { void operator()(int) {} void operator()(int a, int) { auto bar = [this](auto&& v){ operator()(v); }; bar(a); } }; ../src/main.cpp: In instantiation of ‘Foo::operator()(int, int):: [with auto:1 = int&]’: ../src/main.cpp:8:14: required from here ../src/main.cpp:7:48: error: use of ‘Foo::operator()(int, int):: [with auto:1 = int&]’ before deduction of ‘auto’ 7 | auto bar = [this](auto&& v){ operator()(v); }; | ~~^~~ This error similarly only appears with the unqualified call, and also disappears if the lambda has '-> void'.
[Bug libgomp/97213] OpenMP "if" is dramatically slower than code-level "if" - why?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97213 Thanassis Tsiodras changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED --- Comment #4 from Thanassis Tsiodras --- Marking as resolved.
[Bug libgomp/97213] OpenMP "if" is dramatically slower than code-level "if" - why?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97213 --- Comment #3 from Jakub Jelinek --- Note, I think significant speedup is in tail recursion optimization which will be prevented even with mergeable task. Computing fibonacci this way is not efficient.
[Bug libgomp/97213] OpenMP "if" is dramatically slower than code-level "if" - why?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97213 --- Comment #2 from Thanassis Tsiodras --- I see. I was not aware of "mergeable", TBH - thanks for pointing it out (it led me to reading about "data environments"). Thanks, Jakub.
[Bug libgomp/97213] OpenMP "if" is dramatically slower than code-level "if" - why?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97213 --- Comment #1 from Jakub Jelinek --- Even with if(false) the implementation has to create a new data environment etc. if(false) just means the task will be included, i.e. the generating task will only continue when the included task finishes and the generating thread will execute the task. You'd need to add mergeable clause also to let the implementation for if(false) pretend there wasn't the task directive at all, but that is just an optimization option that GCC doesn't use right now (would require basically copying the region once again). Also, there is the overhead of the taskwait that you perform unconditionally at all levels.
[Bug libgomp/97213] New: OpenMP "if" is dramatically slower than code-level "if" - why?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97213 Bug ID: 97213 Summary: OpenMP "if" is dramatically slower than code-level "if" - why? Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: ttsiodras at gmail dot com CC: jakub at gcc dot gnu.org Target Milestone: --- In trying to understand how OpenMP `task` works, I did this benchmark: #include #include long fib(int val) { if (val < 2) return val; long total = 0; { #pragma omp task shared(total) if(val==45) total += fib(val-1); #pragma omp task shared(total) if(val==45) total += fib(val-2); #pragma omp taskwait } return total; } int main() { #pragma omp parallel #pragma omp single { long res = fib(45); printf("fib(45)=%ld\n", res); } } It's a simple Fibonacci calculation, that only spawns two tasks at the top-level of fib(45) - basically, one thread does fib(44), the other does fib(43); and the results are added and returned. I know there's a chance for a race on the "+=" of the total - but that's not the point of this... Here's the performance in my i5 laptop: $ gcc -O2 with_openmp_if.c -fopenmp $ time ./a.out fib(45)=1134903170 real1m4.244s user1m44.696s sys 0m0.010s 64 seconds... Now compare this, to the same code, but with the "if" moved from OpenMP level, to user code level - i.e. this change in "fib": long fib(int val) { if (val < 2) return val; long total = 0; { if (val == 45) { #pragma omp task shared(total) total += fib(val-1); #pragma omp task shared(total) total += fib(val-2); #pragma omp taskwait } else return fib(val-1) + fib(val-2); } return total; } $ gcc -O2 with_normal_if.c -fopenmp $ time ./a.out fib(45)=1134903170 real0m8.585s user0m14.021s sys 0m0.011s We go from 64 seconds down to 8.5 seconds. Why? What does the OpenMP-level "if" do so differently, that it causes an order of magnitude less performance?
[Bug fortran/96495] [gfortran] Composition of user-defined operators does not copy ALLOCATABLE property of derived type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96495 --- Comment #6 from CVS Commits --- The master branch has been updated by Paul Thomas : https://gcc.gnu.org/g:5b26b3b3f5c75a86a5a3e851866247ac7fcb6c8b commit r11-3480-g5b26b3b3f5c75a86a5a3e851866247ac7fcb6c8b Author: Paul Thomas Date: Sat Sep 26 12:32:35 2020 +0100 Correct overwrite of alloc_comp_result_2.f90 in fix of PR96495. 2020-26-09 Paul Thomas gcc/testsuite/ PR fortran/96495 * gfortran.dg/alloc_comp_result_2.f90 : Restore original. * gfortran.dg/alloc_comp_result_3.f90 : New test.
[Bug libgomp/97212] New: [OpenMP] 'depend' clause with 'target nowait' (!) + 'task' does not work
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97212 Bug ID: 97212 Summary: [OpenMP] 'depend' clause with 'target nowait' (!) + 'task' does not work Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: openmp, wrong-code Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: burnus at gcc dot gnu.org CC: jakub at gcc dot gnu.org Target Milestone: --- Created attachment 49274 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49274=edit C testcase, run with -fopenmp The SOLLVE_VV testcase https://github.com/SOLLVE/sollve_vv/blob/master/tests/4.5/task/test_target_and_task_nowait.c FAILS. Note: It also fails with a compiler which is not even configured for offloading and, hence, everything is run on the host. It uses with 'nowait' and 'depend': #pragma omp target map(tofrom: a, sum) depend(out: a) nowait ... (set 'a') ... #pragma omp task depend(in: a) shared(a,errors) ... check value of ... A comment indicates a problem with real-world code: // This test checks if dependence expressed on target and task // regions are honoured in the presense of nowait. // This test is motivated by OpenMP usage in QMCPack.
[Bug bootstrap/97163] Build error with -mcpu=power9 on ppc64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97163 --- Comment #8 from CVS Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:d00b1b023ecfc3ddc3fe952c0063dab7529d5f7a commit r11-3476-gd00b1b023ecfc3ddc3fe952c0063dab7529d5f7a Author: Jakub Jelinek Date: Sat Sep 26 10:07:41 2020 +0200 powerpc, libcpp: Fix gcc build with clang on power8 [PR97163] libcpp has two specialized altivec implementations of search_line_fast, one for power8+ and the other one otherwise. Both use __attribute__((altivec(vector))) and the GCC builtins rather than altivec.h and the APIs from there, which is fine, but should be restricted to when libcpp is built with GCC, so that it can be relied on. The second elif is and thus e.g. when built with clang it isn't picked, but the first one was just guarded with and so according to the bugreporter clang fails miserably on that. The following patch fixes that by adding the same GCC_VERSION requirement as the second version. I don't know where the 4.5 in there comes from and the exact version doesn't matter that much, as long as it is above 4.2 that clang pretends to be and smaller or equal to 4.8 as the oldest gcc we support as bootstrap compiler ATM. Furthermore, the patch fixes the comment, the version it is talking about is not pre-GCC 5, but actually the GCC 5+ one. 2020-09-26 Jakub Jelinek PR bootstrap/97163 * lex.c (search_line_fast): Only use _ARCH_PWR8 Altivec version for GCC >= 4.5.