[Bug target/62011] False Data Dependency in popcnt instruction

2021-10-11 Thread malincns at 163 dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Ma Lin  changed:

   What|Removed |Added

 CC||malincns at 163 dot com

--- Comment #18 from Ma Lin  ---
FYI, in Intel 10th/11th Generation Processor Errata Table, there is no popcnt
problem.

In 9th Generation Errata Table, this problem exists.

[Bug target/62011] False Data Dependency in popcnt instruction

2017-11-16 Thread andrew.n.senkevich at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Andrew Senkevich  changed:

   What|Removed |Added

 CC||andrew.n.senkevich at gmail 
dot co
   ||m

--- Comment #17 from Andrew Senkevich  ---
(In reply to Travis Downs from comment #16)
> Also, this is fixed for Skylake for tzcnt and lzcnt but not popcnt.

How to confirm it? As I see it is fixed for popcnt. Could you show some
reproducer?

[Bug target/62011] False Data Dependency in popcnt instruction

2017-11-11 Thread travis.downs at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #16 from Travis Downs  ---
Also, this is fixed for Skylake for tzcnt and lzcnt but not popcnt.

[Bug target/62011] False Data Dependency in popcnt instruction

2017-11-11 Thread travis.downs at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Travis Downs  changed:

   What|Removed |Added

 CC||travis.downs at gmail dot com

--- Comment #15 from Travis Downs  ---
For what it's worth and because Richard asked for it above, there is are Intel
erratum for this, at least as of Haswell, for example HSD146: "POPCNT
Instruction May Take Longer to Execute Than Expected". 

It mentions only popcnt, and I found it for Haswell, Skylake (SKL029) and
Broadwell. The text is:

POPCNT Instruction May Take Longer to Execute Than Expected

Problem:
POPCNT instruction execution with a 32 or 64 bit operand may be delayed until 
previous non-dependent instructions have executed.

Implication:
Software using the POPCNT instruction may experience lower performance than
expected. 

Workaround:
None identified

[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-21 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Uroš Bizjak  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #14 from Uroš Bizjak  ---
.

[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-21 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Uroš Bizjak  changed:

   What|Removed |Added

   Target Milestone|--- |4.9.2

--- Comment #13 from Uroš Bizjak  ---
Fixed for 4.9.2+.

[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-21 Thread uros at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #12 from uros at gcc dot gnu.org ---
Author: uros
Date: Thu Aug 21 18:03:49 2014
New Revision: 214279

URL: https://gcc.gnu.org/viewcvs?rev=214279&root=gcc&view=rev
Log:
Backport from mainline
2014-08-19  H.J. Lu  

* config/i386/i386.md (*ctz2_falsedep_1): Don't clear
destination if it is used in source.
(*clz2_lzcnt_falsedep_1): Likewise.
(*popcount2_falsedep_1): Likewise.

Backport from mainline
2014-08-18  Uros Bizjak  

PR target/62011
* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI):
New tune flag.
* config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BMI): New define.
* config/i386/i386.md (unspec) : New unspec.
(ffs2): Do not expand with tzcnt for
TARGET_AVOID_FALSE_DEP_FOR_BMI.
(ffssi2_no_cmove): Ditto.
(*tzcnt_1): Disable for TARGET_AVOID_FALSE_DEP_FOR_BMI.
(ctz2): New expander.
(*ctz2_falsedep_1): New insn_and_split pattern.
(*ctz2_falsedep): New insn.
(*ctz2): Rename from ctz2.
(clz2_lzcnt): New expander.
(*clz2_lzcnt_falsedep_1): New insn_and_split pattern.
(*clz2_lzcnt_falsedep): New insn.
(*clz2): Rename from ctz2.
(popcount2): New expander.
(*popcount2_falsedep_1): New insn_and_split pattern.
(*popcount2_falsedep): New insn.
(*popcount2): Rename from ctz2.
(*popcount2_cmp): Remove.
(*popcountsi2_cmp_zext): Ditto.


Modified:
branches/gcc-4_9-branch/gcc/ChangeLog
branches/gcc-4_9-branch/gcc/config/i386/i386.h
branches/gcc-4_9-branch/gcc/config/i386/i386.md
branches/gcc-4_9-branch/gcc/config/i386/x86-tune.def


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-18 Thread uros at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #11 from uros at gcc dot gnu.org ---
Author: uros
Date: Mon Aug 18 18:00:52 2014
New Revision: 214112

URL: https://gcc.gnu.org/viewcvs?rev=214112&root=gcc&view=rev
Log:
PR target/62011
* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI):
New tune flag.
* config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BMI): New define.
* config/i386/i386.md (unspec) : New unspec.
(ffs2): Do not expand with tzcnt for
TARGET_AVOID_FALSE_DEP_FOR_BMI.
(ffssi2_no_cmove): Ditto.
(*tzcnt_1): Disable for TARGET_AVOID_FALSE_DEP_FOR_BMI.
(ctz2): New expander.
(*ctz2_falsedep_1): New insn_and_split pattern.
(*ctz2_falsedep): New insn.
(*ctz2): Rename from ctz2.
(clz2_lzcnt): New expander.
(*clz2_lzcnt_falsedep_1): New insn_and_split pattern.
(*clz2_lzcnt_falsedep): New insn.
(*clz2): Rename from ctz2.
(popcount2): New expander.
(*popcount2_falsedep_1): New insn_and_split pattern.
(*popcount2_falsedep): New insn.
(*popcount2): Rename from ctz2.
(*popcount2_cmp): Remove.
(*popcountsi2_cmp_zext): Ditto.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.h
trunk/gcc/config/i386/i386.md
trunk/gcc/config/i386/x86-tune.def


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-13 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2014-08-13
 Ever confirmed|0   |1

--- Comment #10 from Richard Biener  ---
Confirmed at least.


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-12 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #9 from Yuri Rumyantsev  ---
This is not u32 version but u64. The first loop (u32) version looks like:

.L23:
leal1(%rdx), %ecx
xorq%rax, %rax
popcntq(%rbx,%rax,8), %rax
leal2(%rdx), %r8d
xorq%rcx, %rcx
popcntq(%rbx,%rcx,8), %rcx
addq%rax, %rcx
leal3(%rdx), %esi
xorq%rax, %rax
popcntq(%rbx,%r8,8), %rax
addq%rax, %rcx
xorq%rax, %rax
popcntq(%rbx,%rsi,8), %rax
addq%rax, %rcx
leal4(%rdx), %eax
addq%rcx, %r14
movq%rax, %rdx
cmpq%rax, %r12
ja.L23


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-12 Thread finis at in dot tum.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #8 from finis at in dot tum.de ---
@Yuri: Note however, that the result of your fixed u32 version seems to be
wrong.


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-12 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #7 from Yuri Rumyantsev  ---
Please ignore my previous comment - if we insert nullifying of destination
register before each popcnt (and lzcnt) performance will restore:

original test results:

unsigned8388663 0.848533 sec24.715 GB/s
uint64_t8388663 1.37436 sec 15.2592 GB/s

fixed popcnt:

unsigned9044037 0.853753 sec24.5639 GB/s
uint64_t8388663 0.694458 sec30.1984 GB/s

Here is assembly for 2nd loop:

.L16:
xorq%rax, %rax
popcntq-8(%rdx), %rax
xorq%rcx, %rcx
popcntq(%rdx), %rcx
addq%rax, %rcx
xorq%rax, %rax
popcntq8(%rdx), %rax
addq%rcx, %rax
addq$32, %rdx
xorq%rcx, %rcx
popcntq-16(%rdx), %rcx
addq%rax, %rcx
addq%rcx, %r13
cmpq%rsi, %rdx
jne.L16


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-11 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Yuri Rumyantsev  changed:

   What|Removed |Added

 CC||ysrumyan at gmail dot com

--- Comment #6 from Yuri Rumyantsev  ---
I don't see any issues with 'false dependency' on HSW. I've got sep data on it:

for unsigned veriant (with LEA instructions):

0x400b30 52 161 lea0x1(%rdx),%ecx 
0x400b33 53 0 popcnt (%rbx,%rax,8),%rax 
0x400b39 54 353 lea0x2(%rdx),%r8d 
0x400b3d 55 0 popcnt (%rbx,%rcx,8),%rcx 
0x400b43 56 170 add%rax,%rcx 
0x400b46 57 25 lea0x3(%rdx),%esi 
0x400b49 58 332 popcnt (%rbx,%r8,8),%rax 
0x400b4f 59 196 add%rax,%rcx 
0x400b52 60 199 popcnt (%rbx,%rsi,8),%rax 
0x400b58 61 235 add%rax,%rcx 
0x400b5b 62 414 lea0x4(%rdx),%eax 
0x400b5e 63 0 add%rcx,%r14 
0x400b61 64 312 mov%rax,%rdx 
0x400b64 65 0 cmp%rax,%r12 
0x400b67 66 0 ja 400b30  

and we don't see any performance anomaly with popcnt.

But for 2nd loop we have

0x400c50 118 0 popcnt -0x8(%rdx),%rax 
0x400c56 119 0 popcnt (%rdx),%rcx 
0x400c5b 120 1086 add%rax,%rcx 
0x400c5e 121 492 popcnt 0x8(%rdx),%rax 
0x400c64 122 3 add%rcx,%rax 
0x400c67 123 507 add$0x20,%rdx 
0x400c6b 124 0 popcnt -0x10(%rdx),%rcx 
0x400c71 125 955 add%rax,%rcx 
0x400c74 126 479 add%rcx,%r13 
0x400c77 127 489 cmp%rsi,%rdx 
0x400c7a 128 0 jne400c50 

So far I can't imagine what the problem is.


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-07 Thread finis at in dot tum.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #5 from finis at in dot tum.de ---
Maybe there are a lot more instructions with such a false dependency. popcnt
may only be the tip of the ice berg. I don't think Intel only got this
operation wrong and all other SSE/AVX/... instructions are correct. I rather
think a group of operations is implemented like popcnt. The source code in the
linked SO question yields a good testbed for other operations as well: Simply
replace popcount by another intrinsic and check if the performance deviations
occur.


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-07 Thread finis at in dot tum.de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

finis at in dot tum.de changed:

   What|Removed |Added

 CC||finis at in dot tum.de

--- Comment #4 from finis at in dot tum.de ---

> Not sure if we want to
> disable popcnt use completely.

No matter how to fix this, do not disable popcnt! Even with the false
dependency it is still the fastest instruction for popcounting. The false
dependency makes it slower, but it is still faster than a hand written version.

The easiest fix IMHO would be using xor %r %r on the output register. It seems
to work extremely well, as you can see in the answer of the linked SO question.


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-05 Thread debiandev at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

--- Comment #3 from Andev  ---
This seems to be specific to some latest Intel CPUs. I am not sure which other
CPUs are affected. There is no official errata for this behavior AFAIK. 

As Alexander suggested, it would be a great idea to have a work around for this
in gcc for these specific CPUs.


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-05 Thread amonakov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
I think adjusting for scheduling won't help much: rather than making the
compiler aware of increased latency, you'd need that either the register
allocator avoids using a recently written hard register for popcnt (I'm not
aware of such capability), or as a stopgap measure the compiler can issue a
dependency-breaking instruction (xor %reg %reg) just before popcnt.


[Bug target/62011] False Data Dependency in popcnt instruction

2014-08-05 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target|X86_64-*-*  |x86_64-*-*, i?86-*-*

--- Comment #1 from Richard Biener  ---
Please clarify - this is a defect in the CPU?  Can you point to an official
errata?

In which case we might want to adjust the scheduler description used for
GENERIC tuning (and for the specific broken CPUs).  Not sure if we want to
disable popcnt use completely.