http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57529
Bug ID: 57529 Summary: Redundant masking of zero-extended values Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: jewillco at osl dot iu.edu Using version "gcc (GCC) 4.9.0 20130519 (experimental)" with target "x86_64-unknown-linux-gnu" and the flags "-Ofast -std=gnu99 -march=bdver1", the following code: #include <stdint.h> void foo(const uint16_t* restrict indexes, const uint64_t* restrict bits, unsigned int* restrict sum, int count) { for (int i = 0; i < count; ++i) { unsigned int val = indexes[i]; if (bits[val / 64] & (1UL << (val % 64))) {sum[val] += 1;} } } produces two shifts to implement the "val / 64" operation instead of one, seemingly because the compiler is trying to mask val to 16 bits even though it was loaded with movzwl and thus was already masked and zero-extended. Here is the assembly for the function body: testl %ecx, %ecx # count movl %ecx, %r9d # count, count jle .L8 #, xorl %eax, %eax # ivtmp.5 .p2align 4,,10 .p2align 3 .L4: movzwl (%rdi,%rax,2), %ecx # MEM[base: indexes_8(D), index: ivtmp.5_52, step: 2, offset: 0B], D.2242 movq %rcx, %r8 # D.2242, D.2244 # **************** Redundant masking operation: salq $48, %r8 #, D.2244 shrq $54, %r8 #, D.2244 # **************** movq (%rsi,%r8,8), %r8 # *_16, D.2244 # ++++++++++++++++ shrq %cl, %r8 # D.2242, D.2244 andl $1, %r8d #, D.2244 # ++++++++++++++++ je .L3 #, # xxxxxxxxxxxxxxxx movzwl %cx, %r8d # D.2242, D.2244 # xxxxxxxxxxxxxxxx incl (%rdx,%r8,4) # *_25 .L3: incq %rax # ivtmp.5 cmpl %eax, %r9d # ivtmp.5, count jg .L4 #, .L8: rep; ret The seemingly-unnecessary operation is marked with stars; a single shrq by 6 should do the unsigned division operation correctly, while two instructions are used to both mask the value to 16 bits and shift it. The zero-extension inside x's is also unnecessary (%rcx could have been used directly in the index expression). On a somewhat unrelated issue, the code marked in +'s seems to be sub-optimal as well, and could probably be replaced by a bt instruction (GCC 4.4.7 uses "btq" there using -O3 and the same -march flag).