[ https://issues.apache.org/jira/browse/IMPALA-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong resolved IMPALA-2055. ----------------------------------- Resolution: Later > improve performance of HyperLogLog > ---------------------------------- > > Key: IMPALA-2055 > URL: https://issues.apache.org/jira/browse/IMPALA-2055 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Affects Versions: Impala 2.2 > Reporter: Daniel Lescohier > Priority: Minor > Attachments: hllmerge.cxx, hllmerge_clang.s, hllmerge_gcc.s > > > I'm attaching a standalone .cxx file that can be compiled independently, to > showcase improvements to the HLL functions in > be/src/exprs/aggregate-functions.cc. > The two assembler files are output from: > - clang -S -O3 hllmerge.cxx > - gcc -S -O3 hllmerge.cxx > The improvements can be integrated in: > - AggregateFunctions::HllMerge > - AggregateFunctions::HllFinalEstimate > I think HllMerge is probably more important, because a lot more data is put > through that function. > For HllMerge, I was able to have the compilers vectorize the code with SSE > packed byte instructions. In order for the compiler to do it, I changed the > function so that: > - The loop limit is against the compile-time constant HLL_LEN. The HllMerge > function has these checks, so it'd be OK to use the constant: > -- DCHECK_EQ(dst->len, HLL_LEN); > -- DCHECK_EQ(src.len, HLL_LEN); > - Made the loop into a simpler form that the compilers could recognize to > vectorize. > The C++ code: > {code} > void HllMerge(FunctionContext* ctx, const StringVal& src, > StringVal *dst) > { > const char *s = src.ptr, *end = src.ptr + HLL_LEN; > char *d = dst->ptr; > for (; s < end; ++s, ++d) { > *d = (*s > *d) ? *s: *d; > } > } > {code} > The inner loop of the merge that clang generates is this: > {code} > .LBB0_9: # %vector.body > # =>This Inner Loop Header: Depth=1 > movdqu -16(%rsi), %xmm0 > movdqu -16(%rdx), %xmm1 > movdqa %xmm0, %xmm2 > pcmpgtb %xmm1, %xmm2 > pand %xmm2, %xmm0 > pandn %xmm1, %xmm2 > por %xmm0, %xmm2 > movdqu %xmm2, -16(%rdx) > movdqu (%rsi), %xmm0 > movdqu (%rdx), %xmm1 > movdqa %xmm0, %xmm2 > pcmpgtb %xmm1, %xmm2 > pand %xmm2, %xmm0 > pandn %xmm1, %xmm2 > por %xmm0, %xmm2 > movdqu %xmm2, (%rdx) > addq $32, %rsi > addq $32, %rdx > addq $-32, %rcx > jne .LBB0_9 > {code} > For HllEstimate, I made two helper functions, which can be static functions > in aggregate-functions.cc. > One to count the number of zero registers: > {code} > int HllCountZeroRegisters(const char *buckets) > { > int sum; > for (int i = 0; i < HLL_LEN; ++i) { > sum += buckets[i] ? 0 : 1; > } > return sum; > } > {code} > This also uses packed byte instructions. > The second helper function is to sum the inverse powers of 2. This uses an > array of 64 floats that has the precomputed inverse powers of 2, to save > calling the expensive powf function. The table is 256 bytes, so easily fits > in L1 cache. > {code} > float HllSumInvPow2(const char *buckets) > { > float sum = 0; > const uint64_t *i = reinterpret_cast<const uint64_t *>(buckets); > const uint64_t *end = i + (HLL_LEN/sizeof(end)); > for (; i < end; ++i) { > sum += invPow2[(*i >> 0) & 63]; > sum += invPow2[(*i >> 8) & 63]; > sum += invPow2[(*i >> 16) & 63]; > sum += invPow2[(*i >> 24) & 63]; > sum += invPow2[(*i >> 32) & 63]; > sum += invPow2[(*i >> 40) & 63]; > sum += invPow2[(*i >> 48) & 63]; > sum += invPow2[(*i >> 56) & 63]; > } > return sum; > } > {code} > In order to use this function, we should add this check to HllFinalEstimate: > - DCHECK_EQ(HLL_LEN & 7, 0) > Reading from L1 cache should be much faster than calling the powf function. > The movq/shrq/andl instructions are basically free since they're interleaved > with the latency of the addss instruction adding from L1 cache. > I recommend that HllFinalEstimate calls the HllCountZeroRegisters function > first, and only if num_zero_registers == 0, call the HllSumInvPow2 function. > No need to do the relatively expensive floating point sum if the number of > distinct values is low. -- This message was sent by Atlassian Jira (v8.3.4#803005)