[ 
https://issues.apache.org/jira/browse/IMPALA-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-2055.
-----------------------------------
    Resolution: Later

> improve performance of HyperLogLog
> ----------------------------------
>
>                 Key: IMPALA-2055
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2055
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.2
>            Reporter: Daniel Lescohier
>            Priority: Minor
>         Attachments: hllmerge.cxx, hllmerge_clang.s, hllmerge_gcc.s
>
>
> I'm attaching a standalone .cxx file that can be compiled independently, to 
> showcase improvements to the HLL functions in 
> be/src/exprs/aggregate-functions.cc.
> The two assembler files are output from:
>  - clang -S -O3 hllmerge.cxx
>  - gcc -S -O3 hllmerge.cxx
> The improvements can be integrated in:
>  - AggregateFunctions::HllMerge
>  - AggregateFunctions::HllFinalEstimate
> I think HllMerge is probably more important, because a lot more data is put 
> through that function.
> For HllMerge, I was able to have the compilers vectorize the code with SSE 
> packed byte instructions. In order for the compiler to do it, I changed the 
> function so that:
>  - The loop limit is against the compile-time constant HLL_LEN. The HllMerge 
> function has these checks, so it'd be OK to use the constant:
>  -- DCHECK_EQ(dst->len, HLL_LEN);
>  -- DCHECK_EQ(src.len, HLL_LEN);
>  - Made the loop into a simpler form that the compilers could recognize to 
> vectorize.
> The C++ code:
> {code}
> void HllMerge(FunctionContext* ctx, const StringVal& src,
>     StringVal *dst)
> {
>   const char *s = src.ptr, *end = src.ptr + HLL_LEN;
>   char *d = dst->ptr;
>   for (; s < end; ++s, ++d) {
>     *d = (*s > *d) ? *s: *d;
>   }
> }
> {code}
> The inner loop of the merge that clang generates is this:
> {code}
> .LBB0_9:                                # %vector.body
>                                         # =>This Inner Loop Header: Depth=1
>         movdqu  -16(%rsi), %xmm0
>         movdqu  -16(%rdx), %xmm1
>         movdqa  %xmm0, %xmm2
>         pcmpgtb %xmm1, %xmm2
>         pand    %xmm2, %xmm0
>         pandn   %xmm1, %xmm2
>         por     %xmm0, %xmm2
>         movdqu  %xmm2, -16(%rdx)
>         movdqu  (%rsi), %xmm0
>         movdqu  (%rdx), %xmm1
>         movdqa  %xmm0, %xmm2
>         pcmpgtb %xmm1, %xmm2
>         pand    %xmm2, %xmm0
>         pandn   %xmm1, %xmm2
>         por     %xmm0, %xmm2
>         movdqu  %xmm2, (%rdx)
>         addq    $32, %rsi
>         addq    $32, %rdx
>         addq    $-32, %rcx
>         jne     .LBB0_9
> {code}
> For HllEstimate, I made two helper functions, which can be static functions 
> in aggregate-functions.cc.
> One to count the number of zero registers:
> {code}
> int HllCountZeroRegisters(const char *buckets) 
> {
>   int sum;
>   for (int i = 0; i < HLL_LEN; ++i) {
>     sum += buckets[i] ? 0 : 1;
>   }
>   return sum;
> }
> {code}
> This also uses packed byte instructions.
> The second helper function is to sum the inverse powers of 2.  This uses an 
> array of 64 floats that has the precomputed inverse powers of 2, to save 
> calling the expensive powf function. The table is 256 bytes, so easily fits 
> in L1 cache.
> {code}
> float HllSumInvPow2(const char *buckets) 
> {
>   float sum = 0;
>   const uint64_t *i = reinterpret_cast<const uint64_t *>(buckets);
>   const uint64_t *end = i + (HLL_LEN/sizeof(end));
>   for (; i < end; ++i) {
>     sum += invPow2[(*i >>  0) & 63];
>     sum += invPow2[(*i >>  8) & 63];
>     sum += invPow2[(*i >> 16) & 63];
>     sum += invPow2[(*i >> 24) & 63];
>     sum += invPow2[(*i >> 32) & 63];
>     sum += invPow2[(*i >> 40) & 63];
>     sum += invPow2[(*i >> 48) & 63];
>     sum += invPow2[(*i >> 56) & 63];
>   }
>   return sum;
> }
> {code}
> In order to use this function, we should add this check to HllFinalEstimate:
>  - DCHECK_EQ(HLL_LEN & 7, 0)
> Reading from L1 cache should be much faster than calling the powf function. 
> The movq/shrq/andl instructions are basically free since they're interleaved 
> with the latency of the addss instruction adding from L1 cache.
> I recommend that HllFinalEstimate calls the HllCountZeroRegisters function 
> first, and only if num_zero_registers == 0, call the HllSumInvPow2 function.  
> No need to do the relatively expensive floating point sum if the number of 
> distinct values is low.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to