[ 
https://issues.apache.org/jira/browse/KUDU-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861471#comment-16861471
 ] 

Todd Lipcon edited comment on KUDU-2846 at 6/11/19 8:41 PM:
------------------------------------------------------------

example code that does SIMD comparisons for int equality 8 at a time:
{code}
void TestFastCode(const ColumnBlock* cb, uint8_t* __restrict__ selvec, int32_t 
ref) {
    __m256i ref_vec = _mm256_set1_epi32(ref);
    const __m256i* data = (const __m256i*)cb->data_;
    for (int i = 0; i < cb->nrows_; i += 8) {
        __m256i m = _mm256_loadu_si256(data++);
        __m256i c = _mm256_cmpeq_epi32(m, ref_vec);
        uint8_t mask = _mm256_movemask_ps((__m256)c);
        *selvec++ &= mask; // TODO: do we need to reverse the bits? also need 
to & with nulls
    }
  // TODO: handle case of rowblock length not a multiple of 8
}{code}

I couldn't convince the auto-vectorizer to generate the same assembly as doing 
it by hand, but it may be worth implementing these for the most common 
predicates. Likely something like 10x improvement possible here vs our current 
branchy mess.


was (Author: tlipcon):
example code that does SIMD comparisons for int equality 8 at a time:
{code}
void TestFastCode(const ColumnBlock* cb, uint8_t* selvec, int32_t ref) {
    __m256i ref_vec = _mm256_set1_epi32(ref);
    for (int i = 0; i < cb->nrows_; i += 8) {
        __m256i m = _mm256_loadu_si256((const __m256i*)&cb->data_[i * 
sizeof(int32_t)]);
        __m256i c = _mm256_cmpeq_epi32(m, ref_vec);
        int mask = _mm256_movemask_ps((__m256)c);
        selvec[i/8] &= mask; // TODO: do we need to reverse the bits? not sure.
    }
  // TODO: handle case of rowblock length not a multiple of 8, or can we 
enforce that?
}
{code}

I couldn't convince the auto-vectorizer to generate the same assembly as doing 
it by hand, but it may be worth implementing these for the most common 
predicates. Likely something like 10x improvement possible here vs our current 
branchy mess.

> Special case predicate evaluation for SIMD support
> --------------------------------------------------
>
>                 Key: KUDU-2846
>                 URL: https://issues.apache.org/jira/browse/KUDU-2846
>             Project: Kudu
>          Issue Type: Improvement
>            Reporter: Todd Lipcon
>            Priority: Major
>
> In the common case of predicate evaluation on primitive types, we can likely 
> improve performance as follows:
> - doing the comparisons while ignoring nullability and selectedness (any null 
> or unselected cells may have junk data, which causes a junk comparison result)
> - take the resulting bitmask of comparison results and use bitwise ops to 
> account for null/unselected cells to ensure that those result in a 'false' 
> comparison
> For some types of comparisons this can result in SIMD operations. For others, 
> at least, this will remove most branches from the path. This should speed up 
> queries like TPCH Q6 which spends 25% of its time in predicate evaluation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to