[ https://issues.apache.org/jira/browse/IMPALA-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Qifan Chen resolved IMPALA-3816. -------------------------------- Fix Version/s: Impala 4.0 Resolution: Fixed > Codegen perf-critical loops in Sorter > ------------------------------------- > > Key: IMPALA-3816 > URL: https://issues.apache.org/jira/browse/IMPALA-3816 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Affects Versions: Impala 2.7.0 > Reporter: Tim Armstrong > Assignee: Qifan Chen > Priority: Minor > Labels: codegen > Fix For: Impala 4.0 > > Attachments: 7e455a394bf2622b_adcea7ad00000000_opt.ll, percentile > query profile.txt, tpch_30.txt > > > In the sorter, we codegen the comparator function but call it indirectly via > a function pointer. We should consider codegening the perf-critical loops so > that we can make the comparator function call direct and inlinable. Inlining > the comparison will be very beneficial if it is trivial, e.g. order by a > numeric column: I expect sorts on simple keys will get noticably faster. > We should also be able to get rid of FreeLocalAllocations() calls for most > comparators, although I'm not sure what the best way to approach that is. > The Partition() loop is the most perf-critical, followed by InsertionSort(). > We also don't do this yet for the TopN node, see IMPALA-3815. > Mostafa's analysis: > While evaluating Sort performance I noticed that the codegened compare > function is not inlined which results in large overhead per row. > Expected speedup is 10-15% > {code} > /// Returns a negative value if lhs is less than rhs, a positive value if > lhs is > /// greater than rhs, or 0 if they are equal. All exprs > (ordering_exprs_lhs_ and > /// ordering_exprs_rhs_) must have been prepared and opened before calling > this, > /// i.e. 'sort_key_exprs' in the constructor must have been opened. > int ALWAYS_INLINE Compare(const TupleRow* lhs, const TupleRow* rhs) const { > return codegend_compare_fn_ == NULL ? > CompareInterpreted(lhs, rhs) : > (*codegend_compare_fn_)(ordering_expr_evals_lhs_.data(), > ordering_expr_evals_rhs_.data(), lhs, rhs); > } > {code} > From Perf > {code} > │ bool Sorter::TupleSorter::Less(const TupleRow* lhs, const > TupleRow* rhs) { > > ▒ > 7.43 │ push %rbp > > > ▒ > 3.23 │ mov %rsp,%rbp > > > ▒ > 9.44 │ push %r12 > > > ▒ > 2.69 │ push %rbx > > > ▒ > 3.89 │ mov %rsi,%r12 > > > ▒ > 2.98 │ mov %rdi,%rbx > > > ▒ > 6.06 │ sub $0x10,%rsp > > > ◆ > │ --num_comparisons_till_free_; > > > ▒ > │ DCHECK_GE(num_comparisons_till_free_, 0); > > > ▒ > │ if (UNLIKELY(num_comparisons_till_free_ == 0)) { > > > ▒ > 3.75 │ subl $0x1,0x18(%rdi) > > > ▒ > 9.42 │ ↓ je 58 > > > ▒ > │ parent_->expr_results_pool_.Clear(); > > > ▒ > │ num_comparisons_till_free_ = state_->batch_size(); > > > ▒ > │ } > > > ▒ > │ return comparator_.Less(lhs, rhs); > > > ▒ > 1.09 │17: mov 0x10(%rbx),%rdi > > > ▒ > │ /// Returns a negative value if lhs is less than rhs, a > positive value if lhs is > > ▒ > │ /// greater than rhs, or 0 if they are equal. All exprs > (ordering_exprs_lhs_ and > > ▒ > │ /// ordering_exprs_rhs_) must have been prepared and opened > before calling this, > > ▒ > │ /// i.e. 'sort_key_exprs' in the constructor must have been > opened. > > ▒ > │ int ALWAYS_INLINE Compare(const TupleRow* lhs, const TupleRow* > rhs) const { > > ▒ > │ return codegend_compare_fn_ == NULL ? > > > ▒ > 2.69 │ mov 0x58(%rdi),%rax > > > ▒ > │ CompareInterpreted(lhs, rhs) : > > > ▒ > │ (*codegend_compare_fn_)(ordering_expr_evals_lhs_.data(), > > > ▒ > │ ordering_expr_evals_rhs_.data(), lhs, rhs); > > > ▒ > 5.43 │ test %rax,%rax > > > ▒ > │ ↓ je 40 > > > ▒ > 6.85 │ mov 0x20(%rdi),%rsi > > > ▒ > 0.86 │ mov %rdx,%rcx > > > ▒ > 2.55 │ mov 0x8(%rdi),%rdi > > > ▒ > 3.38 │ mov %r12,%rdx > > > ▒ > 6.10 │ → callq *(%rax) > > > ▒ > │ } > > > ▒ > 5.84 │ add $0x10,%rsp > > > ▒ > │ /// All exprs (ordering_exprs_lhs_ and ordering_exprs_rhs_) > must have been prepared > > ▒ > │ /// and opened before calling this. > > > ▒ > │ /// Force inlining because it tends not to be always inlined at > callsites, even in > > ▒ > │ /// hot loops. > > > ▒ > │ bool ALWAYS_INLINE Less(const TupleRow* lhs, const TupleRow* > rhs) const { > > ▒ > │ return Compare(lhs, rhs) < 0; > > > ▒ > 1.77 │ shr $0x1f,%eax > > > ▒ > 7.91 │ pop %rbx > > > ▒ > 4.11 │ pop %r12 > > > ▒ > 0.49 │ pop %rbp > > > ▒ > 1.75 │ ← retq > > > ▒ > │ /// i.e. 'sort_key_exprs' in the constructor must have been > opened. > > ▒ > │ int ALWAYS_INLINE Compare(const TupleRow* lhs, const TupleRow* > rhs) const { > > ▒ > │ return codegend_compare_fn_ == NULL ? > > > ▒ > │ CompareInterpreted(lhs, rhs) : > > > ▒ > │ (*codegend_compare_fn_)(ordering_expr_evals_lhs_.data(), > > > ▒ > │ ordering_expr_evals_rhs_.data(), lhs, rhs); > > > ▒ > │40: mov %r12,%rsi > > > ▒ > │ → callq > impala::TupleRowComparator::CompareInterpreted(impala::TupleRow const*, > impala::TupleRow const*) const > ▒ > │ add $0x10,%rsp > > > ▒ > │ /// All exprs (ordering_exprs_lhs_ and ordering_exprs_rhs_) > must have been prepared > > ▒ > Press 'h' for help on key bindings > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)