[GitHub] [orc] stiga-huang opened a new pull request, #1087: ORC-1150: [C++] Optimize RowReaderImpl::computeBatchSize() by pre-computation

GitBox Fri, 15 Apr 2022 18:25:12 -0700


stiga-huang opened a new pull request, #1087:
URL: https://github.com/apache/orc/pull/1087


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., 
`ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of 
describing the problem.
     3. Make PR title and description complete because these will be the 
permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce 
the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section 
is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster 
reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class 
hierarchy will help reviewers.
     2. If there is a discussion in the mailing list, please add the link.
   -->
   In case of PPD, SargsApplier mantains a bool vector representing whether a 
RowGroup is picked after evaluating the SearchArguments. The bool vector is 
then used in RowReaderImpl::computeBatchSize() to compute the next skipped row 
id for  the current row range. The computation is done for each batch.
   
   In the worst case, if all RowGroups are picked, computeBatchSize() has to 
iterate the whole vector starts from the current RowGroup. In practise, a 
stripe can have thousands of RowGroups. The complexity of iterating the bool 
vector is comparable to materializing the batch (usually have size as 1024). 
The perf result in the JIRA description shows that it can take 1/4 of the scan 
time.
   
   Instead of maintaining the bool vector, we can pre-compute the next skipped 
row id of the row range that each RowGroup locates. 
RowReaderImpl::computeBatchSize() can use the results directly. This PR 
replaces the bool vector with the vector of row ids explained above. If the 
current RowGroup is skipped, the value is 0. This matches the case when the 
original bool value is false.
   
   The pre-computation is still done in SargsApplier::pickRowGroups(). 
RowReaderImpl::computeBatchSize() and RowReaderImpl::advanceToNextRowGroup() 
are optimized to use the pre-computed results.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   This PR improves the performance in case of PPD. I pick a 252MB file of the 
tpcds inventory table, use a modified orc-scan with PPD that picks all 
RowGroups to scan it. Here are the results measure by perf-stat:
   
   |                | original    | optimized   | speedup      |
   |----------------|-------------|-------------|--------------|
   | Time taken (s) | 0.722492053 | 0.527221077 | 1.3703777875 |
   | instructions   | 8725349519  | 5989083282  | 1.4568756366 |
   | branches       | 1143322885  | 752821194   | 1.5187177169 |
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   -->
   Passed all existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] stiga-huang opened a new pull request, #1087: ORC-1150: [C++] Optimize RowReaderImpl::computeBatchSize() by pre-computation

Reply via email to