Baymine opened a new pull request, #64568:
URL: https://github.com/apache/doris/pull/64568

   
   ### What problem does this PR solve?
   
   Issue Number: close #xxx
   
   Related PR: #xxx
   
   Problem Summary:
   The ORC tiny-stripe scan path (`OrcReader::set_fill_columns`, 
all-tiny-stripe branch) wraps the file reader in `io::RangeCacheFileReader` so 
the whole file's column-stream reads can be served from one set of pre-merged 
in-RAM ranges. Since #53729 (`update scan bytes metric in file_scanner`, 
2025-07-28), `ORCFileInputStream::read()` issues every byte through 
`_tracing_file_reader` instead of directly through `_file_reader`. 
`_tracing_file_reader` is a `TracingFileReader` whose `_inner` 
`shared_ptr<FileReader>` is **captured by value in the constructor's 
initializer list** 
   
   After this reassignment, `_file_reader` points at the new 
`RangeCacheFileReader`, but `_tracing_file_reader->_inner` still holds an 
independent `shared_ptr` to the original `HdfsFileReader`. Every read still 
hits HDFS directly. The `RangeCacheFileReader` is constructed, registers spill 
counters in the runtime profile (which is why 
`RangeCacheFileReader.ReadToCacheBytes = 0` was the truthful diagnostic), and 
is **never read from**. Net effect: the BE pays the stripe-merge + 
wrapper-construction cost, gets no caching benefit, and runs 5–9× slower than 
the per-stripe `OrcMergeRangeFileReader` non-tiny path on the same physical IO.
   
   We tested it on online SQL with `orc_tiny_stripe_threshold_bytes=8388608`
   | Counter | Pre-fix (4.0.5 default, buggy) | Post-fix (4.0.5 default + this 
PR) | Notes |
   |---|---|---|---|
   | Total query time | **5 min 55 s** | **1 min 18 s** | **4.55× speedup** |
   | `RangeCacheFileReader.ReadToCacheBytes` | **0 B** | **146.05 GB** | 
Matches 2.1.6 baseline byte-for-byte |
   | `RangeCacheFileReader.RequestBytes` | 415 MB (counter aliased onto 
`MergedSmallIO`) | 32.25 GB | Reads now flow through the wrapper |
   | `RangeCacheFileReader.RequestIO` | (low, aliased) | 27.506914 M | Every 
ORC read goes through the cache |
   | `FileReadBytes` (HDFS) | 52.84 GB | 53.22 GB | Same physical IO, confirmed 
|
   | `FileReadCalls` (HDFS) | 27 618 259 | 27 618 259 | Identical |
   | `count(*)` result | 780 797 | 780 797 | No semantic change |
   
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [x] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to