JackieTien97 opened a new pull request, #815:
URL: https://github.com/apache/tsfile/pull/815
## Summary
Optimize `TsFileReader::get_timeseries_metadata()` to eliminate redundant
index tree traversals and duplicate disk reads when collecting timeseries
metadata for all devices.
### Root Causes
1. **N+1 I/O**: `get_all_device_ids()` traverses the entire device index
tree but discards offset information. Then for each device,
`load_device_index_entry()` re-searches the tree from the root — O(N×D)
redundant disk reads for N devices and tree depth D.
2. **Duplicate read**: `get_device_timeseries_meta_without_chunk_meta()`
reads the measurement index node once to check alignment, then
`load_all_measurement_index_entry()` reads the exact same byte range again.
3. **Redundant PageArena::init()**: Called per-device in a loop despite
already being initialized in the constructor.
### Changes
- Add `TsFileIOReader::get_device_timeseries_meta_by_offset()` — accepts
pre-resolved offsets (skips `load_device_index_entry`), reuses deserialized top
node for `get_all_leaf()` (eliminates duplicate read)
- Add `TsFileReader::get_all_device_entries()` — single tree traversal
collecting device IDs with their `(start_offset, end_offset)`
- Rewrite `get_timeseries_metadata()` to combine the above
- Remove redundant `PageArena::init()` call from
`get_timeseries_metadata_impl()`
## Benchmark
**Test file**: `ecg_dataset/part_0.tsfile` (634 MB, 53040 devices)
| Path | Avg Time | Speedup |
|------|----------|---------|
| **Before** (`get_all_device_ids` + `get_timeseries_metadata(ids)`) | ~28.8
s | — |
| **After** (`get_timeseries_metadata()` optimized) | ~0.30 s | **~94x** |
Raw results (5 rounds each, after 2 warmup rounds):
```
=== Old path (get_all_device_ids + get_timeseries_metadata(ids)) ===
Round 1: 28508721 us (devices=53040)
Round 2: 30220159 us (devices=53040)
Round 3: 28359948 us (devices=53040)
Round 4: 28497494 us (devices=53040)
Round 5: 28451505 us (devices=53040)
=== New path (get_timeseries_metadata(), optimized) ===
Round 1: 303186 us (devices=53040)
Round 2: 303545 us (devices=53040)
Round 3: 303869 us (devices=53040)
Round 4: 304806 us (devices=53040)
Round 5: 304579 us (devices=53040)
```
## Test plan
- [x] All 522 existing tests pass
- [x] Verified result correctness: both paths return identical device count
(53040)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]