Jzjsnow opened a new issue, #3774:
URL: https://github.com/apache/amoro/issues/3774

   ### Description
   
   Currently, Amoro determines Iceberg table optimizations through periodic 
full-refresh evaluations at the table level. While this design ensures 
consistent refreshes, it introduces inefficiencies for large-scale Iceberg 
tables with continuously growing data volumes.
   
   From log analysis on an Amoro system with 40,000+ tables, we observed:
   - Numerous redundant scans during refresh.
   - Low evaluation efficiency, as many scans did not lead to effective 
optimizations.
   - Resource wastage and system delays, limiting scalability and user 
experience.
   
   To address these limitations, we propose introducing an event-triggered 
optimization mechanism that reduces unnecessary scans and enhances evaluation 
efficiency, while ensuring timely optimizations.
   
   ### Use case/motivation
   
   - Large-scale Iceberg tables (tens of thousands) suffer from frequent 
full-table refresh scans, leading to high resource usage and wasted computation.
   - Users need timely optimizations without the overhead of redundant scans.
   - For real-time scenarios, latency-sensitive workloads cannot tolerate 
inefficient refresh mechanisms.
   - The current design struggles to balance timeliness vs. resource 
utilization.
   
   By moving from periodic full-refresh scans to event-driven triggers, Amoro 
can:
   
   1. Reduce full-table scans, avoiding redundant work.
   2. Increase effective scan ratio, ensuring scans more frequently lead to 
actual optimizations.
   3. Improve file merge efficiency, maximizing the effect of each scan.
   4. Ensure timely optimization, even while lowering scan frequency.
   
   ### Describe the solution
   
   To address pain points in Amoro's periodic optimization refresh mechanism, 
we propose an ​​event-triggered optimization mechanism​​:
   
   - Use loaded table metadata and metrics to trigger scan evaluations, 
reducing redundant scans. 
   - Replace periodic table loading with Iceberg commit-driven triggers.
   
   ### Subtasks
   
   Metadata Metric-Driven Evaluation​
   - [ ] Add support for pluggable refresh event in TableRuntimeRefreshExecutor 
and DefaultRefreshEvent
   - [ ] Add support for MSE-based refresh event: calculate partition MSE from 
loaded metadata and filter partitions based on threshold.
   - [ ] Documentation updates.
   - [ ] (optional) Bind each computed pendingInput result with snapshotId to 
avoid repeated scans when in pending state.
   - [ ] (optional) Cache storage for MSE calculation results.
   - [ ] (optional) Cache storage for pendingInput queue (e.g., when >100 
partitions need optimization, enable queued optimization).
   
   Iceberg Commit-Triggered Table Runtime Refresh
   - [ ] REST endpoint for reporting external table events: define interface 
and JSON body format.
   - [ ]  Add an ExternalEventReportService to trigger 
TableRuntimeRefreshExecutor for refresh & evaluation based on commit operations.
   - [ ]  Catalog query service API.
   - [ ]  Documentation updates.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to