wu-sheng commented on issue #13634:
URL: https://github.com/apache/skywalking/issues/13634#issuecomment-3707851552
## đź§ Additional Proposal: Metrics-Driven Sampling and Caching Design
### **Design Intent**
To improve the precision and adaptability of trace sampling, BanyanDB should
enable the trace pipeline to make decisions based on live system and service
performance metrics.
By aligning sampling logic with real-time latency, error characteristics,
and resource utilization, the system can intelligently balance observability
quality and data volume.
This approach moves the system from rigid, threshold-based sampling toward a
**context-aware, self-adjusting mechanism** that continuously reflects the
actual operating conditions of distributed systems.
---
### **Core Design Tenets**
1. **Metrics-Centric Sampling Logic**
Introduce a metrics-aware decision layer within the trace pipeline.
Sampling will factor in telemetry data such as latency percentiles, throughput,
and failure ratios.
This enables the database to prioritize traces that highlight critical
performance states or anomalies rather than relying on static numeric
thresholds.
2. **Unified Metrics Access Layer**
Incorporate a unified layer for reading aggregated metrics that can serve
multiple purposes within the pipeline.
This layer provides a consistent interface to access metric data, whether
it originates from BanyanDB’s internal metric storage or external observability
systems, ensuring flexibility and coherence in decision-making.
3. **Caching**
Introduce a caching mechanism above the metrics access layer to store
recently used or recurrently needed metric values. This is different from the
flagged
The cache will maintain lightweight persistence across trace-processing
batches, improving performance and stability while ensuring the data remains
sufficiently fresh for high-throughput, streaming environments.
4. **Adaptive Sampling Policy**
Shift from static threshold-based policies to **context-aware
evaluators** that can respond to live performance data.
For instance, the policy might choose to:
- Retain traces contributing to current P95 or P99 latency outliers.
- Emphasize the slowest or most error-prone API endpoints.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]