janniklinde opened a new pull request, #2387: URL: https://github.com/apache/systemds/pull/2387
This PR depends on #2368. This patch provides a major rework of the `OOCEvictionManager` by separating cache scheduling logic from I/O handling. The rework is needed 1. because of the necessity to support (multi-)tile pinning (even with LRU cache `NullPointerException`s can occur under memory pressure, especially when requiring multiple blocks to be resident in cache simultaneously) 2. to give better cache limit guarantees when parallel reading evicted blocks to prevent OOM 3. to improve I/O performance 4. to perform I/O tasks in their own thread pool to not block compute tasks while reading. Further, we introduce detailed out-of-core statistics and a fine-grained event log that can be exported to CSV using the CLI options `-oocStats [topNHeavyHitters]` and `-oocLogEvents [savedir]`. The event log can be visualized to identify bottlenecks (see image below; performance of pca on 1Mx1000 input matrix). Detailed information to the experiment can be found on . The bottom graph shows compute tasks and idle times of the fixed sized `ThreadPool`. The y-axis of the bottom three graphs shows the Thread ID of the worker performing the read/write/compute tasks. <img width="2560" height="1463" alt="monitor" src="https://github.com/user-attachments/assets/8ce838b5-b17f-4596-ac5c-82b547c530f4" /> Currently, it is still possible to exceed hard limits of the cache because of uncontrolled producers that are not yet unified with the cache system (e.g., `ReblockOOCInstruction`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
