janniklinde opened a new pull request, #2387:
URL: https://github.com/apache/systemds/pull/2387

   This PR depends on #2368.
   
   This patch provides a major rework of the `OOCEvictionManager` by separating 
cache scheduling logic from I/O handling. The rework is needed 
   
   1. because of the necessity to support (multi-)tile pinning (even with LRU 
cache `NullPointerException`s can occur under memory pressure, especially when 
requiring multiple blocks to be resident in cache simultaneously)
   2. to give better cache limit guarantees when parallel reading evicted 
blocks to prevent OOM
   3. to improve I/O performance
   4. to perform I/O tasks in their own thread pool to not block compute tasks 
while reading.
   
   Further, we introduce detailed out-of-core statistics and a fine-grained 
event log that can be exported to CSV using the CLI options `-oocStats 
[topNHeavyHitters]` and `-oocLogEvents [savedir]`. The event log can be 
visualized to identify bottlenecks (see image below; performance of pca on 
1Mx1000 input matrix). Detailed information to the experiment can be found on 
![this repo](https://github.com/janniklinde/OOCExperiments). The bottom graph 
shows compute tasks and idle times of the fixed sized `ThreadPool`. The y-axis 
of the bottom three graphs  shows the Thread ID of the worker performing the 
read/write/compute tasks.
   
   <img width="2560" height="1463" alt="monitor" 
src="https://github.com/user-attachments/assets/8ce838b5-b17f-4596-ac5c-82b547c530f4";
 />
   
   Currently, it is still possible to exceed hard limits of the cache because 
of uncontrolled producers that are not yet unified with the cache system (e.g., 
`ReblockOOCInstruction`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to