On Thu, Jan 11, 2024 at 03:25:51PM -0500, Stefan Monnier wrote:
manufacturers in different memory banks, but since it's always
possible to power down, replace or just remove memory, and power
up again,

Hmm... "always"?  What about long running computations like that
simulation (or LLM training) launched a month ago and that's expected to
finish in another month or so?

I'd expect something like that to have a checkpoint/restart capability if not starting over actually matters.

Some mainframes have supported hot (un)plugging RAM modules as well

Yes, mainframes have been engineered that way for a long time. It makes them very expensive, and their market share has been declining for decades because most problems can be solved more cheaply in software (even while maintaining high availability). Hot *spare* memory is relatively common, as it solves most problems without the complexity of hot *swapping*, at the (generally low) cost of having to schedule downtime at some point in the future to actually replace the failed module.

Reply via email to