On Thu, Jan 11, 2024 at 03:25:51PM -0500, Stefan Monnier wrote:
manufacturers in different memory banks, but since it's always
possible to power down, replace or just remove memory, and power
up again,
Hmm... "always"? What about long running computations like that
simulation (or LLM training) launched a month ago and that's expected to
finish in another month or so?
I'd expect something like that to have a checkpoint/restart capability
if not starting over actually matters.
Some mainframes have supported hot (un)plugging RAM modules as well
Yes, mainframes have been engineered that way for a long time. It makes
them very expensive, and their market share has been declining for
decades because most problems can be solved more cheaply in software
(even while maintaining high availability). Hot *spare* memory is
relatively common, as it solves most problems without the complexity of
hot *swapping*, at the (generally low) cost of having to schedule
downtime at some point in the future to actually replace the failed
module.