On Sunday February 28 (Japan time) the ATM network of Mizuho Bank, Japan's third largest bank suffered a massive service outage. More than 4,000 thousand machines stopped working. There were more than 5,000 cases in which the affected machines stopped processing after taking the customers' cash cards leaving them incapable of retrieving them. Problems persisted during the next Monday.
Mizuho Bank explains that an internal data relocation process involving 700,000 saving accounts overloaded the entire system and affected ATM cash withdrawals. The bank's chairman apologized for failing to anticipate the effect of the procedure on the system. https://www.japantimes.co.jp/news/2021/02/28/business/mizuho-cards-banks/ https://www3.nhk.or.jp/nhkworld/en/news/20210301_31/ --- A comment to a Japanese-language online article of the snafu, posted by an engineer who has been maintaining web serves for financial organizations for nearly twenty years caught my attention. He (or she) is not working for Mizuho Bank and admits he can only speculate what actually happened, but his story sounds familiar. I'd like to share the tale with you through a rough translation. Some companies do not take their system administration and maintenance sections in high regard. When engineers do their job right, there are no problems. Management sees this, erroneously concludes that the maintenance crew is "doing nothing" worthy of their expensive salaries, and decides to axe them. The nasty part of this is that problems won't surface for a while thanks to the fine job done by the former crew. Problems may occur but they can be dealt with using makeshift solutions improvised by less qualified engineers. At this point it appears that the management did the right thing by laying off highly-paid specialists. Each fix may seem small, but in the aggregate they turn the system into chaos. At some point a "last straw" is thrown upon the pile and breaks the mule's back. The system breaks down in a calamity affecting a large number of clients. Management may hire expert engineers to deal with the emergency but with the system in such a mess even they can't figure things out. --- It was an episode like this that spurred me to look for better ways to do things and eventually I noticed free software. Experience tells me that the likes of the less qualified crew described above are not likely to keep proper records of the changes made. When the emergency team arrives, lack of documentation becomes a serious obstacle. Paradoxically this puts the less qualified crew who made the mess in an advantageous position. The emergency team members have to beg to get necessary information; they can't say or do things that would offend them lest they stop providing information. Indeed those that caused the mess may even argue that the newly hired hands are technically inferior, with claims such as "those outside guys do not know real systems."