Hi, A long tale of what happened next.
Back at the end of June, I wrote: > Anyone here have an idea why this Acer Aspire Revo R3700 should have > started chirping at random times, in clusters? I'm assuming it's > connected to the warm weather. > > The noise is a very short `chirp', consistent in duration and pitch; > more deliberately generated rather than mechanical side-effect. The > PC has a normal `BIOS beep' speaker, but I was wondering if it could > be from a secondary piezoelectric speaker, perhaps on the hard drive? > > Speaking of which, the drive is a Hitachi Travelstar 5K500.B > HTS545016B9A300; the two-page specification I found makes no mention > of noises. The PC has an AMI BIOS, but I wouldn't expect any of its > code to be running once Linux has control as the kernel does it all > itself based on ACPI tables? > > Recordings of `sensors' and `smartctl -x' over the days show no > dramatic change other than background temperature pushing them all up > a bit. Nothing unusual in the log files. > > I'm not alone. There are a few reports of noises like this on other > types of PC, and several of them suggest it comes from the hard-drive > area, e.g. > http://www.tomshardware.co.uk/forum/287497-31-this-beep-driving-crazy The chirps continued. Not getting any worse, but no obvious pattern. I increased the frequency of logging the drive's SMART data and diff(1)ing it against the last time; still no changes. Then late in July I heard several chirps in quick succession at about 7.45 a.m. when the machine had been unused for hours. The log files confirmed several cron jobs had started up at minute intervals to use network bandwidth whilst it's still `free'; 8 a.m. cut-off. With systemd(1), systemd-journald(8), and Postfix's many processes, running a simple cron command involves a lot more parallel work than in the old days. The chirp was in response. Once I'd cottoned on, I could see the correlation a lot of the time through the day. Seeking wasn't the cause, the drive does that a lot, rather lots of seeking, probably to distant tracks, close together in time. I thought it more likely the drive had a speaker and was making the noise. Wanting to remove the warmer weather from the variables, I used some spudgers to get the PC's case open for the first time in seven years and blew out all the dust. https://en.wikipedia.org/wiki/Spudger https://amzn.to/2Ovr4EU This worked: the motherboard and drive's temperature sensors all dropped a few degrees centigrade. And the chirp changed too: it was slightly louder and sharper, less muffled. But its occurrences continued. Now that I'd managed to open the case, I started looking into a replacement drive. And then this happened. $ diff sda.smart.2018-08-01_08* sda.smart.2018-08-01_19* ... < Local Time is: Wed Aug 1 08:32:33 2018 BST --- > Local Time is: Wed Aug 1 19:29:59 2018 BST ... < 196 Reallocated_Event_Count -O--CK 100 100 000 - 0 --- > 196 Reallocated_Event_Count -O--CK 100 100 000 - 2 The Reallocated_Sector_Ct (count) raw value remained 0. $ grep Realloc sda.smart.2018-08-01_19* 5 Reallocated_Sector_Ct PO--CK 100 100 005 - 0 196 Reallocated_Event_Count -O--CK 100 100 000 - 2 $ My reading of IDs 5 and 196 at https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes, 05 Reallocated Sectors Count = 0 Count of reallocated sectors. The raw value represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months. 196 Reallocation Event Count = 2 Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful and unsuccessful attempts are counted. suggests that two *unsuccessful* attempts were made to remap a sector or two. Now wanting to urgently replace the drive whether it was the source of the chirping or not, I got a nice 500 GB 2.5" SATA SSD by Crucial, the marketing name of its manufacturers Micron. https://amzn.to/2MKPSbO That size was the cheapest byte/£ of their MX500 range. Amazon delivered it and a USB adapter the next day, Sunday. In the meantime, I brought forward the weekly `scrub' of the drive, reading it all the way through from beginning to end whilst doing little else. That took its normal hour and had no errors. Significantly, there were no chirps during that time, unusual by then. This made it more likely the drive was the source. The only other cause I could think of was the high activity by the drive was causing some motherboard power-supply issue to abate, but Occam's Razor... I booted from an Ubuntu 18.04 USB stick to copy the PC's original spinning-rust drive to the SSD over USB 2.0 with dd(1). 75% through the copy, the X desktop annoyingly froze. That's OK, I thought, the copy will still be working away. Then Ubuntu decided to helpfully log that X session out and auto-login me to a new one. Not only did this kill the dd, as I saw no need to have started it in screen(1), but Ubuntu auto-mounted the filesystems it could see on the new drive, even though one was incomplete, modifying their bits along the way because there's a `mount count', etc. This nobbled my intention of comparing bit-for-bit the original and the copy at the end as verification. I didn't want to start the copy from scratch, starting off a shiny SSD's life by writing tens of GiB only to do it again, so I used cmp(1)'s `-l' option to list the locations of byte differences between them. The physical sector containing that offset was then selectively copied across again with dd(1). This evolved to feed cmp's output into awk to print sector numbers, and each sector only once, in a script called `./cmp startsector'. And the selective dd became `./fix firstsector [n]' where `n' defaulted to one and was the number to copy. After gaining confidence in these, the awk changed once more to count the run of consecutive sectors and alter its output format to be `./fix sector n # Comment with more detail'. Then I did `./cmp | ./fix' to patch up all the remaining differences. There weren't lots, but they were scattered. After that, the normal dd was restarted where it left off. The detour took about an hour, but less than waiting for a duplicate copy, and I thought if it crashes out again I'll be able to recover and continue a second time. None of the runs of dd erred whilst reading the rust. At the end, I b2sum(1)'d both drives in parallel and they had the same digest so the underlying copies and my patchwork pieced together seamlessly. I got the case open a second time and swapped the drives; that meant removing the motherboard because the drive's screws come up through the mobo from the underneath. As expected, disk activity that involved a lot of seeking before is now much quicker, e.g. searching an email folder of 30,000 emails, one per file. And not a chirp in earshot. Crucial are going to send details of the SSDs wattage so I can compare with the rust's, just out of interest. It's bound to be lower, especially because I configured the rust to never spin down for power saving, preferring a steady state for wear and tear. The rust did well, SMART's Power_On_Hours says $ units 63377hours time 7 year + 84 day + 18 min + 38.177251 sec The plan now is to get another of those SSDs and hook it up over USB once a day to mirror the first to it with rsync(1). Then a new machine will start with the second one re-formatted as one half of an md(4) RAID1 or RAID10 pair, the other declared `missing'. Atop that it will mainly be a LUKS dm-crypt encrypted block device. Copy the old PC's data into it, then move the first SSD to fill in the `missing' slot of the RAID pair. A third, external, large spinning rust will be added to the mirror and occasionally present for syncing. Thanks to AndrewM and twaugh on #dorset IRC for putting up with the running commentary, and answering questions on md and dm-crypt for the next stages. Cheers, Ralph. -- Next meeting: Bournemouth, Tuesday, 2018-09-04 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread: mailto:dorset@mailman.lug.org.uk / CHECK IF YOU'RE REPLYING Reporting bugs well: http://goo.gl/4Xue / TO THE LIST OR THE AUTHOR