Hi,

A long tale of what happened next.

Back at the end of June, I wrote:
> Anyone here have an idea why this Acer Aspire Revo R3700 should have
> started chirping at random times, in clusters?  I'm assuming it's
> connected to the warm weather.
>
> The noise is a very short `chirp', consistent in duration and pitch;
> more deliberately generated rather than mechanical side-effect.  The
> PC has a normal `BIOS beep' speaker, but I was wondering if it could
> be from a secondary piezoelectric speaker, perhaps on the hard drive?
>
> Speaking of which, the drive is a Hitachi Travelstar 5K500.B
> HTS545016B9A300;  the two-page specification I found makes no mention
> of noises.  The PC has an AMI BIOS, but I wouldn't expect any of its
> code to be running once Linux has control as the kernel does it all
> itself based on ACPI tables?
>
> Recordings of `sensors' and `smartctl -x' over the days show no
> dramatic change other than background temperature pushing them all up
> a bit.  Nothing unusual in the log files.
>
> I'm not alone.  There are a few reports of noises like this on other
> types of PC, and several of them suggest it comes from the hard-drive
> area, e.g.
> http://www.tomshardware.co.uk/forum/287497-31-this-beep-driving-crazy

The chirps continued.  Not getting any worse, but no obvious pattern.
I increased the frequency of logging the drive's SMART data and
diff(1)ing it against the last time;  still no changes.

Then late in July I heard several chirps in quick succession at about
7.45 a.m. when the machine had been unused for hours.  The log files
confirmed several cron jobs had started up at minute intervals to use
network bandwidth whilst it's still `free';  8 a.m. cut-off.  With
systemd(1), systemd-journald(8), and Postfix's many processes, running a
simple cron command involves a lot more parallel work than in the old
days.  The chirp was in response.

Once I'd cottoned on, I could see the correlation a lot of the time
through the day.  Seeking wasn't the cause, the drive does that a lot,
rather lots of seeking, probably to distant tracks, close together in
time.  I thought it more likely the drive had a speaker and was making
the noise.

Wanting to remove the warmer weather from the variables, I used some
spudgers to get the PC's case open for the first time in seven years and
blew out all the dust.  https://en.wikipedia.org/wiki/Spudger
https://amzn.to/2Ovr4EU

This worked: the motherboard and drive's temperature sensors all dropped
a few degrees centigrade.  And the chirp changed too: it was slightly
louder and sharper, less muffled.  But its occurrences continued.

Now that I'd managed to open the case, I started looking into a
replacement drive.  And then this happened.

    $ diff sda.smart.2018-08-01_08* sda.smart.2018-08-01_19*
    ...
    < Local Time is:    Wed Aug  1 08:32:33 2018 BST
    ---
    > Local Time is:    Wed Aug  1 19:29:59 2018 BST
    ...
    < 196 Reallocated_Event_Count -O--CK   100   100   000    -    0
    ---
    > 196 Reallocated_Event_Count -O--CK   100   100   000    -    2

The Reallocated_Sector_Ct (count) raw value remained 0.

    $ grep Realloc sda.smart.2018-08-01_19*
      5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
    196 Reallocated_Event_Count -O--CK   100   100   000    -    2
    $

My reading of IDs 5 and 196 at
https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes,

 05 Reallocated Sectors Count = 0

    Count of reallocated sectors.  The raw value represents a count of
    the bad sectors that have been found and remapped.  Thus, the higher
    the attribute value, the more sectors the drive has had to
    reallocate.  This value is primarily used as a metric of the life
    expectancy of the drive; a drive which has had any reallocations at
    all is significantly more likely to fail in the immediate months.

196 Reallocation Event Count = 2

    Count of remap operations.  The raw value of this attribute shows
    the total count of attempts to transfer data from reallocated
    sectors to a spare area.  Both successful and unsuccessful attempts
    are counted.

suggests that two *unsuccessful* attempts were made to remap a sector or
two.

Now wanting to urgently replace the drive whether it was the source of
the chirping or not, I got a nice 500 GB 2.5" SATA SSD by Crucial, the
marketing name of its manufacturers Micron.  https://amzn.to/2MKPSbO
That size was the cheapest byte/£ of their MX500 range.  Amazon
delivered it and a USB adapter the next day, Sunday.

In the meantime, I brought forward the weekly `scrub' of the drive,
reading it all the way through from beginning to end whilst doing little
else.  That took its normal hour and had no errors.  Significantly,
there were no chirps during that time, unusual by then.  This made it
more likely the drive was the source.  The only other cause I could
think of was the high activity by the drive was causing some motherboard
power-supply issue to abate, but Occam's Razor...

I booted from an Ubuntu 18.04 USB stick to copy the PC's original
spinning-rust drive to the SSD over USB 2.0 with dd(1).  75% through the
copy, the X desktop annoyingly froze.  That's OK, I thought, the copy
will still be working away.  Then Ubuntu decided to helpfully log that
X session out and auto-login me to a new one.  Not only did this kill
the dd, as I saw no need to have started it in screen(1), but Ubuntu
auto-mounted the filesystems it could see on the new drive, even though
one was incomplete, modifying their bits along the way because there's a
`mount count', etc.  This nobbled my intention of comparing bit-for-bit
the original and the copy at the end as verification.

I didn't want to start the copy from scratch, starting off a shiny SSD's
life by writing tens of GiB only to do it again, so I used cmp(1)'s `-l'
option to list the locations of byte differences between them.  The
physical sector containing that offset was then selectively copied
across again with dd(1).  This evolved to feed cmp's output into awk to
print sector numbers, and each sector only once, in a script called
`./cmp startsector'.  And the selective dd became `./fix firstsector
[n]' where `n' defaulted to one and was the number to copy.  After
gaining confidence in these, the awk changed once more to count the run
of consecutive sectors and alter its output format to be `./fix sector n
# Comment with more detail'.  Then I did `./cmp | ./fix' to patch up all
the remaining differences.  There weren't lots, but they were scattered.
After that, the normal dd was restarted where it left off.  The detour
took about an hour, but less than waiting for a duplicate copy, and
I thought if it crashes out again I'll be able to recover and continue a
second time.

None of the runs of dd erred whilst reading the rust.  At the end, I
b2sum(1)'d both drives in parallel and they had the same digest so the
underlying copies and my patchwork pieced together seamlessly.  I got
the case open a second time and swapped the drives; that meant removing
the motherboard because the drive's screws come up through the mobo from
the underneath.

As expected, disk activity that involved a lot of seeking before is now
much quicker, e.g. searching an email folder of 30,000 emails, one per
file.  And not a chirp in earshot.  Crucial are going to send details of
the SSDs wattage so I can compare with the rust's, just out of interest.
It's bound to be lower, especially because I configured the rust to
never spin down for power saving, preferring a steady state for wear and
tear.  The rust did well, SMART's Power_On_Hours says

    $ units 63377hours time
            7 year + 84 day + 18 min + 38.177251 sec

The plan now is to get another of those SSDs and hook it up over USB
once a day to mirror the first to it with rsync(1).  Then a new machine
will start with the second one re-formatted as one half of an md(4)
RAID1 or RAID10 pair, the other declared `missing'.  Atop that it will
mainly be a LUKS dm-crypt encrypted block device.  Copy the old PC's
data into it, then move the first SSD to fill in the `missing' slot of
the RAID pair.  A third, external, large spinning rust will be added to
the mirror and occasionally present for syncing.

Thanks to AndrewM and twaugh on #dorset IRC for putting up with the
running commentary, and answering questions on md and dm-crypt for the
next stages.

Cheers, Ralph.

-- 
Next meeting:  Bournemouth, Tuesday, 2018-09-04 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread:  mailto:dorset@mailman.lug.org.uk / CHECK IF YOU'RE REPLYING
Reporting bugs well:  http://goo.gl/4Xue     / TO THE LIST OR THE AUTHOR

Reply via email to