Antonio Diaz Diaz wrote:
Daniel Rudolf wrote:
in my experience a broken HDD sometimes doesn't even report a read error,
but just returns random garbage.
Interesting. Yours is the third request of a feature intended to detect
or recover from such hardware misbehaviors. Are HDD manufacturers
cutting too many corners as of lately?
I'd say that HDD manufacturers have always cut corners 😄
I can only speak about my experience, but the latest defective HDD with
this issue was quite old: It was a 2,5" Samsung drive manufactured back
in 2015 that didn't accumulate many hours (below 8,000 hours). It died
out of nowhere, not a single sign of an imminent failure (apart from the
old age of course). Nothing critical on that HDD, but I always try to
restore data.
To add some more context about the issue: It's not like that the HDD
starts returning garbage data at some point, but rather a "wrong" sector
now and then. The only reason I noticed is because I store hashes of
some of my data and thus was able to verify some of it after running
ddrescue. I saw that some hashes didn't match, so I tracked the sectors
down and was surprised to see that ddrescue didn't report these sectors
to be bad. So I rescued these sectors again and then the hashes matched.
However, I can't verify all data this way, simply because I don't have
hashes of everything. I thus ran ddrescue-verify and was surprised to
see that it reported a difference of about 400 MiB (of 1,5 TB total).
ddrescue-verify excluded the four 512 bit sectors ddrescue deemed bad.
Most (but not all) differences were in areas ddrescue also struggled
with, but ultimately rescued successfully (what was false, but ddrescue
didn't know that). However, I don't know about how many "wrong" sectors
we're really talking - it could be as low as one "wrong" sector per 1
MiB ddrescue-verify block.
I experienced this before (too many non-matching hashes of data for the
number of bad sectors reported by ddrescue), but I didn't investigate
further back then.
>
> If the hardware does not fulfil its part of the job (report a read
> error), it is not possible for the software to be sure that it has got
> the correct data.
I agree: It's virtually impossible for ddrescue to decide what the
correct data is. One could indeed use a "two out of three" algorithm,
but that isn't bulletproof either: The issue could have developed after
the initial rescue. So, in the end ddrescue should leave the decision up
to the user (best with the support of knowing how often what data was read).
Thus I like running ddrescue at least twice: Once to do the actual data
recovery, and at least once more to verify the data that was recovered.
This seems like the best one can do with such faulty HDDs. But I find
the hash approach both complicated and inefficient. Comparing the data
read during the verification run with the image (outfile) should be
better. It would also make easier to use the domain options to verify
only a subset of the whole image.
The biggest problems I see with verification are where to store
efficiently the different reads of variable sectors, and how to decide
which of the reads is the real data of each given sector. Maybe the
drive returns garbage most of the time for damaged sectors, and only
from time to time it manages to return the correct data.
You're absolutely right, ddrescue doesn't need the hashes to compare
data, because it has access to the outfile. According to the README,
ddrescue-verify's reason to use hashes is because the developer was
required to send all data over the wire. But I presume that this isn't a
very common use case...
However, not being required to store the diverging data is the reason
why I kept the idea of using hashes: Is it really worth storing that
data? If ddrescue reports the data to differ, the user will investigate
and thus read the data a third time. If the data of the third try
differs from both the first and second try, I'd consider the data of the
second try worthless anyway.
This made me thinking: Creating hashes proactively indeed doesn't make
much sense, but they could be used in case ddrescue detects a "vary"
sector: ddrescue could then calculate the sector's hash from the
outfile, and from the newly read data, and store both in the mapfile for
the user to decide what to do. This way we shouldn't have any issues
with domains, or blocks consisting of sectors read in different phases -
and the mapfile remains manageable.
All the best
Daniel Rudolf