Hi,
first of all, thank you for your amazing project, I'm sure it saved many
people a lot of data, including some of mine.
In my (limited) experience a big issue with broken HDDs is that reading
from them doesn't always return the "true" data on the disk. I'm aware
of the -J / --check-on-error and -d / --idirect options, but in my
experience a broken HDD sometimes doesn't even report a read error, but
just returns random garbage.
Thus I like running ddrescue at least twice: Once to do the actual data
recovery, and at least once more to verify the data that was recovered.
The idea is more or less an adaptation of Valentin Hilbig's
ddrescue-verify project at https://github.com/hilbix/ddrescue-verify:
ddrescue-verify consumes the rescued copy of the data, calculates a md5
hash of each 1M block and writes it to a hashfile similar to ddrescue's
mapfile. To skip bad areas it also consumes ddrescue's mapfile and
excludes bad areas from hashing. One can then use the generated hashfile
on a second ddrescue-verify run to read all data from the bad device
again and compare the previously recorded md5 hash in the hashfile with
the data on the bad device. It then uses this information to create a
ddrescue-compatible mapfile with matching areas with a finished status,
and non-matching areas with a non-tried status. One can then use this
mapfile with ddrescue to copy the non-matching areas again.
Hilbig's project has three major disadvantages: It's a separate project
and thus calculates hashes as a separate step, it doesn't utilize
ddrescue's sophisticated algorithm to read as much data as possible (no
trimming, scraping, retries), and doesn't support some of ddrescue's
advanced features like resuming recoveries, or using rescue domains.
Thus I'd like to suggest adding this as a native feature to ddrescue.
Data verification should take place in two separate steps: First, one
needs to record hashes, second, one can perform the actual data
verification.
1. With some option one can enable hash recording, which basically just
tells ddrescue to additionally calculate a hash (could be md5, but there
might be better suited hashing algos due to ddrescue's multi-phase
approach and blocks possibly spanning multiple phases, namely rolling
hash algos like adler-32) over a configurable number of finished sectors
(might be 1M by default as well) next to its normal operation. It then
writes this hash to the mapfile, which naturally causes the mapfile to
grow significantly. If one uses a mapfile that was previously created
without hash recording, ddrescue should first (i.e. before its normal
operation) check whether all finished blocks already have a hash and
read the necessary data from the output file (!) to add missing hashes.
2. With another option one can switch ddrescue to run in data
verification mode, which takes the mapfile as input and compares the
recorded hashes of finished (!) blocks with data read from the bad
device. ddrescue uses its sophisticated algorithm (all options apply
accordingly, e.g. including -n / --no-scrape) and records its progress
in a second status column. The finished status (character "+") now
indicates that hashes match. If hashes don't match, ddrescue splits the
block up into single sectors, reads the sectors from the output file and
compares the data. The non-matching sectors are then marked with the new
status "vary" (could use character "~") and ddrescue writes both the new
and old hash to the mapfile. Optionally, ddrescue also overwrites the
matching sectors in the output file with the new data, but this should
be opt-in. There should be an option to do this later. In any case it
should trigger the user to investigate. Read errors of previously
finished blocks are recorded as such in the second status column and no
verification takes place. If ddrescue runs in data verification mode
repeatedly, blocks with status "vary" are checked again, finished blocks
are not. Outside of data verification mode the second status column is
simply ignored.
The first and second step of data verification intentionally requires
separate executions of ddrescue, because this allows the user to decide
when to verify data. Verifying data causes a lot of additional wear.
It's no easy decision whether one wants to run it before or after both
trimming and scraping: on one hand, trimming and scraping create a lot
of additional wear and could make previously good blocks go bad, which
could then cause data verification to fail. On the other hand, data
verification creates a lot of wear on its own, thus it gets more likely
to encounter bad blocks while trimming and scraping later. Basically one
needs to decide whether one wants to sacrifice some data in bad areas in
the sake of being sure that the other data is correct.
Tools like ddrescuelog and ddrescueview should work as before (both seem
to ignore additional columns, i.e. they shouldn't break due to this
addition), but should be updated to display verification information as
well.
WDYT?
All the best
Daniel Rudolf