Hi,

first of all, thank you for your amazing project, I'm sure it saved many people a lot of data, including some of mine.

In my (limited) experience a big issue with broken HDDs is that reading from them doesn't always return the "true" data on the disk. I'm aware of the -J / --check-on-error and -d / --idirect options, but in my experience a broken HDD sometimes doesn't even report a read error, but just returns random garbage.

Thus I like running ddrescue at least twice: Once to do the actual data recovery, and at least once more to verify the data that was recovered.

The idea is more or less an adaptation of Valentin Hilbig's ddrescue-verify project at https://github.com/hilbix/ddrescue-verify: ddrescue-verify consumes the rescued copy of the data, calculates a md5 hash of each 1M block and writes it to a hashfile similar to ddrescue's mapfile. To skip bad areas it also consumes ddrescue's mapfile and excludes bad areas from hashing. One can then use the generated hashfile on a second ddrescue-verify run to read all data from the bad device again and compare the previously recorded md5 hash in the hashfile with the data on the bad device. It then uses this information to create a ddrescue-compatible mapfile with matching areas with a finished status, and non-matching areas with a non-tried status. One can then use this mapfile with ddrescue to copy the non-matching areas again.

Hilbig's project has three major disadvantages: It's a separate project and thus calculates hashes as a separate step, it doesn't utilize ddrescue's sophisticated algorithm to read as much data as possible (no trimming, scraping, retries), and doesn't support some of ddrescue's advanced features like resuming recoveries, or using rescue domains.

Thus I'd like to suggest adding this as a native feature to ddrescue. Data verification should take place in two separate steps: First, one needs to record hashes, second, one can perform the actual data verification.

1. With some option one can enable hash recording, which basically just tells ddrescue to additionally calculate a hash (could be md5, but there might be better suited hashing algos due to ddrescue's multi-phase approach and blocks possibly spanning multiple phases, namely rolling hash algos like adler-32) over a configurable number of finished sectors (might be 1M by default as well) next to its normal operation. It then writes this hash to the mapfile, which naturally causes the mapfile to grow significantly. If one uses a mapfile that was previously created without hash recording, ddrescue should first (i.e. before its normal operation) check whether all finished blocks already have a hash and read the necessary data from the output file (!) to add missing hashes.

2. With another option one can switch ddrescue to run in data verification mode, which takes the mapfile as input and compares the recorded hashes of finished (!) blocks with data read from the bad device. ddrescue uses its sophisticated algorithm (all options apply accordingly, e.g. including -n / --no-scrape) and records its progress in a second status column. The finished status (character "+") now indicates that hashes match. If hashes don't match, ddrescue splits the block up into single sectors, reads the sectors from the output file and compares the data. The non-matching sectors are then marked with the new status "vary" (could use character "~") and ddrescue writes both the new and old hash to the mapfile. Optionally, ddrescue also overwrites the matching sectors in the output file with the new data, but this should be opt-in. There should be an option to do this later. In any case it should trigger the user to investigate. Read errors of previously finished blocks are recorded as such in the second status column and no verification takes place. If ddrescue runs in data verification mode repeatedly, blocks with status "vary" are checked again, finished blocks are not. Outside of data verification mode the second status column is simply ignored.

The first and second step of data verification intentionally requires separate executions of ddrescue, because this allows the user to decide when to verify data. Verifying data causes a lot of additional wear. It's no easy decision whether one wants to run it before or after both trimming and scraping: on one hand, trimming and scraping create a lot of additional wear and could make previously good blocks go bad, which could then cause data verification to fail. On the other hand, data verification creates a lot of wear on its own, thus it gets more likely to encounter bad blocks while trimming and scraping later. Basically one needs to decide whether one wants to sacrifice some data in bad areas in the sake of being sure that the other data is correct.

Tools like ddrescuelog and ddrescueview should work as before (both seem to ignore additional columns, i.e. they shouldn't break due to this addition), but should be updated to display verification information as well.

WDYT?

All the best
Daniel Rudolf


Reply via email to