Feature Request: Add data verification mode

Daniel Rudolf Sun, 09 Feb 2025 09:59:46 -0800

Hi,

first of all, thank you for your amazing project, I'm sure it saved manypeople a lot of data, including some of mine.

In my (limited) experience a big issue with broken HDDs is that readingfrom them doesn't always return the "true" data on the disk. I'm awareof the -J / --check-on-error and -d / --idirect options, but in myexperience a broken HDD sometimes doesn't even report a read error, butjust returns random garbage.

Thus I like running ddrescue at least twice: Once to do the actual datarecovery, and at least once more to verify the data that was recovered.

The idea is more or less an adaptation of Valentin Hilbig'sddrescue-verify project at https://github.com/hilbix/ddrescue-verify:ddrescue-verify consumes the rescued copy of the data, calculates a md5hash of each 1M block and writes it to a hashfile similar to ddrescue'smapfile. To skip bad areas it also consumes ddrescue's mapfile andexcludes bad areas from hashing. One can then use the generated hashfileon a second ddrescue-verify run to read all data from the bad deviceagain and compare the previously recorded md5 hash in the hashfile withthe data on the bad device. It then uses this information to create addrescue-compatible mapfile with matching areas with a finished status,and non-matching areas with a non-tried status. One can then use thismapfile with ddrescue to copy the non-matching areas again.

Hilbig's project has three major disadvantages: It's a separate projectand thus calculates hashes as a separate step, it doesn't utilizeddrescue's sophisticated algorithm to read as much data as possible (notrimming, scraping, retries), and doesn't support some of ddrescue'sadvanced features like resuming recoveries, or using rescue domains.

Thus I'd like to suggest adding this as a native feature to ddrescue.Data verification should take place in two separate steps: First, oneneeds to record hashes, second, one can perform the actual dataverification.

1. With some option one can enable hash recording, which basically justtells ddrescue to additionally calculate a hash (could be md5, but theremight be better suited hashing algos due to ddrescue's multi-phaseapproach and blocks possibly spanning multiple phases, namely rollinghash algos like adler-32) over a configurable number of finished sectors(might be 1M by default as well) next to its normal operation. It thenwrites this hash to the mapfile, which naturally causes the mapfile togrow significantly. If one uses a mapfile that was previously createdwithout hash recording, ddrescue should first (i.e. before its normaloperation) check whether all finished blocks already have a hash andread the necessary data from the output file (!) to add missing hashes.

2. With another option one can switch ddrescue to run in dataverification mode, which takes the mapfile as input and compares therecorded hashes of finished (!) blocks with data read from the baddevice. ddrescue uses its sophisticated algorithm (all options applyaccordingly, e.g. including -n / --no-scrape) and records its progressin a second status column. The finished status (character "+") nowindicates that hashes match. If hashes don't match, ddrescue splits theblock up into single sectors, reads the sectors from the output file andcompares the data. The non-matching sectors are then marked with the newstatus "vary" (could use character "~") and ddrescue writes both the newand old hash to the mapfile. Optionally, ddrescue also overwrites thematching sectors in the output file with the new data, but this shouldbe opt-in. There should be an option to do this later. In any case itshould trigger the user to investigate. Read errors of previouslyfinished blocks are recorded as such in the second status column and noverification takes place. If ddrescue runs in data verification moderepeatedly, blocks with status "vary" are checked again, finished blocksare not. Outside of data verification mode the second status column issimply ignored.

The first and second step of data verification intentionally requiresseparate executions of ddrescue, because this allows the user to decidewhen to verify data. Verifying data causes a lot of additional wear.It's no easy decision whether one wants to run it before or after bothtrimming and scraping: on one hand, trimming and scraping create a lotof additional wear and could make previously good blocks go bad, whichcould then cause data verification to fail. On the other hand, dataverification creates a lot of wear on its own, thus it gets more likelyto encounter bad blocks while trimming and scraping later. Basically oneneeds to decide whether one wants to sacrifice some data in bad areas inthe sake of being sure that the other data is correct.

Tools like ddrescuelog and ddrescueview should work as before (both seemto ignore additional columns, i.e. they shouldn't break due to thisaddition), but should be updated to display verification information aswell.


WDYT?

All the best
Daniel Rudolf

Feature Request: Add data verification mode

Reply via email to