On Sat, Nov 11, 2017 at 5:42 PM, Hugo Mills <h...@carfax.org.uk> wrote: > On Sat, Nov 11, 2017 at 05:18:33PM -0700, Chris Murphy wrote: >> OK this might be in the stupid questions category, but I'm not >> understanding the purpose of computing hash collisions with -ss. Or >> more correctly, why it's taking so much longer than -s. >> >> It seems like what we'd want is every filename to have the same hash, >> but for the file to go through a PBKDF so the hashes we get aren't >> (easily) brute forced. So I totally understand that -ss should take >> much longer than -s, but this is at least two orders magnitude longer >> (so far). That's why I'm confused. >> >> -s option on this file system took 5 minutes, start to finish. >> -ss option is at 8 hours and counting. >> >> The other part I'm not groking is that some filenames fail with: >> >> WARNING: cannot find a hash collision for 'Tool', generating garbage, >> it won't match indexes >> >> So? That seems like an undesirable outcome. And if it were just being >> pushed through a PBKDF function, it wouldn't fail. Every >> file/directory "Tool" would get the same hash on *this* run of >> btrs-image. If I run it again, or someone else runs it, they'd get >> some other hash (same hashes for each instance of "Tool" on their >> filesystem). > > In the FS tree, you can go from the inode of the file to its name > (where the inode is in the index, and the name is stored in the > corresponding data item). Alternatively, you can go from the filename > to the inode. In the latter case, since the keys are a structured 17 > byte object, you obviously can't fit the whole filename into the key, > so the filename is hashed (using, IIRC, CRC32), and it's the hash that > appears in the key of the index. > > When an image is made without the -s options, the whole metadata is > stored, including all the filenames in the data items. For some > people, that's a security risk, and they don't want their filenames > leaking out, so -s exists to put junk in the filename records. > However, it doesn't change the hashes in the index to correspond with > the modified filenames, because that would at minimum require the > whole tree to be rebuilt (because all the items would have different > hashes, and hence different ordering in the index). This is a bad > thing for debugging, because you're not getting the details of the > tree as it was in the broken filesystem. So, in this case, the image > is actually broken, because the filenames don't match the hashes. > > Most of the time, that's absolutely fine, because the thing being > debugged is somewhere else, and it doesn't matter that "ls" on the > restored FS won't work right. > > However, in some (possibly hypothetical) cases, it _does_ matter, > and you do need the hashes to match the filenames. This is where -ss > comes in. We can't generate random filenames and then take the hashes > of those, because of the undesirability of rewriting the whole FS tree > to reindex it with the changed hashes. So, what -ss tries to do is > stick with the original hashes and find arbitrary filenames which > match them. It's (I think) CRC32, so it shouldn't be too hard, but > it's still non-trivial amounts of work to reverse engineer a > human-readable ASCII filename which hashes to a given value. > Particularly if, as was the case when Josef wrote it, a simple > brute-force algorithm was used. > > It could definitely be improved -- I believe there are some good > (but non-trivial) algorithms for finding preimages for CRC32 checksums > out there. It's just that btrfs-image doesn't use them. However, it's > not an option that's needed very often, so it's probably not worth > putting in the effort to fix it up. (I definitely remember Josef > commenting on IRC when he wrote -s and -ss that it could almost > certainly be done more efficiently, but he had bigger fish to fry at > the time, like fixing the broken FS he was working on) > > As to the thing where it's not finding a pre-image at all -- I'm > guessing here, but it's possible that this is a case where two of the > orginal filenames hashed to the same value. If that happens, one of > the hashes is incremented by a small integer in a predictable way > before storage. So it may be that the resulting value isn't mappable > to an ASCII pre-image, or that the search just gives up before finding > one. >
Super explanation, thanks Hugo. Maybe it's worth an additional note in the man page, just how expensive it is. I'm definitely regretting not starting this imaging in a tmux session. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html