On Sat, Nov 11, 2017 at 05:18:33PM -0700, Chris Murphy wrote:
> OK this might be in the stupid questions category, but I'm not
> understanding the purpose of computing hash collisions with -ss. Or
> more correctly, why it's taking so much longer than -s.
> 
> It seems like what we'd want is every filename to have the same hash,
> but for the file to go through a PBKDF so the hashes we get aren't
> (easily) brute forced. So I totally understand that -ss should take
> much longer than -s, but this is at least two orders magnitude longer
> (so far). That's why I'm confused.
> 
> -s option on this file system took 5 minutes, start to finish.
> -ss option is at 8 hours and counting.
> 
> The other part I'm not groking is that some filenames fail with:
> 
> WARNING: cannot find a hash collision for 'Tool', generating garbage,
> it won't match indexes
> 
> So? That seems like an undesirable outcome. And if it were just being
> pushed through a PBKDF function, it wouldn't fail. Every
> file/directory "Tool" would get the same hash on *this* run of
> btrs-image. If I run it again, or someone else runs it, they'd get
> some other hash (same hashes for each instance of "Tool" on their
> filesystem).

   In the FS tree, you can go from the inode of the file to its name
(where the inode is in the index, and the name is stored in the
corresponding data item). Alternatively, you can go from the filename
to the inode. In the latter case, since the keys are a structured 17
byte object, you obviously can't fit the whole filename into the key,
so the filename is hashed (using, IIRC, CRC32), and it's the hash that
appears in the key of the index.

   When an image is made without the -s options, the whole metadata is
stored, including all the filenames in the data items. For some
people, that's a security risk, and they don't want their filenames
leaking out, so -s exists to put junk in the filename records.
However, it doesn't change the hashes in the index to correspond with
the modified filenames, because that would at minimum require the
whole tree to be rebuilt (because all the items would have different
hashes, and hence different ordering in the index). This is a bad
thing for debugging, because you're not getting the details of the
tree as it was in the broken filesystem. So, in this case, the image
is actually broken, because the filenames don't match the hashes.

   Most of the time, that's absolutely fine, because the thing being
debugged is somewhere else, and it doesn't matter that "ls" on the
restored FS won't work right.

   However, in some (possibly hypothetical) cases, it _does_ matter,
and you do need the hashes to match the filenames. This is where -ss
comes in. We can't generate random filenames and then take the hashes
of those, because of the undesirability of rewriting the whole FS tree
to reindex it with the changed hashes. So, what -ss tries to do is
stick with the original hashes and find arbitrary filenames which
match them. It's (I think) CRC32, so it shouldn't be too hard, but
it's still non-trivial amounts of work to reverse engineer a
human-readable ASCII filename which hashes to a given value.
Particularly if, as was the case when Josef wrote it, a simple
brute-force algorithm was used.

   It could definitely be improved -- I believe there are some good
(but non-trivial) algorithms for finding preimages for CRC32 checksums
out there. It's just that btrfs-image doesn't use them. However, it's
not an option that's needed very often, so it's probably not worth
putting in the effort to fix it up. (I definitely remember Josef
commenting on IRC when he wrote -s and -ss that it could almost
certainly be done more efficiently, but he had bigger fish to fry at
the time, like fixing the broken FS he was working on)

   As to the thing where it's not finding a pre-image at all -- I'm
guessing here, but it's possible that this is a case where two of the
orginal filenames hashed to the same value. If that happens, one of
the hashes is incremented by a small integer in a predictable way
before storage. So it may be that the resulting value isn't mappable
to an ASCII pre-image, or that the search just gives up before finding
one.

   Hugo.

-- 
Hugo Mills             | Yes, this is an example of something that becomes
hugo@... carfax.org.uk | less explosive as a one-to-one cocrystal with TNT.
http://carfax.org.uk/  | (Hexanitrohexaazaisowurtzitane)
PGP: E2AB1DE4          |                                            Derek Lowe

Attachment: signature.asc
Description: Digital signature

Reply via email to