I am working up code to calculate a hash over parts of the data in an sqlite
database, and as a start looked at the dbhash.c code found at

https://www.sqlite.org/src/artifact?ci=trunk&filename=tool/dbhash.c

I don't think the code as is works correctly, it is easy to construct
a hash collision between two different databases.

The code to hash a text value, merely prefixes the text string with a
'3', so if you have adjacent columns with '3's, they can hash the same.

A demonstration script follows:

#!/bin/sh

schema="create table hashc(a text, b text);"

sqlite3 hca.db "$schema"
sqlite3 hca.db "insert into hashc values ('3333', '3333')"

sqlite3 hcb.db "$schema"
sqlite3 hcb.db "insert into hashc values ('33333', '333')"

./dbhash hca.db hcb.db

rm hca.db hcb.db

On my system, I get:

granicus% sh hashc     
2d07f37961253ec993424d0997caf5fb5aede448 hca.db
2d07f37961253ec993424d0997caf5fb5aede448 hcb.db
granicus% 

This obviously isn't what is wanted, even if it is unlikely to arise in
practice.  I believe a simple fix would be to, in addition to the type
identifer of '3' for a text column, to hash the string length as some
integer type.  I would be happy to provide example code if that would be
helpful, but it's really just a strlen() and then hashing a "3", the
result of the strlen as a four or eight byte integer, which can be done
similarly to how an integer column would be hashed, and then the string
content.

For the example above, the rows would hash as
("3", 4, "3333", "3", 4, "3333") and ("3", 5, "33333", "3", 3, "333")
whereas now, they hash as
("3", "3333", "3" "3333") and ("3", "33333", "3", "333")
which are the same.

Hashing the string length as well as the data makes the rows different,
and I think it would be sufficient to make different rows unique.

-- 
nw
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to