I am working up code to calculate a hash over parts of the data in an sqlite database, and as a start looked at the dbhash.c code found at
https://www.sqlite.org/src/artifact?ci=trunk&filename=tool/dbhash.c I don't think the code as is works correctly, it is easy to construct a hash collision between two different databases. The code to hash a text value, merely prefixes the text string with a '3', so if you have adjacent columns with '3's, they can hash the same. A demonstration script follows: #!/bin/sh schema="create table hashc(a text, b text);" sqlite3 hca.db "$schema" sqlite3 hca.db "insert into hashc values ('3333', '3333')" sqlite3 hcb.db "$schema" sqlite3 hcb.db "insert into hashc values ('33333', '333')" ./dbhash hca.db hcb.db rm hca.db hcb.db On my system, I get: granicus% sh hashc 2d07f37961253ec993424d0997caf5fb5aede448 hca.db 2d07f37961253ec993424d0997caf5fb5aede448 hcb.db granicus% This obviously isn't what is wanted, even if it is unlikely to arise in practice. I believe a simple fix would be to, in addition to the type identifer of '3' for a text column, to hash the string length as some integer type. I would be happy to provide example code if that would be helpful, but it's really just a strlen() and then hashing a "3", the result of the strlen as a four or eight byte integer, which can be done similarly to how an integer column would be hashed, and then the string content. For the example above, the rows would hash as ("3", 4, "3333", "3", 4, "3333") and ("3", 5, "33333", "3", 3, "333") whereas now, they hash as ("3", "3333", "3" "3333") and ("3", "33333", "3", "333") which are the same. Hashing the string length as well as the data makes the rows different, and I think it would be sufficient to make different rows unique. -- nw _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users