Chris Hostetter wrote:
i'm not crypto expert, but i imagine it would probably take the same
amount of statistical guess work to reconstruct meaningful info from
either approach (hashing hte individual words compared to eliminating the
positions) so i would think the trade off of supporting phrase queries
would make the hasing approach more worthwhile.
hashing the words sounds like a brilliant idea, but it's subject to a
known plaintext attack. Basically if I have the index and a document in
it I can compare the reconstructed document consisting of hashed words
and the original document. I can then say that "cat" is
"d077f244def8a70e5ea758bd8352fcd8", dog is
"06d80eb0c50b49a509b49f2424e8c805" and so on. I then have a choice of
attacking the hash algorithm (there are no prizes for guessing what I
used) or simply constructing a table of known words and hashes.
Injecting my own documents into the corpus would let me fill in missing
words or test an hypothesis about the hash algorithm. It would be fun to
see how long it would take to reconstruct the index with no hashed words
in it :-) The best, really, that can be said about hashing is that it
either piques the interest of someone who would otherwise ignore it and
discourages the most casual of hackers.
Setting all the token positions to zero helps quite a lot because you
then can't distinguish between "the cat and dog sat on the mat" and "the
cat sat on the dog on the mat" (more or less) and one is much more
interesting than the other. Of course, simply knowing that there are
documents that contain the words, oh, "airport" and "bomb" would be
interesting. What would be even more interesting would be comparing the
index at different times -- like knowing that the Pentagon ordered many
more pizzas the day before the air strikes in the Gulf, that kind of
thing. (Langley, incidentally, ordered the same number they always
ordered, but perhaps fewer were put in the trash unconsumed.) (This
might also be an urban myth, I haven't ever seen anything definitive
about it.) (But I digress.)
It all depends on how much value the owner of the original documents
places on them and how much effort he thinks that a hacker might be
prepared to put into recovering the text.
The best you're ever going to do is to protect the index as well as you
do the original documents.
jch
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]