Re: "Approximate" hashes
On Wed, 1 Sep 2004, Marcel Popescu wrote: > Hence my question: is there some "approximate" hash function (which I could > use instead of SHA-1) which can verify that a text hashes "very close" to a > value? So that if I change, say, tabs into spaces, I won't get exactly the > same value, but I would get a "good enough"? Hi Marcel, You may wish to look at Cmeclax's nilsimsa. It has been used to detect slightly-modified message floods in anonymous remailer systems, and was also used in Spamassassin at some point. http://lexx.shinn.net/cmeclax/nilsimsa.html - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
RE: "Approximate" hashes
> On Behalf Of Marcel Popescu ... > My problem is that I don't know what happens with the email in transit > (this, I believe, is an observation in the hashcash FAQ). I > am worried that some mail server might dislike ASCII characters with > > Hence my question: is there some "approximate" hash function > (which I could > use instead of SHA-1) which can verify that a > text hashes "very close" to a value? At 12:27 PM 9/1/2004, Keith Ray wrote: nilsimsa Computes nilsimsa codes of messages and compares the codes and finds clusters of similar messages so as to trash spam. Check out Vipul's Razor, which uses an approach similar to this. You'll find information at Cloudmark and on Sourceforge. There are several different kinds of differences to work around - - damage in transit, as noted, though it's the least of your worries in spite of Unicode, MS Codesets, and 8-bit-uncleanness - different mail headers getting added or subtracted or mimed (Some people include relevant parts in their message indexes, some don't.) - deliberate differences introduced in the message to discourage detection, ranging from the simple "Dear Alice"/"Dear Bob" to removal addresses than encode each spam victim's info, to different random word-scramble that's also there to discourage Bayesian spam-detectors. This one's really common these days, especially as mail systems have decreased the number of users they'll send to in a given SMTP session / envelope because of spamming - if you can only spam 5-10 recipients per TCP session, might as well make each session somewhat different so you only get hit by local detectors, not global indexers. Vipul's Razor and related approaches try to calculate a unique id for each message so that if a human detects that a message is spam, the id can be published so everybody else trashes it. This usually needs more than one human rating something as spam to prevent abuse, and there's some tuning, but it's a good start. - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Re: "Approximate" hashes
Hence my question: is there some "approximate" hash function (which I could use instead of SHA-1) which can verify that a text hashes "very close" to a value? So that if I change, say, tabs into spaces, I won't get exactly the same value, but I would get a "good enough"? What you want is what's called a canonicalization function. You want to hash not T, but F(T), and F can de-tabify, and so on. As has been mentioned, OpenPGP has a canonicalization for text and cleartext signatures (and we debate what it should be, even). XML Digital Signatures has one. The major other issue you have to deal with is whether the canonicalization is an interpretation, a conversion, or an assurance. If it's an interpretation, then you are hashing F(T), and there's always some slim chance that there's a collision between two texts that canonicalize to different values, but hash to the same. I wouldn't worry about it, myself. Much. (Assuming a decent canonicalizer, of course.) If it's a conversion, then you're replacing the text with the canonicalized text. This puts the canonicalization in the face of the users, but removes the problems handwaved at above. If it's an assurance, then if the text is not canonicalized, then it's not a valid message if it doesn't meet the requirements. A message with a tab, for example, just isn't a valid message. Don't even hash it, return an error. Or alert that it was an invalid message. Jon - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
RE: "Approximate" hashes
| nilsimsa | Computes nilsimsa codes of messages and compares the codes and finds | clusters of similar messages so as to trash spam. | | What's a nilsimsa code? | | A nilsimsa code is something like a hash, but unlike hashes, a small change | in the message results in a small change in the nilsimsa code. | | http://lexx.shinn.net/cmeclax/nilsimsa.html I had a look at the code (which isn't easy to follow). This appears to be a new application of Bloom filters. -- Jerry - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
RE: "Approximate" hashes
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Marcel Popescu > Sent: Wednesday, September 01, 2004 9:56 AM > To: [EMAIL PROTECTED] > Subject: "Approximate" hashes > > I am trying to build a Windows anti-spam thingy; it's > supposed to "sit" in > between the mail client and the outer world, and indicate through mail > headers whether the incoming mail has a valid hashcash > http://www.hashcash.org/ "coin" (and, of course, to automatically add > hashcash to outgoing emails). > > My problem is that I don't know what happens with the email in transit > (this, I believe, is an observation in the hashcash FAQ). I > am worried that > some mail server might dislike ASCII characters with the high > bit set, or > that a client uses some encoding which for some reason > doesn't make it to > the destination unchanged. > > Hence my question: is there some "approximate" hash function > (which I could > use instead of SHA-1) which can verify that a text hashes > "very close" to a > value? So that if I change, say, tabs into spaces, I won't > get exactly the > same value, but I would get a "good enough"? > > I don't know if this is possible. But if it is, I though this > would be a > good place to find out about it. nilsimsa Computes nilsimsa codes of messages and compares the codes and finds clusters of similar messages so as to trash spam. What's a nilsimsa code? A nilsimsa code is something like a hash, but unlike hashes, a small change in the message results in a small change in the nilsimsa code. http://lexx.shinn.net/cmeclax/nilsimsa.html -- Keith Ray <[EMAIL PROTECTED]> -- OpenPGP Key: 0x79269A12 - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Re: "Approximate" hashes
At 06:02 PM 9/1/04 +0300, Marcel Popescu wrote: >From: "Marcel Popescu" <[EMAIL PROTECTED]> > >> Hence my question: is there some "approximate" hash function (which I >could >> use instead of SHA-1) which can verify that a text hashes "very close" to >a >> value? So that if I change, say, tabs into spaces, I won't get exactly the >> same value, but I would get a "good enough"? This is completely what secure hashes are supposed to prevent. *Any* change in the input will flip half the hash-bits on average. Just like a block cipher. There is no input "nearness" preservation. That's part of the point of the algorithm. >I just had an idea. Would this work? > >- let S be the input string, whose hash I want to verify >- make S uppercase >- remove everything but A-Z, 0-9, and common punctuation (!;:'",.?) >- calculate the SHA1 hash of the result > >This should keep any insignificant changes out of the final result. You can encode your message in some format which is not subject to mangling, and use a hash of that encoding. You can then decode the encoding back into unicode or whatever funky but net-fragile character set you like. This is somewhat like ascii-armoring of PGP-encrypted messages. = 36 Laurelwood Dr Irvine CA 92620-1299 VOX: (714) 544-9727 (home) mnemonic: P1G JIG WRAP ICBM: -117.7621, 33.7275 HTTP: http://68.5.216.23:81 (back up, but not 99.999% reliable) PGP PUBLIC KEY: by arrangement Send plain ASCII text not HTML lest ye be misquoted -- "Don't 'sir' me, young man, you have no idea who you're dealing with" Tommy Lee Jones, MIB No, you're not 'tripping', that is an emu ---Hank R. Hill - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Re: "Approximate" hashes
From: ""Hal Finney"" <[EMAIL PROTECTED]> > As you are probably aware, existing hashcash implementations do not base > the stamp on the message content. Instead they only lock the stamp to > the receiver's email address. Then the receiver keeps a list of the > hashcash stamps he has seen recently, to prevent reuse. I'm not sure > what you hope to gain by locking the stamp to the message content. Me dumb, sorry. I was actually thinking of some other thing I wanted to do about spam (which involved the whole content), and haven't re-read the hashcash site in about a week. You are perfectly right, of course. Windows spam victims, here I come! Thanks, Marcel - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Re: "Approximate" hashes
On Wed, 1 Sep 2004, Marcel Popescu wrote: > My problem is that I don't know what happens with the email in transit > some mail server might dislike ASCII characters with the high bit set, or > Hence my question: is there some "approximate" hash function (which I could PGP has this issue with 'clear-signed' output. An excerpt from the pgp man page: If CLEARSIG is enabled, then when signing and ASCII-armoring a text file, PGP uses a different format that includes the plaintext in human-readable form. Lines beginning with "-" are quoted with "- ". To cope with some of the stupider mailers in the world, lines beginning with "From" are also quoted, and trailing whitespace on lines is stripped. PGP will remove the quoting if you use it to decrypt the message, but the trailing whitespace is not recovered. This is still useful enough to be enabled by default. You might find more details about typical mailer-munging from the PGP docs or source. --scott GRALLSPICE Moscow MKSEARCH affinity group KUDESK Nazi operative Clinton Register to vote! http://www.yourvotematters.org/VerifiedVoting ( http://cscott.net/ ) - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]
Re: "Approximate" hashes
From: "Marcel Popescu" <[EMAIL PROTECTED]> > Hence my question: is there some "approximate" hash function (which I could > use instead of SHA-1) which can verify that a text hashes "very close" to a > value? So that if I change, say, tabs into spaces, I won't get exactly the > same value, but I would get a "good enough"? I just had an idea. Would this work? - let S be the input string, whose hash I want to verify - make S uppercase - remove everything but A-Z, 0-9, and common punctuation (!;:'",.?) - calculate the SHA1 hash of the result This should keep any insignificant changes out of the final result. Does anyone know of a mail transformation which could upset it? Can anyone see a way to "attack" this by letting a significantly different message collide on the same hash? (I'm ignoring the recent discoveries - they're not that practical, I'm only trying to fight spam, not the government.) Thanks, Marcel - The Cryptography Mailing List Unsubscribe by sending "unsubscribe cryptography" to [EMAIL PROTECTED]