Hi, I’ve recently gotten emails (a lot of them, as it happened) with the following subject line:
Subject: H¡gh level of r¡sk. Your account has been hacked. Change yøur passwørd. and I’ve seen other similar emails in the past using simple mechanical substitutions (Greek alpha for ‘a’, Cyrillic a for ‘a’, Cyrillic A for ‘A’, Cyrillic VE for ‘B’, Cyrillic IE for ‘E’, Cyrillic EN for ‘H’, etc). The String::Approx module (see https://metacpan.org/pod/String::Approx) allows for weighting insertions/deletions/substitutions, and what we’re seeing here is a heavy use of substitutions. I’m thinking about a module where you could enter the ASCII string of: High level of risk. Your account has been hacked. Change your password. and all permutations of it via substitution would be matched as long as some threshold isn’t exceeded (say 10 or 15% substitutions, which seems like a reasonable ceiling). There are also Spam I’ve seen where words have been deliberately misspelled as a way of avoiding exact matches, with doubled letters being dropped, similar letters being transposed (’n’ for ‘m’, ‘z’ for ’s’, ‘k’ for ‘c’, etc) so simply replacing non-ASCII letters with their ASCII “approximates” wouldn’t be sufficient because of the shuffling in the ASCII space as well. Has anyone else considered approximate string matching? Thanks, -Philip