On Fri, Feb 11, 2011 at 10:19:07AM +0100, Andreas Tille wrote: > since some time we get more and more SPAM which is easily to detect for > me (and most probably automatically): SPAM in languages I do simply not > understand and which are definitely not English. Wouldn't it be a > reasonable means for a SPAM filter to mark mails which blatantly fail a > spell checker to mark as potential SPAM and just apply this filter to > all Debian lists. We have defined languages for each list and the "one > mail per month" were a user just writes in the wrong language by > accident will probably not harm the project.
I've been thinking about this some as well for my personal domain. Debian has tools that can determine the language of a document (libtextcat and friends). Emails that are 70% or more composed of languages that I have no hope of speaking or understanding (i.e., everything but English, Spanish, French, and Portuguese) would be rejected. I chose 70% as the threshold because sometimes Debian lists get mails from users in both English and another language (in hopes of being understood) and I wouldn't want to penalize those users. I haven't implemented this, but I might at some point. Obviously, this would have to be adjusted per-list; we wouldn't want to reject German-language emails to debian-user-german. I also think language testing is better than spell checking for English because honestly English has a lot of pretty irregular and bizarre spellings; I say this as someone whose native language is English and who spells fairly decently. A spell checker might catch more legitimate emails than we'd like. -- brian m. carlson / brian with sandals: Houston, Texas, US +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187
signature.asc
Description: Digital signature