Re: Use language determination tool for SPAM prevention (Was: Spell checker as reasonable SPAM prevention tool)

2011-02-11 Thread Peter Samuelson

[Andreas Tille]
> On Fri, Feb 11, 2011 at 02:27:03PM +, brian m. carlson wrote:
> > 
> > I've been thinking about this some as well for my personal domain.
> > Debian has tools that can determine the language of a document
> > (libtextcat and friends).
> 
> So this is even better.

Amazingly, SpamAssassin has a plugin based on the same algorithm as
libtextcat.

man Mail::SpamAssassin::Plugin::TextCat

for SpamAssassin configuration information.
-- 
Peter Samuelson | org-tld!p12n!peter | http://p12n.org/


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110212000858.ga10...@p12n.org



Re: Spell checker as reasonable SPAM prevention tool

2011-02-11 Thread The Fungi
On Fri, Feb 11, 2011 at 10:19:07AM +0100, Andreas Tille wrote:
[...]
> I assume that a spell checker can be configured that way that it
> can distinguish between writing an English text with some /
> several mistakes and a text with say 50% error rate which is
> probably not understandable anyway.

But could it reliably pass MBF announcements which are 99% package
names and (often numerous non-English) maintainer names? Or a
message which is 80% C source code because it contains a patch under
discussion? Those definitely seem to me like important test cases,
at least, which I don't think most human-language-oriented
spell-checkers would deal with well (though I'd love to be proven
wrong!).
-- 
{ IRL(Jeremy_Stanley); WWW(http://fungi.yuggoth.org/); PGP(43495829);
WHOIS(STANL3-ARIN); SMTP(fu...@yuggoth.org); FINGER(fu...@yuggoth.org);
MUD(kin...@katarsis.mudpy.org:6669); IRC(fu...@irc.yuggoth.org#ccl);
ICQ(114362511); YAHOO(crawlingchaoslabs); AIM(dreadazathoth); }


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211211650.go9...@yuggoth.org



Use language determination tool for SPAM prevention (Was: Spell checker as reasonable SPAM prevention tool)

2011-02-11 Thread Andreas Tille
On Fri, Feb 11, 2011 at 02:27:03PM +, brian m. carlson wrote:
> 
> I've been thinking about this some as well for my personal domain.
> Debian has tools that can determine the language of a document
> (libtextcat and friends).

So this is even better.

> Emails that are 70% or more composed of
> languages that I have no hope of speaking or understanding (i.e.,
> everything but English, Spanish, French, and Portuguese) would be
> rejected.  I chose 70% as the threshold because sometimes Debian lists
> get mails from users in both English and another language (in hopes of
> being understood) and I wouldn't want to penalize those users.  I
> haven't implemented this, but I might at some point.

Publishing the implementation would be cool.

> Obviously, this would have to be adjusted per-list;

This is for sure obvious and that's why I did not mention this.  We have
a default language per list which makes for sure a need for configurable
filtering per list - but this should be easy enough if we get it
implemented at all.

>  we wouldn't want to
> reject German-language emails to debian-user-german.  I also think
> language testing is better than spell checking for English because
> honestly English has a lot of pretty irregular and bizarre spellings; I
> say this as someone whose native language is English and who spells
> fairly decently.  A spell checker might catch more legitimate emails
> than we'd like.

My shot at the spell checker was just to detect a language - it might
perfectly be that we have better tools than a spell checker to detect a
language in which an e-mail is written in which makes the implementation
of the suggestion probably easier.

Kind regards

  Andreas.

-- 
http://fam-tille.de


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211143843.gg9...@an3as.eu



Re: Spell checker as reasonable SPAM prevention tool

2011-02-11 Thread brian m. carlson
On Fri, Feb 11, 2011 at 10:19:07AM +0100, Andreas Tille wrote:
> since some time we get more and more SPAM which is easily to detect for
> me (and most probably automatically):  SPAM in languages I do simply not
> understand and which are definitely not English.  Wouldn't it be a
> reasonable means for a SPAM filter to mark mails which blatantly fail a
> spell checker to mark as potential SPAM and just apply this filter to
> all Debian lists.  We have defined languages for each list and the "one
> mail per month" were a user just writes in the wrong language by
> accident will probably not harm the project.

I've been thinking about this some as well for my personal domain.
Debian has tools that can determine the language of a document
(libtextcat and friends).  Emails that are 70% or more composed of
languages that I have no hope of speaking or understanding (i.e.,
everything but English, Spanish, French, and Portuguese) would be
rejected.  I chose 70% as the threshold because sometimes Debian lists
get mails from users in both English and another language (in hopes of
being understood) and I wouldn't want to penalize those users.  I
haven't implemented this, but I might at some point.

Obviously, this would have to be adjusted per-list; we wouldn't want to
reject German-language emails to debian-user-german.  I also think
language testing is better than spell checking for English because
honestly English has a lot of pretty irregular and bizarre spellings; I
say this as someone whose native language is English and who spells
fairly decently.  A spell checker might catch more legitimate emails
than we'd like.

-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187


signature.asc
Description: Digital signature


Re: Spell checker as reasonable SPAM prevention tool

2011-02-11 Thread Andreas Tille
On Fri, Feb 11, 2011 at 10:42:49AM +0100, Samuel Thibault wrote:
> Andreas Tille, le Fri 11 Feb 2011 10:19:07 +0100, a écrit :
> > PS: I assume that a spell checker can be configured that way that it
> > can distinguish between writing an English text with some / several
> > mistakes and a text with say 50% error rate which is probably not
> > understandable anyway.
> 
> Mmm, I think we've already had users that have even 50% error rate,
> simply because they mispell things. Yes, not everybody has even a basic
> knowledge level in english, but they still can provide useful input to a
> mailing list.

It might be a topic of fuerther investigation what limit on the error
rate to put but I'm quite positive that there are reasonable algorithms
to detect in what language a text is in or rather to detect whether a
text atempts to be written in a certain language (which is probably
easier than to guess a language).  The question whether it is worth
doing some stats on the mailing list archive about this is rather if we
finally want this language detection method for a SPAM filter or not.

My guess is that you will find a ratio of misspelled words / total
number of words which is a clear sign for non-English text, than you
have some intermediate area where those postings like you are afraid
about are belonging to and than there are the postings which are
obviosely trying hard to write some English.  I'd like to get rid of
the clearly non-English texts.  I have the impression that we get more
and more of these since some time and I assume that bayesian filters
are not (yet) trained good enough to detect these as SPAM.  So we need
to find some other means.

Kind regards

   Andreas.

-- 
http://fam-tille.de


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211104413.gb2...@an3as.eu



Re: Spell checker as reasonable SPAM prevention tool

2011-02-11 Thread Michelle Konzack
Hello Samuel Thibault,

Am 2011-02-11 10:42:49, hacktest Du folgendes herunter:
> Andreas Tille, le Fri 11 Feb 2011 10:19:07 +0100, a écrit :
> > PS: I assume that a spell checker can be configured that way that it
> > can distinguish between writing an English text with some / several
> > mistakes and a text with say 50% error rate which is probably not
> > understandable anyway.
> Mmm, I think we've already had users that have even 50% error rate,
> simply because they mispell things. Yes, not everybody has even a basic
> knowledge level in english, but they still can provide useful input to a
> mailing list.

In the arround 600 latvian spams I have gotten the last 3  weeks,  there
are enough keywords which identify the mais as spam and I  do  not  know
why, but spamassassin gaved the messages a score of -4 and greater.

Thanks, Greetings and nice Day/Evening
Michelle Konzack

-- 
# Debian GNU/Linux Consultant ##
   Development of Intranet and Embedded Systems with Debian GNU/Linux

itsystems@tdnet France EURL   itsystems@tdnet UG (limited liability)
Owner Michelle KonzackOwner Michelle Konzack

Apt. 917 (homeoffice)
50, rue de Soultz Kinzigstraße 17
67100 Strasbourg/France   77694 Kehl/Germany
Tel: +33-6-61925193 mobil Tel: +49-177-9351947 mobil
Tel: +33-9-52705884 fix

  
 

Jabber linux4miche...@jabber.ccc.de
ICQ#328449886

Linux-User #280138 with the Linux Counter, http://counter.li.org/


signature.pgp
Description: Digital signature


Re: Spell checker as reasonable SPAM prevention tool

2011-02-11 Thread Cyril Brulebois
Samuel Thibault  (11/02/2011):
> Mmm, I think we've already had users that have even 50% error rate,
> simply because they mispell things.

I like the intended pun!

KiBi.


signature.asc
Description: Digital signature


Re: Spell checker as reasonable SPAM prevention tool

2011-02-11 Thread Samuel Thibault
Andreas Tille, le Fri 11 Feb 2011 10:19:07 +0100, a écrit :
> PS: I assume that a spell checker can be configured that way that it
> can distinguish between writing an English text with some / several
> mistakes and a text with say 50% error rate which is probably not
> understandable anyway.

Mmm, I think we've already had users that have even 50% error rate,
simply because they mispell things. Yes, not everybody has even a basic
knowledge level in english, but they still can provide useful input to a
mailing list.

Samuel


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211094249.ga5...@const.bordeaux.inria.fr



Spell checker as reasonable SPAM prevention tool

2011-02-11 Thread Andreas Tille
Hi,

since some time we get more and more SPAM which is easily to detect for
me (and most probably automatically):  SPAM in languages I do simply not
understand and which are definitely not English.  Wouldn't it be a
reasonable means for a SPAM filter to mark mails which blatantly fail a
spell checker to mark as potential SPAM and just apply this filter to
all Debian lists.  We have defined languages for each list and the "one
mail per month" were a user just writes in the wrong language by
accident will probably not harm the project.

Just my 0.02 Euro

  Andreas.

PS: I assume that a spell checker can be configured that way that it
can distinguish between writing an English text with some / several
mistakes and a text with say 50% error rate which is probably not
understandable anyway.

-- 
http://fam-tille.de


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20110211091907.gd30...@an3as.eu