RE: [Mimedefang] Image validator/OCR SA plugin

2006-04-23 Thread Martin Blapp


Hi,


be something to be gained by running the OCR scan from mimdefang?
The idea would be to run the scan, and if sufficient text results
(I'd hesitate to suggest that a quick spelling scan would be run on
the result, but that is a possibility) that this text is written
by MdF into a new text attachment.  The message is then reformulated
and passed to Spamassassin.  The advantage of this approach is that
SA (and rules du jour) already have rules for catching things like
pharma and stock scam e-mail, so the normal scoring should catch these


Hmm, the SA and rules du jour stock and obfu rules suck ;-) Beside that,
I also match some words which are 100% legitimate. And the OCR words
are often truncated so one must match those too.


things.  Also this approach would work on versions of SA prior to 3.1.1.
There is a design decision as to whether the OCR'd text attachment should
remain in the message and then be delivered to the user, or whether it
would only be kept if SA scores the message as spam.


If you add the OCR'd text attachment to the message you'll have to resend
the whole message. Not a good idea IMHO.

Martin
___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-23 Thread David F. Skoll
Martin Blapp wrote:

 Hmm, the SA and rules du jour stock and obfu rules suck ;-) Beside that,
 I also match some words which are 100% legitimate. And the OCR words
 are often truncated so one must match those too.

But the real key is Bayes.  Adding the OCR words to Bayes will be a real
advantage.

 If you add the OCR'd text attachment to the message you'll have to resend
 the whole message. Not a good idea IMHO.

No, you only add it to what you feed SpamAssassin.

Regards,

David.
___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


RE: AW: [Mimedefang] Image validator/OCR SA plugin

2006-04-21 Thread Gary Funck


 -Original Message-
 From: Martin Blapp
 Sent: Monday, April 17, 2006 8:00 AM
  Spamassassin version is 3.1.0, looks like I'll have to upgrade to 3.1.1
  to get this to work?
 
 Seems so, yes. I'll correct the manual.

Has this package/plugin been updated yet with the various fixes
suggested to date?  I had a little trouble tracking some
of the suggested fixes.


___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


RE: [Mimedefang] Image validator/OCR SA plugin

2006-04-19 Thread Cormack, Ken
So far in my tests, this OCR plugin looks like it's working ok.  I rounded
up the needed prereqs (that was a bit of a chore, but everything compiled
cleanly), and changed the package definition as indicated in Martin's post
(be sure to run spamassassin -D --lint).  So far I've seen several hits
for the ocr SUSPECT_GIF rule, with no detectable problems.

Ken

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-19 Thread Nels Lindquist
On 14 Apr 2006 at 18:42, Martin Blapp wrote:

 This is just a little advertisement for my plugin which is now
 in a usable state and works very well.
 
 Anyone interested should keep an eye on it - it really helps
 with the image only spam we get today. But problably the spammers
 will soon change their tricks to different images which are more
 difficult to read :-(

This is a really cool idea.

As far as spammers obfuscating their images, couldn't that be worked 
around by tying OCR into the bayesian system?  Then obfuscation 
wouldn't matter--whatever munging is done to a particular image would 
produce the same OCR strings, before and after bayes training.  You 
wouldn't need to know particular strings to match beforehand in that 
case.

That would force image spammers would to produce a unique obfuscated 
graphic for every single message, which seems like an expensive 
proposition.

Of course, I once thought producing a unique set of (text) bayes 
poison for every message was expensive, and that sure didn't stop 
them...


Nels Lindquist *
Information Systems Manager
Morningstar Air Express Inc.

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-19 Thread David F. Skoll
Nels Lindquist wrote:

 As far as spammers obfuscating their images, couldn't that be worked 
 around by tying OCR into the bayesian system?

I think the original idea was to obfuscate the images so people could
read the text, but OCR tools wouldn't be able to.

 Then obfuscation wouldn't matter--whatever munging is done to a
 particular image would produce the same OCR strings, before and
 after bayes training.  You wouldn't need to know particular strings
 to match beforehand in that case.

True, but you'd need to see enough of them to train your Bayes engine.

 That would force image spammers would to produce a unique obfuscated 
 graphic for every single message, which seems like an expensive 
 proposition.

Sadly, serious spammers have virtually unlimited computing resources.
There are armies of thousands of zombie machines out there waiting to
do their masters' bidding...

Adding random noise that fools OCR tools but leaves the images legible
for humans probably isn't that computationally expensive.

The only way to defeat image spam would be if Microsoft modifies
Outlook not to display HTML or images, and for Thunderbird et al to
follow suit.  Anyone care to bet on the odds of that happening? :-(

Regards,

David.
___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-18 Thread Philip Prindeville
Dave Williss wrote:

- Original Message - 
From: Gary Funck [EMAIL PROTECTED]
To: mimedefang@lists.roaringpenguin.com
Sent: Sunday, April 16, 2006 6:34 PM
Subject: RE: [Mimedefang] Image validator/OCR SA plugin


  

Martin wrote:


But problably the spammers
will soon change their tricks to different images which are more
difficult to read :-(

http://antispam.imp.ch/patches/patch-ocrtext
  

On this topic, Nick FitzGerald mentioned this article,
http://www.jgc.org/blog/2006/01/do-spammers-fear-ocr.html

Sunday, January 15, 2006
Do spammers fear OCR?
Nick FitzGerald recently sent me two sample spams that seem to indicate 
that
some spammers fear that using images in place of words isn't enough. 
They've
started to obscure their messages to prevent optical character 
recognition.



I'm afraid they'll start using OCR themselves.  One common trick to allow a 
web
site to have an email address humanly readable but not harvestable is to put 
it
in an image.  That may not be so safe any more :-(  Of course, they'd have 
to
scan an awful lot of images in the hopes of finding an email address in any
of them, so they may not find it worth the effort.
  


Good!

That's all I can say.

Fine!

Putting stuff into bitmaps is a travesty, and makes websites (a) harder
to search on
automatically, but much more egregious is that (b) it limits their
accessibility to people
with vision impairment using text-to-speech browsers.

A lot of web pages also don't seem to take into account .21 or .19 pitch
LCD monitors
(like the 2560x1600 monitor I'm staring at).  Do you know how small a
7x9 font looks
like on that monitor?  It looks like lint.  Putting images or explicit
font sizes (like use a
9 pixel high font here instead of saying use a 8 point font here [and
let the browser
scale it appropriately] is idiotic).

I'm glad SOME good has finally come of the travail of spammers (though I
never thought
I'd live to hear myself say it).

Ok.  End of rant.

-Philip

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


AW: [Mimedefang] Image validator/OCR SA plugin

2006-04-17 Thread Martin Bene
Hi Martin,
 
 Anyone interested should keep an eye on it - it really helps
 with the image only spam we get today. But problably the spammers
 will soon change their tricks to different images which are more
 difficult to read :-(
 
 http://antispam.imp.ch/patches/patch-ocrtext

Just tried to get this to run on one of my test boxes;

* first problem, as described by Paul Murphy:  Can't locate object
method new via package
Mail::SpamAssassi::Plugin::ocrtext, so I changed the package
definition.

* next problem: running spamassassin -t on a test messages give me this
output:

[5681] warn: plugin: eval failed: Can't locate object method new via
package Mail::SpamAssassin::Timeout (perhaps you forgot to load
Mail::SpamAssassin::Timeout?) at /etc/mail/spamassassin/ocrtext.pm
line 391.

Spamassassin version is 3.1.0, looks like I'll have to upgrade to 3.1.1
to get this to work?

Thanks, Martin

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: AW: [Mimedefang] Image validator/OCR SA plugin

2006-04-17 Thread Martin Blapp


Hi,


Spamassassin version is 3.1.0, looks like I'll have to upgrade to 3.1.1
to get this to work?


Seems so, yes. I'll correct the manual.


Thanks, Martin

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-17 Thread Dave Williss


- Original Message - 
From: Gary Funck [EMAIL PROTECTED]

To: mimedefang@lists.roaringpenguin.com
Sent: Sunday, April 16, 2006 6:34 PM
Subject: RE: [Mimedefang] Image validator/OCR SA plugin



Martin wrote:

But problably the spammers
will soon change their tricks to different images which are more
difficult to read :-(

http://antispam.imp.ch/patches/patch-ocrtext


On this topic, Nick FitzGerald mentioned this article,
http://www.jgc.org/blog/2006/01/do-spammers-fear-ocr.html

Sunday, January 15, 2006
Do spammers fear OCR?
Nick FitzGerald recently sent me two sample spams that seem to indicate 
that
some spammers fear that using images in place of words isn't enough. 
They've
started to obscure their messages to prevent optical character 
recognition.


I'm afraid they'll start using OCR themselves.  One common trick to allow a 
web
site to have an email address humanly readable but not harvestable is to put 
it
in an image.  That may not be so safe any more :-(  Of course, they'd have 
to

scan an awful lot of images in the hopes of finding an email address in any
of them, so they may not find it worth the effort.



___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


RE: [Mimedefang] Image validator/OCR SA plugin

2006-04-16 Thread Gary Funck
Martin wrote:
 But problably the spammers
 will soon change their tricks to different images which are more
 difficult to read :-(

 http://antispam.imp.ch/patches/patch-ocrtext

On this topic, Nick FitzGerald mentioned this article,
http://www.jgc.org/blog/2006/01/do-spammers-fear-ocr.html

Sunday, January 15, 2006
Do spammers fear OCR?
Nick FitzGerald recently sent me two sample spams that seem to indicate that
some spammers fear that using images in place of words isn't enough. They've
started to obscure their messages to prevent optical character recognition.
[...]


___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


RE: [Mimedefang] Image validator/OCR SA plugin

2006-04-15 Thread Paul Murphy
Martin,

I installed your plugin for testing, but found that it would not load
correctly on my system, giving the error:

[5631] dbg: plugin: loading Mail::SpamAssassin::Plugin::ocrtext from @INC
[5631] warn: plugin: failed to create instance of plugin
Mail::SpamAssassin::Pl
ugin::ocrtext: Can't locate object method new via package
Mail::SpamAssassin
::Plugin::ocrtext at (eval 28) line 1.

To solve this, I changed the package definition in ocrtext.pm to be:

package Mail::SpamAssassin::Plugin::ocrtext;

The distributed version has package ocrtext; instead, so while the plugin
is loaded from the .pm file correctly, it then can't find anything which is
registered using the fully referenced name.

Is there a test file available to demonstrate this working, or do I have to
make one myself?

Best Wishes,

Paul.

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.4.1/312 - Release Date: 14/04/2006
 

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


[Mimedefang] Image validator/OCR SA plugin

2006-04-14 Thread Martin Blapp


Hi all,

This is just a little advertisement for my plugin which is now
in a usable state and works very well.

Anyone interested should keep an eye on it - it really helps
with the image only spam we get today. But problably the spammers
will soon change their tricks to different images which are more
difficult to read :-(

http://antispam.imp.ch/patches/patch-ocrtext

Martin

Martin Blapp, [EMAIL PROTECTED] [EMAIL PROTECTED]
--
ImproWare AG, UNIXSP  ISP, Zurlindenstrasse 29, 4133 Pratteln, CH
Phone: +41 61 826 93 00 Fax: +41 61 826 93 01
PGP: finger -l [EMAIL PROTECTED]
PGP Fingerprint: B434 53FC C87C FE7B 0A18 B84C 8686 EF22 D300 551E
--

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


RE: [Mimedefang] Image validator/OCR SA plugin

2006-04-14 Thread Matthew.van.Eerde
Martin Blapp wrote:
 http://antispam.imp.ch/patches/patch-ocrtext

That is unbelievably sweet.

I remember a couple of years ago there was a virus that sent itself in a 
password-protected .zip file, with an image containing the password.  OCR would 
have been useful... I could easily see MIMEDefang reading the password from 
the image and feeding it to the virus scanner.

-- 
Matthew.van.Eerde (at) hbinc.com   805.964.4554 x902
Hispanic Business Inc./HireDiversity.com   Software Engineer

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-14 Thread Richard Laager
On Fri, 2006-04-14 at 18:42 +0200, Martin Blapp wrote:
 Anyone interested should keep an eye on it - it really helps
 with the image only spam we get today. But problably the spammers
 will soon change their tricks to different images which are more
 difficult to read :-(

Interesting... What's the performance like with this? How many messages
do you scan per day with it?

Richard


___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-14 Thread John Rudd


On Apr 14, 2006, at 9:42 AM, Martin Blapp wrote:


Anyone interested should keep an eye on it - it really helps
with the image only spam we get today. But problably the spammers
will soon change their tricks to different images which are more
difficult to read :-(



I can see it now ... pretty soon, we'll be seeing spam in capcha form.

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-14 Thread Martin Blapp



Interesting... What's the performance like with this? How many messages
do you scan per day with it?


It is rather fast. On a Pentium IV 3Ghz I can scan a average jpg/gif picture in
0,2 - 0,3 seconds.

I've limited the scantime to 5 seconds per image, and I allow only three images 
to be scanned per mail. Of course this is user configurable.


The greps here are just up to now, not a full day.

grep hits= /var/log/maillog | wc -l
   78050

grep X-Spam-Status: Yes /var/log/maillog | wc -l
   48400

grep hits=.*SPAMPIC /var/log/maillog | wc -l
9572

grep X-Spam-Status: Yes.*hits=.*SPAMPIC /var/log/maillog | wc -l
9558

grep X-Spam-Status: Yes.*hits=.*SPAMPIC /var/log/maillog | grep 
HTML_IMAGE_ONLY | wc -l
9528

# grep HTML_IMAGE_ONLY /var/log/maillog | wc -l
   35834

This means 60% of all mails we get are SPAM. More than 10% of the SPAM
are some gif and jpg pictures advertizing for stocks and meds.

But almost 45% of all mails match HTML_IMAGE_ONLY, so it's unusable
at all. I even use lower scores for those rules now - which gives
me less FPS:

score HTML_IMAGE_ONLY_041.400
score HTML_IMAGE_ONLY_081.300
score HTML_IMAGE_ONLY_121.200
score HTML_IMAGE_ONLY_161.100
score HTML_IMAGE_ONLY_200.950
score HTML_IMAGE_ONLY_240.900
score HTML_IMAGE_ONLY_280.700
score HTML_IMAGE_ONLY_320.400

Martin
___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] Image validator/OCR SA plugin

2006-04-14 Thread Martin Blapp

# grep HTML_IMAGE_ONLY /var/log/maillog | wc -l
  35834


This is wrong. It should have been

# grep HTML_IMAGE_ONLY.*hits= /var/log/maillog | wc -l
17917


But almost 45% of all mails match HTML_IMAGE_ONLY, so it's unusable
at all. I even use lower scores for those rules now - which gives
me less FPS:


22% is still a lot ...

Martin
___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang