Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-21 Thread Kris Deugau

dar...@chaosreigns.com wrote:

On 07/20, Sharma, Ashish wrote:

Can someone suggest some better OCR plugin for Spamassassin 3.3.1 for image 
spam?


It still seems strange to me that anybody has ever bothered with using OCR
to deal with image spam, when it's so easy, and for me not problematic, to
just block all emails that might be image spam - those with an attached
image that is embedded in the body of an html mail.


I have to ask - have you ever tried this in the context of an ISP mail 
system?


A great many users consider sending pictures and videos by email to be 
the ultimate purpose of email...  and many of the same set of users take 
great delight in (ab)using Outlook's "stationery" or using Incredimail, 
as well as overdosing on funny fonts and colours in the text.


-kgd


Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-21 Thread David F. Skoll
On Thu, 21 Jul 2011 07:47:00 +0100
"Sharma, Ashish"  wrote:

> Can you please outline the other techniques that you use to catch
> image spams?

We find Bayes (we have our own implementation) and RBLs (again, we have
our own) work pretty well.

Regards,

David.


Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-21 Thread Axb

http://wiki.apache.org/spamassassin/UnmaintainedCustomPlugins

"OCR scanner and image validator SA-plugin"

"OCR Plugin"

may be worth a try.. no idea how well they work


The Spamassassin wiki is so cool



On 2011-07-21 8:53, Sharma, Ashish wrote:

All,

The current functionality requires me to receive mails that contains image and 
process them.

So I want a good tool to deal with image spam.

Please suggest some.

Thanks
Ashish Sharma

-Original Message-
From: Jason Bertoch [mailto:ja...@i6ix.com]
Sent: Thursday, July 21, 2011 8:03 AM
To: users@spamassassin.apache.org
Subject: Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

On 7/20/2011 9:18 PM, dar...@chaosreigns.com wrote:

On 07/20, Sharma, Ashish wrote:

Can someone suggest some better OCR plugin for Spamassassin 3.3.1 for image 
spam?

It still seems strange to me that anybody has ever bothered with using OCR
to deal with image spam, when it's so easy, and for me not problematic, to
just block all emails that might be image spam - those with an attached
image that is embedded in the body of an html mail.

Inlined attached images are not a feature that I find anywhere near worth
having enough to justify needing to OCR image spam.



Image spam was a huge deal when it first came out, and there were
several sources scrambling to offer a solution, including resources to
involve Bayes on the decoded text.  Those worked well enough to deter,
for the time-being anyway, that method of spamming.

That said, while I agree with your sentiment toward inline images and
HTML mail in general, they are a common business practice and many folks
simply can't use the outright block method.

At my last job, I eventually found that image-spam dropped to such a
significant low that I didn't need OCR anymore but was still required to
allow inline images through.

/Jason




RE: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-20 Thread Sharma, Ashish
All,

The current functionality requires me to receive mails that contains image and 
process them.

So I want a good tool to deal with image spam.

Please suggest some.

Thanks
Ashish Sharma

-Original Message-
From: Jason Bertoch [mailto:ja...@i6ix.com] 
Sent: Thursday, July 21, 2011 8:03 AM
To: users@spamassassin.apache.org
Subject: Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

On 7/20/2011 9:18 PM, dar...@chaosreigns.com wrote:
> On 07/20, Sharma, Ashish wrote:
>> Can someone suggest some better OCR plugin for Spamassassin 3.3.1 for image 
>> spam?
> It still seems strange to me that anybody has ever bothered with using OCR
> to deal with image spam, when it's so easy, and for me not problematic, to
> just block all emails that might be image spam - those with an attached
> image that is embedded in the body of an html mail.
>
> Inlined attached images are not a feature that I find anywhere near worth
> having enough to justify needing to OCR image spam.
>

Image spam was a huge deal when it first came out, and there were 
several sources scrambling to offer a solution, including resources to 
involve Bayes on the decoded text.  Those worked well enough to deter, 
for the time-being anyway, that method of spamming.

That said, while I agree with your sentiment toward inline images and 
HTML mail in general, they are a common business practice and many folks 
simply can't use the outright block method.

At my last job, I eventually found that image-spam dropped to such a 
significant low that I didn't need OCR anymore but was still required to 
allow inline images through.

/Jason


RE: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-20 Thread Sharma, Ashish
David, 

>[We don't use OCR, as it happens.  We usually catch image spams anyway
>using other techniques.]

Can you please outline the other techniques that you use to catch image spams?

Thanks
Ashish Sharma

-Original Message-
From: David F. Skoll [mailto:d...@roaringpenguin.com] 
Sent: Thursday, July 21, 2011 7:50 AM
To: users@spamassassin.apache.org
Subject: Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

On Wed, 20 Jul 2011 21:18:48 -0400
dar...@chaosreigns.com wrote:

> It still seems strange to me that anybody has ever bothered with
> using OCR to deal with image spam, when it's so easy, and for me not
> problematic, to just block all emails that might be image spam -
> those with an attached image that is embedded in the body of an html
> mail.

We receive many legitimate [sic] emails that use an embedded image
in that way.  Lots of companies think it's really cool to include their
logo in a .sig :(

> I've been very happily using this since 2006, and it completely made
> image spam go away.

Is this on a business account where it's critical for you to accept
email from... ahem... somewhat less-than-knowledgeable people?

> Inlined attached images are not a feature that I find anywhere near
> worth having enough to justify needing to OCR image spam.

Unfortunately, we can't block those.  The FP rate for us would be
horrendous.

[We don't use OCR, as it happens.  We usually catch image spams anyway
using other techniques.]

Regards,

David.



Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-20 Thread Jason Bertoch

On 7/20/2011 9:18 PM, dar...@chaosreigns.com wrote:

On 07/20, Sharma, Ashish wrote:

Can someone suggest some better OCR plugin for Spamassassin 3.3.1 for image 
spam?

It still seems strange to me that anybody has ever bothered with using OCR
to deal with image spam, when it's so easy, and for me not problematic, to
just block all emails that might be image spam - those with an attached
image that is embedded in the body of an html mail.

Inlined attached images are not a feature that I find anywhere near worth
having enough to justify needing to OCR image spam.



Image spam was a huge deal when it first came out, and there were 
several sources scrambling to offer a solution, including resources to 
involve Bayes on the decoded text.  Those worked well enough to deter, 
for the time-being anyway, that method of spamming.


That said, while I agree with your sentiment toward inline images and 
HTML mail in general, they are a common business practice and many folks 
simply can't use the outright block method.


At my last job, I eventually found that image-spam dropped to such a 
significant low that I didn't need OCR anymore but was still required to 
allow inline images through.


/Jason


Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-20 Thread David F. Skoll
On Wed, 20 Jul 2011 21:18:48 -0400
dar...@chaosreigns.com wrote:

> It still seems strange to me that anybody has ever bothered with
> using OCR to deal with image spam, when it's so easy, and for me not
> problematic, to just block all emails that might be image spam -
> those with an attached image that is embedded in the body of an html
> mail.

We receive many legitimate [sic] emails that use an embedded image
in that way.  Lots of companies think it's really cool to include their
logo in a .sig :(

> I've been very happily using this since 2006, and it completely made
> image spam go away.

Is this on a business account where it's critical for you to accept
email from... ahem... somewhat less-than-knowledgeable people?

> Inlined attached images are not a feature that I find anywhere near
> worth having enough to justify needing to OCR image spam.

Unfortunately, we can't block those.  The FP rate for us would be
horrendous.

[We don't use OCR, as it happens.  We usually catch image spams anyway
using other techniques.]

Regards,

David.



Re: Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-20 Thread darxus
On 07/20, Sharma, Ashish wrote:
> Can someone suggest some better OCR plugin for Spamassassin 3.3.1 for image 
> spam?

It still seems strange to me that anybody has ever bothered with using OCR
to deal with image spam, when it's so easy, and for me not problematic, to
just block all emails that might be image spam - those with an attached
image that is embedded in the body of an html mail.

In my postfix main.cf I have:
body_checks = pcre:/etc/postfix/body_checks
And that file just contains:
/\bsrc\s*=(?:3D)?\s*["']?cid:/ REJECT Your email was rejected because you 
embedded an attached image in the body.

So if somebody ever sends me a legit email with an inlined attached image,
they'll still get an error, without me causing any backscatter.

My mom was annoyed that she couldn't use some tool to decorate her emails
to me with garbage, but... that doesn't qualify as a negative for me.

I've been very happily using this since 2006, and it completely made image
spam go away.

People can still send me images attached to emails, and they can still send
me emails with images embedded in the body of html emails as long as they're
hosted on a web server and not attached.  It only gets rejected if the
image is attached *and* embedded in the body of the email.

Inlined attached images are not a feature that I find anywhere near worth
having enough to justify needing to OCR image spam.

-- 
"I finally figured out the only reason to be alive is to enjoy it."
- Rita Mae Brown
http://www.ChaosReigns.com


Suggest OCR plugin on Spamassassin 3.3.1 for image spam

2011-07-20 Thread Sharma, Ashish
Hi,

I am currently using FuzzyOCR(3.6.0) for image spam control on my 
Spamassassin(3.3.1) stack.

The FuzzyOCR parent location (http://fuzzyocr.own-hero.net/wiki/Downloads) 
suggests the above FuzzyOCR is available only for testing on Spamassassin 3.2.x 

Somehow I am running this version of FuzzyOCR for my Spamassassin stack.

Lately I am not convinced with FuzzyOCR performance and the errors that I keep 
getting on it. 

Moreover the community support and active development on FuzzyOCR too seems to 
be missing.

Can someone suggest some better OCR plugin for Spamassassin 3.3.1 for image 
spam?

Thanks
Ashish Sharma


Re: New spamassassin OCR plugin

2009-05-27 Thread Benny Pedersen

On Wed, May 27, 2009 23:43, decoder wrote:
> I am planning a new release, but my time schedule is though.

super, i posted a new thread with subject "FuzzyOcr wordlist"

new words to be added for latest spams

-- 
http://localhost/ 100% uptime and 100% mirrored :)



Re: New spamassassin OCR plugin

2009-05-27 Thread decoder

LuKreme wrote:

On 24-May-2009, at 18:40, Henrik K wrote:
I don't know why users are so afraid of words like SVN. You have to 
look at the project, not version numbers.



I don't have FuzzyOCR installed, and it's not because of the SVN. 
First, I don't think my server can take the processing hit and second 
it requires so much to be installed that I'm SURE my server can't take 
the hit.




May I ask how many mails you process per day? Please note that

a) FuzzyOcr runs last if properly installed
b) it doesn't do anything if the score exceeds a configurable threshold
c) it supports hashes and other things that make processing faster


Cheers,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: New spamassassin OCR plugin

2009-05-27 Thread decoder

alex k wrote:


If only FuzzyOCR's developer would read that ;)
Unfortunately he doesn't seem to be interested in his project anymore.
Maybe you could take care of this orphaned code.

  


Dear Alex,


I am reading exactly everything you write ;)


The code is not orphaned, but also not being extended at the moment. The 
SVN version runs stable in all SA 3.2.x releases. I answer to tickets 
and questions via email.



I am planning a new release, but my time schedule is though.


Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: New spamassassin OCR plugin

2009-05-25 Thread LuKreme

On 24-May-2009, at 18:40, Henrik K wrote:
I don't know why users are so afraid of words like SVN. You have to  
look at the project, not version numbers.



I don't have FuzzyOCR installed, and it's not because of the SVN.   
First, I don't think my server can take the processing hit and second  
it requires so much to be installed that I'm SURE my server can't take  
the hit.


--
You may be anti anti-spam-kook if: Despite having invented the
FUSSP you not only don't know the difference between the SMTP
envelope and SMTP headers; you doubt there is such a thing as
the SMTP envelop because email doesn't involve paper.



Re: New spamassassin OCR plugin

2009-05-24 Thread Henrik K
On Sun, May 24, 2009 at 08:57:28AM +0200, alex k wrote:
> 
> > Looks like nothing that fuzzyOCR couldn't do, being more flexible and
> > proven
> > by time.
> 
> If only FuzzyOCR's developer would read that ;)
> Unfortunately he doesn't seem to be interested in his project anymore.
> Maybe you could take care of this orphaned code.

I don't know why you think it's orphaned. SVN version worked fine 2 years
ago, and it still works. There was not image spam during this time.

http://fuzzyocr.own-hero.net/ticket/2934#comment:3

Even promise of a new official release. I don't know why users are so afraid
of words like SVN. You have to look at the project, not version numbers.



Re: New spamassassin OCR plugin

2009-05-24 Thread Res

On Sun, 24 May 2009, LuKreme wrote:


On 24-May-2009, at 03:10, alex k wrote:

You forgot ocrad. Ocrad is needed by facileOCR (see "Dependencies") and as
far as I know, there is no ready-to-use binary for Windows.


You keep talking about Windows.  The world is not bifurcated between windows 
and Linux, there is Solaris, OS X, FreeBSD, and countless other -nonlinux- 
unix variants out there.


I thought he said it was for Linux?  That kinda means if it works on 
BSD/Slowaris goodo, if not, dont cry, the guy is trying to do something 
worthwhile and all I see here is a bunch of misfit trolls attacking him

for it, ask yourself this,  WTF are *YOU* doing to contribute to the
community to stop the crap?  A few of you need to ask yourselves that!

Now, get over it, move on with life, that means fighting the 
spammers not each other.



--
Res

-Beware of programmers who carry screwdrivers


Re: New spamassassin OCR plugin

2009-05-24 Thread LuKreme

On 24-May-2009, at 03:10, alex k wrote:
You forgot ocrad. Ocrad is needed by facileOCR (see "Dependencies")  
and as

far as I know, there is no ready-to-use binary for Windows.


You keep talking about Windows.  The world is not bifurcated between  
windows and Linux, there is Solaris, OS X, FreeBSD, and countless  
other -nonlinux- unix variants out there.



--
Once again I am banished to the demon section of the card catalog



Re: New spamassassin OCR plugin

2009-05-24 Thread mouss
alex k a écrit :
> Hi,
> 
>> On Sun, May 24, 2009 at 08:57:28AM +0200, alex k wrote:
>>> It is Linux centric and I do mention that on the project side.
>>>
>>> The code part you mention is the one that kills a leftover convert
>>> process
>>> after it reached its timeout, an exeption.
>>> You got the sources, go ahead and make a windows version.
>> You seem you think "ps" works the same on every Unix variant. What does
>> Windows have to do with it? I'm not here to nitpick, but it's pretty
>> serious
>> if processes go crazy and fill up the server. You just mention that it's
>> "tested" on Linux. Most people would assume that SA plugins run fine as
>> long
>> as you have SA and Perl.
> 
> You forgot ocrad. Ocrad is needed by facileOCR (see "Dependencies") and as
> far as I know, there is no ready-to-use binary for Windows.
> 


# uname
FreeBSD
# cd /usr/ports/graphics/ocrad
# make install clean
...
$ pkg_info|grep ocrad
ocrad-0.17_3OCR program implemented as filter

As you see, it took one command to install ocrad.

now:

$ ps -o pid,cmd --ppid $$ --no-header
ps: cmd: keyword not found
ps: illegal option -- -
usage: ps [-aCcefHhjlmrSTuvwXxZ] [-O fmt | -o fmt] [-G gid[,gid...]]
  [-M core] [-N system]
  [-p pid[,pid...]] [-t tty[,tty...]] [-U user[,user...]]
   ps [-L]



> OK, to make it clear. To all Windows users: use any of the other plugins
> available for Windows, there are several. You will not be able to fulfill
> the requirements.
> 
> And again, you got the sources, adapt it to your needs.
> I'm afraid, this discussion is leading nowhere.
> 

good luck.


Re: New spamassassin OCR plugin

2009-05-24 Thread wolfgang
Hi Xela,

I think there has been some misunderstanding:

In an older episode (Sunday, 24. May 2009), Henrik K wrote:
> You should mention that it's pretty Linux centric, atleast code like
> "ps -o pid,cmd --ppid $$ --no-header".. why don't you use perl
> functions?

In an older episode (Sunday, 24. May 2009), Henrik K wrote:
> You seem you think "ps" works the same on every Unix variant.

Henrik is trying to point out that the "ps" command has different syntax 
requirements on different Unix variants. I can only confirm that (I use 
Linux and Solaris at work).

So, the point is that the facileOCR.pm code in it's current state does 
not simply require Perl, SA and ocrad, but *also* Linux. And Henrik 
suggests to make that more clear in the documentation.

Best regards,

wolfgang


Re: New spamassassin OCR plugin

2009-05-24 Thread alex k
Hi,

> On Sun, May 24, 2009 at 08:57:28AM +0200, alex k wrote:
>>
>> It is Linux centric and I do mention that on the project side.
>>
>> The code part you mention is the one that kills a leftover convert
>> process
>> after it reached its timeout, an exeption.
>> You got the sources, go ahead and make a windows version.
>
> You seem you think "ps" works the same on every Unix variant. What does
> Windows have to do with it? I'm not here to nitpick, but it's pretty
> serious
> if processes go crazy and fill up the server. You just mention that it's
> "tested" on Linux. Most people would assume that SA plugins run fine as
> long
> as you have SA and Perl.

You forgot ocrad. Ocrad is needed by facileOCR (see "Dependencies") and as
far as I know, there is no ready-to-use binary for Windows.

OK, to make it clear. To all Windows users: use any of the other plugins
available for Windows, there are several. You will not be able to fulfill
the requirements.

And again, you got the sources, adapt it to your needs.
I'm afraid, this discussion is leading nowhere.

regards,
Xela




Re: New spamassassin OCR plugin

2009-05-24 Thread Henrik K
On Sun, May 24, 2009 at 08:57:28AM +0200, alex k wrote:
> 
> It is Linux centric and I do mention that on the project side.
>
> The code part you mention is the one that kills a leftover convert process
> after it reached its timeout, an exeption.
> You got the sources, go ahead and make a windows version.

You seem you think "ps" works the same on every Unix variant. What does
Windows have to do with it? I'm not here to nitpick, but it's pretty serious
if processes go crazy and fill up the server. You just mention that it's
"tested" on Linux. Most people would assume that SA plugins run fine as long
as you have SA and Perl.



Re: New spamassassin OCR plugin

2009-05-23 Thread alex k
Hi,

> On Sat, May 23, 2009 at 12:43:15PM +0200, alex k wrote:
>> Hi,
>> It seems that image spam is back. So I wrote a new OCR plugin for
>> spamassassin, which uses convert and ocrad to extract text.
>> For details and download see:
>>
>> http://spielwiese.la-evento.com/facileOCR/
>>
>> We use this plugin on our servers. It kicks out every image-spam, that
>> made it through the other filters and produces not a single false
>> positive.
>>
>> OK, this is kind of promotion ;)
>> Nevertheless I hope it is helpful.
>
> You should mention that it's pretty Linux centric, atleast code like
> "ps -o pid,cmd --ppid $$ --no-header".. why don't you use perl functions?
>

It is Linux centric and I do mention that on the project side.
The code part you mention is the one that kills a leftover convert process
after it reached its timeout, an exeption.
You got the sources, go ahead and make a windows version.

> Looks like nothing that fuzzyOCR couldn't do, being more flexible and
> proven
> by time.

If only FuzzyOCR's developer would read that ;)
Unfortunately he doesn't seem to be interested in his project anymore.
Maybe you could take care of this orphaned code.

regards,
Xela
>
>




Re: New spamassassin OCR plugin

2009-05-23 Thread Henrik K
On Sat, May 23, 2009 at 12:43:15PM +0200, alex k wrote:
> Hi,
> It seems that image spam is back. So I wrote a new OCR plugin for
> spamassassin, which uses convert and ocrad to extract text.
> For details and download see:
> 
> http://spielwiese.la-evento.com/facileOCR/
> 
> We use this plugin on our servers. It kicks out every image-spam, that
> made it through the other filters and produces not a single false
> positive.
> 
> OK, this is kind of promotion ;)
> Nevertheless I hope it is helpful.

You should mention that it's pretty Linux centric, atleast code like
"ps -o pid,cmd --ppid $$ --no-header".. why don't you use perl functions?

Looks like nothing that fuzzyOCR couldn't do, being more flexible and proven
by time.



Re: New spamassassin OCR plugin

2009-05-23 Thread alex k
Hi,

> On 23.05.09 12:43, alex k wrote:
>> It seems that image spam is back. So I wrote a new OCR plugin for
>> spamassassin, which uses convert and ocrad to extract text.
>> For details and download see:
>>
>> http://spielwiese.la-evento.com/facileOCR/
>>
>> We use this plugin on our servers. It kicks out every image-spam, that
>> made it through the other filters and produces not a single false
>> positive.
>
> hmmm, last two images I've checked were much nicer read by gocr, just FYI.
>
> another question I've raised some time ago was the possibility of pushing
> read text to spamassassin so it could be detected by other checks, e.g.
> spamassassin and optionally uribl's...
> The answer was gocr is not reliable enough for doing this stuff, but I
> hope
> it's worth trying...

I will explain a bit, how this plugin works:
It doesn't matter, how nice the text is read. You can always get the
extracted text from debuglog and expand your spamwords list with things
like "Favorl,cllck,Fvorle".
The extracted text is filtered, so you can savely use anything you find in
debuglog.
Thus we don't need a 100% word recognition, which would be very hard to
reach.

I decided to use ocrad and not gocr, because ocrad has some nice features
(like text filtering and builtin resizing).
I wanted to keep the list of dependencies as small as possible, so I use
only ocrad (tests with tesseract or ocropus were discouraging).

By the way, there already exists a plugin which extracts words with gocr
and feeds it to Bayes. Do you know BayesOCR?

bye,
Xela

> --
> Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
> Warning: I wish NOT to receive e-mail advertising to this address.
> Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
> "To Boot or not to Boot, that's the question." [WD1270 Caviar]
>




Re: New spamassassin OCR plugin

2009-05-23 Thread Matus UHLAR - fantomas
On 23.05.09 12:43, alex k wrote:
> It seems that image spam is back. So I wrote a new OCR plugin for
> spamassassin, which uses convert and ocrad to extract text.
> For details and download see:
> 
> http://spielwiese.la-evento.com/facileOCR/
> 
> We use this plugin on our servers. It kicks out every image-spam, that
> made it through the other filters and produces not a single false
> positive.

hmmm, last two images I've checked were much nicer read by gocr, just FYI.

another question I've raised some time ago was the possibility of pushing
read text to spamassassin so it could be detected by other checks, e.g.
spamassassin and optionally uribl's...
The answer was gocr is not reliable enough for doing this stuff, but I hope
it's worth trying...
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"To Boot or not to Boot, that's the question." [WD1270 Caviar]


Re: New spamassassin OCR plugin

2009-05-23 Thread wolfgang
In an older episode (Saturday, 23. May 2009), alex k wrote:
> Hi,
> It seems that image spam is back. So I wrote a new OCR plugin for
> spamassassin, which uses convert and ocrad to extract text.

Thank you. It works out of the box (after installing ocrad) here on 
Ubuntu 8.04.2 linux with SA 3.2.4 and Perl 5.8.8.

I miss an option to adjust the score, though. My test got a FOCR score 
of 17 which is far more than I want.

To my surprise, setting a different FOCR score in a *.cf file 
in /etc/mail/spamassassin/ did not change the result.

Regards,

wolfgang




New spamassassin OCR plugin

2009-05-23 Thread alex k
Hi,
It seems that image spam is back. So I wrote a new OCR plugin for
spamassassin, which uses convert and ocrad to extract text.
For details and download see:

http://spielwiese.la-evento.com/facileOCR/

We use this plugin on our servers. It kicks out every image-spam, that
made it through the other filters and produces not a single false
positive.

OK, this is kind of promotion ;)
Nevertheless I hope it is helpful.

bye,
Xela



Re: ocr plugin

2008-05-02 Thread decoder

Theo Van Dinter wrote:

On Fri, May 02, 2008 at 09:12:12PM +0200, decoder wrote:
  
Also, the SA plugin architecture is not designed to modify the message 
in any way, so you cannot push back the text into the normal processing 
line.



Really?  Who says?  I made very specific modifications in 3.2 to allow for
just that.

Search the list archives for "post_message_parse".
  
Ah ok, I was refering to the 3.1.x architecture. I haven't looked at the 
changes done in 3.2, but if this is technically possible now, then I 
apologize :D



Best regards,


Chris



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ocr plugin

2008-05-02 Thread Theo Van Dinter
On Fri, May 02, 2008 at 09:12:12PM +0200, decoder wrote:
> Also, the SA plugin architecture is not designed to modify the message 
> in any way, so you cannot push back the text into the normal processing 
> line.

Really?  Who says?  I made very specific modifications in 3.2 to allow for
just that.

Search the list archives for "post_message_parse".

-- 
Randomly Selected Tagline:
"If you ever reach total enlightenment while drinking beer, I bet it
 makes beer shoot out your nose."  - Deep Thought, Jack Handy


pgp5QC7mvLO5v.pgp
Description: PGP signature


Re: ocr plugin

2008-05-02 Thread decoder

Matus UHLAR - fantomas wrote:

does it push the extracted text back to SA so it could be used by e.g.
bayes? This is how it imho should be used.

(and imho the same for .pdf and/or .doc - extract text _and_ images from
it, call OCR for images...)

  
That is a question that was very frequently asked around here and that's 
why I also included it in the FuzzyOcr FAQ:


"If you take a look at the actual results of the OCR engines used, then 
you'll see that the output suffers from a lot of noise. Hence, it is not 
suited for common word analysis like bayes, and FuzzyOcr uses a special 
fuzzy matching algorithm to find the words"


Also, the SA plugin architecture is not designed to modify the message 
in any way, so you cannot push back the text into the normal processing 
line.


As to image spam in general: Yes, it has dropped dramatically and I 
haven't seen any actually for quite a long time now. I hope that my tool 
is one reason that this annoying technique is gone now :D



Best regards,


Chris



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ocr plugin

2008-05-02 Thread Matus UHLAR - fantomas
> >>> Am I right to say that picture spam has dropped dramatically since the
> >>> last months?

On 02.05.08 11:38, Joseph Brennan wrote:
> Right.  There's close to none now.  Spam techniques come and go.

does it push the extracted text back to SA so it could be used by e.g.
bayes? This is how it imho should be used.

(and imho the same for .pdf and/or .doc - extract text _and_ images from
it, call OCR for images...)

-- 
Matus UHLAR - fantomas, [EMAIL PROTECTED] ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
A day without sunshine is like, night.


Re: ocr plugin

2008-05-02 Thread Joseph Brennan



> Am I right to say that picture spam has dropped dramatically since the
> last months?



Right.  There's close to none now.  Spam techniques come and go.

Joseph Brennan
Columbia University IT




Re: ocr plugin

2008-05-02 Thread William Taylor
On Fri, May 02, 2008 at 06:06:05PM +0300, Henrik K wrote:
> On Fri, May 02, 2008 at 03:38:41PM +0200, polloxx wrote:
> > Hi,
> > 
> > Am I right to say that picture spam has dropped dramatically since the
> > last months?
> 
> Has there been any in a year? That's when I dropped using it.
> 

It's probably not worth the resources running it right now. I only get a few 
that trickle in here and there.
Others mileage may very though.


Re: ocr plugin

2008-05-02 Thread Henrik K
On Fri, May 02, 2008 at 03:38:41PM +0200, polloxx wrote:
> Hi,
> 
> Am I right to say that picture spam has dropped dramatically since the
> last months?

Has there been any in a year? That's when I dropped using it.



Re: ocr plugin

2008-05-02 Thread William Taylor
We are using the SVN version of FuzzyOCR. It seems to be working fine.

-William

On Fri, May 02, 2008 at 03:38:41PM +0200, polloxx wrote:
> Hi,
> 
> Am I right to say that picture spam has dropped dramatically since the
> last months?
> Is it still reasonable to run an orc plugin? I see the latest FuzzyORC
> version is
> not SA 3.2.x compatible. Are there more recent product compatible to 3.2.x?
> Are you guys still running an ocr plugin on production servers?
> 
> Thanks for your answers,
> P.
> 


ocr plugin

2008-05-02 Thread polloxx
Hi,

Am I right to say that picture spam has dropped dramatically since the
last months?
Is it still reasonable to run an orc plugin? I see the latest FuzzyORC
version is
not SA 3.2.x compatible. Are there more recent product compatible to 3.2.x?
Are you guys still running an ocr plugin on production servers?

Thanks for your answers,
P.


Re: Bayes combining and OCR (Was Re: SpamAssassin 3.2 compatiblity)

2007-06-01 Thread Matthias Keller

Justin Mason wrote:

Matthias Keller writes:
  

Nix wrote:


On 31 May 2007, Graham Murray said:

  
  

Nix <[EMAIL PROTECTED]> writes:




(And, let's be blunt, the pure this-word-is-spammy recognition part of
FuzzyOCR is much less smart than the Bayesian system already present
in SA: FuzzyOCR should really use the Bayesian system to determine the
spamminess of words, I suppose...)
  
  

Or even just act as a MIME part 'decoding' system (like Base64) and feed
all words it finds in images into Bayes, along with all other text in
the mail, rather than generating a score itself.



Perhaps so, but if so those words should have a score-multiplier of some
sort applied, because the fact that those words originated in images is
itself an obfuscation technique that should be noted in the score.
  
  

This has been discussed here again and again and again

first of all, these 10 words found in an image cannot stand against the 
bayes poisoning found in all these messages - so it would literally be 
useless for bayes filtering



by the way, this is a common misconception of how our Bayes system works;
what *should* happen is that the "poison" text winds up with "weak"
Bayesian probability scores between 0.2 and 0.8, since it uses words that
also appear in ham (hence why it appears as poison).  However, the OCR'd
text would wind up with "strong" scores around 0.99 or greater.

The chi-square probability combining algorithm we use takes care of this,
by discounting the "weak" clues and taking more account of the "strong"
clues.  (This is what makes it a more effective combining algorithm for
Bayes than the traditional Graham style.)
  
Would be nice if that worked - just it doesn't for me. I dont know how 
the algorithm works but I observed its results...
I learnt dozens of spams with nearly identical spam texts (and only the 
spam stuff, not the poisoning) and an identical mail WITH random text 
got a Bayes 0.500 - hence really - it just doesn't work for me...


Matt


Bayes combining and OCR (Was Re: SpamAssassin 3.2 compatiblity)

2007-06-01 Thread Justin Mason

Matthias Keller writes:
> Nix wrote:
> > On 31 May 2007, Graham Murray said:
> >
> >   
> >> Nix <[EMAIL PROTECTED]> writes:
> >>
> >> 
> >>> (And, let's be blunt, the pure this-word-is-spammy recognition part of
> >>> FuzzyOCR is much less smart than the Bayesian system already present
> >>> in SA: FuzzyOCR should really use the Bayesian system to determine the
> >>> spamminess of words, I suppose...)
> >>>   
> >> Or even just act as a MIME part 'decoding' system (like Base64) and feed
> >> all words it finds in images into Bayes, along with all other text in
> >> the mail, rather than generating a score itself.
> >> 
> >
> > Perhaps so, but if so those words should have a score-multiplier of some
> > sort applied, because the fact that those words originated in images is
> > itself an obfuscation technique that should be noted in the score.
> >   
> This has been discussed here again and again and again
> 
> first of all, these 10 words found in an image cannot stand against the 
> bayes poisoning found in all these messages - so it would literally be 
> useless for bayes filtering

by the way, this is a common misconception of how our Bayes system works;
what *should* happen is that the "poison" text winds up with "weak"
Bayesian probability scores between 0.2 and 0.8, since it uses words that
also appear in ham (hence why it appears as poison).  However, the OCR'd
text would wind up with "strong" scores around 0.99 or greater.

The chi-square probability combining algorithm we use takes care of this,
by discounting the "weak" clues and taking more account of the "strong"
clues.  (This is what makes it a more effective combining algorithm for
Bayes than the traditional Graham style.)

Note: this relies on the use of a different "namespace" for OCR-discovered
words, btw; ie. if the words "make money fast" are found in OCR'd text,
it'd generate "OCR:make", "OCR:money", "OCR:fast".  If the OCR-discovered
words are just thrown in with normal text words, that wouldn't work.

--j.


Re: Fuzzy OCR & annoying Outlook users

2007-05-11 Thread Kris Deugau
[EMAIL PROTECTED] wrote:
> I'm using FuzzyOCR which works great. However, lately I've been seeing
> annoying Outlook users using some kind of plugin which seem to add an
> image, and it has the text "Free emoticons, download here" (or
> something), mostly it's in my language and then it has the text "gratis".

It's most likely actually a piece of point-and-drool-ware called
"IncrediMail" generating these messages.  As an ISP mail system admin,
I'm distinctly unimpressed with it;  it's essentially an overlay on
Outlook Express that locks off most of what little functionality there
is in OE, and thoroughly HTMLifies any outgoing mail (and, as you've
seen, adds these annoying "advertising" images).

It is, however, quite popular with ISP residential customers.  :(

-kgd


AW: Fuzzy OCR & annoying Outlook users

2007-05-11 Thread Starckjohann, Ove
simply remove "gratis" from your wordlist and you'll be done...
I think even without "gratis" in the wordlist FuzzyOCR will do a great job on 
"real" spam ;-)

Ove



> -Ursprüngliche Nachricht-
> Von: news [mailto:[EMAIL PROTECTED] Im Auftrag von 
> [EMAIL PROTECTED]
> Gesendet: Freitag, 11. Mai 2007 10:52
> An: users@spamassassin.apache.org
> Betreff: Fuzzy OCR & annoying Outlook users
> 
> 
> Hey,
> 
> I'm using FuzzyOCR which works great. However, lately I've 
> been seeing 
> annoying Outlook users using some kind of plugin which seem to add an 
> image, and it has the text "Free emoticons, download here" 
> (or something), 
> mostly it's in my language and then it has the text "gratis".
> 
> The word "gratis" gets mached by FuzzyOCR and the mail gets 
> an extra score 
> of 5.
> 
> So I tried adding the hash of this image:
> 
> # ./fuzzy-find --delete imstp_pets_cat1_du.gif
> # ./fuzzy-find --learn-ham --score=0 imstp_pets_cat1_du.gif
> 
> However, when I scan the mail again, I'm still getting a score of 5:
> 
> 5.0 FUZZY_OCR_KNOWN_HASH   BODY: Mail contains an image 
> with known hash
>Words found:
> "gratis" in 1 lines
> "gratis" in 1
>lines
> 
> Any idea's to learn FuzzyOCR not to tag this image as spam?
> 
> Thanks!
> K.
> 
> 
> 


Fuzzy OCR & annoying Outlook users

2007-05-11 Thread kshatriyak

Hey,

I'm using FuzzyOCR which works great. However, lately I've been seeing 
annoying Outlook users using some kind of plugin which seem to add an 
image, and it has the text "Free emoticons, download here" (or something), 
mostly it's in my language and then it has the text "gratis".


The word "gratis" gets mached by FuzzyOCR and the mail gets an extra score 
of 5.


So I tried adding the hash of this image:

# ./fuzzy-find --delete imstp_pets_cat1_du.gif
# ./fuzzy-find --learn-ham --score=0 imstp_pets_cat1_du.gif

However, when I scan the mail again, I'm still getting a score of 5:

   5.0 FUZZY_OCR_KNOWN_HASH   BODY: Mail contains an image with known hash
  Words found:
"gratis" in 1 lines
"gratis" in 1
  lines

Any idea's to learn FuzzyOCR not to tag this image as spam?

Thanks!
K.




bad OCR with some GIF images

2007-02-10 Thread Spamy.cz - Maxim Cerny
Hello,

I'm using SA 3.1.7 with FuzzyOCR 3.5.1 . This month I started having
troubles with some GIF spams. The OCR can't recognize it and prints out
only some letters after doing the OCR. Have anybody seen it?


Max

[EMAIL PROTECTED] f]# spamassassin --debug FuzzyOcr < Přep\:\ Now\ this\ is\
clearly\ not\ re.eml > /dev/null
[21573] dbg: FuzzyOcr: focr_bin_helper:
'pnmnorm,pnminvert,pamthreshold,ppmtopgm,pamtopnm'
[21573] info: FuzzyOcr: Adding <5> new helper apps
[21573] dbg: FuzzyOcr: focr_bin_helper: 'tesseract'
[21573] info: FuzzyOcr: Adding <1> new helper apps
[21573] info: FuzzyOcr: Starting preprocessor parser for file
"/etc/mail/spamassassin/FuzzyOcr.preps"...
[21573] dbg: FuzzyOcr: line: preprocessor normalize {
[21573] dbg: FuzzyOcr: line: command = pnmnorm
[21573] dbg: FuzzyOcr: line: }
[21573] dbg: FuzzyOcr: line: preprocessor invert {
[21573] dbg: FuzzyOcr: line: command = pnminvert
[21573] dbg: FuzzyOcr: line: }
[21573] dbg: FuzzyOcr: line: preprocessor ppmtopgm {
[21573] dbg: FuzzyOcr: line: command = ppmtopgm
[21573] dbg: FuzzyOcr: line: }
[21573] dbg: FuzzyOcr: line: preprocessor pamtopnm {
[21573] dbg: FuzzyOcr: line: command = pamtopnm
[21573] dbg: FuzzyOcr: line: }
[21573] dbg: FuzzyOcr: line: preprocessor pamthreshold {
[21573] dbg: FuzzyOcr: line: command = pamthreshold
[21573] dbg: FuzzyOcr: line: args = -simple -threshold 0.5
[21573] dbg: FuzzyOcr: line: }
[21573] dbg: FuzzyOcr: line: preprocessor maketiff {
[21573] dbg: FuzzyOcr: line: command = pnmtotiff
[21573] dbg: FuzzyOcr: line: args = -color -truecolor
[21573] dbg: FuzzyOcr: line: }
[21573] info: FuzzyOcr: Starting scanset parser for file
"/etc/mail/spamassassin/FuzzyOcr.scansets"...
[21573] dbg: FuzzyOcr: line scanset ocrad {
[21573] dbg: FuzzyOcr: line command = $ocrad
[21573] dbg: FuzzyOcr: line args = -s5 $input
[21573] dbg: FuzzyOcr: line }
[21573] dbg: FuzzyOcr: line scanset ocrad-invert {
[21573] dbg: FuzzyOcr: line command = $ocrad
[21573] dbg: FuzzyOcr: line args = -s5 -i $input
[21573] dbg: FuzzyOcr: line }
[21573] dbg: FuzzyOcr: line scanset ocrad-decolorize-invert {
[21573] dbg: FuzzyOcr: line preprocessors = ppmtopgm, pamthreshold, pamtopnm
[21573] dbg: FuzzyOcr: line command = $ocrad
[21573] dbg: FuzzyOcr: line args = -s5 -i $input
[21573] dbg: FuzzyOcr: line }
[21573] dbg: FuzzyOcr: line scanset ocrad-decolorize {
[21573] dbg: FuzzyOcr: line preprocessors = ppmtopgm, pamthreshold, pamtopnm
[21573] dbg: FuzzyOcr: line command = $ocrad
[21573] dbg: FuzzyOcr: line args = -s5 $input
[21573] dbg: FuzzyOcr: line }
[21573] dbg: FuzzyOcr: line scanset gocr {
[21573] dbg: FuzzyOcr: line command = $gocr
[21573] dbg: FuzzyOcr: line args = -i $input
[21573] dbg: FuzzyOcr: line }
[21573] dbg: FuzzyOcr: line scanset gocr-180 {
[21573] dbg: FuzzyOcr: line command = $gocr
[21573] dbg: FuzzyOcr: line args = -l 180 -d 2 -i $input
[21573] dbg: FuzzyOcr: line }
[21573] info: FuzzyOcr: Searching in: /usr/local/netpbm/bin
[21573] info: FuzzyOcr: Searching in: /usr/local/bin
[21573] info: FuzzyOcr: Searching in: /usr/bin
[21573] info: FuzzyOcr: Using gifsicle => /usr/bin/gifsicle
[21573] dbg: FuzzyOcr: Using giffix => /bin/giffix
[21573] dbg: FuzzyOcr: Using giftext => /bin/giftext
[21573] dbg: FuzzyOcr: Using gifinter => /bin/gifinter
[21573] info: FuzzyOcr: Using giftopnm => /usr/bin/giftopnm
[21573] info: FuzzyOcr: Using jpegtopnm => /usr/bin/jpegtopnm
[21573] info: FuzzyOcr: Using pngtopnm => /usr/bin/pngtopnm
[21573] info: FuzzyOcr: Using bmptopnm => /usr/bin/bmptopnm
[21573] info: FuzzyOcr: Using tifftopnm => /usr/bin/tifftopnm
[21573] info: FuzzyOcr: Using ppmhist => /usr/bin/ppmhist
[21573] info: FuzzyOcr: Using pamfile => /usr/bin/pamfile
[21573] info: FuzzyOcr: Using ocrad => /usr/bin/ocrad
[21573] dbg: FuzzyOcr: Using gocr => /usr/local/bin/gocr
[21573] info: FuzzyOcr: Using pnmnorm => /usr/bin/pnmnorm
[21573] info: FuzzyOcr: Using pnminvert => /usr/bin/pnminvert
[21573] info: FuzzyOcr: Using pamthreshold => /usr/bin/pamthreshold
[21573] info: FuzzyOcr: Using ppmtopgm => /usr/bin/ppmtopgm
[21573] info: FuzzyOcr: Using pamtopnm => /usr/bin/pamtopnm
[21573] info: FuzzyOcr: Using tesseract => /usr/bin/tesseract
[21573] dbg: FuzzyOcr: Threshold[max_hash] => 5
[21573] dbg: FuzzyOcr: Threshold[c] => 5
[21573] dbg: FuzzyOcr: Threshold[s] => 0.01
[21573] dbg: FuzzyOcr: Threshold[w] => 0.01
[21573] dbg: FuzzyOcr: Threshold[h] => 0.01
[21573] dbg: FuzzyOcr: Threshold[cn] => 0.01
[21573] dbg: FuzzyOcr: focr_add_score => 1
[21573] dbg: FuzzyOcr: focr_autodisable_negative_score => -8
[21573] dbg: FuzzyOcr: focr_autodisable_score => 1000
[21573] dbg: FuzzyOcr: focr_autosort_buffer => 10
[21573] dbg: FuzzyOcr: focr_autosort_scanset => 1
[21573] dbg: FuzzyOcr: focr_base_score => 5
[21573] dbg: FuzzyOcr: focr_corrupt_score => 2.5
[21573] dbg: FuzzyOcr: focr_corrupt_u

Skipping OCR on Delivery Failures?

2007-01-31 Thread Josh Graham
I've set up Sendmail to send double bounces to /dev/null but I'm still
getting a large about of "Delivery failures" to my spambox, and each one
of them has been scanned by OCR.  According to my logs in the last 48
hours I've scanned 1.3 million incoming messages and the server is
seriously bogged down, is there a way to set SA to not use FuzzyOCR on
these types of messages?



Re: Despeckling images for OCR and anti-spam purposes

2006-12-23 Thread René Berber
Kelly Jones wrote:

> Spammers are starting to put "speckles" in their images to defeat
> OCR-scanning plugins such as FuzzyOCR.

That's a very old technique.

> I thought ImageMagick's -despeckle option would help, but it doesn't
> seem to, not even when applied multiple times, not even in conjunction
> with -monochrome.

Have you tried a simple `gocr -d 4 ...` it does a good job with those images.

> I want a filter that does this for each pixel X:

man gocr:
...
   -d size
  set  dust  size  in  pixels  (clusters  smaller  than  this  are
  removed), 0 means no clusters are removed, the default is -1 for
  auto detection
...
> 1) if any of X's 8 neighbor pixels is the same color, turn X black
> 2) otherwise, turn X white
> 
> Can some combination of options to convert do this?
> 
> I realize that:
> 
> 1. This will only work w/ indexed-color images (eg, GIFs) and not JPEGs,
> etc.
> 2. Spammers will soon work around this, so this is just a short-term
> bandage.
> 3. I could write something in libgd to do this (blech!)

Whatever.
-- 
René Berber



Re: Despeckling images for OCR and anti-spam purposes

2006-12-23 Thread René Berber
Kenneth Porter wrote:

> --On Saturday, December 23, 2006 12:43 PM +0100 decoder
> <[EMAIL PROTECTED]> wrote:
> 
>> Which images are you refering to? If you can put up a sample, then I
>> can tell you which scanner setting will catch it :)
> 
> Does the SA wiki support uploading of images? Perhaps we could have a
> page of just problem images. [snip]

Bad idea, it would help spammers more than anybody else.
-- 
René Berber



Re: Despeckling images for OCR and anti-spam purposes

2006-12-23 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Kenneth Porter wrote:
> --On Saturday, December 23, 2006 12:43 PM +0100 decoder
> <[EMAIL PROTECTED]> wrote:
>
>> Which images are you refering to? If you can put up a sample,
>> then I can tell you which scanner setting will catch it :)
>
> Does the SA wiki support uploading of images? Perhaps we could have
>  a page of just problem images. Such a page is likely to grow large
>  and consume a lot of bandwidth, so perhaps we could get a resource
>  that thumbnails them and runs them through the Coral Cache.
I'm not sure about the SA wiki but you can create a ticket for it on
our side and attach the picture :) Maybe I can create a wiki page for
it as well on our page that allows uploading/appending of images. You
can find the page at fuzzyocr.own-hero.net.

Chris
>
>

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFjYNrJQIKXnJyDxURAs8PAJ0TMpqHh47zay0wN8MPwFkcyluknQCeJU9m
YOi1MNkEKQ/0YcIe4VhCVSs=
=2LK1
-END PGP SIGNATURE-



Re: Despeckling images for OCR and anti-spam purposes

2006-12-23 Thread Kenneth Porter
--On Saturday, December 23, 2006 12:43 PM +0100 decoder 
<[EMAIL PROTECTED]> wrote:



Which images are you refering to? If you can put up a sample, then I
can tell you which scanner setting will catch it :)


Does the SA wiki support uploading of images? Perhaps we could have a page 
of just problem images. Such a page is likely to grow large and consume a 
lot of bandwidth, so perhaps we could get a resource that thumbnails them 
and runs them through the Coral Cache.





Re: Despeckling images for OCR and anti-spam purposes

2006-12-23 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Kelly Jones wrote:
> Spammers are starting to put "speckles" in their images to defeat
> OCR-scanning plugins such as FuzzyOCR.
Which images are you refering to? If you can put up a sample, then I
can tell you which scanner setting will catch it :)


Best regards,

Chris


>
> I thought ImageMagick's -despeckle option would help, but it
> doesn't seem to, not even when applied multiple times, not even in
> conjunction with -monochrome.
>
> I want a filter that does this for each pixel X:
>
> 1) if any of X's 8 neighbor pixels is the same color, turn X black
> 2) otherwise, turn X white
>
> Can some combination of options to convert do this?
>
> I realize that:
>
> 1. This will only work w/ indexed-color images (eg, GIFs) and not
> JPEGs, etc. 2. Spammers will soon work around this, so this is just
> a short-term bandage. 3. I could write something in libgd to do
> this (blech!)
>

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFjRZqJQIKXnJyDxURAt4YAKCCpRPORjqRy2l6UejArzZKH6Ar1ACghlCC
PcRpJ+Ur+RUvHMy0OY6eDms=
=EJCE
-END PGP SIGNATURE-



Despeckling images for OCR and anti-spam purposes

2006-12-22 Thread Kelly Jones

Spammers are starting to put "speckles" in their images to defeat
OCR-scanning plugins such as FuzzyOCR.

I thought ImageMagick's -despeckle option would help, but it doesn't
seem to, not even when applied multiple times, not even in conjunction
with -monochrome.

I want a filter that does this for each pixel X:

1) if any of X's 8 neighbor pixels is the same color, turn X black
2) otherwise, turn X white

Can some combination of options to convert do this?

I realize that:

1. This will only work w/ indexed-color images (eg, GIFs) and not JPEGs, etc.
2. Spammers will soon work around this, so this is just a short-term bandage.
3. I could write something in libgd to do this (blech!)

--
We're just a Bunch Of Regular Guys, a collective group that's trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.


Re: Custom OCR ?

2006-12-12 Thread Theo Van Dinter
On Tue, Dec 12, 2006 at 06:06:02PM +0100, Janek Kozicki wrote:
> Can you tell me how to painlessly tell spamassassin to call my script?

Write a plugin.  See something like FuzzyOcr.

-- 
Randomly Selected Tagline:
If Major BBS sucked, it would be good for something.


pgpmGPNHMqFH0.pgp
Description: PGP signature


Custom OCR ?

2006-12-12 Thread Janek Kozicki
Hello,

I have debian sarge and backported from testing spamassassin 3.1.7

For configuring it, I had only modified
file /etc/default/spamassassin, it has following content:

ENABLED=1
OPTIONS="--create-prefs --max-children 5 --helper-home-dir -s 
/var/log/spamd.log"
PIDFILE="/var/run/spamd.pid"
NICE="--nicelevel 15"

I want to try several different OCR programs to filter spam. Let's
say that I have written a script ~/bin/img2txt which takes as single
argument the file containing image and prints to stdout OCRed text.


Can you tell me how to painlessly tell spamassassin to call my script?

The best would be if it was an OPTION, so that I could for example say:

OPTIONS="--create-prefs --max-children 5 --helper-home-dir -s 
/var/log/spamd.log -ocr=/home/janek/bin/img2txt"


Then I will be able to focus my work on the content of ~/img2txt - a
work to experiment with all OCR programs available on the net. And
even experiment with my own neural networks...

-- 
Janek Kozicki |


Re: spammers dodging OCR

2006-11-21 Thread alex
lol, just got a spam with the image obfuscated like captchas in a bbs,
to avoid detection by ocr.

On Mon, Nov 06, 2006 at 02:06:45PM -0600, Jorge Valdes wrote:
> Gary V wrote:
> >This morning I received my copy of networkworld. Here is an 
> >interesting article:
> >
> >http://www.networkworld.com/columnists/2006/103006buzz-spammers-dodging-ocr.html
> > 
> >
> >
> >Gary V
> >
> >_
> >Add a Yahoo! contact to Windows Live Messenger for a chance to win a 
> >free trip! 
> >http://www.imagine-windowslive.com/minisites/yahoo/default.aspx?locale=en-us&hmtagline
> > 
> >
> >
> >
> >
> FuzzyOcr (devel version) is already catching these... has been for a 
> while now.
> 
> -- 
> Jorge Valdes
> 


Re: Fuzzy OCR - first time user

2006-11-18 Thread decoder

Marc Perkel wrote:
OK - trying out the FuzzyOCR plugin. So far it all the default stuff 
with minimal installation. I'm running Fedora Core 6. Used the gocr 
RPM and didn't patch the source. Everything is default and it doesn't 
seem to be complaining so .


If I like this what do I need to change to really do it right? Should 
I grab the devel code? Do I really need the gocr patch? Should I tweek 
the scores? What do the hard core users change?


My suggestion the FuzzyOcr version is 3.4.x, since it is a lot better. I 
also recommend to enable image hashing which is disabled by default.


About the patch for gocr: I highly suggest to build it from source 
because I don't know if Fedora Core 6 has the proper bindings to netpbm 
compiled with gocr. Redhat does not. That leads to dramatical decrease 
in effectiveness. Also, the patch prevents segmentation faults with some 
pictures, and afaik, this bug still hasn't been fixed.


The scores normally do not need change, unless you get serious problems 
with FPs..


And what the hardcore users change? lol... well, experienced users have 
different scansets, for example they invoke "ocrad" instead of gocr in 
their scansets because it runs faster and recognizes better in most 
situations. In the shipped config file, there is an example for a 
scanset which includes ocrad (If you wan't to try it out, make sure to 
read the "Notes about the config file" page on the FuzzyOcr download 
page as the ocrad scanset contains a small typo which should be fixed 
first :))


Finally, if you run into problems, try our mailing list at 
http://lists.own-hero.net/mailman/listinfo/devel-spam



Best regards,


Chris


Fuzzy OCR - first time user

2006-11-17 Thread Marc Perkel
OK - trying out the FuzzyOCR plugin. So far it all the default stuff 
with minimal installation. I'm running Fedora Core 6. Used the gocr RPM 
and didn't patch the source. Everything is default and it doesn't seem 
to be complaining so .


If I like this what do I need to change to really do it right? Should I 
grab the devel code? Do I really need the gocr patch? Should I tweek the 
scores? What do the hard core users change?




Re: spammers dodging OCR

2006-11-06 Thread Jorge Valdes

Gary V wrote:
This morning I received my copy of networkworld. Here is an 
interesting article:


http://www.networkworld.com/columnists/2006/103006buzz-spammers-dodging-ocr.html 



Gary V

_
Add a Yahoo! contact to Windows Live Messenger for a chance to win a 
free trip! 
http://www.imagine-windowslive.com/minisites/yahoo/default.aspx?locale=en-us&hmtagline 





FuzzyOcr (devel version) is already catching these... has been for a 
while now.


--
Jorge Valdes




spammers dodging OCR

2006-11-06 Thread Gary V
This morning I received my copy of networkworld. Here is an interesting 
article:


http://www.networkworld.com/columnists/2006/103006buzz-spammers-dodging-ocr.html

Gary V

_
Add a Yahoo! contact to Windows Live Messenger for a chance to win a free 
trip! 
http://www.imagine-windowslive.com/minisites/yahoo/default.aspx?locale=en-us&hmtagline




Re: Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread jdow

From: "David B Funk" <[EMAIL PROTECTED]>


On Fri, 8 Sep 2006, Michael Grey wrote:


In regards to the second, many large companies have outside companies do work
for them in the areas of marketing and other aspects. So this also will
happen regardless.

Let me clarify; this is an OUTSIDE relay to INSIDE...

A FuzzyOCR White List with (very privately held) keywords would help.

Any other ideas ?


Sit down and have a little hart-to-heart with your marketing people.
They may want to rethink their methods. Put the images on a web-server
and e-mail links to them.

You can hack your local mail system to not spam-tag those messages
but what about the intended potential customer recipients?
If you're tagging them then that should be an indication that other
people will too.

The world changes, sometimes due to actions of bad people. After 9/11
it became a bad idea to try to send white powder thru snail-mail.
Thanks to image-spammers it's becoming not so good an idea to send
imbedded-image e-mails (particularly if it's a marketing campain ;).


Marketing campaigns via email are deadly these days regardless of
embedded whazzits or whozzits. I don't CARE if I have any prior
relationship with company fubar. If they send me marketing it goes
to the spam bucket. And it is liable to be used for anti-spam training
if the BAYES score was too low.

Sending an account summary as an email image is perhaps one of the
stupidest ideas any marketdroid has concocted. (If it is part of a
marketing plan this raises sincere concerns about the cell company's
privacy policy, too.)

{^_^}



Re: Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread jdow

It MAY be that FuzzyOcr needs to become aware of whitelists. But then,
if you use a whitelist entry for the cell account summaries the FuzzyOcr
scores are basically meaningless.

{^_^}
- Original Message - 
From: "Michael Grey" <[EMAIL PROTECTED]>

To: 
Sent: Friday, September 08, 2006 09:40
Subject: Fuzzy OCR false positives from Screenshots...


We are testing a new configuration using FuzzyOCR, and found it to work very
well overall... 




However, there have been two occasions in the last 24 hrs where screenshots
embedded into the emails caused false positives.



One was an 'account summary' from a cell company, the other was some internal
marketing info.



Are there other approaches to getting certain images white listed if they
contain, say, our specific company name ?



Any other ideas on how to deal with this ?





Many thanks !





Michael Grey








RE: Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread David B Funk
On Fri, 8 Sep 2006, Michael Grey wrote:

> In regards to the second, many large companies have outside companies do work
> for them in the areas of marketing and other aspects. So this also will
> happen regardless.
>
> Let me clarify; this is an OUTSIDE relay to INSIDE...
>
> A FuzzyOCR White List with (very privately held) keywords would help.
>
> Any other ideas ?

Sit down and have a little hart-to-heart with your marketing people.
They may want to rethink their methods. Put the images on a web-server
and e-mail links to them.

You can hack your local mail system to not spam-tag those messages
but what about the intended potential customer recipients?
If you're tagging them then that should be an indication that other
people will too.

The world changes, sometimes due to actions of bad people. After 9/11
it became a bad idea to try to send white powder thru snail-mail.
Thanks to image-spammers it's becoming not so good an idea to send
imbedded-image e-mails (particularly if it's a marketing campain ;).

-- 
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{


Re: Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread Logan Shaw

On Fri, 8 Sep 2006, Michael Grey wrote:

We are testing a new configuration using FuzzyOCR, and found it to work very
well overall...

However, there have been two occasions in the last 24 hrs where screenshots
embedded into the emails caused false positives.

One was an 'account summary' from a cell company, the other was some internal
marketing info.

Are there other approaches to getting certain images white listed if they
contain, say, our specific company name ?


You could probably hack FuzzyOcr.pm pretty easily.

The basic strategy would be to create another list just like
@words, but with whitelist words instead.  You should be able
to duplicate the code where it parses config file options (look
for "focr_word") and put in your own config file option, say
"focr_word_whitelist".  Then at the bottom, there is a foreach
loop that iterates through @words and looks for matches.
You can just duplicate that loop and create a separate count
of whitelist words matched.  Then modify the way the score is
computed (the "my $score = ...") line, and you're done.

  - Logan


RE: Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread Logan Shaw

On Fri, 8 Sep 2006, Randal, Phil wrote:

Score appropriately, train your Bayes well, and the false positives
should diminish.


FUZZY_OCR gives crazily high scores to certain things.
One point per matched keyword, I believe.  I've seen FUZZY_OCR,
by itself, give scores as high as 24.00.

Here's the distribution from one of my log files, as a matter
of fact:

score count
- -
 4.00: 21
 5.00: 9
 6.00: 6
 7.00: 4
 8.00: 4
 9.00: 15
10.00: 7
11.00: 7
13.00: 1
14.00: 1
24.00: 1

As you can see, 24 only happened once, but 9, 10, and 11 are
very common.

So yeah, false positives should diminish some, but there is
no way a BAYES_00 is going to make up for a score of 11.

Personally, I think the scoring strategy for FUZZY_OCR needs
to be revamped...

  - Logan


RE: Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread Randal, Phil
Score appropriately, train your Bayes well, and the false positives
should diminish.

Cheers,

Phil

--
Phil Randal
Network Engineer
Herefordshire Council
Hereford, UK  

> -Original Message-
> From: Michael Grey [mailto:[EMAIL PROTECTED] 
> Sent: 08 September 2006 18:43
> To: users@spamassassin.apache.org
> Subject: RE: Fuzzy OCR false positives from Screenshots...
> 
> 
> You will have to ask the cell company about the first issue ...
> 
> In regards to the second, many large companies have outside 
> companies do work
> for them in the areas of marketing and other aspects. So this 
> also will
> happen regardless.
> 
> Let me clarify; this is an OUTSIDE relay to INSIDE...
> 
> A FuzzyOCR White List with (very privately held) keywords would help. 
> 
> Any other ideas ?
> 
> 
> 
> Michael Grey
> 
> 
> 
> 
> -Original Message-
> From: John D. Hardin [mailto:[EMAIL PROTECTED] 
> Sent: Friday, September 08, 2006 10:10 AM
> To: Michael Grey
> Cc: users@spamassassin.apache.org
> Subject: Re: Fuzzy OCR false positives from Screenshots...
> 
> On Fri, 8 Sep 2006, Michael Grey wrote:
> 
> > However, there have been two occasions in the last 24 hrs 
> where screenshots
> > embedded into the emails caused false positives.
> > 
> > One was an 'account summary' from a cell company, the other was some
> internal
> > marketing info.
> > 
> > Are there other approaches to getting certain images white 
> listed if they
> > contain, say, our specific company name ?
> 
> Don't run SA against internal email.
> 
> And what the heck is a cell-phone company doing sending you
> screenshots?
> 
> --
>  John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/
>  [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED]
>  key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> --
> -
>   If someone has a gun and is trying to kill you, it would be
>   reasonable to shoot back with your own gun.
>   -- the Dalai Lama, May 15, 2001
> --
> -
>  9 days until The 219th anniversary of the signing of the 
> U.S. Constitution
> 


RE: Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread Michael Grey

You will have to ask the cell company about the first issue ...

In regards to the second, many large companies have outside companies do work
for them in the areas of marketing and other aspects. So this also will
happen regardless.

Let me clarify; this is an OUTSIDE relay to INSIDE...

A FuzzyOCR White List with (very privately held) keywords would help. 

Any other ideas ?



Michael Grey




-Original Message-
From: John D. Hardin [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 08, 2006 10:10 AM
To: Michael Grey
Cc: users@spamassassin.apache.org
Subject: Re: Fuzzy OCR false positives from Screenshots...

On Fri, 8 Sep 2006, Michael Grey wrote:

> However, there have been two occasions in the last 24 hrs where screenshots
> embedded into the emails caused false positives.
> 
> One was an 'account summary' from a cell company, the other was some
internal
> marketing info.
> 
> Are there other approaches to getting certain images white listed if they
> contain, say, our specific company name ?

Don't run SA against internal email.

And what the heck is a cell-phone company doing sending you
screenshots?

--
 John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  If someone has a gun and is trying to kill you, it would be
  reasonable to shoot back with your own gun.
  -- the Dalai Lama, May 15, 2001
---
 9 days until The 219th anniversary of the signing of the U.S. Constitution



Re: Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread John D. Hardin
On Fri, 8 Sep 2006, Michael Grey wrote:

> However, there have been two occasions in the last 24 hrs where screenshots
> embedded into the emails caused false positives.
> 
> One was an 'account summary' from a cell company, the other was some internal
> marketing info.
> 
> Are there other approaches to getting certain images white listed if they
> contain, say, our specific company name ?

Don't run SA against internal email.

And what the heck is a cell-phone company doing sending you
screenshots?

--
 John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  If someone has a gun and is trying to kill you, it would be
  reasonable to shoot back with your own gun.
  -- the Dalai Lama, May 15, 2001
---
 9 days until The 219th anniversary of the signing of the U.S. Constitution



Fuzzy OCR false positives from Screenshots...

2006-09-08 Thread Michael Grey








We are testing a new configuration using FuzzyOCR, and found
it to work very well overall… 

 

However, there have been two occasions in the last 24 hrs
where screenshots embedded into the emails caused false positives.

 

One was an ‘account summary’ from a cell
company, the other was some internal marketing info.

 

Are there other approaches to getting certain images white listed
if they contain, say, our specific company name ?

 

Any other ideas on how to deal with this ?

 

 

Many thanks !

 

 

Michael Grey

 

 








Re: Tesseract OCR open sourced

2006-09-05 Thread John D. Hardin
On Tue, 5 Sep 2006, Robert LeBlanc wrote:

> The only /non-technical/ issue that occurs to me is in the
> licensing, which is a combination of the Apache License (2.0) and
> a custom clause that may be a non-starter for some applications:
> "If you wish to use it for commercial gain you must contact The
> MITRE Corporation for conditions of use."  This might preclude its
> (free) use in SpamAssassin-based appliances, or at for-profit ISPs
> and offsite mail-filtering services.

If licensing is in any way an issue, then FuzzyOCR should be
structured such that it's very easy to choose which OCR engine you
wish it to use.

--
 John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The problem is when people look at Yahoo, slashdot, or groklaw and
  jump from obvious and correct observations like "Oh my God, this
  place is teeming with utter morons" to incorrect conclusions like
  "there's nothing of value here".-- Al Petrofsky, in Y! SCOX
---
 12 days until The 219th anniversary of the signing of the U.S. Constitution



Re: Tesseract OCR open sourced

2006-09-05 Thread Robert LeBlanc
Kenneth Porter wrote:
> <http://developers.slashdot.org/comments.pl?sid=195752&cid=16041870>
> 
> Theo just mentioned this on the -devel list:
> 
> <http://article.gmane.org/gmane.mail.spam.spamassassin.devel/45374>

I posted about this last week on the Devel-Spam and Maia-Users lists,
along with the results of some preliminary tests I conducted with
Tesseract OCR vs. GOCR, and it looks promising.  Here's what I posted:

=== post begins ===

It's already "usable"; I've compiled it and done some basic tests with
it, and it does seem to work pretty well.  On an arbitrary spam image,
for instance, which starts with:

  CRITICAL INVESTOR ALERT!
  ESPION INTERNATIONAL INC (EPLJ.PK)

The Tesseract OCR engine scanned this image and produced:

  CRITICAL INVESTOR ALERT!
  ESPION INTERNATIONAL INC (EPLJPK)

By comparison, the GOCR engine (with default options) produces:

  cRITIcAE INvEsToR AEERT!.

With -l 180 -d 2 though, GOCR does about as well as Tesseract, if you
ignore case:

  cRITIcAL INvEsToR ALERT!.
  EspIoN INTERNATIoNAL INc (EpLJ.pK)

One potential snag, though, is that Tesseract OCR only operates on TIFF
images, and pnmtotiff wasn't able to produce a usable TIFF in this
sample test.  The ImageMagick "convert" utility worked, though.

Another issue is that Tesseract OCR doesn't behave as a filter (i.e. it
doesn't read from STDIN or write to STDOUT), it expects to be called
like this:

  tesseract   batch

which then produces three files:

  outfile.map
  outfile.txt
  outfile.raw

The *.txt file is the extracted text that we're interested in.  It
shouldn't be too difficult to modify the sources to make Tesseract OCR
behave like a proper filter though, and it's likely that such a patch
would be welcomed by the maintainers at Google.

At the moment--and based solely on this very basic bit of testing--I'd
say that Tesseract OCR is comparable to GOCR, perhaps a bit more clever.
More testing on different types of spam images will be required to know
for sure.  There's practically no documentation available for it yet,
apart from the brief README, so it's unclear whether it accepts any
parameters of the sort that GOCR does, or whether it scans everything
with fixed settings or uses some sort of adaptive setting mechanism.

Google seems to be in the process of hiring people to work on it,
however, and with their large-scale book-scanning projects underway they
have a vested interest in producing a high-quality OCR engine, so in six
months or so Tesseract OCR might well be the engine of choice.  Right
now, though, I think GOCR can do a comparable job.

=== end of post ===

Upon discussing this with decoder (the author of the FuzzyOcr plugin),
he maintains that none of these potential snags are real obstacles.  His
next release will supposedly include a new temporary file structure that
can handle the *.txt file output without requiring the OCR engine to
behave as a proper filter.  I intend to continue testing Tesseract OCR
with more image spam samples as I get ahold of them, but thus far it
looks promising.

The only /non-technical/ issue that occurs to me is in the licensing,
which is a combination of the Apache License (2.0) and a custom clause
that may be a non-starter for some applications: "If you wish to use it
for commercial gain you must contact The MITRE Corporation for
conditions of use."  This might preclude its (free) use in
SpamAssassin-based appliances, or at for-profit ISPs and offsite
mail-filtering services.  If the commercial license fees (or "conditions
of use") are reasonable, however, this may be a minor issue,
particularly if the engine proves to be better than anything else in its
price range.  Always nice to have options, even if they're not always free.

-- 
Robert LeBlanc <[EMAIL PROTECTED]>
Renaissoft, Inc.
Maia Mailguard <http://www.maiamailguard.com/>



signature.asc
Description: OpenPGP digital signature


Re: Tesseract OCR open sourced

2006-09-05 Thread Kenneth Porter



Theo just mentioned this on the -devel list:




Tesseract OCR open sourced

2006-09-04 Thread jdow

http://developers.slashdot.org/developers/06/09/04/2215210.shtml

Tesseract, developed by HP labs, is touted as one of the most
accurate OCR programs available. Google cleaned it up and has
released it OS.

{^_^}


Re: OCR plugin doesn't seem to work

2006-08-23 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mike Pepe wrote:
> decoder wrote:
>
>> Which OCR plugin are you using there? If it is the original
>> OcrPlugin, then you might try FuzzyOcr instead. The original
>> OcrPlugin was more proof-of-concept, and will cause you lots of
>> headaches with the current image spam...
>
> I did upgrade to FuzzyOCR after I read your message. But, I don't
> think it's working- however other rules seem to be catching these
> stock gifs. Here's the headers from one of them:
>
> Content analysis details:   (10.6 points, 5.0 required)
>
> pts rule name  description  --
> -- 1.1
> EXTRA_MPART_TYPE   Header has extraneous Content-type:...type=
> entry 4.2 HELO_DYNAMIC_IPADDRRelay HELO'd using suspicious
> hostname (IP addr 1) 0.1 FORGED_RCVD_HELO   Received: contains
> a forged HELO 1.1 HTML_IMAGE_ONLY_32 BODY: HTML: images with
> 2800-3200 bytes of words 0.4 HTML_30_40 BODY: Message
> is 30% to 40% HTML 1.0 BAYES_60   BODY: Bayesian spam
> probability is 60 to 80% [score: 0.7765] 0.0 HTML_MESSAGE
> BODY: HTML included in message 0.8 SARE_GIF_ATTACHFULL:
> Email has a inline gif 2.0 RCVD_IN_SORBS_DUL  RBL: SORBS: sent
> directly from dynamic IP address [71.197.31.248 listed in
> dnsbl.sorbs.net]
>
> I don't see OCR mentioned in there at all. I still don't think it's
>  working.
>
> Spamassassin --lint doesn't indicate anything is wrong. How can I
> test it?
>
> -Mike
>

The download page of FuzzyOcr provides a sample-mails.tar.gz. It
contains some messages which should all get detected.


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE7BEyJQIKXnJyDxURAv18AKCg6TCSrH41ERtalz/H93/sqlsjXACdF5ue
FfD4tGxRS5cEWQ8of2aT/Co=
=xyHr
-END PGP SIGNATURE-



Re: OCR plugin doesn't seem to work

2006-08-22 Thread Mike Pepe

decoder wrote:


Which OCR plugin are you using there? If it is the original OcrPlugin,
then you might try FuzzyOcr instead. The original OcrPlugin was more
proof-of-concept, and will cause you lots of headaches with the
current image spam...


I did upgrade to FuzzyOCR after I read your message. But, I don't think 
it's working- however other rules seem to be catching these stock gifs. 
Here's the headers from one of them:


Content analysis details:   (10.6 points, 5.0 required)

 pts rule name  description
 -- 
--
 1.1 EXTRA_MPART_TYPE   Header has extraneous Content-type:...type= 
entry

 4.2 HELO_DYNAMIC_IPADDRRelay HELO'd using suspicious hostname (IP addr
1)
 0.1 FORGED_RCVD_HELO   Received: contains a forged HELO
 1.1 HTML_IMAGE_ONLY_32 BODY: HTML: images with 2800-3200 bytes of 
words

 0.4 HTML_30_40 BODY: Message is 30% to 40% HTML
 1.0 BAYES_60   BODY: Bayesian spam probability is 60 to 80%
[score: 0.7765]
 0.0 HTML_MESSAGE   BODY: HTML included in message
 0.8 SARE_GIF_ATTACHFULL: Email has a inline gif
 2.0 RCVD_IN_SORBS_DUL  RBL: SORBS: sent directly from dynamic IP 
address

[71.197.31.248 listed in dnsbl.sorbs.net]

I don't see OCR mentioned in there at all. I still don't think it's working.

Spamassassin --lint doesn't indicate anything is wrong. How can I test it?

-Mike



Re: OCR plugin doesn't seem to work

2006-08-21 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mike Pepe wrote:
> Hey guys,
>
> Running SA 3.1.1, on Fedora Core 3, with Perl 5.8.5
>
> I installed gocr and imagemagick packages, copied the Ocr.pm and cf
>  files into /etc/mail/spamassassin
>
> The tests don't seem to run, the pump 'n dump GIFs are still
> arriving and I don't see that the test is being run in the headers.
>  Other SARE and custom rules in that directory are running though.
> The permissions are the same, etc. Anyone have any ideas?
>
> # ls 70_sare_adult.cf 70_sare_uri1.cf
> spamassassin-default.rc 70_sare_obfu0.cf
> 99_sare_fraud_post25x.cf  spamassassin-helper.sh 70_sare_obfu1.cf
> 99_sare_fraud_pre25x.cf   spamassassin-spamc.rc 70_sare_oem.cf
> cathy_caparula.cf tripwire.cf 70_sare_random.cfinit.pre
> v310.pre 70_sare_specific.cf  local.cf
> WebRedirect.cf 70_sare_spoof.cf Ocr.cf
> WebRedirect.pm 70_sare_stocks.cfOcr.pm 70_sare_uri0.cf
> RulesDuJour
>
Which OCR plugin are you using there? If it is the original OcrPlugin,
then you might try FuzzyOcr instead. The original OcrPlugin was more
proof-of-concept, and will cause you lots of headaches with the
current image spam...


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE6e/4JQIKXnJyDxURAlpSAJwInsGumasFgOK0ZOGp6M5W5Atw1ACeMqpx
QKBndV7iGnXOuxQJVip/ox4=
=GpHQ
-END PGP SIGNATURE-



OCR plugin doesn't seem to work

2006-08-21 Thread Mike Pepe

Hey guys,

Running SA 3.1.1, on Fedora Core 3, with Perl 5.8.5

I installed gocr and imagemagick packages, copied the Ocr.pm and cf 
files into /etc/mail/spamassassin


The tests don't seem to run, the pump 'n dump GIFs are still arriving 
and I don't see that the test is being run in the headers. Other SARE 
and custom rules in that directory are running though. The permissions 
are the same, etc. Anyone have any ideas?


# ls
70_sare_adult.cf 70_sare_uri1.cf   spamassassin-default.rc
70_sare_obfu0.cf 99_sare_fraud_post25x.cf  spamassassin-helper.sh
70_sare_obfu1.cf 99_sare_fraud_pre25x.cf   spamassassin-spamc.rc
70_sare_oem.cf   cathy_caparula.cf tripwire.cf
70_sare_random.cfinit.pre  v310.pre
70_sare_specific.cf  local.cf  WebRedirect.cf
70_sare_spoof.cf Ocr.cfWebRedirect.pm
70_sare_stocks.cfOcr.pm
70_sare_uri0.cf  RulesDuJour

-Mike


Re: Improved OCR Plugin with approximate matching

2006-08-18 Thread Matthias Keller

decoder wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

decoder wrote:
  

Hello there,

I have improved the original OcrPlugin (found at
http://wiki.apache.org/spamassassin/OcrPlugin), so it contains
fuzzy matching. Like that, mistakes made by the OCR recognition or
intentional obfuscations in the text don't make the recognition
impossible. This is being done with a relative distance calculation
 between the pattern (word from a given word list) and a line in
the recognized input. Also, the plugin uses dynamic scoring (more
matched words means more score, this can be adjusted in the
source).

You can find a full description and an example in the wiki under:

http://wiki.apache.org/spamassassin/FuzzyOcrPlugin


Ideas for improvements or critics are always welcome :)


Best regards,


Chris



A new beta is available (2.2-beta1).

It includes a bugfix for a bug with jpeg content-types reported by
Matthias Keller. Other changes:

- - Debug file stuff removed, instead of that, the tempfiles don't get
deleted when in debug mode (verbose > 1).
- - Logfile support, all debug messages go there
- - Much more debug messages
- - Error handling/logging (Thanks to Ron Bender for pointing that out)
- - Added the necessary priority line to the cf file. (Thanks to Mark
Martinec and others for reminding me about that)

Please note that this is a beta... so you should probably try it out
in non-production environments first before blaming me ;D
  

Hi Chris

Wanted to report back - it's all working nicely and smoothly so far
And thanks to your plugin yesterday an onslaught of about 30 image spams 
within one minute have been blocked efficiently. Especially with my much 
extended wordlist most of them get blocked accurately - my only concern 
is the varying results from gocr nobody has been able to help me with
I've tried 3 different gocr 0.40 versions and none seems to be as good 
as yours... you dont have the source to your version somewhere so I 
could try yours??


I've got one request tough: When announcing a new version, could you 
post a new subject instead of replying to the old one, maybe with a 
subject "FuzzyOcr v2.2-beta1 released" ? In my thread sorted view I 
always have to go look for a message in an old thread...


btw, i've subscribed to your list tough i feel general discussion about 
your plugin should be done here whereas support inquiries and that stuff 
can nicely fit on the separate one...


Matt


Re: Improved OCR Plugin with approximate matching

2006-08-17 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

decoder wrote:
> Hello there,
>
> I have improved the original OcrPlugin (found at
> http://wiki.apache.org/spamassassin/OcrPlugin), so it contains
> fuzzy matching. Like that, mistakes made by the OCR recognition or
> intentional obfuscations in the text don't make the recognition
> impossible. This is being done with a relative distance calculation
>  between the pattern (word from a given word list) and a line in
> the recognized input. Also, the plugin uses dynamic scoring (more
> matched words means more score, this can be adjusted in the
> source).
>
> You can find a full description and an example in the wiki under:
>
> http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
>
>
> Ideas for improvements or critics are always welcome :)
>
>
> Best regards,
>
>
> Chris

A new beta is available (2.2-beta1).

It includes a bugfix for a bug with jpeg content-types reported by
Matthias Keller. Other changes:

- - Debug file stuff removed, instead of that, the tempfiles don't get
deleted when in debug mode (verbose > 1).
- - Logfile support, all debug messages go there
- - Much more debug messages
- - Error handling/logging (Thanks to Ron Bender for pointing that out)
- - Added the necessary priority line to the cf file. (Thanks to Mark
Martinec and others for reminding me about that)

Please note that this is a beta... so you should probably try it out
in non-production environments first before blaming me ;D

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE5HWiJQIKXnJyDxURAvBCAJ9rsVctqQcMC76duSL8YP23L4mPjQCggwv+
gYGWlMO1FSkJ9jud+7tatZc=
=gcsV
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-13 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

decoder wrote:
> Hello there,
>
> I have improved the original OcrPlugin (found at
> http://wiki.apache.org/spamassassin/OcrPlugin), so it contains
> fuzzy matching. Like that, mistakes made by the OCR recognition or
> intentional obfuscations in the text don't make the recognition
> impossible. This is being done with a relative distance calculation
>  between the pattern (word from a given word list) and a line in
> the recognized input. Also, the plugin uses dynamic scoring (more
> matched words means more score, this can be adjusted in the
> source).
>
> You can find a full description and an example in the wiki under:
>
> http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
>
>
> Ideas for improvements or critics are always welcome :)
>
>
> Best regards,
>
>
> Chris

Hello there,

I've just released version 2.1c, which fixes problems when using
Spamassassin + Mailscanner (score is always 1.0).

Thanks for this bug report and patch to Howard Kash.

Other (minor) changes:

- -Fixed a typo (treshold -> threshold), if you are using this variable
in your config, you need to fix this.
- -Removed the '-' from jpegtopnm arguments to provide backwards
compatiblity to older netpbm (as someone else mentioned here before)

The updated version can be found at the usual download URL (see the
spamassassin wiki under FuzzyOcr)


Best regards

Christian
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE33TcJQIKXnJyDxURAukgAKCYIPpk1R0oHQH7qdCVtrd7DdHGowCfVsZh
3KUFvNC5v52BytjKnA2OooY=
=0r9I
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-10 Thread Theo Van Dinter
On Thu, Aug 10, 2006 at 10:55:30AM -0700, Dave . wrote:
>  foreach my $p ( $pms->{msg}->find_parts("image") ) {

>Does this mean the message must have the text "image" and/or "image/gif"
>within the body? Many of the "penny stock" spam gifs I get appear as follows:

Generally speaking, RTM (Mail::SpamAssassin::Message and
Mail::SpamAssassin::Message::Node). :)

In short, it says to find all the parts with "image" in the content-type
header.  (for various reasons, I'd change '"image"' to '[EMAIL PROTECTED]/@i', 
btw.)

>   my ( $ctype, $boundary, $charset, $name ) =
> Mail::SpamAssassin::Util::parse_content_type(
>  $p->get_header('content-type') );
>   if ( $ctype eq "image/gif" ) {

Unless you need the other values, drop the first function call and just use
$p->{'type'} instead of $ctype.  It's already parsed out.

-- 
Randomly Generated Tagline:
"L: Well... Do you have any kids?
  T: No...
  L: Oh.  Well, do you have any grandkids?
  T: Ummm  ... No ..."
 - Telephone saleswoman trying to sell me a family portrait


pgp01ZbPBufTJ.pgp
Description: PGP signature


Re: Improved OCR Plugin with approximate matching

2006-08-10 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dave . wrote:
> Give them code from Ocr.pm:
>
> --- foreach my $p (
> $pms->{msg}->find_parts("image") ) { my ( $ctype, $boundary,
> $charset, $name ) = Mail::SpamAssassin::Util::parse_content_type(
> $p->get_header('content-type') ); if ( $ctype eq "image/gif" ) {
> open OCR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - >
> /tmp/spamassassin.ocr.$$"; foreach $p ( $p->decode() ) { print OCR
> $p; --- Does this mean the message must have the
> text "image" and/or "image/gif" within the body? Many of the "penny
> stock" spam gifs I get appear as follows:
>
> Should {Fuzzy}OcrPlugin be able to catch this?
>
>
>  src="cid:000b01c6bc94$a534f210$9ab30b3b@hniyb.kwf" align=baseline
> border=0>
>

First of all, you are using an old version of the plugin, please
upgrade it :)

Secondly, could you send me one or two examples of these mails? Maybe
I can extend the plugin to catch more tricks like this that spammers
try to avoid detection :) The word "image" is supposed to be in the
body, when the content-type of an attachment (image) is specified.


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE23UmJQIKXnJyDxURAsndAJ4tLDmu++8lGLlk5a0KNojlRjDhNACfTU+d
zWnzqypLs/PGqsv0zuMV+Eo=
=LvQN
-END PGP SIGNATURE-



RE: Improved OCR Plugin with approximate matching

2006-08-10 Thread Dave .
Give them code from Ocr.pm:--- foreach my $p ( $pms->{msg}->find_parts("image") ) {  my ( $ctype, $boundary, $charset, $name ) =Mail::SpamAssassin::Util::parse_content_type( $p->get_header('content-type') );  if ( $ctype eq "image/gif" ) { open OCR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - > /tmp/spamassassin.ocr.$$";     foreach $p ( $p->decode() ) {print OCR $p;---Does this mean the message must have the text "image" and/or "image/gif" within the body? Many of the "penny stock" spam gifs I get appear as follows:Should {Fuzzy}OcrPlugin be able to catch this?src="" align=baseline border=0> 
	
		Do you Yahoo!? Everyone is raving about the  all-new Yahoo! Mail Beta.

Re: Improved OCR Plugin with approximate matching

2006-08-10 Thread amosch . security
On Tue, Aug 08, 2006 at 12:43:24AM +0200, decoder wrote:
> 
> You can find a full description and an example in the wiki under:
> 
> http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
> 
> 
> Ideas for improvements or critics are always welcome :)
> 
> 

Hi, 

First, thanks for working on such a great plugin!

I have to make this adjustment to the jpegtopnm call to get it to work
with jpeg files:

< open IMAGE_PROCESSOR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr 
-i - > $tempfile";
---
> open IMAGE_PROCESSOR, "|/usr/bin/jpegtopnm |/usr/bin/gocr -i 
> - > $tempfile";

Hope this helps..

---

Yifang


Re: Improved OCR Plugin with approximate matching

2006-08-10 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Bill Landry wrote:
> - Original Message - From: "Spamassassin List"
> <[EMAIL PROTECTED]>
> To: 
> Sent: Wednesday, August 09, 2006 2:26 PM
> Subject: Re: Improved OCR Plugin with approximate matching
>
>>> Spamassassin List wrote:
>>>>>> decoder wrote:
>>>>>>
>>>>>> See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
>>>>>>
>>>>>> Major changes: Replaced imagemagick with netpbm, support png,
>>>>>> invoked giffix for broken gifs, detect image format with magic
>>>>>> bytes and not by content-type, added various configuration
>>>>>> options.
>>>>
>>>> I install the above plugin, and i keep getting the same error.
>>>>
>>>> [EMAIL PROTECTED] spamtest]# spamassassin -t < spam-gif-1.txt sh:
>>>> /usr/bin/giffix: No such file or directory giftopnm: error reading
>>>> magic number (null): Error reading magic number from Netpbm image
>>>> stream.  Most often, this means your input file is empty. sh:
>>>> /usr/bin/giffix: No such file or directory giftopnm: error reading
>>>> magic number (null): Error reading magic number from Netpbm image
>>>> stream.  Most often, this means your input file is empty. sh:
>>>> /usr/bin/giffix: No such file or directory giftopnm: error reading
>>>> magic number (null): Error reading magic number from Netpbm image
>>>> stream.  Most often, this means your input file is empty.
>>>>
>>>> I notice the error occur when the attachment is gif format.
>>>>
>>>
>>> You are missing a tool. It is called "giffix" and part of the
>>> "giflib"
>>> package. Without it, the plugin can't fix broken gifs to analyze
>>> them.
>>> Install giflib.
>>>
>>
>> I did a yum install giflib, but it install another package. What is
>> the package for yum?
>>
>> libungifi386   4.1.3-3.fc4.2  
>> updates-released   39 k
>
> Check that the path is correct to giffix.  I had to change the
> following line in the FuzzyOcr.pm from:
>
> open OCR, "|/usr/bin/convert - pnm:-|/usr//bin/gocr -i - >
> /tmp/spamassassin.focr.$$";
>
>to
>
> open OCR, "|/usr/local/bin/convert - pnm:-|/usr/local/bin/gocr -i -
> > /tmp/spamassassin.focr.$$";
>
> Chris, you may want to consider adding an editable path statement to
> the top of the plug-in file rather than using hard coded paths that
> do not match everyone elses file locations.
>
> Bill
>

Noted for the next release :)

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2v1mJQIKXnJyDxURAop5AKCtgjoQXj4jFk6noej1f2NcXymM+QCeO/8y
sqYeeaVYdHi1JGHD/CYwQF0=
=TOE2
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-10 Thread Mathias Tauber
> > yum install libungif* will get both libungif and libungif-progs (which
> > contains giffix)


I'm using Debian (Sarge) and I think libungif-bin is here the better package. 
giflib-bin wants to install the packages libx11-6, xfree86-common, xlibs-data
additionaly. Which means 10MB more than installing libungif-bin.


Mathias


RE: Improved OCR Plugin with approximate matching

2006-08-09 Thread Rick Cooper


> -Original Message-
> From: decoder [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, August 09, 2006 5:31 PM
> To: Spamassassin List; users@spamassassin.apache.org
> Subject: Re: Improved OCR Plugin with approximate matching
>
>
[snip]
>
> According to google, libungif seems correct for yum... If the giffix
> binary still isn't present, try installing giflib from source.. that
> isn't a big deal
>
> Chris

yum install libungif* will get both libungif and libungif-progs (which
contains giffix)

Rick


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.




Re: Improved OCR Plugin with approximate matching

2006-08-09 Thread Bill Landry
- Original Message - 
From: "Spamassassin List" <[EMAIL PROTECTED]>

To: 
Sent: Wednesday, August 09, 2006 2:26 PM
Subject: Re: Improved OCR Plugin with approximate matching


Spamassassin List wrote:

decoder wrote:

See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

Major changes: Replaced imagemagick with netpbm, support png,
invoked giffix for broken gifs, detect image format with magic
bytes and not by content-type, added various configuration
options.


I install the above plugin, and i keep getting the same error.

[EMAIL PROTECTED] spamtest]# spamassassin -t < spam-gif-1.txt sh:
/usr/bin/giffix: No such file or directory giftopnm: error reading
magic number (null): Error reading magic number from Netpbm image
stream.  Most often, this means your input file is empty. sh:
/usr/bin/giffix: No such file or directory giftopnm: error reading
magic number (null): Error reading magic number from Netpbm image
stream.  Most often, this means your input file is empty. sh:
/usr/bin/giffix: No such file or directory giftopnm: error reading
magic number (null): Error reading magic number from Netpbm image
stream.  Most often, this means your input file is empty.

I notice the error occur when the attachment is gif format.



You are missing a tool. It is called "giffix" and part of the "giflib"
package. Without it, the plugin can't fix broken gifs to analyze them.
Install giflib.



I did a yum install giflib, but it install another package. What is the 
package for yum?


libungifi386   4.1.3-3.fc4.2updates-released   39 
k


Check that the path is correct to giffix.  I had to change the following 
line in the FuzzyOcr.pm from:


open OCR, "|/usr/bin/convert - pnm:-|/usr//bin/gocr -i - > 
/tmp/spamassassin.focr.$$";


   to

open OCR, "|/usr/local/bin/convert - pnm:-|/usr/local/bin/gocr -i - > 
/tmp/spamassassin.focr.$$";


Chris, you may want to consider adding an editable path statement to the top 
of the plug-in file rather than using hard coded paths that do not match 
everyone elses file locations.


Bill



Re: Improved OCR Plugin with approximate matching

2006-08-09 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Spamassassin List wrote:
>> Spamassassin List wrote:
> decoder wrote:
>
> See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
>
> Major changes: Replaced imagemagick with netpbm, support
> png, invoked giffix for broken gifs, detect image format
> with magic bytes and not by content-type, added various
> configuration options.
>>>
>>> I install the above plugin, and i keep getting the same error.
>>>
>>> [EMAIL PROTECTED] spamtest]# spamassassin -t < spam-gif-1.txt sh:
>>> /usr/bin/giffix: No such file or directory giftopnm: error
>>> reading magic number (null): Error reading magic number from
>>> Netpbm image stream.  Most often, this means your input file is
>>> empty. sh: /usr/bin/giffix: No such file or directory giftopnm:
>>> error reading magic number (null): Error reading magic number
>>> from Netpbm image stream.  Most often, this means your input
>>> file is empty. sh: /usr/bin/giffix: No such file or directory
>>> giftopnm: error reading magic number (null): Error reading
>>> magic number from Netpbm image stream.  Most often, this means
>>> your input file is empty.
>>>
>>> I notice the error occur when the attachment is gif format.
>>>
>>
>> You are missing a tool. It is called "giffix" and part of the
>> "giflib" package. Without it, the plugin can't fix broken gifs to
>> analyze them. Install giflib.
>>
>
> I did a yum install giflib, but it install another package. What is
>  the package for yum?
>
> libungifi386   4.1.3-3.fc4.2 updates-released
> 39 k

According to google, libungif seems correct for yum... If the giffix
binary still isn't present, try installing giflib from source.. that
isn't a big deal

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2lQiJQIKXnJyDxURAsRZAKCXyAHd1WHl1OQCNBa/ysiMt5upQgCeLavV
DZ1i1yJ6sWw8xctOyCD9fAs=
=ZBsG
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-09 Thread Spamassassin List

Spamassassin List wrote:

decoder wrote:

See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

Major changes: Replaced imagemagick with netpbm, support png,
invoked giffix for broken gifs, detect image format with magic
bytes and not by content-type, added various configuration
options.


I install the above plugin, and i keep getting the same error.

[EMAIL PROTECTED] spamtest]# spamassassin -t < spam-gif-1.txt sh:
/usr/bin/giffix: No such file or directory giftopnm: error reading
magic number (null): Error reading magic number from Netpbm image
stream.  Most often, this means your input file is empty. sh:
/usr/bin/giffix: No such file or directory giftopnm: error reading
magic number (null): Error reading magic number from Netpbm image
stream.  Most often, this means your input file is empty. sh:
/usr/bin/giffix: No such file or directory giftopnm: error reading
magic number (null): Error reading magic number from Netpbm image
stream.  Most often, this means your input file is empty.

I notice the error occur when the attachment is gif format.



You are missing a tool. It is called "giffix" and part of the "giflib"
package. Without it, the plugin can't fix broken gifs to analyze them.
Install giflib.



I did a yum install giflib, but it install another package. What is the 
package for yum?


libungifi386   4.1.3-3.fc4.2updates-released   39 k 



Re: Improved OCR Plugin with approximate matching

2006-08-09 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Spamassassin List wrote:
>>> decoder wrote:
>>>
>>> See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
>>>
>>> Major changes: Replaced imagemagick with netpbm, support png,
>>> invoked giffix for broken gifs, detect image format with magic
>>> bytes and not by content-type, added various configuration
>>> options.
>
> I install the above plugin, and i keep getting the same error.
>
> [EMAIL PROTECTED] spamtest]# spamassassin -t < spam-gif-1.txt sh:
> /usr/bin/giffix: No such file or directory giftopnm: error reading
> magic number (null): Error reading magic number from Netpbm image
> stream.  Most often, this means your input file is empty. sh:
> /usr/bin/giffix: No such file or directory giftopnm: error reading
> magic number (null): Error reading magic number from Netpbm image
> stream.  Most often, this means your input file is empty. sh:
> /usr/bin/giffix: No such file or directory giftopnm: error reading
> magic number (null): Error reading magic number from Netpbm image
> stream.  Most often, this means your input file is empty.
>
> I notice the error occur when the attachment is gif format.
>

You are missing a tool. It is called "giffix" and part of the "giflib"
package. Without it, the plugin can't fix broken gifs to analyze them.
Install giflib.

Chris

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2lCZJQIKXnJyDxURAoWUAJ468UGm2q5k7lMgQH7Z0VEawu2SQQCfZhMM
fmuMQNEuH7h9Ulm3yhdnIFM=
=LzFn
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-09 Thread Spamassassin List

decoder wrote:

See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

Major changes: Replaced imagemagick with netpbm, support png, invoked
giffix for broken gifs, detect image format with magic bytes and not
by content-type, added various configuration options.


I install the above plugin, and i keep getting the same error.

[EMAIL PROTECTED] spamtest]# spamassassin -t < spam-gif-1.txt
sh: /usr/bin/giffix: No such file or directory
giftopnm: error reading magic number
(null): Error reading magic number from Netpbm image stream.  Most often, 
this means your input file is empty.

sh: /usr/bin/giffix: No such file or directory
giftopnm: error reading magic number
(null): Error reading magic number from Netpbm image stream.  Most often, 
this means your input file is empty.

sh: /usr/bin/giffix: No such file or directory
giftopnm: error reading magic number
(null): Error reading magic number from Netpbm image stream.  Most often, 
this means your input file is empty.


I notice the error occur when the attachment is gif format.



Re: Improved OCR Plugin with approximate matching

2006-08-09 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Expertsites, Inc. wrote:
>> decoder wrote:
>>
>> See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
>>
>> Major changes: Replaced imagemagick with netpbm, support png, invoked
>> giffix for broken gifs, detect image format with magic bytes and not
>> by content-type, added various configuration options.
>>
>> Feedback is welcome  :)
>>
>> Chris
>
> Since installation yesterday, my system hit FUZZY_OCR in 204
> messages.  One scored 18, ten scored in the 20's and the rest
> between 30 to 83.  Scan time ran between 6.4 and 16.6 seconds per
> message.  I'm using a ton of SARE rules on a RHE server, dual xeon
> 2.4 ghz with 2 gig ram.
>
> If OCR is processor/memory intensive, could it be configured to kick
> in for lower scoring messages only?
>
> Tom Green
> --
> Expertsites, Inc.
>

You mean as in, run the OCR only if the message doesn't have already
xy points? That should be fairly easy... as long as the tests that
score the message run before the OCR test.. :)

I will implement that asap.

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2k9hJQIKXnJyDxURAicmAJ9bY8E49BBPOMTu1IYcAWUPVBo0CACfaRZ9
gkrbOwfe2Wn1rve8lChuV64=
=QVfi
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-09 Thread Expertsites, Inc.

decoder wrote:

See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

Major changes: Replaced imagemagick with netpbm, support png, invoked
giffix for broken gifs, detect image format with magic bytes and not
by content-type, added various configuration options.

Feedback is welcome  :)

Chris


Since installation yesterday, my system hit FUZZY_OCR in 204 messages.  One 
scored 18, ten scored in the 20's and the rest between 30 to 83.  Scan time 
ran between 6.4 and 16.6 seconds per message.  I'm using a ton of SARE rules 
on a RHE server, dual xeon 2.4 ghz with 2 gig ram.


If OCR is processor/memory intensive, could it be configured to kick in for 
lower scoring messages only?


Tom Green
--
Expertsites, Inc. 





Re: Improved OCR Plugin with approximate matching

2006-08-08 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

decoder wrote:
> Hello there,
>
> I have improved the original OcrPlugin (found at
> http://wiki.apache.org/spamassassin/OcrPlugin), so it contains
> fuzzy matching. Like that, mistakes made by the OCR recognition or
> intentional obfuscations in the text don't make the recognition
> impossible. This is being done with a relative distance calculation
>  between the pattern (word from a given word list) and a line in
> the recognized input. Also, the plugin uses dynamic scoring (more
> matched words means more score, this can be adjusted in the
> source).
>
> You can find a full description and an example in the wiki under:
>
> http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
>
>
> Ideas for improvements or critics are always welcome :)
>
>
> Best regards,
>
>
> Chris

See http://wiki.apache.org/spamassassin/FuzzyOcrPlugin

Major changes: Replaced imagemagick with netpbm, support png, invoked
giffix for broken gifs, detect image format with magic bytes and not
by content-type, added various configuration options.

Feedback is welcome  :)

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2PqdJQIKXnJyDxURAnFuAJ4vfLmW4UZUO0YH0EGcJlyNwJMUsACdGmAJ
1ZfXWyUvpaJ8ZNC1HeRMbLA=
=/Cyu
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-08 Thread John D. Hardin
On Tue, 8 Aug 2006, decoder wrote:

> I only wanted to add a small note: I recently saw gifs that cannot be
> converted using imagemagick because they are either sloppy generated
> or with intention partly corrupted. Please think about using giftopnm
> and jpegtopnm instead. If you have a better idea, tell me.
> 
> giftopnm: Extraneous data at end of image.  Skipped to end of image
> giftopnm: bogus character 0x4f, ignoring
> giftopnm: bogus character 0xa7, ignoring
> giftopnm: bogus character 0xc0, ignoring
> giftopnm: bogus character 0x8a, ignoring
> giftopnm: Unable to read Color 33 from colormap

Add a few points for the fact that it is a corrupt GIF?

--
 John Hardin KA7OHZICQ#15735746http://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  In 1998 more than three times as many people in the US were killed
  by incompetent physicians than were killed by handguns, yet the
  President of the A.M.A. is adopting "gun safety" as his platform.
---



Re: Improved OCR Plugin with approximate matching

2006-08-08 Thread Marc Perkel




Perhaps corrupted gifs should be treated as spam?

decoder wrote:

  -BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello again,


I only wanted to add a small note: I recently saw gifs that cannot be
converted using imagemagick because they are either sloppy generated
or with intention partly corrupted. Please think about using giftopnm
and jpegtopnm instead. If you have a better idea, tell me.

To use giftopnm and jpegtopnm, change the code from:

  if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
 open OCR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - >
/tmp/spamassassin.focr.$$";



to:

  if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
     if ($ctype eq "image/gif") {
 open OCR, "|/usr/bin/giftopnm - |/usr/bin/gocr -i - >
/tmp/spamassassin.focr.$$";
 } else {
 open OCR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i -
  
  
/tmp/spamassassin.focr.$$";

  
   }


Note that with imagemagick, things can get really bad. I experienced a
highly increased time to convert (about 30 seconds and then an error
message from imagemagick for a 7kb gif file). So I really advise you
to change the code to use different tools. These will also complain,
for example:

giftopnm: Extraneous data at end of image.  Skipped to end of image
giftopnm: bogus character 0x4f, ignoring
giftopnm: bogus character 0xa7, ignoring
giftopnm: bogus character 0xc0, ignoring
giftopnm: bogus character 0x8a, ignoring
giftopnm: Unable to read Color 33 from colormap

But it still continues and the text gets recognized correctly.


Best regards,

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2F0IJQIKXnJyDxURAtTdAJ4nx25dKbocHd7DW+ff1biW3GFmMACeO7t0
ZjYofyRHdknL5L3GcyMdgLo=
=e1ze
-END PGP SIGNATURE-


  





Re: Improved OCR Plugin with approximate matching

2006-08-08 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Matthias Keller wrote:
> decoder wrote:
>> -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
>>
>> Hello there,
>>
>> I have improved the original OcrPlugin (found at
>> http://wiki.apache.org/spamassassin/OcrPlugin), so it contains
>> fuzzy matching. Like that, mistakes made by the OCR recognition
>> or intentional obfuscations in the text don't make the
>> recognition impossible. This is being done with a relative
>> distance calculation between the pattern (word from a given word
>> list) and a line in the recognized input. Also, the plugin uses
>> dynamic scoring (more matched words means more score, this can be
>> adjusted in the source).
>>
>> You can find a full description and an example in the wiki under:
>>
>>
>> http://wiki.apache.org/spamassassin/FuzzyOcrPlugin
>>
>>
>> Ideas for improvements or critics are always welcome :)
>>
> Hi
>
> Could this plugin be extended to support png images? I receive
> quite a few of them... I guess it's probably just a line or two in
> addition to the jpg and gif Also might it be a good idea not to
> trust the content-type but instead use file or another 'detection
> utility'? As mentioned on the original ocrplugin page - gif2pnm and
> jpg2pnm have been abandoned because of sometimes wrong content
> types?
>
>
> Matt

That is a good idea... I will try to implement the file command
somewhere to make sure we are really using the correct tool to
convert. I explicitly use giftopnm and jpegtopnm here (from netpbm)
because, as I mentioned in an earlier reply, I received some gifs
which are corrupt, and these cause convert from imagemagick to drain
CPU for 30 seconds and more without any result... so one should really
avoid imagemagick here.

In the latest version I am working on, I invoke giffix (from the
giflib) to fix these gifs before converting them with giftopnm...
Adding png support will not be hard, I will put it on the todo list.

I will post a new version in the wiki and announce it here as soon as
I am finished. :)


Thanks for the input :)

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2G+rJQIKXnJyDxURAiT0AJ0di3sBaL4D5/mHy0Y7MhXXBlASTgCfRakO
lqp2m/v+vdxVJ5gZwIGZ7qo=
=6Nt6
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-08 Thread Matthias Keller

decoder wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello there,

I have improved the original OcrPlugin (found at
http://wiki.apache.org/spamassassin/OcrPlugin), so it contains fuzzy
matching. Like that, mistakes made by the OCR recognition or
intentional obfuscations in the text don't make the recognition
impossible. This is being done with a relative distance calculation
between the pattern (word from a given word list) and a line in the
recognized input. Also, the plugin uses dynamic scoring (more matched
words means more score, this can be adjusted in the source).

You can find a full description and an example in the wiki under:

http://wiki.apache.org/spamassassin/FuzzyOcrPlugin


Ideas for improvements or critics are always welcome :)
  

Hi

Could this plugin be extended to support png images?
I receive quite a few of them...
I guess it's probably just a line or two in addition to the jpg and gif
Also might it be a good idea not to trust the content-type but instead 
use file or another 'detection utility'? As mentioned on the original 
ocrplugin page - gif2pnm and jpg2pnm have been abandoned because of 
sometimes wrong content types?



Matt


Re: Improved OCR Plugin with approximate matching

2006-08-08 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello again,


I only wanted to add a small note: I recently saw gifs that cannot be
converted using imagemagick because they are either sloppy generated
or with intention partly corrupted. Please think about using giftopnm
and jpegtopnm instead. If you have a better idea, tell me.

To use giftopnm and jpegtopnm, change the code from:

  if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
 open OCR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - >
/tmp/spamassassin.focr.$$";



to:

  if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
     if ($ctype eq "image/gif") {
 open OCR, "|/usr/bin/giftopnm - |/usr/bin/gocr -i - >
/tmp/spamassassin.focr.$$";
 } else {
 open OCR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i -
> /tmp/spamassassin.focr.$$";
 }


Note that with imagemagick, things can get really bad. I experienced a
highly increased time to convert (about 30 seconds and then an error
message from imagemagick for a 7kb gif file). So I really advise you
to change the code to use different tools. These will also complain,
for example:

giftopnm: Extraneous data at end of image.  Skipped to end of image
giftopnm: bogus character 0x4f, ignoring
giftopnm: bogus character 0xa7, ignoring
giftopnm: bogus character 0xc0, ignoring
giftopnm: bogus character 0x8a, ignoring
giftopnm: Unable to read Color 33 from colormap

But it still continues and the text gets recognized correctly.


Best regards,

Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2F0IJQIKXnJyDxURAtTdAJ4nx25dKbocHd7DW+ff1biW3GFmMACeO7t0
ZjYofyRHdknL5L3GcyMdgLo=
=e1ze
-END PGP SIGNATURE-



Re: Improved OCR Plugin with approximate matching

2006-08-07 Thread jdow

From: "uNiXpSyChO" <[EMAIL PROTECTED]>


decoder wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello there,

I have improved the original OcrPlugin (found at
http://wiki.apache.org/spamassassin/OcrPlugin), so it contains fuzzy
matching. Like that, mistakes made by the OCR recognition or
intentional obfuscations in the text don't make the recognition
impossible. This is being done with a relative distance calculation
between the pattern (word from a given word list) and a line in the
recognized input. Also, the plugin uses dynamic scoring (more matched
words means more score, this can be adjusted in the source).

You can find a full description and an example in the wiki under:

http://wiki.apache.org/spamassassin/FuzzyOcrPlugin


Ideas for improvements or critics are always welcome :)



seems to work... but i never see a score about 1.00.

the docs say the default score is 4.  did i miss something?


You probably never amended your local.cf or equivalent with the
score for the rule. So it gets the default score of 1.

{^_^}


Re: Improved OCR Plugin with approximate matching

2006-08-07 Thread uNiXpSyChO




seems to work... but i never see a score about 1.00.

the docs say the default score is 4.  did i miss something?


above 1.00 i meant.



Re: Improved OCR Plugin with approximate matching

2006-08-07 Thread uNiXpSyChO

decoder wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello there,

I have improved the original OcrPlugin (found at
http://wiki.apache.org/spamassassin/OcrPlugin), so it contains fuzzy
matching. Like that, mistakes made by the OCR recognition or
intentional obfuscations in the text don't make the recognition
impossible. This is being done with a relative distance calculation
between the pattern (word from a given word list) and a line in the
recognized input. Also, the plugin uses dynamic scoring (more matched
words means more score, this can be adjusted in the source).

You can find a full description and an example in the wiki under:

http://wiki.apache.org/spamassassin/FuzzyOcrPlugin


Ideas for improvements or critics are always welcome :)



seems to work... but i never see a score about 1.00.

the docs say the default score is 4.  did i miss something?



Improved OCR Plugin with approximate matching

2006-08-07 Thread decoder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello there,

I have improved the original OcrPlugin (found at
http://wiki.apache.org/spamassassin/OcrPlugin), so it contains fuzzy
matching. Like that, mistakes made by the OCR recognition or
intentional obfuscations in the text don't make the recognition
impossible. This is being done with a relative distance calculation
between the pattern (word from a given word list) and a line in the
recognized input. Also, the plugin uses dynamic scoring (more matched
words means more score, this can be adjusted in the source).

You can find a full description and an example in the wiki under:

http://wiki.apache.org/spamassassin/FuzzyOcrPlugin


Ideas for improvements or critics are always welcome :)


Best regards,


Chris
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE18IMJQIKXnJyDxURAm4PAJ9WcLtEDharV99qZrgPGuy0oa6a+QCfcvgz
azeW1/azOeGFnW2qBnvcOUs=
=KZIA
-END PGP SIGNATURE-



OCR

2006-08-07 Thread Filbert
Hi,

I'm planning to test the OCR module in SA very soon.

I was wondering if other (commercial) anti-spam products already have a OCR 
module built-in?

Thx
F.


  1   2   >