Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Henrik Krohns
On Sat, Oct 28, 2006 at 04:28:47PM -0700, Dennis Peterson wrote:
> >
> >I don't get it.. unless you have some big honeypot, maybe 5% of traffic
> >contain small images to be OCRd. If your server can't handle that, I guess
> >it's running out of juice anyway. :)
> >
> >You can even easily create separate scanning queue for OCR, so it doesn't
> >interfere with normal traffic.
> 
> You may have missed that I'm in the image industry - a great deal of 
> what we do is imagery including imagery with text in it, and as we have 
> to scan all images over a particular size, it would require more cpu 
> than is worth it.

Ok that's fair. But you probably meant: scan everything _under_ SpamAssassin
scan size. That's only whole messages less than ~256kB to be scanned by
default in most software. I guess if you get images from all over, you can't
whitelist etc then.

Cheers,
Henrik
___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Dennis Peterson

Bill Randle wrote:

On Sat, 2006-10-28 at 16:21 -0700, Dennis Peterson wrote:




Actually, the FuzzyOCR plugin already handles animated gifs using
various techniques to extract the hidden text. It also is able to
decode png and jpeg files.


Ah - so it does. I hadn't looked at v. 2.3. I'll have another look. 
Thanks, Bill.


dp
___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Bill Randle
On Sat, 2006-10-28 at 16:21 -0700, Dennis Peterson wrote:
> Bill Randle wrote:
> > On Sat, 2006-10-28 at 16:54 -0400, Kris Deugau wrote:
> >>
> >> However, in the long run, OCR to feed the text to SpamAssassin's other
> >> rules is a better solution;  it's much more flexible.
> > 
> > Indeed. For those interested in the topic of OCR to feed SpamAssassin,
> > there's an active project with its own mailing list that does just this.
> > It turns out to be a non-trivial task because many of these image spam
> > are animated gifs, so you need to find the right frame to pass to the
> > OCR program.
> > 
> > Start here: http://wiki.apache.org/spamassassin/FuzzyOcrPlugin then
> > subscribe to the Devel-Spam mailing list (there's a link on that page).
> 
> 
> You might want to consider the next level of image spam before you go 
> too far down the OCR path:
> 
> http://www.iss.net/threats/Animated%20GIF.html

Actually, the FuzzyOCR plugin already handles animated gifs using
various techniques to extract the hidden text. It also is able to
decode png and jpeg files.

-Bill
 

___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Dennis Peterson

Henrik Krohns wrote:

On Sat, Oct 28, 2006 at 09:20:55AM -0700, Dennis Peterson wrote:

I've explored OCR on both color and de-colorized images and there have
been successes, but not enough to warrant turning it on in production. It
is very cpu intensive.


I don't get it.. unless you have some big honeypot, maybe 5% of traffic
contain small images to be OCRd. If your server can't handle that, I guess
it's running out of juice anyway. :)

You can even easily create separate scanning queue for OCR, so it doesn't
interfere with normal traffic.


You may have missed that I'm in the image industry - a great deal of 
what we do is imagery including imagery with text in it, and as we have 
to scan all images over a particular size, it would require more cpu 
than is worth it. And when you consider repeating it all at a disaster 
recovery site it's starting to be a lot of computer power with a high 
false positive probability.


You cannot count on the image spam being gif as png images are showing 
up now as are jpg, and animated gifs are also out there. OCR isn't 
practical for me but may be for others for a while - at least until they 
start to use CAPTCHA technology to get around it.


dp
___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Dennis Peterson

Bill Randle wrote:

On Sat, 2006-10-28 at 16:54 -0400, Kris Deugau wrote:

Henrik Krohns wrote:

I don't get it.. unless you have some big honeypot, maybe 5% of traffic
contain small images to be OCRd. If your server can't handle that, I guess
it's running out of juice anyway. :)

Well... yeah.The basic problem is that all the other garbage
(with the occasional inevitable exception) is getting caught by Clam
(viruses and most phishes) or SpamAssassin (all but a few text-based spams.

I've found *enough* similarities in the raw binary image data to
usefully make signatures for a lot of what is otherwise getting through;
 at the moment this is just a stopgap until these machines can be retired.

However, in the long run, OCR to feed the text to SpamAssassin's other
rules is a better solution;  it's much more flexible.


Indeed. For those interested in the topic of OCR to feed SpamAssassin,
there's an active project with its own mailing list that does just this.
It turns out to be a non-trivial task because many of these image spam
are animated gifs, so you need to find the right frame to pass to the
OCR program.

Start here: http://wiki.apache.org/spamassassin/FuzzyOcrPlugin then
subscribe to the Devel-Spam mailing list (there's a link on that page).



You might want to consider the next level of image spam before you go 
too far down the OCR path:


http://www.iss.net/threats/Animated%20GIF.html

dp
___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Bill Randle
On Sat, 2006-10-28 at 16:54 -0400, Kris Deugau wrote:
> Henrik Krohns wrote:
> > I don't get it.. unless you have some big honeypot, maybe 5% of traffic
> > contain small images to be OCRd. If your server can't handle that, I guess
> > it's running out of juice anyway. :)
> 
> Well... yeah.The basic problem is that all the other garbage
> (with the occasional inevitable exception) is getting caught by Clam
> (viruses and most phishes) or SpamAssassin (all but a few text-based spams.
> 
> I've found *enough* similarities in the raw binary image data to
> usefully make signatures for a lot of what is otherwise getting through;
>  at the moment this is just a stopgap until these machines can be retired.
> 
> However, in the long run, OCR to feed the text to SpamAssassin's other
> rules is a better solution;  it's much more flexible.

Indeed. For those interested in the topic of OCR to feed SpamAssassin,
there's an active project with its own mailing list that does just this.
It turns out to be a non-trivial task because many of these image spam
are animated gifs, so you need to find the right frame to pass to the
OCR program.

Start here: http://wiki.apache.org/spamassassin/FuzzyOcrPlugin then
subscribe to the Devel-Spam mailing list (there's a link on that page).

-Bill


___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Kris Deugau
Henrik Krohns wrote:
> I don't get it.. unless you have some big honeypot, maybe 5% of traffic
> contain small images to be OCRd. If your server can't handle that, I guess
> it's running out of juice anyway. :)

Well... yeah.The basic problem is that all the other garbage
(with the occasional inevitable exception) is getting caught by Clam
(viruses and most phishes) or SpamAssassin (all but a few text-based spams.

I've found *enough* similarities in the raw binary image data to
usefully make signatures for a lot of what is otherwise getting through;
 at the moment this is just a stopgap until these machines can be retired.

However, in the long run, OCR to feed the text to SpamAssassin's other
rules is a better solution;  it's much more flexible.

-kgd
___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Henrik Krohns
On Sat, Oct 28, 2006 at 09:20:55AM -0700, Dennis Peterson wrote:
>
> I've explored OCR on both color and de-colorized images and there have
> been successes, but not enough to warrant turning it on in production. It
> is very cpu intensive.

I don't get it.. unless you have some big honeypot, maybe 5% of traffic
contain small images to be OCRd. If your server can't handle that, I guess
it's running out of juice anyway. :)

You can even easily create separate scanning queue for OCR, so it doesn't
interfere with normal traffic.

Cheers,
Henrik
___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Dennis Peterson

Kris Deugau wrote:



The stock and pill spams that I'm trying to tag, however, have images
that have *very small* variations message-to-message, but over a larger
sample there's really very little that can be seen as "common" across
the whole set - or even a significant part of the set.  Automating the
process of finding "all possible values for the byte at this position"
is the only way I can usefully get anywhere.


I did a binary diff and md5 checksums on hundreds of the stock and pill 
images and never found any two to be the same. They use a random noise 
generator to sprinkle the images with enough debris to prevent analysis, 
so even splitting the files into 128 and 512 byte slices and checking 
each of the slices was not helpful. Even when you convert the image to 
black and white to remove the color element there's still sufficient 
randomness to prevent go-nogo certainty. I've explored OCR on both color 
and de-colorized images and there have been successes, but not enough to 
warrant turning it on in production. It is very cpu intensive.


I attempted to see if there were any digital watermarks in these images 
and found nothing although the math for doing this pushes my limits.


I work in the image industry so have to be more careful than most 
regarding these, so others may have better luck than I which is another 
way of saying acceptable risk is site dependent.


I'd be very interested in any headway you make.

FWIW, I checked my current logs and found the MSRBL sigs blocked over 
6,000 images in a two week period. The Sanesecurity filters stopped an 
additional 4,000. There were a total of 16383 messages blocked using all 
ClamAV filters, and many more thousands found by various milters and 
RBL/SURBL scans. This is on one of the smaller servers I run. The bigger 
mail farms are magnitudes greater for all categories. I mention this 
only because the out of pocket cost for these successes was $0.00 USD 
and very little time invested. Which reminds me, I should send some 
donation money to all the great folks who made these success possible.


dp
___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Kris Deugau
Dennis Peterson wrote:
> Not to change the direction on you, but you might want to take advantage
> of the work Steve Basford is doing at
> http://www.sanesecurity.com/clamav/ for phishing problems, and also look
> at http://www.msrbl.com/site/stats for image and spam solutions. Both
> sites are providing excellent results on systems I'm running. The
> patterns are downloadable and very up to date. I've not had a single
> complaint of false positives, and the number of patterns provided is
> quite large.

Those both look like excellent projects for the things they're
targetting...   but they don't really fit my problem.

Phishing scams are mostly tagged by Clam already, and if not, they're
generally tagged by SpamAssassin.  This is working fine.

Imagespam that doesn't mutate will quickly get noticed and tagged either
via SpamAssassin's Bayes learner, or when I find a run of copies of the
exact same image (which is all you can really tag with the MD5
signatures).  FWIW, I have seen a few of these...  about one in several
thousand reported missed spams.  :/

The stock and pill spams that I'm trying to tag, however, have images
that have *very small* variations message-to-message, but over a larger
sample there's really very little that can be seen as "common" across
the whole set - or even a significant part of the set.  Automating the
process of finding "all possible values for the byte at this position"
is the only way I can usefully get anywhere.

On rare occasion, I find a duplicate, but that's ~1 in 500 or worse,
which would add up to a LOT of MD5 sigs that wouldn't really do me any
good.  I've seen general patterns in the hex dumps, but there's enough
variation that manually creating a signature to match these things is
unworkable.

> Steve has also written a very useable how-to for creating these patterns.

A lot of the how-tos I've seen assume that whatever you're trying to
create a signature for shows minor variations message-to-message, but
shows a *very* large range over a larger number of messages (100+).  :/
  Thus the scripts I wrote to extract a chunk of hex-coded bytes, and
crunch those down to what should be valid ClamAV signatures.

An average signature from this process might look something like:

ImgSpam.Misc.5:0:0:474946383761??(01|00)??00442c??(01|00)??0084(00|48|53)(00|15)(00|30|1c)f0f0f0(f0|e0|c0)f0(e0|b0|f0|d0|c0)f0(00|f0|40)(00|d0|e0|60|70)(f0|90|00|c0)(e0|90|00|b0|70)f0??(00|90|40|7d|10)(f0|ea)??(f0|00|e0|d0|46)

Watch for linewrap, this is the just the first ~175 characters of a
~630-character sig.  The complexity is typical of results I've been
getting, and the rest of the sig is similar.

-kgd
___
http://lurker.clamav.net/list/clamav-users.html


Re: [Clamav-users] Complexity limit on (custom) signatures?

2006-10-28 Thread Gerard Seibert
On Friday October 27, 2006 at 08:42:34 (PM) Dennis Peterson wrote:

> Not to change the direction on you, but you might want to take advantage 
> of the work Steve Basford is doing at 
> http://www.sanesecurity.com/clamav/ for phishing problems, and also look 
> at http://www.msrbl.com/site/stats for image and spam solutions. Both 
> sites are providing excellent results on systems I'm running. The 
> patterns are downloadable and very up to date. I've not had a single 
> complaint of false positives, and the number of patterns provided is 
> quite large.
> 
> Steve has also written a very useable how-to for creating these patterns.

Steve has done a remarkable job with his 'sig' files. He is constantly
updating them. I know because I use them. they are always catching
'phishing' threats' on my PC.

He also has two automated installers for downloading and installing his
signature files. I wrote the 'script' version. There is also a Perl
version available on his site.


-- 
Gerard

 "There is nothing wrong with making love with the light on. Just make
 sure the car door is closed."

  George Burns
___
http://lurker.clamav.net/list/clamav-users.html