R: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-29 Thread Giampaolo Tomassoni
> -Messaggio originale-
> Da: Dan Barker [mailto:[EMAIL PROTECTED]
> 
> 
> 
> The main purpose of the FuzzyOcr's db was of course to avoid computing
> the
> OCR passes needed to decode the image text for known images. The
> problem is
> that the cache content is not searched for an exact match of the key
> values
> (which are image type, width, height, number of colors and color
> frequencies): it looks for the best match of these values within a
> given
> range. This has a number of drawbacks:
> 
>  a) range search defeats look-up indexing in the db,
> thereby resulting in browsing the whole db for a match;
> 
> 
> 
> Range searching in a database is (can be?) vastly faster than a full
> table
> scan. You have to USE the indexes, not just assume an FTS will be
> required.
> The database optimizer _SHOULD_ figure this out, but only if it has
> reasonable statistics and is passed a reasonable query.
> 
> I got some SQL from MapQuest once, that had a WHERE clause containing
> the
> arithmetic to compute the distance from a given location (a "radius"
> search). As coded, of course, a Full Table Scan was required and the
> distance function was evaluated to determine a row's presence in the
> result
> set. It was very slow, even for just a few hundred thousand location
> records.
> 
> I indexed the Latitude and Longitude columns, and expressed the query
> without the radius in the WHERE, but rather the ranges of possible
> Latitude
> and Longitude values (In effect, the rectangle that just contained the
> circle the user desired). The unwanted "corners" of the result set were
> discarded, rather than every single row outside the desired radius. The
> performance gains (for normal radiuses, 10, 20, 50, 100 miles) were
> enormous. The average gain was 100:1 - two orders of magnitude.
> 
> MapQuest, of course, wasn't interested. The SQL ran on their clients'
> machines, not theirs.
> 
> If fuzzyOCR caching method has any merit at all, tuning the SQL and/or
> the
> database will provide decent performance.
> 
> "Explain Execution Plan" is your friend!

I totally agree with you, Dan.

I recall such a discussion when FuzzyOcr was still under development and
there were some of us contributing ideas about this.

Nevertheless, when the final code was out, it didn't implement anything in
order to at least allow the SQL server to reduce the number of rows to
retrieve (in example, the specific select didn't even establish the allowed,
rough ranges on image width and height attributes in its where expression).
The result is point a) of the list of drawbacks in my post.

However, even fixing it, points b) and c) still hold. This is a problem
bound to the range checking itself: it is not always correct to assert that
two images are the same if their "distance" is less than a given epsilon.
This is true regardless of how you compute your "distance" or how low is
your epsilon, but in the specific case maybe the FuzzyOcr's distance
function is not the best we could get...

Thereby, to my opinion drawback a) may eventually get fixed by not allowing
any range search at all, but instead computing a true hash (md5 or whatever
good) of the image file and then using it as a primary key in the db.
Spammers often use the very same image in only few messages, thereby the
performance gain would be low but, nevertheless, it would be non zero. c)
would still hold, anyway, so maybe event this "solution" wouldn't help that
much when you're trying to tune FO to defenestrate that damn spammer.

As someone else pointed out in this list, the whole caching code was due to
concerns about execution times needed by computing OCR code on a lot of
images. These concerns seems much relaxed now, so the best option we
actually have is to disable caching at all, which is like discarding any
caching code in FuzzyOcr.

Giampaolo


> Dan Barker


RE: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-29 Thread Dan Barker


The main purpose of the FuzzyOcr's db was of course to avoid computing the
OCR passes needed to decode the image text for known images. The problem is
that the cache content is not searched for an exact match of the key values
(which are image type, width, height, number of colors and color
frequencies): it looks for the best match of these values within a given
range. This has a number of drawbacks:

 a) range search defeats look-up indexing in the db,
thereby resulting in browsing the whole db for a match;



Range searching in a database is (can be?) vastly faster than a full table
scan. You have to USE the indexes, not just assume an FTS will be required.
The database optimizer _SHOULD_ figure this out, but only if it has
reasonable statistics and is passed a reasonable query.

I got some SQL from MapQuest once, that had a WHERE clause containing the
arithmetic to compute the distance from a given location (a "radius"
search). As coded, of course, a Full Table Scan was required and the
distance function was evaluated to determine a row's presence in the result
set. It was very slow, even for just a few hundred thousand location
records.

I indexed the Latitude and Longitude columns, and expressed the query
without the radius in the WHERE, but rather the ranges of possible Latitude
and Longitude values (In effect, the rectangle that just contained the
circle the user desired). The unwanted "corners" of the result set were
discarded, rather than every single row outside the desired radius. The
performance gains (for normal radiuses, 10, 20, 50, 100 miles) were
enormous. The average gain was 100:1 - two orders of magnitude.

MapQuest, of course, wasn't interested. The SQL ran on their clients'
machines, not theirs.

If fuzzyOCR caching method has any merit at all, tuning the SQL and/or the
database will provide decent performance. 

"Explain Execution Plan" is your friend!

Dan Barker



R: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-29 Thread Giampaolo Tomassoni
> -Messaggio originale-
> Da: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Per conto di
> snowcrash+sa
> 
> hi andy,
> 
> > For what it's worth, the fuzzyocr hashing is of very limited value,
> and in
> > many cases is a severe performance hit. I found that scanning the
> hashes,
> > due to the "fuzzy" nature, is more costly than just rescanning the
> file
> > with OCR, as *each* *and* *every* hash must be checked iteratively.
> 
> now, *that's* an interesting point to consider.
> 
> i'd be interested in what, then, the 'goal' of the hashing/comparison
> *is*?
> 
> is it performance, and it just missed the mark for the reasons you
> state?  or is it something else?

The main purpose of the FuzzyOcr's db was of course to avoid computing the
OCR passes needed to decode the image text for known images. The problem is
that the cache content is not searched for an exact match of the key values
(which are image type, width, height, number of colors and color
frequencies): it looks for the best match of these values within a given
range. This has a number of drawbacks:

 a) range search defeats look-up indexing in the db,
thereby resulting in browsing the whole db for a match;

 b) range search also increases false positive matches
on the db content;

 c) the db caches OCR results, thereby a mach on it may return
an unwanted/imprecise result if you tweak FuzzyOcr config
and/or words files.

The first drawback may yield high processing times and even timeouts when
you have a medium-loaded mail server, the second one is probably the worst
problem to most of us and the latter is, well, another problem.

So, yes: FuzzyOCR's cache was meant to increase performances and, yes again,
it basically missed the mark.

The solution is to simply discard the cache db and run the OCR phases on
every and each image: on most but the less loaded servers this is the most
effective way to deal with it. Most of us are used to turn glitches off
while keeping the good work... :)

Giampaolo


> dunno.
> 
> but, your point bears some benchmarking ...
> 
> thx!


Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-29 Thread Mark Martinec
Andy Dills wrote:

> For what it's worth, the fuzzyocr hashing is of very limited value, and in
> many cases is a severe performance hit. I found that scanning the hashes,
> due to the "fuzzy" nature, is more costly than just rescanning the file
> with OCR, as *each* *and* *every* hash must be checked iteratively.
>
> Because of the "fuzzy" nature, you can't just check the db to "see if this
> hash exists." You have to go through and compare the generated hash to
> every hash in the db, and it considers it a match if it's "close enough".
>
> It's severely less computationally expensive to just rescan the damn
> image. It won't matter if you only get a couple hundered emails per day,
> but once the number of stored hashes reaches a reasonably low number, it
> becomes faster to rescan the image than to go through every single stored
> hash to see if you've already scanned a similar image.

I fully agree. When a fuzzyocr caching database grows beyond certain (small) 
size, it becomes a severe penaly, costlier than rescanning images.


snowcrash wrote:
> i'd be interested in what, then, the 'goal' of the hashing/comparison *is*?
> is it performance, and it just missed the mark for the reasons you
> state?  or is it something else?

The desired goal was no doubt performance increase,
but the implementation made it into a performance drag.

A possible compromise is to ditch the fuzzyocr database every couple of days,
and let it be started anew. This does bring some (limited) benefits.

  Mark


Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread snowcrash+sa
hi andy,

> For what it's worth, the fuzzyocr hashing is of very limited value, and in
> many cases is a severe performance hit. I found that scanning the hashes,
> due to the "fuzzy" nature, is more costly than just rescanning the file
> with OCR, as *each* *and* *every* hash must be checked iteratively.

now, *that's* an interesting point to consider.

i'd be interested in what, then, the 'goal' of the hashing/comparison *is*?

is it performance, and it just missed the mark for the reasons you
state?  or is it something else?

dunno.

but, your point bears some benchmarking ...

thx!


Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread Andy Dills
On Tue, 28 Aug 2007, snowcrash+sa wrote:

> aha!
> 
>  in FuzzyOcr.cf,
> 
>   -   focr_hashing_learn_scanned 1
>   +   focr_hashing_learn_scanned 0
> 
> then,
> 
>   rm Fuzzy*db*
> 

...

> 
> i did not realize that if the HASH ore-exists, then the images' total
> score hits -- and is reused frm the hash db, but thata none of the
> word-hit data is stored/resed.

For what it's worth, the fuzzyocr hashing is of very limited value, and in 
many cases is a severe performance hit. I found that scanning the hashes, 
due to the "fuzzy" nature, is more costly than just rescanning the file 
with OCR, as *each* *and* *every* hash must be checked iteratively.

Because of the "fuzzy" nature, you can't just check the db to "see if this 
hash exists." You have to go through and compare the generated hash to 
every hash in the db, and it considers it a match if it's "close enough".

It's severely less computationally expensive to just rescan the damn 
image. It won't matter if you only get a couple hundered emails per day, 
but once the number of stored hashes reaches a reasonably low number, it 
becomes faster to rescan the image than to go through every single stored 
hash to see if you've already scanned a similar image. 

Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---


R: Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread Giampaolo Tomassoni
> -Messaggio originale-
> Da: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED]
> 
> ...omissis...
> 
> I *guess* it is focr_enable_image_hashing, which I already commented
> out
> (and thereby disabled hashing) because I was experiencing problems when
> changing something in my FuzzyOcr.words. This happened some months ago
> and I
> can't exactly recall how I did it...
> 
> Giampaolo

Ok, I didn't see your post in which you was saying you already succeeded in
disabling the hash db.

Sorry about my redundant reply.

Giampaolo

> 
> 
> >
> > thanks!


R: Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread Giampaolo Tomassoni
> -Messaggio originale-
> Da: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Per conto di
> 
> hi,
> 
> (you 'busted out' of the thread ... replying back in it.)
> 
> > Disable the FuzzyOcr's result hash cache on both machines before
> testing for
> > differences: you are looking at stale results.
> >
> > If these systems cached the results when the version or config of the
> two
> > FuzzyOcr(s) were not the same, of course you see a difference...
> 
> You might have a point.
> 
> So, silly question:
> 
> HOW do I "Disable the FuzzyOcr's result hash cache"?
> 
> commenting out:
> 
>   #body FUZZY_OCR_KNOWN_HASHeval:dummy_check()
> 
> in FuzzyOcr.cf isn't doing it:-/

I *guess* it is focr_enable_image_hashing, which I already commented out
(and thereby disabled hashing) because I was experiencing problems when
changing something in my FuzzyOcr.words. This happened some months ago and I
can't exactly recall how I did it...

Giampaolo


> 
> thanks!


Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread snowcrash+sa
aha!

 in FuzzyOcr.cf,

-   focr_hashing_learn_scanned 1
+   focr_hashing_learn_scanned 0

then,

rm Fuzzy*db*


now, as expected ...

  18 FUZZY_OCR  BODY: Img with common spam text inside
[Words found:]
["investor" in 1 lines]
["price" in 2 lines]
["company" in 1 lines]
["alert" in 1 lines]
["valium" in 1 lines]
["trade" in 1 lines]
["banking" in 1 lines]
["news" in 1 lines]
[(13.5 word occurrences found)]


i did not realize that if the HASH ore-exists, then the images' total
score hits -- and is reused frm the hash db, but thata none of the
word-hit data is stored/resed.

got it,

thx!


Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread snowcrash+sa
hi,

(you 'busted out' of the thread ... replying back in it.)

> Disable the FuzzyOcr's result hash cache on both machines before testing for
> differences: you are looking at stale results.
>
> If these systems cached the results when the version or config of the two
> FuzzyOcr(s) were not the same, of course you see a difference...

You might have a point.

So, silly question:

HOW do I "Disable the FuzzyOcr's result hash cache"?

commenting out:

  #body FUZZY_OCR_KNOWN_HASHeval:dummy_check()

in FuzzyOcr.cf isn't doing it:-/

thanks!


Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread snowcrash
Hi,

> Unless I've missed something, above is the only difference in the two
> reports.  So that says the only real difference if the lack of the 'words
> found' report in FuzzyOCR.

Yup. You're correct.

> I think I'd add a line that should end up in the report and say
> something like "debug report enabled".

In building up the two boxes I'd co'd the plugin setup from a single
local Hg repo.  And, diff'd the co'd files -- no diff.

Nonetheless, probly a good idea ... thx

> See if that line shows up in both cases.  If not, you probably have a
> problem with whatever is setting the switch.  If it does, look at how those
> word strings are output.  The only thing that looks interesting about them
> is they contain quote marks.  Maybe somethign strange has happened and the
> quote marks are breaking the debug output in one case.

Digging aound as I type ... sstay tuned.

> If all else fails, probably best to take this over to the FuzzyOCR mailing
> list for a while.

Sadly, a bit "lonely" over there these days ... but, will (eventually) do!

Cheers,


R: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread Giampaolo Tomassoni
> -Messaggio originale-
> Da: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Per conto di
> snowcrash+sa
>
> ...omissis...
>  
> 16 FUZZY_OCR_KNOWN_HASH   BODY: Image with known hash
>
> ...omissis...
>  > any hints/suggestions as to what i might've missed? how to find it?

Disable the FuzzyOcr's result hash cache on both machines before testing for
differences: you are looking at stale results.

If these systems cached the results when the version or config of the two
FuzzyOcr(s) were not the same, of course you see a difference...

Giampaolo

> 
> thanks!


Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread Loren Wilton

   16 FUZZY_OCR_KNOWN_HASH   BODY: Image with known hash
 []
 [Words found:]
 ["investor" in 1 lines]
 ["price" in 2 lines]
 ["company" in 1 lines]
 ["alert" in 1 lines]
 ["valium" in 1 lines]
 ["trade" in 1 lines]
 ["banking" in 1 lines]
 ["news" in 1 lines]
 [(13.5 word occurrences found)]

   18 FUZZY_OCR_KNOWN_HASH   BODY: Image with known hash
 []
 [Words found:]
 []
 [(13.5 word occurrences found)]



Unless I've missed something, above is the only difference in the two 
reports.  So that says the only real difference if the lack of the 'words 
found' report in FuzzyOCR.


The first think I'd do is look at the plugin code to see what controls the 
output of those report lines, and see what it is based on.  I think I'd add 
a line that should end up in the report and say something like "debug report 
enabled".


See if that line shows up in both cases.  If not, you probably have a 
problem with whatever is setting the switch.  If it does, look at how those 
word strings are output.  The only thing that looks interesting about them 
is they contain quote marks.  Maybe somethign strange has happened and the 
quote marks are breaking the debug output in one case.


If all else fails, probably best to take this over to the FuzzyOCR mailing 
list for a while.


   Loren




two supposedly identical SA boxes, with slightly different report output -- help find the diff?

2007-08-28 Thread snowcrash+sa
hi,

grr. i'm at that resorting-to-visine stage of wtf ... :-/

i've

spamassassin --version
SpamAssassin version 3.2.4-r564346
  running on Perl version 5.8.8

with, among numerous other ruls/plugins, FuzzyOcr/r330 installed.

i've just updated two supposedly identical boxes, building from clean
sources, and running the same setup scripts on both.

no errors in the installs.

on testing of FuzzyOcr image processing on one of its included test files with,

spamassassin -D -t -x < FuzzyOcr/samples/ocr-animated.eml

i see in the debug output the following report on one box,

  ...
  Content analysis details:   (38.2 points, 4.0 required)

   pts rule name  description
   -- --
   4.2 MID_DEGREESMID_DEGREES
   3.7 CTYPE_8SPACE_GIF   BODY: Stock spam image part
'Content-Type' found (8
  spc)
   0.0 HTML_MESSAGE   BODY: HTML included in message
   1.5 BAYES_50   BODY: Bayesian spam probability is 40 to 60%
  [score: 0.4467]
   1.7 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
   2.5 HTML_IMAGE_ONLY_16 BODY: HTML: images with 1200-1600 bytes of words
   1.2 SARE_GIF_ATTACHFULL: Email has a inline gif
   1.5 MY_CID_AND_STYLE   SARE cid and style
   2.9 DRUGS_STOCK_MIMEOLEStock-spam forged headers found (5510)
16 FUZZY_OCR_KNOWN_HASH   BODY: Image with known hash
  []
  [Words found:]
  ["investor" in 1 lines]
  ["price" in 2 lines]
  ["company" in 1 lines]
  ["alert" in 1 lines]
  ["valium" in 1 lines]
  ["trade" in 1 lines]
  ["banking" in 1 lines]
  ["news" in 1 lines]
  [(13.5 word occurrences found)]


and, similarly on the other box,

  ...
  Content analysis details:   (38.5 points, 4.0 required)

   pts rule name  description
   -- --
   3.7 MID_DEGREESMID_DEGREES
   1.6 CTYPE_8SPACE_GIF   BODY: Stock spam image part
'Content-Type' found (8
  spc)
   0.0 HTML_MESSAGE   BODY: HTML included in message
   1.5 BAYES_50   BODY: Bayesian spam probability is 40 to 60%
  [score: 0.4467]
   1.5 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
   1.5 HTML_IMAGE_ONLY_16 BODY: HTML: images with 1200-1600 bytes of words
   1.2 SARE_GIF_ATTACHFULL: Email has a inline gif
   1.5 MY_CID_AND_STYLE   SARE cid and style
   3.5 DRUGS_STOCK_MIMEOLEStock-spam forged headers found (5510)
18 FUZZY_OCR_KNOWN_HASH   BODY: Image with known hash
  []
  [Words found:]
  []
  [(13.5 word occurrences found)]


NOTE the "words found" detail in the second box's debug output :-/

trying to find what's causing the different output, i've pored over
the debug output, googl'd the lists, diff'd the config files, etc.

nada.  to my weary eye, all looks "the same".

obviously, it's not.

any hints/suggestions as to what i might've missed? how to find it?

thanks!