R: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
> -Messaggio originale- > Da: Dan Barker [mailto:[EMAIL PROTECTED] > > > > The main purpose of the FuzzyOcr's db was of course to avoid computing > the > OCR passes needed to decode the image text for known images. The > problem is > that the cache content is not searched for an exact match of the key > values > (which are image type, width, height, number of colors and color > frequencies): it looks for the best match of these values within a > given > range. This has a number of drawbacks: > > a) range search defeats look-up indexing in the db, > thereby resulting in browsing the whole db for a match; > > > > Range searching in a database is (can be?) vastly faster than a full > table > scan. You have to USE the indexes, not just assume an FTS will be > required. > The database optimizer _SHOULD_ figure this out, but only if it has > reasonable statistics and is passed a reasonable query. > > I got some SQL from MapQuest once, that had a WHERE clause containing > the > arithmetic to compute the distance from a given location (a "radius" > search). As coded, of course, a Full Table Scan was required and the > distance function was evaluated to determine a row's presence in the > result > set. It was very slow, even for just a few hundred thousand location > records. > > I indexed the Latitude and Longitude columns, and expressed the query > without the radius in the WHERE, but rather the ranges of possible > Latitude > and Longitude values (In effect, the rectangle that just contained the > circle the user desired). The unwanted "corners" of the result set were > discarded, rather than every single row outside the desired radius. The > performance gains (for normal radiuses, 10, 20, 50, 100 miles) were > enormous. The average gain was 100:1 - two orders of magnitude. > > MapQuest, of course, wasn't interested. The SQL ran on their clients' > machines, not theirs. > > If fuzzyOCR caching method has any merit at all, tuning the SQL and/or > the > database will provide decent performance. > > "Explain Execution Plan" is your friend! I totally agree with you, Dan. I recall such a discussion when FuzzyOcr was still under development and there were some of us contributing ideas about this. Nevertheless, when the final code was out, it didn't implement anything in order to at least allow the SQL server to reduce the number of rows to retrieve (in example, the specific select didn't even establish the allowed, rough ranges on image width and height attributes in its where expression). The result is point a) of the list of drawbacks in my post. However, even fixing it, points b) and c) still hold. This is a problem bound to the range checking itself: it is not always correct to assert that two images are the same if their "distance" is less than a given epsilon. This is true regardless of how you compute your "distance" or how low is your epsilon, but in the specific case maybe the FuzzyOcr's distance function is not the best we could get... Thereby, to my opinion drawback a) may eventually get fixed by not allowing any range search at all, but instead computing a true hash (md5 or whatever good) of the image file and then using it as a primary key in the db. Spammers often use the very same image in only few messages, thereby the performance gain would be low but, nevertheless, it would be non zero. c) would still hold, anyway, so maybe event this "solution" wouldn't help that much when you're trying to tune FO to defenestrate that damn spammer. As someone else pointed out in this list, the whole caching code was due to concerns about execution times needed by computing OCR code on a lot of images. These concerns seems much relaxed now, so the best option we actually have is to disable caching at all, which is like discarding any caching code in FuzzyOcr. Giampaolo > Dan Barker
RE: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
The main purpose of the FuzzyOcr's db was of course to avoid computing the OCR passes needed to decode the image text for known images. The problem is that the cache content is not searched for an exact match of the key values (which are image type, width, height, number of colors and color frequencies): it looks for the best match of these values within a given range. This has a number of drawbacks: a) range search defeats look-up indexing in the db, thereby resulting in browsing the whole db for a match; Range searching in a database is (can be?) vastly faster than a full table scan. You have to USE the indexes, not just assume an FTS will be required. The database optimizer _SHOULD_ figure this out, but only if it has reasonable statistics and is passed a reasonable query. I got some SQL from MapQuest once, that had a WHERE clause containing the arithmetic to compute the distance from a given location (a "radius" search). As coded, of course, a Full Table Scan was required and the distance function was evaluated to determine a row's presence in the result set. It was very slow, even for just a few hundred thousand location records. I indexed the Latitude and Longitude columns, and expressed the query without the radius in the WHERE, but rather the ranges of possible Latitude and Longitude values (In effect, the rectangle that just contained the circle the user desired). The unwanted "corners" of the result set were discarded, rather than every single row outside the desired radius. The performance gains (for normal radiuses, 10, 20, 50, 100 miles) were enormous. The average gain was 100:1 - two orders of magnitude. MapQuest, of course, wasn't interested. The SQL ran on their clients' machines, not theirs. If fuzzyOCR caching method has any merit at all, tuning the SQL and/or the database will provide decent performance. "Explain Execution Plan" is your friend! Dan Barker
R: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
> -Messaggio originale- > Da: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Per conto di > snowcrash+sa > > hi andy, > > > For what it's worth, the fuzzyocr hashing is of very limited value, > and in > > many cases is a severe performance hit. I found that scanning the > hashes, > > due to the "fuzzy" nature, is more costly than just rescanning the > file > > with OCR, as *each* *and* *every* hash must be checked iteratively. > > now, *that's* an interesting point to consider. > > i'd be interested in what, then, the 'goal' of the hashing/comparison > *is*? > > is it performance, and it just missed the mark for the reasons you > state? or is it something else? The main purpose of the FuzzyOcr's db was of course to avoid computing the OCR passes needed to decode the image text for known images. The problem is that the cache content is not searched for an exact match of the key values (which are image type, width, height, number of colors and color frequencies): it looks for the best match of these values within a given range. This has a number of drawbacks: a) range search defeats look-up indexing in the db, thereby resulting in browsing the whole db for a match; b) range search also increases false positive matches on the db content; c) the db caches OCR results, thereby a mach on it may return an unwanted/imprecise result if you tweak FuzzyOcr config and/or words files. The first drawback may yield high processing times and even timeouts when you have a medium-loaded mail server, the second one is probably the worst problem to most of us and the latter is, well, another problem. So, yes: FuzzyOCR's cache was meant to increase performances and, yes again, it basically missed the mark. The solution is to simply discard the cache db and run the OCR phases on every and each image: on most but the less loaded servers this is the most effective way to deal with it. Most of us are used to turn glitches off while keeping the good work... :) Giampaolo > dunno. > > but, your point bears some benchmarking ... > > thx!
Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
Andy Dills wrote: > For what it's worth, the fuzzyocr hashing is of very limited value, and in > many cases is a severe performance hit. I found that scanning the hashes, > due to the "fuzzy" nature, is more costly than just rescanning the file > with OCR, as *each* *and* *every* hash must be checked iteratively. > > Because of the "fuzzy" nature, you can't just check the db to "see if this > hash exists." You have to go through and compare the generated hash to > every hash in the db, and it considers it a match if it's "close enough". > > It's severely less computationally expensive to just rescan the damn > image. It won't matter if you only get a couple hundered emails per day, > but once the number of stored hashes reaches a reasonably low number, it > becomes faster to rescan the image than to go through every single stored > hash to see if you've already scanned a similar image. I fully agree. When a fuzzyocr caching database grows beyond certain (small) size, it becomes a severe penaly, costlier than rescanning images. snowcrash wrote: > i'd be interested in what, then, the 'goal' of the hashing/comparison *is*? > is it performance, and it just missed the mark for the reasons you > state? or is it something else? The desired goal was no doubt performance increase, but the implementation made it into a performance drag. A possible compromise is to ditch the fuzzyocr database every couple of days, and let it be started anew. This does bring some (limited) benefits. Mark
Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
hi andy, > For what it's worth, the fuzzyocr hashing is of very limited value, and in > many cases is a severe performance hit. I found that scanning the hashes, > due to the "fuzzy" nature, is more costly than just rescanning the file > with OCR, as *each* *and* *every* hash must be checked iteratively. now, *that's* an interesting point to consider. i'd be interested in what, then, the 'goal' of the hashing/comparison *is*? is it performance, and it just missed the mark for the reasons you state? or is it something else? dunno. but, your point bears some benchmarking ... thx!
Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
On Tue, 28 Aug 2007, snowcrash+sa wrote: > aha! > > in FuzzyOcr.cf, > > - focr_hashing_learn_scanned 1 > + focr_hashing_learn_scanned 0 > > then, > > rm Fuzzy*db* > ... > > i did not realize that if the HASH ore-exists, then the images' total > score hits -- and is reused frm the hash db, but thata none of the > word-hit data is stored/resed. For what it's worth, the fuzzyocr hashing is of very limited value, and in many cases is a severe performance hit. I found that scanning the hashes, due to the "fuzzy" nature, is more costly than just rescanning the file with OCR, as *each* *and* *every* hash must be checked iteratively. Because of the "fuzzy" nature, you can't just check the db to "see if this hash exists." You have to go through and compare the generated hash to every hash in the db, and it considers it a match if it's "close enough". It's severely less computationally expensive to just rescan the damn image. It won't matter if you only get a couple hundered emails per day, but once the number of stored hashes reaches a reasonably low number, it becomes faster to rescan the image than to go through every single stored hash to see if you've already scanned a similar image. Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---
R: Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
> -Messaggio originale- > Da: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED] > > ...omissis... > > I *guess* it is focr_enable_image_hashing, which I already commented > out > (and thereby disabled hashing) because I was experiencing problems when > changing something in my FuzzyOcr.words. This happened some months ago > and I > can't exactly recall how I did it... > > Giampaolo Ok, I didn't see your post in which you was saying you already succeeded in disabling the hash db. Sorry about my redundant reply. Giampaolo > > > > > > thanks!
R: Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
> -Messaggio originale- > Da: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Per conto di > > hi, > > (you 'busted out' of the thread ... replying back in it.) > > > Disable the FuzzyOcr's result hash cache on both machines before > testing for > > differences: you are looking at stale results. > > > > If these systems cached the results when the version or config of the > two > > FuzzyOcr(s) were not the same, of course you see a difference... > > You might have a point. > > So, silly question: > > HOW do I "Disable the FuzzyOcr's result hash cache"? > > commenting out: > > #body FUZZY_OCR_KNOWN_HASHeval:dummy_check() > > in FuzzyOcr.cf isn't doing it:-/ I *guess* it is focr_enable_image_hashing, which I already commented out (and thereby disabled hashing) because I was experiencing problems when changing something in my FuzzyOcr.words. This happened some months ago and I can't exactly recall how I did it... Giampaolo > > thanks!
Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
aha! in FuzzyOcr.cf, - focr_hashing_learn_scanned 1 + focr_hashing_learn_scanned 0 then, rm Fuzzy*db* now, as expected ... 18 FUZZY_OCR BODY: Img with common spam text inside [Words found:] ["investor" in 1 lines] ["price" in 2 lines] ["company" in 1 lines] ["alert" in 1 lines] ["valium" in 1 lines] ["trade" in 1 lines] ["banking" in 1 lines] ["news" in 1 lines] [(13.5 word occurrences found)] i did not realize that if the HASH ore-exists, then the images' total score hits -- and is reused frm the hash db, but thata none of the word-hit data is stored/resed. got it, thx!
Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
hi, (you 'busted out' of the thread ... replying back in it.) > Disable the FuzzyOcr's result hash cache on both machines before testing for > differences: you are looking at stale results. > > If these systems cached the results when the version or config of the two > FuzzyOcr(s) were not the same, of course you see a difference... You might have a point. So, silly question: HOW do I "Disable the FuzzyOcr's result hash cache"? commenting out: #body FUZZY_OCR_KNOWN_HASHeval:dummy_check() in FuzzyOcr.cf isn't doing it:-/ thanks!
Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
Hi, > Unless I've missed something, above is the only difference in the two > reports. So that says the only real difference if the lack of the 'words > found' report in FuzzyOCR. Yup. You're correct. > I think I'd add a line that should end up in the report and say > something like "debug report enabled". In building up the two boxes I'd co'd the plugin setup from a single local Hg repo. And, diff'd the co'd files -- no diff. Nonetheless, probly a good idea ... thx > See if that line shows up in both cases. If not, you probably have a > problem with whatever is setting the switch. If it does, look at how those > word strings are output. The only thing that looks interesting about them > is they contain quote marks. Maybe somethign strange has happened and the > quote marks are breaking the debug output in one case. Digging aound as I type ... sstay tuned. > If all else fails, probably best to take this over to the FuzzyOCR mailing > list for a while. Sadly, a bit "lonely" over there these days ... but, will (eventually) do! Cheers,
R: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
> -Messaggio originale- > Da: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Per conto di > snowcrash+sa > > ...omissis... > > 16 FUZZY_OCR_KNOWN_HASH BODY: Image with known hash > > ...omissis... > > any hints/suggestions as to what i might've missed? how to find it? Disable the FuzzyOcr's result hash cache on both machines before testing for differences: you are looking at stale results. If these systems cached the results when the version or config of the two FuzzyOcr(s) were not the same, of course you see a difference... Giampaolo > > thanks!
Re: two supposedly identical SA boxes, with slightly different report output -- help find the diff?
16 FUZZY_OCR_KNOWN_HASH BODY: Image with known hash [] [Words found:] ["investor" in 1 lines] ["price" in 2 lines] ["company" in 1 lines] ["alert" in 1 lines] ["valium" in 1 lines] ["trade" in 1 lines] ["banking" in 1 lines] ["news" in 1 lines] [(13.5 word occurrences found)] 18 FUZZY_OCR_KNOWN_HASH BODY: Image with known hash [] [Words found:] [] [(13.5 word occurrences found)] Unless I've missed something, above is the only difference in the two reports. So that says the only real difference if the lack of the 'words found' report in FuzzyOCR. The first think I'd do is look at the plugin code to see what controls the output of those report lines, and see what it is based on. I think I'd add a line that should end up in the report and say something like "debug report enabled". See if that line shows up in both cases. If not, you probably have a problem with whatever is setting the switch. If it does, look at how those word strings are output. The only thing that looks interesting about them is they contain quote marks. Maybe somethign strange has happened and the quote marks are breaking the debug output in one case. If all else fails, probably best to take this over to the FuzzyOCR mailing list for a while. Loren
two supposedly identical SA boxes, with slightly different report output -- help find the diff?
hi, grr. i'm at that resorting-to-visine stage of wtf ... :-/ i've spamassassin --version SpamAssassin version 3.2.4-r564346 running on Perl version 5.8.8 with, among numerous other ruls/plugins, FuzzyOcr/r330 installed. i've just updated two supposedly identical boxes, building from clean sources, and running the same setup scripts on both. no errors in the installs. on testing of FuzzyOcr image processing on one of its included test files with, spamassassin -D -t -x < FuzzyOcr/samples/ocr-animated.eml i see in the debug output the following report on one box, ... Content analysis details: (38.2 points, 4.0 required) pts rule name description -- -- 4.2 MID_DEGREESMID_DEGREES 3.7 CTYPE_8SPACE_GIF BODY: Stock spam image part 'Content-Type' found (8 spc) 0.0 HTML_MESSAGE BODY: HTML included in message 1.5 BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score: 0.4467] 1.7 MIME_HTML_ONLY BODY: Message only has text/html MIME parts 2.5 HTML_IMAGE_ONLY_16 BODY: HTML: images with 1200-1600 bytes of words 1.2 SARE_GIF_ATTACHFULL: Email has a inline gif 1.5 MY_CID_AND_STYLE SARE cid and style 2.9 DRUGS_STOCK_MIMEOLEStock-spam forged headers found (5510) 16 FUZZY_OCR_KNOWN_HASH BODY: Image with known hash [] [Words found:] ["investor" in 1 lines] ["price" in 2 lines] ["company" in 1 lines] ["alert" in 1 lines] ["valium" in 1 lines] ["trade" in 1 lines] ["banking" in 1 lines] ["news" in 1 lines] [(13.5 word occurrences found)] and, similarly on the other box, ... Content analysis details: (38.5 points, 4.0 required) pts rule name description -- -- 3.7 MID_DEGREESMID_DEGREES 1.6 CTYPE_8SPACE_GIF BODY: Stock spam image part 'Content-Type' found (8 spc) 0.0 HTML_MESSAGE BODY: HTML included in message 1.5 BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score: 0.4467] 1.5 MIME_HTML_ONLY BODY: Message only has text/html MIME parts 1.5 HTML_IMAGE_ONLY_16 BODY: HTML: images with 1200-1600 bytes of words 1.2 SARE_GIF_ATTACHFULL: Email has a inline gif 1.5 MY_CID_AND_STYLE SARE cid and style 3.5 DRUGS_STOCK_MIMEOLEStock-spam forged headers found (5510) 18 FUZZY_OCR_KNOWN_HASH BODY: Image with known hash [] [Words found:] [] [(13.5 word occurrences found)] NOTE the "words found" detail in the second box's debug output :-/ trying to find what's causing the different output, i've pored over the debug output, googl'd the lists, diff'd the config files, etc. nada. to my weary eye, all looks "the same". obviously, it's not. any hints/suggestions as to what i might've missed? how to find it? thanks!