Re: FuzzyOcr 3.5.1 released

jdow Mon, 08 Jan 2007 04:18:15 -0800

From: "Andy Dills" <[EMAIL PROTECTED]>

On Sun, 7 Jan 2007, Andy Dills wrote:

On Sun, 7 Jan 2007, decoder wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> Hello all,
>
>
> since 3.5.0 RC1 was released, we fixed many bugs, thanks to the many
> testers and bug reporters :) so big thanks.


I have something I'm curious about, having run FuzzyOcr in a medium size
(3-400k messages per day) mail cluster for about a week now.

Why do you do database maintenance with every unmatched check?

>From Hashing.pm:

        unless ($match) {
            my $then = time - ($conf->{focr_db_max_days}*86400);
--->        $sql = qq(select * from $db.$dbfile order by $dbfile.check);
            my $sth  = $ddb->prepare($sql); $sth->execute;
            while (my @row = $sth->fetchrow_array) {
                my $hash2 = $row[1] || "0:0:0:0";
                $hash2 .= "::$row[0]";
                if (within_threshold($digest,$hash2)) {
                    $txt   = 'Approx';
                    $key   = $row[0];
                    $next  = $row[5] + 1;
                    $when  = $row[7] || $now;

$ret = $dbfile eq $conf->{focr_mysql_hash} ?$row[8] : $row[5];

                    $dinfo = $row[9] || '';

infolog("Found[$dbfile]: Score='$row[8]' Info:'$row[9]'");

                    last;
                }
            }
            # Expire old records...

---> $sql = qq(delete from $db.$dbfile where $dbfile.check <$then);

            debuglog($sql,2);
            $ddb->do($sql);
        }

Those two queries are extremely expensive in a larger envrionment...Ihave

commented this code segment out on our cluster, and have written a quick

maintenance script that runs once per day...dropped the response timefrom

2-3s to .01-.05s on queries, and eliminated the suddenly large
and customer-annoying mailqueues.


Sorry to follow up to my own post, but now that I read this segment a
little closer I realize that I'm basically commenting out the matching
capability of the Hashing mechanism, eliminating all value of the Hashing
in the first place.

So...I guess my point is, unless there is a better way of determining the
match than checking every single hash in the database (hoping that you
find one that is close enough along the way), it's more efficient (in
larger environments at least) to just scan each mail message without
hashing enabled.

Thoughts?

Andy


Hash the hashes and store them in a suitable tree?

{^_^}

Re: FuzzyOcr 3.5.1 released

Reply via email to