RE: [Devel-spam] FuzzyOcr 3.5.1 released

Giampaolo Tomassoni Mon, 08 Jan 2007 12:19:26 -0800

From: Dan Barker [mailto:[EMAIL PROTECTED]
> 
> Giampaolo: I hope you succeed.
> 
> I've given up hope on convincing folks (Mapquest in particular) 
> that radius
> searches can be indexed. You needn't pull the lat/long of every 
> single entry
> to run the distance function, and then discard the ones too far away. You
> can index on LAT and LONG and structure the query such that only the
> "possible" lat/long values need the distance function (and the rest of the
> record fetched) evaluated.


Right.


> Just because it's two orders of magnitude more efficient doesn't make
> anybody listen.
>
> Same conversation, different universe!

You mean that it is probably a concept to far away from the origin of someone's 
comprehensibility space? :)

giampaolo


> Dan
> 
> -----Original Message-----
> From: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED]
> Sent: Monday, January 08, 2007 2:00 PM
> To: [EMAIL PROTECTED]; users@spamassassin.apache.org
> Subject: RE: [Devel-spam] FuzzyOcr 3.5.1 released
> 
> 
> From: Andy Dills [mailto:[EMAIL PROTECTED]
> >
> > ...omissis...
> >
> > > I understand that the "order" keyword in select is potentially
> > expensive, but
> > > necessary because matches occur generally towards the most
> > recent entries,
> > > thus increasing the possibility of a match earlier on.  When
> > your hash count
> > > is in the thousands, earlier matches mean less queries to the
> > database, and
> > > potentially faster results.
> >
> > It's not just the order directive, it's the iteration throughout the
> > entire database.
> >
> > Consider when the database grows to >50k records. For a new image that
> > doesn't have a hash, that's 50k records that must be sorted then
> > sent from
> > the DB server to the mail server, then all 50k records must be checked
> > against the hash before we decide that we haven't seen this 
> image before.
> > That just isn't a workable algorithm. If iteration throughout the entire
> > database is a requirement, hashing is a performance hit rather than a
> > performance gain.
> >
> > A better solution might be a seperate daemon that holds the hashes in
> > memory, to which you submit the hash being considered.
> 
> Other ways could be the ones depicted in my recent post (Message-ID:
> <[EMAIL PROTECTED]>), in which 
> close images
> are basicly clustered together thanks to a surrogate index.
> 
> giampaolo
> 
> >
> > Honestly, I have been extremely impressed with having hashing turned
> > completely off.
> >
> > Andy
> >
> > ---
> > Andy Dills
> > Xecunet, Inc.
> > www.xecu.net
> > 301-682-9972
> > ---
> 
>

RE: [Devel-spam] FuzzyOcr 3.5.1 released

Reply via email to