Re: [Devel-spam] FuzzyOcr 3.5.1 released

2007-01-12 Thread A Mennucc
hi

netpbm in Debian is 10.1 ; so it is too old 

I asked for a newer one in
http://bugs.debian.org/159847

Jim Knuth ha scritto:
 
 I`ve forgotten: libnetpbm10-dev (Debian Etch)
 
 


what do you mean by this ?

a.



signature.asc
Description: OpenPGP digital signature


Re: [Devel-spam] FuzzyOcr 3.5.1 released

2007-01-12 Thread Jim Knuth
Heute (12.01.2007/11:51 Uhr) schrieb A Mennucc,

 hi

 netpbm in Debian is 10.1 ; so it is too old 

 I asked for a newer one in
 http://bugs.debian.org/159847

 Jim Knuth ha scritto:
 
 I`ve forgotten: libnetpbm10-dev (Debian Etch)
 
 


 what do you mean by this ?

I mean: This is also installed. ;)

 a.


-- 
Viele Gruesse, Kind regards,
 Jim Knuth
 [EMAIL PROTECTED]
 ICQ #277289867
--
Zufalls-Zitat
--
Freiheit bedeutet Verantwortlichkeit; das ist der Grund, 
warum sich die meisten Menschen vor ihr fürchten. (George 
Bernard Shaw, ir. Dramatiker, 1856-1950)
--
Der Text hat nichts mit dem Empfaenger der Mail zu tun
--
Virus free. Checked by NOD32 Version 1973 Build 8755  12.01.2007



Re: [Devel-spam] FuzzyOcr 3.5.1 released

2007-01-09 Thread jdow

Yup - if you are looking for within 10 miles you can perform a raw
comparison by looking at the lat-lon degrees number to remove anything
more than two degrees apart. That knocks down your search by 180 time
in each direction, over 3:1 savings right there. If you store all
the data as degree and fractional degree you can remove everything more
than a small fraction of a degree apart.

But for the first cut storing everything in the grid square 117 to 118
longitude and 34 to 35 longitude in its own part of the tree structure
allows almost instant selection of likely candidates. You could also
use links to store 117 to 118, 34-35 in one box, 117.5-118.5, 34-35 in
another box - noting the overlap in the concept. That means a site right
on a corner or edge of a criterion marker isn't lost. Anything like that
which can be used to reduce the amount of data that needs to be tested
even at the expence of cross-linked trees is a huge savings. You enter
an item into the database once, that performs the searches for the crude
region linkages. Then the searches, the many operation, can proceed
quicker due to filtering out excess searches.

{^_^}
- Original Message - 
From: Dan Barker [EMAIL PROTECTED]




Giampaolo: I hope you succeed.

I've given up hope on convincing folks (Mapquest in particular) that 
radius
searches can be indexed. You needn't pull the lat/long of every single 
entry

to run the distance function, and then discard the ones too far away. You
can index on LAT and LONG and structure the query such that only the
possible lat/long values need the distance function (and the rest of the
record fetched) evaluated.

Just because it's two orders of magnitude more efficient doesn't make
anybody listen.

Same conversation, different universe!

Dan

-Original Message-
From: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED]

From: Andy Dills [mailto:[EMAIL PROTECTED]


...omissis...

 I understand that the order keyword in select is potentially
expensive, but
 necessary because matches occur generally towards the most
recent entries,
 thus increasing the possibility of a match earlier on.  When
your hash count
 is in the thousands, earlier matches mean less queries to the
database, and
 potentially faster results.

It's not just the order directive, it's the iteration throughout the
entire database.

Consider when the database grows to 50k records. For a new image that
doesn't have a hash, that's 50k records that must be sorted then
sent from
the DB server to the mail server, then all 50k records must be checked
against the hash before we decide that we haven't seen this image before.
That just isn't a workable algorithm. If iteration throughout the entire
database is a requirement, hashing is a performance hit rather than a
performance gain.

A better solution might be a seperate daemon that holds the hashes in
memory, to which you submit the hash being considered.


Other ways could be the ones depicted in my recent post (Message-ID:
[EMAIL PROTECTED]), in which close 
images

are basicly clustered together thanks to a surrogate index.

giampaolo



Honestly, I have been extremely impressed with having hashing turned
completely off.

Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---






Re: [Devel-spam] FuzzyOcr 3.5.1 released

2007-01-08 Thread Andy Dills
On Mon, 8 Jan 2007, Jorge Valdes wrote:

 I do understand that in large environments, optimizations have to be made in
 order not to kill server performance, and expiration is probably something
 that could be done at more convenient times.  I will commit a script that
 can safely be run as a cronjob soon.

Excellent.

 I understand that the order keyword in select is potentially expensive, but
 necessary because matches occur generally towards the most recent entries,
 thus increasing the possibility of a match earlier on.  When your hash count
 is in the thousands, earlier matches mean less queries to the database, and
 potentially faster results.

It's not just the order directive, it's the iteration throughout the 
entire database.

Consider when the database grows to 50k records. For a new image that 
doesn't have a hash, that's 50k records that must be sorted then sent from 
the DB server to the mail server, then all 50k records must be checked 
against the hash before we decide that we haven't seen this image before. 
That just isn't a workable algorithm. If iteration throughout the entire 
database is a requirement, hashing is a performance hit rather than a 
performance gain.

A better solution might be a seperate daemon that holds the hashes in 
memory, to which you submit the hash being considered.

Honestly, I have been extremely impressed with having hashing turned 
completely off.

Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---


RE: [Devel-spam] FuzzyOcr 3.5.1 released

2007-01-08 Thread Giampaolo Tomassoni
From: Andy Dills [mailto:[EMAIL PROTECTED]
 
 ...omissis...

  I understand that the order keyword in select is potentially 
 expensive, but
  necessary because matches occur generally towards the most 
 recent entries,
  thus increasing the possibility of a match earlier on.  When 
 your hash count
  is in the thousands, earlier matches mean less queries to the 
 database, and
  potentially faster results.
 
 It's not just the order directive, it's the iteration throughout the 
 entire database.
 
 Consider when the database grows to 50k records. For a new image that 
 doesn't have a hash, that's 50k records that must be sorted then 
 sent from 
 the DB server to the mail server, then all 50k records must be checked 
 against the hash before we decide that we haven't seen this image before. 
 That just isn't a workable algorithm. If iteration throughout the entire 
 database is a requirement, hashing is a performance hit rather than a 
 performance gain.
 
 A better solution might be a seperate daemon that holds the hashes in 
 memory, to which you submit the hash being considered.

Other ways could be the ones depicted in my recent post (Message-ID: [EMAIL 
PROTECTED]), in which close images are basicly clustered together thanks to a 
surrogate index.

giampaolo

 
 Honestly, I have been extremely impressed with having hashing turned 
 completely off.
 
 Andy
 
 ---
 Andy Dills
 Xecunet, Inc.
 www.xecu.net
 301-682-9972
 ---



RE: [Devel-spam] FuzzyOcr 3.5.1 released

2007-01-08 Thread Dan Barker
Giampaolo: I hope you succeed.

I've given up hope on convincing folks (Mapquest in particular) that radius
searches can be indexed. You needn't pull the lat/long of every single entry
to run the distance function, and then discard the ones too far away. You
can index on LAT and LONG and structure the query such that only the
possible lat/long values need the distance function (and the rest of the
record fetched) evaluated.

Just because it's two orders of magnitude more efficient doesn't make
anybody listen.

Same conversation, different universe!

Dan

-Original Message-
From: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED]
Sent: Monday, January 08, 2007 2:00 PM
To: [EMAIL PROTECTED]; users@spamassassin.apache.org
Subject: RE: [Devel-spam] FuzzyOcr 3.5.1 released


From: Andy Dills [mailto:[EMAIL PROTECTED]

 ...omissis...

  I understand that the order keyword in select is potentially
 expensive, but
  necessary because matches occur generally towards the most
 recent entries,
  thus increasing the possibility of a match earlier on.  When
 your hash count
  is in the thousands, earlier matches mean less queries to the
 database, and
  potentially faster results.

 It's not just the order directive, it's the iteration throughout the
 entire database.

 Consider when the database grows to 50k records. For a new image that
 doesn't have a hash, that's 50k records that must be sorted then
 sent from
 the DB server to the mail server, then all 50k records must be checked
 against the hash before we decide that we haven't seen this image before.
 That just isn't a workable algorithm. If iteration throughout the entire
 database is a requirement, hashing is a performance hit rather than a
 performance gain.

 A better solution might be a seperate daemon that holds the hashes in
 memory, to which you submit the hash being considered.

Other ways could be the ones depicted in my recent post (Message-ID:
[EMAIL PROTECTED]), in which close images
are basicly clustered together thanks to a surrogate index.

giampaolo


 Honestly, I have been extremely impressed with having hashing turned
 completely off.

 Andy

 ---
 Andy Dills
 Xecunet, Inc.
 www.xecu.net
 301-682-9972
 ---




RE: [Devel-spam] FuzzyOcr 3.5.1 released

2007-01-08 Thread Giampaolo Tomassoni
From: Dan Barker [mailto:[EMAIL PROTECTED]
 
 Giampaolo: I hope you succeed.
 
 I've given up hope on convincing folks (Mapquest in particular) 
 that radius
 searches can be indexed. You needn't pull the lat/long of every 
 single entry
 to run the distance function, and then discard the ones too far away. You
 can index on LAT and LONG and structure the query such that only the
 possible lat/long values need the distance function (and the rest of the
 record fetched) evaluated.

Right.


 Just because it's two orders of magnitude more efficient doesn't make
 anybody listen.

 Same conversation, different universe!

You mean that it is probably a concept to far away from the origin of someone's 
comprehensibility space? :)

giampaolo


 Dan
 
 -Original Message-
 From: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED]
 Sent: Monday, January 08, 2007 2:00 PM
 To: [EMAIL PROTECTED]; users@spamassassin.apache.org
 Subject: RE: [Devel-spam] FuzzyOcr 3.5.1 released
 
 
 From: Andy Dills [mailto:[EMAIL PROTECTED]
 
  ...omissis...
 
   I understand that the order keyword in select is potentially
  expensive, but
   necessary because matches occur generally towards the most
  recent entries,
   thus increasing the possibility of a match earlier on.  When
  your hash count
   is in the thousands, earlier matches mean less queries to the
  database, and
   potentially faster results.
 
  It's not just the order directive, it's the iteration throughout the
  entire database.
 
  Consider when the database grows to 50k records. For a new image that
  doesn't have a hash, that's 50k records that must be sorted then
  sent from
  the DB server to the mail server, then all 50k records must be checked
  against the hash before we decide that we haven't seen this 
 image before.
  That just isn't a workable algorithm. If iteration throughout the entire
  database is a requirement, hashing is a performance hit rather than a
  performance gain.
 
  A better solution might be a seperate daemon that holds the hashes in
  memory, to which you submit the hash being considered.
 
 Other ways could be the ones depicted in my recent post (Message-ID:
 [EMAIL PROTECTED]), in which 
 close images
 are basicly clustered together thanks to a surrogate index.
 
 giampaolo
 
 
  Honestly, I have been extremely impressed with having hashing turned
  completely off.
 
  Andy
 
  ---
  Andy Dills
  Xecunet, Inc.
  www.xecu.net
  301-682-9972
  ---