Re: [Devel-spam] FuzzyOcr 3.5.1 released
hi netpbm in Debian is 10.1 ; so it is too old I asked for a newer one in http://bugs.debian.org/159847 Jim Knuth ha scritto: I`ve forgotten: libnetpbm10-dev (Debian Etch) what do you mean by this ? a. signature.asc Description: OpenPGP digital signature
Re: [Devel-spam] FuzzyOcr 3.5.1 released
Heute (12.01.2007/11:51 Uhr) schrieb A Mennucc, hi netpbm in Debian is 10.1 ; so it is too old I asked for a newer one in http://bugs.debian.org/159847 Jim Knuth ha scritto: I`ve forgotten: libnetpbm10-dev (Debian Etch) what do you mean by this ? I mean: This is also installed. ;) a. -- Viele Gruesse, Kind regards, Jim Knuth [EMAIL PROTECTED] ICQ #277289867 -- Zufalls-Zitat -- Freiheit bedeutet Verantwortlichkeit; das ist der Grund, warum sich die meisten Menschen vor ihr fürchten. (George Bernard Shaw, ir. Dramatiker, 1856-1950) -- Der Text hat nichts mit dem Empfaenger der Mail zu tun -- Virus free. Checked by NOD32 Version 1973 Build 8755 12.01.2007
Re: [Devel-spam] FuzzyOcr 3.5.1 released
Yup - if you are looking for within 10 miles you can perform a raw comparison by looking at the lat-lon degrees number to remove anything more than two degrees apart. That knocks down your search by 180 time in each direction, over 3:1 savings right there. If you store all the data as degree and fractional degree you can remove everything more than a small fraction of a degree apart. But for the first cut storing everything in the grid square 117 to 118 longitude and 34 to 35 longitude in its own part of the tree structure allows almost instant selection of likely candidates. You could also use links to store 117 to 118, 34-35 in one box, 117.5-118.5, 34-35 in another box - noting the overlap in the concept. That means a site right on a corner or edge of a criterion marker isn't lost. Anything like that which can be used to reduce the amount of data that needs to be tested even at the expence of cross-linked trees is a huge savings. You enter an item into the database once, that performs the searches for the crude region linkages. Then the searches, the many operation, can proceed quicker due to filtering out excess searches. {^_^} - Original Message - From: Dan Barker [EMAIL PROTECTED] Giampaolo: I hope you succeed. I've given up hope on convincing folks (Mapquest in particular) that radius searches can be indexed. You needn't pull the lat/long of every single entry to run the distance function, and then discard the ones too far away. You can index on LAT and LONG and structure the query such that only the possible lat/long values need the distance function (and the rest of the record fetched) evaluated. Just because it's two orders of magnitude more efficient doesn't make anybody listen. Same conversation, different universe! Dan -Original Message- From: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED] From: Andy Dills [mailto:[EMAIL PROTECTED] ...omissis... I understand that the order keyword in select is potentially expensive, but necessary because matches occur generally towards the most recent entries, thus increasing the possibility of a match earlier on. When your hash count is in the thousands, earlier matches mean less queries to the database, and potentially faster results. It's not just the order directive, it's the iteration throughout the entire database. Consider when the database grows to 50k records. For a new image that doesn't have a hash, that's 50k records that must be sorted then sent from the DB server to the mail server, then all 50k records must be checked against the hash before we decide that we haven't seen this image before. That just isn't a workable algorithm. If iteration throughout the entire database is a requirement, hashing is a performance hit rather than a performance gain. A better solution might be a seperate daemon that holds the hashes in memory, to which you submit the hash being considered. Other ways could be the ones depicted in my recent post (Message-ID: [EMAIL PROTECTED]), in which close images are basicly clustered together thanks to a surrogate index. giampaolo Honestly, I have been extremely impressed with having hashing turned completely off. Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---
Re: [Devel-spam] FuzzyOcr 3.5.1 released
On Mon, 8 Jan 2007, Jorge Valdes wrote: I do understand that in large environments, optimizations have to be made in order not to kill server performance, and expiration is probably something that could be done at more convenient times. I will commit a script that can safely be run as a cronjob soon. Excellent. I understand that the order keyword in select is potentially expensive, but necessary because matches occur generally towards the most recent entries, thus increasing the possibility of a match earlier on. When your hash count is in the thousands, earlier matches mean less queries to the database, and potentially faster results. It's not just the order directive, it's the iteration throughout the entire database. Consider when the database grows to 50k records. For a new image that doesn't have a hash, that's 50k records that must be sorted then sent from the DB server to the mail server, then all 50k records must be checked against the hash before we decide that we haven't seen this image before. That just isn't a workable algorithm. If iteration throughout the entire database is a requirement, hashing is a performance hit rather than a performance gain. A better solution might be a seperate daemon that holds the hashes in memory, to which you submit the hash being considered. Honestly, I have been extremely impressed with having hashing turned completely off. Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---
RE: [Devel-spam] FuzzyOcr 3.5.1 released
From: Andy Dills [mailto:[EMAIL PROTECTED] ...omissis... I understand that the order keyword in select is potentially expensive, but necessary because matches occur generally towards the most recent entries, thus increasing the possibility of a match earlier on. When your hash count is in the thousands, earlier matches mean less queries to the database, and potentially faster results. It's not just the order directive, it's the iteration throughout the entire database. Consider when the database grows to 50k records. For a new image that doesn't have a hash, that's 50k records that must be sorted then sent from the DB server to the mail server, then all 50k records must be checked against the hash before we decide that we haven't seen this image before. That just isn't a workable algorithm. If iteration throughout the entire database is a requirement, hashing is a performance hit rather than a performance gain. A better solution might be a seperate daemon that holds the hashes in memory, to which you submit the hash being considered. Other ways could be the ones depicted in my recent post (Message-ID: [EMAIL PROTECTED]), in which close images are basicly clustered together thanks to a surrogate index. giampaolo Honestly, I have been extremely impressed with having hashing turned completely off. Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---
RE: [Devel-spam] FuzzyOcr 3.5.1 released
Giampaolo: I hope you succeed. I've given up hope on convincing folks (Mapquest in particular) that radius searches can be indexed. You needn't pull the lat/long of every single entry to run the distance function, and then discard the ones too far away. You can index on LAT and LONG and structure the query such that only the possible lat/long values need the distance function (and the rest of the record fetched) evaluated. Just because it's two orders of magnitude more efficient doesn't make anybody listen. Same conversation, different universe! Dan -Original Message- From: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED] Sent: Monday, January 08, 2007 2:00 PM To: [EMAIL PROTECTED]; users@spamassassin.apache.org Subject: RE: [Devel-spam] FuzzyOcr 3.5.1 released From: Andy Dills [mailto:[EMAIL PROTECTED] ...omissis... I understand that the order keyword in select is potentially expensive, but necessary because matches occur generally towards the most recent entries, thus increasing the possibility of a match earlier on. When your hash count is in the thousands, earlier matches mean less queries to the database, and potentially faster results. It's not just the order directive, it's the iteration throughout the entire database. Consider when the database grows to 50k records. For a new image that doesn't have a hash, that's 50k records that must be sorted then sent from the DB server to the mail server, then all 50k records must be checked against the hash before we decide that we haven't seen this image before. That just isn't a workable algorithm. If iteration throughout the entire database is a requirement, hashing is a performance hit rather than a performance gain. A better solution might be a seperate daemon that holds the hashes in memory, to which you submit the hash being considered. Other ways could be the ones depicted in my recent post (Message-ID: [EMAIL PROTECTED]), in which close images are basicly clustered together thanks to a surrogate index. giampaolo Honestly, I have been extremely impressed with having hashing turned completely off. Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---
RE: [Devel-spam] FuzzyOcr 3.5.1 released
From: Dan Barker [mailto:[EMAIL PROTECTED] Giampaolo: I hope you succeed. I've given up hope on convincing folks (Mapquest in particular) that radius searches can be indexed. You needn't pull the lat/long of every single entry to run the distance function, and then discard the ones too far away. You can index on LAT and LONG and structure the query such that only the possible lat/long values need the distance function (and the rest of the record fetched) evaluated. Right. Just because it's two orders of magnitude more efficient doesn't make anybody listen. Same conversation, different universe! You mean that it is probably a concept to far away from the origin of someone's comprehensibility space? :) giampaolo Dan -Original Message- From: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED] Sent: Monday, January 08, 2007 2:00 PM To: [EMAIL PROTECTED]; users@spamassassin.apache.org Subject: RE: [Devel-spam] FuzzyOcr 3.5.1 released From: Andy Dills [mailto:[EMAIL PROTECTED] ...omissis... I understand that the order keyword in select is potentially expensive, but necessary because matches occur generally towards the most recent entries, thus increasing the possibility of a match earlier on. When your hash count is in the thousands, earlier matches mean less queries to the database, and potentially faster results. It's not just the order directive, it's the iteration throughout the entire database. Consider when the database grows to 50k records. For a new image that doesn't have a hash, that's 50k records that must be sorted then sent from the DB server to the mail server, then all 50k records must be checked against the hash before we decide that we haven't seen this image before. That just isn't a workable algorithm. If iteration throughout the entire database is a requirement, hashing is a performance hit rather than a performance gain. A better solution might be a seperate daemon that holds the hashes in memory, to which you submit the hash being considered. Other ways could be the ones depicted in my recent post (Message-ID: [EMAIL PROTECTED]), in which close images are basicly clustered together thanks to a surrogate index. giampaolo Honestly, I have been extremely impressed with having hashing turned completely off. Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 ---