Re: similarity detection

2002-10-09 Thread Chris Devers
On Tue, 8 Oct 2002, Tim Sweetman wrote: Well, sort of - search engines find documents to fit certain criteria; this tries to find documents similar to other documents. Aeguably part of the same problem space though. I don't know where you can find them anymore [ironically], but when it was

Re: similarity detection

2002-10-09 Thread Mike Jarvis
On Wed, Oct 09, 2002 at 07:48:32AM -0400, Chris Devers wrote: On Tue, 8 Oct 2002, Tim Sweetman wrote: Well, sort of - search engines find documents to fit certain criteria; this tries to find documents similar to other documents. Aeguably part of the same problem space though. I don't

Re: similarity detection

2002-10-09 Thread Alex McLintock
At 10:05 09/10/02, Andy Wardley wrote: I even wrote some Perl code to test it out... but I can't find that either. I'm *sure* I read it... I *wasn't* just dreaming... I think there may be something wrong when you start dreaming algorythms and perl.. :-) Alex Openweb Analysts Ltd,

Re: similarity detection

2002-10-09 Thread Belden Lyman
Chris Devers wrote: On Tue, 8 Oct 2002, Tim Sweetman wrote: Well, sort of - search engines find documents to fit certain criteria; this tries to find documents similar to other documents. Arguably part of the same problem space though. I don't know where you can find them anymore

Re: similarity detection

2002-10-09 Thread Paul Makepeace
On Wed, Oct 09, 2002 at 05:26:31PM +0100, Alex McLintock wrote: At 10:05 09/10/02, Andy Wardley wrote: I even wrote some Perl code to test it out... but I can't find that either. I'm *sure* I read it... I *wasn't* just dreaming... I think there may be something wrong when you start

Re: similarity detection

2002-10-09 Thread Paul Makepeace
On Tue, Oct 08, 2002 at 11:23:59AM +0100, nemesis wrote: Hello again, I have a database (mySQL) full of variable length text fields (average about 1500 characters, 250 words). Curently there are about 250 fields, but I hope this to expand to as many as possible (it is an online joke

Re: similarity detection

2002-10-08 Thread nemesis
alex wrote: probably completely crap but following is an approach i have been thinking about for a while and have been looking for the right soft/textual dataset to try it out on. Thanks everyone for the suggestions. I certainly have some more ideass to work on. Will

Re: similarity detection

2002-10-08 Thread Shevek
On Tue, 8 Oct 2002, alex wrote: so, in your particular example you could try a 26 dimensional space where each dimension is the frequency of a particular letter in the alphabet. if This will fail for the same reason that this is a crappy hash algorithm. All English sentences tend to have the

Re: similarity detection

2002-10-08 Thread Shevek
On Tue, 8 Oct 2002, alex wrote: indeed - i seem to vaguely remember that i didn't use the sqrt in my postal sector[0] comparisons (it was to calculate nearest specsavers retail outlets to a postcode) and sql looked something like this: Metric space theory tells you that your distance

Re: similarity detection

2002-10-08 Thread Ben
On Tue, Oct 08, 2002 at 12:11:38PM +0100, Shevek wrote: On Tue, 8 Oct 2002, alex wrote: indeed - i seem to vaguely remember that i didn't use the sqrt in my postal sector[0] comparisons (it was to calculate nearest specsavers retail outlets to a postcode) and sql looked something like

Re: similarity detection

2002-10-08 Thread Shevek
On Tue, 8 Oct 2002, Ben wrote: On Tue, Oct 08, 2002 at 12:11:38PM +0100, Shevek wrote: Metric space theory tells you that your distance computation is valid whether you square or not. It's still a valid metric. The unit ball is a slightly different shape ... Nonsense. d(x,y) = (x1

Re: similarity detection

2002-10-08 Thread Paul Makepeace
On Tue, Oct 08, 2002 at 11:43:47AM +0100, alex wrote: sqrt( (x0-x1)^2 + (y0-y1)^2 + (z0-z1)^2) so, in your particular example you could try a 26 dimensional space where each dimension is the frequency of a particular letter in the alphabet. if I think you will find that this