Re: Comparing 2 similar strings?

2005-05-28 Thread Rob W. W. Hooft
Irmen de Jong wrote:
> Wouldn't it be easier if it is normalized to a float between 0.0 and 1.0?

Maybe. You could do that by ignoring "negative" values, and by dividing 
by min(len(s1),len(s2))+1. For my application this is irrelevant, I only 
need a scale to compare a single word to many different words, and pick 
the one that stands out. I can relate the (unnormalized) score to the 
number of typos in the word.

> Furthermore, I can also produce wrong(?) results:
> 
> $ python comparestrings.py
> s1: test
> s2: x
> Score: -0.4
> 
> Minus 0.4... Is this a bug?

Not in the procedure, it is a bug in my description. It was late at 
night when I wrote it. What you are doing here is aligning the two words 
like this:

test
--x-

the t opens a gap (score -0.3), e elongates the gap (-0.2), s and x are 
adjacent on the keyboard (score 0.4) and t opens a gap (-0.3). Total 
score is -0.4. If you compare "test" with "l" you even get a score of 
-0.7. You can make arbitrarily low numbers by comparing long strings of 
"a" characters with a single "l".

What this is telling you is that not only are the letters all different 
(score 0.0), but the strings are not even equally long (hence <0.0). The 
gap-open and gap-elongation penalties can be adjusted in the module. The 
-0.3 and -0.2 values were set after some experimentation to make the 
relative scores of different matches "feel" natural.

-- 
Rob W.W. Hooft  ||  [EMAIL PROTECTED]  ||  http://www.hooft.net/people/rob/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-28 Thread Irmen de Jong
Rob W. W. Hooft wrote:
> After reading this thread, I have wrapped up a different approach, 
> probably not what you were looking for, but it is very good for what I 
> wanted: comparing a command string typed by a user with all possible 
> commands a program can accept, to be able to do typo-correction. The 
> method will return a floating point number between 0.0 (no match) and 
> len(s)+1.0 (if the strings are exactly equal). You can see the score as 
> "the number of equal letters".
[...]
Interesting, but the result is a bit puzzling.
Wouldn't it be easier if it is normalized to a float between 0.0 and 1.0?

Furthermore, I can also produce wrong(?) results:

$ python comparestrings.py
s1: test
s2: x
Score: -0.4

Minus 0.4... Is this a bug?

--Irmen
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-28 Thread Rob W. W. Hooft
After reading this thread, I have wrapped up a different approach, 
probably not what you were looking for, but it is very good for what I 
wanted: comparing a command string typed by a user with all possible 
commands a program can accept, to be able to do typo-correction. The 
method will return a floating point number between 0.0 (no match) and 
len(s)+1.0 (if the strings are exactly equal). You can see the score as 
"the number of equal letters".

I put the module at http://starship.python.net/crew/hooft/
It is called comparestrings.py.

It is a comparison algorithm as implemented by Reinhard Schneider and 
Chris Sander for comparison of protein sequences, but implemented to 
compare two ASCII strings.

The comparison makes use of a similarity matrix for letters: in the 
protein case this is normally a chemical functionality similarity, for 
our case this is a matrix based on keys next to each other on a US 
Qwerty keyboard and on "capital letters are similar to their lowercase 
equivalent"

The algorithm does not cut corners: it is sure to find the absolute best 
match between the two strings.

No attempt has been made to make this efficient in time or memory. Time 
taken and memory used is proportional to the product of the lengths of 
the input strings. Use for strings longer than 25 characters is entirely 
for your own risk.

Regards,

Rob Hooft
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-25 Thread Roger Binns
<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
> ***Check out difflib, it's in the library.***  Perfect package for what
> the OP wants AFAICT.

The method in difflib is okay, but doesn't do that good a job.  It
is also relatively slow.  My need for this was matching records in
BitPim (eg which entries just read from a cell phone correspond to
which entries just read from Outlook)

Here are some links for the OP as well as ready to run code at the
end.

A paper on various algorithms and their effectiveness on various data
sets.

http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf

They do have a Java library of all the algorithms by the same authors.

http://secondstring.sourceforge.net/

There is a group at the Australian National University with a project named
Febrl - Freely extensible biomedical record linkage.  The idea is to be able
to work out if two records from the same or different sources are the same
person.  All their code is in Python:

http://datamining.anu.edu.au/software/febrl/febrldoc/

The solution I settled on is using Jaro-Winkler. This is a quote from the
paper above:

  A surprisingly good distance metric is a fast heuristic scheme, proposed
  by Jaro (Jaro 1995; 1989) and later extended by Winkler (Winkler 1999).
  This works almost as well as the Monge-Elkan scheme, but
  is an order of magnitude faster

I did find a Python module that implemented this (it is part of the
python-levenshtein package).  Unfortunately there were some issues
at the time dealing with some strings that may be unicode and some
not.  There was one implementation I saw that took copies of both
strings and was replacing characters with asterisks.  My data sets
include meaningful asterisks so that seriously munged things.

Ultimately I wrote my own implementation with the following properties:

 - The arguments can be any combination of unicode and ordinary strings
 - There are no special characters nor ones used internally as replacements
 - A bitset is used for tracking so you use an eigth of the memory
 - There is both Python and C code implementations so you don't have
   to compile anything but can if you want

The code can be grabbed from:

http://cvs.sf.net/viewcvs.py/bitpim/bitpim/native/strings/

There is a detailed description at the begining of the Python implementation:

http://cvs.sf.net/viewcvs.py/bitpim/bitpim/native/strings/jarowpy.py?view=markup

Roger 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-25 Thread nicolas . lehuen
William Park a écrit :
> How do you compare 2 strings, and determine how much they are "close" to
> each other?  Eg.
> aqwerty
> qwertyb
> are similar to each other, except for first/last char.  But, how do I
> quantify that?
>
> I guess you can say for the above 2 strings that
> - at max, 6 chars out of 7 are same sequence --> 85% max
>
> But, for
> qawerty
> qwerbty
> max correlation is
> - 3 chars out of 7 are the same sequence --> 42% max
>
> (Crossposted to 3 of my favourite newsgroup.)

Hi,

If you want to use phonetic comparison, here are some algorithms that
are reportedly more efficient than Soundex :

Double-Metaphone
NYSIIS
Phonex

Of course, phonetic algorithms have a lot of disadvantages, the main
one being that they know about one way to pronounce words (usually a
rather rigid, anglo-saxon way) which may not be the right way (hence
the examples given before for Gaellic surnames). But these ones are far
"better" than soundex.

Regards,

Nicolas Lehuen

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-23 Thread Oleg Paraschenko
Hello,

Scott David Daniels <[EMAIL PROTECTED]> wrote in message
news:<[EMAIL PROTECTED]>...
> William Park wrote:
> > How do you compare 2 strings, and determine how much they are "close" to
> > each other?  
> 
> Here's a really weird idea:  Measure the size difference between the
> pair of strings compressed together and compressed separately.
> 
  The idea isn't weird. The only problem is that naive approach
failed. compress(a,b) != compress(b,a). Having an assumption
that compressing is a good approximation for the Kolmogorov
complexity, the correct formula is a bit more complicated.

-- 
olpa@  http://datahansa.com/  XML publishing and documentation management
http://uucode.com/blog/  Generative Programming, XML, TeX, Scheme
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-23 Thread Oleg Paraschenko
Hello William,

William Park <[EMAIL PROTECTED]> wrote in message
news:<[EMAIL PROTECTED]>...
> How do you compare 2 strings, and determine how much they are "close" to
> each other?
> ...

  If your strings are between 1 and 16 Kb, look at GetReuse SDK:

http://getreuse.com/sdk/

  It has Perl and Python bindings. You might be interested in the
latter as you posted to comp.lang.python.

  I can't disclosure the algorithm. Here is an excerpt from the
home page: The formula for the calculation of the similarity is
based on the scientific research. Any other "good" method of
calculations should produce results that are equivalent in some
terms to the GetReuse results.

-- 
olpa@  http://datahansa.com/  XML publishing and documentation management
http://uucode.com/blog/  Generative Programming, XML, TeX, Scheme
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-20 Thread Skip Montanaro

Steve> (is this the same as 'Conchobar'?)

No, that's a trendy pub in Key West...



Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-20 Thread Steve Holden
Chris Croughton wrote:
> On Thu, 19 May 2005 06:38:59 +1000, John Machin 
><[EMAIL PROTECTED]> wrote:
> 
> 
>>On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <[EMAIL PROTECTED]>
>>wrote:
>>
>>
>>>William Park wrote:
>>>
>>>
How do you compare 2 strings, and determine how much they are "close" to
each other?  Eg.
aqwerty
qwertyb
are similar to each other, except for first/last char.  But, how do I
quantify that?
>>>
>>>"However you like" is probably the right answer, but one way might be to 
>>>compare their soundex encoding 
>>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>>>percentage difference based on comparing the numeric part.
>>
>>Fantastic suggestion. Here's a tiny piece of real-life test data:
>>
>>compare the surnames "Mousaferiadis" and "McPherson".
> 
> 
> I remember a word processor, MultiMate, which used Soundex to do
> matching for its suggestions for spelling correction.  One of my
> cow-orkers typed the word 'agains' -- it was supposed to be 'against'
> but 'again' would also have been a sensible suggestion.  MultiMate,
> however, suggested neither of those reasonable words, it did suggest
> "iguanas" amd "Utahns"...
> 
> (I wonder what it does with "Talliafero" and "Tolliver", and with
> "Featherstone-Howe" and "Fanshaw"...)
> 
> The answer to the OP, which someone just pointed out to me on
> comp.programming, is "edit distance" or "Levenshtein Distance"[1].  A
> google search on either produces some good descriptions in the first few
> links, including http://www.merriampark.com/ld.htm which has not only a
> description of the algorithm but also source code in Java, C++ and
> Visual Basic (no Awk, thought there are links to pages with
> implementations in Perl, Python, Delphi and many more)...
> 
> [1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein'
> in Schwaebisch (south German) fashion, but apparently the author of the
> algorithm was one Vladimir I. Levenshtein and that is how he is credited
> on the IEEE site.  There are also a number of Google hits under the
> 'stein' spelling, though...
> 
> Chris C

And what's the Levenshtein distance between "levenshtein" and 
"levenstein"? Ironic if it were large ...

regards
  Steve
-- 
Steve Holden+1 703 861 4237  +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-20 Thread Steve Holden
Dennis Lee Bieber wrote:
> On Wed, 18 May 2005 20:03:53 -0500, Ed Morton <[EMAIL PROTECTED]>
> declaimed the following in comp.lang.python:
> 
> 
> 
>>Fantastic test data set. I know how to pronounce McPherson but I'd never 
>>have guessed that Mousaferiadis sounds like it. I suppose non-Celts 
>>probably wouldn't be able to guess how Dalziell, Drumnadrochit, Culzean, 
>>Ceilidh, or Concobarh are pronounced either.
>>
> 
>   Since "soundex" is initial letter (consonant?) and a code for
> the next three syllables (or close to it), really long multi-syllabic
> names are effectively truncated...
> 
>   Howe'er... When Maire Brennan releases an album as "Moya",
> following sister's "Enya" (Eithne, as I seem to recall reading)... I'd
> not attempt to pronounce most of the names you supply... "Dalziell"
> doesn't look Celtic... "Culzean" almost looks Aztec/Mayan... "Ceilidh"
> => kay-lee?
> 
>   Okay, I think I can manage bain sidhe and uisge (after too much
> of the latter, I'll be seeing the former)
> 
Well, as an Englishman who has spent a good deal of time in Scotland I'd 
essay the following. If there are any Scots reading they can chastise me 
with glee for my presumption.

Dalziell:   "Da'y yell"
Drumnadrochit:  "Dru'mnadro'ckit"
(but the Scots would insist you use
a gutteral for the "ch", I'm not sure
how to render that phonetically. It's
a bit like the sound made before spitting,
or the "G" in Guido in Dutch :-).
Culzean:"Ka La'ne"
Ceilidh:"Ca'yli" (once had a border collie called
Ceilidh).
Concobarh:  (is this the same as 'Conchobar'?)
Co'nnahwar

promoting-scottish-english-unity-ly y'rs  - steve
-- 
Steve Holden+1 703 861 4237  +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-20 Thread John Machin
On 20 May 2005 04:09:26 GMT, [EMAIL PROTECTED] (Patrick TJ McPhee)
wrote:

>In article <[EMAIL PROTECTED]>,
>John Machin  <[EMAIL PROTECTED]> wrote:
>
>% None of the other approaches make the mistake of preserving the first
>% letter -- this alone is almost enough reason for jettisoning soundex.
>
>Metaphone does, at least in the case of Aaron and Erin.

You're quite correct; moreover NYSIIS does that too.

Dunno what came over me, must be gamma rays or something :-)


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-19 Thread Patrick TJ McPhee
In article <[EMAIL PROTECTED]>,
John Machin  <[EMAIL PROTECTED]> wrote:

% None of the other approaches make the mistake of preserving the first
% letter -- this alone is almost enough reason for jettisoning soundex.

Metaphone does, at least in the case of Aaron and Erin.


-- 

Patrick TJ McPhee
North York  Canada
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-19 Thread John Machin
On Fri, 20 May 2005 01:47:15 +1000, Steven D'Aprano
<[EMAIL PROTECTED]> wrote:

>On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote:
>
>> None of the other approaches make the mistake of preserving the first
>> letter -- this alone is almost enough reason for jettisoning soundex.
>
>Off-topic now, but you've made me curious.
>
>Why is this a bad idea?
>
>How would you handle the case of "barow" and "marow"? (Barrow and
>marrow, naturally.) Without the first letter, they sound identical. Why is
>throwing that information away a good thing?

Sorry if that was unclear. By "preserving the first letter", I meant
that in "standard" soundex, the first letter is not transformed into a
digit.

Karen -> K650 
Kieran -> K650
(R->6, N->5; vowels->0 and then are squeezed out)

Now compare this:
Aaron -> A650
Erin -> E650

Bearing in mind that the usual application of soundex is "all or
nothing", the result is Karen == Kieran, but Aaron !== Erin, which is
at the very least extremely inconsistent.

A better phonetic-key creator would produce the same result for each
of the first pair, and for each of the second pair -- e.g. KARAN and
ARAN respectively.

Also consider Catherine vs Katherine.

Cheers,
John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-19 Thread Chris Croughton
On Fri, 20 May 2005 01:47:15 +1000, Steven D'Aprano 
   <[EMAIL PROTECTED]> wrote:

> On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote:
> 
>> None of the other approaches make the mistake of preserving the first
>> letter -- this alone is almost enough reason for jettisoning soundex.
> 
> Off-topic now, but you've made me curious.
> 
> Why is this a bad idea?

Why is the first letter any more important than any other?

> How would you handle the case of "barow" and "marow"? (Barrow and
> marrow, naturally.) Without the first letter, they sound identical. Why is
> throwing that information away a good thing?

Well, Soundex will quite possibly throw the information away anyway,
certainly it regards several letters as the same.  But why is the
difference between barrow and marrow more important than that between
help and held?  Or between hatter and hammer?

Regarding 'agains' as similar to 'iguanas' and 'Utahns', but not to
'again' or 'against', is silly...

Chris C
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-19 Thread Dan Bishop
Steven D'Aprano wrote:
> On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote:
>
> > None of the other approaches make the mistake of preserving the
first
> > letter -- this alone is almost enough reason for jettisoning
soundex.
>
> Off-topic now, but you've made me curious.
>
> Why is this a bad idea?

Because of situations like, for example, my mother's last name, which
originally started with "Y" but got anglicized to a name beginning with
"E".  Same name, different Soundex codes, and the problem occurs only
occurs because of Soundex's preservation of the exact initial letter.

> How would you handle the case of "barow" and "marow"? (Barrow and
> marrow, naturally.) Without the first letter, they sound identical.
Why is
> throwing that information away a good thing?

No one's suggesting throwing away the first letter's information, just
removing the special treatment for it.  "Barow" becomes 1600 and
"Marow" becomes 5600.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-19 Thread Steven D'Aprano
On Thu, 19 May 2005 14:09:32 +1000, John Machin wrote:

> None of the other approaches make the mistake of preserving the first
> letter -- this alone is almost enough reason for jettisoning soundex.

Off-topic now, but you've made me curious.

Why is this a bad idea?

How would you handle the case of "barow" and "marow"? (Barrow and
marrow, naturally.) Without the first letter, they sound identical. Why is
throwing that information away a good thing?

Thanks,


-- 
Steven


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-19 Thread Steven D'Aprano
On Thu, 19 May 2005 07:07:56 +1000, John Machin wrote:

> On Wed, 18 May 2005 13:45:30 -0700, Don <[EMAIL PROTECTED]>
> wrote:
> 
>>http://www.personal.psu.edu/staff/i/u/iua1/python/apse/
> 
> The above is broken, not meeting one of the elementary conditions for
> a distance metric:
> 
> distance(a, b) == distance(b, a)

I agree that this makes the edit distance broken in the context of text
strings, but you should be aware that distance is only commutative if
there are no constraints on travel. If you've ever driven around cities
like Sydney that use lots of one-way streets, you will know that the
distance from point A to point B is not necessarily the same as the
distance from B back to A.


-- 
Steven


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-19 Thread Chris Croughton
On Thu, 19 May 2005 06:38:59 +1000, John Machin 
   <[EMAIL PROTECTED]> wrote:

> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <[EMAIL PROTECTED]>
> wrote:
> 
>>William Park wrote:
>>
>>> How do you compare 2 strings, and determine how much they are "close" to
>>> each other?  Eg.
>>> aqwerty
>>> qwertyb
>>> are similar to each other, except for first/last char.  But, how do I
>>> quantify that?
>>
>>"However you like" is probably the right answer, but one way might be to 
>>compare their soundex encoding 
>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>>percentage difference based on comparing the numeric part.
> 
> Fantastic suggestion. Here's a tiny piece of real-life test data:
> 
> compare the surnames "Mousaferiadis" and "McPherson".

I remember a word processor, MultiMate, which used Soundex to do
matching for its suggestions for spelling correction.  One of my
cow-orkers typed the word 'agains' -- it was supposed to be 'against'
but 'again' would also have been a sensible suggestion.  MultiMate,
however, suggested neither of those reasonable words, it did suggest
"iguanas" amd "Utahns"...

(I wonder what it does with "Talliafero" and "Tolliver", and with
"Featherstone-Howe" and "Fanshaw"...)

The answer to the OP, which someone just pointed out to me on
comp.programming, is "edit distance" or "Levenshtein Distance"[1].  A
google search on either produces some good descriptions in the first few
links, including http://www.merriampark.com/ld.htm which has not only a
description of the algorithm but also source code in Java, C++ and
Visual Basic (no Awk, thought there are links to pages with
implementations in Perl, Python, Delphi and many more)...

[1] I would have spelt it 'Levenstein', and pronounced it 'Levenshtein'
in Schwaebisch (south German) fashion, but apparently the author of the
algorithm was one Vladimir I. Levenshtein and that is how he is credited
on the IEEE site.  There are also a number of Google hits under the
'stein' spelling, though...

Chris C
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-19 Thread [EMAIL PROTECTED]
Could hit a few snags.  Quick out-of-the-library compression using
standards like zlib will have headers that will dilute the difference
on short strings, and on long strings block compression (zlib, bzip2)
will not pick up similarities because the similarities will be in
different blocks.  With blocks of around 100k-1M in these algos by
default (IIRC),  this could work well for strings between oh say
1k-50k.

But I need to underscore Aahz's posting above:

***Check out difflib, it's in the library.***  Perfect package for what
the OP wants AFAICT.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread Scott David Daniels
William Park wrote:
> How do you compare 2 strings, and determine how much they are "close" to
> each other?  

Here's a really weird idea:  Measure the size difference between the
pair of strings compressed together and compressed separately.

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread John Machin
On Wed, 18 May 2005 22:37:21 -0500, Ed Morton <[EMAIL PROTECTED]>
wrote:

>
>
>John Machin wrote:
>
>> On Wed, 18 May 2005 20:03:53 -0500, Ed Morton <[EMAIL PROTECTED]>
>> wrote:
>
>>>I assume you were actually being facetious
>>>and trying to make the point 
>>>that names that don't look the same on paper can have the same soundex 
>>>encoding and that's obviously countered with the fact that soundex is 
>>>just a cheap and cheerful way to find names that probably sound similair 
>>>which can vary tremendously based on ethnicity or accent.
>> 
>> 
>> *If* you want phonetic similarity, there are methods that much better
>> than soundex, in the sense of fewer false positives and fewer false
>> negatives. Google for NYSIIS, dolby, metaphone, caverphone.
>
>And I assume I'd find they all have pros and cons too, otherwise you'd 
>be referring to THE best one rather than a selection.

*ALL* approximate matching methods have pros and cons -- and all
others have fewer than soundex.

> It seems a bit 
>pointless to go browsing through the documentation on them when someone 
>who presumably already has can't just state the best one for the job.

They were listed in roughly increasing order of general rough
effectiveness. It depends on the job. It depends on the language. None
of them would work well with your O Muirchaitaeioughs :-)

>
>> Cheap? You get what you pay for.
>> 
>> Cheerful? What's the relevance?
>
>"Cheap and cheerful" is a colloquial expression meaning cost-effective.

Grossly misapplied to soundex.

>
>> Someone who types "Mousaferiadis" into a customer search screen and
>> gets back several lines of McPherson and MacPherson is unlikely to be
>> cheerful -- even before we factor in the speed [soundex divides the
>> universe into a relative small number of buckets].
>> 
>> Someone who's looking for Erin when they should be looking for Aaron
>> (or vice versa) won't get much cheer out of soundex, either.
>
>That goes back to accent. In [some parts at least of] the USA Erin 
>sounds very much like Aaron wheras in the UK the 2 are very dissimilar. 
>I assume since you apparently consider them similair that you live in 
>the USA

You assume incorrectly. In any case my whereabouts are of sublime
irrelevance. What matters is that some people will as you say think
that Aaron and Erin sound similar in the best of circumstances; they
are prone to be mistaken one for the other by (say) a tired call
centre operative especially when the caller and the callee are from
different backgrounds.

> and so would consider soundex as providing a "false negative" by 
>saying they don't match. Perhaps one of the other approaches you suggest 
>would report that they do match but that wouldn't make it clearly a 
>better choice to everyone.

None of the other approaches make the mistake of preserving the first
letter -- this alone is almost enough reason for jettisoning soundex.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread Ed Morton


John Machin wrote:

> On Wed, 18 May 2005 20:03:53 -0500, Ed Morton <[EMAIL PROTECTED]>
> wrote:

>>I assume you were actually being facetious
>>and trying to make the point 
>>that names that don't look the same on paper can have the same soundex 
>>encoding and that's obviously countered with the fact that soundex is 
>>just a cheap and cheerful way to find names that probably sound similair 
>>which can vary tremendously based on ethnicity or accent.
> 
> 
> *If* you want phonetic similarity, there are methods that much better
> than soundex, in the sense of fewer false positives and fewer false
> negatives. Google for NYSIIS, dolby, metaphone, caverphone.

And I assume I'd find they all have pros and cons too, otherwise you'd 
be referring to THE best one rather than a selection. It seems a bit 
pointless to go browsing through the documentation on them when someone 
who presumably already has can't just state the best one for the job.

> Cheap? You get what you pay for.
> 
> Cheerful? What's the relevance?

"Cheap and cheerful" is a colloquial expression meaning cost-effective.

> Someone who types "Mousaferiadis" into a customer search screen and
> gets back several lines of McPherson and MacPherson is unlikely to be
> cheerful -- even before we factor in the speed [soundex divides the
> universe into a relative small number of buckets].
> 
> Someone who's looking for Erin when they should be looking for Aaron
> (or vice versa) won't get much cheer out of soundex, either.

That goes back to accent. In [some parts at least of] the USA Erin 
sounds very much like Aaron wheras in the UK the 2 are very dissimilar. 
I assume since you apparently consider them similair that you live in 
the USA and so would consider soundex as providing a "false negative" by 
saying they don't match. Perhaps one of the other approaches you suggest 
would report that they do match but that wouldn't make it clearly a 
better choice to everyone.

> 
>>It's a reasonable approach to consider given the very loose requirements 
>>presented.
> 
> 
> Soundex is *NEVER* a reasonable approach to consider. Phonetic
> variation is only one consideration. In any case, the OP didn't appear
> to be concerned with phonetic variations.

The OP didn't say what the application was at all, but you're right that 
from his example he does SEEM more interested in character matches than 
phonetic ones so he'd presumably quickly discard phonetic comparisons if 
that's really not what he wants.

Ed.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread John Machin
On Wed, 18 May 2005 20:03:53 -0500, Ed Morton <[EMAIL PROTECTED]>
wrote:

>
>
>John Machin wrote:
>> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <[EMAIL PROTECTED]>
>> wrote:
>> 
>> 
>>>
>>>William Park wrote:
>>>
>>>
How do you compare 2 strings, and determine how much they are "close" to
each other?  Eg.
aqwerty
qwertyb
are similar to each other, except for first/last char.  But, how do I
quantify that?

I guess you can say for the above 2 strings that
- at max, 6 chars out of 7 are same sequence --> 85% max

But, for
qawerty
qwerbty
max correlation is
- 3 chars out of 7 are the same sequence --> 42% max

(Crossposted to 3 of my favourite newsgroup.)

>>>
>>>"However you like" is probably the right answer, but one way might be to 
>>>compare their soundex encoding 
>>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>>>percentage difference based on comparing the numeric part.
>>>
>> 
>> 
>> Fantastic suggestion. Here's a tiny piece of real-life test data:
>> 
>> compare the surnames "Mousaferiadis" and "McPherson".
>> 
>
>Fantastic test data set. I know how to pronounce McPherson but I'd never 
>have guessed that Mousaferiadis sounds like it.

If you guessed "moose a ferry ah dis" i.e. phonetically you wouldn't
be far wrong. The point is that the two names neither look similar nor
sound similar. It is highly unlikely that one would be corrupted into
the other during either written or spoken communication. However they
get the same soundex code because the soundex method picks out MSFR
and MCPR and says in effect that S===C (sometimes) and F==P
(sometimes).

>
>I assume you were actually being facetious
> and trying to make the point 
>that names that don't look the same on paper can have the same soundex 
>encoding and that's obviously countered with the fact that soundex is 
>just a cheap and cheerful way to find names that probably sound similair 
>which can vary tremendously based on ethnicity or accent.

*If* you want phonetic similarity, there are methods that much better
than soundex, in the sense of fewer false positives and fewer false
negatives. Google for NYSIIS, dolby, metaphone, caverphone.

Cheap? You get what you pay for.

Cheerful? What's the relevance?

Someone who types "Mousaferiadis" into a customer search screen and
gets back several lines of McPherson and MacPherson is unlikely to be
cheerful -- even before we factor in the speed [soundex divides the
universe into a relative small number of buckets].

Someone who's looking for Erin when they should be looking for Aaron
(or vice versa) won't get much cheer out of soundex, either.

>
>It's a reasonable approach to consider given the very loose requirements 
>presented.

Soundex is *NEVER* a reasonable approach to consider. Phonetic
variation is only one consideration. In any case, the OP didn't appear
to be concerned with phonetic variations.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread Ed Morton


John Machin wrote:
> On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <[EMAIL PROTECTED]>
> wrote:
> 
> 
>>
>>William Park wrote:
>>
>>
>>>How do you compare 2 strings, and determine how much they are "close" to
>>>each other?  Eg.
>>>aqwerty
>>>qwertyb
>>>are similar to each other, except for first/last char.  But, how do I
>>>quantify that?
>>>
>>>I guess you can say for the above 2 strings that
>>>- at max, 6 chars out of 7 are same sequence --> 85% max
>>>
>>>But, for
>>>qawerty
>>>qwerbty
>>>max correlation is
>>>- 3 chars out of 7 are the same sequence --> 42% max
>>>
>>>(Crossposted to 3 of my favourite newsgroup.)
>>>
>>
>>"However you like" is probably the right answer, but one way might be to 
>>compare their soundex encoding 
>>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>>percentage difference based on comparing the numeric part.
>>
> 
> 
> Fantastic suggestion. Here's a tiny piece of real-life test data:
> 
> compare the surnames "Mousaferiadis" and "McPherson".
> 

Fantastic test data set. I know how to pronounce McPherson but I'd never 
have guessed that Mousaferiadis sounds like it. I suppose non-Celts 
probably wouldn't be able to guess how Dalziell, Drumnadrochit, Culzean, 
Ceilidh, or Concobarh are pronounced either.

I assume you were actually being facetious and trying to make the point 
that names that don't look the same on paper can have the same soundex 
encoding and that's obviously countered with the fact that soundex is 
just a cheap and cheerful way to find names that probably sound similair 
which can vary tremendously based on ethnicity or accent.

It's a reasonable approach to consider given the very loose requirements 
presented.

Ed.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread Aahz
[trimmed to c.l.py]

In article <[EMAIL PROTECTED]>,
William Park  <[EMAIL PROTECTED]> wrote:
>
>How do you compare 2 strings, and determine how much they are "close" to
>each other?  Eg.
>aqwerty
>qwertyb
>are similar to each other, except for first/last char.  But, how do I
>quantify that?
>
>I guess you can say for the above 2 strings that
>- at max, 6 chars out of 7 are same sequence --> 85% max
>
>But, for
>qawerty
>qwerbty
>max correlation is
>- 3 chars out of 7 are the same sequence --> 42% max

difflib -- use the string as the sequence
-- 
Aahz ([EMAIL PROTECTED])   <*> http://www.pythoncraft.com/

"And if that makes me an elitist...I couldn't be happier."  --JMS
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread John Machin
On Wed, 18 May 2005 13:45:30 -0700, Don <[EMAIL PROTECTED]>
wrote:

>http://www.personal.psu.edu/staff/i/u/iua1/python/apse/

The above is broken, not meeting one of the elementary conditions for
a distance metric:

distance(a, b) == distance(b, a)

Quoting from its docs:
 |  Note: The definition of the goodness of an approximate match is
the 
 |  number of steps required to bring the string pattern to a form
that is 
 |  entirely contained in the string to which it is being matched.

Note: "entirely contained in", rather than "equal to". Now read on:


 | The mathing 
 |  is not commutative. The pattern that you instantiate the class
with will be 
 |  matched against the input. For example the word "funky" can be
made to 
 |  match the word "funnybone" with an edit distance of one. However,
using 
 |  "funnybone" as a pattern that will be  matched to "funky" the
distance 
 |  will become five.
 |  
 |  Example:
 |  
 |  >>> from Apse import Approx
 |  >>> a = Approx("funky")
 |  >>> a.dist("funnybone")
 |  1
 |  >>> a = Approx("funnybone")
 |  >>> a.dist("funky")
 |  5


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread John Machin
On Wed, 18 May 2005 15:48:32 -0400, William Park
<[EMAIL PROTECTED]> wrote:

>How do you compare 2 strings, and determine how much they are "close" to
>each other?  Eg.
>aqwerty
>qwertyb
>are similar to each other, except for first/last char.  But, how do I
>quantify that?
>
>I guess you can say for the above 2 strings that
>- at max, 6 chars out of 7 are same sequence --> 85% max
>
>But, for
>qawerty
>qwerbty
>max correlation is
>- 3 chars out of 7 are the same sequence --> 42% max


1. Google for such topics as "fuzzy matching", "edit distance",
"approximate comparison".

2. Closer to home, look at the thread in comp.lang.python around
2004-11-18 -- search for "Pingel Hyyro" [and yes you do mean "hyyro",
not "hydro"!!]

3. Steadfastly ignore any (presumably) well-intentioned profferings of
soundex.


HTH,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread Don
William Park wrote:

> How do you compare 2 strings, and determine how much they are "close" to
> each other?  Eg.
> aqwerty
> qwertyb
> are similar to each other, except for first/last char.  But, how do I
> quantify that?
> 
> I guess you can say for the above 2 strings that
> - at max, 6 chars out of 7 are same sequence --> 85% max
> 
> But, for
> qawerty
> qwerbty
> max correlation is
> - 3 chars out of 7 are the same sequence --> 42% max
> 
> (Crossposted to 3 of my favourite newsgroup.)
> 

http://www.personal.psu.edu/staff/i/u/iua1/python/apse/
http://www.nist.gov/dads/HTML/editDistance.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread John Machin
On Wed, 18 May 2005 15:06:53 -0500, Ed Morton <[EMAIL PROTECTED]>
wrote:

>
>
>William Park wrote:
>
>> How do you compare 2 strings, and determine how much they are "close" to
>> each other?  Eg.
>> aqwerty
>> qwertyb
>> are similar to each other, except for first/last char.  But, how do I
>> quantify that?
>> 
>> I guess you can say for the above 2 strings that
>> - at max, 6 chars out of 7 are same sequence --> 85% max
>> 
>> But, for
>> qawerty
>> qwerbty
>> max correlation is
>> - 3 chars out of 7 are the same sequence --> 42% max
>> 
>> (Crossposted to 3 of my favourite newsgroup.)
>>
>
>"However you like" is probably the right answer, but one way might be to 
>compare their soundex encoding 
>(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
>percentage difference based on comparing the numeric part.
>

Fantastic suggestion. Here's a tiny piece of real-life test data:

compare the surnames "Mousaferiadis" and "McPherson".






-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread Ed Morton


William Park wrote:

> How do you compare 2 strings, and determine how much they are "close" to
> each other?  Eg.
> aqwerty
> qwertyb
> are similar to each other, except for first/last char.  But, how do I
> quantify that?
> 
> I guess you can say for the above 2 strings that
> - at max, 6 chars out of 7 are same sequence --> 85% max
> 
> But, for
> qawerty
> qwerbty
> max correlation is
> - 3 chars out of 7 are the same sequence --> 42% max
> 
> (Crossposted to 3 of my favourite newsgroup.)
>

"However you like" is probably the right answer, but one way might be to 
compare their soundex encoding 
(http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?soundex) and figure out 
percentage difference based on comparing the numeric part.

Ed.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing 2 similar strings?

2005-05-18 Thread Jürgen Kahrs
Hello William,

> How do you compare 2 strings, and determine how much they are "close" to
> each other?  Eg.
> aqwerty
> qwertyb
> are similar to each other, except for first/last char.  But, how do I
> quantify that?

This is a classic problem of computer science.
Watch this:

  http://odur.let.rug.nl/~kleiweg/lev/
  http://www.levenshtein.net/

This solution has application in speech
recognition and also in DNA research.
-- 
http://mail.python.org/mailman/listinfo/python-list


Comparing 2 similar strings?

2005-05-18 Thread William Park
How do you compare 2 strings, and determine how much they are "close" to
each other?  Eg.
aqwerty
qwertyb
are similar to each other, except for first/last char.  But, how do I
quantify that?

I guess you can say for the above 2 strings that
- at max, 6 chars out of 7 are same sequence --> 85% max

But, for
qawerty
qwerbty
max correlation is
- 3 chars out of 7 are the same sequence --> 42% max

(Crossposted to 3 of my favourite newsgroup.)

-- 
William Park <[EMAIL PROTECTED]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
   http://home.eol.ca/~parkw/thinflash.html
-- 
http://mail.python.org/mailman/listinfo/python-list