Re: Aproximative string matching

2005-11-21 Thread Daniel Dittmar
javuchi wrote:
 I'm searching for a library which makes aproximative string matching,
 for example, searching in a dictionary the word motorcycle, but
 returns similar strings like motorcicle.
 
 Is there such a library?
 

agrep (aproximate grep) allows for a certain amount of errors and there 
exist Python bindings (http://www.bio.cam.ac.uk/~mw263/pyagrep.html)

Or google for agrep python.

Daniel

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Aproximative string matching

2005-11-21 Thread Fredrik Lundh
Tim Roberts wrote:

 I'm searching for a library which makes aproximative string matching,
 for example, searching in a dictionary the word motorcycle, but
 returns similar strings like motorcicle.
 
 Is there such a library?

 There is an algorithm called Soundex that replaces each word by a
 4-character string, such that all words that are pronounced similarly
 encode to the same string.

 The algorithm is easy to implement; you can probably find one by Googling.

Python used to ship with a soundex module, but it was removed
in 1.6, for various reasons.  here's a replacement:

http://orca.mojam.com/~skip/python/soundex.py

/F



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Aproximative string matching

2005-11-21 Thread Tim Heaney
javuchi [EMAIL PROTECTED] writes:

 I'm searching for a library which makes aproximative string matching,
 for example, searching in a dictionary the word motorcycle, but
 returns similar strings like motorcicle.

 Is there such a library?

I kind of like the one at 

  http://www.personal.psu.edu/staff/i/u/iua1/python/apse/

which you might use something like

   import Apse
   ap = Apse.Approx('motorcycle', edit=1)
   ap.match(['motorcycle', 'motorcicle', 'motorscooter'])
  ['motorcycle', 'motorcicle']

That page mentions several alternatives, as well.

I hope this helps,

Tim
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Aproximative string matching

2005-11-21 Thread Diez B. Roggisch
Steven D'Aprano wrote:
 [EMAIL PROTECTED] wrote:
 
 This algorithm is called soundex. Here is one implementation example.

 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

 here is another:
 http://effbot.org/librarybook/soundex.htm
 
 
 Soundex is *one* particular algorithm for approximate string matching. 
 It is optimised for matching Anglo-American names (like Smith/Smythe), 
 and is considered to be quite old and obsolete for all but the most 
 trivial applications -- or so I'm told.
 
 Soundex will not match arbitrary changes -- it will match both cat and 
 cet, but it won't match cat and mat.
 
 A more sophisticated approximate string matching algorithm will use the 
 Levenshtein distance. You can find a Useless implementation here:
 
 http://www.uselesspython.com/download.php?script_id=108
 
 
 Given a function levenshtein(s1, s2) that returns the distance between 
 two strings, you could use it for approximate matching like this:
 
 def approx_matching(strlist, target, dist=1):
 Matches approximately strings in strlist to
 a target string.
 
 Returns a list of strings, where each string
 matched is no further than an edit distance of
 dist from the target.
 
 found = []
 for s in strlist:
 if levenshtein(s, target) = dist:
 found.append(s)
 return s

I compute a relative metric based on th every same implementation like this:

def relative(a, b):
 
 Computes a relative distance between two strings. Its in the range
 (0-1] where 1 means total equality.
 @type a: string
 @param a: arg one
 @type b: string
 @param b: arg two
 @rtype: float
 @return: the distance
 
 d = levensthein(a,b)
 longer = float(max((len(a), len(b
 shorter = float(min((len(a), len(b
 r = ((longer - d) / longer) * (shorter / longer)
 return r


The idea is that otherwise e.g. cat and hippopothamus have a 
l-distance of only 3, which one would consider good at the first look.


Regards,

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Aproximative string matching

2005-11-21 Thread Steven D'Aprano
On Mon, 21 Nov 2005 19:47:45 +0100, Diez B. Roggisch wrote:

 The idea is that otherwise e.g. cat and hippopothamus have a 
 l-distance of only 3, which one would consider good at the first look.

???

I make it that the L-distance between cat and hippopothamus is twelve, not
three. With len(cat)=3 and len(hippopothamus) = 13, you need ten
insertions just to get the lengths equal, so the L-distance must be at
least ten.


-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Aproximative string matching

2005-11-21 Thread Irmen de Jong
javuchi wrote:
 I'm searching for a library which makes aproximative string matching,
 for example, searching in a dictionary the word motorcycle, but
 returns similar strings like motorcicle.
 
 Is there such a library?
 

Perhaps the get_close_matches function that is presentt in the standard 
library (in the difflib module) could be useful ?

--Irmen
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Aproximative string matching

2005-11-20 Thread elbertlev
This algorithm is called soundex. Here is one implementation example.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

here is another:
http://effbot.org/librarybook/soundex.htm

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Aproximative string matching

2005-11-20 Thread Steven D'Aprano
[EMAIL PROTECTED] wrote:

 This algorithm is called soundex. Here is one implementation example.
 
 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213
 
 here is another:
 http://effbot.org/librarybook/soundex.htm

Soundex is *one* particular algorithm for approximate 
string matching. It is optimised for matching 
Anglo-American names (like Smith/Smythe), and is 
considered to be quite old and obsolete for all but the 
most trivial applications -- or so I'm told.

Soundex will not match arbitrary changes -- it will 
match both cat and cet, but it won't match cat and mat.

A more sophisticated approximate string matching 
algorithm will use the Levenshtein distance. You can 
find a Useless implementation here:

http://www.uselesspython.com/download.php?script_id=108


Given a function levenshtein(s1, s2) that returns the 
distance between two strings, you could use it for 
approximate matching like this:

def approx_matching(strlist, target, dist=1):
 Matches approximately strings in strlist to
 a target string.

 Returns a list of strings, where each string
 matched is no further than an edit distance of
 dist from the target.
 
 found = []
 for s in strlist:
 if levenshtein(s, target) = dist:
 found.append(s)
 return s



-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Aproximative string matching

2005-11-20 Thread Tim Roberts
javuchi [EMAIL PROTECTED] wrote:

I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word motorcycle, but
returns similar strings like motorcicle.

Is there such a library?

There is an algorithm called Soundex that replaces each word by a
4-character string, such that all words that are pronounced similarly
encode to the same string.

The algorithm is easy to implement; you can probably find one by Googling.
-- 
- Tim Roberts, [EMAIL PROTECTED]
  Providenza  Boekelheide, Inc.
-- 
http://mail.python.org/mailman/listinfo/python-list