> > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > >> I have a side project that needs to "intelligently" know if two strings >> are contextually similar. > > The examples you gave seem heavy on word order and whitespace > consideration, > before applying any algorithms. Here's a quick perl version that does the > job:
[SNIP] This is a case where the example was too simple to explain the problem, sorry. I have an implementation of Oracle's "contains" function for PostgreSQL, and it does basically what you are doing, and, in fact, also has Mohawk Software Extensions (LOL) that provide metaphone. The problem is that parsing white space realy isn't reliable. Sometimes it is pinkfloyd-darksideofthemoon. Also, I have been thinking of other applications. I have a piece of code that does this: apps$ ./stratest "pink foyd dark side of the moon money" "money dark side of the moon pink floyd" Match: dark side of the moon Match: pink f Match: money Match: oyd apps$ ./stratest "pinkfoyddarksideofthemoonmoney" "moneydarksideofthemoonpinkfloyd" Match: darksideofthemoon Match: pinkf Match: money Match: oyd I need to come up with a numerically sane way of taking this information and understanding overall "similarity." ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq