----- Original Message ----- > From: "Jerry Schwartz" <je...@gii.co.jp> > > I'm not sure that I could easily build a dictionary of non-junk > words, since
The traditional way is to build a database of junk words. The list tends to be shorter :-) Think and/or/it/the/with/like/... Percentages of mutual and non-mutual words between two titles should be a reasonable indicator of likeness. You could conceivably even assign value to individual words, so "polypropylbutanate" is more useful than "synergy" for comparison purposes. All very theoretical, though, I haven't actually done much of it to this level. My experience in data mangling is limited to mostly should-be-fixed-format data like sports results. -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=arch...@jab.org