I'm actually enjoying this discussion because I have the same type of issue. However, I have done away with trying to do a full text search in favor of making a table with unique fields where all fields should uniquely identify the group. If I get a dupe, I can clean it up.
However, like you, they don't want me to mess with the original data. So, what I have is another table with my good data that my table with my unique data refers to. If a bad record is creased, I don't care I just create my relationship to the table of data I know (read think - I rarely look at this stuff) is good. So, I have 4 fields that should be unique for a group. Two chats and two ints. If three of these match a record in the 'good data' table - there's my relationship. If two or less match, I create a new record in my 'good data' table and log the event. (I haven't gotten to the logging part yet though, easy enough just to look sense none of the fields in 'good data' should match) I'm thinking you might have to dig deeper than me to find 'good data' but I think its there. Maybe isbn, name, publisher + address, price, average pages, name of sales person, who you guys pay for the material, etc etc etc. On May 3, 2011 10:59 AM, "Johan De Meersman" <vegiv...@tuxera.be> wrote: > > > ----- Original Message ----- > > From: "Jerry Schwartz" <je...@gii.co.jp> > > > > I'm not sure that I could easily build a dictionary of non-junk > > words, since > > The traditional way is to build a database of junk words. The list tends to be shorter :-) > > Think and/or/it/the/with/like/... > > Percentages of mutual and non-mutual words between two titles should be a reasonable indicator of likeness. You could conceivably even assign value to individual words, so "polypropylbutanate" is more useful than "synergy" for comparison purposes. > > All very theoretical, though, I haven't actually done much of it to this level. My experience in data mangling is limited to mostly should-be-fixed-format data like sports results. > > > -- > Bier met grenadyn > Is als mosterd by den wyn > Sy die't drinkt, is eene kwezel > Hy die't drinkt, is ras een ezel > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe: http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com >