I'm actually enjoying this discussion because I have the same type of issue.
However, I have done away with trying to do a full text search in favor of
making a table with unique fields where all fields should uniquely identify
the group. If I get a dupe, I can clean it up.

However, like you, they don't want me to mess with the original data. So,
what I have is another table with my good data that my table with my unique
data refers to. If a bad record is creased, I don't care I just create my
relationship to the table of data I know (read think - I rarely look at this
stuff) is good.

So, I have 4 fields that should be unique for a group. Two chats and two
ints. If three of these match a record in the 'good data' table - there's my
relationship. If two or less match, I create a new record in my 'good data'
table and log the event. (I haven't gotten to the logging part yet though,
easy enough just to look sense none of the fields in 'good data' should
match)

I'm thinking you might have to dig deeper than me to find 'good data' but I
think its there. Maybe isbn, name, publisher + address, price, average
pages, name of sales person, who you guys pay for the material, etc etc etc.


On May 3, 2011 10:59 AM, "Johan De Meersman" <vegiv...@tuxera.be> wrote:
>
>
> ----- Original Message -----
> > From: "Jerry Schwartz" <je...@gii.co.jp>
> >
> > I'm not sure that I could easily build a dictionary of non-junk
> > words, since
>
> The traditional way is to build a database of junk words. The list tends
to be shorter :-)
>
> Think and/or/it/the/with/like/...
>
> Percentages of mutual and non-mutual words between two titles should be a
reasonable indicator of likeness. You could conceivably even assign value to
individual words, so "polypropylbutanate" is more useful than "synergy" for
comparison purposes.
>
> All very theoretical, though, I haven't actually done much of it to this
level. My experience in data mangling is limited to mostly
should-be-fixed-format data like sports results.
>
>
> --
> Bier met grenadyn
> Is als mosterd by den wyn
> Sy die't drinkt, is eene kwezel
> Hy die't drinkt, is ras een ezel
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:    http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com
>

Reply via email to