RE: Join based upon LIKE

Jerry Schwartz Thu, 05 May 2011 06:55:21 -0700

>-----Original Message-----
>From: Nuno Tavares [mailto:nuno.tava...@dri.pt]
>Sent: Tuesday, May 03, 2011 6:21 PM
>To: mysql@lists.mysql.com
>Subject: Re: Join based upon LIKE
>
>Dear Jerry,
>
>I've been silently following this discussion because I've missed the
>original question.
>
>But from your last explanation, now it really looks you have a "data
>quality" kind of issue, which is by far related with MySQL.
>
[JS] Definitely -- but I have to work with the tools available. This is only 
one part of the process, there is more trouble further on that is not related 
to our database at all.


>Indeed, in Data Quality, there is *never* a ready solution, because the
>source is tipically chaotic....
>
>May I suggest you to explore Google Refine? It seems to be able to
>address all those issues quite nicely, and the clustering might solve
>your problem at once. You shall know, however, how to export the tables
>(or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that.
>
[JS] I never heard of Google Refine. Thanks for bringing to my attention.

>Hope it helps,
>-NT
[JS] Thank you.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com



>
>Em 03-05-2011 21:34, Jerry Schwartz escreveu:
>> My situation is sounds rather simple. All I am doing is matching a
>spreadsheet
>> of products against our database. My job is to find any matches against
>> existing products and determine which ones are new, which ones are
>> replacements for older products, and which ones just need to have the
>> publication date (and page count, price, whatever) refreshed.
>>
>> Publisher is no problem. What I have for each "feed" is a title and (most 
>> of
>> the time) an ISBN or other identification assigned by the publisher.
>>
>> Matching by product ID is easy (assuming there aren't any mistakes in the
>> current or previous feeds); but the publisher might or might not change the
>> product ID when they update a report. That's why I also run a match by 
>> title,
>> and that's where all the trouble comes from.
>>
>> The publisher might or might not include a mix of old and new products in a
>> feed. The publisher might change the title of an existing product, either 
>> on
>> purpose or by accident; they might simply be sloppy about their spelling; 
>> or
>> (and this is where it is critical) the title might include a reference to
>some
>> time period such as a year or a quarter.
>>
>> I think we'd better pull the plug on this discussion. It doesn't seem like
>> there's a ready solution. Fortunately our database is small, and most feeds
>> are only a few hundred products.
>>
>> Regards,
>>
>> Jerry Schwartz
>> Global Information Incorporated
>> 195 Farmington Ave.
>> Farmington, CT 06032
>>
>> 860.674.8796 / FAX: 860.674.8341
>> E-mail: je...@gii.co.jp
>> Web site: www.the-infoshop.com
>>
>>
>>> -----Original Message-----
>>> From: shawn wilson [mailto:ag4ve...@gmail.com]
>>> Sent: Tuesday, May 03, 2011 4:08 PM
>>> Cc: mysql mailing list
>>> Subject: Re: Join based upon LIKE
>>>
>>> I'm actually enjoying this discussion because I have the same type of 
>>> issue.
>>> However, I have done away with trying to do a full text search in favor of
>>> making a table with unique fields where all fields should uniquely 
>>> identify
>>> the group. If I get a dupe, I can clean it up.
>>>
>>> However, like you, they don't want me to mess with the original data. So,
>>> what I have is another table with my good data that my table with my 
>>> unique
>>> data refers to. If a bad record is creased, I don't care I just create my
>>> relationship to the table of data I know (read think - I rarely look at 
>>> this
>>> stuff) is good.
>>>
>>> So, I have 4 fields that should be unique for a group. Two chats and two
>>> ints. If three of these match a record in the 'good data' table - there's 
>>> my
>>> relationship. If two or less match, I create a new record in my 'good 
>>> data'
>>> table and log the event. (I haven't gotten to the logging part yet though,
>>> easy enough just to look sense none of the fields in 'good data' should
>>> match)
>>>
>>> I'm thinking you might have to dig deeper than me to find 'good data' but 
>>> I
>>> think its there. Maybe isbn, name, publisher + address, price, average
>>> pages, name of sales person, who you guys pay for the material, etc etc 
>>> etc.
>>>
>>>
>>> On May 3, 2011 10:59 AM, "Johan De Meersman" <vegiv...@tuxera.be> wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jerry Schwartz" <je...@gii.co.jp>
>>>>>
>>>>> I'm not sure that I could easily build a dictionary of non-junk
>>>>> words, since
>>>>
>>>> The traditional way is to build a database of junk words. The list tends
>>> to be shorter :-)
>>>>
>>>> Think and/or/it/the/with/like/...
>>>>
>>>> Percentages of mutual and non-mutual words between two titles should be a
>>> reasonable indicator of likeness. You could conceivably even assign value 
>>> to
>>> individual words, so "polypropylbutanate" is more useful than "synergy" 
>>> for
>>> comparison purposes.
>>>>
>>>> All very theoretical, though, I haven't actually done much of it to this
>>> level. My experience in data mangling is limited to mostly
>>> should-be-fixed-format data like sports results.
>>>>
>>>>
>>>> --
>>>> Bier met grenadyn
>>>> Is als mosterd by den wyn
>>>> Sy die't drinkt, is eene kwezel
>>>> Hy die't drinkt, is ras een ezel
>>>>
>>>> --
>>>> MySQL General Mailing List
>>>> For list archives: http://lists.mysql.com/mysql
>>>> To unsubscribe:    http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com
>>>>
>>
>>
>>
>>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe:    http://lists.mysql.com/mysql?unsub=je...@gii.co.jp





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=arch...@jab.org

RE: Join based upon LIKE

Reply via email to