>-----Original Message----- >From: Nuno Tavares [mailto:nuno.tava...@dri.pt] >Sent: Tuesday, May 03, 2011 6:21 PM >To: mysql@lists.mysql.com >Subject: Re: Join based upon LIKE > >Dear Jerry, > >I've been silently following this discussion because I've missed the >original question. > >But from your last explanation, now it really looks you have a "data >quality" kind of issue, which is by far related with MySQL. > [JS] Definitely -- but I have to work with the tools available. This is only one part of the process, there is more trouble further on that is not related to our database at all.
>Indeed, in Data Quality, there is *never* a ready solution, because the >source is tipically chaotic.... > >May I suggest you to explore Google Refine? It seems to be able to >address all those issues quite nicely, and the clustering might solve >your problem at once. You shall know, however, how to export the tables >(or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that. > [JS] I never heard of Google Refine. Thanks for bringing to my attention. >Hope it helps, >-NT [JS] Thank you. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com > >Em 03-05-2011 21:34, Jerry Schwartz escreveu: >> My situation is sounds rather simple. All I am doing is matching a >spreadsheet >> of products against our database. My job is to find any matches against >> existing products and determine which ones are new, which ones are >> replacements for older products, and which ones just need to have the >> publication date (and page count, price, whatever) refreshed. >> >> Publisher is no problem. What I have for each "feed" is a title and (most >> of >> the time) an ISBN or other identification assigned by the publisher. >> >> Matching by product ID is easy (assuming there aren't any mistakes in the >> current or previous feeds); but the publisher might or might not change the >> product ID when they update a report. That's why I also run a match by >> title, >> and that's where all the trouble comes from. >> >> The publisher might or might not include a mix of old and new products in a >> feed. The publisher might change the title of an existing product, either >> on >> purpose or by accident; they might simply be sloppy about their spelling; >> or >> (and this is where it is critical) the title might include a reference to >some >> time period such as a year or a quarter. >> >> I think we'd better pull the plug on this discussion. It doesn't seem like >> there's a ready solution. Fortunately our database is small, and most feeds >> are only a few hundred products. >> >> Regards, >> >> Jerry Schwartz >> Global Information Incorporated >> 195 Farmington Ave. >> Farmington, CT 06032 >> >> 860.674.8796 / FAX: 860.674.8341 >> E-mail: je...@gii.co.jp >> Web site: www.the-infoshop.com >> >> >>> -----Original Message----- >>> From: shawn wilson [mailto:ag4ve...@gmail.com] >>> Sent: Tuesday, May 03, 2011 4:08 PM >>> Cc: mysql mailing list >>> Subject: Re: Join based upon LIKE >>> >>> I'm actually enjoying this discussion because I have the same type of >>> issue. >>> However, I have done away with trying to do a full text search in favor of >>> making a table with unique fields where all fields should uniquely >>> identify >>> the group. If I get a dupe, I can clean it up. >>> >>> However, like you, they don't want me to mess with the original data. So, >>> what I have is another table with my good data that my table with my >>> unique >>> data refers to. If a bad record is creased, I don't care I just create my >>> relationship to the table of data I know (read think - I rarely look at >>> this >>> stuff) is good. >>> >>> So, I have 4 fields that should be unique for a group. Two chats and two >>> ints. If three of these match a record in the 'good data' table - there's >>> my >>> relationship. If two or less match, I create a new record in my 'good >>> data' >>> table and log the event. (I haven't gotten to the logging part yet though, >>> easy enough just to look sense none of the fields in 'good data' should >>> match) >>> >>> I'm thinking you might have to dig deeper than me to find 'good data' but >>> I >>> think its there. Maybe isbn, name, publisher + address, price, average >>> pages, name of sales person, who you guys pay for the material, etc etc >>> etc. >>> >>> >>> On May 3, 2011 10:59 AM, "Johan De Meersman" <vegiv...@tuxera.be> wrote: >>>> >>>> >>>> ----- Original Message ----- >>>>> From: "Jerry Schwartz" <je...@gii.co.jp> >>>>> >>>>> I'm not sure that I could easily build a dictionary of non-junk >>>>> words, since >>>> >>>> The traditional way is to build a database of junk words. The list tends >>> to be shorter :-) >>>> >>>> Think and/or/it/the/with/like/... >>>> >>>> Percentages of mutual and non-mutual words between two titles should be a >>> reasonable indicator of likeness. You could conceivably even assign value >>> to >>> individual words, so "polypropylbutanate" is more useful than "synergy" >>> for >>> comparison purposes. >>>> >>>> All very theoretical, though, I haven't actually done much of it to this >>> level. My experience in data mangling is limited to mostly >>> should-be-fixed-format data like sports results. >>>> >>>> >>>> -- >>>> Bier met grenadyn >>>> Is als mosterd by den wyn >>>> Sy die't drinkt, is eene kwezel >>>> Hy die't drinkt, is ras een ezel >>>> >>>> -- >>>> MySQL General Mailing List >>>> For list archives: http://lists.mysql.com/mysql >>>> To unsubscribe: http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com >>>> >> >> >> >> > > >-- >MySQL General Mailing List >For list archives: http://lists.mysql.com/mysql >To unsubscribe: http://lists.mysql.com/mysql?unsub=je...@gii.co.jp -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=arch...@jab.org