RE: Join based upon LIKE
-Original Message- From: Nuno Tavares [mailto:nuno.tava...@dri.pt] Sent: Tuesday, May 03, 2011 6:21 PM To: mysql@lists.mysql.com Subject: Re: Join based upon LIKE Dear Jerry, I've been silently following this discussion because I've missed the original question. But from your last explanation, now it really looks you have a data quality kind of issue, which is by far related with MySQL. [JS] Definitely -- but I have to work with the tools available. This is only one part of the process, there is more trouble further on that is not related to our database at all. Indeed, in Data Quality, there is *never* a ready solution, because the source is tipically chaotic May I suggest you to explore Google Refine? It seems to be able to address all those issues quite nicely, and the clustering might solve your problem at once. You shall know, however, how to export the tables (or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that. [JS] I never heard of Google Refine. Thanks for bringing to my attention. Hope it helps, -NT [JS] Thank you. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com Em 03-05-2011 21:34, Jerry Schwartz escreveu: My situation is sounds rather simple. All I am doing is matching a spreadsheet of products against our database. My job is to find any matches against existing products and determine which ones are new, which ones are replacements for older products, and which ones just need to have the publication date (and page count, price, whatever) refreshed. Publisher is no problem. What I have for each feed is a title and (most of the time) an ISBN or other identification assigned by the publisher. Matching by product ID is easy (assuming there aren't any mistakes in the current or previous feeds); but the publisher might or might not change the product ID when they update a report. That's why I also run a match by title, and that's where all the trouble comes from. The publisher might or might not include a mix of old and new products in a feed. The publisher might change the title of an existing product, either on purpose or by accident; they might simply be sloppy about their spelling; or (and this is where it is critical) the title might include a reference to some time period such as a year or a quarter. I think we'd better pull the plug on this discussion. It doesn't seem like there's a ready solution. Fortunately our database is small, and most feeds are only a few hundred products. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -Original Message- From: shawn wilson [mailto:ag4ve...@gmail.com] Sent: Tuesday, May 03, 2011 4:08 PM Cc: mysql mailing list Subject: Re: Join based upon LIKE I'm actually enjoying this discussion because I have the same type of issue. However, I have done away with trying to do a full text search in favor of making a table with unique fields where all fields should uniquely identify the group. If I get a dupe, I can clean it up. However, like you, they don't want me to mess with the original data. So, what I have is another table with my good data that my table with my unique data refers to. If a bad record is creased, I don't care I just create my relationship to the table of data I know (read think - I rarely look at this stuff) is good. So, I have 4 fields that should be unique for a group. Two chats and two ints. If three of these match a record in the 'good data' table - there's my relationship. If two or less match, I create a new record in my 'good data' table and log the event. (I haven't gotten to the logging part yet though, easy enough just to look sense none of the fields in 'good data' should match) I'm thinking you might have to dig deeper than me to find 'good data' but I think its there. Maybe isbn, name, publisher + address, price, average pages, name of sales person, who you guys pay for the material, etc etc etc. On May 3, 2011 10:59 AM, Johan De Meersman vegiv...@tuxera.be wrote: - Original Message - From: Jerry Schwartz je...@gii.co.jp I'm not sure that I could easily build a dictionary of non-junk words, since The traditional way is to build a database of junk words. The list tends to be shorter :-) Think and/or/it/the/with/like/... Percentages of mutual and non-mutual words between two titles should be a reasonable indicator of likeness. You could conceivably even assign value to individual words, so polypropylbutanate is more useful than synergy for comparison purposes. All very theoretical, though, I haven't actually done much of it to this level. My experience in data mangling is limited to mostly should
Re: Join based upon LIKE
http://www.gedpage.com/soundex.html offers a simple explanation of what it does. One possibility would be building a referential table with only a recordID and soundex column, unique over both; and filling that with the soundex of individual nonjunk words. So, from the titles 1 | Rain in Spain 2 | Spain's Rain you'd get 1 | R500 1 | S150 2 | S150 2 | R500 From thereon, you can see that all the same words have been used - ignoring a lot of spelling errors like Spian. Obviously not a magic solution, but it's a start. - Original Message - From: Jerry Schwartz je...@gii.co.jp To: Johan De Meersman vegiv...@tuxera.be Cc: Jim McNeely j...@newcenturydata.com, mysql mailing list mysql@lists.mysql.com Sent: Monday, 2 May, 2011 4:09:36 PM Subject: RE: Join based upon LIKE [JS] I've thought about using soundex(), but I'm not quite sure how. I didn't pursue it much because there are so many odd terms such as chemical names, but perhaps I should give it a try in my infinite free time. [JS] Thanks for your condolences. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
RE: Join based upon LIKE
-Original Message- From: Johan De Meersman [mailto:vegiv...@tuxera.be] Sent: Tuesday, May 03, 2011 5:31 AM To: Jerry Schwartz Cc: Jim McNeely; mysql mailing list; Johan De Meersman Subject: Re: Join based upon LIKE http://www.gedpage.com/soundex.html offers a simple explanation of what it does. One possibility would be building a referential table with only a recordID and soundex column, unique over both; and filling that with the soundex of individual nonjunk words. So, from the titles 1 | Rain in Spain 2 | Spain's Rain you'd get 1 | R500 1 | S150 2 | S150 2 | R500 From thereon, you can see that all the same words have been used - ignoring a lot of spelling errors like Spian. Obviously not a magic solution, but it's a start. [JS] Thanks. I'm not sure that I could easily build a dictionary of non-junk words, since some of these reports have titles like Toluene Diisocyanate Market Outlook 2008, Toluene Market Outlook 2008, and Toluene: 2009 World Market Outlook And Forecast (Special Crisis Edition). I shall ponder this when I am caught up, or (more likely) in the afterlife. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com - Original Message - From: Jerry Schwartz je...@gii.co.jp To: Johan De Meersman vegiv...@tuxera.be Cc: Jim McNeely j...@newcenturydata.com, mysql mailing list mysql@lists.mysql.com Sent: Monday, 2 May, 2011 4:09:36 PM Subject: RE: Join based upon LIKE [JS] I've thought about using soundex(), but I'm not quite sure how. I didn't pursue it much because there are so many odd terms such as chemical names, but perhaps I should give it a try in my infinite free time. [JS] Thanks for your condolences. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: Join based upon LIKE
- Original Message - From: Jerry Schwartz je...@gii.co.jp I'm not sure that I could easily build a dictionary of non-junk words, since The traditional way is to build a database of junk words. The list tends to be shorter :-) Think and/or/it/the/with/like/... Percentages of mutual and non-mutual words between two titles should be a reasonable indicator of likeness. You could conceivably even assign value to individual words, so polypropylbutanate is more useful than synergy for comparison purposes. All very theoretical, though, I haven't actually done much of it to this level. My experience in data mangling is limited to mostly should-be-fixed-format data like sports results. -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: Join based upon LIKE
I'm actually enjoying this discussion because I have the same type of issue. However, I have done away with trying to do a full text search in favor of making a table with unique fields where all fields should uniquely identify the group. If I get a dupe, I can clean it up. However, like you, they don't want me to mess with the original data. So, what I have is another table with my good data that my table with my unique data refers to. If a bad record is creased, I don't care I just create my relationship to the table of data I know (read think - I rarely look at this stuff) is good. So, I have 4 fields that should be unique for a group. Two chats and two ints. If three of these match a record in the 'good data' table - there's my relationship. If two or less match, I create a new record in my 'good data' table and log the event. (I haven't gotten to the logging part yet though, easy enough just to look sense none of the fields in 'good data' should match) I'm thinking you might have to dig deeper than me to find 'good data' but I think its there. Maybe isbn, name, publisher + address, price, average pages, name of sales person, who you guys pay for the material, etc etc etc. On May 3, 2011 10:59 AM, Johan De Meersman vegiv...@tuxera.be wrote: - Original Message - From: Jerry Schwartz je...@gii.co.jp I'm not sure that I could easily build a dictionary of non-junk words, since The traditional way is to build a database of junk words. The list tends to be shorter :-) Think and/or/it/the/with/like/... Percentages of mutual and non-mutual words between two titles should be a reasonable indicator of likeness. You could conceivably even assign value to individual words, so polypropylbutanate is more useful than synergy for comparison purposes. All very theoretical, though, I haven't actually done much of it to this level. My experience in data mangling is limited to mostly should-be-fixed-format data like sports results. -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com
RE: Join based upon LIKE
My situation is sounds rather simple. All I am doing is matching a spreadsheet of products against our database. My job is to find any matches against existing products and determine which ones are new, which ones are replacements for older products, and which ones just need to have the publication date (and page count, price, whatever) refreshed. Publisher is no problem. What I have for each feed is a title and (most of the time) an ISBN or other identification assigned by the publisher. Matching by product ID is easy (assuming there aren't any mistakes in the current or previous feeds); but the publisher might or might not change the product ID when they update a report. That's why I also run a match by title, and that's where all the trouble comes from. The publisher might or might not include a mix of old and new products in a feed. The publisher might change the title of an existing product, either on purpose or by accident; they might simply be sloppy about their spelling; or (and this is where it is critical) the title might include a reference to some time period such as a year or a quarter. I think we'd better pull the plug on this discussion. It doesn't seem like there's a ready solution. Fortunately our database is small, and most feeds are only a few hundred products. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -Original Message- From: shawn wilson [mailto:ag4ve...@gmail.com] Sent: Tuesday, May 03, 2011 4:08 PM Cc: mysql mailing list Subject: Re: Join based upon LIKE I'm actually enjoying this discussion because I have the same type of issue. However, I have done away with trying to do a full text search in favor of making a table with unique fields where all fields should uniquely identify the group. If I get a dupe, I can clean it up. However, like you, they don't want me to mess with the original data. So, what I have is another table with my good data that my table with my unique data refers to. If a bad record is creased, I don't care I just create my relationship to the table of data I know (read think - I rarely look at this stuff) is good. So, I have 4 fields that should be unique for a group. Two chats and two ints. If three of these match a record in the 'good data' table - there's my relationship. If two or less match, I create a new record in my 'good data' table and log the event. (I haven't gotten to the logging part yet though, easy enough just to look sense none of the fields in 'good data' should match) I'm thinking you might have to dig deeper than me to find 'good data' but I think its there. Maybe isbn, name, publisher + address, price, average pages, name of sales person, who you guys pay for the material, etc etc etc. On May 3, 2011 10:59 AM, Johan De Meersman vegiv...@tuxera.be wrote: - Original Message - From: Jerry Schwartz je...@gii.co.jp I'm not sure that I could easily build a dictionary of non-junk words, since The traditional way is to build a database of junk words. The list tends to be shorter :-) Think and/or/it/the/with/like/... Percentages of mutual and non-mutual words between two titles should be a reasonable indicator of likeness. You could conceivably even assign value to individual words, so polypropylbutanate is more useful than synergy for comparison purposes. All very theoretical, though, I haven't actually done much of it to this level. My experience in data mangling is limited to mostly should-be-fixed-format data like sports results. -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: Join based upon LIKE
Dear Jerry, I've been silently following this discussion because I've missed the original question. But from your last explanation, now it really looks you have a data quality kind of issue, which is by far related with MySQL. Indeed, in Data Quality, there is *never* a ready solution, because the source is tipically chaotic May I suggest you to explore Google Refine? It seems to be able to address all those issues quite nicely, and the clustering might solve your problem at once. You shall know, however, how to export the tables (or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that. Hope it helps, -NT Em 03-05-2011 21:34, Jerry Schwartz escreveu: My situation is sounds rather simple. All I am doing is matching a spreadsheet of products against our database. My job is to find any matches against existing products and determine which ones are new, which ones are replacements for older products, and which ones just need to have the publication date (and page count, price, whatever) refreshed. Publisher is no problem. What I have for each feed is a title and (most of the time) an ISBN or other identification assigned by the publisher. Matching by product ID is easy (assuming there aren't any mistakes in the current or previous feeds); but the publisher might or might not change the product ID when they update a report. That's why I also run a match by title, and that's where all the trouble comes from. The publisher might or might not include a mix of old and new products in a feed. The publisher might change the title of an existing product, either on purpose or by accident; they might simply be sloppy about their spelling; or (and this is where it is critical) the title might include a reference to some time period such as a year or a quarter. I think we'd better pull the plug on this discussion. It doesn't seem like there's a ready solution. Fortunately our database is small, and most feeds are only a few hundred products. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -Original Message- From: shawn wilson [mailto:ag4ve...@gmail.com] Sent: Tuesday, May 03, 2011 4:08 PM Cc: mysql mailing list Subject: Re: Join based upon LIKE I'm actually enjoying this discussion because I have the same type of issue. However, I have done away with trying to do a full text search in favor of making a table with unique fields where all fields should uniquely identify the group. If I get a dupe, I can clean it up. However, like you, they don't want me to mess with the original data. So, what I have is another table with my good data that my table with my unique data refers to. If a bad record is creased, I don't care I just create my relationship to the table of data I know (read think - I rarely look at this stuff) is good. So, I have 4 fields that should be unique for a group. Two chats and two ints. If three of these match a record in the 'good data' table - there's my relationship. If two or less match, I create a new record in my 'good data' table and log the event. (I haven't gotten to the logging part yet though, easy enough just to look sense none of the fields in 'good data' should match) I'm thinking you might have to dig deeper than me to find 'good data' but I think its there. Maybe isbn, name, publisher + address, price, average pages, name of sales person, who you guys pay for the material, etc etc etc. On May 3, 2011 10:59 AM, Johan De Meersman vegiv...@tuxera.be wrote: - Original Message - From: Jerry Schwartz je...@gii.co.jp I'm not sure that I could easily build a dictionary of non-junk words, since The traditional way is to build a database of junk words. The list tends to be shorter :-) Think and/or/it/the/with/like/... Percentages of mutual and non-mutual words between two titles should be a reasonable indicator of likeness. You could conceivably even assign value to individual words, so polypropylbutanate is more useful than synergy for comparison purposes. All very theoretical, though, I haven't actually done much of it to this level. My experience in data mangling is limited to mostly should-be-fixed-format data like sports results. -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
RE: Join based upon LIKE
-Original Message- From: Johan De Meersman [mailto:vegiv...@tuxera.be] Sent: Sunday, May 01, 2011 4:01 AM To: Jerry Schwartz Cc: Jim McNeely; mysql mailing list Subject: Re: Join based upon LIKE - Original Message - From: Jerry Schwartz je...@gii.co.jp I shove those modified titles into a table and do a JOIN ON `prod_title` LIKE `wild_title`. Roughly what I meant with the shadow fields, yes - keep your own set of data around :-) I have little more to offer, then, I'm afraid. The soundex() algorithm may or may not be of some use to you; it offers comparison based (roughly) on pronounciation instead of spelling. [JS] I've thought about using soundex(), but I'm not quite sure how. I didn't pursue it much because there are so many odd terms such as chemical names, but perhaps I should give it a try in my infinite free time. Apart from that, you have my deepest sympathy. I hope you can wake up from the nightmare soon :-) [JS] Thanks for your condolences. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: Join based upon LIKE
- Original Message - From: Jerry Schwartz je...@gii.co.jp I shove those modified titles into a table and do a JOIN ON `prod_title` LIKE `wild_title`. Roughly what I meant with the shadow fields, yes - keep your own set of data around :-) I have little more to offer, then, I'm afraid. The soundex() algorithm may or may not be of some use to you; it offers comparison based (roughly) on pronounciation instead of spelling. Apart from that, you have my deepest sympathy. I hope you can wake up from the nightmare soon :-) -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: FW: Join based upon LIKE
2011/04/28 15:28 -0400, Jerry Schwartz No takers? And this is not real taking, because the algorithm of which I am thinking, the edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see the Wikipedia entry). The obvious implementation takes as many steps as the product of the two compared strings s length. On the other hand, a good implementation of LIKE costs the pattern s length added to all the strings against which it matches s length, a sum, not product, of lengths. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: FW: Join based upon LIKE
2011/04/28 15:28 -0400, Jerry Schwartz No takers? And this is not real taking, because the algorithm of which I am thinking, the edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see the Wikipedia entry), but it yields, I believe, much more nearly such answer as you want. The obvious implementation takes as many steps as the product of the two compared strings s length. On the other hand, a good implementation of LIKE costs the pattern s length added to all the strings against which it matches s length, a sum, not product, of lengths. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: Join based upon LIKE
- Original Message - From: Jerry Schwartz je...@gii.co.jp [JS] This isn't the only place I have to deal with fuzzy data. :-( Discretion prohibits further comment. Heh. What you *really* need, is a LART. Preferably one of the spiked variety. A full-text index would work if I were only looking for one title at a time, but I don't know if that would be a good idea if I have a list of 1 titles. That would pretty much require either 1 separate queries or a very, very long WHERE clause. Yes, unfortunately. You should see if you can introduce a form of data normalisation - say, shadow fields with corrected entries, or functionality in the application that suggests correct entries based on what the user typed. Or, if the money's there, you could have a look at Amazon Mechanical Turk (yes, really) for cheap-ish data correction. -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
RE: Join based upon LIKE
-Original Message- From: Johan De Meersman [mailto:vegiv...@tuxera.be] Sent: Friday, April 29, 2011 5:56 AM To: Jerry Schwartz Cc: mysql mailing list Subject: Re: Join based upon LIKE - Original Message - From: Jerry Schwartz je...@gii.co.jp [JS] This isn't the only place I have to deal with fuzzy data. :-( Discretion prohibits further comment. Heh. What you *really* need, is a LART. Preferably one of the spiked variety. [JS] Unless a LART is a demon of some kind, I don't know what it is. A full-text index would work if I were only looking for one title at a time, but I don't know if that would be a good idea if I have a list of 1 titles. That would pretty much require either 1 separate queries or a very, very long WHERE clause. Yes, unfortunately. You should see if you can introduce a form of data normalisation - say, shadow fields with corrected entries, or functionality in the application that suggests correct entries based on what the user typed. [JS] Except for obvious misspellings and non-ASCII characters, I do not have the freedom to muck with the text. If the data were created in-house, I could correct it on the way in; but it comes from myriad other companies. Or, if the money's there, you could have a look at Amazon Mechanical Turk (yes, really) for cheap-ish data correction. [JS] Again, I can't change the data. The titles are assigned by the publishers. Think what would happen if Amazon decided to fix the titles of books. Ain't Misbehavin would, at best, turn into I am not misbehaving. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
RE: Join based upon LIKE
-Original Message- From: Jim McNeely [mailto:j...@newcenturydata.com] Sent: Thursday, April 28, 2011 6:43 PM To: Jerry Schwartz Subject: Re: Join based upon LIKE It just smells wrong, a nicer system would have you joining on ID's of some kind so that spelling wouldn't matter. I don't know the full situation for you though. [JS] That would be nice, wouldn't it. In a nutshell, we sell publications. Publishers send us lists of publications. Some are new, some replace previous editions. (Think of books, almanacs, and newsletters.) Some publishers make do without any product IDs at all, but most do use product IDs of some kind. The problem is that the March edition of a publication might or might not have the same product ID as the February edition. I try to match them both by product ID and by title. Sometimes the title will fuzzy match, but the ID won't; sometimes the ID will match but the title won't; sometimes (if I'm really lucky) they both match; and sometimes the ID matches one product and the title matches another. It's the fuzzy match by title that gives me fits: - The title might have a date in it (Rain in Spain in 2010 Q2), but not necessarily in a uniform way (Rain in Spain Q3 2010). - The title might have differences in wording or punctuation (Rain in Spain - 2010Q2). - The title might have simple misspellings (Rain in Spian - Q2 2010). I've written code that looks for troublesome constructs and replaces them with %: in , -, to , Q2, 2Q, and more and more. So Rain in Spain - 2010 Q2 becomes Rain%Spain%. I shove those modified titles into a table and do a JOIN ON `prod_title` LIKE `wild_title`. This will miss actual misspellings (Spain, Spian). It will also produce a large number of false positives. On the back end, I have other code that compares the new titles against the titles retrieved by that query and decides if they are exact matches, approximate matches (here I do use regular expressions, as well as lists of known bad boys), or false positives. From there on, it's all hand work. Pretty big nut, eh? So that's why I need to use LIKE in my JOIN. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com Jim McNeely On Apr 28, 2011, at 12:28 PM, Jerry Schwartz wrote: No takers? -Original Message- From: Jerry Schwartz [mailto:je...@gii.co.jp] Sent: Monday, April 25, 2011 2:34 PM To: 'Mailing-List mysql' Subject: Join based upon LIKE I have to match lists of new publications against our database, so that I can replace the existing publications in our catalog. For example, The UK Market for Puppies in February 2011 would be a replacement for The UK Market for Puppies in December 2010 Unfortunately, the publishers aren't particularly careful with their titles. One might even say they are perverse. I am likely to get UK Market: Puppies - Feb 2011 as replacement for The UK Market for Puppies in December 2010 You can see that a straight match by title is not going to work. Here's what I've been doing: = SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD'); CREATE TEMPORARY TABLE new_titles ( new_title VARCHAR(255), INDEX (new_title), new_title_like VARCHAR(255), INDEX (new_title_like) ); INSERT INTO new_titles VALUES ('Alternative Energy Monthly Deal Analysis - MA and Investment Trends, April 2011', 'Alternative Energy Monthly Deal Analysis%MA%Investment Trends%'), ('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company Share, Price Trends, Capacity Forecasts of All Active and Planned Plants', 'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price Trends%Capacity Forecasts of All Active%Planned Plants'), ... ('Underground Gas Storage Industry Outlook in North America, 2011 - Details of All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage Industry Outlook%North America%Details of All Operating%Planned Gas Storage Sites to%'), ('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017', 'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%'); SELECT prod.prod_title AS `Title IN Database`, new_titles.new_title AS `Title IN Feed`, prod.prod_num AS `ID` FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like) AND prod.pub_id = @PUBID AND prod.prod_discont = 0 ORDER BY new_titles.new_title; == (I've written code that substitutes % for certain strings that I specify, and there is some trial and error involved.) Here's how MySQL handles that SELECT: *** 1. row *** id: 1 select_type: SIMPLE table: new_titles type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows
FW: Join based upon LIKE
No takers? -Original Message- From: Jerry Schwartz [mailto:je...@gii.co.jp] Sent: Monday, April 25, 2011 2:34 PM To: 'Mailing-List mysql' Subject: Join based upon LIKE I have to match lists of new publications against our database, so that I can replace the existing publications in our catalog. For example, The UK Market for Puppies in February 2011 would be a replacement for The UK Market for Puppies in December 2010 Unfortunately, the publishers aren't particularly careful with their titles. One might even say they are perverse. I am likely to get UK Market: Puppies - Feb 2011 as replacement for The UK Market for Puppies in December 2010 You can see that a straight match by title is not going to work. Here's what I've been doing: = SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD'); CREATE TEMPORARY TABLE new_titles ( new_title VARCHAR(255), INDEX (new_title), new_title_like VARCHAR(255), INDEX (new_title_like) ); INSERT INTO new_titles VALUES ('Alternative Energy Monthly Deal Analysis - MA and Investment Trends, April 2011', 'Alternative Energy Monthly Deal Analysis%MA%Investment Trends%'), ('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company Share, Price Trends, Capacity Forecasts of All Active and Planned Plants', 'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price Trends%Capacity Forecasts of All Active%Planned Plants'), ... ('Underground Gas Storage Industry Outlook in North America, 2011 - Details of All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage Industry Outlook%North America%Details of All Operating%Planned Gas Storage Sites to%'), ('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017', 'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%'); SELECT prod.prod_title AS `Title IN Database`, new_titles.new_title AS `Title IN Feed`, prod.prod_num AS `ID` FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like) AND prod.pub_id = @PUBID AND prod.prod_discont = 0 ORDER BY new_titles.new_title; == (I've written code that substitutes % for certain strings that I specify, and there is some trial and error involved.) Here's how MySQL handles that SELECT: *** 1. row *** id: 1 select_type: SIMPLE table: new_titles type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 47 Extra: Using filesort *** 2. row *** id: 1 select_type: SIMPLE table: prod type: ref possible_keys: pub_id key: pub_id key_len: 48 ref: const rows: 19607 Extra: Using where = Here's the important part of the table `prod`: = Table: prod Create Table: CREATE TABLE `prod` ( `prod_id` varchar(15) NOT NULL DEFAULT '', `prod_num` mediumint(6) unsigned DEFAULT NULL, `prod_title` varchar(255) DEFAULT NULL, `prod_type` varchar(2) DEFAULT NULL, `prod_vat_pct` decimal(5,2) DEFAULT NULL, `prod_discont` tinyint(1) DEFAULT NULL, `prod_replacing` mediumint(6) unsigned DEFAULT NULL, `prod_replaced_by` mediumint(6) unsigned DEFAULT NULL, `prod_ready` tinyint(1) DEFAULT NULL, `pub_id` varchar(15) DEFAULT NULL, ... PRIMARY KEY (`prod_id`), UNIQUE KEY `prod_num` (`prod_num`), KEY `prod_pub_prod_id` (`prod_pub_prod_id`), KEY `pub_id` (`pub_id`), KEY `prod_title` (`prod_title`), FULLTEXT KEY `prod_title_fulltext` (`prod_title`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 = This works reasonably well for a small number (perhaps 200-300) of new products; but now I've been handed a list of over 15000 to stuff into the table `new_titles`! This motivates me to wonder if there is a better way, since I expect this to take a very long time. Suggestions? Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: Join based upon LIKE
- Original Message - From: Jerry Schwartz je...@gii.co.jp No takers? Not willingly, no :-p This is a pretty complex problem, as SQL itself isn't particularly well-equipped to deal with fuzzy data. One approach that might work is using a fulltext indexing engine (MySQL's built-in ft indices, or an external one like Solr or something) and doing best-fit matches on the keywords of the title you're looking for. -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
RE: Join based upon LIKE
-Original Message- From: Johan De Meersman [mailto:vegiv...@tuxera.be] Sent: Thursday, April 28, 2011 4:18 PM To: Jerry Schwartz Cc: mysql mailing list Subject: Re: Join based upon LIKE - Original Message - From: Jerry Schwartz je...@gii.co.jp No takers? Not willingly, no :-p This is a pretty complex problem, as SQL itself isn't particularly well- equipped to deal with fuzzy data. One approach that might work is using a fulltext indexing engine (MySQL's built-in ft indices, or an external one like Solr or something) and doing best-fit matches on the keywords of the title you're looking for. [JS] This isn't the only place I have to deal with fuzzy data. :-( Discretion prohibits further comment. A full-text index would work if I were only looking for one title at a time, but I don't know if that would be a good idea if I have a list of 1 titles. That would pretty much require either 1 separate queries or a very, very long WHERE clause. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -- Bier met grenadyn Is als mosterd by den wyn Sy die't drinkt, is eene kwezel Hy die't drinkt, is ras een ezel -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Join based upon LIKE
I have to match lists of new publications against our database, so that I can replace the existing publications in our catalog. For example, The UK Market for Puppies in February 2011 would be a replacement for The UK Market for Puppies in December 2010 Unfortunately, the publishers aren't particularly careful with their titles. One might even say they are perverse. I am likely to get UK Market: Puppies - Feb 2011 as replacement for The UK Market for Puppies in December 2010 You can see that a straight match by title is not going to work. Here's what I've been doing: = SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD'); CREATE TEMPORARY TABLE new_titles ( new_title VARCHAR(255), INDEX (new_title), new_title_like VARCHAR(255), INDEX (new_title_like) ); INSERT INTO new_titles VALUES ('Alternative Energy Monthly Deal Analysis - MA and Investment Trends, April 2011', 'Alternative Energy Monthly Deal Analysis%MA%Investment Trends%'), ('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company Share, Price Trends, Capacity Forecasts of All Active and Planned Plants', 'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price Trends%Capacity Forecasts of All Active%Planned Plants'), ... ('Underground Gas Storage Industry Outlook in North America, 2011 - Details of All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage Industry Outlook%North America%Details of All Operating%Planned Gas Storage Sites to%'), ('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017', 'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%'); SELECT prod.prod_title AS `Title IN Database`, new_titles.new_title AS `Title IN Feed`, prod.prod_num AS `ID` FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like) AND prod.pub_id = @PUBID AND prod.prod_discont = 0 ORDER BY new_titles.new_title; == (I've written code that substitutes % for certain strings that I specify, and there is some trial and error involved.) Here's how MySQL handles that SELECT: *** 1. row *** id: 1 select_type: SIMPLE table: new_titles type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 47 Extra: Using filesort *** 2. row *** id: 1 select_type: SIMPLE table: prod type: ref possible_keys: pub_id key: pub_id key_len: 48 ref: const rows: 19607 Extra: Using where = Here's the important part of the table `prod`: = Table: prod Create Table: CREATE TABLE `prod` ( `prod_id` varchar(15) NOT NULL DEFAULT '', `prod_num` mediumint(6) unsigned DEFAULT NULL, `prod_title` varchar(255) DEFAULT NULL, `prod_type` varchar(2) DEFAULT NULL, `prod_vat_pct` decimal(5,2) DEFAULT NULL, `prod_discont` tinyint(1) DEFAULT NULL, `prod_replacing` mediumint(6) unsigned DEFAULT NULL, `prod_replaced_by` mediumint(6) unsigned DEFAULT NULL, `prod_ready` tinyint(1) DEFAULT NULL, `pub_id` varchar(15) DEFAULT NULL, ... PRIMARY KEY (`prod_id`), UNIQUE KEY `prod_num` (`prod_num`), KEY `prod_pub_prod_id` (`prod_pub_prod_id`), KEY `pub_id` (`pub_id`), KEY `prod_title` (`prod_title`), FULLTEXT KEY `prod_title_fulltext` (`prod_title`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 = This works reasonably well for a small number (perhaps 200-300) of new products; but now I've been handed a list of over 15000 to stuff into the table `new_titles`! This motivates me to wonder if there is a better way, since I expect this to take a very long time. Suggestions? Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 E-mail: je...@gii.co.jp Web site: www.the-infoshop.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org