RE: Join based upon LIKE

2011-05-05 Thread Jerry Schwartz
-Original Message-
From: Nuno Tavares [mailto:nuno.tava...@dri.pt]
Sent: Tuesday, May 03, 2011 6:21 PM
To: mysql@lists.mysql.com
Subject: Re: Join based upon LIKE

Dear Jerry,

I've been silently following this discussion because I've missed the
original question.

But from your last explanation, now it really looks you have a data
quality kind of issue, which is by far related with MySQL.

[JS] Definitely -- but I have to work with the tools available. This is only 
one part of the process, there is more trouble further on that is not related 
to our database at all.

Indeed, in Data Quality, there is *never* a ready solution, because the
source is tipically chaotic

May I suggest you to explore Google Refine? It seems to be able to
address all those issues quite nicely, and the clustering might solve
your problem at once. You shall know, however, how to export the tables
(or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that.

[JS] I never heard of Google Refine. Thanks for bringing to my attention.

Hope it helps,
-NT
[JS] Thank you.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com




Em 03-05-2011 21:34, Jerry Schwartz escreveu:
 My situation is sounds rather simple. All I am doing is matching a
spreadsheet
 of products against our database. My job is to find any matches against
 existing products and determine which ones are new, which ones are
 replacements for older products, and which ones just need to have the
 publication date (and page count, price, whatever) refreshed.

 Publisher is no problem. What I have for each feed is a title and (most 
 of
 the time) an ISBN or other identification assigned by the publisher.

 Matching by product ID is easy (assuming there aren't any mistakes in the
 current or previous feeds); but the publisher might or might not change the
 product ID when they update a report. That's why I also run a match by 
 title,
 and that's where all the trouble comes from.

 The publisher might or might not include a mix of old and new products in a
 feed. The publisher might change the title of an existing product, either 
 on
 purpose or by accident; they might simply be sloppy about their spelling; 
 or
 (and this is where it is critical) the title might include a reference to
some
 time period such as a year or a quarter.

 I think we'd better pull the plug on this discussion. It doesn't seem like
 there's a ready solution. Fortunately our database is small, and most feeds
 are only a few hundred products.

 Regards,

 Jerry Schwartz
 Global Information Incorporated
 195 Farmington Ave.
 Farmington, CT 06032

 860.674.8796 / FAX: 860.674.8341
 E-mail: je...@gii.co.jp
 Web site: www.the-infoshop.com


 -Original Message-
 From: shawn wilson [mailto:ag4ve...@gmail.com]
 Sent: Tuesday, May 03, 2011 4:08 PM
 Cc: mysql mailing list
 Subject: Re: Join based upon LIKE

 I'm actually enjoying this discussion because I have the same type of 
 issue.
 However, I have done away with trying to do a full text search in favor of
 making a table with unique fields where all fields should uniquely 
 identify
 the group. If I get a dupe, I can clean it up.

 However, like you, they don't want me to mess with the original data. So,
 what I have is another table with my good data that my table with my 
 unique
 data refers to. If a bad record is creased, I don't care I just create my
 relationship to the table of data I know (read think - I rarely look at 
 this
 stuff) is good.

 So, I have 4 fields that should be unique for a group. Two chats and two
 ints. If three of these match a record in the 'good data' table - there's 
 my
 relationship. If two or less match, I create a new record in my 'good 
 data'
 table and log the event. (I haven't gotten to the logging part yet though,
 easy enough just to look sense none of the fields in 'good data' should
 match)

 I'm thinking you might have to dig deeper than me to find 'good data' but 
 I
 think its there. Maybe isbn, name, publisher + address, price, average
 pages, name of sales person, who you guys pay for the material, etc etc 
 etc.


 On May 3, 2011 10:59 AM, Johan De Meersman vegiv...@tuxera.be wrote:


 - Original Message -
 From: Jerry Schwartz je...@gii.co.jp

 I'm not sure that I could easily build a dictionary of non-junk
 words, since

 The traditional way is to build a database of junk words. The list tends
 to be shorter :-)

 Think and/or/it/the/with/like/...

 Percentages of mutual and non-mutual words between two titles should be a
 reasonable indicator of likeness. You could conceivably even assign value 
 to
 individual words, so polypropylbutanate is more useful than synergy 
 for
 comparison purposes.

 All very theoretical, though, I haven't actually done much of it to this
 level. My experience in data mangling is limited to mostly
 should

Re: Join based upon LIKE

2011-05-03 Thread Johan De Meersman

http://www.gedpage.com/soundex.html offers a simple explanation of what it does.

One possibility would be building a referential table with only a recordID and 
soundex column, unique over both; and filling that with the soundex of 
individual nonjunk words.

So, from the titles

1 | Rain in Spain
2 | Spain's Rain

you'd get

1 | R500
1 | S150
2 | S150
2 | R500

From thereon, you can see that all the same words have been used - ignoring a 
lot of spelling errors like Spian. Obviously not a magic solution, but it's a 
start.

- Original Message -
 From: Jerry Schwartz je...@gii.co.jp
 To: Johan De Meersman vegiv...@tuxera.be
 Cc: Jim McNeely j...@newcenturydata.com, mysql mailing list 
 mysql@lists.mysql.com
 Sent: Monday, 2 May, 2011 4:09:36 PM
 Subject: RE: Join based upon LIKE
 
 [JS] I've thought about using soundex(), but I'm not quite sure how.
 
 I didn't pursue it much because there are so many odd terms such as
 chemical
 names, but perhaps I should give it a try in my infinite free time.
 
 
 [JS] Thanks for your condolences.
 
 Regards,
 
 Jerry Schwartz
 Global Information Incorporated
 195 Farmington Ave.
 Farmington, CT 06032
 
 860.674.8796 / FAX: 860.674.8341
 E-mail: je...@gii.co.jp
 Web site: www.the-infoshop.com
 

-- 
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



RE: Join based upon LIKE

2011-05-03 Thread Jerry Schwartz
-Original Message-
From: Johan De Meersman [mailto:vegiv...@tuxera.be]
Sent: Tuesday, May 03, 2011 5:31 AM
To: Jerry Schwartz
Cc: Jim McNeely; mysql mailing list; Johan De Meersman
Subject: Re: Join based upon LIKE


http://www.gedpage.com/soundex.html offers a simple explanation of what it
does.

One possibility would be building a referential table with only a recordID 
and
soundex column, unique over both; and filling that with the soundex of
individual nonjunk words.

So, from the titles

1 | Rain in Spain
2 | Spain's Rain

you'd get

1 | R500
1 | S150
2 | S150
2 | R500

From thereon, you can see that all the same words have been used - ignoring a
lot of spelling errors like Spian. Obviously not a magic solution, but it's a
start.

[JS] Thanks.

I'm not sure that I could easily build a dictionary of non-junk words, since 
some of these reports have titles like Toluene Diisocyanate Market Outlook 
2008, Toluene Market Outlook 2008, and Toluene: 2009 World Market Outlook 
And Forecast (Special Crisis Edition).

I shall ponder this when I am caught up, or (more likely) in the afterlife.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com

- Original Message -
 From: Jerry Schwartz je...@gii.co.jp
 To: Johan De Meersman vegiv...@tuxera.be
 Cc: Jim McNeely j...@newcenturydata.com, mysql mailing list
mysql@lists.mysql.com
 Sent: Monday, 2 May, 2011 4:09:36 PM
 Subject: RE: Join based upon LIKE

 [JS] I've thought about using soundex(), but I'm not quite sure how.

 I didn't pursue it much because there are so many odd terms such as
 chemical
 names, but perhaps I should give it a try in my infinite free time.


 [JS] Thanks for your condolences.

 Regards,

 Jerry Schwartz
 Global Information Incorporated
 195 Farmington Ave.
 Farmington, CT 06032

 860.674.8796 / FAX: 860.674.8341
 E-mail: je...@gii.co.jp
 Web site: www.the-infoshop.com


--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: Join based upon LIKE

2011-05-03 Thread Johan De Meersman

- Original Message -
 From: Jerry Schwartz je...@gii.co.jp
 
 I'm not sure that I could easily build a dictionary of non-junk
 words, since

The traditional way is to build a database of junk words. The list tends to be 
shorter :-)

Think and/or/it/the/with/like/...

Percentages of mutual and non-mutual words between two titles should be a 
reasonable indicator of likeness. You could conceivably even assign value to 
individual words, so polypropylbutanate is more useful than synergy for 
comparison purposes.

All very theoretical, though, I haven't actually done much of it to this level. 
My experience in data mangling is limited to mostly should-be-fixed-format data 
like sports results.


-- 
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: Join based upon LIKE

2011-05-03 Thread shawn wilson
I'm actually enjoying this discussion because I have the same type of issue.
However, I have done away with trying to do a full text search in favor of
making a table with unique fields where all fields should uniquely identify
the group. If I get a dupe, I can clean it up.

However, like you, they don't want me to mess with the original data. So,
what I have is another table with my good data that my table with my unique
data refers to. If a bad record is creased, I don't care I just create my
relationship to the table of data I know (read think - I rarely look at this
stuff) is good.

So, I have 4 fields that should be unique for a group. Two chats and two
ints. If three of these match a record in the 'good data' table - there's my
relationship. If two or less match, I create a new record in my 'good data'
table and log the event. (I haven't gotten to the logging part yet though,
easy enough just to look sense none of the fields in 'good data' should
match)

I'm thinking you might have to dig deeper than me to find 'good data' but I
think its there. Maybe isbn, name, publisher + address, price, average
pages, name of sales person, who you guys pay for the material, etc etc etc.


On May 3, 2011 10:59 AM, Johan De Meersman vegiv...@tuxera.be wrote:


 - Original Message -
  From: Jerry Schwartz je...@gii.co.jp
 
  I'm not sure that I could easily build a dictionary of non-junk
  words, since

 The traditional way is to build a database of junk words. The list tends
to be shorter :-)

 Think and/or/it/the/with/like/...

 Percentages of mutual and non-mutual words between two titles should be a
reasonable indicator of likeness. You could conceivably even assign value to
individual words, so polypropylbutanate is more useful than synergy for
comparison purposes.

 All very theoretical, though, I haven't actually done much of it to this
level. My experience in data mangling is limited to mostly
should-be-fixed-format data like sports results.


 --
 Bier met grenadyn
 Is als mosterd by den wyn
 Sy die't drinkt, is eene kwezel
 Hy die't drinkt, is ras een ezel

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com



RE: Join based upon LIKE

2011-05-03 Thread Jerry Schwartz
My situation is sounds rather simple. All I am doing is matching a spreadsheet 
of products against our database. My job is to find any matches against 
existing products and determine which ones are new, which ones are 
replacements for older products, and which ones just need to have the 
publication date (and page count, price, whatever) refreshed.

Publisher is no problem. What I have for each feed is a title and (most of 
the time) an ISBN or other identification assigned by the publisher.

Matching by product ID is easy (assuming there aren't any mistakes in the 
current or previous feeds); but the publisher might or might not change the 
product ID when they update a report. That's why I also run a match by title, 
and that's where all the trouble comes from.

The publisher might or might not include a mix of old and new products in a 
feed. The publisher might change the title of an existing product, either on 
purpose or by accident; they might simply be sloppy about their spelling; or 
(and this is where it is critical) the title might include a reference to some 
time period such as a year or a quarter.

I think we'd better pull the plug on this discussion. It doesn't seem like 
there's a ready solution. Fortunately our database is small, and most feeds 
are only a few hundred products.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com


-Original Message-
From: shawn wilson [mailto:ag4ve...@gmail.com]
Sent: Tuesday, May 03, 2011 4:08 PM
Cc: mysql mailing list
Subject: Re: Join based upon LIKE

I'm actually enjoying this discussion because I have the same type of issue.
However, I have done away with trying to do a full text search in favor of
making a table with unique fields where all fields should uniquely identify
the group. If I get a dupe, I can clean it up.

However, like you, they don't want me to mess with the original data. So,
what I have is another table with my good data that my table with my unique
data refers to. If a bad record is creased, I don't care I just create my
relationship to the table of data I know (read think - I rarely look at this
stuff) is good.

So, I have 4 fields that should be unique for a group. Two chats and two
ints. If three of these match a record in the 'good data' table - there's my
relationship. If two or less match, I create a new record in my 'good data'
table and log the event. (I haven't gotten to the logging part yet though,
easy enough just to look sense none of the fields in 'good data' should
match)

I'm thinking you might have to dig deeper than me to find 'good data' but I
think its there. Maybe isbn, name, publisher + address, price, average
pages, name of sales person, who you guys pay for the material, etc etc etc.


On May 3, 2011 10:59 AM, Johan De Meersman vegiv...@tuxera.be wrote:


 - Original Message -
  From: Jerry Schwartz je...@gii.co.jp
 
  I'm not sure that I could easily build a dictionary of non-junk
  words, since

 The traditional way is to build a database of junk words. The list tends
to be shorter :-)

 Think and/or/it/the/with/like/...

 Percentages of mutual and non-mutual words between two titles should be a
reasonable indicator of likeness. You could conceivably even assign value to
individual words, so polypropylbutanate is more useful than synergy for
comparison purposes.

 All very theoretical, though, I haven't actually done much of it to this
level. My experience in data mangling is limited to mostly
should-be-fixed-format data like sports results.


 --
 Bier met grenadyn
 Is als mosterd by den wyn
 Sy die't drinkt, is eene kwezel
 Hy die't drinkt, is ras een ezel

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: Join based upon LIKE

2011-05-03 Thread Nuno Tavares
Dear Jerry,

I've been silently following this discussion because I've missed the
original question.

But from your last explanation, now it really looks you have a data
quality kind of issue, which is by far related with MySQL.

Indeed, in Data Quality, there is *never* a ready solution, because the
source is tipically chaotic

May I suggest you to explore Google Refine? It seems to be able to
address all those issues quite nicely, and the clustering might solve
your problem at once. You shall know, however, how to export the tables
(or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that.

Hope it helps,
-NT

Em 03-05-2011 21:34, Jerry Schwartz escreveu:
 My situation is sounds rather simple. All I am doing is matching a 
 spreadsheet 
 of products against our database. My job is to find any matches against 
 existing products and determine which ones are new, which ones are 
 replacements for older products, and which ones just need to have the 
 publication date (and page count, price, whatever) refreshed.
 
 Publisher is no problem. What I have for each feed is a title and (most of 
 the time) an ISBN or other identification assigned by the publisher.
 
 Matching by product ID is easy (assuming there aren't any mistakes in the 
 current or previous feeds); but the publisher might or might not change the 
 product ID when they update a report. That's why I also run a match by title, 
 and that's where all the trouble comes from.
 
 The publisher might or might not include a mix of old and new products in a 
 feed. The publisher might change the title of an existing product, either on 
 purpose or by accident; they might simply be sloppy about their spelling; or 
 (and this is where it is critical) the title might include a reference to 
 some 
 time period such as a year or a quarter.
 
 I think we'd better pull the plug on this discussion. It doesn't seem like 
 there's a ready solution. Fortunately our database is small, and most feeds 
 are only a few hundred products.
 
 Regards,
 
 Jerry Schwartz
 Global Information Incorporated
 195 Farmington Ave.
 Farmington, CT 06032
 
 860.674.8796 / FAX: 860.674.8341
 E-mail: je...@gii.co.jp
 Web site: www.the-infoshop.com
 
 
 -Original Message-
 From: shawn wilson [mailto:ag4ve...@gmail.com]
 Sent: Tuesday, May 03, 2011 4:08 PM
 Cc: mysql mailing list
 Subject: Re: Join based upon LIKE

 I'm actually enjoying this discussion because I have the same type of issue.
 However, I have done away with trying to do a full text search in favor of
 making a table with unique fields where all fields should uniquely identify
 the group. If I get a dupe, I can clean it up.

 However, like you, they don't want me to mess with the original data. So,
 what I have is another table with my good data that my table with my unique
 data refers to. If a bad record is creased, I don't care I just create my
 relationship to the table of data I know (read think - I rarely look at this
 stuff) is good.

 So, I have 4 fields that should be unique for a group. Two chats and two
 ints. If three of these match a record in the 'good data' table - there's my
 relationship. If two or less match, I create a new record in my 'good data'
 table and log the event. (I haven't gotten to the logging part yet though,
 easy enough just to look sense none of the fields in 'good data' should
 match)

 I'm thinking you might have to dig deeper than me to find 'good data' but I
 think its there. Maybe isbn, name, publisher + address, price, average
 pages, name of sales person, who you guys pay for the material, etc etc etc.


 On May 3, 2011 10:59 AM, Johan De Meersman vegiv...@tuxera.be wrote:


 - Original Message -
 From: Jerry Schwartz je...@gii.co.jp

 I'm not sure that I could easily build a dictionary of non-junk
 words, since

 The traditional way is to build a database of junk words. The list tends
 to be shorter :-)

 Think and/or/it/the/with/like/...

 Percentages of mutual and non-mutual words between two titles should be a
 reasonable indicator of likeness. You could conceivably even assign value to
 individual words, so polypropylbutanate is more useful than synergy for
 comparison purposes.

 All very theoretical, though, I haven't actually done much of it to this
 level. My experience in data mangling is limited to mostly
 should-be-fixed-format data like sports results.


 --
 Bier met grenadyn
 Is als mosterd by den wyn
 Sy die't drinkt, is eene kwezel
 Hy die't drinkt, is ras een ezel

 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/mysql?unsub=ag4ve...@gmail.com

 
 
 
 


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



RE: Join based upon LIKE

2011-05-02 Thread Jerry Schwartz
-Original Message-
From: Johan De Meersman [mailto:vegiv...@tuxera.be]
Sent: Sunday, May 01, 2011 4:01 AM
To: Jerry Schwartz
Cc: Jim McNeely; mysql mailing list
Subject: Re: Join based upon LIKE


- Original Message -
 From: Jerry Schwartz je...@gii.co.jp

 I shove those modified titles into a table and do a JOIN ON
 `prod_title` LIKE
 `wild_title`.

Roughly what I meant with the shadow fields, yes - keep your own set of data
around :-)

I have little more to offer, then, I'm afraid. The soundex() algorithm may or
may not be of some use to you; it offers comparison based (roughly) on
pronounciation instead of spelling.

[JS] I've thought about using soundex(), but I'm not quite sure how.

I didn't pursue it much because there are so many odd terms such as chemical 
names, but perhaps I should give it a try in my infinite free time.

Apart from that, you have my deepest sympathy. I hope you can wake up from 
the
nightmare soon :-)

[JS] Thanks for your condolences.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com



--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: Join based upon LIKE

2011-05-01 Thread Johan De Meersman

- Original Message -
 From: Jerry Schwartz je...@gii.co.jp
 
 I shove those modified titles into a table and do a JOIN ON
 `prod_title` LIKE
 `wild_title`.

Roughly what I meant with the shadow fields, yes - keep your own set of data 
around :-)

I have little more to offer, then, I'm afraid. The soundex() algorithm may or 
may not be of some use to you; it offers comparison based (roughly) on 
pronounciation instead of spelling.

Apart from that, you have my deepest sympathy. I hope you can wake up from the 
nightmare soon :-)

-- 
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: FW: Join based upon LIKE

2011-04-30 Thread Hal�sz S�ndor
 2011/04/28 15:28 -0400, Jerry Schwartz 
No takers?

And this is not real taking, because the algorithm of which I am thinking, the 
edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see 
the Wikipedia entry). The obvious implementation takes as many steps as the 
product of the two compared strings s length. On the other hand, a good 
implementation of LIKE costs the pattern s length added to all the strings 
against which it matches s length, a sum, not product, of lengths.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: FW: Join based upon LIKE

2011-04-30 Thread Hal�sz S�ndor
 2011/04/28 15:28 -0400, Jerry Schwartz 
No takers?

And this is not real taking, because the algorithm of which I am thinking, the 
edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see 
the Wikipedia entry), but it yields, I believe, much more nearly such answer as 
you want.

The obvious implementation takes as many steps as the product of the two 
compared strings s length. On the other hand, a good implementation of LIKE 
costs the pattern s length added to all the strings against which it matches s 
length, a sum, not product, of lengths.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: Join based upon LIKE

2011-04-29 Thread Johan De Meersman

- Original Message -
 From: Jerry Schwartz je...@gii.co.jp
 
 [JS] This isn't the only place I have to deal with fuzzy data. :-(
 Discretion prohibits further comment.

Heh. What you *really* need, is a LART. Preferably one of the spiked variety.

 A full-text index would work if I were only looking for one title at
 a time, but I don't know if that would be a good idea if I have a list of
 1 titles. That would pretty much require either 1 separate queries
 or a very, very long WHERE clause.

Yes, unfortunately. You should see if you can introduce a form of data 
normalisation - say, shadow fields with corrected entries, or functionality in 
the application that suggests correct entries based on what the user typed.

Or, if the money's there, you could have a look at Amazon Mechanical Turk (yes, 
really) for cheap-ish data correction.

-- 
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



RE: Join based upon LIKE

2011-04-29 Thread Jerry Schwartz
-Original Message-
From: Johan De Meersman [mailto:vegiv...@tuxera.be]
Sent: Friday, April 29, 2011 5:56 AM
To: Jerry Schwartz
Cc: mysql mailing list
Subject: Re: Join based upon LIKE


- Original Message -
 From: Jerry Schwartz je...@gii.co.jp

 [JS] This isn't the only place I have to deal with fuzzy data. :-(
 Discretion prohibits further comment.

Heh. What you *really* need, is a LART. Preferably one of the spiked variety.

[JS] Unless a LART is a demon of some kind, I don't know what it is.

 A full-text index would work if I were only looking for one title at
 a time, but I don't know if that would be a good idea if I have a list of
 1 titles. That would pretty much require either 1 separate queries
 or a very, very long WHERE clause.

Yes, unfortunately. You should see if you can introduce a form of data
normalisation - say, shadow fields with corrected entries, or functionality 
in
the application that suggests correct entries based on what the user typed.

[JS] Except for obvious misspellings and non-ASCII characters, I do not have 
the freedom to muck with the text. If the data were created in-house, I could 
correct it on the way in; but it comes from myriad other companies.

Or, if the money's there, you could have a look at Amazon Mechanical Turk 
(yes,
really) for cheap-ish data correction.

[JS] Again, I can't change the data. The titles are assigned by the 
publishers. Think what would happen if Amazon decided to fix the titles of 
books. Ain't Misbehavin would, at best, turn into I am not misbehaving.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com



--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



RE: Join based upon LIKE

2011-04-29 Thread Jerry Schwartz
-Original Message-
From: Jim McNeely [mailto:j...@newcenturydata.com]
Sent: Thursday, April 28, 2011 6:43 PM
To: Jerry Schwartz
Subject: Re: Join based upon LIKE

It just smells wrong, a nicer system would have you joining on ID's of some
kind so that spelling wouldn't matter. I don't know the full situation for 
you
though.

[JS] That would be nice, wouldn't it.

In a nutshell, we sell publications. Publishers send us lists of publications. 
Some are new, some replace previous editions. (Think of books, almanacs, and 
newsletters.) Some publishers make do without any product IDs at all, but most 
do use product IDs of some kind.

The problem is that the March edition of a publication might or might not have 
the same product ID as the February edition. I try to match them both by 
product ID and by title. Sometimes the title will fuzzy match, but the ID 
won't; sometimes the ID will match but the title won't; sometimes (if I'm 
really lucky) they both match; and sometimes the ID matches one product and 
the title matches another.

It's the fuzzy match by title that gives me fits:

- The title might have a date in it (Rain in Spain in 2010 Q2), but not 
necessarily in a uniform way (Rain in Spain Q3 2010).
- The title might have differences in wording or punctuation (Rain in Spain - 
2010Q2).
- The title might have simple misspellings (Rain in Spian - Q2 2010).

I've written code that looks for troublesome constructs and replaces them with 
%:  in , -,  to , Q2, 2Q, and more and more. So Rain in Spain - 
2010 Q2 becomes Rain%Spain%.

I shove those modified titles into a table and do a JOIN ON `prod_title` LIKE 
`wild_title`.

This will miss actual misspellings (Spain, Spian). It will also produce a 
large number of false positives.

On the back end, I have other code that compares the new titles against the 
titles retrieved by that query and decides if they are exact matches, 
approximate matches (here I do use regular expressions, as well as lists of 
known bad boys), or false positives. From there on, it's all hand work.

Pretty big nut, eh?

So that's why I need to use LIKE in my JOIN.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com




Jim McNeely

On Apr 28, 2011, at 12:28 PM, Jerry Schwartz wrote:

 No takers?

 -Original Message-
 From: Jerry Schwartz [mailto:je...@gii.co.jp]
 Sent: Monday, April 25, 2011 2:34 PM
 To: 'Mailing-List mysql'
 Subject: Join based upon LIKE

 I have to match lists of new publications against our database, so that I 
 can
 replace the existing publications in our catalog. For example,

 The UK Market for Puppies in February 2011

 would be a replacement for

 The UK Market for Puppies in December 2010

 Unfortunately, the publishers aren't particularly careful with their 
 titles.
 One might even say they are perverse. I am likely to get

 UK Market: Puppies - Feb 2011

 as replacement for

 The UK Market for Puppies in December 2010

 You can see that a straight match by title is not going to work.

 Here's what I've been doing:

 =

 SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');

 CREATE TEMPORARY TABLE new_titles (
  new_title VARCHAR(255), INDEX (new_title),
  new_title_like VARCHAR(255), INDEX (new_title_like)
  );

 INSERT INTO new_titles
 VALUES

 ('Alternative Energy Monthly Deal Analysis - MA and Investment Trends, 
 April
 2011', 'Alternative Energy Monthly Deal Analysis%MA%Investment Trends%'),
 ('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company
 Share, Price Trends, Capacity Forecasts of All Active and Planned Plants',
 'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price
 Trends%Capacity Forecasts of All Active%Planned Plants'),
 ...
 ('Underground Gas Storage Industry Outlook in North America, 2011 - Details
of
 All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas
Storage
 Industry Outlook%North America%Details of All Operating%Planned Gas Storage
 Sites to%'),
 ('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017',
 'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');

 SELECT prod.prod_title AS `Title IN Database`,
new_titles.new_title AS `Title IN Feed`,
prod.prod_num AS `ID`
 FROM new_titles JOIN prod ON prod.prod_title LIKE 
 (new_titles.new_title_like)
  AND prod.pub_id = @PUBID AND prod.prod_discont = 0
 ORDER BY new_titles.new_title;
 ==

 (I've written code that substitutes % for certain strings that I specify,
 and there is some trial and error involved.)

 Here's how MySQL handles that SELECT:

 *** 1. row ***
   id: 1
  select_type: SIMPLE
table: new_titles
 type: ALL
 possible_keys: NULL
  key: NULL
  key_len: NULL
  ref: NULL
 rows

FW: Join based upon LIKE

2011-04-28 Thread Jerry Schwartz
No takers?

-Original Message-
From: Jerry Schwartz [mailto:je...@gii.co.jp]
Sent: Monday, April 25, 2011 2:34 PM
To: 'Mailing-List mysql'
Subject: Join based upon LIKE

I have to match lists of new publications against our database, so that I can 
replace the existing publications in our catalog. For example,

The UK Market for Puppies in February 2011

would be a replacement for

The UK Market for Puppies in December 2010

Unfortunately, the publishers aren't particularly careful with their titles. 
One might even say they are perverse. I am likely to get

UK Market: Puppies - Feb 2011

as replacement for

The UK Market for Puppies in December 2010

You can see that a straight match by title is not going to work.

Here's what I've been doing:

=

SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');

CREATE TEMPORARY TABLE new_titles (
new_title VARCHAR(255), INDEX (new_title),
new_title_like VARCHAR(255), INDEX (new_title_like)
);

INSERT INTO new_titles
VALUES

('Alternative Energy Monthly Deal Analysis - MA and Investment Trends, April 
2011', 'Alternative Energy Monthly Deal Analysis%MA%Investment Trends%'),
('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company 
Share, Price Trends, Capacity Forecasts of All Active and Planned Plants', 
'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price 
Trends%Capacity Forecasts of All Active%Planned Plants'),
...
('Underground Gas Storage Industry Outlook in North America, 2011 - Details of 
All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage 
Industry Outlook%North America%Details of All Operating%Planned Gas Storage 
Sites to%'),
('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017', 
'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');

SELECT prod.prod_title AS `Title IN Database`,
new_titles.new_title AS `Title IN Feed`,
prod.prod_num AS `ID`
FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like)
AND prod.pub_id = @PUBID AND prod.prod_discont = 0
ORDER BY new_titles.new_title;
==

(I've written code that substitutes % for certain strings that I specify, 
and there is some trial and error involved.)

Here's how MySQL handles that SELECT:

*** 1. row ***
   id: 1
  select_type: SIMPLE
table: new_titles
 type: ALL
possible_keys: NULL
  key: NULL
  key_len: NULL
  ref: NULL
 rows: 47
Extra: Using filesort
*** 2. row ***
   id: 1
  select_type: SIMPLE
table: prod
 type: ref
possible_keys: pub_id
  key: pub_id
  key_len: 48
  ref: const
 rows: 19607
Extra: Using where
=

Here's the important part of the table `prod`:

=

   Table: prod
Create Table: CREATE TABLE `prod` (
  `prod_id` varchar(15) NOT NULL DEFAULT '',
  `prod_num` mediumint(6) unsigned DEFAULT NULL,
  `prod_title` varchar(255) DEFAULT NULL,
  `prod_type` varchar(2) DEFAULT NULL,
  `prod_vat_pct` decimal(5,2) DEFAULT NULL,
  `prod_discont` tinyint(1) DEFAULT NULL,
  `prod_replacing` mediumint(6) unsigned DEFAULT NULL,
  `prod_replaced_by` mediumint(6) unsigned DEFAULT NULL,
  `prod_ready` tinyint(1) DEFAULT NULL,
  `pub_id` varchar(15) DEFAULT NULL,
...
  PRIMARY KEY (`prod_id`),
  UNIQUE KEY `prod_num` (`prod_num`),
  KEY `prod_pub_prod_id` (`prod_pub_prod_id`),
  KEY `pub_id` (`pub_id`),
  KEY `prod_title` (`prod_title`),
  FULLTEXT KEY `prod_title_fulltext` (`prod_title`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

=

This works reasonably well for a small number (perhaps 200-300) of new 
products; but now I've been handed a list of over 15000 to stuff into the 
table `new_titles`! This motivates me to wonder if there is a better way, 
since I expect this to take a very long time.

Suggestions?

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: Join based upon LIKE

2011-04-28 Thread Johan De Meersman

- Original Message -
 From: Jerry Schwartz je...@gii.co.jp
 
 No takers?

Not willingly, no :-p

This is a pretty complex problem, as SQL itself isn't particularly 
well-equipped to deal with fuzzy data. One approach that might work is using a 
fulltext indexing engine (MySQL's built-in ft indices, or an external one like 
Solr or something) and doing best-fit matches on the keywords of the title 
you're looking for.
 

-- 
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



RE: Join based upon LIKE

2011-04-28 Thread Jerry Schwartz
-Original Message-
From: Johan De Meersman [mailto:vegiv...@tuxera.be]
Sent: Thursday, April 28, 2011 4:18 PM
To: Jerry Schwartz
Cc: mysql mailing list
Subject: Re: Join based upon LIKE


- Original Message -
 From: Jerry Schwartz je...@gii.co.jp

 No takers?

Not willingly, no :-p

This is a pretty complex problem, as SQL itself isn't particularly well-
equipped to deal with fuzzy data. One approach that might work is using a
fulltext indexing engine (MySQL's built-in ft indices, or an external one 
like
Solr or something) and doing best-fit matches on the keywords of the title
you're looking for.

[JS] This isn't the only place I have to deal with fuzzy data. :-( Discretion 
prohibits further comment.

A full-text index would work if I were only looking for one title at a time, 
but I don't know if that would be a good idea if I have a list of 1 
titles. That would pretty much require either 1 separate queries or a 
very, very long WHERE clause.


Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com




--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Join based upon LIKE

2011-04-25 Thread Jerry Schwartz
I have to match lists of new publications against our database, so that I can 
replace the existing publications in our catalog. For example,

The UK Market for Puppies in February 2011

would be a replacement for

The UK Market for Puppies in December 2010

Unfortunately, the publishers aren't particularly careful with their titles. 
One might even say they are perverse. I am likely to get

UK Market: Puppies - Feb 2011

as replacement for

The UK Market for Puppies in December 2010

You can see that a straight match by title is not going to work.

Here's what I've been doing:

=

SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');

CREATE TEMPORARY TABLE new_titles (
new_title VARCHAR(255), INDEX (new_title),
new_title_like VARCHAR(255), INDEX (new_title_like)
);

INSERT INTO new_titles
VALUES

('Alternative Energy Monthly Deal Analysis - MA and Investment Trends, April 
2011', 'Alternative Energy Monthly Deal Analysis%MA%Investment Trends%'),
('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company 
Share, Price Trends, Capacity Forecasts of All Active and Planned Plants', 
'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price 
Trends%Capacity Forecasts of All Active%Planned Plants'),
...
('Underground Gas Storage Industry Outlook in North America, 2011 - Details of 
All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage 
Industry Outlook%North America%Details of All Operating%Planned Gas Storage 
Sites to%'),
('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017', 
'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');

SELECT prod.prod_title AS `Title IN Database`,
new_titles.new_title AS `Title IN Feed`,
prod.prod_num AS `ID`
FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like)
AND prod.pub_id = @PUBID AND prod.prod_discont = 0
ORDER BY new_titles.new_title;
==

(I've written code that substitutes % for certain strings that I specify, 
and there is some trial and error involved.)

Here's how MySQL handles that SELECT:

*** 1. row ***
   id: 1
  select_type: SIMPLE
table: new_titles
 type: ALL
possible_keys: NULL
  key: NULL
  key_len: NULL
  ref: NULL
 rows: 47
Extra: Using filesort
*** 2. row ***
   id: 1
  select_type: SIMPLE
table: prod
 type: ref
possible_keys: pub_id
  key: pub_id
  key_len: 48
  ref: const
 rows: 19607
Extra: Using where
=

Here's the important part of the table `prod`:

=

   Table: prod
Create Table: CREATE TABLE `prod` (
  `prod_id` varchar(15) NOT NULL DEFAULT '',
  `prod_num` mediumint(6) unsigned DEFAULT NULL,
  `prod_title` varchar(255) DEFAULT NULL,
  `prod_type` varchar(2) DEFAULT NULL,
  `prod_vat_pct` decimal(5,2) DEFAULT NULL,
  `prod_discont` tinyint(1) DEFAULT NULL,
  `prod_replacing` mediumint(6) unsigned DEFAULT NULL,
  `prod_replaced_by` mediumint(6) unsigned DEFAULT NULL,
  `prod_ready` tinyint(1) DEFAULT NULL,
  `pub_id` varchar(15) DEFAULT NULL,
...
  PRIMARY KEY (`prod_id`),
  UNIQUE KEY `prod_num` (`prod_num`),
  KEY `prod_pub_prod_id` (`prod_pub_prod_id`),
  KEY `pub_id` (`pub_id`),
  KEY `prod_title` (`prod_title`),
  FULLTEXT KEY `prod_title_fulltext` (`prod_title`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

=

This works reasonably well for a small number (perhaps 200-300) of new 
products; but now I've been handed a list of over 15000 to stuff into the 
table `new_titles`! This motivates me to wonder if there is a better way, 
since I expect this to take a very long time.

Suggestions?

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org