Re: FW: Join based upon LIKE

2011-04-30 Thread Hal�sz S�ndor
 2011/04/28 15:28 -0400, Jerry Schwartz 
No takers?

And this is not real taking, because the algorithm of which I am thinking, the 
edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see 
the Wikipedia entry). The obvious implementation takes as many steps as the 
product of the two compared strings s length. On the other hand, a good 
implementation of LIKE costs the pattern s length added to all the strings 
against which it matches s length, a sum, not product, of lengths.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: FW: Join based upon LIKE

2011-04-30 Thread Hal�sz S�ndor
 2011/04/28 15:28 -0400, Jerry Schwartz 
No takers?

And this is not real taking, because the algorithm of which I am thinking, the 
edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see 
the Wikipedia entry), but it yields, I believe, much more nearly such answer as 
you want.

The obvious implementation takes as many steps as the product of the two 
compared strings s length. On the other hand, a good implementation of LIKE 
costs the pattern s length added to all the strings against which it matches s 
length, a sum, not product, of lengths.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



FW: Join based upon LIKE

2011-04-28 Thread Jerry Schwartz
No takers?

-Original Message-
From: Jerry Schwartz [mailto:je...@gii.co.jp]
Sent: Monday, April 25, 2011 2:34 PM
To: 'Mailing-List mysql'
Subject: Join based upon LIKE

I have to match lists of new publications against our database, so that I can 
replace the existing publications in our catalog. For example,

The UK Market for Puppies in February 2011

would be a replacement for

The UK Market for Puppies in December 2010

Unfortunately, the publishers aren't particularly careful with their titles. 
One might even say they are perverse. I am likely to get

UK Market: Puppies - Feb 2011

as replacement for

The UK Market for Puppies in December 2010

You can see that a straight match by title is not going to work.

Here's what I've been doing:

=

SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');

CREATE TEMPORARY TABLE new_titles (
new_title VARCHAR(255), INDEX (new_title),
new_title_like VARCHAR(255), INDEX (new_title_like)
);

INSERT INTO new_titles
VALUES

('Alternative Energy Monthly Deal Analysis - MA and Investment Trends, April 
2011', 'Alternative Energy Monthly Deal Analysis%MA%Investment Trends%'),
('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company 
Share, Price Trends, Capacity Forecasts of All Active and Planned Plants', 
'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price 
Trends%Capacity Forecasts of All Active%Planned Plants'),
...
('Underground Gas Storage Industry Outlook in North America, 2011 - Details of 
All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage 
Industry Outlook%North America%Details of All Operating%Planned Gas Storage 
Sites to%'),
('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017', 
'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');

SELECT prod.prod_title AS `Title IN Database`,
new_titles.new_title AS `Title IN Feed`,
prod.prod_num AS `ID`
FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like)
AND prod.pub_id = @PUBID AND prod.prod_discont = 0
ORDER BY new_titles.new_title;
==

(I've written code that substitutes % for certain strings that I specify, 
and there is some trial and error involved.)

Here's how MySQL handles that SELECT:

*** 1. row ***
   id: 1
  select_type: SIMPLE
table: new_titles
 type: ALL
possible_keys: NULL
  key: NULL
  key_len: NULL
  ref: NULL
 rows: 47
Extra: Using filesort
*** 2. row ***
   id: 1
  select_type: SIMPLE
table: prod
 type: ref
possible_keys: pub_id
  key: pub_id
  key_len: 48
  ref: const
 rows: 19607
Extra: Using where
=

Here's the important part of the table `prod`:

=

   Table: prod
Create Table: CREATE TABLE `prod` (
  `prod_id` varchar(15) NOT NULL DEFAULT '',
  `prod_num` mediumint(6) unsigned DEFAULT NULL,
  `prod_title` varchar(255) DEFAULT NULL,
  `prod_type` varchar(2) DEFAULT NULL,
  `prod_vat_pct` decimal(5,2) DEFAULT NULL,
  `prod_discont` tinyint(1) DEFAULT NULL,
  `prod_replacing` mediumint(6) unsigned DEFAULT NULL,
  `prod_replaced_by` mediumint(6) unsigned DEFAULT NULL,
  `prod_ready` tinyint(1) DEFAULT NULL,
  `pub_id` varchar(15) DEFAULT NULL,
...
  PRIMARY KEY (`prod_id`),
  UNIQUE KEY `prod_num` (`prod_num`),
  KEY `prod_pub_prod_id` (`prod_pub_prod_id`),
  KEY `pub_id` (`pub_id`),
  KEY `prod_title` (`prod_title`),
  FULLTEXT KEY `prod_title_fulltext` (`prod_title`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

=

This works reasonably well for a small number (perhaps 200-300) of new 
products; but now I've been handed a list of over 15000 to stuff into the 
table `new_titles`! This motivates me to wonder if there is a better way, 
since I expect this to take a very long time.

Suggestions?

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org