Re: FW: Join based upon LIKE

2011-04-29 Thread Hal�sz S�ndor
 2011/04/28 15:28 -0400, Jerry Schwartz 
No takers?

And this is not real taking, because the algorithm of which I am thinking, the 
edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see 
the Wikipedia entry). The obvious implementation takes as many steps as the 
product of the two compared strings s length. On the other hand, a good 
implementation of "LIKE" costs the pattern s length added to all the strings 
against which it matches s length, a sum, not product, of lengths.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



RE: ERROR 2006 (HY000)

2011-04-29 Thread Jerry Schwartz
>-Original Message-
>From: Larry Martell [mailto:larry.mart...@gmail.com]
>Sent: Friday, April 29, 2011 5:31 PM
>To: Jerry Schwartz
>Cc: mysql mailing list
>Subject: Re: ERROR 2006 (HY000)
>
>On Fri, Apr 29, 2011 at 3:23 PM, Jerry Schwartz  wrote:
>> I'm getting
>>
>> ERROR 2006 (HY000) at line 10: MySQL server has gone away
>>
>> while trying to do an insert through the MySQL CLI. Our database is so 
>> small
>> that I haven't gotten around to doing any tuning, so this came out of the
>> blue. I'm not quite sure where to start.
>>
>> Here's what things look like at my end:
>>
>> =
>>
>> select version();
>> +--+
>> | version()|
>> +--+
>> | 5.1.36-community |
>> +--+
>>
>>
>> DROP TEMPORARY TABLE IF EXISTS new_titles;
>>
>> CREATE TEMPORARY TABLE new_titles (
>>new_title VARCHAR(255), INDEX (new_title),
>>new_title_like VARCHAR(255), INDEX (new_title_like)
>>) ENGINE MYISAM;
>>
>> INSERT INTO new_titles
>> VALUES
>>
>> ('(I.Z) Queenco Ltd. (QNCO) - Financial and Strategic SWOT Analysis 
>> Review',
>> '(I%Z) Queenco Ltd%(QNCO)%Financial%Strategic SWOT Analysis Review'),
>> ('@Comm Corporation (ATCM) - Strategic SWOT Analysis Review', '@Comm
>> Corporation (ATCM)%Strategic SWOT Analysis Review'),
>> ('010017 Telecom GmbH - Strategic SWOT Analysis Review', '010017 Telecom
>> GmbH%Strategic SWOT Analysis Review'),
>> ('1 & 1 Internet AG - Strategic SWOT Analysis Review', '1%1 Internet
>> AG%Strategic SWOT Analysis Review'),
>> ('1,618 STRICT AB (STRI B) - Financial and Strategic SWOT Analysis Review',
>> '1%618 STRICT AB (STRI B)%Financial%Strategic SWOT Analysis Review'),
>> ('1199SEIU Benefit and Pension Funds - Strategic SWOT Analysis Review',
>> '1199SEIU Benefit%Pension Funds%Strategic SWOT Analysis Review'),
>> ('1300 Smiles Limited (ONT) - Financial and Strategic SWOT Analysis 
>> Review',
>> '1300 Smiles Limited (ONT)%Financial%Strategic SWOT Analysis Review'),
>> ('141 Capital Inc. - Strategic SWOT Analysis Review', '141 Capital
>> Inc%Strategic SWOT Analysis Review'),
>> ('1855 SA (AL185) - Financial and Strategic SWOT Analysis Review', '1855 SA
>> (AL185)%Financial%Strategic SWOT Analysis Review'),
>> ('1pm plc (OPM) - Financial and Strategic SWOT Analysis Review', '1pm plc
>> (OPM)%Financial%Strategic SWOT Analysis Review'),
>> ('1st Century Bancshares, Inc. (FCTY) - Financial and Strategic SWOT 
>> Analysis
>> Review', '1st Century Bancshares%Inc%(FCTY)%Financial%Strategic SWOT 
>> Analysis
>> Review'),
>> ('1st NRG Corp. (FNRC) - Financial and Strategic SWOT Analysis Review', 
>> '1st
>> NRG Corp%(FNRC)%Financial%Strategic SWOT Analysis Review'),
>> ('1st RED AG (SXL) - Financial and Strategic SWOT Analysis Review', '1st 
>> RED
>> AG (SXL)%Financial%Strategic SWOT Analysis Review'),
>> ('20 Microns Limited (533022) - Financial and Strategic SWOT Analysis
>Review',
>> '20 Microns Limited (533022)%Financial%Strategic SWOT Analysis Review'),
>> ('21 Holdings Limited (1003) - Financial and Strategic SWOT Analysis 
>> Review',
>> '21 Holdings Limited (1003)%Financial%Strategic SWOT Analysis Review'),
>> ('21LADY Co., Ltd. (3346) - Financial and Strategic SWOT Analysis Review',
>> '21LADY Co%Ltd%(3346)%Financial%Strategic SWOT Analysis Review'),
>>
>> ==
>>
>> ... and so on for about 15000 rows.
>>
>> Before I go through changing this to 15000 separate inserts, I'd like to 
>> know
>> how to "fix" the problem the right way. I looked at all of the system
>> variables, and none of them seemed to apply. I probably missed something
>> obvious.
>
>This is most likely due to your query exceeding the max packet size.
>Break it up into smaller pieces or increase max_allowed_packet.
[JS] You hit the nail on the head.

I wondered about max_allowed_packet, but the documentation says that it 
applies to BLOBs or long strings. It didn't occur to me that a query could, 
itself, be a long string.

My query is a tad more than 2mb, so setting max_allowed_packet to 3mb fixed it 
right up.

Thanks!

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com








-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: ERROR 2006 (HY000)

2011-04-29 Thread Larry Martell
On Fri, Apr 29, 2011 at 3:23 PM, Jerry Schwartz  wrote:
> I'm getting
>
> ERROR 2006 (HY000) at line 10: MySQL server has gone away
>
> while trying to do an insert through the MySQL CLI. Our database is so small
> that I haven't gotten around to doing any tuning, so this came out of the
> blue. I'm not quite sure where to start.
>
> Here's what things look like at my end:
>
> =
>
> select version();
> +--+
> | version()        |
> +--+
> | 5.1.36-community |
> +--+
>
>
> DROP TEMPORARY TABLE IF EXISTS new_titles;
>
> CREATE TEMPORARY TABLE new_titles (
>        new_title VARCHAR(255), INDEX (new_title),
>        new_title_like VARCHAR(255), INDEX (new_title_like)
>        ) ENGINE MYISAM;
>
> INSERT INTO new_titles
> VALUES
>
> ('(I.Z) Queenco Ltd. (QNCO) - Financial and Strategic SWOT Analysis Review',
> '(I%Z) Queenco Ltd%(QNCO)%Financial%Strategic SWOT Analysis Review'),
> ('@Comm Corporation (ATCM) - Strategic SWOT Analysis Review', '@Comm
> Corporation (ATCM)%Strategic SWOT Analysis Review'),
> ('010017 Telecom GmbH - Strategic SWOT Analysis Review', '010017 Telecom
> GmbH%Strategic SWOT Analysis Review'),
> ('1 & 1 Internet AG - Strategic SWOT Analysis Review', '1%1 Internet
> AG%Strategic SWOT Analysis Review'),
> ('1,618 STRICT AB (STRI B) - Financial and Strategic SWOT Analysis Review',
> '1%618 STRICT AB (STRI B)%Financial%Strategic SWOT Analysis Review'),
> ('1199SEIU Benefit and Pension Funds - Strategic SWOT Analysis Review',
> '1199SEIU Benefit%Pension Funds%Strategic SWOT Analysis Review'),
> ('1300 Smiles Limited (ONT) - Financial and Strategic SWOT Analysis Review',
> '1300 Smiles Limited (ONT)%Financial%Strategic SWOT Analysis Review'),
> ('141 Capital Inc. - Strategic SWOT Analysis Review', '141 Capital
> Inc%Strategic SWOT Analysis Review'),
> ('1855 SA (AL185) - Financial and Strategic SWOT Analysis Review', '1855 SA
> (AL185)%Financial%Strategic SWOT Analysis Review'),
> ('1pm plc (OPM) - Financial and Strategic SWOT Analysis Review', '1pm plc
> (OPM)%Financial%Strategic SWOT Analysis Review'),
> ('1st Century Bancshares, Inc. (FCTY) - Financial and Strategic SWOT Analysis
> Review', '1st Century Bancshares%Inc%(FCTY)%Financial%Strategic SWOT Analysis
> Review'),
> ('1st NRG Corp. (FNRC) - Financial and Strategic SWOT Analysis Review', '1st
> NRG Corp%(FNRC)%Financial%Strategic SWOT Analysis Review'),
> ('1st RED AG (SXL) - Financial and Strategic SWOT Analysis Review', '1st RED
> AG (SXL)%Financial%Strategic SWOT Analysis Review'),
> ('20 Microns Limited (533022) - Financial and Strategic SWOT Analysis Review',
> '20 Microns Limited (533022)%Financial%Strategic SWOT Analysis Review'),
> ('21 Holdings Limited (1003) - Financial and Strategic SWOT Analysis Review',
> '21 Holdings Limited (1003)%Financial%Strategic SWOT Analysis Review'),
> ('21LADY Co., Ltd. (3346) - Financial and Strategic SWOT Analysis Review',
> '21LADY Co%Ltd%(3346)%Financial%Strategic SWOT Analysis Review'),
>
> ==
>
> ... and so on for about 15000 rows.
>
> Before I go through changing this to 15000 separate inserts, I'd like to know
> how to "fix" the problem the right way. I looked at all of the system
> variables, and none of them seemed to apply. I probably missed something
> obvious.

This is most likely due to your query exceeding the max packet size.
Break it up into smaller pieces or increase max_allowed_packet.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



ERROR 2006 (HY000)

2011-04-29 Thread Jerry Schwartz
I'm getting

ERROR 2006 (HY000) at line 10: MySQL server has gone away

while trying to do an insert through the MySQL CLI. Our database is so small 
that I haven't gotten around to doing any tuning, so this came out of the 
blue. I'm not quite sure where to start.

Here's what things look like at my end:

=

select version();
+--+
| version()|
+--+
| 5.1.36-community |
+--+


DROP TEMPORARY TABLE IF EXISTS new_titles;

CREATE TEMPORARY TABLE new_titles (
new_title VARCHAR(255), INDEX (new_title),
new_title_like VARCHAR(255), INDEX (new_title_like)
) ENGINE MYISAM;

INSERT INTO new_titles
VALUES

('(I.Z) Queenco Ltd. (QNCO) - Financial and Strategic SWOT Analysis Review', 
'(I%Z) Queenco Ltd%(QNCO)%Financial%Strategic SWOT Analysis Review'),
('@Comm Corporation (ATCM) - Strategic SWOT Analysis Review', '@Comm 
Corporation (ATCM)%Strategic SWOT Analysis Review'),
('010017 Telecom GmbH - Strategic SWOT Analysis Review', '010017 Telecom 
GmbH%Strategic SWOT Analysis Review'),
('1 & 1 Internet AG - Strategic SWOT Analysis Review', '1%1 Internet 
AG%Strategic SWOT Analysis Review'),
('1,618 STRICT AB (STRI B) - Financial and Strategic SWOT Analysis Review', 
'1%618 STRICT AB (STRI B)%Financial%Strategic SWOT Analysis Review'),
('1199SEIU Benefit and Pension Funds - Strategic SWOT Analysis Review', 
'1199SEIU Benefit%Pension Funds%Strategic SWOT Analysis Review'),
('1300 Smiles Limited (ONT) - Financial and Strategic SWOT Analysis Review', 
'1300 Smiles Limited (ONT)%Financial%Strategic SWOT Analysis Review'),
('141 Capital Inc. - Strategic SWOT Analysis Review', '141 Capital 
Inc%Strategic SWOT Analysis Review'),
('1855 SA (AL185) - Financial and Strategic SWOT Analysis Review', '1855 SA 
(AL185)%Financial%Strategic SWOT Analysis Review'),
('1pm plc (OPM) - Financial and Strategic SWOT Analysis Review', '1pm plc 
(OPM)%Financial%Strategic SWOT Analysis Review'),
('1st Century Bancshares, Inc. (FCTY) - Financial and Strategic SWOT Analysis 
Review', '1st Century Bancshares%Inc%(FCTY)%Financial%Strategic SWOT Analysis 
Review'),
('1st NRG Corp. (FNRC) - Financial and Strategic SWOT Analysis Review', '1st 
NRG Corp%(FNRC)%Financial%Strategic SWOT Analysis Review'),
('1st RED AG (SXL) - Financial and Strategic SWOT Analysis Review', '1st RED 
AG (SXL)%Financial%Strategic SWOT Analysis Review'),
('20 Microns Limited (533022) - Financial and Strategic SWOT Analysis Review', 
'20 Microns Limited (533022)%Financial%Strategic SWOT Analysis Review'),
('21 Holdings Limited (1003) - Financial and Strategic SWOT Analysis Review', 
'21 Holdings Limited (1003)%Financial%Strategic SWOT Analysis Review'),
('21LADY Co., Ltd. (3346) - Financial and Strategic SWOT Analysis Review', 
'21LADY Co%Ltd%(3346)%Financial%Strategic SWOT Analysis Review'),

==

... and so on for about 15000 rows.

Before I go through changing this to 15000 separate inserts, I'd like to know 
how to "fix" the problem the right way. I looked at all of the system 
variables, and none of them seemed to apply. I probably missed something 
obvious.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



RE: Join based upon LIKE

2011-04-29 Thread Jerry Schwartz
>-Original Message-
>From: Jim McNeely [mailto:j...@newcenturydata.com]
>Sent: Thursday, April 28, 2011 6:43 PM
>To: Jerry Schwartz
>Subject: Re: Join based upon LIKE
>
>It just smells wrong, a nicer system would have you joining on ID's of some
>kind so that spelling wouldn't matter. I don't know the full situation for 
>you
>though.
>
[JS] That would be nice, wouldn't it.

In a nutshell, we sell publications. Publishers send us lists of publications. 
Some are new, some replace previous editions. (Think of books, almanacs, and 
newsletters.) Some publishers make do without any product IDs at all, but most 
do use product IDs of some kind.

The problem is that the March edition of a publication might or might not have 
the same product ID as the February edition. I try to match them both by 
product ID and by title. Sometimes the title will "fuzzy match", but the ID 
won't; sometimes the ID will match but the title won't; sometimes (if I'm 
really lucky) they both match; and sometimes the ID matches one product and 
the title matches another.

It's the fuzzy match by title that gives me fits:

- The title might have a date in it ("Rain in Spain in 2010 Q2"), but not 
necessarily in a uniform way ("Rain in Spain Q3 2010").
- The title might have differences in wording or punctuation ("Rain in Spain - 
2010Q2").
- The title might have simple misspellings ("Rain in Spian - Q2 2010").

I've written code that looks for troublesome constructs and replaces them with 
"%": " in ", "-", " to ", "Q2", "2Q", and more and more. So "Rain in Spain - 
2010 Q2" becomes "Rain%Spain%".

I shove those modified titles into a table and do a JOIN ON `prod_title` LIKE 
`wild_title`.

This will miss actual misspellings ("Spain", "Spian"). It will also produce a 
large number of false positives.

On the back end, I have other code that compares the new titles against the 
titles retrieved by that query and decides if they are exact matches, 
approximate matches (here I do use regular expressions, as well as lists of 
known "bad boys"), or false positives. From there on, it's all hand work.

Pretty big nut, eh?

So that's why I need to use LIKE in my JOIN.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com




>Jim McNeely
>
>On Apr 28, 2011, at 12:28 PM, Jerry Schwartz wrote:
>
>> No takers?
>>
>> -Original Message-
>> From: Jerry Schwartz [mailto:je...@gii.co.jp]
>> Sent: Monday, April 25, 2011 2:34 PM
>> To: 'Mailing-List mysql'
>> Subject: Join based upon LIKE
>>
>> I have to match lists of new publications against our database, so that I 
>> can
>> replace the existing publications in our catalog. For example,
>>
>> "The UK Market for Puppies in February 2011"
>>
>> would be a replacement for
>>
>> "The UK Market for Puppies in December 2010"
>>
>> Unfortunately, the publishers aren't particularly careful with their 
>> titles.
>> One might even say they are perverse. I am likely to get
>>
>> "UK Market: Puppies - Feb 2011"
>>
>> as replacement for
>>
>> "The UK Market for Puppies in December 2010"
>>
>> You can see that a straight match by title is not going to work.
>>
>> Here's what I've been doing:
>>
>> =
>>
>> SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');
>>
>> CREATE TEMPORARY TABLE new_titles (
>>  new_title VARCHAR(255), INDEX (new_title),
>>  new_title_like VARCHAR(255), INDEX (new_title_like)
>>  );
>>
>> INSERT INTO new_titles
>> VALUES
>>
>> ('Alternative Energy Monthly Deal Analysis - M&A and Investment Trends, 
>> April
>> 2011', 'Alternative Energy Monthly Deal Analysis%M&A%Investment Trends%'),
>> ('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company
>> Share, Price Trends, Capacity Forecasts of All Active and Planned Plants',
>> 'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price
>> Trends%Capacity Forecasts of All Active%Planned Plants'),
>> ...
>> ('Underground Gas Storage Industry Outlook in North America, 2011 - Details
>of
>> All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas
>Storage
>> Industry Outlook%North America%Details of All Operating%Planned Gas Storage
>> Sites to%'),
>> ('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017',
>> 'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');
>>
>> SELECT prod.prod_title AS `Title IN Database`,
>>new_titles.new_title AS `Title IN Feed`,
>>prod.prod_num AS `ID`
>> FROM new_titles JOIN prod ON prod.prod_title LIKE 
>> (new_titles.new_title_like)
>>  AND prod.pub_id = @PUBID AND prod.prod_discont = 0
>> ORDER BY new_titles.new_title;
>> ==
>>
>> (I've written code that substitutes "%" for certain strings that I specify,
>> and there is some trial and error involved.)
>>
>> Here's how MySQL handles that SELECT:
>>
>> *** 1. row *

RE: Join based upon LIKE

2011-04-29 Thread Jerry Schwartz
>-Original Message-
>From: Johan De Meersman [mailto:vegiv...@tuxera.be]
>Sent: Friday, April 29, 2011 5:56 AM
>To: Jerry Schwartz
>Cc: mysql mailing list
>Subject: Re: Join based upon LIKE
>
>
>- Original Message -
>> From: "Jerry Schwartz" 
>>
>> [JS] This isn't the only place I have to deal with fuzzy data. :-(
>> Discretion prohibits further comment.
>
>Heh. What you *really* need, is a LART. Preferably one of the spiked variety.
>
[JS] Unless a LART is a demon of some kind, I don't know what it is.

>> A full-text index would work if I were only looking for one title at
>> a time, but I don't know if that would be a good idea if I have a list of
>> 1 titles. That would pretty much require either 1 separate queries
>> or a very, very long WHERE clause.
>
>Yes, unfortunately. You should see if you can introduce a form of data
>normalisation - say, shadow fields with corrected entries, or functionality 
>in
>the application that suggests correct entries based on what the user typed.
>
[JS] Except for obvious misspellings and non-ASCII characters, I do not have 
the freedom to muck with the text. If the data were created in-house, I could 
correct it on the way in; but it comes from myriad other companies.

>Or, if the money's there, you could have a look at Amazon Mechanical Turk 
>(yes,
>really) for cheap-ish data correction.
>
[JS] Again, I can't change the data. The titles are assigned by the 
publishers. Think what would happen if Amazon decided to "fix" the titles of 
books. "Ain't Misbehavin" would, at best, turn into "I am not misbehaving".

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: je...@gii.co.jp
Web site: www.the-infoshop.com



>--
>Bier met grenadyn
>Is als mosterd by den wyn
>Sy die't drinkt, is eene kwezel
>Hy die't drinkt, is ras een ezel




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org



Re: Join based upon LIKE

2011-04-29 Thread Johan De Meersman

- Original Message -
> From: "Jerry Schwartz" 
> 
> [JS] This isn't the only place I have to deal with fuzzy data. :-(
> Discretion prohibits further comment.

Heh. What you *really* need, is a LART. Preferably one of the spiked variety.

> A full-text index would work if I were only looking for one title at
> a time, but I don't know if that would be a good idea if I have a list of
> 1 titles. That would pretty much require either 1 separate queries
> or a very, very long WHERE clause.

Yes, unfortunately. You should see if you can introduce a form of data 
normalisation - say, shadow fields with corrected entries, or functionality in 
the application that suggests correct entries based on what the user typed.

Or, if the money's there, you could have a look at Amazon Mechanical Turk (yes, 
really) for cheap-ish data correction.

-- 
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org