Re: [GENERAL] Fuzzy string matching of product names

2010-04-13 Thread Bruce Momjian
George Silva wrote:
 The above is true. For geocoding the same idea is used: the metaphone
 function is used against street names, and searched to a simples column,
 filled with the results of the metaphone function. It works quite well.

I would think an expression index would be better than a separate
column.

---


 
 George
 
 On Mon, Apr 5, 2010 at 4:23 PM, Brian Modra br...@zwartberg.com wrote:
 
  On 05/04/2010, Peter Geoghegan peter.geoghega...@gmail.com wrote:
   Hello,
  
   At the moment, users of my application, which runs on 8.4.3, may
   search for products in a way that is implemented roughly like this:
  
   SELECT * FROM products WHERE description ILIKE '%%usr_string%%';
  
   This works reasonably well. However, I thought it would be a nice
   touch to give my users leeway to spell product names incorrectly when
   searching, or to not have to remember if a product is entered as coca
   cola, CocaCola or Coca-cola. At the moment, they don't have to
   worry about case sensitivity because I use ILIKE - I'd like to
   preserve that. I'd also like to not have it weigh against them heavily
   when they don't search for a specific product, but just a common
   substring. For example, if they search for coca-cola, there may be a
   number of different coca-cola products: CocaCola 330ml can,
   Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought
   to not matter too much - all cocacola products should be returned.
  
   This isn't important enough for me to be willing to add a big
   dependency to my application. I'd really prefer to limit myself to the
   contrib modules. pg_trgm and fuzzystrmatch look very promising, but
   it's not obvious how I can use either to achieve what I want.
   Postgres's built-in regex support may have a role to play too.
  
   I can live with it not being indexable, because typically there are
   only tens of thousands of products in a production system.
  
   Could someone suggest an approach that is reasonably simple and
   reasonably generic ?
 
  What I do is to create another column that has a simplified version of
  the string in it.
  (I created a function to simplify strings, and when the source column
  is changed or inserted, I also update the simplified column.
  Then when searching, I use the same function to simplify the search
  string and use = to test against the simplified column.
 
  E.g.
  if the table has a column called name that you want to search, you
  create a name_simplified column, and fill it as so:
  update your_table set name_simplified=yourSimplifyFunction(name);
 
  Then to search:
  select * from your_table where simplified_name =
  yourSimplifyFunction('Coca-Cola');
 
  This is really fast, because the match is using the index rather than
  a sequential scan.
 
  
   Thanks,
   Peter Geoghegan
  
   --
   Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
   To make changes to your subscription:
   http://www.postgresql.org/mailpref/pgsql-general
  
 
 
  --
  Brian Modra   Land line: +27 23 5411 462
  Mobile: +27 79 69 77 082
  5 Jan Louw Str, Prince Albert, 6930
  Postal: P.O. Box 2, Prince Albert 6930
  South Africa
  http://www.zwartberg.com/
 
  --
  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-general
 
 
 
 
 -- 
 George R. C. Silva
 
 Desenvolvimento em GIS
 http://blog.geoprocessamento.net

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fuzzy string matching of product names

2010-04-09 Thread Dimitri Fontaine
Leif Biberg Kristensen l...@solumslekt.org writes:
 On Monday 5. April 2010 22.00.41 Peter Geoghegan wrote:
 similar they sound. How can that actually be applied to get the
 functionality that I've described?

 I've got a similar problem in my 18th century research, when clerks usually 
 took pride in being able to spell a name in any number of ways. I've landed 
 on 
 a solution where I'm sending search strings to SIMILAR TO. I usually get far 
 too many hits, but it's much easier to browse through 100 hits than the 
 entire 
 dataset which is approaching 60,000 records.

In both your cases I'd play with trigram search. The idea is dead simple
and it performs really well. It's the poor man's Full Text Search, but
for catalog look ups it's exactly what you want I suppose:

  http://www.postgresql.org/docs/8.3/static/pgtrgm.html

Regards,
-- 
dim

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fuzzy string matching of product names

2010-04-06 Thread Peter Geoghegan
 I've got a similar problem in my 18th century research, when clerks usually
 took pride in being able to spell a name in any number of ways. I've landed on
 a solution where I'm sending search strings to SIMILAR TO. I usually get far
 too many hits, but it's much easier to browse through 100 hits than the entire
 dataset which is approaching 60,000 records.

 Optimizing the search strings is based upon a lot of experience.

That sounds like an interesting problem - mine sounds mundane in comparison.

I now seem to be getting reasonable results with pg_trgm coupled with
ILIKE. I ORDER BY description ILIKE '%%usr_str%%' DESC,
similarity(description, 'usr_str') DESC , prioritising products that
actually contain the user specified string. I have tweaked the is
greater than similarity of my queries, to include a bit of rubbish to
be on the safe side, but not too much. I've done this with what I
believe to be a representative dataset. You can create a GiST or GIN
index on text fields for these queries, which is nice.

Have you tried using pg_trgm with your dataset? You might have more
success. It isn't biased towards a particular natural language. Also,
I suggest you avoid SQL regular expressions (which are supported with
SIMILAR TO), and use the ~ operator instead, which gives you the more
powerful POSIX regular expressions, unless portability is a major
concern.

Regards,
Peter Geoghegan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fuzzy string matching of product names

2010-04-06 Thread Andy Colson

On 4/5/2010 3:00 PM, Peter Geoghegan wrote:

http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/


Fuzzystrmatch is generally used to compare two single words for how
similar they sound. How can that actually be applied to get the
functionality that I've described?

Regards,
Peter Geoghegan




You could use regexp_split_to_table to split words, then soundx each 
one, and find matching...  assuming you have a table like:


andy=# select * from tmpx;
 id |  cname
+-
  1 | andy corp
  2 | cola corp
  3 | pepsi cola corp
  4 | bob inc
(4 rows)


(I dont have the soundx stuff installed, so I'll uppercase instead)

andy=# select id from  (select id, upper(regexp_split_to_table(cname, 
E'\\s+')) as sndx from  tmpx) as foo where sndx = 'CORP';

 id

  1
  2
  3
(3 rows)


and actually, you probably want select distinct id from...

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Fuzzy string matching of product names

2010-04-05 Thread Peter Geoghegan
Hello,

At the moment, users of my application, which runs on 8.4.3, may
search for products in a way that is implemented roughly like this:

SELECT * FROM products WHERE description ILIKE '%%usr_string%%';

This works reasonably well. However, I thought it would be a nice
touch to give my users leeway to spell product names incorrectly when
searching, or to not have to remember if a product is entered as coca
cola, CocaCola or Coca-cola. At the moment, they don't have to
worry about case sensitivity because I use ILIKE - I'd like to
preserve that. I'd also like to not have it weigh against them heavily
when they don't search for a specific product, but just a common
substring. For example, if they search for coca-cola, there may be a
number of different coca-cola products: CocaCola 330ml can,
Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought
to not matter too much - all cocacola products should be returned.

This isn't important enough for me to be willing to add a big
dependency to my application. I'd really prefer to limit myself to the
contrib modules. pg_trgm and fuzzystrmatch look very promising, but
it's not obvious how I can use either to achieve what I want.
Postgres's built-in regex support may have a role to play too.

I can live with it not being indexable, because typically there are
only tens of thousands of products in a production system.

Could someone suggest an approach that is reasonably simple and
reasonably generic ?

Thanks,
Peter Geoghegan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fuzzy string matching of product names

2010-04-05 Thread Bill Moran
In response to Peter Geoghegan peter.geoghega...@gmail.com:

 Hello,
 
 At the moment, users of my application, which runs on 8.4.3, may
 search for products in a way that is implemented roughly like this:
 
 SELECT * FROM products WHERE description ILIKE '%%usr_string%%';
 
 This works reasonably well. However, I thought it would be a nice
 touch to give my users leeway to spell product names incorrectly when
 searching, or to not have to remember if a product is entered as coca
 cola, CocaCola or Coca-cola. At the moment, they don't have to
 worry about case sensitivity because I use ILIKE - I'd like to
 preserve that. I'd also like to not have it weigh against them heavily
 when they don't search for a specific product, but just a common
 substring. For example, if they search for coca-cola, there may be a
 number of different coca-cola products: CocaCola 330ml can,
 Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought
 to not matter too much - all cocacola products should be returned.
 
 This isn't important enough for me to be willing to add a big
 dependency to my application. I'd really prefer to limit myself to the
 contrib modules. pg_trgm and fuzzystrmatch look very promising, but
 it's not obvious how I can use either to achieve what I want.
 Postgres's built-in regex support may have a role to play too.
 
 I can live with it not being indexable, because typically there are
 only tens of thousands of products in a production system.
 
 Could someone suggest an approach that is reasonably simple and
 reasonably generic ?

http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fuzzy string matching of product names

2010-04-05 Thread Brian Modra
On 05/04/2010, Peter Geoghegan peter.geoghega...@gmail.com wrote:
 Hello,

 At the moment, users of my application, which runs on 8.4.3, may
 search for products in a way that is implemented roughly like this:

 SELECT * FROM products WHERE description ILIKE '%%usr_string%%';

 This works reasonably well. However, I thought it would be a nice
 touch to give my users leeway to spell product names incorrectly when
 searching, or to not have to remember if a product is entered as coca
 cola, CocaCola or Coca-cola. At the moment, they don't have to
 worry about case sensitivity because I use ILIKE - I'd like to
 preserve that. I'd also like to not have it weigh against them heavily
 when they don't search for a specific product, but just a common
 substring. For example, if they search for coca-cola, there may be a
 number of different coca-cola products: CocaCola 330ml can,
 Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought
 to not matter too much - all cocacola products should be returned.

 This isn't important enough for me to be willing to add a big
 dependency to my application. I'd really prefer to limit myself to the
 contrib modules. pg_trgm and fuzzystrmatch look very promising, but
 it's not obvious how I can use either to achieve what I want.
 Postgres's built-in regex support may have a role to play too.

 I can live with it not being indexable, because typically there are
 only tens of thousands of products in a production system.

 Could someone suggest an approach that is reasonably simple and
 reasonably generic ?

What I do is to create another column that has a simplified version of
the string in it.
(I created a function to simplify strings, and when the source column
is changed or inserted, I also update the simplified column.
Then when searching, I use the same function to simplify the search
string and use = to test against the simplified column.

E.g.
if the table has a column called name that you want to search, you
create a name_simplified column, and fill it as so:
update your_table set name_simplified=yourSimplifyFunction(name);

Then to search:
select * from your_table where simplified_name =
yourSimplifyFunction('Coca-Cola');

This is really fast, because the match is using the index rather than
a sequential scan.


 Thanks,
 Peter Geoghegan

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



-- 
Brian Modra   Land line: +27 23 5411 462
Mobile: +27 79 69 77 082
5 Jan Louw Str, Prince Albert, 6930
Postal: P.O. Box 2, Prince Albert 6930
South Africa
http://www.zwartberg.com/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fuzzy string matching of product names

2010-04-05 Thread George Silva
The above is true. For geocoding the same idea is used: the metaphone
function is used against street names, and searched to a simples column,
filled with the results of the metaphone function. It works quite well.

George

On Mon, Apr 5, 2010 at 4:23 PM, Brian Modra br...@zwartberg.com wrote:

 On 05/04/2010, Peter Geoghegan peter.geoghega...@gmail.com wrote:
  Hello,
 
  At the moment, users of my application, which runs on 8.4.3, may
  search for products in a way that is implemented roughly like this:
 
  SELECT * FROM products WHERE description ILIKE '%%usr_string%%';
 
  This works reasonably well. However, I thought it would be a nice
  touch to give my users leeway to spell product names incorrectly when
  searching, or to not have to remember if a product is entered as coca
  cola, CocaCola or Coca-cola. At the moment, they don't have to
  worry about case sensitivity because I use ILIKE - I'd like to
  preserve that. I'd also like to not have it weigh against them heavily
  when they don't search for a specific product, but just a common
  substring. For example, if they search for coca-cola, there may be a
  number of different coca-cola products: CocaCola 330ml can,
  Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought
  to not matter too much - all cocacola products should be returned.
 
  This isn't important enough for me to be willing to add a big
  dependency to my application. I'd really prefer to limit myself to the
  contrib modules. pg_trgm and fuzzystrmatch look very promising, but
  it's not obvious how I can use either to achieve what I want.
  Postgres's built-in regex support may have a role to play too.
 
  I can live with it not being indexable, because typically there are
  only tens of thousands of products in a production system.
 
  Could someone suggest an approach that is reasonably simple and
  reasonably generic ?

 What I do is to create another column that has a simplified version of
 the string in it.
 (I created a function to simplify strings, and when the source column
 is changed or inserted, I also update the simplified column.
 Then when searching, I use the same function to simplify the search
 string and use = to test against the simplified column.

 E.g.
 if the table has a column called name that you want to search, you
 create a name_simplified column, and fill it as so:
 update your_table set name_simplified=yourSimplifyFunction(name);

 Then to search:
 select * from your_table where simplified_name =
 yourSimplifyFunction('Coca-Cola');

 This is really fast, because the match is using the index rather than
 a sequential scan.

 
  Thanks,
  Peter Geoghegan
 
  --
  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-general
 


 --
 Brian Modra   Land line: +27 23 5411 462
 Mobile: +27 79 69 77 082
 5 Jan Louw Str, Prince Albert, 6930
 Postal: P.O. Box 2, Prince Albert 6930
 South Africa
 http://www.zwartberg.com/

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general




-- 
George R. C. Silva

Desenvolvimento em GIS
http://blog.geoprocessamento.net


Re: [GENERAL] Fuzzy string matching of product names

2010-04-05 Thread Peter Geoghegan
 http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html

 --
 Bill Moran
 http://www.potentialtech.com
 http://people.collaborativefusion.com/~wmoran/

Fuzzystrmatch is generally used to compare two single words for how
similar they sound. How can that actually be applied to get the
functionality that I've described?

Regards,
Peter Geoghegan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fuzzy string matching of product names

2010-04-05 Thread Bill Moran
In response to Peter Geoghegan peter.geoghega...@gmail.com:

  http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html
 
  --
  Bill Moran
  http://www.potentialtech.com
  http://people.collaborativefusion.com/~wmoran/
 
 Fuzzystrmatch is generally used to compare two single words for how
 similar they sound. How can that actually be applied to get the
 functionality that I've described?

Well, it really depends on your particular situation and what you
want to support.  You could break the name down into individual
words and generate metaphones, then use like to match on metaphone:

'The Candlestick Corporation, Limited' - 'TE CDSK CPRN LMTD'

Searching for candlestick - WHERE metaphone column like '%CDSK%'

Or you could create an array column that has all the metaphones in
it and use an ANY() or ALL() match to find ones that apply.

Exactly how you implement depends on how far you want to go.  Do you
want to support OR matches, AND matches, or both?  Can the words be
out of order?

You could also use Levenshtein as a percentage function to find matches,
even on long strings with multiple words.  Since Levenshtein gives you
the number of alterations between two strings, using that as a percentage
of the total string length gives you a pretty good gauge of how close
they are overall, and would allow you to set a threshold, or possibly even
list results by relevance.

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fuzzy string matching of product names

2010-04-05 Thread Leif Biberg Kristensen
On Monday 5. April 2010 22.00.41 Peter Geoghegan wrote:
 similar they sound. How can that actually be applied to get the
 functionality that I've described?

I've got a similar problem in my 18th century research, when clerks usually 
took pride in being able to spell a name in any number of ways. I've landed on 
a solution where I'm sending search strings to SIMILAR TO. I usually get far 
too many hits, but it's much easier to browse through 100 hits than the entire 
dataset which is approaching 60,000 records.

Optimizing the search strings is based upon a lot of experience.

It would probably be better to add a column with normalized names, but the 
amount of work involved with that is staggering. I eventually associate most 
of the records to «persons» with normalized names, but the search process can 
sometimes be very frustrating, and it would really help with some kind of 
fuzzy search.

Just in case anyone should suggest it: Both Soundex and Metaphone are useless 
for Norwegian 18th century names.

regards,
-- 
Leif Biberg Kristensen
http://solumslekt.org/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general