Re: [GENERAL] Fuzzy string matching of product names
George Silva wrote: The above is true. For geocoding the same idea is used: the metaphone function is used against street names, and searched to a simples column, filled with the results of the metaphone function. It works quite well. I would think an expression index would be better than a separate column. --- George On Mon, Apr 5, 2010 at 4:23 PM, Brian Modra br...@zwartberg.com wrote: On 05/04/2010, Peter Geoghegan peter.geoghega...@gmail.com wrote: Hello, At the moment, users of my application, which runs on 8.4.3, may search for products in a way that is implemented roughly like this: SELECT * FROM products WHERE description ILIKE '%%usr_string%%'; This works reasonably well. However, I thought it would be a nice touch to give my users leeway to spell product names incorrectly when searching, or to not have to remember if a product is entered as coca cola, CocaCola or Coca-cola. At the moment, they don't have to worry about case sensitivity because I use ILIKE - I'd like to preserve that. I'd also like to not have it weigh against them heavily when they don't search for a specific product, but just a common substring. For example, if they search for coca-cola, there may be a number of different coca-cola products: CocaCola 330ml can, Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought to not matter too much - all cocacola products should be returned. This isn't important enough for me to be willing to add a big dependency to my application. I'd really prefer to limit myself to the contrib modules. pg_trgm and fuzzystrmatch look very promising, but it's not obvious how I can use either to achieve what I want. Postgres's built-in regex support may have a role to play too. I can live with it not being indexable, because typically there are only tens of thousands of products in a production system. Could someone suggest an approach that is reasonably simple and reasonably generic ? What I do is to create another column that has a simplified version of the string in it. (I created a function to simplify strings, and when the source column is changed or inserted, I also update the simplified column. Then when searching, I use the same function to simplify the search string and use = to test against the simplified column. E.g. if the table has a column called name that you want to search, you create a name_simplified column, and fill it as so: update your_table set name_simplified=yourSimplifyFunction(name); Then to search: select * from your_table where simplified_name = yourSimplifyFunction('Coca-Cola'); This is really fast, because the match is using the index rather than a sequential scan. Thanks, Peter Geoghegan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Brian Modra Land line: +27 23 5411 462 Mobile: +27 79 69 77 082 5 Jan Louw Str, Prince Albert, 6930 Postal: P.O. Box 2, Prince Albert 6930 South Africa http://www.zwartberg.com/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- George R. C. Silva Desenvolvimento em GIS http://blog.geoprocessamento.net -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fuzzy string matching of product names
Leif Biberg Kristensen l...@solumslekt.org writes: On Monday 5. April 2010 22.00.41 Peter Geoghegan wrote: similar they sound. How can that actually be applied to get the functionality that I've described? I've got a similar problem in my 18th century research, when clerks usually took pride in being able to spell a name in any number of ways. I've landed on a solution where I'm sending search strings to SIMILAR TO. I usually get far too many hits, but it's much easier to browse through 100 hits than the entire dataset which is approaching 60,000 records. In both your cases I'd play with trigram search. The idea is dead simple and it performs really well. It's the poor man's Full Text Search, but for catalog look ups it's exactly what you want I suppose: http://www.postgresql.org/docs/8.3/static/pgtrgm.html Regards, -- dim -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fuzzy string matching of product names
I've got a similar problem in my 18th century research, when clerks usually took pride in being able to spell a name in any number of ways. I've landed on a solution where I'm sending search strings to SIMILAR TO. I usually get far too many hits, but it's much easier to browse through 100 hits than the entire dataset which is approaching 60,000 records. Optimizing the search strings is based upon a lot of experience. That sounds like an interesting problem - mine sounds mundane in comparison. I now seem to be getting reasonable results with pg_trgm coupled with ILIKE. I ORDER BY description ILIKE '%%usr_str%%' DESC, similarity(description, 'usr_str') DESC , prioritising products that actually contain the user specified string. I have tweaked the is greater than similarity of my queries, to include a bit of rubbish to be on the safe side, but not too much. I've done this with what I believe to be a representative dataset. You can create a GiST or GIN index on text fields for these queries, which is nice. Have you tried using pg_trgm with your dataset? You might have more success. It isn't biased towards a particular natural language. Also, I suggest you avoid SQL regular expressions (which are supported with SIMILAR TO), and use the ~ operator instead, which gives you the more powerful POSIX regular expressions, unless portability is a major concern. Regards, Peter Geoghegan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fuzzy string matching of product names
On 4/5/2010 3:00 PM, Peter Geoghegan wrote: http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ Fuzzystrmatch is generally used to compare two single words for how similar they sound. How can that actually be applied to get the functionality that I've described? Regards, Peter Geoghegan You could use regexp_split_to_table to split words, then soundx each one, and find matching... assuming you have a table like: andy=# select * from tmpx; id | cname +- 1 | andy corp 2 | cola corp 3 | pepsi cola corp 4 | bob inc (4 rows) (I dont have the soundx stuff installed, so I'll uppercase instead) andy=# select id from (select id, upper(regexp_split_to_table(cname, E'\\s+')) as sndx from tmpx) as foo where sndx = 'CORP'; id 1 2 3 (3 rows) and actually, you probably want select distinct id from... -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Fuzzy string matching of product names
Hello, At the moment, users of my application, which runs on 8.4.3, may search for products in a way that is implemented roughly like this: SELECT * FROM products WHERE description ILIKE '%%usr_string%%'; This works reasonably well. However, I thought it would be a nice touch to give my users leeway to spell product names incorrectly when searching, or to not have to remember if a product is entered as coca cola, CocaCola or Coca-cola. At the moment, they don't have to worry about case sensitivity because I use ILIKE - I'd like to preserve that. I'd also like to not have it weigh against them heavily when they don't search for a specific product, but just a common substring. For example, if they search for coca-cola, there may be a number of different coca-cola products: CocaCola 330ml can, Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought to not matter too much - all cocacola products should be returned. This isn't important enough for me to be willing to add a big dependency to my application. I'd really prefer to limit myself to the contrib modules. pg_trgm and fuzzystrmatch look very promising, but it's not obvious how I can use either to achieve what I want. Postgres's built-in regex support may have a role to play too. I can live with it not being indexable, because typically there are only tens of thousands of products in a production system. Could someone suggest an approach that is reasonably simple and reasonably generic ? Thanks, Peter Geoghegan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fuzzy string matching of product names
In response to Peter Geoghegan peter.geoghega...@gmail.com: Hello, At the moment, users of my application, which runs on 8.4.3, may search for products in a way that is implemented roughly like this: SELECT * FROM products WHERE description ILIKE '%%usr_string%%'; This works reasonably well. However, I thought it would be a nice touch to give my users leeway to spell product names incorrectly when searching, or to not have to remember if a product is entered as coca cola, CocaCola or Coca-cola. At the moment, they don't have to worry about case sensitivity because I use ILIKE - I'd like to preserve that. I'd also like to not have it weigh against them heavily when they don't search for a specific product, but just a common substring. For example, if they search for coca-cola, there may be a number of different coca-cola products: CocaCola 330ml can, Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought to not matter too much - all cocacola products should be returned. This isn't important enough for me to be willing to add a big dependency to my application. I'd really prefer to limit myself to the contrib modules. pg_trgm and fuzzystrmatch look very promising, but it's not obvious how I can use either to achieve what I want. Postgres's built-in regex support may have a role to play too. I can live with it not being indexable, because typically there are only tens of thousands of products in a production system. Could someone suggest an approach that is reasonably simple and reasonably generic ? http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fuzzy string matching of product names
On 05/04/2010, Peter Geoghegan peter.geoghega...@gmail.com wrote: Hello, At the moment, users of my application, which runs on 8.4.3, may search for products in a way that is implemented roughly like this: SELECT * FROM products WHERE description ILIKE '%%usr_string%%'; This works reasonably well. However, I thought it would be a nice touch to give my users leeway to spell product names incorrectly when searching, or to not have to remember if a product is entered as coca cola, CocaCola or Coca-cola. At the moment, they don't have to worry about case sensitivity because I use ILIKE - I'd like to preserve that. I'd also like to not have it weigh against them heavily when they don't search for a specific product, but just a common substring. For example, if they search for coca-cola, there may be a number of different coca-cola products: CocaCola 330ml can, Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought to not matter too much - all cocacola products should be returned. This isn't important enough for me to be willing to add a big dependency to my application. I'd really prefer to limit myself to the contrib modules. pg_trgm and fuzzystrmatch look very promising, but it's not obvious how I can use either to achieve what I want. Postgres's built-in regex support may have a role to play too. I can live with it not being indexable, because typically there are only tens of thousands of products in a production system. Could someone suggest an approach that is reasonably simple and reasonably generic ? What I do is to create another column that has a simplified version of the string in it. (I created a function to simplify strings, and when the source column is changed or inserted, I also update the simplified column. Then when searching, I use the same function to simplify the search string and use = to test against the simplified column. E.g. if the table has a column called name that you want to search, you create a name_simplified column, and fill it as so: update your_table set name_simplified=yourSimplifyFunction(name); Then to search: select * from your_table where simplified_name = yourSimplifyFunction('Coca-Cola'); This is really fast, because the match is using the index rather than a sequential scan. Thanks, Peter Geoghegan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Brian Modra Land line: +27 23 5411 462 Mobile: +27 79 69 77 082 5 Jan Louw Str, Prince Albert, 6930 Postal: P.O. Box 2, Prince Albert 6930 South Africa http://www.zwartberg.com/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fuzzy string matching of product names
The above is true. For geocoding the same idea is used: the metaphone function is used against street names, and searched to a simples column, filled with the results of the metaphone function. It works quite well. George On Mon, Apr 5, 2010 at 4:23 PM, Brian Modra br...@zwartberg.com wrote: On 05/04/2010, Peter Geoghegan peter.geoghega...@gmail.com wrote: Hello, At the moment, users of my application, which runs on 8.4.3, may search for products in a way that is implemented roughly like this: SELECT * FROM products WHERE description ILIKE '%%usr_string%%'; This works reasonably well. However, I thought it would be a nice touch to give my users leeway to spell product names incorrectly when searching, or to not have to remember if a product is entered as coca cola, CocaCola or Coca-cola. At the moment, they don't have to worry about case sensitivity because I use ILIKE - I'd like to preserve that. I'd also like to not have it weigh against them heavily when they don't search for a specific product, but just a common substring. For example, if they search for coca-cola, there may be a number of different coca-cola products: CocaCola 330ml can, Coca-Cola 2 litre bottle, but no actual plain cocacola. That ought to not matter too much - all cocacola products should be returned. This isn't important enough for me to be willing to add a big dependency to my application. I'd really prefer to limit myself to the contrib modules. pg_trgm and fuzzystrmatch look very promising, but it's not obvious how I can use either to achieve what I want. Postgres's built-in regex support may have a role to play too. I can live with it not being indexable, because typically there are only tens of thousands of products in a production system. Could someone suggest an approach that is reasonably simple and reasonably generic ? What I do is to create another column that has a simplified version of the string in it. (I created a function to simplify strings, and when the source column is changed or inserted, I also update the simplified column. Then when searching, I use the same function to simplify the search string and use = to test against the simplified column. E.g. if the table has a column called name that you want to search, you create a name_simplified column, and fill it as so: update your_table set name_simplified=yourSimplifyFunction(name); Then to search: select * from your_table where simplified_name = yourSimplifyFunction('Coca-Cola'); This is really fast, because the match is using the index rather than a sequential scan. Thanks, Peter Geoghegan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Brian Modra Land line: +27 23 5411 462 Mobile: +27 79 69 77 082 5 Jan Louw Str, Prince Albert, 6930 Postal: P.O. Box 2, Prince Albert 6930 South Africa http://www.zwartberg.com/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- George R. C. Silva Desenvolvimento em GIS http://blog.geoprocessamento.net
Re: [GENERAL] Fuzzy string matching of product names
http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ Fuzzystrmatch is generally used to compare two single words for how similar they sound. How can that actually be applied to get the functionality that I've described? Regards, Peter Geoghegan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fuzzy string matching of product names
In response to Peter Geoghegan peter.geoghega...@gmail.com: http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ Fuzzystrmatch is generally used to compare two single words for how similar they sound. How can that actually be applied to get the functionality that I've described? Well, it really depends on your particular situation and what you want to support. You could break the name down into individual words and generate metaphones, then use like to match on metaphone: 'The Candlestick Corporation, Limited' - 'TE CDSK CPRN LMTD' Searching for candlestick - WHERE metaphone column like '%CDSK%' Or you could create an array column that has all the metaphones in it and use an ANY() or ALL() match to find ones that apply. Exactly how you implement depends on how far you want to go. Do you want to support OR matches, AND matches, or both? Can the words be out of order? You could also use Levenshtein as a percentage function to find matches, even on long strings with multiple words. Since Levenshtein gives you the number of alterations between two strings, using that as a percentage of the total string length gives you a pretty good gauge of how close they are overall, and would allow you to set a threshold, or possibly even list results by relevance. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Fuzzy string matching of product names
On Monday 5. April 2010 22.00.41 Peter Geoghegan wrote: similar they sound. How can that actually be applied to get the functionality that I've described? I've got a similar problem in my 18th century research, when clerks usually took pride in being able to spell a name in any number of ways. I've landed on a solution where I'm sending search strings to SIMILAR TO. I usually get far too many hits, but it's much easier to browse through 100 hits than the entire dataset which is approaching 60,000 records. Optimizing the search strings is based upon a lot of experience. It would probably be better to add a column with normalized names, but the amount of work involved with that is staggering. I eventually associate most of the records to «persons» with normalized names, but the search process can sometimes be very frustrating, and it would really help with some kind of fuzzy search. Just in case anyone should suggest it: Both Soundex and Metaphone are useless for Norwegian 18th century names. regards, -- Leif Biberg Kristensen http://solumslekt.org/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general