[GENERAL] Help with tokenization of age-ranges in full text search

2015-02-25 Thread Mason Hale
Hello, I've got a 9.3 database hosted at Heroku.

I'm full text search to search for group names in part of my application,
and some of my group names are the names of youth sports age groups like
Boys 9-10 or Girls 11-12.

I would like for a search for the terms Boys, Boys 9-10,  9, 10 or
9-10 to match Boys 9-10.

I have created a custom dictionary and search configuration as follows:

CREATE TEXT SEARCH DICTIONARY public.simple_nostem_no_stop (
TEMPLATE = pg_catalog.simple
);

   CREATE TEXT SEARCH CONFIGURATION public.user_search ( COPY =
pg_catalog.simple );

ALTER TEXT SEARCH CONFIGURATION public.user_search
   ALTER MAPPING FOR email, asciiword, asciihword, hword_asciipart,
hword, hword_part, word WITH simple_nostem_no_stop;

Which results in this configuration:

development=# \dF+ public.user_search
Text search
configuration public.user_search

Parser: pg_catalog.default

  Token  | Dictionaries

-+---

 asciihword  | simple_nostem_no_stop

 asciiword   | simple_nostem_no_stop

 blank   | simple

 email   | simple_nostem_no_stop

 file| simple

 float   | simple

 host| simple

 hword   | simple_nostem_no_stop

 hword_asciipart | simple_nostem_no_stop

 hword_numpart   | simple_nostem_no_stop

 hword_part  | simple_nostem_no_stop

 int | simple

 numhword| simple_nostem_no_stop

 numword | simple_nostem_no_stop

 sfloat  | simple

 uint| simple

 url | simple

 url_path| simple

 version | simple

 word| simple_nostem_no_stop



Testing my query, I get the following tokenization:

development=# select alias, token from ts_debug('public.user_search', 'Boys
9-10');

   alias   | token

---+---

 asciiword | Boys

 blank |

 uint  | 9

 int   | -10

(4 rows)


I was expecting / hoping to seek tokens for 9-10 and 10.

With the above a search for 9-10 or 10 would not match the term Boys
9-10.

I was expecting the hnumword or hword_numpart token_types to match, but
they appear to require a leading alpha character to match.

If I add a leading alpha character, it tokenizes the way I would like:

development=# select alias, token from ts_debug('public.user_search', 'Boys
x9-y10');

 alias | token

---+

 asciiword | Boys

 blank |

 numhword  | x9-y10

 hword_numpart | x9

 blank | -

 hword_numpart | y10

(6 rows)



So my question is -- can I get the tokenization that I want out of a
configuration of the stock available token types?

Since I'm hosting my db on Heroku, I assume a custom parser extension is
not a possibility.

I've tried remove the uint and int mappings, hoping then that the other
parsers will pick up the slack, but that just results in the terms being
dropped altogether.

For example, if I run:

 ALTER TEXT SEARCH CONFIGURATION public.user_search
DROP MAPPING IF EXISTS FOR file, float, host, int, uint, sfloat,
url_path, version, blank, url;

Then my configuration now looks like this:

development=# \dF+ public.user_search

Text search configuration public.user_search

Parser: pg_catalog.default

  Token  | Dictionaries

-+---

 asciihword  | simple_nostem_no_stop

 asciiword   | simple_nostem_no_stop

 email   | simple_nostem_no_stop

 hword   | simple_nostem_no_stop

 hword_asciipart | simple_nostem_no_stop

 hword_numpart   | simple_nostem_no_stop

 hword_part  | simple_nostem_no_stop

 numhword| simple_nostem_no_stop

 numword | simple_nostem_no_stop

 word| simple_nostem_no_stop

But now the tokens are just dropped altogether:

development=# select alias, token, lexemes from
ts_debug('public.user_search', 'Boys 9-10');

   alias   | token | lexemes

---+---+-

 asciiword | Boys  | {boys}

 blank |   |

 uint  | 9 |

 int   | -10   |

(4 rows)

Thanks in advance for any advice.

Would love to find a simple solution.

Mason


Re: [GENERAL] Help with tokenization of age-ranges in full text search

2015-02-25 Thread Alvaro Herrera
Mason Hale wrote:
 Hello, I've got a 9.3 database hosted at Heroku.
 
 I'm full text search to search for group names in part of my application,
 and some of my group names are the names of youth sports age groups like
 Boys 9-10 or Girls 11-12.
 
 I would like for a search for the terms Boys, Boys 9-10,  9, 10 or
 9-10 to match Boys 9-10.

Hm, so if there's a sport for Boys 8-10, what will you do when it
doesn't match a query for 9?  Does this matter?  I mean, maybe
tokenization is not the most appropriate thing to do in this case.

 So my question is -- can I get the tokenization that I want out of a
 configuration of the stock available token types?

The tokenizer stuff is not the most configurable part of the FTS stuff,
sadly.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general