Re: [HACKERS] Latin vs non-Latin words in text search parsing

Alvaro Herrera Sun, 21 Oct 2007 15:04:02 -0700

Tom Lane wrote:

> ISTM that perhaps a more generally useful definition would be
> 
> lword         Only ASCII letters
> nlword                Entirely letters per iswalpha(), but not lword
> word          Entirely alphanumeric per iswalnum(), but not nlword
>               (hence, includes at least one digit)
> 
> However, I am no linguist and maybe I'm missing something.


I tend to agree with the need to redefine the categories.  I am not sure
I agree with this particular definition though.  I would think that a
"latin word" should include ASCII letters and accented letters, and a
non-latin word would be one that included only non-ASCII chars.

alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');
 Alias |  Description  |   Token   |  Dictionaries  |      Lexized token       
-------+---------------+-----------+----------------+--------------------------
 word  | Word          | añadido   | {spanish_stem} | spanish_stem: {añad}
 blank | Space symbols |           | {}             | 
 word  | Word          | añadió    | {spanish_stem} | spanish_stem: {añad}
 blank | Space symbols |           | {}             | 
 word  | Word          | añadidura | {spanish_stem} | spanish_stem: {añadidur}
(5 lignes)

I would think those would all fit in the "latin word" category.  This
example is more interesting because it shows a word categorized
differently just because the plural loses the accent:

alvherre=# select * from ts_debug('spanish', 'caracteres carácter');
 Alias |  Description  |   Token    |  Dictionaries  |      Lexized token       
-------+---------------+------------+----------------+--------------------------
 lword | Latin word    | caracteres | {spanish_stem} | spanish_stem: {caracter}
 blank | Space symbols |            | {}             | 
 word  | Word          | carácter   | {spanish_stem} | spanish_stem: {caract}
(3 lignes)

I am not sure if there are any western european languages were words can
only be formed with non-ascii chars.  At least in spanish accents tend
to be rare.  However, I would think this is also wrong:

alvherre=# select * from ts_debug('french', 'à');
 Alias  |  Description   | Token | Dictionaries  |  Lexized token  
--------+----------------+-------+---------------+-----------------
 nlword | Non-latin word | à     | {french_stem} | french_stem: {}
(1 ligne)

I don't think this is much of a problem, this particular word being
(most likely) a stopword.

So, how about

lword           Entirely letters per iswalpha, with at least one ASCII
nlword          Entirely letters per iswalpha
word            Entirely alphanumeric per iswalnum, but not nlword

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Reply via email to