Tom Lane wrote:

> ISTM that perhaps a more generally useful definition would be
> lword         Only ASCII letters
> nlword                Entirely letters per iswalpha(), but not lword
> word          Entirely alphanumeric per iswalnum(), but not nlword
>               (hence, includes at least one digit)
> However, I am no linguist and maybe I'm missing something.

I tend to agree with the need to redefine the categories.  I am not sure
I agree with this particular definition though.  I would think that a
"latin word" should include ASCII letters and accented letters, and a
non-latin word would be one that included only non-ASCII chars.

alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');
 Alias |  Description  |   Token   |  Dictionaries  |      Lexized token       
 word  | Word          | añadido   | {spanish_stem} | spanish_stem: {añad}
 blank | Space symbols |           | {}             | 
 word  | Word          | añadió    | {spanish_stem} | spanish_stem: {añad}
 blank | Space symbols |           | {}             | 
 word  | Word          | añadidura | {spanish_stem} | spanish_stem: {añadidur}
(5 lignes)

I would think those would all fit in the "latin word" category.  This
example is more interesting because it shows a word categorized
differently just because the plural loses the accent:

alvherre=# select * from ts_debug('spanish', 'caracteres carácter');
 Alias |  Description  |   Token    |  Dictionaries  |      Lexized token       
 lword | Latin word    | caracteres | {spanish_stem} | spanish_stem: {caracter}
 blank | Space symbols |            | {}             | 
 word  | Word          | carácter   | {spanish_stem} | spanish_stem: {caract}
(3 lignes)

I am not sure if there are any western european languages were words can
only be formed with non-ascii chars.  At least in spanish accents tend
to be rare.  However, I would think this is also wrong:

alvherre=# select * from ts_debug('french', 'à');
 Alias  |  Description   | Token | Dictionaries  |  Lexized token  
 nlword | Non-latin word | à     | {french_stem} | french_stem: {}
(1 ligne)

I don't think this is much of a problem, this particular word being
(most likely) a stopword.

So, how about

lword           Entirely letters per iswalpha, with at least one ASCII
nlword          Entirely letters per iswalpha
word            Entirely alphanumeric per iswalnum, but not nlword

Alvaro Herrera                      
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to