Re: [HACKERS] Latin vs non-Latin words in text search parsing

Gregory Stark Mon, 22 Oct 2007 03:35:47 -0700

"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:

> Alvaro Herrera wrote:
>> Tom Lane wrote:
>> 
>>> ISTM that perhaps a more generally useful definition would be
>>>
>>> lword               Only ASCII letters
>>> nlword              Entirely letters per iswalpha(), but not lword
>>> word                Entirely alphanumeric per iswalnum(), but not nlword
>>>             (hence, includes at least one digit)
>> ...
>> I am not sure if there are any western european languages were words can
>> only be formed with non-ascii chars. 
>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.


For what it's worth I did the same search last night and found three French
words including "çà" -- which admittedly is likely to be a noise word. Other
dictionaries such as Italian and Irish also have one-letter words like this.
The only other with multi-letter words is actually Faroese with "íð" and "óð".

> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?

I'm not very familiar with the use case here. Is there a good reason to want
to abbreviate these names? I think I would expect "ascii", "word", and "token"
for the three categories Tom describes.

> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.

I also wonder about languages like Arabic and Hindi which do have words but
I'm not sure if they use white space as simply as in latin languages.

-- 
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Latin vs non-Latin words in text search parsing

Reply via email to