UTF-8

Alexander Presber Fri, 17 Feb 2006 06:37:47 -0800

Hello,

Thanks for your efforts, I still don't get it to work.

I now tried the norwegian example. My encoding is ISO-8859 (I neverused UTF-8, because I thought it would be slower, the thread name isa bit misleading).


So I am using an ISO-8859-9 database:

  ~/cvs/ssd% psql -l

     Name    | Eigentümer | Kodierung
  -----------+------------+-----------
   postgres  | postgres   | LATIN9
   tstest    | aljoscha   | LATIN9

and a norwegian, ISO-8859 encoded dictionary and aff-file:

  ~% file tsearch/dict/ispell_no/norwegian.dict
  tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
  ~% file tsearch/dict/ispell_no/norwegian.aff
  tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text

the aff-file contains the lines:

  compoundwords controlled z
  ...
  #            to compounds only:
  flag ~\\:
     [^S]    > S

and the dictionary containins:

  overtrekk/BCW\z

  (meaning: word can be compound part, intermediary "s" is allowed)

My configuration is:

  tstest=# SELECT * FROM tsearch2.pg_ts_cfg;
    ts_name  | prs_name |   locale
  -----------+----------+------------
   simple    | default  | [EMAIL PROTECTED]
   german    | default  | [EMAIL PROTECTED]
   norwegian | default  | [EMAIL PROTECTED]


Now the test:

  tstest=# SELECT tsearch2.lexize('ispell_no','overtrekksgrill');
   lexize
  --------

  (1 Zeile)

BUT:

  tstest=# SELECT tsearch2.lexize('ispell_no','overtrekkgrill');
                 lexize
  ------------------------------------
   {over,trekk,grill,overtrekk,grill}
  (1 Zeile)


It simply doesn't work. No UTF-8 is involved.

Sincerely yours,

Alexander Presber

P.S.: Henning: Sorry for bothering you with the CC, just ignore it,if you like.



Am 27.01.2006 um 18:17 schrieb Teodor Sigaev:

contrib_regression=# insert into pg_ts_dict values (
         'norwegian_ispell',
(select dict_init from pg_ts_dict wheredict_name='ispell_template'),
          'DictFile="/usr/local/share/ispell/norsk.dict" ,'
          'AffFile ="/usr/local/share/ispell/norsk.aff"',
(select dict_lexize from pg_ts_dict wheredict_name='ispell_template'),
         'Norwegian ISpell dictionary'
   );
INSERT 16681 1
contrib_regression=# select lexize('norwegian_ispell','politimester');
                  lexize
------------------------------------------
 {politimester,politi,mester,politi,mest}
(1 row)
contrib_regression=# select lexize('norwegian_ispell','sjokoladefabrikk');
                lexize
--------------------------------------
 {sjokoladefabrikk,sjokolade,fabrikk}
(1 row)
contrib_regression=# select lexize('norwegian_ispell','overtrekksgrilldresser');
         lexize
-------------------------
 {overtrekk,grill,dress}
(1 row)
% psql -l
           List of databases
        Name        | Owner  | Encoding
--------------------+--------+----------
 contrib_regression | teodor | KOI8
 postgres           | pgsql  | KOI8
 template0          | pgsql  | KOI8
 template1          | pgsql  | KOI8
(4 rows)
I'm afraid that UTF-8 problem. We just committed in CVS HEADmultibyte support for tsearch2, so you can try it.
Pls, notice, the dict, aff stopword files should be in serverencoding. Snowball sources for german (and other) in UTF8 can befounded in http://snowball.tartarus.org/dist/libstemmer_c.tgz
To all: May be, we should put all snowball's stemmers (for allavailable languages and encodings) to tsearch2 directory?
--
Teodor Sigaev E-mail:[EMAIL PROTECTED]WWW: http://www.sigaev.ru/



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Re: [GENERAL] TSearch2 / German compound words / UTF-8

Reply via email to