Re: [GENERAL] [to_tsvector] German Compound Words

Sven R. Kunze Mon, 01 Jun 2015 00:24:54 -0700

I actually wanted to minimize the installation effort. Thus, I used thehunspell-de-de package of Debian/Ubuntu.


Give me a second for ispell.

Below, see the hunspell variant forProduktionsintervall/Produktionintervall:


=# select * from ts_debug('public.german_compound', 'Produktionsintervall');

-----------+-----------------+----------------------+-------------------------------+-------------+------------------------

asciiword | Word, all ASCII | Produktionsintervall |{german_hunspell,german_stem} | german_stem | {produktionsintervall}

(1 row)

=# select * from ts_debug('public.german_compound', 'Produktionintervall');

-----------+-----------------+---------------------+-------------------------------+-------------+-----------------------

asciiword | Word, all ASCII | Produktionintervall |{german_hunspell,german_stem} | german_stem | {produktionintervall}




PS: I post your answer to the list as well

On 28.05.2015 19:42, Oleg Bartunov wrote:

For readability it's better to use

select * from ts_debug

I remember there is problem with correct support of hunspell files.Did you try ispell files ?

Also, I found this 
messagehttp://www.postgresql.org/message-id/[email protected]

Try this word - Produktionintervall

On Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze <[email protected]<mailto:[email protected]>> wrote:


    Sure. Here you are:

    =# select ts_debug('public.german_compound', 'wasserkraft');
    ts_debug
    
-----------------------------------------------------------------------------------------------------
     (asciiword,"Word, all
    
ASCII",wasserkraft,"{german_hunspell,german_stem}",german_stem,{wasserkraft})

    =# select ts_debug('public.german_compound', 'schifffahrt');
    ts_debug
    
---------------------------------------------------------------------------------------------------------
     (asciiword,"Word, all
    
ASCII",schifffahrt,"{german_hunspell,german_stem}",german_hunspell,{schifffahrt})

    =# select ts_debug('public.german_compound', 'blindflansch');
    ts_debug
    
-------------------------------------------------------------------------------------------------------
     (asciiword,"Word, all
    
ASCII",blindflansch,"{german_hunspell,german_stem}",german_stem,{blindflansch})

    That is my testing configuration:

    =# \dF+ german_compound
    Text search configuration "public.german_compound"
    Parser: "pg_catalog.default"
          Token      |        Dictionaries
    -----------------+-----------------------------
     asciihword      | german_hunspell,german_stem
     asciiword       | german_hunspell,german_stem
     email           | simple
     file            | simple
     float           | simple
     host            | simple
     hword           | german_hunspell,german_stem
     hword_asciipart | german_hunspell,german_stem
     hword_numpart   | simple
     hword_part      | german_hunspell,german_stem
     int             | simple
     numhword        | simple
     numword         | simple
     sfloat          | simple
     uint            | simple
     url             | simple
     url_path        | simple
     version         | simple
     word            | german_hunspell,german_stem


    On 28.05.2015 17:24, Oleg Bartunov wrote:

    ts_debug() ?

    =# select * from ts_debug('english', 'messages');

    
-----------+-----------------+----------+----------------+--------------+----------
     asciiword | Word, all ASCII | messages | {english_stem} |
    english_stem | {messag}


    On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze
    <[email protected] <mailto:[email protected]>> wrote:

        Hi everybody,

        what do I need to do in order to enable compound word
        handling in PostgreSQL tsvector implementation?

        I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed
        package hunspell-de-de and already created a new dictionary
        as described here:
        
http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

        CREATE TEXT SEARCH DICTIONARY german_hunspell (
            TEMPLATE = ispell,
            DictFile = de_de,
            AffFile = de_de,
            StopWords = german
        );

        Furthermore, created a new test text search configuration
        (copied from german) and updated all parser parts where the
        german_stem dictionary is used so that it uses
        german_hunspell first and then german_stem.

        However, ts_vector still does not work for the compound words
        such as:

        wasserkraft -> wasserkraft, kraft
        schifffahrt -> schifffahrt, fahrt
        blindflansch -> blindflansch, flansch

        etc.


        What have I done wrong here?

--Sven R. Kunze

        TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
        Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
        e-mail: [email protected] <mailto:[email protected]>
        web: www.tbz-pariv.de <http://www.tbz-pariv.de>

        Geschäftsführer: Dr. Reiner Wohlgemuth
        Sitz der Gesellschaft: Chemnitz
        Registergericht: Chemnitz HRB 8543

--Sent via pgsql-general mailing list

        ([email protected]
        <mailto:[email protected]>)
        To make changes to your subscription:
        http://www.postgresql.org/mailpref/pgsql-general

--Sven R. Kunze

    TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
    Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
    e-mail:[email protected]  <mailto:[email protected]>
    web:www.tbz-pariv.de  <http://www.tbz-pariv.de>

    Geschäftsführer: Dr. Reiner Wohlgemuth
    Sitz der Gesellschaft: Chemnitz
    Registergericht: Chemnitz HRB 8543



--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: [email protected]
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543

Re: [GENERAL] [to_tsvector] German Compound Words

Reply via email to