Re: [GENERAL] FTS with more than one language in body and with unknown query language?

2016-07-27 Thread Artur Zakirov

On 15.07.2016 21:34, Stefan Keller wrote:

I actually expect that stemming takes place for english and german.
And we will in fact have queries in english and in german as well.
So I think we still have some issues to resolve...?


I performed the following things:
- patch for PostgreSQL:
  https://github.com/select-artur/postgres/tree/join_tsconfig
  It adds new option for FTS dictionary mapping (JOIN). I want to 
propose this patch to -hackers.

- dict_translate dictionary based on dict_xsyn contrib:
  https://github.com/select-artur/dict_translate

This things are made for multilingual purpose and are interesting for 
us. Maybe they will be helpful for you too.


Example:

1 - Create files:

$SHAREDIR/tsearch_data/geo_en.trn:

forest wald forst holz

$SHAREDIR/tsearch_data/geo_de.trn:

wald forest wood
forst forest wood
holz forest wood

2 - Execute queries:

=# CREATE TEXT SEARCH DICTIONARY geo_en (
Template = translate,
DictFile = geo_en,
InputDict = pg_catalog.english_stem);
=# CREATE TEXT SEARCH DICTIONARY geo_de (
Template = translate,
DictFile = geo_de,
InputDict = pg_catalog.german_stem);
=# CREATE TEXT SEARCH CONFIGURATION geo(COPY='simple');
=# ALTER TEXT SEARCH CONFIGURATION geo_ths
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH geo_en (JOIN), english_stem (JOIN), geo_de (JOIN), german_stem 
(JOIN);


=# CREATE TABLE geo (id int, body_en text, body_de text);
=# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');

3 - Sample queries:

=# SELECT * FROM geo WHERE to_tsvector('geo', body_en) @@ 
to_tsquery('geo', 'forests');

 id | body_en | body_de
+-+-
  1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo', body_de) @@ 
to_tsquery('geo', 'forests');

 id | body_en | body_de
+-+-
  2 | (null)  | wald
  3 | (null)  | forst
(2 rows)

=# SELECT * FROM geo WHERE to_tsvector('geo', body_en) @@ 
to_tsquery('geo', 'walde');

 id | body_en | body_de
+-+-
  1 | forest  | (null)
(1 row)

I will be glad for your comments.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS with more than one language in body and with unknown query language?

2016-07-15 Thread Stefan Keller
Artur

2016-07-15 11:02 GMT+02:00 Artur Zakirov :
...
> This solution does not take into account the fact that queries in german can
> be in various forms (for example, plural forms). It works well for english.
> As a solution of this, you can put into geo.ths various forms of word.
>
> But maybe the solution above is what you need. Otherwise it can serve as a
> concept of solution.

Many thanks for this helpful input!
I actually expect that stemming takes place for english and german.
And we will in fact have queries in english and in german as well.
So I think we still have some issues to resolve...?

:Stefan


2016-07-15 11:02 GMT+02:00 Artur Zakirov :
> Hello, Stefan!
>
> On 15.07.2016 01:54, Stefan Keller wrote:
>>
>> приве́т! Artur
>>
>> Thanks for your explanations.
>>
>> 2016-07-14 17:20 GMT+02:00 Artur Zakirov :
>>>
>>> On 14.07.2016 01:16, Stefan Keller wrote:
>>
>> ...

 * Should I create a synonym dictionary which contains word
 translations en-de instead of synonyms en-en?
>>>
>>>
>>> This synonym dictionary will contain a thousands entries. So it will
>>> require
>>> a great effort to make this dictionary.
>>
>>
>> It's a domain-specific corpus of max. 1000 records of descriptive text
>> (metadata) about geographic data, like topographic map, land use
>> planning, etc.
>
>
> Oh, I understand. If you need word stemming you can use a thesaurus
> dictionary. So you should do the following.
>
> 1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example of
> entry:
>
> wald : forest
>
> 2 - PostgreSQL already has english_stem dictionary. In this solution you do
> not need german_stem dictionary. Create the dictionary and the configuration
> in PostgreSQL:
>
> =# CREATE TEXT SEARCH DICTIONARY geo_ths (
> Template = thesaurus,
> DictFile = geo,
> Dictionary = pg_catalog.english_stem);
> =# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple');
> =# ALTER TEXT SEARCH CONFIGURATION geo_ths
> ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part
> WITH geo_ths, english_stem;
>
> For example, we have the following data:
>
> =# CREATE TABLE geo (id int, body_en text, body_de text);
> =# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');
>
> Then you can execute the following queries:
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
> to_tsquery('geo_ths', 'forest');
>  id | body_en | body_de
> +-+-
>   1 | forest  | (null)
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
> to_tsquery('geo_ths', 'forest');
>  id | body_en | body_de
> +-+-
>   2 | (null)  | wald
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
> to_tsquery('geo_ths', 'wald');
>  id | body_en | body_de
> +-+-
>   1 | forest  | (null)
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
> to_tsquery('geo_ths', 'wald');
>  id | body_en | body_de
> +-+-
>   2 | (null)  | wald
> (1 row)
>
> This solution does not take into account the fact that queries in german can
> be in various forms (for example, plural forms). It works well for english.
> As a solution of this, you can put into geo.ths various forms of word.
>
> But maybe the solution above is what you need. Otherwise it can serves as a
> concept of solution.
>
>
>>
>> ...

 * How to setup a text search configuration which e.g. stems en and de
 words?
>>
>>
>> I still would like to give FTS a try with synonym dictionary (en-de).
>> Now, I'm wondering how to setup the configuration. I've seen examples
>> to process either english, german or russian alone. But I did not find
>> yet any documentation on how to setup the text search configuration
>> where a corpus contains two (or more) languages at same time in a
>> table (body_en and body_de).
>>
>> :Stefan
>>
>>
>>
>>
>>
>>
>>
>> 2016-07-14 17:20 GMT+02:00 Artur Zakirov :
>>>
>>> Hi,
>>>
>>> On 14.07.2016 01:16, Stefan Keller wrote:


 Hi,

 I have a text corpus which contains either German or English docs and
 I expect queries where I don't know if it's German or English. So I'd
 like e.g. that a query "forest" matches "forest" in body_en but also
 "Wald" in body_de.

 I created a table with attributes body_en and body_de (type "text"). I
 will use ts_vector/ts_query on the fly (don't need yet an index
 (attributes)).

 * Can FTS handle this multilingual situation?
>>>
>>>
>>>
>>> In my opinion, PostgreSQL cant handle it. It cant translate words from
>>> one
>>> language to another, it just stems word from original form to basic form.
>>> First you need to translate word from English to German, then search word
>>> in
>>> the body_de attribute.
>>>
>>> And the issue is complicated by the fact that one word could have
>>> different
>>> meaning in the other language.
>>>
 * How to setup a text search conf

Re: [GENERAL] FTS with more than one language in body and with unknown query language?

2016-07-15 Thread Artur Zakirov

Hello, Stefan!

On 15.07.2016 01:54, Stefan Keller wrote:

приве́т! Artur

Thanks for your explanations.

2016-07-14 17:20 GMT+02:00 Artur Zakirov :

On 14.07.2016 01:16, Stefan Keller wrote:

...

* Should I create a synonym dictionary which contains word
translations en-de instead of synonyms en-en?


This synonym dictionary will contain a thousands entries. So it will require
a great effort to make this dictionary.


It's a domain-specific corpus of max. 1000 records of descriptive text
(metadata) about geographic data, like topographic map, land use
planning, etc.


Oh, I understand. If you need word stemming you can use a thesaurus 
dictionary. So you should do the following.


1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example 
of entry:


wald : forest

2 - PostgreSQL already has english_stem dictionary. In this solution you 
do not need german_stem dictionary. Create the dictionary and the 
configuration in PostgreSQL:


=# CREATE TEXT SEARCH DICTIONARY geo_ths (
Template = thesaurus,
DictFile = geo,
Dictionary = pg_catalog.english_stem);
=# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple');
=# ALTER TEXT SEARCH CONFIGURATION geo_ths
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH geo_ths, english_stem;

For example, we have the following data:

=# CREATE TABLE geo (id int, body_en text, body_de text);
=# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');

Then you can execute the following queries:

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@ 
to_tsquery('geo_ths', 'forest');

 id | body_en | body_de
+-+-
  1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@ 
to_tsquery('geo_ths', 'forest');

 id | body_en | body_de
+-+-
  2 | (null)  | wald
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@ 
to_tsquery('geo_ths', 'wald');

 id | body_en | body_de
+-+-
  1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@ 
to_tsquery('geo_ths', 'wald');

 id | body_en | body_de
+-+-
  2 | (null)  | wald
(1 row)

This solution does not take into account the fact that queries in german 
can be in various forms (for example, plural forms). It works well for 
english. As a solution of this, you can put into geo.ths various forms 
of word.


But maybe the solution above is what you need. Otherwise it can serves 
as a concept of solution.




...

* How to setup a text search configuration which e.g. stems en and de
words?


I still would like to give FTS a try with synonym dictionary (en-de).
Now, I'm wondering how to setup the configuration. I've seen examples
to process either english, german or russian alone. But I did not find
yet any documentation on how to setup the text search configuration
where a corpus contains two (or more) languages at same time in a
table (body_en and body_de).

:Stefan







2016-07-14 17:20 GMT+02:00 Artur Zakirov :

Hi,

On 14.07.2016 01:16, Stefan Keller wrote:


Hi,

I have a text corpus which contains either German or English docs and
I expect queries where I don't know if it's German or English. So I'd
like e.g. that a query "forest" matches "forest" in body_en but also
"Wald" in body_de.

I created a table with attributes body_en and body_de (type "text"). I
will use ts_vector/ts_query on the fly (don't need yet an index
(attributes)).

* Can FTS handle this multilingual situation?



In my opinion, PostgreSQL cant handle it. It cant translate words from one
language to another, it just stems word from original form to basic form.
First you need to translate word from English to German, then search word in
the body_de attribute.

And the issue is complicated by the fact that one word could have different
meaning in the other language.


* How to setup a text search configuration which e.g. stems en and de
words?
* Should I create a synonym dictionary which contains word
translations en-de instead of synonyms en-en?



This synonym dictionary will contain a thousands entries. So it will require
a great effort to make this dictionary.



* Any hints to related work where FTS has been used in a multilingual
context?

:Stefan




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS with more than one language in body and with unknown query language?

2016-07-14 Thread Stefan Keller
приве́т! Artur

Thanks for your explanations.

2016-07-14 17:20 GMT+02:00 Artur Zakirov :
> On 14.07.2016 01:16, Stefan Keller wrote:
...
>> * Should I create a synonym dictionary which contains word
>> translations en-de instead of synonyms en-en?
>
> This synonym dictionary will contain a thousands entries. So it will require
> a great effort to make this dictionary.

It's a domain-specific corpus of max. 1000 records of descriptive text
(metadata) about geographic data, like topographic map, land use
planning, etc.

...
>> * How to setup a text search configuration which e.g. stems en and de
>> words?

I still would like to give FTS a try with synonym dictionary (en-de).
Now, I'm wondering how to setup the configuration. I've seen examples
to process either english, german or russian alone. But I did not find
yet any documentation on how to setup the text search configuration
where a corpus contains two (or more) languages at same time in a
table (body_en and body_de).

:Stefan







2016-07-14 17:20 GMT+02:00 Artur Zakirov :
> Hi,
>
> On 14.07.2016 01:16, Stefan Keller wrote:
>>
>> Hi,
>>
>> I have a text corpus which contains either German or English docs and
>> I expect queries where I don't know if it's German or English. So I'd
>> like e.g. that a query "forest" matches "forest" in body_en but also
>> "Wald" in body_de.
>>
>> I created a table with attributes body_en and body_de (type "text"). I
>> will use ts_vector/ts_query on the fly (don't need yet an index
>> (attributes)).
>>
>> * Can FTS handle this multilingual situation?
>
>
> In my opinion, PostgreSQL cant handle it. It cant translate words from one
> language to another, it just stems word from original form to basic form.
> First you need to translate word from English to German, then search word in
> the body_de attribute.
>
> And the issue is complicated by the fact that one word could have different
> meaning in the other language.
>
>> * How to setup a text search configuration which e.g. stems en and de
>> words?
>> * Should I create a synonym dictionary which contains word
>> translations en-de instead of synonyms en-en?
>
>
> This synonym dictionary will contain a thousands entries. So it will require
> a great effort to make this dictionary.
>
>
>> * Any hints to related work where FTS has been used in a multilingual
>> context?
>>
>> :Stefan
>>
>>
>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS with more than one language in body and with unknown query language?

2016-07-14 Thread Artur Zakirov

Hi,

On 14.07.2016 01:16, Stefan Keller wrote:

Hi,

I have a text corpus which contains either German or English docs and
I expect queries where I don't know if it's German or English. So I'd
like e.g. that a query "forest" matches "forest" in body_en but also
"Wald" in body_de.

I created a table with attributes body_en and body_de (type "text"). I
will use ts_vector/ts_query on the fly (don't need yet an index
(attributes)).

* Can FTS handle this multilingual situation?


In my opinion, PostgreSQL cant handle it. It cant translate words from 
one language to another, it just stems word from original form to basic 
form. First you need to translate word from English to German, then 
search word in the body_de attribute.


And the issue is complicated by the fact that one word could have 
different meaning in the other language.



* How to setup a text search configuration which e.g. stems en and de words?
* Should I create a synonym dictionary which contains word
translations en-de instead of synonyms en-en?


This synonym dictionary will contain a thousands entries. So it will 
require a great effort to make this dictionary.



* Any hints to related work where FTS has been used in a multilingual context?

:Stefan




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] FTS with more than one language in body and with unknown query language?

2016-07-13 Thread Stefan Keller
Hi,

I have a text corpus which contains either German or English docs and
I expect queries where I don't know if it's German or English. So I'd
like e.g. that a query "forest" matches "forest" in body_en but also
"Wald" in body_de.

I created a table with attributes body_en and body_de (type "text"). I
will use ts_vector/ts_query on the fly (don't need yet an index
(attributes)).

* Can FTS handle this multilingual situation?
* How to setup a text search configuration which e.g. stems en and de words?
* Should I create a synonym dictionary which contains word
translations en-de instead of synonyms en-en?
* Any hints to related work where FTS has been used in a multilingual context?

:Stefan


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general