Re: [GENERAL] TSearch queries with multiple languages

2009-02-13 Thread Ivan Sergio Borgonovo
On Fri, 13 Feb 2009 09:55:04 +0300 (MSK)
Oleg Bartunov  wrote:

> contrib/btree_gin, which is under review for 8.4, will allow to
> create composite index like (ts_config, tsvector), so queries
> which specified ts_config (language) will uses this index.

Grass Root protectionism of good programmers ;)

Not only Oleg write very valuable code, but he really cares about
his users base.

I'm still sorry I haven't been able to track down the origin of a
very slow gin index creation I reported months ago.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] TSearch queries with multiple languages

2009-02-12 Thread Oleg Bartunov

On Thu, 12 Feb 2009, Tom Lane wrote:


Gordon Callan  writes:

Next we create an index on the ts_vector column:
 CREATE INDEX node_ts_body on node USING gin(ts_body);



From the documentation, it seems this index will know what config each row has.


No, actually the index doesn't know and doesn't care.  The tsvector
representation is language-independent --- it contains "just strings".
All the language-dependent processing happens during reduction of the
document text to tsvector (or reduction of a search string to tsquery).
So if words from different languages happen to reduce to the same
string, searches in both languages will find that entry.

Usually this works the way people want; but if not, you could add an
additional WHERE condition to your queries to match only documents in
the desired language.


contrib/btree_gin, which is under review for 8.4, will allow to create
composite index like (ts_config, tsvector), so queries which specified 
ts_config (language) will uses this index.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] TSearch queries with multiple languages

2009-02-12 Thread Tom Lane
Gordon Callan  writes:
> Next we create an index on the ts_vector column:
>  CREATE INDEX node_ts_body on node USING gin(ts_body);

> From the documentation, it seems this index will know what config each row 
> has.  

No, actually the index doesn't know and doesn't care.  The tsvector
representation is language-independent --- it contains "just strings".
All the language-dependent processing happens during reduction of the
document text to tsvector (or reduction of a search string to tsquery).
So if words from different languages happen to reduce to the same
string, searches in both languages will find that entry.

Usually this works the way people want; but if not, you could add an
additional WHERE condition to your queries to match only documents in
the desired language.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] TSearch queries with multiple languages

2009-02-12 Thread Gordon Callan

Greetings,

I'm implementing full text search at our company, using Tsearch2 and have read 
Chapter 12 (FTS) numerous times and am still unclear about something. 

All our data is stored in Postgresql in Unicode.
The data to be searched can be in a number of different languages.  

I plan to create a ts_vector column for each corresponding data column to be 
searched and use an additional column for the 
regconfig where each row will have a configuration name depending on the 
language of that row of data.

So, for example, we have a table called node with columns node_id, body;  

We add 2 columns, ts_body and ts_config to contain the ts_vector data of body 
and the configuration (langugage) in the 
other.  

ALTER TABLE node ADD column ts_body tsvector, ts_config regconfig;

At install time, the ts_config column will be populated so that it contains the 
language/config for each row.  We will also
provide a means to keep the ts_body column updated each time the underlying 
body data changes.

We then generate the ts_vector column using this configuration:
UPDATE node SET ts_body = to_tsvector(ts_config, body);

Presumably, this will generate tsvector data for every row, using what's in the 
regconfig column to determine the language.

Next we create an index on the ts_vector column:
 CREATE INDEX node_ts_body on node USING gin(ts_body);

>From the documentation, it seems this index will know what config each row 
>has.  

OK, now here's where the documentation is sketchy.

When searching, we will generate SQL like this:

SELECT * 
FROM node
WHERE (ts_body @@ to_tsquery('english','foo & bar'));

Assuming we have 3 different configurations (all contained in various rows and 
defined in the regconfig column), what language(s) will be returned in the 
result set? All 3 languages? Is it based on the default_text_search_config ?


Thanks for your help,
Gordon

_
Windows Liveā„¢: Keep your life in sync. 
http://windowslive.com/howitworks?ocid=TXT_TAGLM_WL_t1_allup_howitworks_022009