Re: [GENERAL] Avoiding empty queries in tsearch

2007-01-15 Thread Teodor Sigaev

contrib_regression=# select numnode( plainto_tsquery('the any') );
NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
 numnode
-
   0
(1 row)

contrib_regression=# select numnode( plainto_tsquery('the table') );
 numnode
-
   1
(1 row)

contrib_regression=# select numnode( plainto_tsquery('long table') );
 numnode
-----
   3
(1 row)


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Avoiding empty queries in tsearch

2007-01-15 Thread Teodor Sigaev



Doug Cole wrote:
That sounds perfect, but it doesn't seem to exist on either of the 
postgresql installations I have access to (8.1 on ubuntu and fedora 
core).  Is it new to 8.2?  Is there a similar function under 8.1, or at 

Yes, it's new in 8.2


least a decent work-around?  Thanks for the help,
Doug


Not nice workaround but it works:

# create or replace function isvoid(tsquery)
returns bool as $$
select case when $1 is NULL then 't'::bool when length(textin(tsquery_out( $1 
))) = 0 then 't'::bool else 'f'::bool end;

$$ language SQL called on null input;


# select isvoid( plainto_tsquery('the & any') );
NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
 isvoid

 t
(1 row)



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Stats collector frozen?

2007-01-26 Thread Teodor Sigaev

Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
if I put a #undef select right before the select in pgstat.c, the
regression tests pass. 


May be, problem is related to fixed bug in pgwin32_waitforsinglesocket() ?
WaitForMultipleObjectsEx might sleep indefinitely while waiting socket to write, 
so, may be there is symmetrical problem with read? Or pgwin32_select() is used 
for waiting write too?




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] intarray index vs gin index

2007-02-09 Thread Teodor Sigaev
intarray. My question is whether I still should use intarray for 
indexing (if yes then either I should use GIST or GIN) or maybe GIN 
index is faster than GIST+intarray / GIN+intarray.

Yes, with intarray you can use GiST/GIN indexes which you wish

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Having performance problems with TSearch2

2007-02-20 Thread Teodor Sigaev

Use GIN index instead of GiST

I have a table of books, with 120 registers. I have created an GIST 
index over the title and subtitle,

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] tsearch2: word position

2007-02-21 Thread Teodor Sigaev
I'm fiddling with to_tsvector() and parse() from tsearch2, trying to get 
the word position from those functions. I'd like to use the tsearch2 
parser and stemmer, but I need to know the exact position of the word as 
well as the original, unstemmed word.


It's not supposed usage... Why do you need that?

And this only tells me a word position, not a character or byte position 
within the string. Is there a way to get this information from tsearch2?


Have a look to headline framework as an example or staring point. hlparsetext() 
returns  parsed text with matched lexemes in tsquery. Small description of 
hlparsetext is placed at 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html

near the end. Description of HLWORD struct is some out of day, sorry.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] tsearch2: word position

2007-02-22 Thread Teodor Sigaev
to_tsvector() could as well return the character number or a byte 
pointer, I could see advantages for both. But the word number makes 
little sense to me.


Word number is used only in ranking functions. If you don't need a ranking than 
you could safely strip positional information.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] tsearch2: word position

2007-02-22 Thread Teodor Sigaev
Huh? I explicitly *want* positional information. But I find the word 
number to be less useful than a character number or a simple (byte) 
pointer to the position of the word in the string.


Given only the word number, I have to go and parse the string again.


byte offset of word is useless for ranking purpose
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] tsearch2: word position

2007-02-22 Thread Teodor Sigaev



No, the first X aren't more important, but being able to determine
word proximity is very important for partial phrase matching and
ranking.  The closer the words, the "better" the match, all else being
equal.

exactly

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] to_tsvector in 8.2.3

2007-03-21 Thread Teodor Sigaev
I can't reproduce your problem, but I have not Windows box, can anybody 
reproduce that?



contrib_regression=# select version();
 version 

 PostgreSQL 8.2.3 on i386-unknown-freebsd6.2, compiled by GCC gcc (GCC) 3.4.6 
[FreeBSD] 20060305

(1 row)
contrib_regression=#  show server_encoding ;
 server_encoding
-
 UTF8
(1 row)

contrib_regression=# show lc_collate;
 lc_collate

 C
(1 row)

contrib_regression=# show lc_ctype;
 lc_ctype
--
 C
(1 row)

contrib_regression=# select to_tsvector('test text');
to_tsvector
---
 'test':1 'text':2
(1 row)

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] to_tsvector in 8.2.3

2007-03-21 Thread Teodor Sigaev

8.2 has fully rewritten text parser based on POSIX is* functions.

Thomas Pundt wrote:

On Wednesday 21 March 2007 14:25, Teodor Sigaev wrote:
| I can't reproduce your problem, but I have not Windows box, can anybody
| reproduce that?

just a guess in the wild; I once had a similar phenomen and tracked it down
to a "non breaking space character" (0xA0). Since then I'm patching the
tsearch2 lexer:

--- postgresql-8.1.5/contrib/tsearch2/wordparser/parser.l
+++ postgresql-8.1.4/contrib/tsearch2/wordparser/parser.l
@@ -78,8 +78,8 @@
 /* cyrillic koi8 char */
 CYRALNUM   [0-9\200-\377]
 CYRALPHA   [\200-\377]
-ALPHA  [a-zA-Z\200-\377]
-ALNUM  [0-9a-zA-Z\200-\377]
+ALPHA  [a-zA-Z\200-\237\241-\377]
+ALNUM  [0-9a-zA-Z\200-\237\241-\377]
 
 
 HOSTNAME   ([-_[:alnum:]]+\.)+[[:alpha:]]+

@@ -307,7 +307,7 @@
return UWORD; 
 }
 
-[ \r\n\t]+ {

+[ \240\r\n\t]+ {
token = tsearch2_yytext;
tokenlen = tsearch2_yyleng;
return SPACE;


Ciao,
Thomas



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Teodor Sigaev
I'm afraid that fulltext search on multiterabytes set of documents can not be 
implemented on any RDBMS, at least on single box. Specialized fulltext search 
engines (with exact matching and time to search about one second) has practical 
limit near 20 millions of docs, cluster - near 100 millions.  Bigger collections 
require engines like a google.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] multi terabyte fulltext searching

2007-03-21 Thread Teodor Sigaev
I am currently using GIST indexes because I receive about 10GB of new 
data a week (then again I am not deleting any information).  The do not 
expect to be able to stop receiving text for about 5 years, so the data 
is not going to become static any time soon.  The reason I am concerned 
with performance is that I am providing a search system for several 
newspapers since essentially the beginning of time.  Many bibliographer 
etc would like to use this utility but if each search takes too long I 
am not going to be able to support many concurrent users.


Use GiST and GIN indexes together: any data older than one month (which doesn't 
change) with GIN index and new data with GiST. And one time per month moves data 
from GiST to GIN.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] to_tsvector in 8.2.3

2007-03-21 Thread Teodor Sigaev

postgres=# select to_tsvector('test text');
  to_tsvector
---
 'test text':1
(1 row)
Ok. that's related to 
http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/tsearch2/wordparser/parser.c.diff?r1=1.11;r2=1.12;f=h
commit. Thomas pointed that it can be non-breakable space (0xa0) and that commit 
assumes any character with C locale and multibyte encoding and > 0x7f is alpha.

To check theory, pls, apply attached patch.

If so, I'm confused, we can not assume that 0xa0 is a space symbol in any 
multibyte encoding, even in Windows.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/
*** ./contrib/tsearch2/wordparser/parser.c.orig Wed Mar 21 20:41:23 2007
--- ./contrib/tsearch2/wordparser/parser.c  Wed Mar 21 21:10:39 2007
***
*** 124,130 
--- 124,134 
 * with C-locale is an alpha character
 */
if ( c > 0x7f )
+   {
+   if ( c == 0xa0 )
+   return 0;
return 1;
+   }
  
return isalnum(0xff & c);
}
***
*** 157,163 
--- 161,171 
 * with C-locale is an alpha character
 */
if ( c > 0x7f )
+   {
+   if ( c == 0xa0 )
+   return 0;
return 1;
+   }
  
return isalpha(0xff & c);
}

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] to_tsvector in 8.2.3

2007-03-22 Thread Teodor Sigaev
Solved, see attached patch. I had found old Celeron-300 box and install Windows 
on it, and it was very slow :)




Nope, same result with this patch.

Thank you.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/
*** ./contrib/tsearch2.orig/./wordparser/parser.c   Thu Mar 22 18:39:23 2007
--- ./contrib/tsearch2/./wordparser/parser.cThu Mar 22 18:51:23 2007
***
*** 117,123 
{
if (lc_ctype_is_c())
{
!   unsigned int c = *(unsigned int*)(prs->wstr + 
prs->state->poschar);
  
/*
 * any non-ascii symbol with multibyte encoding
--- 117,123 
{
if (lc_ctype_is_c())
{
!   unsigned int c = *(prs->wstr + prs->state->poschar);
  
/*
 * any non-ascii symbol with multibyte encoding

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] Tsearch2 crashes my backend, ouch !

2007-03-30 Thread Teodor Sigaev

which version of pgsql exactly?

Listmail wrote:


Hello,

I have just ditched Gentoo and installed a brand new kubuntu system 
(was tired of the endless compiles).
I have a problem with crashing tsearch2. This appeared both on 
Gentoo and the brand new kubuntu.


I will describe all my install procedure, maybe I'm doing something 
wrong.


Cluster is newly created and empty.

initdb was done with UNICODE encoding & locales.

# from postgresql.conf

# These settings are initialized by initdb -- they might be changed
lc_messages = 'fr_FR.UTF-8' # locale for system 
error message strings
lc_monetary = 'fr_FR.UTF-8' # locale for monetary 
formatting
lc_numeric = 'fr_FR.UTF-8'  # locale for number 
formatting
lc_time = 'fr_FR.UTF-8' # locale for time 
formatting


[EMAIL PROTECTED]:~$ locale
LANG=fr_FR.UTF-8
LC_CTYPE="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
etc...

First import needed .sql files from contrib and check that the 
default tsearch2 config works for English


$ createdb -U postgres test
$ psql -U postgres test ./configure --prefix=/usr/lib/postgresql/8.2/ 
--datadir=/usr/share/postgresql/8.2 --enable-nls=fr --with-python

cd contrib/tsearch2
make
cd gendict
(copy french stem.c and stem.h from the snowball website)
./config.sh -n fr -s -p french_UTF_8 -i -v -c stem.c -h stem.h 
-C'Snowball stemmer for French'

cd ../../dict_fr
make clean && make
sudo make install

Now we have :

/bin/sh ../../config/install-sh -c -m 644 dict_fr.sql 
'/usr/share/postgresql/8.2/contrib'
/bin/sh ../../config/install-sh -c -m 755  libdict_fr.so.0.0 
'/usr/lib/postgresql/8.2/lib/dict_fr.so'


Okay...

- download and install UTF8 french dictionaries from 
http://www.davidgis.fr/download/tsearch2_french_files.zip and put them 
in contrib directory

(the files delivered by debian package ifrench are ISO8859, bleh)

- import french shared libs
psql -U postgres test < /usr/share/postgresql/8.2/contrib/dict_fr.sql

Then :

test=# select lexize( 'en_stem', 'flying' );
 lexize

 {fli}

And :

test=# select * from pg_ts_dict where dict_name ~ '^(fr|en)';
 dict_name |   dict_init   |   dict_initoption|  
dict_lexize  |dict_comment
---+---+--+---+- 

 en_stem   | snb_en_init(internal) | contrib/english.stop | 
snb_lexize(internal,internal,integer) | English Stemmer. Snowball.
 fr| dinit_fr(internal)|  | 
snb_lexize(internal,internal,integer) | Snowball stemmer for French


test=# select lexize( 'fr', 'voyageur' );
server closed the connection unexpectedly

BLAM ! Try something else :

test=# UPDATE pg_ts_dict SET 
dict_initoption='/usr/share/postgresql/8.2/contrib/french.stop' WHERE 
dict_name = 'fr';

UPDATE 1
test=# select lexize( 'fr', 'voyageur' );
server closed the connection unexpectedly

Try other options :

dict_name   | fr_ispell
dict_init   | spell_init(internal)
dict_initoption | 
DictFile="/usr/share/postgresql/8.2/contrib/french.dict",AffFile="/usr/share/postgresql/8.2/contrib/french.aff",StopFile="/usr/share/postgresql/8.2/contrib/french.stop" 


dict_lexize | spell_lexize(internal,internal,integer)
dict_comment|

test=# select lexize( 'en_stem', 'traveler' ), lexize( 'fr_ispell', 
'voyageur' );

-[ RECORD 1 ]---
lexize | {travel}
lexize | {voyageuse}

Now it works (kinda) but stemming doesn't stem for French (since 
snowball is out). It should return 'voyage' (=travel) instead of 
'voyageuse' (=female traveler)

That's now what I want ; i want to use snowball to stem French words.

I'm going to make a debug build and try to debug it, but if anyone 
can help, you're really, really welcome.






---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Tsearch2 crashes my backend, ouch !

2007-03-30 Thread Teodor Sigaev

(copy french stem.c and stem.h from the snowball website)
Take french stemmer from 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/stemmer/stemmer_utf8_french.tar.gz

At least, it works for me.

Sorry, but Snowball's interfaces are changed very quickly and unpredictable and 
Snowball doesn't use version mark or something similar. So, downloaded Snowball 
core and stemmers in different time may be incompatible :(.


Our tsearch_core patch (moving tsearch into core of pgsql) solves that problem - 
it contains all possible snowball stemmers.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Tsearch2 crashes my backend, ouch !

2007-04-02 Thread Teodor Sigaev

Fixed. Thanks for the report.

Anyway, just to signal that tsearch2 crashes if SELECT is not 
granted to pg_ts_dict (other tables give a proper error message when 
not GRANTed).On

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] to_tsvector in 8.2.3

2007-04-02 Thread Teodor Sigaev

Sorry, no - I tested on CVS HEAD, so dll isn't compatible :(
Wait a bit for 8.2.4

richardcraig wrote:

Teodor

As a non-C windows user (yes - throw stones at me :) ) Do you have a fixed
dll for this patch that I can try?

Thanks

Richard


Teodor Sigaev-2 wrote:

Solved, see attached patch. I had found old Celeron-300 box and install
Windows 
on it, and it was very slow :)




Nope, same result with this patch.

Thank you.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
WWW:
http://www.sigaev.ru/

*** ./contrib/tsearch2.orig/./wordparser/parser.c   Thu Mar 22 18:39:23 2007
--- ./contrib/tsearch2/./wordparser/parser.cThu Mar 22 18:51:23 2007
***
*** 117,123 
{
if (lc_ctype_is_c())
{
!   unsigned int c = *(unsigned int*)(prs->wstr + 
prs->state->poschar);
  
  			/*

 * any non-ascii symbol with multibyte encoding
--- 117,123 
{
if (lc_ctype_is_c())
{
!   unsigned int c = *(prs->wstr + prs->state->poschar);
  
  			/*

 * any non-ascii symbol with multibyte encoding


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq






--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] indexing array columns

2007-04-16 Thread Teodor Sigaev

you should be able to index the way you want. In contrib there a module
"cube" which does similar to what you want to 3D, extending it to 12D
shouldn't be too hard...


contrib/cube module implements N dimensional cube representation

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] tsearch2 dictionary that indexes substrings?

2007-04-24 Thread Teodor Sigaev

My colleague who speaks more C than me came up with the code below
which works fine for us. Will the memory allocated for lexeme be freed

Nice, except self-defined utf8 properties. I think it will be much better to use
pg_mblen(char*). In this case your dictionary will work with any supported by 
pgsql encodings.



by the caller?

Yes, of course.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Postgresql 8.2.4 crash with tsearch2

2007-05-24 Thread Teodor Sigaev

Pls, check your steps or say me where I'm wrong :)
If you still have a problems, I can solve it if I'll have access to your 
developer server...


% cd PGSQL_SRC
% zcat ~/tmp/tsearch_snowball_82-20070504.gz| patch -p0
% cd contrib/tsearch2
% gmake && su -c 'gmake install' && gmake installcheck
% cd gendict
% cp ~/tmp/libstemmer_c/src_c/stem_UTF_8_french.c stem.c
% cp ~/tmp/libstemmer_c/src_c/stem_UTF_8_french.h stem.h
% ./config.sh -n fr -s -p french_UTF_8 -v -C'Snowball stemmer for 
French - UTF8'

% cd ../../dict_fr
% gmake && su -c 'gmake install'
% psql contrib_regression < dict_fr.sql

contrib_regression=# select lexize('fr', 'sortir'), lexize('fr', 
'service'), lexize('fr', 'chose');

 lexize |  lexize  | lexize
+--+
 {sort} | {servic} | {chos}
(1 row)

contrib_regression=# select lexize('fr', 
'as');


 lexize
 

{}



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] crash creating tsearch2 index

2007-05-28 Thread Teodor Sigaev

Could you provide a test suite?

John DeSoi wrote:

Hi,

I'm trying to dump and restore a copy of a database in the same cluster. 
pg_restore would abort when creating a tsearch2 gist index. So I dumped 
to text removed the CREATE INDEX commands and tried to do that at the 
end after the rest of the database was loaded. I still have the same 
problem:


CREATE INDEX song_tsx_title_idx ON song USING gist (tsx_title 
public.gist_tsvector_ops);

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

This is pg 8.0.8 in a shared hosting environment, so I don't have a lot 
of options for tweaking. Is there a known work-around for this?


Thanks,




John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] opclass for real[]

2007-05-30 Thread Teodor Sigaev
ERROR:  data type real[] has no default operator class for access method 
"gist"
HINT:  You must specify an operator class for the index or define a 
default operator class for the data type.

There is operator class for GIN for real[].
http://www.postgresql.org/docs/8.2/static/xindex.html#XINDEX-GIN-ARRAY-STRAT-TABLE



Is there a opclass defined in 8.2 or I have to create one. In either 
case can you please give a link for information on opclasses.


Thanks
Abhang


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] TSEARCH2: disable stemming in indexes and triggers

2007-05-31 Thread Teodor Sigaev
I found out that using 'simple' instead of 'default' when using 
to_tsvector() does excactly that, but I don't know how to change my 
triggers and indexes to keep doing the same (using 'simple'). 


Suppose, your database is initialized with C locale. So, just mark 
simple configuration as default:


# update pg_ts_cfg set locale=null where ts_name='default';
# update pg_ts_cfg set locale='C' where ts_name='simple';

If your locale setting is not C then mark needed configuration with your 
 locale.


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] warm standby server stops doing checkpoints afterawhile

2007-06-01 Thread Teodor Sigaev

<2007-06-01 13:11:29.365 CEST:%> DEBUG:  0: Ressource manager (13)
has partial state information

To me, this points clearly to there being an improperly completed action
in resource manager 13. (GIN) In summary, it appears that there may be
an issue with the GIN code for WAL recovery and this is effecting the
Warm Standby.


Hmm. I found that gin_xlog_cleanup doesn't reset incomplete_splits list. Is it 
possible reason of bug?


Attached patch fixes it.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/
*** ./src/backend/access/gin/ginxlog.c.orig Fri Jun  1 16:47:47 2007
--- ./src/backend/access/gin/ginxlog.c  Fri Jun  1 16:53:47 2007
***
*** 594,599 
--- 594,600 
  
MemoryContextSwitchTo(topCtx);
MemoryContextDelete(opCtx);
+   incomplete_splits = NIL;
  }
  
  bool

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] warm standby server stops doing checkpointsafterawhile

2007-06-01 Thread Teodor Sigaev

<2007-06-01 16:28:51.708 CEST:%> LOG:  GIN incomplete split root:8
l:45303 r:111740 at redo CA/C8243C28

...

<2007-06-01 16:38:23.133 CEST:%> LOG:  GIN incomplete split root:8
l:45303 r:111740 at redo CA/C8243C28


Looks like a bug in GIN. I'll play with it. Can you provide more details about 
your test?



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] warm standby server stops doingcheckpointsafterawhile

2007-06-01 Thread Teodor Sigaev



I'd suggest we throw an error, as shown in the enclosed patch. Frank,
can you give that a whirl to provide Teodor with something more to work
with? Thanks.


I already makes a test suite which reproduce the problem - it leaves incompleted 
splits. But I discover one more problem: deadlock on buffer's lock. So, right 
now I investigate the problem.




Neither GIST nor B-tree seems to throw an error in corresponding
locations also, so the potential for not being able to track this is
high. I'd want to throw errors in those locations also.


Agreed, I'll add more check
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] warm standby server stops doingcheckpointsafterawhile

2007-06-01 Thread Teodor Sigaev


Found a reason: if parent page is fully backuped after child's split then 
forgetIncompleteSplit() isn't called at all.


Hope, attached patch fix that. Pls, test it.

PS I'm going away for weekend, so I'll not be online until Monday.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/


patch_wal_gin.gz
Description: Unix tar archive

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] warm standby server stops doingcheckpointsafterawhile

2007-06-04 Thread Teodor Sigaev

<2007-06-01 23:00:00.001 CEST:%> LOG:  GIN incomplete splits=8


Just to be sure: patch fixes *creating* of WAL log, not replaying. So, primary 
db should be patched too.


During weekend I found possible deadlock in locking protocol in GIN between 
concurrent UPDATE and VACUUM queries with the same GIN index involved. Strange, 
but I didn't see it in 8.2 and even now I can't reproduce it. It's easy to 
reproduce оnly on HEAD with recently added ReadBufferWithStrategy() call instead 
of ReadBuffer(). ReadBufferWithStrategy() call was added to implement 
limited-size "ring" of buffers for VACUUM. Nevertheless, it's a possible 
scenario in 8.2.


Attached patch fixes that deadlock bug too. And, previous version of my patch 
has a mistake which is observable on CREATE INDEX .. USING GIN query.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/


patch_wal_gin.v6.gz
Description: Unix tar archive

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] warm standby server stops doingcheckpointsafterawhile

2007-06-04 Thread Teodor Sigaev

1. After a certain point, consecutive GIN index splits cause a problem.
The new RHS block numbers are consecutive from 111780+

That's newly created page. Splitted page might have any number



2. The incomplete splits stay around indefinitely after creation and we
aren't trying to remove the wrong split at any point. We're either never
creating an xlog record, or we are ignoring it in recovery, or we are
somehow making multiple entries then not removing all of them.

Agreed



3. The root seems to move, which isn't what I personally was expecting
to see. It seems root refers to the highest parent involved in the
split.
root in this context means parent of splitted page. Actually, there is a lot of 
B-tree in GIN, see http://www.sigaev.ru/gin/GinStructure.pdf


4. We're writing lots of redo in between failed page splits. So *almost*
everything is working correctly.

5. This starts to happen when we have very large indexes. This may be
coincidental but the first relation file is fairly full (900+ MB).


Yes. It seems to me that conditions of error are very rare and B-tree over 
ItemPointers (second level of GIN) has a big capacity, 1000+ items per page. So, 
splits occur rather rare.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] warm standby server stops doingcheckpointsafterawhile

2007-06-04 Thread Teodor Sigaev

Ooops. Patch doesn't apply cleanly. New version.


Attached patch fixes that deadlock bug too. And, previous version of my 
patch has a mistake which is observable on CREATE INDEX .. USING GIN query.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/


patch_wal_gin.v7.gz
Description: Unix tar archive

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] warm standby server stops doingcheckpointsafterawhile

2007-06-04 Thread Teodor Sigaev

After some observation of massive reindexing of some hundred thousand
data sets it seems to me that the slave doesn't skip checkpoints
anymore. (Apart from those skipped because of the CheckpointTimeout thing)
I'll keep an eye on it and report back any news on the issue.


Nice, committed. Thank for your report and testing.
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] If an index is based on 3 columns will a query using

2005-09-12 Thread Teodor Sigaev

Greg Stark suggests here:
http://archives.postgresql.org/pgsql-hackers/2005-05/msg00966.php
that GiST could also be fixed to work with any subset of the index
columns, but it hasn't been done yet, unless Teodor and Oleg snuck
something in during that last round of GiST work.


GiST may work with any subset of index columns too. Even in existing code I 
don't see any problem except NULL in a first column. GiST doesn't store tuples 
with leading NULL value (gist.c lines 174, 326), so index doesn't contained them.


After our work about WAL-lization GiST, it may work with "invalid" tuples 
(possibly occured after crash recovery), so itsn't a big deal to add support 
NULL in a first column. But freeze date is outdated... Should I add or leave it 
to 8.2?




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] TSearch2 snowball version error

2005-09-15 Thread Teodor Sigaev

Snowball changes interfaces, I'll update tsearh2 sources today or tommorow.


William Leite Araújo wrote:

Hi,

I'm trying compile a new brazilian portuguese dictionary to TSearch2 
contrib, but found the errors:


portuguese_stem.c: In function `r_prelude':
portuguese_stem.c:481: error: void value not ignored as it ought to be
portuguese_stem.c:487: error: void value not ignored as it ought to be
portuguese_stem.c: In function `r_postlude':
portuguese_stem.c:610: error: void value not ignored as it ought to be
portuguese_stem.c:616: error: void value not ignored as it ought to be
portuguese_stem.c: In function `r_standard_suffix':
portuguese_stem.c:662: error: void value not ignored as it ought to be
portuguese_stem.c:672: error: void value not ignored as it ought to be
portuguese_stem.c:682: error: void value not ignored as it ought to be
portuguese_stem.c:692: error: void value not ignored as it ought to be
portuguese_stem.c:702: error: void value not ignored as it ought to be
portuguese_stem.c:715: error: void value not ignored as it ought to be
portuguese_stem.c:729: error: void value not ignored as it ought to be
portuguese_stem.c:744: error: void value not ignored as it ought to be
portuguese_stem.c:760: error: void value not ignored as it ought to be
portuguese_stem.c:775: error: void value not ignored as it ought to be
portuguese_stem.c:791: error: void value not ignored as it ought to be
portuguese_stem.c:806: error: void value not ignored as it ought to be
portuguese_stem.c:818: error: void value not ignored as it ought to be
portuguese_stem.c:832: error: void value not ignored as it ought to be
portuguese_stem.c: In function `r_verb_suffix':
portuguese_stem.c:856: error: void value not ignored as it ought to be
portuguese_stem.c: In function `r_residual_suffix':
portuguese_stem.c:880: error: void value not ignored as it ought to be
portuguese_stem.c: In function `r_residual_form':
portuguese_stem.c:902: error: void value not ignored as it ought to be
portuguese_stem.c:929: error: void value not ignored as it ought to be
portuguese_stem.c:935: error: void value not ignored as it ought to be
portuguese_stem.c: In function `portuguese_ISO_8859_1_stem':
portuguese_stem.c:993: error: void value not ignored as it ought to be
make: ** [portuguese_stem.o] Erro 1

 This after change the included file "header.h", the old 
"../runtime/header.h" is invalid.
 I think that is a version trouble, because the return of functions 
"slice_from_s" and "slice_del" are "int" in current snowball portuguese 
files, but on include files of version 8.0.2 and 8.0.3 of Portgresql the 
return type is "void".

 Help-me, please.

--
William Leite Araújo


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] TSearch2 snowball version error

2005-09-15 Thread Teodor Sigaev

Fixed and commited in cvs.

Patches for already existsing versions :
http://www.sigaev.ru/gist/patch_snowball-7.4.gz
http://www.sigaev.ru/gist/patch_snowball-8.0.gz




I'm trying compile a new brazilian portuguese dictionary to TSearch2 
contrib, but found the errors:


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] re : tsearch2 problem

2005-10-18 Thread Teodor Sigaev
because of its way of "Inverse Order OF Idexing",and "The GiST Tree is 
well for larting lan but doesn't for chinese lan(also ineffifent for 
Japanese lan)"。


About japanese look at http://www.oss.ecl.ntt.co.jp/tsearch2j/ (it's on 
Japanese, but you can contact with developer)






 

I only know that GiST is also  "Inverse Order OF Idexing",but of course 
it is a tree NOT as lucene does


Tsearch2 doesn't use inverted index (yet), and so it easy to update (your index 
is always full - online indexing), but it hasn't good performance on large data 
set. The limit is about 10^5 - 10^6 rows in practice.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] really thanks,Teodor Sigaev. HOW ts2 implment that

2005-10-18 Thread Teodor Sigaev

I know the LUCENE " inverted index " is just like follows:

 


keyword   doc_id[frequence]   location
guangzhou 1[2]   3,6
he   2[1]   1
i 1[1]   4
live  1[2],2[1]   2,5,2
shanghai  2[1]   3


Yes


so ,before I  thought that A node In GiST tree is very like the above :

In the tree's node , it has several value for storing its's

turple_id[frequence]   location ,and that is why I thoutht the " 
BUT now , I'm lost again ,how does Gist tree store lexem and how to find 
his correspoding tuples in database( including  frequence   location)?


Tsearch2 use "direct" index: each leaf tuple in tree represent document, 
internal tuples represents union of documents.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Fuzzy text search

2005-11-14 Thread Teodor Sigaev



jennyw wrote:
We've heard that PostgreSQL can do fuzzy search, but haven't had much 
luck. I'm brand new to PostgreSQL, so this might be completely obvious 
for an experienced user.
Are there any how-tos on fuzzy text searching? Someone said to try using 
tsearch2, but it seems that it does full-text searching, but not fuzzy 
search. Ideally, we'd like to use that w/ a fuzzy search option. In 
particular, if someone searches for "imat" we want to return results 
including "immaterial" and "imaterial" (so misspellings plus partial 
match).




Look at contrib/pg_trgm. If you want fulltext search with mispelling correction 
then read "Tsearch2 Integration" in contrib/pg_trgm/README.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] tsearch2: more than one index per table?

2005-11-22 Thread Teodor Sigaev

ERROR:  could not find tsearch config by locale
UPDATE t SET idxA=to_tsvector('default', a);



Is it working
select to_tsvector('foo bar')?

I suppose, no. In that case tsearch can't find configuration for current 
database locale, update pg_ts_cfg.locale in wished row to correct value.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] TSearch2 / German compound words / UTF-8

2005-11-23 Thread Teodor Sigaev
Tsearch/isepll is not able to break this word into parts, because of the 
"s" in "Produktion/s/intervall". Misspelling the word as 
"Produktionintervall" fixes it:


It should be affixes marked as 'affix in middle of compound word',
Flag is '~', example look in norsk dictionary:

flag ~\\:
[^S]   >S  #~ advarsel > advarsels-

BTW, we develop and debug compound word support on norsk (norwegian) dictionary, 
so look for example there. But we don't know Norwegian, norwegians helped us  :)







The second thing is with UTF-8:

I know there is no, or no full support yet, but I need to get it as good 
as it's possible /now/. Is there anything in CVS that I might be able to 
backport to my version or other tips? My setup works, as for the dict 
and the stop word files, but I fear the stemming and mapping of umlauts 
and other special chars does not as it should. I tried recoding the 
german.aff to UTF-8 as well, but that breaks it with an regex error 
sometimes:


Now in CVS it is deep alpha version and now only text parser is UTF-compliant, 
we continue development...





fts=# SELECT ts_debug('dass');
ERROR:  Regex error in '[^sãŸ]$': brackets [] not balanced
CONTEXT:  SQL function "ts_debug" statement 1

This seems while it tries to map ss to ß, but anyway, I fear, I didn't 
anything good with that.


As suggested in the "Tsearch2 and Unicode/UTF-8" article I have a second 
snowball dict. The first lines of the stem.h I used start with:



extern struct SN_env * german_ISO_8859_1_create_env(void);

Can you use ISO-8859-1?

So I guess this will not work exactly well with UTF-8 ;p Is there any 
other stem.h I could use? Google hasn't returned much for me :/


http://snowball.tartarus.org/

Snowball can generate UTF parser:
http://snowball.tartarus.org/runtime/use.html:
F1 [-o[utput] F2]
   [-s[yntax]]
   [-w[idechars]]  [-u[tf8]] < that's it!
   [-j[ava]]  [-n[ame] C]
   [-ep[refix] S1]  [-vp[refix] S2]
   [-i[nclude] D]
   [-r[untime] P]
At least for Russian there is 2 parsers, for KOI8 and UTF, (
http://snowball.tartarus.org/algorithms/russian/stem.sbl
http://snowball.tartarus.org/algorithms/russian/stem-Unicode.sbl
), diff shows that they different only in stringdef section. So you can make UTF 
parser for german.
BUT, I'm afraid that Snowball uses widechar, and postgres use multibyte for UTF 
internally.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] fts, compond words?

2005-12-07 Thread Teodor Sigaev
That is a long discussed thing. We can't formulate unconflicting rules... For 
example:

1) a  &[dist<=2]  ( b &[dist<=3] c )
2) a  &[dist<=2]  ( b |[dist<=3] c )
3) a  &[dist<=2] !c
4) a  &[dist<=2]  ( b |[dist<=3] !c )
5) a  &[dist<=2] ( b & c )
What does exact they mean? What is tsvectors which should be matched by those 
queries?


The simple solution is : under operation 'phrase search' (ok, it will be '+' 
below) it must be only 'phrase search operations. I.e.:

a | b ( c + ( d + e ) )  - good
a | ( c + ( d & g ) )-  bad.

For example, we have word 'foonish' and after lexize we got two lexemes: 'foo1' 
and 'foo2'. So a good query 'a + foonish' becomes 'a + ( foo1 | foo2 )'...






Mike Rylander wrote:

On 12/6/05, Marcus Engene <[EMAIL PROTECTED]> wrote:

[snip]



 A & (B | (New OperatorTheNextWordMustFollow York))




Actually, I love that idea.  Oleg, would it be possible to create a
tsquery operator that understands proximity?  Or, how allowing a
predicate to the current '&' op, as in '&[dist<=1]' meaning "next
token follows with a max distance of  1".  I imagine that it would
only be useful on unstripped tsvectors, but if the lexem position is
already stored ...

--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [GENERAL] fts, compond words?

2005-12-08 Thread Teodor Sigaev

hrm... that is a problem.  Though, I think that's a case of how the
compiled expression is built from user input.  Unless I'm mistaken

  a + ( foo1 | foo2 )

is exactly equal to

  (a + foo1) | (a + foo2)


Ahhh... but then there is the more complex example of

  a + foonish + bar

becoming

  a + (foo1 | foo2) + bar

 but I guess that could be

(a + foo1 + bar) | (a + foo2 + bar)


That a simple case, what about languages as norwegian or german? They has 
compound words and ispell dictionary can split them to lexemes. But, usialy 
there is more than one variant of separation:


forbruksvaremerkelov
forbruk vare merke lov
forbruk vare merkelov
forbruk varemerke lov
forbruk varemerkelov
forbruksvare merke lov
forbruksvare merkelov
(notice: I don't know translation, just an example. When we working on compound 
word support we found word which has 24 variant of separation!!)


So, query 'a + forbruksvaremerkelov' will be awful:

a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... )

Of course, that is examle just from mind, but solution of phrase search should 
work reasonably with such corner cases.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] TSearch2 / Get all unique lexems

2005-12-08 Thread Teodor Sigaev
Thanks. I hoped for something possible inside a pl/pgsql proc. I'm 
trying to integrate pg_trgm with Tsearch2. I'm still on my UTF-8 
database. Yes I know, there is _NO_ UTF-8 support of any kind in 
Tsearch2 yet, but I got it working to a degree that is OK for my 
application (Created my own stemmer variant, ispell dict, affix file 
etc). The last missing bit is to get a source for pg_trgm. I cannot use 
the the stat() function, because it breaks as soon it sees an UTF-8 char.


I suppose noncompatible with UTF wordparser can produce illegal lexemes (with 
part of multibyte char) and stores it in tsvector. Tsvector hasn't any control 
of breakness lexemes (with a help pg_verifymbstr() call), but stat() makes text 
field and then postgres check it and found incomplete mbchars. Which way I see 
(except waiting UTF support in tsearch2 which we develop now):


1 modify stat() function to check text field and if it fails then remove lexeme 
from output


2 Take from CVS HEAD wordpaser (ts_locale.[ch], wparser_def.c, 
wordparser/parser.[ch]). to_tsvector will works fine, to_tsquery will works 
correct only with quoted string  (for examle, 'foo' & 'bar',  bad: foo & bar). 
But casting 'asasas'::tsvector and dump/reload will not work correct.





--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] TSearch2 / German compound words / UTF-8

2006-01-27 Thread Teodor Sigaev

contrib_regression=# insert into pg_ts_dict values (
 'norwegian_ispell',
  (select dict_init from pg_ts_dict where dict_name='ispell_template'),
  'DictFile="/usr/local/share/ispell/norsk.dict" ,'
  'AffFile ="/usr/local/share/ispell/norsk.aff"',
 (select dict_lexize from pg_ts_dict where dict_name='ispell_template'),
 'Norwegian ISpell dictionary'
   );
INSERT 16681 1
contrib_regression=# select lexize('norwegian_ispell','politimester');
  lexize
--
 {politimester,politi,mester,politi,mest}
(1 row)

contrib_regression=# select lexize('norwegian_ispell','sjokoladefabrikk');
lexize
--
 {sjokoladefabrikk,sjokolade,fabrikk}
(1 row)

contrib_regression=# select lexize('norwegian_ispell','overtrekksgrilldresser');
 lexize
-
 {overtrekk,grill,dress}
(1 row)
% psql -l
   List of databases
Name| Owner  | Encoding
++--
 contrib_regression | teodor | KOI8
 postgres   | pgsql  | KOI8
 template0  | pgsql  | KOI8
 template1  | pgsql  | KOI8
(4 rows)


I'm afraid that UTF-8 problem. We just committed in CVS HEAD multibyte support 
for tsearch2, so you can try it.


Pls, notice, the dict, aff stopword files should be in server encoding. Snowball 
sources for german (and other) in UTF8 can be founded in 
http://snowball.tartarus.org/dist/libstemmer_c.tgz


To all: May be, we should put all snowball's stemmers (for all available 
languages and encodings) to tsearch2 directory?


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Tsearch2 doesn't work

2005-08-04 Thread Teodor Sigaev

You shoud run configure script in a root pgsql's source tree.

Supiah Mustaffa wrote:

Hi,

I want to install Tsearch2 and follow  this command :
1.cd contrib/tsearch2
2. make; make install

It doesn't work and display this messages :
Makefile:5: ../../src/Makefile.global: No such file or directory
Makefile:47: /contrib/contrib-global.mk: No such file or directory
make: *** No rule to make target `/contrib/contrib-global.mk'.  Stop.
Makefile:5: ../../src/Makefile.global: No such file or directory
Makefile:47: /contrib/contrib-global.mk: No such file or directory
make: *** No rule to make target `/contrib/contrib-global.mk'.  Stop.

What it's means? What should I do? Please help me.

Thanks.


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Regression - Query requires full scan, GIN doesn't support it

2007-06-22 Thread Teodor Sigaev

Is this a permanent limitation of GIN, or is a fix possible?
Permanent. You could check user input by querytree() function --- if it returns 
'T' string then fullscan will be needed. If your tsquery is produced by 
plainto_tsquery() call then it will not find any result, so you can show to user 
void page.



Is a fix being worked on?
If a fix is forthcoming, will it be available in the 8.2 series or only 8.3+?


Possibly, full fix in 8.4. But I will not promise.
8.3 will have protection from queries which doesn't match anything.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] PickSplit method of 2 columns ... error

2007-08-28 Thread Teodor Sigaev
Split page algorithm was rewrited for 8.2 for multicolumn indexes and API for 
user-defined pickSplit function was extended to has better results with index 
creation. But GiST can interact with old functions - and it says about this. 
That isn't mean some real problem or error - index will be the same as in 8.1, 
not better.




Kevin Neufeld wrote:

Has anyone come across this error before?

LOG:  PickSplit method of 2 columns of index 
'asset_position_lines_asset_cubespacetime_idx' doesn't support secondary 
split


This is a multi-column GiST index on an integer and a cube (a data type 
from the postgres cube extension module).


I traced the error to the gistUserPicksplit 
 function in the 
gistsplit.c ... I surmise that this method is called whenever a page 
split is necessary.


So, I know when this error occurs, but I don't know why.

Thoughts anyone?
Cheers,
Kevin



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] tsearch2 anomoly?

2007-09-07 Thread Teodor Sigaev
Usual text hasn't strict syntax rules, so parser tries to recognize most 
probable token.  Something with '.', '-' and alnum characters is often a 
filename, but filename is very rare finished or started by dot.


RC Gobeille wrote:

Thanks and I didn't know about ts_debug, so thanks for that also.

For the record, I see how to use my own processing function (e.g.
dropatsymbol) to get what I need:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro
.html

However, can you explain the logic behind the parsing difference if I just
add a ".s" to a string:


ossdb=# select ts_debug('gallery2-httpd-2.1-conf.');
ts_debug
---
 (default,hword,"Hyphenated word",gallery2-httpd-2,{simple},"'2' 'httpd'
'gallery2' 'gallery2-httpd-2'")
 (default,part_hword,"Part of hyphenated word",gallery2,{simple},'gallery2')
 (default,lpart_hword,"Latin part of hyphenated
word",httpd,{en_stem},'httpd')
 (default,float,"Decimal notation",2.1,{simple},'2.1')
 (default,lpart_hword,"Latin part of hyphenated word",conf,{en_stem},'conf')
(5 rows)

ossdb=# select ts_debug('gallery2-httpd-2.1-conf.s');
  ts_debug
-
 (default,host,Host,gallery2-httpd-2.1-conf.s,{simple},'gallery2-httpd-2.1-c
onf.s')
(1 row)

Thanks again,
Bob


On 9/6/07 11:19 AM, "Oleg Bartunov" <[EMAIL PROTECTED]> wrote:


This is how default parser works.  See output from
select * from ts_debug('gallery2-httpd-conf');
and
select * from ts_debug('httpd-2.2.3-5.src.rpm');

All token type:

select * from token_type();


On Thu, 6 Sep 2007, RC Gobeille wrote:


I'm having trouble understanding to_tsvector.  (PostreSQL 8.1.9 contrib)

In this first case converting 'gallery2-httpd-conf' makes sense to me and is
exactly what I want.  It looks like the entire string is indexed plus the
substrings broken by '-' are indexed.


ossdb=# select to_tsvector('gallery2-httpd-conf');
 to_tsvector
-
'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1


However, I'd expect the same to happen in the httpd example - but it does not
appear to.

ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
  to_tsvector
---
'httpd-2.2.3-5.src.rpm':1

Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?

Is this a bug or design?


Thank you!
Bob

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Tsearch2 - spanish

2007-09-18 Thread Teodor Sigaev

prueba=# select to_tsvector('espanol','melón');
ERROR:  Affix parse error at 506 line

and

prueba=# select lexize('sp','melón');
 lexize  
-

 {melon}
(1 row)


Looks very strange, can you provide list of dictionaries and configuration map?


I tried many dictionaries with the same results. Also I change the
codeset of files :aff and dict (from "latin1 to utf8" and "utf8 to
iso88591") and got the same error

where  can I investigate for resolve about this problem?

My dictionary at 506 line had:

Where do you take this file? And what is encdoing/locale setting of your db?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Tsearch2 - spanish

2007-09-19 Thread Teodor Sigaev

prueba1=# select to_tsvector('espanol','melón  perro mordelón');
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 



Hmm, can you provide backtrace?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] tsearch2 document and word limit

2005-01-27 Thread Teodor Sigaev
Sorry, but no way except patching sources of tsearch2
Tsearch2 (not GiST) has pointed limitations  to save storage size mainly and to 
reduce rank calculation time. Our (oleg and me) expirience in search engines 
shows, that full positions info for long document hasn't a big importance to 
ranking.
Did you try normalize rank by length of document?

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-ref.html:
...
Both of these ranking functions take an integer normalization option that 
specifies whether a document's length should impact its rank. This is often 
desirable, since a hundred-word document with five instances of a search word is 
probably more relevant than a thousand-word document with five instances. The 
option can have the values:
* 0 (the default) ignores document length.
* 1 divides the rank by the logarithm of the length.
* 2 divides the rank by the length itself.
...


David Beavan wrote:
Hi
I have been toying with the implementation of tsearch2 to index some 
large text documents. I have run into problems where I am up against 
limits:

no more than 255 occurrences of a particular word are indexed.
word positions greater than 16384 are added as position 16384 and end up 
as one occurrence.

These are problematic because I need to rank based on number of word 
occurrences, and these limits are preventing this.

Does anybody have any suggestions as to how this could be worked around, 
is the limit due to gist? would openfts help (im guessing not)?

Failing that does anybody have experience of combining another text 
indexing package with postgresql?

Dave

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [GENERAL] multicolumn GIST index question

2005-03-01 Thread Teodor Sigaev
Did anyone get multi-column GIST indexes working using both
the gist_btree and postgis modules?
It must.
fl=# -- sessionid is a text;  the_geom is a GEOMETRY
fl=# create index testidx2 on user_point_features using gist 
(sessionid,the_geom);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
fl=#
fl=#
fl=#
fl=# create table test3(a text, b text, c text);
CREATE TABLE
fl=# select addgeometrycolumn ('','test3','the_geom','-1','POINT',2);
addgeometrycolumn
--
 public.test3.the_geom SRID:-1 TYPE:POINT DIMS:2
 geometry_column fixed:0
(1 row)
fl=# create index idx_text_text on test3 using gist (a,b);
CREATE INDEX
fl=# create index idx_text_geom on test3 using gist (a,the_geom);
CREATE INDEX
fl=#

Is a sessionid from user_point_features table unique?
Pls, try to compile database with --enable-cassert --enable-debug and send gdb 
output.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [GENERAL] ERROR: Gin doesn't support full scan due to it's awful

2006-09-06 Thread Teodor Sigaev

explain analyze
select *
from test.features
where to_tsquery('') @@ features.vector

ERROR:  Gin doesn't support full scan due to it's awful inefficiency


Look:
contrib_regression=# select '{1,2,3}'::int4[] @ '{}';  --contains
 ?column?
--
 t
(1 row)

contrib_regression=# select '{1,2,3}'::int4[] && '{}'; --overlap
 ?column?
--
 f
(1 row)

contrib_regression=# select to_tsvector('asdasd') @@ ''::tsquery;
NOTICE:  query doesn't contain lexeme(s)
 ?column?
--
 f
(1 row)

Semantic of different operation with void (but not NULL) argument is very 
different. If query doesn't contain any entry (returned by extractQuery() index 
support method), then GIN, in any case, doesn't know what it should return: 
whole set of pointers or nothing. But GIN can't return all - it will be 
very-very slow, because there is a lot of pointers in GIN index to each table's row.


It seems to me that message makes confuse about reason of error...





Interestingly this works:

explain analyze
select *
from test.features
where NULL @@ features.vector


That is because @@ is marked as 'returns NULL on NULL input', ie index will not 
be used.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] ERROR: Gin doesn't support full scan due to it's awful

2006-09-06 Thread Teodor Sigaev
I see why it would now.  But it would be nice if the message mentioned 
the root cause of the problem  - that the tsquery value does not contain 
any lexemes and thus is not valid for doing a search.


... doing a index search :(. GIN core doesn't know anything about 
tsvector/tsquery/tsearch - so, error message should mention something like this:

GIN index doesn't support search with void argument.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] cyclical redundancy checksum algorithm(s)?

2006-09-28 Thread Teodor Sigaev

You sure that's actually what he said?  A change in CRC proves the data
changed, but lack of a change does not prove it didn't.


"To quickly determine if rows have changed, we rely on a cyclic
redundancy checksum (CRC) algorithm.   If the CRC is identical for the

>

"summary" functions, such as an MD5 hash.  I wouldn't trust it at all
with a 32-bit CRC, and not much with a 64-bit CRC.  Too much risk of
collision.


Small example of collisions for crc32:
0x38ee5531
Hundemarke
92294
0x59471e4f
raciner
tranchefiler
0x947bb6c0
Betriebsteile
4245


I had make a lot of work when choosing hash function for tsearch2. Also, I had 
find that popular hash algorithms produce more collision for non-ascii 
languages... CRC32 is more "smooth".
On dictionary with 332296 unique words CRC32 produces 11 collisions, perl's hash 
function - 35, pgsql's hash_any - 12.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Full Text fuzzy search

2006-09-28 Thread Teodor Sigaev

Play around contrib/tsearch2 and contrib/pg_trgm

Michael Vodep wrote:

Hi

How can i do a full text fuzzy search with PgSQL?

For example: i got a text domain. I search for Postgresql
He should return a result for 'Postgresql' if i enter following values:
PostgreSQL, Postgresql, postgresql
but also
bostgresql, posdgresql, bosdgresql

Is this possible?

Regrads
michael

---(end of broadcast)---
TIP 6: explain analyze is your friend


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [GENERAL] Full Text fuzzy search

2006-09-28 Thread Teodor Sigaev
For the others (bostgresql, posdgresql, bosdgresql) i think u should 
write a function, that's what we did in our application.


contrib/pg_trgm solves it

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] GiST and Gin technical doc

2006-10-02 Thread Teodor Sigaev

GiST:

http://gist.cs.berkeley.edu/
http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/

GIN:
GIN is ordinary inverted index which described in many docs.
Look for some implementation details in my presentation on conference:
http://www.sigaev.ru/gin/Gin.pdf

Yannick Warnier wrote:

Hi there,

I'm having trouble finding technical documentation about GiST, Gin and
TSearch2.

I am particularly interested in the internal data structure of a GiST-ed
or Gin-ed index, and the availability of searches by proximity.

Does anyone know of a good place to find such doc, outside from the
source code (I'm not too much into C/C++)?

Thank you,

Yannick


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] NOTICE: word is too long INSERT 0 3014

2006-10-30 Thread Teodor Sigaev

I am running into this limitation ALOT with Tsearch2. What are my
options to get around it. Do I have to compile PostgreSQL with a
different block size?

If yes, what are the downsides to doing so (outside of not being able to
do straight upgrades)?


If you really need that, your should play around WordEntry definition 
(tsvector.h). Sorry, right now I haven't possibility to look closer, 
just tomorrow.
Limit for word is equal 2KB - I supposed that is long enough to store 
any meaningful words.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] NOTICE: word is too long INSERT 0 3014

2006-10-30 Thread Teodor Sigaev

For example, redefine by follow way:

typedef struct
{
uint32
haspos:1,
len:31;
uint32  pos;
}   WordEntry;

/* <= 1Gb */
#define MAXSTRLEN ( 1<<30 )
#define MAXSTRPOS ( 1<<30 )

Teodor Sigaev wrote:

I am running into this limitation ALOT with Tsearch2. What are my
options to get around it. Do I have to compile PostgreSQL with a
different block size?

If yes, what are the downsides to doing so (outside of not being able to
do straight upgrades)?


If you really need that, your should play around WordEntry definition 
(tsvector.h). Sorry, right now I haven't possibility to look closer, 
just tomorrow.
Limit for word is equal 2KB - I supposed that is long enough to store 
any meaningful words.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Index greater than 8k

2006-10-30 Thread Teodor Sigaev

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than that.

Hmm, tsearch2 GIST index  is specially designed for support huge index entry:
first, every lexemes in tsvectore are transformed to hash value (with a help of 
crc32), second, it's stripped all position infos, third, if size of array is 
greater than TOAST_INDEX_TARGET then tsearch2 will make bit signature of 
tsvector. Signature's length is fixed and equals to 252 bytes by default (+ 8 
bytes for header of datum). All values on internal pages are represented as 
signatures below.


So, tsearch2 guarantees that index entry will be small enough. If it's not true, 
then there is a bug - pls, make test suite demonstrating the problem.



Is recompiling the block size the option there?
What are the downsides, except for the custom build?


Can you send exact error message?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [HACKERS] [GENERAL] Index greater than 8k

2006-10-31 Thread Teodor Sigaev
The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a 
self contained test case directly to  Teodor  which shows the error. 


'ERROR:  index row requires 8792 bytes, maximum size is 8191'
Uh, I see. But I'm really surprised why do you use pg_trgm on big text? pg_trgm 
is designed to find similar words and use technique known as trigrams. This will 
 work good on small pieces of text such as words or set expression. But all big 
texts (on the same language) will be similar :(. So, I didn't take care about 
guarantee that index tuple's size limitation. In principle, it's possible to 
modify pg_trgm to have such guarantee, but index becomes lossy - all tuples 
gotten  from index should be checked by table's tuple evaluation.


If you want to search similar documents I can recommend to have a look to 
fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty 
close to trigrams and metrics of similarity is the same, but uses another 
signature calculations. And, there are some tips and trics: removing HTML 
marking,removing punctuation, lowercasing text and so on - it's interesting and 
complex task.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [GENERAL] Index greater than 8k

2006-11-01 Thread Teodor Sigaev

We are trying to get something faster than ~ '%foo%';
Which Tsearch2 does not give us :)


Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix search, 
it's possible to use it.


Brain storm method:

Develop a dictionary which returns all substring for lexeme, for example for 
word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob 
bar ba ar'. And make GIN functional index over your column (to save disk space).

So, your query will be looked as
select ... where to_tsvector(text_column) @@ 'foo';
Notices:
Time of search in GIN weak depend on number of words (opposite to 
tsearch2/GiST), but insertion of row may be slow enough


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] tsearch2() with data from other table

2006-11-10 Thread Teodor Sigaev

CREATE FUNCTION euits(int)
RETURNS text AS 'select username || \' \' || firstname || \' \' || 
lastname from site_user where objectid = $1;' LANGUAGE SQL;


CREATE TRIGGER site_item_fts
BEFORE UPDATE OR INSERT ON site_item
FOR EACH ROW EXECUTE PROCEDURE
tsearch2(idxfti, name, keywords, keywords_cb, location_country, 
location_city, media_source, description, euits, site_user);


site_user is a table, isn't it?
tsearch2 trigger accepts only column's names and functions with prototype TEXT 
func(TEXT).


For clarify, show your table's definitions.
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] tsearch2() with data from other table

2006-11-10 Thread Teodor Sigaev

CREATE FUNCTION euits(int)
RETURNS text AS 'select username || \' \' || firstname || \' \' || 
lastname from site_user where objectid = $1;' LANGUAGE SQL;


CREATE TRIGGER site_item_fts
BEFORE UPDATE OR INSERT ON site_item
FOR EACH ROW EXECUTE PROCEDURE
tsearch2(idxfti, name, keywords, keywords_cb, location_country, 
location_city, media_source, description, euits, site_user);


So, when updating site_item I want to fetch the names from site_user and 
 give this data to tsearch2() along with other stuff from site_item.


Sorry, current interface of tsvector aggregate doesn't support it.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Using GIN indexes on 8.2

2006-11-10 Thread Teodor Sigaev



Alexander Staubo wrote:
Two questions about GIN on 8.2. There's not much documentation about 
GIN, but this should be possible:


  create table foo (values text[]);
  create index foo_values_index on foo using gin (text);

However, this then fails saying the operator "@" does not exist:

  select * from foo where values @ '{hello, world}'


Use @>, <@ operations instead of @ and ~
Look for discussions in -hackers for reasons of changing names

Do I need to reference a specific opclass when creating the index? From 
the documentation I got the impression that GIN bundled operators for 
most built-in types.

if there is a default opclass for your datatype - you may do not specify.



Secondly, are GIN indexes immutable and (unlike Tsearch2) non-lossy and 
therefore useful with functional indexes? I would like to do this:


  create table bar (value text);
  create index bar_value_index on bar using gin (analyze(value));

where analyze() is a function of my own that tokenizes, stems and 
filters the text into a text[] array.


Be careful -
select
pg_opclass.opcname,
pg_operator.oprname,
pg_amop.amopreqcheck
from
pg_opclass,
pg_operator,
pg_amop,
pg_am
where
pg_operator.oid = pg_amop.amopopr and
pg_opclass.oid = pg_amop.amopclaid and
pg_opclass.opcamid = pg_am.oid and
pg_am.amname='gin' and pg_opclass.opcname='_text_ops';
  opcname  | oprname | amopreqcheck
---+-+--
 _text_ops | &&  | f
 _text_ops | @>  | f
 _text_ops | <@  | t
 _text_ops | =   | t
(4 rows)

So, operations <@ and = will recheck result with table's row.

Pls, why don't you use tsearch2 with GIN?


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Using GIN indexes on 8.2

2006-11-10 Thread Teodor Sigaev

Use @>, <@ operations instead of @ and ~
Look for discussions in -hackers for reasons of changing names


Ah, many thanks. How about updating those web pages? :)

Now they are in core:
http://developer.postgresql.org/pgdocs/postgres/functions-array.html




Pls, why don't you use tsearch2 with GIN?


Perhaps -- is there any documentation on this?

Nothing special, just use GIN index instead of GiST.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] PG_MODULE_MAGIC check in 8.2

2006-11-14 Thread Teodor Sigaev

#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

solves your problem
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Functional Index

2006-11-22 Thread Teodor Sigaev
use varchar_pattern_ops operator class, LIKE cannot use varchar_ops for non-C 
locales.


Alexander Presber wrote:

Hello everybody,

I am trying to speed up a query on an integer column by defining an 
index as follows


 > CREATE INDEX idx_main_subject ON pdb.main (lower(main_subject::text) 
using varchar_ops);


on column "main_subject".

I had hoped to get speedups for right-fuzzy LIKE-searches, but EXPLAIN 
ANALYZE yields that the index is not used:


 > EXPLAIN ANALYZE SELECT COUNT(*) FROM pdb.main WHERE 
lower(main_subject::text) LIKE lower('10%'::text);
QUERY 
PLAN
 

Aggregate  (cost=137759.92..137759.93 rows=1 width=0) (actual 
time=3421.696..3421.697 rows=1 loops=1)
  ->  Seq Scan on main  (cost=0.00..137727.17 rows=13096 width=0) 
(actual time=0.036..3300.961 rows=77577 loops=1)

Filter: (lower((main_subject)::text) ~~ '10%'::text)
Total runtime: 3421.751 ms
(4 Zeilen)


Am I misunderstanding the concept of functional indexes? Is there 
another way to achieve

Any help is greatly
appreciated.

Yours,
Alexander Presber

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] tsearch to spellcheck

2006-11-24 Thread Teodor Sigaev
You can use lexize() function from tsearch2 (with properly configured ispell 
dictionary) - if it returns not NULL value the word is ok.

Also have a look to pg_trgm contrib modle.

SunWuKung wrote:

Is it possible to use tsearch2 to check spelling? Something like a
function that takes a single word (and a language id) and returns true
if spelled correctly false otherwise.

Thanks.
Balazs


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] postgresql 8.2 rc1 - crash

2006-11-30 Thread Teodor Sigaev

While I'm downloading your file, pls, do follow:
gdb /usr/local/pgsql/bin/postgres your_core_file

If it's needed, change path to postgres file.

In gdb, type
# bt
and send output

hubert depesz lubaczewski wrote:

hi,
i have been testing 8.2 rc1, while i got this problem.

base data:
linux, 32bit, kernel: 2.6.18.3 <http://2.6.18.3>; debian
postgresql version:
PostgreSQL 8.2rc1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.5 
(Debian 1:3.3.5-13)
problematic table is over 2gigabytes in size, and has several indices - 
one of them is gin-index.


problem:
when i issue vacuum full verbose analyze it works, but then crashes with 
signal 11.

always in the same situation.

i was not able to check what is the reason.
i did:
recompile with debug, set ulimit -c unlimited, and rerun the query.
it crashed.
i bundled:
1. logs
2. core file
3. config of postgresql
4. saved output of vacuum
all of this can be fetched from: 
http://depesz.com/various/crash.data.tar.bz2
unfortunatelly i'm not c programmer, so i dont know gdb, but i hope you 
will be ab le to make any sense out of it.


the bz2 file is > 20mb in size.

any help? is it hardware problem? or a missed bug in code?

if i can provide you with more information - please tell me what i 
should tell you.


depesz

--
http://www.depesz.com/ - nowy, lepszy depesz


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] postgresql 8.2 rc1 - crash

2006-11-30 Thread Teodor Sigaev
#1  0x080bc224 in PageDeletePostingItem (page=0xb28039a0 "\020", 
offset=53719) at gindatapage.c:291
#2  0x080bf558 in ginDeletePage (gvs=0xbfc2ab80, deleteBlkno=29194, 
leftBlkno=29059, parentBlkno=70274, myoff=351, isParentRoot=0 '\0') at 
ginvacuum.c:268


Are you sure about your hardware? myoff in ginDeletePage() and offset in 
PageDeletePostingItem are the same variable...


Pls, send to me postgres file itself - just core isn't very useful for debug.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] postgresql 8.2 rc1 - crash

2006-11-30 Thread Teodor Sigaev

I reproduce a problem with small script:
print <5000 and i<40;
VACUUM FULL ANALYZE qq;

EOT

So, I'm digging now...
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] postgresql 8.2 rc1 - crash

2006-11-30 Thread Teodor Sigaev
Fixed, thank you. Changes are commited in CVS, pls, try it (I think that index 
is corrupted, so you need to recreate it)




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Tsearch2 - Error using tsquery object in GROUP BBY

2006-12-01 Thread Teodor Sigaev
The query I'm executing has a GROUP BY clause and because I'm using the 
headline function I need my tsquery object in the SELECT clause and 
therefor also in the GROUP BY clause.
BUT when executing the query I get the following error. -> "could not 
identify an ordering operator for type tsquery"


Which version do you use? Only 8.2 will have comparison operator for tsquery.



How do I solve this?

Use subselect:
SELECT *, headline(..., to_tsquery()) FROM
{
SELECT  GROUP BY ...
);

BTW, for performance reason, it might be useful to use limit in inner select -
headline function is slow enough.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Tsearch2 - Error using tsquery object in GROUP BBY

2006-12-01 Thread Teodor Sigaev
I'm using 8.1.4 at the moment but I guess I need to update. The 8.2 is 
looking really promising. So with 8.2 I don't need the subselect?


IMHO, don't need


headline function is slow enough.

You think?! ;)


I known :) - computing headline is a hard task
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Speed of postgres compared to ms sql, is this

2006-12-05 Thread Teodor Sigaev

These sorts of reports would be far more helpful if they contained some
specifics.  What queries does MSSQL do better than Postgres, exactly?


Our OR-patch was inspired by our customer migrating from MS SQL to postgres. 
Next, index support of IS NULL. And, there is a huge difference in performance 
for queries like

select * from a,b where a.f = b.f or ( a.f is null and b.f is null)

NULL support is fast in MS SQL because MS SQL doesn't follow SQL standard: index 
in MS SQL believes that (NULL = NULL) is true.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] PG 8.2.0 - TSearch2 "Wrong affix file format"

2006-12-21 Thread Teodor Sigaev

Send to me dict and aff file, pls

By the way, tsearch2 is changed significantly, so, the better way to update is a 
restoring only your data&schema. Tsearch2 should be installed from contrib.


Tarabas (Manuel Rorarius) wrote:

Hi!

  I have a problem migrating my Database using TSearch2 with the UTF-8
  Backport from 8.1.3 to a new database with 8.2.0.
  
  I successfully installed postgres and the TSearch2 distributed with

  it and copied the german.aff/german.med/german.stop and
  german.stop.ispell from my old postgres 8.1.3 installation to the
  same location in the 8.2.0 install.

  Then I dumped the old database with
  
  "./pg_dump  -f backup-file" and restored it on the

  8.2.0 database successfully without errors.

  I am using UTF-8 database and files for .aff/.med/.stop/.stop.ispell

  When I now try a TSearch2 Command like

  "SELECT set_curdict('de_ispell');"

  I get the error
  
  "ERROR:  Wrong affix file format" although the file was not changed

  and worked fine on the 8.1.3 Databse with the UTF-8 backport Patch
  from 8.2.0.

  Anyone have any idea how to fix the files so they will work with
  8.2.0 also? The files seem to be ok and are UTF-8 encoded.

Best regards
Manuel


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] PG 8.2.0 - TSearch2 "Wrong affix file format"

2006-12-21 Thread Teodor Sigaev

Affix file has artifact:
PFX G N 1
   .> GE

which is strange mix of openoffice format and ispell format. Just remove they.

8.2 Ispell code checks format strongly that in previous versions, even 
backported :)

Tarabas (Manuel Rorarius) wrote:

Hi,

TS> Send to me dict and aff file, pls

see attached .aff file, I have not created the file from a dict
myself but taken the .aff/.med/.stop/.stop.ispell from this blog:

http://www.tauceti.net/roller/cetixx/category/Tipps ...

TS> By the way, tsearch2 is changed significantly, so, the better way to update 
is a
TS> restoring only your data&schema. Tsearch2 should be installed from contrib.

That's what I did ... I used the tsearch delivered with 8.2.0 contrib
for the install. Only the Schema and Data for my database was imported with the
restore from the old system, so that should all be setup correctly :-)

I also tested the correct tsearch2 install by removing all lines from the .aff
file and the error then vanishes and the search works, but without the
.aff I guess a key feature is missing :-)

Best regards
Manuel





---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] TSearch2 Changeset 25387

2006-12-21 Thread Teodor Sigaev
Are you trying to convert openoffice (myspell) format to ispell with help of 
my2ispell?


It seems to me, I see the problem. m2ispell doesn't convert prefixes which can 
not be combined with every word ('N' in myspell). So, ispell file will contain 
wrong line begining with PFX...


I'll fixed that.

Hannes Dorbath wrote:

http://projects.commandprompt.com/public/pgsql/changeset/25387

Though I'm probably start going on Oleg's nerves.. :/

I'm still trying to get compound word support for my dictionaries back, 
while migrating from 8.1.5-gin-utf8 to 8.2.


Can someone give me additional information on that change? My affix file 
trigger that oldFormat condition on line 472. Where is the change in 
affix file format documented? What has changed? Any way to convert them?


I found some OpenOffice pages about it, but I failed to find what I'm 
looking for.


IIRC I had TSearch2 with my `oldFormat' files working on an older 
8.2-dev-snapshot.



Thanks for any hint.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] TSearch2 Changeset 25387

2006-12-21 Thread Teodor Sigaev

Hmm, 2.0.1. But what's the difference? I don't watch changes in OpenOffice 
hardly.



Hannes Dorbath wrote:
What version of OpenOffice MySpell dictionaries is supposed to work with 
TSearch in 8.2?


The format used till OpenOffice 2.0.1 or the format starting from 2.0.2?


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] TSearch2 Changeset 25387

2006-12-21 Thread Teodor Sigaev

Oh, I see. So, only 2.0.1 and I can't change that for 8.2 branch. :(

Hannes Dorbath wrote:

On 21.12.2006 18:32, Teodor Sigaev wrote:
Are you trying to convert openoffice (myspell) format to ispell with 
help of my2ispell?


Yes:

http://groups.google.com/group/pgsql.general/browse_thread/thread/c21872aca3754a06/3a909c0e1f05a5af 



I'm really unsure what someone is supposed to do, to get compound word 
support in 8.2 working.


http://projects.commandprompt.com/public/pgsql/changeset/25387

In the comment it is stated that for German one should still use 
my2ispell. I had no luck with that.


One the other hand:

http://wiki.services.openoffice.org/wiki/Dictionaries#German_.28Germany.2C_29 



tells that the new MySpell dicts, starting from OO 2.0.2, should be fine 
for compound word support.


Thanks for your time.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Tsearch2 default locale on postgres 8.2

2006-12-22 Thread Teodor Sigaev

set_curcfg() is working only for current session

Tarabas (Manuel Rorarius) wrote:

Hi!

   I am having a tsearch2 problem on postgres 8.2 again ...
   when I try to set the default config for tsearch2 with

   "select set_curcfg('default'); it works fine in the same pgadmin
   session when i use "select show_curcfg();" afterwards. The correct
   OID is shown.

   If i then close the query window and open a new one and then try
   the "select show_curcfg();" again, it states
   "ERROR:  could not find tsearch config by locale"

   Any idea why the configuration is not saved correctly?

Best regards
Manuel ...


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Segmentation fault with 8.3 FTS ISpell

2008-01-15 Thread Teodor Sigaev

I tryed to reproduce the bug but without success.
Could you provide a dump of text column?


Hannes Dorbath wrote:
Crash happens about 7 minutes after issuing the UPDATE statement with 
current CVS HEAD. The table has around 5 million rows. It's always 
reproducible.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Segmentation fault with 8.3 FTS ISpell

2008-01-16 Thread Teodor Sigaev

Fixes are committed to CVS, hope, they will help you.
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] postgres 8.3 rc-1 ispell installation problem

2008-01-20 Thread Teodor Sigaev

flag *J:# isimo
   E> -E, 'ISIMO  #  grand'isimo  <-- here  432
E   > -E, 'ISIMOS   # grande grand'isimos
E   > -E, 'ISIMA# grande grand'isima
E   > -E, 'ISIMAS   # grande grand'isimas
O   > -O, 'ISIMO# tonto tont'isimo
O   > -O, 'ISIMA# tonto tont'isima


Current implementation doesn't accept any character in ending except alpha ones.

i think 'I..  word is not correct for ispell, 
this should be one Í letter 

That's right, but you should convert dictionary and affix file in UTF8 encoding.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] full text index and most frequently used words

2008-02-08 Thread Teodor Sigaev

What I'd like to know is if there is an easy to way to use the full
text index to generate a list of the most common words.  I could write
this code manually, but I'm hoping there's a better (simpler) way.


For 8.3
http://www.postgresql.org/docs/8.3/static/textsearch-features.html#TEXTSEARCH-STATISTICS

For versions before 8.3 just use stat() function instead of ts_stat().
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Fragments in tsearch2 headline

2008-03-17 Thread Teodor Sigaev



Teodor, Oleg, do we want this?
http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php


I suppose, we want it. But there are a questions/issues:
- Is it needed to introduce new function? may be it will be better to add option 
to existing headline function. I'd like to keep current layout: ts_headline 
provides some common interface to headline generation. Finding and marking 
fragments is deal of parser's headline method and generation of exact pieces of 
text is made by ts_headline.

- Covers may be overlapped. So, overlapped fragments will be looked odd.


In any case, the patch was developed for contrib version of tsearch.
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fragments in tsearch2 headline

2008-03-31 Thread Teodor Sigaev

The patch takes into account the corner case of overlap. Here is the
code for that
// start check
if (!startHL && *currentpos >= startpos)
   startHL = 1;

The headline generation will not start until currentpos has gone past
startpos. 

Ok



You can also check how this headline function is working at my website
indiankanoon.com. Some example queries are murder, freedom of speech,
freedom of press etc.

Looks good.


Should I develop the patch for the current cvs head of postgres?


I'd like to commit your patch, but if it should be:
 - for current HEAD
 - as extension of existing ts_headline.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Direct access to GIST structure

2008-04-04 Thread Teodor Sigaev



encode spatial proximity. Is there an API (backend C-level is fine) to
access a GIST index?


The best way is to extend existing interface to GiST to support KNN-search. But 
you can see how to get access to index structure from module in gevel module 
(http://www.sigaev.ru/cvsweb/cvsweb.cgi/gevel/). GiST-related functions in this 
module is invented to help to developers, not for production use, so they 
acquire exclusive lock on index.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Direct access to GIST structure

2008-04-04 Thread Teodor Sigaev

I just stumbled on http://www.cs.purdue.edu/spgist/ which seems like
exactly what I need.


It doesn't work with 8.2 and up, because since 8.2 index should take care about 
concurrent access itself and that implementation doesn't do it.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] tsearch2 on-demand dictionary loading & using functions in tsearch2

2008-05-18 Thread Teodor Sigaev
 * Considering the database is loaded separately for each session, does 
this also imply that each running backend has a separate dictionary 
stored in memory?


Yes.


As for downsides, I only really see two:
 * Tracking updates of dictionaries - but it's reasonable to believe 
that new connections get open more often than the dictionary gets 
updated. Also, this might be easily solved by stat()-ing the dictionary 
file before starting up session, and only have the server reload it if 
there's a notified change.

 * Possibly complicated to implement?


Keeping dictionary up to date - it's a most difficult part here. Configuration 
of dictionary might be done by ALTER command - so, parent process (and all 
currently running backends) should get that information to reload dictionary.



As for my second question, is it possible to use functions in tsearch2? 
For example, writing my own stemmer in PL/pgSQL or in C as a postgres 
function.


Yes, of course, you can develop your dictionary (-ies) and parser. Dut only in 
C, because they are critical for performance.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] tsearch2 on-demand dictionary loading & using functions in tsearch2

2008-05-18 Thread Teodor Sigaev



Hmm, good point; I presume "accept the fact that settings change won't 
propagate to other backends until reconnect" would not be acceptable 
behavior, even if documented along with the relevant configuration option?


I suppose so. That was one of the reasons to move tsearch into core and it will 
be too regrettable to lost that feature again.


As for my second question, is it possible to use functions in 
tsearch2? For example, writing my own stemmer in PL/pgSQL or in C as 
a postgres function.
I've had something different in mind. Considering there are already 
facilities to use functions, be it PL/pgSQL, PL/Python or C, why not 
just use those with the condition that the function must accept 
some-arguments and return some-result? Or would using this, even while 
using C as the language used for the actual parser, slow things down too?


API to dictionary and parser intentionally utilizes complex (and nested) 
C-structures to decrease overheads. During parse of text postgres makes two call 
of parser (one call - parser returns word, second - word delimiter. Space is a 
lexeme too! Although it's not a subject to index) and one call of dictionary per 
word. So, if your language can work with C-structures then you can use that 
language with tsearch with more or less performance pay. PL/pgSQL hasn't this 
capability.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Fragments in tsearch2 headline

2008-05-23 Thread Teodor Sigaev

[moved to -hackers, because talk is about implementation details]


I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
(http://archives.postgresql.org/pgsql-general/2007-11/msg00508.php)

Thank you.

1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c
now contrib/tsearch2 is compatibility layer for old applications - they don't
know about new features. So, this part isn't needed.

2 solution to compile function (ts_headline_with_fragments)  into core, but
using it only from contrib module looks very odd. So, new feature can be used
only with compatibility layer for old release :)

3 headline_with_fragments() is hardcoded to use default parser, but what will be
in case when configuration uses another parser? For example, for japanese 
language.

4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] )
and function should accept 'NumFragments=>N' for default parser. Another parsers
may use another options.

5 it just doesn't work correctly, because new code doesn't care of parser
specific type of lexemes.
contrib_regression=# select headline_with_fragments('english', 'wow asd-wow
wow', 'asd', '');
 headline_with_fragments
--
 ...wow asd-wowasd-wow wow
(1 row)


So, I incline to use existing framework/infrastructure although it may be a
subject to change.

Some description:
1 ts_headline defines a correct parser to use
2 it calls hlparsetext to split text into structure suitable for both goals:
find the best fragment(s) and concatenate that fragment(s) back to the text
representation
3 it calls parser specific method   prsheadline which works with preparsed text
(parse was done in hlparsetext). Method should mark a needed
words/parts/lexemes etc.
4 ts_headline glues fragments into text and returns that.

We need a parser's headline method because only parser knows all about its 
lexemes.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] multi-word expression full-text searching

2008-06-30 Thread Teodor Sigaev


SELECT id FROM document WHERE to_tsvector('english',text) @@
plainto_tsquery('english','despite this');
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] multi-word expression full-text searching

2008-07-01 Thread Teodor Sigaev

If I understand well the plainto_tsquery behaviour, this query match with:
Despite this, the president went out.
Despite the event, this question arise.


Right, you mean phrase search. Read the thread: 
http://archives.postgresql.org/pgsql-hackers/2008-05/msg0.php


Suggested patch should be made as module, I think.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] changing text search treatment of puncutation

2008-07-03 Thread Teodor Sigaev



In general there seem to be a lot of ways that people wish they
could tweak the text search parser, and telling them to write
their own parser isn't a very helpful response for most folk.
I don't have an idea about how to improve the situation, but
it seems like something that should be thought about.


We (with Oleg) thought hard about it and we don't find a solution yet.
Configurable parser should be:
- fast
- flexible
- not error-prone
- comfortable to use by non-programmer (at least for non-C programmer)

It might be a table-driven state machine (just put TParserStateAction into 
table(s) with some caching for first step) , but it's complex to operate and 
it's needed to prove correctness of changes in states before its become in use.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Using ISpell dictionary - headaches...

2008-07-22 Thread Teodor Sigaev


It *may* be because I'm using psql 8.0.3 and not the latest version (but 
I'm stucked with that version), i'm just hoping that one of you have met 


Upgrade to 8.0.17 - there was a several fixes in ISpell code.
--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] index scan leads to result that is different from sec scan after upgrading to 8.3.4

2008-10-20 Thread Teodor Sigaev

Hmm.  So the problem seems to be statable as "a full-index scan on a
GIST index might fail to return all the rows, if the index has been
modified since creation".  Teodor, can you think of anything you
changed recently in that area?


Only fixing possible duplicates during index scan. Will see.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


  1   2   >