[GENERAL] tsearch2 questions

2007-07-04 Thread Joshua N Pritikin
1. What is the advantage of the tsearch2() trigger? Why can't I write my 
own trigger which does approximately:

  UPDATE manuscript set manuscript_vector = 
setweight(to_tsvector(manuscript_genre), 'A') || 
setweight(to_tsvector(manuscript_title), 'B') || 
to_tsvector(manuscript_abstract);

2. Is there a way to know in advance the maximum return value of the 
rank function? I have lots of other information to include in the 
goodness-of-match score besides the fulltext match rank so I would 
prefer a tsearch2 rank score between 0 and 1. Do I need to write my own 
rank function?

-- 
Make April 15 just another day, visit http://fairtax.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] tsearch2 questions

2007-07-04 Thread Oleg Bartunov

On Wed, 4 Jul 2007, Joshua N Pritikin wrote:


1. What is the advantage of the tsearch2() trigger? Why can't I write my
own trigger which does approximately:


no advantage, it's just an example.




 UPDATE manuscript set manuscript_vector =
   setweight(to_tsvector(manuscript_genre), 'A') ||
   setweight(to_tsvector(manuscript_title), 'B') ||
   to_tsvector(manuscript_abstract);

2. Is there a way to know in advance the maximum return value of the
rank function? I have lots of other information to include in the
goodness-of-match score besides the fulltext match rank so I would
prefer a tsearch2 rank score between 0 and 1. Do I need to write my own
rank function?


what's about simple normalization formulae, like rank/(rank+1) ?


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] tsearch2 questions

2007-07-04 Thread Joshua N Pritikin
On Wed, Jul 04, 2007 at 10:59:46AM +0400, Oleg Bartunov wrote:
 On Wed, 4 Jul 2007, Joshua N Pritikin wrote:
 1. What is the advantage of the tsearch2() trigger? Why can't I write my
 own trigger which does approximately:
 
 no advantage, it's just an example.

Please mention that in the documentation:

tsearch2() trigger used to automatically update vector_column_name, 
my_filter_name is the function name to preprocess text_column_name. 
There are can be many functions and text columns specified in tsearch2() 
trigger. The following rule used: function applied to all subsequent 
text columns until next function occurs. Example, function dropatsymbol 
replaces all entries of @ sign by space.

tsearch2() is an example. You are welcome to write your own trigger.

 2. Is there a way to know in advance the maximum return value of the
 rank function? I have lots of other information to include in the
 goodness-of-match score besides the fulltext match rank so I would
 prefer a tsearch2 rank score between 0 and 1. Do I need to write my own
 rank function?
 
 what's about simple normalization formulae, like rank/(rank+1) ?

I think you are suggesting that I use the best rank as the denominator 
for the rank column. Yes, I suppose that will work.

Thanks.

-- 
Make April 15 just another day, visit http://fairtax.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] tsearch2 questions

2007-07-04 Thread hubert depesz lubaczewski

On 7/4/07, Joshua N Pritikin [EMAIL PROTECTED] wrote:


Please mention that in the documentation:



dont you think this is perfeclty clear?

If you want to do something specific with columns, you may write your very
own trigger function using plpgsql or other procedural languages (but not
SQL, unfortunately) and use it instead of tsearch2 trigger.



what's about simple normalization formulae, like rank/(rank+1) ?
I think you are suggesting that I use the best rank as the denominator
for the rank column. Yes, I suppose that will work.



actually oleg supposed not to use best rank, but just use the formula as
given - rank/(rank+1) to get rank in range of 0 to 1.

depesz


Re: [GENERAL] tsearch2 questions

2007-07-04 Thread Joshua N Pritikin
On Wed, Jul 04, 2007 at 10:40:11AM +0200, hubert depesz lubaczewski wrote:
 On 7/4/07, Joshua N Pritikin [EMAIL PROTECTED] wrote:
 Please mention that in the documentation:
 
 dont you think this is perfeclty clear?
 
 If you want to do something specific with columns, you may write your very
 own trigger function using plpgsql or other procedural languages (but not
 SQL, unfortunately) and use it instead of tsearch2 trigger.

From where are you quoting? I was quoting from:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-ref.html

 what's about simple normalization formulae, like rank/(rank+1) ?
 I think you are suggesting that I use the best rank as the denominator
 for the rank column. Yes, I suppose that will work.
 
 actually oleg supposed not to use best rank, but just use the formula as
 given - rank/(rank+1) to get rank in range of 0 to 1.

OK, then what does the +1 mean in your formulae? Consider these results 
from [1]. rank/(rank+1): 0.19/.1 = 1.9, .1/.1 = 1, etc. That doesn't 
make sense. The reciprocal also doesn't make sense. So what does Oleg 
mean? I was guessing that Oleg meant to divide the rank column by the 
first rank, that is, by 0.19 so you would get 1, .52, .52, etc.

 id |   headline| rank 
+---+--
  3 | bcrawling/b over cobbles in a low bpassage/b. | 0.19
  1 | bcrawl/b over cobbles leads inward to the west.   |  0.1
  4 | bpassages/b lead east, north, and south.  |  0.1
  5 | bcrawl/b slants up.   |  0.1
  7 | bpassage/b here is blocked by a recent  cave-in.  |  0.1

Am I being stupid?

[1] 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html

-- 
Make April 15 just another day, visit http://fairtax.org

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] tsearch2 questions

2007-07-04 Thread hubert depesz lubaczewski

On 7/4/07, Joshua N Pritikin [EMAIL PROTECTED] wrote:


From where are you quoting? I was quoting from:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-ref.html



i was quoting file
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
or actually - it's copy provided with sources of postgresql in
contrib/tsearch2/docs directory.


actually oleg supposed not to use best rank, but just use the formula as
 given - rank/(rank+1) to get rank in range of 0 to 1.
OK, then what does the +1 mean in your formulae? Consider these results
from [1]. rank/(rank+1): 0.19/.1 = 1.9, .1/.1 = 1, etc. That doesn't
make sense. The reciprocal also doesn't make sense. So what does Oleg
mean? I was guessing that Oleg meant to divide the rank column by the
first rank, that is, by 0.19 so you would get 1, .52, .52, etc.



+1 means: add one to.
for example: for rank = 0.1 you get: 0.1/(0.1+1) = 0.1/1.1 = 0.0909
for rank = 0.5 you get: 0.5/(0.5+1) = 0.5/1.5 = 0.

i think that notation: rank+1 is pretty readable.

additionally - sorry but i dont understand your calculations. what is 0.19/.1
? how did you get the .1?

depesz


Re: [GENERAL] tsearch2 questions

2007-07-04 Thread Joshua N Pritikin
On Wed, Jul 04, 2007 at 11:08:21AM +0200, hubert depesz lubaczewski wrote:
 On 7/4/07, Joshua N Pritikin [EMAIL PROTECTED] wrote:
 From where are you quoting? I was quoting from:
 
 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-ref.html
 
 i was quoting file
 http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html

So that one is fine. Only the reference could use some clarification.

 actually oleg supposed not to use best rank, but just use the formula as
  given - rank/(rank+1) to get rank in range of 0 to 1.
 OK, then what does the +1 mean in your formulae? Consider these results
 from [1]. rank/(rank+1): 0.19/.1 = 1.9, .1/.1 = 1, etc. That doesn't
 make sense. The reciprocal also doesn't make sense. So what does Oleg
 mean? I was guessing that Oleg meant to divide the rank column by the
 first rank, that is, by 0.19 so you would get 1, .52, .52, etc.
 
 +1 means: add one to.
 for example: for rank = 0.1 you get: 0.1/(0.1+1) = 0.1/1.1 = 0.0909
 for rank = 0.5 you get: 0.5/(0.5+1) = 0.5/1.5 = 0.

D'oh! I see.

 i think that notation: rank+1 is pretty readable.
 
 additionally - sorry but i dont understand your calculations. what is 
 0.19/.1
 ? how did you get the .1?

I was imagining that rank+1 was the second row of the rank column.

Sorry for the confusion.

-- 
Make April 15 just another day, visit http://fairtax.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] TSearch2 Questions

2005-11-22 Thread Hannes Dorbath

On 21.11.2005 18:24, Bruno Wolff III wrote:

On Mon, Nov 21, 2005 at 16:50:00 +0300, Oleg Bartunov oleg@sai.msu.su wrote:

On Mon, 21 Nov 2005, Hannes Dorbath wrote:

I'm playing a bit with it ATM. Indexing one Gigabyte of plain text worked 
well, with 10 GB I yet have some performance problems. I read the TSearch 
Tuning Guide and will start optimizing some things, but is it a realistic 
goal to index ~90GB plain text and get sub-second response times on 
hardware that ~4000 EUR can buy?

What's ATM ?  As for the sub-second response times it'd very depend on
your data and queries. It'd be certainly possible with our tsearch daemon
which we postponed, because we inclined to implement inverted indices first
and then build fts index on top of inverted index. But this is long-term
plan.


I believe in this context, 'ATM' is an ancronym for 'at the moment' which
has little impact on the meaning of the paragraph.


For whatever reason I cannot find Oleg's reply on this server, so I 
reply to this post instead. Thanks for your time Oleg, your answers 
really helped me. I still have two questions about compound words and 
UTF-8, but I'll create a new specific post.


Thanks again.

--
Regards,
Hannes Dorbath

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[GENERAL] TSearch2 Questions

2005-11-21 Thread Hannes Dorbath

A few stupid questions:

Where to get the latest version?

Is http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ a dead site 
and the latest versions are always silently distributed with PG inside 
the contrib dir?


How can I find out what version of TSearch2 I'm running?

Is there active development?

Are the patches provided on the site above for backup still needed, or 
are they already included in the versions that ship with 8.0.x? If not, 
why not? =)


Or the better question, are any of those patches listed under 
Development included in the version that ships with recent PG versions?


I'm playing a bit with it ATM. Indexing one Gigabyte of plain text 
worked well, with 10 GB I yet have some performance problems. I read the 
TSearch Tuning Guide and will start optimizing some things, but is it a 
realistic goal to index ~90GB plain text and get sub-second response 
times on hardware that ~4000 EUR can buy?


Thanks in advance

--
Regards,
Hannes Dorbath

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] TSearch2 Questions

2005-11-21 Thread Oleg Bartunov

On Mon, 21 Nov 2005, Hannes Dorbath wrote:


A few stupid questions:

Where to get the latest version?

Is http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ a dead site and 
the latest versions are always silently distributed with PG inside the 
contrib dir?


You should always use tsearch2 distributed with postgresql.
We keep our version for testing purposes. Sometimes we publish backpatches 
(from CVS HEAD) for stable releases.




How can I find out what version of TSearch2 I'm running?

Is there active development?


It's actively developed, see CVS HEAD commits. Main problem attacked is
fully UTF-8 support. Also, we plan some other improvements.
See http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo



Are the patches provided on the site above for backup still needed, or are 
they already included in the versions that ship with 8.0.x? If not, why not? 
=)


All patches already applied .



Or the better question, are any of those patches listed under Development 
included in the version that ships with recent PG versions?




right now, there is no patches you should be aware of. We plan to release
UTF-8 support patch for 8.1 release.

I'm playing a bit with it ATM. Indexing one Gigabyte of plain text worked 
well, with 10 GB I yet have some performance problems. I read the TSearch 
Tuning Guide and will start optimizing some things, but is it a realistic 
goal to index ~90GB plain text and get sub-second response times on hardware 
that ~4000 EUR can buy?


What's ATM ?  As for the sub-second response times it'd very depend on
your data and queries. It'd be certainly possible with our tsearch daemon
which we postponed, because we inclined to implement inverted indices first
and then build fts index on top of inverted index. But this is long-term
plan.

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] TSearch2 Questions

2005-11-21 Thread Bruno Wolff III
On Mon, Nov 21, 2005 at 16:50:00 +0300,
  Oleg Bartunov oleg@sai.msu.su wrote:
 On Mon, 21 Nov 2005, Hannes Dorbath wrote:
 
 I'm playing a bit with it ATM. Indexing one Gigabyte of plain text worked 
 well, with 10 GB I yet have some performance problems. I read the TSearch 
 Tuning Guide and will start optimizing some things, but is it a realistic 
 goal to index ~90GB plain text and get sub-second response times on 
 hardware that ~4000 EUR can buy?
 
 What's ATM ?  As for the sub-second response times it'd very depend on
 your data and queries. It'd be certainly possible with our tsearch daemon
 which we postponed, because we inclined to implement inverted indices first
 and then build fts index on top of inverted index. But this is long-term
 plan.

I believe in this context, 'ATM' is an ancronym for 'at the moment' which
has little impact on the meaning of the paragraph.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org