Re: [GENERAL] FTS uses tsquery directly in the query

2010-01-25 Thread xu fei
Hi, Oleg Bartunov:
First thanks for your quick replay. Could you explain it a little more on it's 
general limitation/feature? I just confuse that to_tsquery('item') function 
will return a tsquery type which is same as 'item'::tsquery, to my 
understanding. 
Let me explain what I want:First Step: extract top K tokensI have a table with 
a column as tsvector type. Some records in this column are too big, which 
contain hundreds tokens. I just want the top K tokens based on the frequency, 
for example top 5. I am not sure there is a direct way to get such kind top K 
tokens. I just read them out in Java and count frequency for each token and 
sort them.
Second Step: generate queryNow I will use these tokens to construct a query to 
search other vectors in the same table. I can not directly use to_tsquery() due 
to two reasons: 1) The default logic operator in to_tsquery() is  but what I 
need it |. 2) Since the tokens are from tsvector, they are already 
normalized. If I use to_tsquery() again, they will be normalized again! For 
example, “course” - “cours” - “cour”. So I just concatenate the top K tokens 
with “|” and directly use ::tsquery . 
Unfortunately, as you say it's general limitation/feature”, I can not do that. 
I checked your manual “Full-Text Search in PostgreSQL A Gentle Introduction”, 
but could not figure out how. So is it possible to implement what I want in 
FTS? If so, how?
Thank! 
Xu
--- On Sun, 1/24/10, Oleg Bartunov o...@sai.msu.su wrote:

From: Oleg Bartunov o...@sai.msu.su
Subject: Re: [GENERAL] FTS uses tsquery directly in the query
To: xu fei auto...@yahoo.com
Cc: pgsql-general@postgresql.org
Date: Sunday, January 24, 2010, 2:11 AM

Xu,

FTS has nothing with your problem, it's general limitation/feature.

Oleg
On Sat, 23 Jan 2010, xu fei wrote:

 Hi, everyone:
 First I can successful run this query:select name, ts_rank_cd(vectors, query) 
 as rank from element, to_tsquery('item') query where query @@ vectors order 
 by rank desc;But actually I want to run this one:select name, 
 ts_rank_cd(vectors, query) as rank from element, 'item'::tsquery query where 
 query @@ vectors order by rank desc;Looks like that FTS does not support 
 directly use ::tsquery  in such query. Do I misunderstand something?  
 Thanks!
 Xu




     Regards,
         Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


  

Re: [GENERAL] FTS uses tsquery directly in the query

2010-01-25 Thread Ivan Sergio Borgonovo
On Mon, 25 Jan 2010 07:19:59 -0800 (PST)
xu fei auto...@yahoo.com wrote:

 Hi, Oleg Bartunov:
 First thanks for your quick replay. Could you explain it a little
 more on it's general limitation/feature? I just confuse that
 to_tsquery('item') function will return a tsquery type which is
 same as 'item'::tsquery, to my understanding. Let me explain what
 I want:First Step: extract top K tokensI have a table with a
 column as tsvector type. Some records in this column are too big,
 which contain hundreds tokens. I just want the top K tokens based
 on the frequency, for example top 5. I am not sure there is a
 direct way to get such kind top K tokens. I just read them out in
 Java and count frequency for each token and sort them. Second
 Step: generate queryNow I will use these tokens to construct a
 query to search other vectors in the same table. I can not
 directly use to_tsquery() due to two reasons: 1) The default logic
 operator in to_tsquery() is  but what I need it |. 2) Since
 the tokens are from tsvector, they are already normalized. If I
 use to_tsquery() again, they will be normalized again! For
 example, “course” - “cours” - “cour”. So I just concatenate the
 top K tokens with “|” and directly use ::tsquery .
 Unfortunately, as you say it's general limitation/feature”, I can
 not do that. I checked your manual “Full-Text Search
 in PostgreSQL A Gentle Introduction”, but could not figure out
 how. So is it possible to implement what I want in FTS? If so,
 how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov

You're trying to solve a similar problems than mine.
I'd like to build up a näive similar text search.
I don't have the length problem still I'd like to avoid to
tokenize/lexize a text twice to build up a tsquery.
I've weighted tsvectors stored in a column and once I pick up one
I'd like to look for similar ones in the same column.

There are thousands way to measure text similarity (and Oleg pointed
me to some), still ts_rank should be good enough for me.

I've very short text so I can't use  on the whole tsvector
otherwise there will be very high chances to find just one match.

As you suggested I could pick up a subset of important[1] lexemes
in the tsvector and build up an ed tsquery with them.

Still at least in my case, since I'm dealing with very short texts,
this still looks too risky (just 1 match). Considering that I'm
using weighted tsvectors it seems that | and picking up the ones
with the best rank could be a way to go.

But as you've noted there is no function that turns a tsvector in a
tsquery (including weight possibly) and give you the choice to use
|.

Well... I'm trying to write a couple of helper functions in C.
But I'm pretty new to postgres internals and well I miss a reference
of functions/macro with some examples... and this is a side project
and I haven't been using C for quite a while.

Once I'll have that function I'll have to solve how to return few
rows (since I'll have to use | I expect a lot of returned rows) to
make efficient use of the gin index and avoid to compute ts_rank for
too many rows.

Don't hold your breath waiting... but let me know if you're
interested so I don't have to be the only one posting newbies
questions on pgsql-hackers ;)

[1] ts_stat could give you some hints about what lexemes may be
important... but well deciding what's important is another can of
worms... and as anticipated ts_rank should be good enough for me.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS uses tsquery directly in the query

2010-01-25 Thread Oleg Bartunov

Do you guys wanted something like:

arxiv=# select and2or(to_tsquery('1  2  3'));
   and2or 
-

 ( '1' | '2' ) | '3'
(1 row)


Oleg
On Mon, 25 Jan 2010, Ivan Sergio Borgonovo wrote:


On Mon, 25 Jan 2010 07:19:59 -0800 (PST)
xu fei auto...@yahoo.com wrote:

 Hi, Oleg Bartunov:
 First thanks for your quick replay.=C2=A0Could you explain it a little
 more on it's general limitation/feature? I just confuse that
 to_tsquery('item') function will return a tsquery type which is
 same as 'item'::tsquery, to my understanding. Let me explain what
 I want:First Step: extract top K tokensI have a table with a
 column as tsvector type. Some records in this column are too big,
 which contain hundreds tokens. I just want the top K tokens based
 on the frequency, for example top 5. I am not sure there is a
 direct way to get such kind top K tokens. I just read them out in
 Java and count frequency for each token and sort them. Second
 Step: generate queryNow I will use these tokens to construct a
 query to search other vectors in the same table. I can not
 directly use to_tsquery() due to two reasons: 1) The default logic
 operator in to_tsquery() is  but what I need it |. 2) Since
 the tokens are from tsvector, they are already normalized. If I
 use to_tsquery() again, they will be normalized again! For
 example, =E2=80=9Ccourse=E2=80=9D - =E2=80=9Ccours=E2=80=9D - =E2=80=9C=
cour=E2=80=9D. So I just concatenate the
 top K tokens with =E2=80=9C|=E2=80=9D and directly use ::tsquery .
 Unfortunately, as you say it's general limitation/feature=E2=80=9D, I can
 not do that. I checked your manual =E2=80=9CFull-Text Search
 in=C2=A0PostgreSQL=C2=A0A Gentle Introduction=E2=80=9D, but could not fig=
ure out
 how. So is it possible to implement what I want in FTS? If so,
 how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov

You're trying to solve a similar problems than mine.
I'd like to build up a n=C3=A4ive similar text search.
I don't have the length problem still I'd like to avoid to
tokenize/lexize a text twice to build up a tsquery.
I've weighted tsvectors stored in a column and once I pick up one
I'd like to look for similar ones in the same column.

There are thousands way to measure text similarity (and Oleg pointed
me to some), still ts_rank should be good enough for me.

I've very short text so I can't use  on the whole tsvector
otherwise there will be very high chances to find just one match.

As you suggested I could pick up a subset of important[1] lexemes
in the tsvector and build up an ed tsquery with them.

Still at least in my case, since I'm dealing with very short texts,
this still looks too risky (just 1 match). Considering that I'm
using weighted tsvectors it seems that | and picking up the ones
with the best rank could be a way to go.

But as you've noted there is no function that turns a tsvector in a
tsquery (including weight possibly) and give you the choice to use
|.

Well... I'm trying to write a couple of helper functions in C.
But I'm pretty new to postgres internals and well I miss a reference
of functions/macro with some examples... and this is a side project
and I haven't been using C for quite a while.

Once I'll have that function I'll have to solve how to return few
rows (since I'll have to use | I expect a lot of returned rows) to
make efficient use of the gin index and avoid to compute ts_rank for
too many rows.

Don't hold your breath waiting... but let me know if you're
interested so I don't have to be the only one posting newbies
questions on pgsql-hackers ;)

[1] ts_stat could give you some hints about what lexemes may be
important... but well deciding what's important is another can of
worms... and as anticipated ts_rank should be good enough for me.

--=20
Ivan Sergio Borgonovo
http://www.webthatworks.it


--=20
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS uses tsquery directly in the query

2010-01-25 Thread Ivan Sergio Borgonovo
On Mon, 25 Jan 2010 23:35:12 +0300 (MSK)
Oleg Bartunov o...@sai.msu.su wrote:

 Do you guys wanted something like:
 
 arxiv=# select and2or(to_tsquery('1  2  3'));
 and2or 
 -
   ( '1' | '2' ) | '3'
 (1 row)

Nearly. I'm starting from a weighted tsvector not from text/tsquery.
I would like to:
- keep the weights in the query
- avoid parsing the text to extract lexemes twice (I already have a
  tsvector)

For me extending pg in C is a new science, but I'm actually trying
to write at least a couple of functions that:
- will return a tsvector as a weight int, pos int[], lexeme text
  record
- will turn a tsvector + operator into a tsquery
  'orange':A1,2,3 'banana':B4,5 'tomato':C6,7 -
  'orange':A | 'banana':B | 'tomato':C
  or eventually
  'orange':A  'banana':B  'tomato':C

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS uses tsquery directly in the query

2010-01-25 Thread xu fei
Hi, Ivan:
I agree with you and also would like to 'hack' into the code. Current FTC is 
the best one in database system and a great building block to support more 
functions. I list some I can think about:choose | or  as an optional 
parameter for to_tsquery, to_tsvector.choose normalization or not for 
to_tsquery, to_tsvector.current two rankings are not enough: the 
default ts_rank, I have not figured out the algorithm. The ts_rank_cd, we have 
the paper but it is designed for short query with 2 or 3 
tokens.The normalization may be similar to Apache Lucene which is really easy 
to modify and build your own tokenizer. I still feel confused after reading 
the annual.. I am not sure current there is a team to help Oleg Bartunov or 
not. If need, I can try to do something rather than just hacking it. I am sure, 
Ivan also will join this. :)Xu
--- On Mon, 1/25/10, Ivan Sergio Borgonovo m...@webthatworks.it wrote:

From: Ivan Sergio Borgonovo m...@webthatworks.it
Subject: Re: [GENERAL] FTS uses tsquery directly in the query
To: pgsql-general@postgresql.org
Date: Monday, January 25, 2010, 4:33 PM

On Mon, 25 Jan 2010 23:35:12 +0300 (MSK)
Oleg Bartunov o...@sai.msu.su wrote:

 Do you guys wanted something like:
 
 arxiv=# select and2or(to_tsquery('1  2  3'));
         and2or 
 -
   ( '1' | '2' ) | '3'
 (1 row)

Nearly. I'm starting from a weighted tsvector not from text/tsquery.
I would like to:
- keep the weights in the query
- avoid parsing the text to extract lexemes twice (I already have a
  tsvector)

For me extending pg in C is a new science, but I'm actually trying
to write at least a couple of functions that:
- will return a tsvector as a weight int, pos int[], lexeme text
  record
- will turn a tsvector + operator into a tsquery
  'orange':A1,2,3 'banana':B4,5 'tomato':C6,7 -
  'orange':A | 'banana':B | 'tomato':C
  or eventually
  'orange':A  'banana':B  'tomato':C

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



  

[GENERAL] FTS uses tsquery directly in the query

2010-01-23 Thread xu fei
Hi, everyone:
First I can successful run this query:select name, ts_rank_cd(vectors, query) 
as rank from element, to_tsquery('item') query where query @@ vectors order by 
rank desc;But actually I want to run this one:select name, ts_rank_cd(vectors, 
query) as rank from element, 'item'::tsquery query where query @@ vectors order 
by rank desc;Looks like that FTS does not support directly use ::tsquery  in 
such query. Do I misunderstand something?  Thanks!
Xu


  

Re: [GENERAL] FTS uses tsquery directly in the query

2010-01-23 Thread Oleg Bartunov

Xu,

FTS has nothing with your problem, it's general limitation/feature.

Oleg
On Sat, 23 Jan 2010, xu fei wrote:


Hi, everyone:
First I can successful run this query:select name, ts_rank_cd(vectors, query) as rank 
from element, to_tsquery('item') query where query @@ vectors order by rank desc;But 
actually I want to run this one:select name, ts_rank_cd(vectors, query) as rank from 
element, 'item'::tsquery query where query @@ vectors order by rank desc;Looks like that 
FTS does not support directly use ::tsquery  in such query. Do I 
misunderstand something?  Thanks!
Xu





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general