Re: [GENERAL] FTS uses "tsquery" directly in the query

2010-01-25 Thread xu fei
Hi, Ivan:
I agree with you and also would like to 'hack' into the code. Current FTC is 
the best one in database system and a great building block to support more 
functions. I list some I can think about:choose "|" or "&" as an optional 
parameter for to_tsquery, to_tsvector.choose normalization or not for 
to_tsquery, to_tsvector.current two rankings are not enough: the 
default ts_rank, I have not figured out the algorithm. The ts_rank_cd, we have 
the paper but it is designed for short query with 2 or 3 
tokens.The normalization may be similar to Apache Lucene which is really easy 
to modify and build your own tokenizer. I still feel confused after reading 
the annual.. I am not sure current there is a team to help Oleg Bartunov or 
not. If need, I can try to do something rather than just hacking it. I am sure, 
Ivan also will join this. :)Xu
--- On Mon, 1/25/10, Ivan Sergio Borgonovo  wrote:

From: Ivan Sergio Borgonovo 
Subject: Re: [GENERAL] FTS uses "tsquery" directly in the query
To: pgsql-general@postgresql.org
Date: Monday, January 25, 2010, 4:33 PM

On Mon, 25 Jan 2010 23:35:12 +0300 (MSK)
Oleg Bartunov  wrote:

> Do you guys wanted something like:
> 
> arxiv=# select and2or(to_tsquery('1 & 2 & 3'));
>         and2or 
> -
>   ( '1' | '2' ) | '3'
> (1 row)

Nearly. I'm starting from a weighted tsvector not from text/tsquery.
I would like to:
- keep the weights in the query
- avoid parsing the text to extract lexemes twice (I already have a
  tsvector)

For me extending pg in C is a new science, but I'm actually trying
to write at least a couple of functions that:
- will return a tsvector as a weight int, pos int[], lexeme text
  record
- will turn a tsvector + operator into a tsquery
  'orange':A1,2,3 'banana':B4,5 'tomato':C6,7 ->
  'orange':A | 'banana':B | 'tomato':C
  or eventually
  'orange':A & 'banana':B & 'tomato':C

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



  

Re: [GENERAL] FTS uses "tsquery" directly in the query

2010-01-25 Thread Ivan Sergio Borgonovo
On Mon, 25 Jan 2010 23:35:12 +0300 (MSK)
Oleg Bartunov  wrote:

> Do you guys wanted something like:
> 
> arxiv=# select and2or(to_tsquery('1 & 2 & 3'));
> and2or 
> -
>   ( '1' | '2' ) | '3'
> (1 row)

Nearly. I'm starting from a weighted tsvector not from text/tsquery.
I would like to:
- keep the weights in the query
- avoid parsing the text to extract lexemes twice (I already have a
  tsvector)

For me extending pg in C is a new science, but I'm actually trying
to write at least a couple of functions that:
- will return a tsvector as a weight int, pos int[], lexeme text
  record
- will turn a tsvector + operator into a tsquery
  'orange':A1,2,3 'banana':B4,5 'tomato':C6,7 ->
  'orange':A | 'banana':B | 'tomato':C
  or eventually
  'orange':A & 'banana':B & 'tomato':C

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS uses "tsquery" directly in the query

2010-01-25 Thread Oleg Bartunov

Do you guys wanted something like:

arxiv=# select and2or(to_tsquery('1 & 2 & 3'));
   and2or 
-

 ( '1' | '2' ) | '3'
(1 row)


Oleg
On Mon, 25 Jan 2010, Ivan Sergio Borgonovo wrote:


On Mon, 25 Jan 2010 07:19:59 -0800 (PST)
xu fei  wrote:

> Hi, Oleg Bartunov:
> First thanks for your quick replay.=C2=A0Could you explain it a little
> more on "it's general limitation/feature"? I just confuse that
> to_tsquery('item') function will return a tsquery type which is
> same as 'item'::tsquery, to my understanding. Let me explain what
> I want:First Step: extract top K tokensI have a table with a
> column as tsvector type. Some records in this column are too big,
> which contain hundreds tokens. I just want the top K tokens based
> on the frequency, for example top 5. I am not sure there is a
> direct way to get such kind top K tokens. I just read them out in
> Java and count frequency for each token and sort them. Second
> Step: generate queryNow I will use these tokens to construct a
> query to search other vectors in the same table. I can not
> directly use to_tsquery() due to two reasons: 1) The default logic
> operator in to_tsquery() is "&" but what I need it "|". 2) Since
> the tokens are from tsvector, they are already normalized. If I
> use to_tsquery() again, they will be normalized again! For
> example, =E2=80=9Ccourse=E2=80=9D -> =E2=80=9Ccours=E2=80=9D -> =E2=80=9C=
cour=E2=80=9D. So I just concatenate the
> top K tokens with =E2=80=9C|=E2=80=9D and directly use "::tsquery ".
> Unfortunately, as you say "it's general limitation/feature=E2=80=9D, I can
> not do that. I checked your manual =E2=80=9CFull-Text Search
> in=C2=A0PostgreSQL=C2=A0A Gentle Introduction=E2=80=9D, but could not fig=
ure out
> how. So is it possible to implement what I want in FTS? If so,
> how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov

You're trying to solve a similar problems than mine.
I'd like to build up a n=C3=A4ive similar text search.
I don't have the "length" problem still I'd like to avoid to
tokenize/lexize a text twice to build up a tsquery.
I've weighted tsvectors stored in a column and once I pick up one
I'd like to look for similar ones in the same column.

There are thousands way to measure text similarity (and Oleg pointed
me to some), still ts_rank should be "good enough for me".

I've very short text so I can't use & on the whole tsvector
otherwise there will be very high chances to find just one match.

As you suggested I could pick up a subset of "important"[1] lexemes
in the tsvector and build up an "&"ed tsquery with them.

Still at least in my case, since I'm dealing with very short texts,
this still looks too risky (just 1 match). Considering that I'm
using weighted tsvectors it seems that "|" and picking up the ones
with the best rank could be a way to go.

But as you've noted there is no function that turns a tsvector in a
tsquery (including weight possibly) and give you the choice to use
"|".

Well... I'm trying to write a couple of helper functions in C.
But I'm pretty new to postgres internals and well I miss a reference
of functions/macro with some examples... and this is a side project
and I haven't been using C for quite a while.

Once I'll have that function I'll have to solve how to return few
rows (since I'll have to use | I expect a lot of returned rows) to
make efficient use of the gin index and avoid to compute ts_rank for
too many rows.

Don't hold your breath waiting... but let me know if you're
interested so I don't have to be the only one posting newbies
questions on pgsql-hackers ;)

[1] ts_stat could give you some hints about what lexemes may be
important... but well deciding what's important is another can of
worms... and as anticipated ts_rank should be "good enough for me".

--=20
Ivan Sergio Borgonovo
http://www.webthatworks.it


--=20
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS uses "tsquery" directly in the query

2010-01-25 Thread Ivan Sergio Borgonovo
On Mon, 25 Jan 2010 07:19:59 -0800 (PST)
xu fei  wrote:

> Hi, Oleg Bartunov:
> First thanks for your quick replay. Could you explain it a little
> more on "it's general limitation/feature"? I just confuse that
> to_tsquery('item') function will return a tsquery type which is
> same as 'item'::tsquery, to my understanding. Let me explain what
> I want:First Step: extract top K tokensI have a table with a
> column as tsvector type. Some records in this column are too big,
> which contain hundreds tokens. I just want the top K tokens based
> on the frequency, for example top 5. I am not sure there is a
> direct way to get such kind top K tokens. I just read them out in
> Java and count frequency for each token and sort them. Second
> Step: generate queryNow I will use these tokens to construct a
> query to search other vectors in the same table. I can not
> directly use to_tsquery() due to two reasons: 1) The default logic
> operator in to_tsquery() is "&" but what I need it "|". 2) Since
> the tokens are from tsvector, they are already normalized. If I
> use to_tsquery() again, they will be normalized again! For
> example, “course” -> “cours” -> “cour”. So I just concatenate the
> top K tokens with “|” and directly use "::tsquery ".
> Unfortunately, as you say "it's general limitation/feature”, I can
> not do that. I checked your manual “Full-Text Search
> in PostgreSQL A Gentle Introduction”, but could not figure out
> how. So is it possible to implement what I want in FTS? If so,
> how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov

You're trying to solve a similar problems than mine.
I'd like to build up a näive similar text search.
I don't have the "length" problem still I'd like to avoid to
tokenize/lexize a text twice to build up a tsquery.
I've weighted tsvectors stored in a column and once I pick up one
I'd like to look for similar ones in the same column.

There are thousands way to measure text similarity (and Oleg pointed
me to some), still ts_rank should be "good enough for me".

I've very short text so I can't use & on the whole tsvector
otherwise there will be very high chances to find just one match.

As you suggested I could pick up a subset of "important"[1] lexemes
in the tsvector and build up an "&"ed tsquery with them.

Still at least in my case, since I'm dealing with very short texts,
this still looks too risky (just 1 match). Considering that I'm
using weighted tsvectors it seems that "|" and picking up the ones
with the best rank could be a way to go.

But as you've noted there is no function that turns a tsvector in a
tsquery (including weight possibly) and give you the choice to use
"|".

Well... I'm trying to write a couple of helper functions in C.
But I'm pretty new to postgres internals and well I miss a reference
of functions/macro with some examples... and this is a side project
and I haven't been using C for quite a while.

Once I'll have that function I'll have to solve how to return few
rows (since I'll have to use | I expect a lot of returned rows) to
make efficient use of the gin index and avoid to compute ts_rank for
too many rows.

Don't hold your breath waiting... but let me know if you're
interested so I don't have to be the only one posting newbies
questions on pgsql-hackers ;)

[1] ts_stat could give you some hints about what lexemes may be
important... but well deciding what's important is another can of
worms... and as anticipated ts_rank should be "good enough for me".

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FTS uses "tsquery" directly in the query

2010-01-25 Thread xu fei
Hi, Oleg Bartunov:
First thanks for your quick replay. Could you explain it a little more on "it's 
general limitation/feature"? I just confuse that to_tsquery('item') function 
will return a tsquery type which is same as 'item'::tsquery, to my 
understanding. 
Let me explain what I want:First Step: extract top K tokensI have a table with 
a column as tsvector type. Some records in this column are too big, which 
contain hundreds tokens. I just want the top K tokens based on the frequency, 
for example top 5. I am not sure there is a direct way to get such kind top K 
tokens. I just read them out in Java and count frequency for each token and 
sort them.
Second Step: generate queryNow I will use these tokens to construct a query to 
search other vectors in the same table. I can not directly use to_tsquery() due 
to two reasons: 1) The default logic operator in to_tsquery() is "&" but what I 
need it "|". 2) Since the tokens are from tsvector, they are already 
normalized. If I use to_tsquery() again, they will be normalized again! For 
example, “course” -> “cours” -> “cour”. So I just concatenate the top K tokens 
with “|” and directly use "::tsquery ". 
Unfortunately, as you say "it's general limitation/feature”, I can not do that. 
I checked your manual “Full-Text Search in PostgreSQL A Gentle Introduction”, 
but could not figure out how. So is it possible to implement what I want in 
FTS? If so, how?
Thank! 
Xu
--- On Sun, 1/24/10, Oleg Bartunov  wrote:

From: Oleg Bartunov 
Subject: Re: [GENERAL] FTS uses "tsquery" directly in the query
To: "xu fei" 
Cc: pgsql-general@postgresql.org
Date: Sunday, January 24, 2010, 2:11 AM

Xu,

FTS has nothing with your problem, it's general limitation/feature.

Oleg
On Sat, 23 Jan 2010, xu fei wrote:

> Hi, everyone:
> First I can successful run this query:select name, ts_rank_cd(vectors, query) 
> as rank from element, to_tsquery('item') query where query @@ vectors order 
> by rank desc;But actually I want to run this one:select name, 
> ts_rank_cd(vectors, query) as rank from element, 'item'::tsquery query where 
> query @@ vectors order by rank desc;Looks like that FTS does not support 
> directly use "::tsquery " in such query. Do I misunderstand something?  
> Thanks!
> Xu
>
>
>

     Regards,
         Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


  

Re: [GENERAL] FTS uses "tsquery" directly in the query

2010-01-23 Thread Oleg Bartunov

Xu,

FTS has nothing with your problem, it's general limitation/feature.

Oleg
On Sat, 23 Jan 2010, xu fei wrote:


Hi, everyone:
First I can successful run this query:select name, ts_rank_cd(vectors, query) as rank 
from element, to_tsquery('item') query where query @@ vectors order by rank desc;But 
actually I want to run this one:select name, ts_rank_cd(vectors, query) as rank from 
element, 'item'::tsquery query where query @@ vectors order by rank desc;Looks like that 
FTS does not support directly use "::tsquery " in such query. Do I 
misunderstand something?  Thanks!
Xu





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] FTS uses "tsquery" directly in the query

2010-01-23 Thread xu fei
Hi, everyone:
First I can successful run this query:select name, ts_rank_cd(vectors, query) 
as rank from element, to_tsquery('item') query where query @@ vectors order by 
rank desc;But actually I want to run this one:select name, ts_rank_cd(vectors, 
query) as rank from element, 'item'::tsquery query where query @@ vectors order 
by rank desc;Looks like that FTS does not support directly use "::tsquery " in 
such query. Do I misunderstand something?  Thanks!
Xu