Re: [GENERAL] FTS uses tsquery directly in the query
Hi, Oleg Bartunov: First thanks for your quick replay. Could you explain it a little more on it's general limitation/feature? I just confuse that to_tsquery('item') function will return a tsquery type which is same as 'item'::tsquery, to my understanding. Let me explain what I want:First Step: extract top K tokensI have a table with a column as tsvector type. Some records in this column are too big, which contain hundreds tokens. I just want the top K tokens based on the frequency, for example top 5. I am not sure there is a direct way to get such kind top K tokens. I just read them out in Java and count frequency for each token and sort them. Second Step: generate queryNow I will use these tokens to construct a query to search other vectors in the same table. I can not directly use to_tsquery() due to two reasons: 1) The default logic operator in to_tsquery() is but what I need it |. 2) Since the tokens are from tsvector, they are already normalized. If I use to_tsquery() again, they will be normalized again! For example, “course” - “cours” - “cour”. So I just concatenate the top K tokens with “|” and directly use ::tsquery . Unfortunately, as you say it's general limitation/feature”, I can not do that. I checked your manual “Full-Text Search in PostgreSQL A Gentle Introduction”, but could not figure out how. So is it possible to implement what I want in FTS? If so, how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov o...@sai.msu.su wrote: From: Oleg Bartunov o...@sai.msu.su Subject: Re: [GENERAL] FTS uses tsquery directly in the query To: xu fei auto...@yahoo.com Cc: pgsql-general@postgresql.org Date: Sunday, January 24, 2010, 2:11 AM Xu, FTS has nothing with your problem, it's general limitation/feature. Oleg On Sat, 23 Jan 2010, xu fei wrote: Hi, everyone: First I can successful run this query:select name, ts_rank_cd(vectors, query) as rank from element, to_tsquery('item') query where query @@ vectors order by rank desc;But actually I want to run this one:select name, ts_rank_cd(vectors, query) as rank from element, 'item'::tsquery query where query @@ vectors order by rank desc;Looks like that FTS does not support directly use ::tsquery in such query. Do I misunderstand something? Thanks! Xu Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Re: [GENERAL] FTS uses tsquery directly in the query
On Mon, 25 Jan 2010 07:19:59 -0800 (PST) xu fei auto...@yahoo.com wrote: Hi, Oleg Bartunov: First thanks for your quick replay. Could you explain it a little more on it's general limitation/feature? I just confuse that to_tsquery('item') function will return a tsquery type which is same as 'item'::tsquery, to my understanding. Let me explain what I want:First Step: extract top K tokensI have a table with a column as tsvector type. Some records in this column are too big, which contain hundreds tokens. I just want the top K tokens based on the frequency, for example top 5. I am not sure there is a direct way to get such kind top K tokens. I just read them out in Java and count frequency for each token and sort them. Second Step: generate queryNow I will use these tokens to construct a query to search other vectors in the same table. I can not directly use to_tsquery() due to two reasons: 1) The default logic operator in to_tsquery() is but what I need it |. 2) Since the tokens are from tsvector, they are already normalized. If I use to_tsquery() again, they will be normalized again! For example, “course” - “cours” - “cour”. So I just concatenate the top K tokens with “|” and directly use ::tsquery . Unfortunately, as you say it's general limitation/feature”, I can not do that. I checked your manual “Full-Text Search in PostgreSQL A Gentle Introduction”, but could not figure out how. So is it possible to implement what I want in FTS? If so, how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov You're trying to solve a similar problems than mine. I'd like to build up a näive similar text search. I don't have the length problem still I'd like to avoid to tokenize/lexize a text twice to build up a tsquery. I've weighted tsvectors stored in a column and once I pick up one I'd like to look for similar ones in the same column. There are thousands way to measure text similarity (and Oleg pointed me to some), still ts_rank should be good enough for me. I've very short text so I can't use on the whole tsvector otherwise there will be very high chances to find just one match. As you suggested I could pick up a subset of important[1] lexemes in the tsvector and build up an ed tsquery with them. Still at least in my case, since I'm dealing with very short texts, this still looks too risky (just 1 match). Considering that I'm using weighted tsvectors it seems that | and picking up the ones with the best rank could be a way to go. But as you've noted there is no function that turns a tsvector in a tsquery (including weight possibly) and give you the choice to use |. Well... I'm trying to write a couple of helper functions in C. But I'm pretty new to postgres internals and well I miss a reference of functions/macro with some examples... and this is a side project and I haven't been using C for quite a while. Once I'll have that function I'll have to solve how to return few rows (since I'll have to use | I expect a lot of returned rows) to make efficient use of the gin index and avoid to compute ts_rank for too many rows. Don't hold your breath waiting... but let me know if you're interested so I don't have to be the only one posting newbies questions on pgsql-hackers ;) [1] ts_stat could give you some hints about what lexemes may be important... but well deciding what's important is another can of worms... and as anticipated ts_rank should be good enough for me. -- Ivan Sergio Borgonovo http://www.webthatworks.it -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] FTS uses tsquery directly in the query
Do you guys wanted something like: arxiv=# select and2or(to_tsquery('1 2 3')); and2or - ( '1' | '2' ) | '3' (1 row) Oleg On Mon, 25 Jan 2010, Ivan Sergio Borgonovo wrote: On Mon, 25 Jan 2010 07:19:59 -0800 (PST) xu fei auto...@yahoo.com wrote: Hi, Oleg Bartunov: First thanks for your quick replay.=C2=A0Could you explain it a little more on it's general limitation/feature? I just confuse that to_tsquery('item') function will return a tsquery type which is same as 'item'::tsquery, to my understanding. Let me explain what I want:First Step: extract top K tokensI have a table with a column as tsvector type. Some records in this column are too big, which contain hundreds tokens. I just want the top K tokens based on the frequency, for example top 5. I am not sure there is a direct way to get such kind top K tokens. I just read them out in Java and count frequency for each token and sort them. Second Step: generate queryNow I will use these tokens to construct a query to search other vectors in the same table. I can not directly use to_tsquery() due to two reasons: 1) The default logic operator in to_tsquery() is but what I need it |. 2) Since the tokens are from tsvector, they are already normalized. If I use to_tsquery() again, they will be normalized again! For example, =E2=80=9Ccourse=E2=80=9D - =E2=80=9Ccours=E2=80=9D - =E2=80=9C= cour=E2=80=9D. So I just concatenate the top K tokens with =E2=80=9C|=E2=80=9D and directly use ::tsquery . Unfortunately, as you say it's general limitation/feature=E2=80=9D, I can not do that. I checked your manual =E2=80=9CFull-Text Search in=C2=A0PostgreSQL=C2=A0A Gentle Introduction=E2=80=9D, but could not fig= ure out how. So is it possible to implement what I want in FTS? If so, how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov You're trying to solve a similar problems than mine. I'd like to build up a n=C3=A4ive similar text search. I don't have the length problem still I'd like to avoid to tokenize/lexize a text twice to build up a tsquery. I've weighted tsvectors stored in a column and once I pick up one I'd like to look for similar ones in the same column. There are thousands way to measure text similarity (and Oleg pointed me to some), still ts_rank should be good enough for me. I've very short text so I can't use on the whole tsvector otherwise there will be very high chances to find just one match. As you suggested I could pick up a subset of important[1] lexemes in the tsvector and build up an ed tsquery with them. Still at least in my case, since I'm dealing with very short texts, this still looks too risky (just 1 match). Considering that I'm using weighted tsvectors it seems that | and picking up the ones with the best rank could be a way to go. But as you've noted there is no function that turns a tsvector in a tsquery (including weight possibly) and give you the choice to use |. Well... I'm trying to write a couple of helper functions in C. But I'm pretty new to postgres internals and well I miss a reference of functions/macro with some examples... and this is a side project and I haven't been using C for quite a while. Once I'll have that function I'll have to solve how to return few rows (since I'll have to use | I expect a lot of returned rows) to make efficient use of the gin index and avoid to compute ts_rank for too many rows. Don't hold your breath waiting... but let me know if you're interested so I don't have to be the only one posting newbies questions on pgsql-hackers ;) [1] ts_stat could give you some hints about what lexemes may be important... but well deciding what's important is another can of worms... and as anticipated ts_rank should be good enough for me. --=20 Ivan Sergio Borgonovo http://www.webthatworks.it --=20 Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] FTS uses tsquery directly in the query
On Mon, 25 Jan 2010 23:35:12 +0300 (MSK) Oleg Bartunov o...@sai.msu.su wrote: Do you guys wanted something like: arxiv=# select and2or(to_tsquery('1 2 3')); and2or - ( '1' | '2' ) | '3' (1 row) Nearly. I'm starting from a weighted tsvector not from text/tsquery. I would like to: - keep the weights in the query - avoid parsing the text to extract lexemes twice (I already have a tsvector) For me extending pg in C is a new science, but I'm actually trying to write at least a couple of functions that: - will return a tsvector as a weight int, pos int[], lexeme text record - will turn a tsvector + operator into a tsquery 'orange':A1,2,3 'banana':B4,5 'tomato':C6,7 - 'orange':A | 'banana':B | 'tomato':C or eventually 'orange':A 'banana':B 'tomato':C thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] FTS uses tsquery directly in the query
Hi, Ivan: I agree with you and also would like to 'hack' into the code. Current FTC is the best one in database system and a great building block to support more functions. I list some I can think about:choose | or as an optional parameter for to_tsquery, to_tsvector.choose normalization or not for to_tsquery, to_tsvector.current two rankings are not enough: the default ts_rank, I have not figured out the algorithm. The ts_rank_cd, we have the paper but it is designed for short query with 2 or 3 tokens.The normalization may be similar to Apache Lucene which is really easy to modify and build your own tokenizer. I still feel confused after reading the annual.. I am not sure current there is a team to help Oleg Bartunov or not. If need, I can try to do something rather than just hacking it. I am sure, Ivan also will join this. :)Xu --- On Mon, 1/25/10, Ivan Sergio Borgonovo m...@webthatworks.it wrote: From: Ivan Sergio Borgonovo m...@webthatworks.it Subject: Re: [GENERAL] FTS uses tsquery directly in the query To: pgsql-general@postgresql.org Date: Monday, January 25, 2010, 4:33 PM On Mon, 25 Jan 2010 23:35:12 +0300 (MSK) Oleg Bartunov o...@sai.msu.su wrote: Do you guys wanted something like: arxiv=# select and2or(to_tsquery('1 2 3')); and2or - ( '1' | '2' ) | '3' (1 row) Nearly. I'm starting from a weighted tsvector not from text/tsquery. I would like to: - keep the weights in the query - avoid parsing the text to extract lexemes twice (I already have a tsvector) For me extending pg in C is a new science, but I'm actually trying to write at least a couple of functions that: - will return a tsvector as a weight int, pos int[], lexeme text record - will turn a tsvector + operator into a tsquery 'orange':A1,2,3 'banana':B4,5 'tomato':C6,7 - 'orange':A | 'banana':B | 'tomato':C or eventually 'orange':A 'banana':B 'tomato':C thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] FTS uses tsquery directly in the query
Hi, everyone: First I can successful run this query:select name, ts_rank_cd(vectors, query) as rank from element, to_tsquery('item') query where query @@ vectors order by rank desc;But actually I want to run this one:select name, ts_rank_cd(vectors, query) as rank from element, 'item'::tsquery query where query @@ vectors order by rank desc;Looks like that FTS does not support directly use ::tsquery in such query. Do I misunderstand something? Thanks! Xu
Re: [GENERAL] FTS uses tsquery directly in the query
Xu, FTS has nothing with your problem, it's general limitation/feature. Oleg On Sat, 23 Jan 2010, xu fei wrote: Hi, everyone: First I can successful run this query:select name, ts_rank_cd(vectors, query) as rank from element, to_tsquery('item') query where query @@ vectors order by rank desc;But actually I want to run this one:select name, ts_rank_cd(vectors, query) as rank from element, 'item'::tsquery query where query @@ vectors order by rank desc;Looks like that FTS does not support directly use ::tsquery in such query. Do I misunderstand something? Thanks! Xu Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general