Text search selectivity improvements (was Re: [HACKERS] Google Summer of Code 2008)

2008-03-18 Thread Jan Urbański
OK, here's a more detailed description of the FTS selectivity improvement idea: === Write a typanalyze function for column type tsvector The function would go through the tuples returned by the BlockSampler and compute the number of times each distinct lexeme appears inside the tsvector

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Oleg Bartunov
On Sat, 8 Mar 2008, Jan Urbaski wrote: Unfortunately, selectivity estimation for query is much difficult than just estimate frequency of individual word. Sure, given something like 'cats & dogs'::tsquery the frequency of 'cat' and 'dog' won't suffice. But at least it's a starting point and

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Oleg Bartunov
On Sat, 8 Mar 2008, Tom Lane wrote: Oleg Bartunov <[EMAIL PROTECTED]> writes: On Sat, 8 Mar 2008, Jan Urbaski wrote: I have a feeling that in many cases identifying the top 50 to 300 lexemes would be enough to talk about text search selectivity with a degree of confidence. At least we wouldn't

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Jan Urbański
Oleg Bartunov wrote: On Sat, 8 Mar 2008, Jan Urbaski wrote: OK, after reading through the some of the code the idea is to write a custom typanalyze function for tsvector columns. It could look inside such function already exists, it's ts_stat(). The problem with ts_stat() is its performance,

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Tom Lane
Oleg Bartunov <[EMAIL PROTECTED]> writes: > On Sat, 8 Mar 2008, Jan Urbaski wrote: >> I have a feeling that in many cases identifying the top 50 to 300 lexemes >> would be enough to talk about text search selectivity with a degree of >> confidence. At least we wouldn't give overly low estimates f

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Oleg Bartunov
On Sat, 8 Mar 2008, Jan Urbaski wrote: Oleg Bartunov wrote: Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: 2. Implement better selectivity estimates for FTS. OK,

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Jan Urbański
Oleg Bartunov wrote: Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: 2. Implement better selectivity estimates for FTS. OK, after reading through the some of the co

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Jan Urbański
Oleg Bartunov wrote: Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? I guess the first approach could be to populate some more columns in pg_statistics for tables with tsvectors. I see there are some statistics already being gathered (pg_stat

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Oleg Bartunov
Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? Oleg On Tue, 4 Mar 2008, Jan Urbaski wrote: Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: 2. Implement better selectivity estimates for FTS. +1 for that one ...

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Josh Berkus
Jan, > OK, this one might very well be the one that'd be more useful. Well, you should submit *both* once SoC opens for applications. The mentors will decide which. -- Josh Berkus PostgreSQL @ Sun San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make ch

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Dave Page
On Tue, Mar 4, 2008 at 4:47 PM, Jan Urbański <[EMAIL PROTECTED]> wrote: > Tom Lane wrote: > > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: > >> 2. Implement better selectivity estimates for FTS. > > > > +1 for that one ... > > OK, this one might very well be the one that'd be m

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Jan Urbański
Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: 2. Implement better selectivity estimates for FTS. +1 for that one ... OK, this one might very well be the one that'd be more useful. And I can always reuse the other idea for my thesis, after expanding it a bit. S

Re: [HACKERS] Google Summer of Code 2008

2008-03-03 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: > 2. Implement better selectivity estimates for FTS. +1 for that one ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your Subscription: http://mail.

[HACKERS] Google Summer of Code 2008

2008-03-03 Thread Jan Urbański
Hi PostgreSQL! Although this year's GSoC is just starting, I thought getting in touch a bit earlier would only be of benefit. I study Computer Science in Faculty of Mathematics, Informatics and Mechanics of Warsaw University. I'm currently in my fourth year of studies. Having chosen Databases fo