Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-14 Thread Stephen Frost
* Qi Huang (huangq...@hotmail.com) wrote: > Thanks, guys, for your hot discussion. I'll explore the ANALYZE command first > and try make SYSTEM sampling work. Other parts, I'll look at them later. That sounds like the right first steps to me. Reviewing this discussion, it was my thought to crea

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-13 Thread Qi Huang
l.com; > neil.con...@gmail.com; dan...@heroku.com; huangq...@hotmail.com; > f...@phlo.org; pgsql-hackers@postgresql.org; sfr...@snowman.net > Subject: Re: [HACKERS] Gsoc2012 idea, tablesample > > On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote: > > On Thu, May 10, 201

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Greg Stark
On Fri, May 11, 2012 at 6:16 PM, Kevin Grittner wrote: >> MaxHeapTuplesPerPage? > > What about dead line pointers without corresponding tuples? Actually we don't allow there to be more than MaxHeapTuplesPerPage line pointers even if some of them are dead line pointers. I think the argument then

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Robert Haas wrote: > On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner > wrote: >> Robert Haas wrote: >>> The trouble is, AFAICS, that you can't bound M very well without >>> scanning the whole table. I mean, it's bounded by theoretical >>> limit, but that's it. >> >> What would the theoretical

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Robert Haas
On Fri, May 11, 2012 at 11:36 AM, Kevin Grittner wrote: > Robert Haas wrote: >> The trouble is, AFAICS, that you can't bound M very well without >> scanning the whole table.  I mean, it's bounded by theoretical >> limit, but that's it. > > What would the theoretical limit be? MaxHeapTuplesPerPag

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Florian Pflug
On May11, 2012, at 16:03 , Kevin Grittner wrote: >> [more complex alternatives] > > I really think your first suggestion covers it perfectly; these more > complex techniques don't seem necessary to me. The point of the more complex techniques (especially the algorithm in my second mail, the "repl

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Tom Lane wrote: > "Kevin Grittner" writes: >> [ uniformly sample the TID space defined as (1..P, 1..M) ] > >> Shouldn't that get us the randomly chosen sample we're looking >> for? Is there a problem you think this ignores? > > Not sure. The issue that I'm wondering about is that the line > n

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Florian Pflug
On May11, 2012, at 17:20 , Robert Haas wrote: > On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner >> Since T cancels out of that equation, we don't need to worry about >> estimating it. Results will be correct for any value of M which is >> *at least* as large as the maximum tuple index in the tabl

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Tom Lane
"Kevin Grittner" writes: > Robert Haas wrote: >> The trouble is, AFAICS, that you can't bound M very well without >> scanning the whole table. I mean, it's bounded by theoretical >> limit, but that's it. > What would the theoretical limit be? (black size - page header size > - minimum size of

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Robert Haas wrote: > The trouble is, AFAICS, that you can't bound M very well without > scanning the whole table. I mean, it's bounded by theoretical > limit, but that's it. What would the theoretical limit be? (black size - page header size - minimum size of one tuple) / item pointer size?

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Tom Lane
"Kevin Grittner" writes: > [ uniformly sample the TID space defined as (1..P, 1..M) ] > Shouldn't that get us the randomly chosen sample we're looking for? > Is there a problem you think this ignores? Not sure. The issue that I'm wondering about is that the line number part of the space is not

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Robert Haas
On Fri, May 11, 2012 at 11:04 AM, Kevin Grittner wrote: > We > will be probing random page numbers 1..P for random tuple indexes > 1..M.  So how many random probes by ctid does that require? The > chance of a hit on each probe is ((T/P)/M) -- the average number of > tuples per page divided by the

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Tom Lane wrote: > If you're willing to accept that the quality of the results > depends on having up-to-date stats, then I'd suggest (1) use the > planner's existing technology to estimate the number of rows in > the table; (2) multiply by sampling factor you want to get a > desired number of sa

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Tom Lane
"Kevin Grittner" writes: > Florian Pflug wrote: >> Maybe one can get rid of these sorts of problems by factoring in >> the expected density of the table beforehand and simply accepting >> that the results will be inaccurate if the statistics are >> outdated? > Unless I'm missing something, I th

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Tom Lane
Florian Pflug writes: > This all hinges on the ability to produce a sufficient accurate estimate of > the > TID density p_tup/p_tid, of course. I think that's the least of its problems. AFAICS this analysis ignores (1) the problem that the TID space is nonuniform, ie we don't know how many tupl

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Kevin Grittner
Florian Pflug wrote: > Maybe one can get rid of these sorts of problems by factoring in > the expected density of the table beforehand and simply accepting > that the results will be inaccurate if the statistics are > outdated? > > One could, for example, simply pick > > N := SamplingPercent

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Sandro Santilli
On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote: > On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner > wrote: > > Ants Aasma wrote: > >> It seems to me that the simplest thing to do would be to lift the > >> sampling done in analyze.c (acquire_sample_rows) and use that to > >> implement

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Florian Pflug
[ Sorry for the self-reply ] On May11, 2012, at 13:17 , Florian Pflug wrote: > Actually, thinking more about this, if the approach sketched above > turns out to work, then one might be able to get away without remembering > previously computed TIDs, thus removing a lot of complexity. > > Say, for

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-11 Thread Florian Pflug
> Florian Pflug wrote: >> On May10, 2012, at 18:36 , Kevin Grittner wrote: >>> Robert Haas wrote: >>> I wonder if you could do this with something akin to the Bitmap Heap Scan machinery. Populate a TID bitmap with a bunch of randomly chosen TIDs, fetch them all in physical order >

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Kevin Grittner
[point splintered in quoting re-joined] Florian Pflug wrote: > On May10, 2012, at 18:36 , Kevin Grittner wrote: >> Robert Haas wrote: >> >>> I wonder if you could do this with something akin to the Bitmap >>> Heap Scan machinery. Populate a TID bitmap with a bunch of >>> randomly chosen TIDs, f

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Florian Pflug
On May10, 2012, at 18:36 , Kevin Grittner wrote: > Robert Haas wrote: > >> I wonder if you could do this with something akin to the Bitmap >> Heap Scan machinery. Populate a TID bitmap with a bunch of >> randomly chosen TIDs, fetch them all in physical order >> and if you don't get as many rows

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Robert Haas
On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner wrote: > Ants Aasma wrote: >> It seems to me that the simplest thing to do would be to lift the >> sampling done in analyze.c (acquire_sample_rows) and use that to >> implement the SYSTEM sampling method. > > Definitely.  I thought we had all agreed

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Kevin Grittner
Ants Aasma wrote: > It seems to me that the simplest thing to do would be to lift the > sampling done in analyze.c (acquire_sample_rows) and use that to > implement the SYSTEM sampling method. Definitely. I thought we had all agreed on that ages ago. -Kevin -- Sent via pgsql-hackers mail

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Ants Aasma
On Thu, May 10, 2012 at 6:33 PM, Robert Haas wrote: > I'm worried this project is getting so complicated that it will be > beyond the ability of a new hacker to get anything useful done.  Can > we simplify the requirements here to something that is reasonable for > a beginner? It seems to me that

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Kevin Grittner
Robert Haas wrote: > I wonder if you could do this with something akin to the Bitmap > Heap Scan machinery. Populate a TID bitmap with a bunch of > randomly chosen TIDs, fetch them all in physical order It would be pretty hard for any other plan to beat that by very much, so it seems like a g

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Robert Haas
On Thu, May 10, 2012 at 10:28 AM, Kevin Grittner wrote: >> One problem I see with this approach is that its efficiency >> depends on the average tuple length, at least with a naive >> approach to random ctid generator. The simplest way to generate >> those randomly without introducing bias is to g

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Kevin Grittner
Florian Pflug wrote: > One problem I see with this approach is that its efficiency > depends on the average tuple length, at least with a naive > approach to random ctid generator. The simplest way to generate > those randomly without introducing bias is to generate a random > page index between

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Florian Pflug
On May10, 2012, at 10:43 , Qi Huang wrote: > 2. use TIDSCAN to directly access tuples. The below way of using ctid > proposed by Kevin looks good. > > -One technique which might be suitably random without reading the > -whole table would be to figure out a maximum block number and tuple > -ID fo

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-05-10 Thread Qi Huang
Hi, AllThanks for your ideas on the implementation of TABLESAMPLE. I have a summary below of the high level requirements from the -hacker thread till now. Please give further comment and if I missed any point, please fell free to add. 1. Build a new type of node, as I should not use SEQSCAN

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Ants Aasma
On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli wrote: > On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote: >> On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote: > >> > SELECT (SELECT reservoir_sample(some_table, 50) AS samples >> >    FROM some_table WHERE ctid =~ ANY (rn

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-24 Thread Sandro Santilli
On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote: > On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote: > > SELECT (SELECT reservoir_sample(some_table, 50) AS samples > >FROM some_table WHERE ctid =~ ANY (rnd_pgtids)) > > FROM random_pages('some_table', 50) AS rnd_pgtids

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-23 Thread Sandro Santilli
On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote: > On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli wrote: > > I'd love to see enhanced CTID operators, to fetch all visible tuples in a > > page > > using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange. > > Among oth

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-23 Thread Ants Aasma
On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli wrote: > I'd love to see enhanced CTID operators, to fetch all visible tuples in a page > using a tidscan.  Something like: WHERE ctid =~ '(501,*)' or a ctidrange. Among other things, this would enable user-space implementation of tablesample. Give

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-23 Thread Sandro Santilli
On Sat, Apr 21, 2012 at 02:28:52PM +0800, Qi Huang wrote: > > Hi, Heikki ... > > Another idea that Robert Haas suggested was to add support doing a TID > > scan for a query like "WHERE ctid< '(501,1)'". That's not enough work > > for GSoC project on its own, but could certainly be a part of it.

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-20 Thread Qi Huang
Hi, Heikki > 1. We probably don't want the SQL syntax to be added to the grammar. > This should be written as an extension, using custom functions as the > API, instead of extra SQL syntax. > > 2. It's not very useful if it's just a dummy replacement for "WHERE > random() < ?". It has to be mo

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli
On Thu, Apr 19, 2012 at 08:47:51AM -0400, Stephen Frost wrote: > * Sandro Santilli (s...@keybit.net) wrote: > > Actually a random sample would really be representative of the data > > distribution. What the type analyzer gets is a sample and that sample > > is what the estimator looks at to answer

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Stephen Frost
* Sandro Santilli (s...@keybit.net) wrote: > Actually a random sample would really be representative of the data > distribution. What the type analyzer gets is a sample and that sample > is what the estimator looks at to answer the question: That might work if all you have is point data, but lines

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli
On Mon, 16 Apr 2012 23:17:25 -0700, Heikki Linnakangas wrote: > 1. We probably don't want the SQL syntax to be added to the >grammar. This should be written as an extension, using custom >functions as the API, instead of extra SQL syntax. I can't find the discussion about this, have any p

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-19 Thread Sandro Santilli
On Tue, Apr 17, 2012 at 04:29:52PM -0400, Stephen Frost wrote: > Josh, > > * Josh Berkus (j...@agliodbs.com) wrote: > > FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which > > worked with geographic indexes. This would be tremendously useful for > > constructing low-resolution

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang
> Date: Wed, 18 Apr 2012 02:45:09 +0300 > Subject: Re: [HACKERS] Gsoc2012 idea, tablesample > From: a...@cybertec.at > To: cbbro...@gmail.com > CC: sfr...@snowman.net; pgsql-hackers@postgresql.org > > On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne > wrote: >

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Ants Aasma
On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne wrote: > Well, there may be cases where the quality of the sample isn't > terribly important, it just needs to be "reasonable." > > I browsed an article on the SYSTEM/BERNOULLI representations; they > both amount to simple picks of tuples. > > -

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost
Josh, * Josh Berkus (j...@agliodbs.com) wrote: > FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which > worked with geographic indexes. This would be tremendously useful for > constructing low-resolution "zoom out" tiles on maps and similar. I'm familiar with the concept of 'z

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Josh Berkus
Qi, Hackers: FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which worked with geographic indexes. This would be tremendously useful for constructing low-resolution "zoom out" tiles on maps and similar. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Greg Stark
On Tue, Apr 17, 2012 at 5:33 PM, Christopher Browne wrote: > I get the feeling that this is a somewhat-magical feature (in that > users haven't much hope of understanding in what ways the results are > deterministic) that is sufficiently "magical" that anyone serious > about their result sets is l

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Christopher Browne
On Tue, Apr 17, 2012 at 11:27 AM, Stephen Frost wrote: > Qi, > > * Qi Huang (huangq...@hotmail.com) wrote: >> > Doing it 'right' certainly isn't going to be simply taking what Neil did >> > and updating it, and I understand Tom's concerns about having this be >> > more than a hack on seqscan, so I

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost
Qi, * Qi Huang (huangq...@hotmail.com) wrote: > > Doing it 'right' certainly isn't going to be simply taking what Neil did > > and updating it, and I understand Tom's concerns about having this be > > more than a hack on seqscan, so I'm a bit nervous that this would turn > > into something bigger

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang
> > 2. It's not very useful if it's just a dummy replacement for "WHERE > > random() < ?". It has to be more advanced than that. Quality of the > > sample is important, as is performance. There was also an > > interesting idea of on implementing monetary unit sampling. > > In reviewing this, I go

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Greg Stark
On Tue, Apr 17, 2012 at 2:49 PM, Stephen Frost wrote: > * Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote: >> 1. We probably don't want the SQL syntax to be added to the grammar. >> This should be written as an extension, using custom functions as >> the API, instead of extra SQL sy

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Tom Lane
Stephen Frost writes: > * Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote: >> Another idea that Robert Haas suggested was to add support doing a >> TID scan for a query like "WHERE ctid< '(501,1)'". That's not >> enough work for GSoC project on its own, but could certainly be a >>

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Stephen Frost
* Heikki Linnakangas (heikki.linnakan...@enterprisedb.com) wrote: > 1. We probably don't want the SQL syntax to be added to the grammar. > This should be written as an extension, using custom functions as > the API, instead of extra SQL syntax. Err, I missed that, and don't particularly agree with

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Heikki Linnakangas
On 17.04.2012 14:55, Qi Huang wrote: Hi, Heikki Thanks for your advice.I will change my plan accordingly. But I have a few questions. 1. We probably don't want the SQL syntax to be added to the grammar. This should be written as an extension, using custom functions as the API, instead of

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang
Besides, I saw the Gsoc site editing has been closed. Should I just submit through this mailing list with attachment? Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore > Date: Tue, 17 Apr 2012 09:16:29 +0300 > From: heikki.linnakan...@enterprisedb.com >

Re: [HACKERS] Gsoc2012 idea, tablesample

2012-04-17 Thread Qi Huang
Hi, Heikki Thanks for your advice.I will change my plan accordingly. But I have a few questions. > 1. We probably don't want the SQL syntax to be added to the grammar. > This should be written as an extension, using custom functions as the > API, instead of extra SQL syntax. > 1. "Th