On 17/01/15 13:46, Amit Kapila wrote:
On Sun, Jan 11, 2015 at 1:29 AM, Petr Jelinek <p...@2ndquadrant.com
<mailto:p...@2ndquadrant.com>> wrote:
 >
 >
 > In second patch which implements the TABLESAMPLE itself I changed the
implementation of random generator because when I looked at the code
again I realized the old one would produce wrong results if there were
multiple TABLESAMPLE statements in same query or multiple cursors in
same transaction.
 >

I have looked into this patch and would like to share my
findings with you.

That's a lot for this.


1.
+static void
+set_tablesample_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
RangeTblEntry *rte)
{
..
+/*
+* There is only one plan to consider but we still need to set
+* parameters for RelOptInfo.
+*/
+set_cheapest(rel);
}

It seems we already call set_cheapest(rel) in set_rel_pathlist()
which is a caller of set_tablesample_rel_pathlist(), so why do
we need it inside set_tablesample_rel_pathlist()?

Ah, this changed after I started working on this patch and I didn't notice - previously all the set_<something>_rel_pathlist called set_cheapest() individually. I will change the code.


2.
+static void
+set_tablesample_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
RangeTblEntry *rte)
{
..
+/* We only do sample scan if it was requested */
+add_path(rel, (Path *) create_samplescan_path(root, rel, required_outer));
}

Do we need to add_path, if there is only one path?

Good point, we can probably just set the pathlist directly in this case.


3.
@@ -332,6 +334,11 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
/* Foreign table */
set_foreign_pathlist(root, rel, rte);
}
+else if (rte->tablesample != NULL)
+{
+/* Build sample scan on relation */
+set_tablesample_rel_pathlist(root, rel, rte);
+}

Have you consider to have a different RTE for this?


I assume you mean different RTEKind, yes I did, but the fact is that it's still a relation, just with different scan type and I didn't want to modify every piece of code which deals with RTE_RELATION to also deal with the new RTE for sample scan as it seems unnecessary. I am not strongly opinionated about this one though.

4.
SampleNext()
a.
Isn't it better to code SampleNext() similar to SeqNext() and
IndexNext(), which encapsulate the logic to get the tuple in
another function(tbs_next() or something like that)?

Possibly, my thinking was that unlike the index_getnext() and heap_getnext(), this function would not be called from any other place so adding one more layer of abstraction didn't seem useful. And it would mean creating new ScanDesc struct, etc.


b.
Also don't we want to handle pruning of page while
scanning (heap_page_prune_opt()) and I observed
in heap scan API's after visibility check we do check
for serializable conflict (CheckForSerializableConflictOut()).

Both good points, will add.

Another thing is don't we want to do anything for sync scans
for these method's as they are doing heap scan?

I don't follow this one tbh.


c.
for bernoulli method, it will first get the tupoffset with
the help of function and then apply visibility check, it seems
that could reduce the sample set way lower than expected
in case lot of tuples are not visible, shouldn't it be done in reverse
way(first visibility check, then call function to see if we want to
include it in the sample)?
I think something similar is done in acquire_sample_rows which
seems right to me.

I don't think so, the way bernoulli works is that it returns every tuple with equal probability, so the visible tuples have same chance of being returned as the invisible ones so the issue should be smoothed away automatically (IMHO).

The acquire_sample_rows has limit on number of rows it returns so that's why it has to do the visibility check before as the problem you described applies there.

The reason for using the probability instead of tuple limit is the fact that there is no way to accurately guess number of tuples in the table at the beginning of scan. This method should actually be better at returning the correct number of tuples without dependence on how many of them are visible or not and how much space in blocks is used.


5.

CREATE TABLE test_tablesample (id INT, name text) WITH (fillfactor=10);
INSERT INTO test_tablesample SELECT i, repeat(i::text, 200) FROM
generate_series(0, 9) s(i) ORDER BY i;

postgres=# SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (80);
  id
----
   1
   3
   4
   7
   8
   9
(6 rows)


postgres=# SELECT id FROM test_tablesample TABLESAMPLE BERNOULLI (80);
  id
----
   0
   1
   2
   3
   4
   5
   6
   7
   8
   9
(10 rows)

So above test yield 60% rows first time and 100% rows next time,
when the test has requested 80%.
I understand that sample percentage will result an approximate
number of rows, however I just wanted that we should check if the
implementation has any problem or not?

I think this is caused by random generator not producing smooth enough random distribution on such a small sample (or possibly in general?).


In this regard, I have one question related to below code:

+Datum
+tsm_bernoulli_nexttuple(PG_FUNCTION_ARGS)
+{
..
+/* Every tuple has percent chance of being returned */
+while (sampler_random_fract(sampler->randstate) > samplesize)
+{
+tupoffset++;
+
+if (tupoffset > maxoffset)
+break;
+}

Why are we not considering tuple in above code
if tupoffset is less than maxoffset?


We consider it, I will rename the samplesize to probability or something to make it more clear what it actually is and maybe expand the comment above the loop.

What the loop does is that it basically considers each offset on a page and does "coin flip" - generates random number using sampler_random_fract and if the random number falls within the probability (= is smaller or equal to samplesize) it will be returned, the
if (tupoffset > maxoffset)
    break;
is there just because we need to give control back to scan so it can move to another block.

6.
One larger question about the approach used in patch, why do you
think it is better to have a new node (SampleScan/SamplePath) like
you have used in patch instead of doing it as part of Sequence Scan.
I agree that we need some changes in different parts of Sequence Scan
execution, but still sounds like a viable way.  One simple thing that
occurs to me is that in ExecSeqScan(), we can use something like
SampleSeqNext instead of SeqNext to scan heap in a slightly different
way, probably doing it as part of sequence scan will have advantage that
we can use most of the existing infrastructure (sequence node path)
and have less discrepancies as mentioned in point-4.


I originally started from SeqScan but well, it requires completely different State structure, it needs to call sampling methods in various places (not just for next tuple), it needs different handling in explain and in optimizer (costing). If we add all this to sequential scan it will not be that much different from new scan node (we'd save the couple of new copy functions and one struct, but I'd rather have clearly distinct scan).

It would also not help with the discrepancies you mentioned because those are in heapam and SampleScan would not use that even if it was merged with SeqScan - I don't think we want to implement the sampling on heapam level, it's too much of a hotspot to be good idea to add additional cycles there.

--
 Petr Jelinek                  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to