[HACKERS] Estimating geometric distributions

Stephen Denne Sun, 09 Mar 2008 15:44:01 -0700

I wrote:
> I have a field whose distribution of frequencies of values is 
> roughly geometric, rather than flat.
> Total rows = 36 million
> relpages=504864
> Distinct field values in use = 169
> 10 values account for 50% of the rows.
> 41 values account for 90% of the rows.
> 
> After setting statistics target to 1000 for that field, and 
> analyzing the table, the statistics row for that field had 75 
> most frequent values and a histogram with 77 entries in it. 
> Estimating 152 values in total.


"public";"mytable";"myfield";0;4;152;"{202,179,8,181,173,207,6,118,107,205,182,4,54,247,168,77,169,53,120,159,149,174,167,156,148,150,56,108,66,119,5,99,96,175,97,208,1,130,10,102,228,101,121,50,11,152,32,12,78,221,55,244,241,252,203,116,103,184,154,153,238,65,49,220,83,98,111,85,139,242,240,260,7,109,114}";"{0.0836433,0.0781667,0.0738367,0.0598533,0.04629,0.04447,0.0359833,0.0314267,0.0278333,0.0268,0.0251433,0.0244867,0.02438,0.0223433,0.0207567,0.0189667,0.0168833,0.01582,0.0150267,0.0141767,0.0130467,0.0128933,0.0125767,0.0123567,0.0116567,0.0114967,0.01048,0.01037,0.00994667,0.00987667,0.00977667,0.00965333,0.00916333,0.00828667,0.00732667,0.00712,0.00629,0.00624,0.00576667,0.00558667,0.00477667,0.00475333,0.00410333,0.00405667,0.00371667,0.00334667,0.00334,0.00312667,0.00312667,0.00302,0.00300333,0.00295,0.00287333,0.00271,0.00267,0.00240667,0.00224,0.00221333,0.00215333,0.0021,0.00205667,0.00202667,0.00197333,0.00197333,0.00168667,0.00166,0.00159333,0.00159,0.00154667,0.00150333,0.00149,0.00133333,0.00132,0.00112667,0.00104}";"{2,9,9,9,67,76,84,84,86,87,87,88,95,100,100,100,104,105,105,110,112,112,128,137,137,138,143,144,144,144,151,155,155,155,157,157,158,171,171,183,185,185,185,185,187,194,199,199,200,200,201,204,204,209,209,214,214,214,214,215,217,225,225,225,229,239,239,249,250,250,253,253,255,257,261,262,266}";0.449246

My problem is frequent 
> over-estimation of rows when restricting by this field with 
> values not known at plan time.

examples:
select * from mytable where myfield = ?;
select * from mytable where myfield in (subquery);

My arithmetic mean of the frequencies is 214200
My geometric mean is 13444

However analyze didn't find all my values, and thinks that there are only 152 
of them, so it uses a mean of 238046


When the subquery is estimated to return three myfield values, the query 
estimates 714138 rows, and chooses a sequential scan over mytable (myfield is 
indexed).

explain select * from mytable where myfield in (values (1),(2),(3));

Hash IN Join  (cost=0.08..1009521.37 rows=714138 width=86)
  Hash Cond: (mytable.myfield = "*VALUES*".column1)
  ->  Seq Scan on mytable  (cost=0.00..866693.76 rows=36182976 width=86)
  ->  Hash  (cost=0.04..0.04 rows=3 width=4)
        ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 width=4)

I think this query is much more likely to return around 40000 rows, and a 
Bitmap Index Scan should be used.

explain select * from mytable where myfield in (values (1),(2));

Nested Loop  (cost=4445.11..931383.93 rows=476092 width=86)
  ->  Unique  (cost=0.04..0.04 rows=2 width=4)
        ->  Sort  (cost=0.04..0.04 rows=2 width=4)
              Sort Key: "*VALUES*".column1
              ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=4)
  ->  Bitmap Heap Scan on mytable  (cost=4445.08..462716.37 rows=238046 
width=86)
        Recheck Cond: (mytable.myfield = "*VALUES*".column1)
        ->  Bitmap Index Scan on myindex  (cost=0.00..4385.56 rows=238046 
width=0)
              Index Cond: (mytable.myfield = "*VALUES*".column1)

The expected number of loops (2 here, 3 above) through the Bitmap Heap Scan * 
462716.37 > 1009521.37, but the cost estimate is far too high in the general 
case. It should be closer to 26000 per loop if adjusted for my expectation of 
the number of rows, being 13444 per loop. As such, you should need to expect 
close to 40 myfield values being returned by the subquery before choosing a 
sequential scan.

Is there any facility already in PostgreSQL to help me here?

Hopefully an index type that I don't know about yet? (Geometric distributions 
are similar to those found in word count distributions).

If not... is there any merit in this idea:

During the analyze process, the geometric mean of sampled rows was calculated, 
and if determined to be significantly different from the arithmetic mean, 
stored in a new stats column. When estimating the number of rows that will be 
returned by queries of the form shown above, if there is a geometric mean 
stored, use it instead of the arithmetic mean.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.

__________________________________________________________________
  This email has been scanned by the DMZGlobal Business Quality 
              Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Estimating geometric distributions

Reply via email to