I wrote:
> I have a field whose distribution of frequencies of values is
> roughly geometric, rather than flat.
> Total rows = 36 million
> relpages=504864
> Distinct field values in use = 169
> 10 values account for 50% of the rows.
> 41 values account for 90% of the rows.
>
> After setting statistics target to 1000 for that field, and
> analyzing the table, the statistics row for that field had 75
> most frequent values and a histogram with 77 entries in it.
> Estimating 152 values in total.
"public";"mytable";"myfield";0;4;152;"{202,179,8,181,173,207,6,118,107,205,182,4,54,247,168,77,169,53,120,159,149,174,167,156,148,150,56,108,66,119,5,99,96,175,97,208,1,130,10,102,228,101,121,50,11,152,32,12,78,221,55,244,241,252,203,116,103,184,154,153,238,65,49,220,83,98,111,85,139,242,240,260,7,109,114}";"{0.0836433,0.0781667,0.0738367,0.0598533,0.04629,0.04447,0.0359833,0.0314267,0.0278333,0.0268,0.0251433,0.0244867,0.02438,0.0223433,0.0207567,0.0189667,0.0168833,0.01582,0.0150267,0.0141767,0.0130467,0.0128933,0.0125767,0.0123567,0.0116567,0.0114967,0.01048,0.01037,0.00994667,0.00987667,0.00977667,0.00965333,0.00916333,0.00828667,0.00732667,0.00712,0.00629,0.00624,0.00576667,0.00558667,0.00477667,0.00475333,0.00410333,0.00405667,0.00371667,0.00334667,0.00334,0.00312667,0.00312667,0.00302,0.00300333,0.00295,0.00287333,0.00271,0.00267,0.00240667,0.00224,0.00221333,0.00215333,0.0021,0.00205667,0.00202667,0.00197333,0.00197333,0.00168667,0.00166,0.00159333,0.00159,0.00154667,0.00150333,0.00149,0.00133333,0.00132,0.00112667,0.00104}";"{2,9,9,9,67,76,84,84,86,87,87,88,95,100,100,100,104,105,105,110,112,112,128,137,137,138,143,144,144,144,151,155,155,155,157,157,158,171,171,183,185,185,185,185,187,194,199,199,200,200,201,204,204,209,209,214,214,214,214,215,217,225,225,225,229,239,239,249,250,250,253,253,255,257,261,262,266}";0.449246
My problem is frequent
> over-estimation of rows when restricting by this field with
> values not known at plan time.
examples:
select * from mytable where myfield = ?;
select * from mytable where myfield in (subquery);
My arithmetic mean of the frequencies is 214200
My geometric mean is 13444
However analyze didn't find all my values, and thinks that there are only 152
of them, so it uses a mean of 238046
When the subquery is estimated to return three myfield values, the query
estimates 714138 rows, and chooses a sequential scan over mytable (myfield is
indexed).
explain select * from mytable where myfield in (values (1),(2),(3));
Hash IN Join (cost=0.08..1009521.37 rows=714138 width=86)
Hash Cond: (mytable.myfield = "*VALUES*".column1)
-> Seq Scan on mytable (cost=0.00..866693.76 rows=36182976 width=86)
-> Hash (cost=0.04..0.04 rows=3 width=4)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4)
I think this query is much more likely to return around 40000 rows, and a
Bitmap Index Scan should be used.
explain select * from mytable where myfield in (values (1),(2));
Nested Loop (cost=4445.11..931383.93 rows=476092 width=86)
-> Unique (cost=0.04..0.04 rows=2 width=4)
-> Sort (cost=0.04..0.04 rows=2 width=4)
Sort Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4)
-> Bitmap Heap Scan on mytable (cost=4445.08..462716.37 rows=238046
width=86)
Recheck Cond: (mytable.myfield = "*VALUES*".column1)
-> Bitmap Index Scan on myindex (cost=0.00..4385.56 rows=238046
width=0)
Index Cond: (mytable.myfield = "*VALUES*".column1)
The expected number of loops (2 here, 3 above) through the Bitmap Heap Scan *
462716.37 > 1009521.37, but the cost estimate is far too high in the general
case. It should be closer to 26000 per loop if adjusted for my expectation of
the number of rows, being 13444 per loop. As such, you should need to expect
close to 40 myfield values being returned by the subquery before choosing a
sequential scan.
Is there any facility already in PostgreSQL to help me here?
Hopefully an index type that I don't know about yet? (Geometric distributions
are similar to those found in word count distributions).
If not... is there any merit in this idea:
During the analyze process, the geometric mean of sampled rows was calculated,
and if determined to be significantly different from the arithmetic mean,
stored in a new stats column. When estimating the number of rows that will be
returned by queries of the form shown above, if there is a geometric mean
stored, use it instead of the arithmetic mean.
Regards,
Stephen Denne.
Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer
focus, and courage. This email with any attachments is confidential and may be
subject to legal privilege. If it is not intended for you please advise by
reply immediately, destroy it and do not copy, disclose or use it in any way.
__________________________________________________________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__________________________________________________________________
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers