Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL

2014-01-13 Thread Tom Lane
=?GB2312?B?RmVsaXgu0Ow=?= ygnhz...@gmail.com writes:
 //line 194 : In a most common values slot, staop is the OID of the =
 operator used to decide whether values are the same or not.
 //line 206 : A histogram slot describes the distribution of scalar data.
  staop is the OID of the  operator that describes the sort ordering.

 I don't understand the function of staop here, how is it used in optimizer,

In principle a data type could have more than one sort ordering, and if
we were to collect stats according to multiple orderings, staop would be
needed to identify which ordering a particular set of statistics was
created with.  That flexibility isn't being used right now, at least not
by any built-in code.  There are types with more than one ordering (more
than one btree opclass), but ANALYZE only collects stats for the default
btree opclass.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL

2014-01-13 Thread ygnhzeus
I see, thanks, so columns of staop* are not currently used by the planner by 
default, right? 
The type of staop is oid, which table is related to it?

2014-01-13



ygnhzeus



发件人:Tom Lane t...@sss.pgh.pa.us
发送时间:2014-01-13 23:21
主题:Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL
收件人:Felix.徐ygnhz...@gmail.com
抄送:Amit Langoteamitlangot...@gmail.com,Atri 
Sharmaatri.j...@gmail.com,pgsql-generalpgsql-general@postgresql.org

=?GB2312?B?RmVsaXgu0Ow=?= ygnhz...@gmail.com writes: 
 //line 194 : In a most common values slot, staop is the OID of the = 
 operator used to decide whether values are the same or not. 
 //line 206 : A histogram slot describes the distribution of scalar data. 
  staop is the OID of the  operator that describes the sort ordering. 

 I don't understand the function of staop here, how is it used in optimizer, 

In principle a data type could have more than one sort ordering, and if 
we were to collect stats according to multiple orderings, staop would be 
needed to identify which ordering a particular set of statistics was 
created with.  That flexibility isn't being used right now, at least not 
by any built-in code.  There are types with more than one ordering (more 
than one btree opclass), but ANALYZE only collects stats for the default 
btree opclass. 

regards, tom lane 

Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL

2014-01-12 Thread Felix . 徐
I see, thanks.

I'm looking into the source code of statistic part now, and I'm a little
confused about the column staop presented in table pg_statistic,
in the pg_statisitc.h, the comment says:

/* 
 * To allow keeping statistics on different kinds of datatypes,
 * we do not hard-wire any particular meaning for the remaining
 * statistical fields. Instead, we provide several slots in which
 * statistical data can be placed. Each slot includes:
 * kind integer code identifying kind of data (see below)
 * op OID of associated operator, if needed
 * numbers float4 array (for statistical values)
 * values anyarray (for representations of data values)
 * The ID and operator fields are never NULL; they are zeroes in an
 * unused slot.  The numbers and values fields are NULL in an unused
 * slot, and might also be NULL in a used slot if the slot kind has
 * no need for one or the other.
 * 
 */
And,
//line 194 : In a most common values slot, staop is the OID of the =
operator used to decide whether values are the same or not.
//line 206 : A histogram slot describes the distribution of scalar data.
 staop is the OID of the  operator that describes the sort ordering.


I don't understand the function of staop here, how is it used in optimizer,
is there any example ? thanks!



2014/1/10 Amit Langote amitlangot...@gmail.com

 On Fri, Jan 10, 2014 at 11:19 PM, Atri Sharma atri.j...@gmail.com wrote:
 
 
  Sent from my iPad
 
  On 10-Jan-2014, at 19:42, ygnhzeus ygnhz...@gmail.com wrote:
 
  Thanks for your reply.
  So correlation is not related to the calculation of selectivity right?
 If I
  force PostgreSQL not to optimize the join order (by setting
  join_collapse_limit and from_collapse_limit  to 1) , is there any other
  factor that may affect the structure of execution plan regardless of the
  data access method.
 
  2014-01-10
  
  ygnhzeus
  
  发件人:Amit Langote amitlangot...@gmail.com
  发送时间:2014-01-10 22:00
  主题:Re: [GENERAL] How to specify/mock the statistic data of tables in
  PostgreSQL
  收件人:ygnhzeusygnhz...@gmail.com
  抄送:pgsql-generalpgsql-general@postgresql.org
 
 
 
  AFAIK, correlation is involved in calculation of the costs that are used
 for
  deciding the type of access.If the correlation is low, index scan can
 lead
  to quite some random reads, hence leading to higher costs.
 

 Ah, I forgot to mention this point about how planner uses correlation
 for access method selection.

 And selectivity is a function of statistical distribution of column
 values described in pg_statistic by histograms, most common values
 (with their occurrence frequencies), number of distinct values, etc.
 It has nothing to do with correlation.

 --
 Amit Langote



[GENERAL] How to specify/mock the statistic data of tables in PostgreSQL

2014-01-10 Thread ygnhzeus
Hi all,
I want to use PostgreSQL to help me calculate the cardinality/selectivity of 
some queries, but I do not want to insert any data into these tables(since the 
data size is huge) to PostgreSQL. So I plan to calculate the statistic data by 
myself (not in PostgreSQL) and manually specify the metrics (maybe by modifying 
pg_statistic table) in PostgreSQL, thus PG's optimizer may use these statistic 
to evaluate the query (Explain...). Here comes the problem:
 
1. Is it possible to do what I've described above?
2. I've took a look at the pg_statistic table and pg_stats view, in the view I 
saw that most_common_elems/most_common_elem_freqs/elem_count_histogram were 
empty, and I'm also a little confused about the column called correlation. Is 
there any detailed document about how these metrics are calculated in 
PostgreSQL?

Thanks!
 

Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL

2014-01-10 Thread Amit Langote
On Fri, Jan 10, 2014 at 6:00 PM, ygnhzeus ygnhz...@gmail.com wrote:
 Hi all,

 I want to use PostgreSQL to help me calculate the cardinality/selectivity of
 some queries, but I do not want to insert any data into these tables(since
 the data size is huge) to PostgreSQL. So I plan to calculate the statistic
 data by myself (not in PostgreSQL) and manually specify the metrics (maybe
 by modifying pg_statistic table) in PostgreSQL, thus PG's optimizer may use
 these statistic to evaluate the query (Explain...). Here comes the problem:



 1. Is it possible to do what I've described above?

 2. I've took a look at the pg_statistic table and pg_stats view, in the view
 I saw that most_common_elems/most_common_elem_freqs/elem_count_histogram
 were empty, and I'm also a little confused about the column called
 correlation. Is there any detailed document about how these metrics are
 calculated in PostgreSQL?



About correlation:
As you might know index on some column imparts a logical ordering (for
example, ascending) to table rows based on that column, but remember,
actual rows are not stored in the same physical order in the relation
file as the logical order. So, there's a random disk access penalty
when fetching individual rows from the heap (for example, range scans
that use index). correlation denotes how close these two orderings
are to each other.

A command called CLUSTER can be used to physically reorder a table's
rows to match the logical ordering imposed by some index on that
table. More about CLUSTER here:

http://www.postgresql.org/docs/9.3/static/sql-cluster.html

Consider following example,

postgres=# create table test as select generate_series(1,100) as a
order by random();
SELECT 100

postgres=# create index test_idx on test using a;
postgres=# create index test_idx on test using btree (a);
CREATE INDEX

postgres=# analyze test;
ANALYZE

postgres=# select correlation from pg_stats where tablename = 'test';
 correlation
-
 -0.00164016
(1 row)

postgres=# select count(*) from test where a between 34000 and 68000;
 count
---
 34001
(1 row)

Time: 26.875 ms

Note here that the correlation is pretty close to zero meaning
physical ordering of rows is different than logical ordering imposed
by the index.

postgres=# cluster test using test_idx;
CLUSTER

This should put rows of the table into the same order as the index.

postgres=# analyze test;
ANALYZE

postgres=# select correlation from pg_stats where tablename = 'test';
 correlation
-
   1
(1 row)

postgres=# select count(*) from test where a between 34000 and 68000;
 count
---
 34001
(1 row)

Time: 12.990 ms

Note here that now rows of the table are in almost same physical order
as its index thus reducing random disk accesses. Note how after
CLUSTER, time for same query reduces to half the time of original
unclustered case. This is due to reduced random disk access.

As to how the pg_stats statistics are used by the planner for row
estimation is described here:

http://www.postgresql.org/docs/9.3/static/row-estimation-examples.html

However, to understand how they are generated by ANALYZE (in most
cases, using random sampling), I guess you'd need to go through its
code in the source file src/backend/commands/analyze.c.

--
Amit Langote


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL

2014-01-10 Thread ygnhzeus
Thanks for your reply.
So correlation is not related to the calculation of selectivity right? If I 
force PostgreSQL not to optimize the join order (by setting join_collapse_limit 
and from_collapse_limit  to 1) , is there any other factor that may affect the 
structure of execution plan regardless of the data access method.

2014-01-10



ygnhzeus



发件人:Amit Langote amitlangot...@gmail.com
发送时间:2014-01-10 22:00
主题:Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL
收件人:ygnhzeusygnhz...@gmail.com
抄送:pgsql-generalpgsql-general@postgresql.org

On Fri, Jan 10, 2014 at 6:00 PM, ygnhzeus ygnhz...@gmail.com wrote: 
 Hi all, 
 
 I want to use PostgreSQL to help me calculate the cardinality/selectivity of 
 some queries, but I do not want to insert any data into these tables(since 
 the data size is huge) to PostgreSQL. So I plan to calculate the statistic 
 data by myself (not in PostgreSQL) and manually specify the metrics (maybe 
 by modifying pg_statistic table) in PostgreSQL, thus PG's optimizer may use 
 these statistic to evaluate the query (Explain...). Here comes the problem: 
 
 
 
 1. Is it possible to do what I've described above? 
 
 2. I've took a look at the pg_statistic table and pg_stats view, in the view 
 I saw that most_common_elems/most_common_elem_freqs/elem_count_histogram 
 were empty, and I'm also a little confused about the column called 
 correlation. Is there any detailed document about how these metrics are 
 calculated in PostgreSQL? 
 
 

About correlation: 
As you might know index on some column imparts a logical ordering (for 
example, ascending) to table rows based on that column, but remember, 
actual rows are not stored in the same physical order in the relation 
file as the logical order. So, there's a random disk access penalty 
when fetching individual rows from the heap (for example, range scans 
that use index). correlation denotes how close these two orderings 
are to each other. 

A command called CLUSTER can be used to physically reorder a table's 
rows to match the logical ordering imposed by some index on that 
table. More about CLUSTER here: 

http://www.postgresql.org/docs/9.3/static/sql-cluster.html 

Consider following example, 

postgres=# create table test as select generate_series(1,100) as a 
order by random(); 
SELECT 100 

postgres=# create index test_idx on test using a; 
postgres=# create index test_idx on test using btree (a); 
CREATE INDEX 

postgres=# analyze test; 
ANALYZE 

postgres=# select correlation from pg_stats where tablename = 'test'; 
 correlation 
- 
 -0.00164016 
(1 row) 

postgres=# select count(*) from test where a between 34000 and 68000; 
 count 
--- 
 34001 
(1 row) 

Time: 26.875 ms 

Note here that the correlation is pretty close to zero meaning 
physical ordering of rows is different than logical ordering imposed 
by the index. 

postgres=# cluster test using test_idx; 
CLUSTER 

This should put rows of the table into the same order as the index. 

postgres=# analyze test; 
ANALYZE 

postgres=# select correlation from pg_stats where tablename = 'test'; 
 correlation 
- 
   1 
(1 row) 

postgres=# select count(*) from test where a between 34000 and 68000; 
 count 
--- 
 34001 
(1 row) 

Time: 12.990 ms 

Note here that now rows of the table are in almost same physical order 
as its index thus reducing random disk accesses. Note how after 
CLUSTER, time for same query reduces to half the time of original 
unclustered case. This is due to reduced random disk access. 

As to how the pg_stats statistics are used by the planner for row 
estimation is described here: 

http://www.postgresql.org/docs/9.3/static/row-estimation-examples.html 

However, to understand how they are generated by ANALYZE (in most 
cases, using random sampling), I guess you'd need to go through its 
code in the source file src/backend/commands/analyze.c. 

-- 
Amit Langote 

Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL

2014-01-10 Thread Atri Sharma


Sent from my iPad

 On 10-Jan-2014, at 19:42, ygnhzeus ygnhz...@gmail.com wrote:
 
 Thanks for your reply.
 So correlation is not related to the calculation of selectivity right? If I 
 force PostgreSQL not to optimize the join order (by setting 
 join_collapse_limit and from_collapse_limit  to 1) , is there any other 
 factor that may affect the structure of execution plan regardless of the data 
 access method.
  
 2014-01-10
 ygnhzeus
 发件人:Amit Langote amitlangot...@gmail.com
 发送时间:2014-01-10 22:00
 主题:Re: [GENERAL] How to specify/mock the statistic data of tables in 
 PostgreSQL
 收件人:ygnhzeusygnhz...@gmail.com
 抄送:pgsql-generalpgsql-general@postgresql.org
  
 

AFAIK, correlation is involved in calculation of the costs that are used for 
deciding the type of access.If the correlation is low, index scan can lead to 
quite some random reads, hence leading to higher costs.

Regards,

Atri

Re: [GENERAL] How to specify/mock the statistic data of tables in PostgreSQL

2014-01-10 Thread Amit Langote
On Fri, Jan 10, 2014 at 11:19 PM, Atri Sharma atri.j...@gmail.com wrote:


 Sent from my iPad

 On 10-Jan-2014, at 19:42, ygnhzeus ygnhz...@gmail.com wrote:

 Thanks for your reply.
 So correlation is not related to the calculation of selectivity right? If I
 force PostgreSQL not to optimize the join order (by setting
 join_collapse_limit and from_collapse_limit  to 1) , is there any other
 factor that may affect the structure of execution plan regardless of the
 data access method.

 2014-01-10
 
 ygnhzeus
 
 发件人:Amit Langote amitlangot...@gmail.com
 发送时间:2014-01-10 22:00
 主题:Re: [GENERAL] How to specify/mock the statistic data of tables in
 PostgreSQL
 收件人:ygnhzeusygnhz...@gmail.com
 抄送:pgsql-generalpgsql-general@postgresql.org



 AFAIK, correlation is involved in calculation of the costs that are used for
 deciding the type of access.If the correlation is low, index scan can lead
 to quite some random reads, hence leading to higher costs.


Ah, I forgot to mention this point about how planner uses correlation
for access method selection.

And selectivity is a function of statistical distribution of column
values described in pg_statistic by histograms, most common values
(with their occurrence frequencies), number of distinct values, etc.
It has nothing to do with correlation.

--
Amit Langote


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general