Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2015-03-19 Thread Bruce Momjian
On Sat, Oct 18, 2014 at 02:20:45PM -0400, Bruce Momjian wrote:
 On Sat, Oct 18, 2014 at 06:15:03PM +0200, Marko Tiikkaja wrote:
  On 10/18/14, 5:46 PM, Tom Lane wrote:
  Marko Tiikkaja ma...@joh.to writes:
  Yes, exactly; if I had had the option to disable the index from the
  optimizer's point of view, I'd have seen that it's not used for looking
  up any data by any queries, and thus I would have known that I can
  safely drop it without slowing down queries.  Which was the only thing I
  cared about, and where the stats we provide failed me.
  
  This argument is *utterly* wrongheaded, because it assumes that the
  planner's use of the index provided no benefit to your queries.  If the
  planner was touching the index at all then it was planning queries in
  which knowledge of the extremal value was relevant to accurate selectivity
  estimation.  So it's quite likely that without the index you'd have gotten
  different and inferior plans, whether or not those plans actually chose to
  use the index.
  
  Maybe.  But at the same time that's a big problem: there's no way of
  knowing whether the index is actually useful or not when it's used
  only by the query planner.
 
 That is a good point.  Without an index, the executor is going to do a
 sequential scan, while a missing index to the optimizer just means worse
 statistics.

I have applied the attached patch to document that the optimizer can
increase the index usage statistics.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
new file mode 100644
index afcfb89..71d06ce
*** a/doc/src/sgml/monitoring.sgml
--- b/doc/src/sgml/monitoring.sgml
*** postgres   27093  0.0  0.0  30096  2752
*** 1382,1389 
/para
  
para
!Indexes can be used via either simple index scans or quotebitmap/
!index scans.  In a bitmap scan
 the output of several indexes can be combined via AND or OR rules,
 so it is difficult to associate individual heap row fetches
 with specific indexes when a bitmap scan is used.  Therefore, a bitmap
--- 1382,1389 
/para
  
para
!Indexes can be used by simple index scans, quotebitmap/ index scans,
!and the optimizer.  In a bitmap scan
 the output of several indexes can be combined via AND or OR rules,
 so it is difficult to associate individual heap row fetches
 with specific indexes when a bitmap scan is used.  Therefore, a bitmap
*** postgres   27093  0.0  0.0  30096  2752
*** 1393,1398 
--- 1393,1401 
 structnamepg_stat_all_tables/.structfieldidx_tup_fetch/
 count for the table, but it does not affect
 structnamepg_stat_all_indexes/.structfieldidx_tup_fetch/.
+The optimizer also accesses indexes to check for supplied constants
+whose values are outside the recorded range of the optimizer statistics
+because the optimizer statistics might be stale.
/para
  
note

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-20 Thread Jim Nasby

On 10/18/14, 8:58 AM, Bruce Momjian wrote:

On Fri, Oct 17, 2014 at 11:03:04PM -0400, Tom Lane wrote:

Bruce Momjian br...@momjian.us writes:

On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:

Those stats were perfectly valid: what the planner is looking for is
accurate minimum and maximum values for the index's leading column, and
that's what it got.  You're correct that a narrower index could have given
the same results with a smaller disk footprint, but the planner got the
results it needed from the index you provided for it to work with.



Uh, why is the optimizer looking at the index on a,b,c and not just the
stats on column a, for example?  I am missing something here.


Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.


Oh, I had forgotten we did that.  It is confusing that there is no way
via EXPLAIN to see the access, making the method of consulting pg_stat_*
and using EXPLAIN unreliable.  Should we document this somewhere?


I think we should. The common (mis)conception is that pg_stats shows 
*user-driven* access, not access because of stuff the system is doing.

This is actually a huge problem for anyone who's trying to figure out how 
useful indexes are; they see usage and thing they have queries that are using 
the index when in reality they don't.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-19 Thread Tom Lane
Jim Nasby jim.na...@bluetreble.com writes:
 On 10/17/14, 10:16 PM, Tom Lane wrote:
 BTW, on re-reading that code I notice that it will happily seize upon
 the first suitable index (first in OID order), regardless of how many
 lower-order columns that index has got.  This doesn't make any difference
 I think for get_actual_variable_range's own purposes, because it's only
 expecting to touch the endmost index page regardless.  However, in light
 of Marko's complaint maybe we should teach it to check all the indexes
 and prefer the matching one with fewest columns?

 The real cost here isn't the number of columns, it's the size of the index, 
 no? So shouldn't we look at relpages instead? For example, you'd certainly 
 want to use an index on (field_we_care_about, smallint_field) over an index 
 on (field_we_care_about, big_honking_text_field).

Yeah, perhaps.  I'd been wondering about adding a tie-breaking rule,
but that's a much simpler way to think about it.  OTOH, that approach
could result in some instability in the choice of index: if you've got
both (field_we_care_about, some_int_field) and (field_we_care_about,
some_other_int_field) then it might switch between choosing those two
indexes from day to day depending on basically-chance issues like when
page splits occur.  That would probably annoy Marko even more than the
current behavior :-(, because it would scatter the planner's usage
across multiple indexes for no very good reason.

The coding I'd been imagining at first would basically break ties in
column count according to index OID order, so its choices would be stable
as long as you did not add/drop indexes.  That seems like a good property
to try to preserve.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-19 Thread Jim Nasby

On 10/19/14, 2:09 PM, Tom Lane wrote:

Yeah, perhaps.  I'd been wondering about adding a tie-breaking rule,
but that's a much simpler way to think about it.  OTOH, that approach
could result in some instability in the choice of index: if you've got
both (field_we_care_about, some_int_field) and (field_we_care_about,
some_other_int_field) then it might switch between choosing those two
indexes from day to day depending on basically-chance issues like when
page splits occur.  That would probably annoy Marko even more than the
current behavior:-(, because it would scatter the planner's usage
across multiple indexes for no very good reason.

The coding I'd been imagining at first would basically break ties in
column count according to index OID order, so its choices would be stable
as long as you did not add/drop indexes.  That seems like a good property
to try to preserve.


Maybe a good alternative is:

ORDER BY int( table.reltuples / index.relpages / BLKSZ ) DESC, oid

By comparing on average tuple size throwing away the fraction presumably we'd 
throw away noise from page splits too.

We'd want to use table.reltuples for consistency sake, though theoretically in 
this case I'd think it should be the same for indexes we care about...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Bruce Momjian
On Fri, Oct 17, 2014 at 11:03:04PM -0400, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:
  Those stats were perfectly valid: what the planner is looking for is
  accurate minimum and maximum values for the index's leading column, and
  that's what it got.  You're correct that a narrower index could have given
  the same results with a smaller disk footprint, but the planner got the
  results it needed from the index you provided for it to work with.
 
  Uh, why is the optimizer looking at the index on a,b,c and not just the
  stats on column a, for example?  I am missing something here.
 
 Because it needs up-to-date min/max values in order to avoid being
 seriously misled about selectivities of values near the endpoints.
 See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

Oh, I had forgotten we did that.  It is confusing that there is no way
via EXPLAIN to see the access, making the method of consulting pg_stat_*
and using EXPLAIN unreliable.  Should we document this somewhere?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Marko Tiikkaja

On 10/18/14, 5:16 AM, Tom Lane wrote:

BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index (first in OID order), regardless of how many
lower-order columns that index has got.  This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless.  However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns?  It would only take a
couple extra lines of code, and probably not that many added cycles
considering we're going to do an index access of some sort.  But I'm
not sure if it's worth any extra effort --- I think in his example
case, there wasn't any narrower index anyway.


Perhaps accidentally this would have helped in my case, actually, since 
I could have created a new, smaller index CONCURRENTLY and then seen 
that the usage of the other index stopped increasing.  With the pick 
the smallest OID behaviour that was not possible.  Another idea had was 
some way to tell the optimizer not to use that particular index for 
stats lookups, but probably the use case for such a feature would be a 
bit narrow.


All that said, I don't think my struggles justify the change you 
described above.  Not sure if it's a good idea or not.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Bruce Momjian
On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:
 On 10/18/14, 5:16 AM, Tom Lane wrote:
 BTW, on re-reading that code I notice that it will happily seize upon
 the first suitable index (first in OID order), regardless of how many
 lower-order columns that index has got.  This doesn't make any difference
 I think for get_actual_variable_range's own purposes, because it's only
 expecting to touch the endmost index page regardless.  However, in light
 of Marko's complaint maybe we should teach it to check all the indexes
 and prefer the matching one with fewest columns?  It would only take a
 couple extra lines of code, and probably not that many added cycles
 considering we're going to do an index access of some sort.  But I'm
 not sure if it's worth any extra effort --- I think in his example
 case, there wasn't any narrower index anyway.
 
 Perhaps accidentally this would have helped in my case, actually,
 since I could have created a new, smaller index CONCURRENTLY and
 then seen that the usage of the other index stopped increasing.
 With the pick the smallest OID behaviour that was not possible.
 Another idea had was some way to tell the optimizer not to use that
 particular index for stats lookups, but probably the use case for
 such a feature would be a bit narrow.

Well, if the index is there, why not use it?  I thought the problem was
just that you had no visibility into how those statistics were being
accessed.  Most people think EXPLAIN shows all accesses, but obviously
now it doesn't.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Marko Tiikkaja

On 10/18/14, 4:33 PM, Bruce Momjian wrote:

On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:

Another idea had was some way to tell the optimizer not to use that
particular index for stats lookups, but probably the use case for
such a feature would be a bit narrow.


Well, if the index is there, why not use it?  I thought the problem was
just that you had no visibility into how those statistics were being
accessed.


Yes, exactly; if I had had the option to disable the index from the 
optimizer's point of view, I'd have seen that it's not used for looking 
up any data by any queries, and thus I would have known that I can 
safely drop it without slowing down queries.  Which was the only thing I 
cared about, and where the stats we provide failed me.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Bruce Momjian
On Sat, Oct 18, 2014 at 04:38:37PM +0200, Marko Tiikkaja wrote:
 On 10/18/14, 4:33 PM, Bruce Momjian wrote:
 On Sat, Oct 18, 2014 at 04:29:41PM +0200, Marko Tiikkaja wrote:
 Another idea had was some way to tell the optimizer not to use that
 particular index for stats lookups, but probably the use case for
 such a feature would be a bit narrow.
 
 Well, if the index is there, why not use it?  I thought the problem was
 just that you had no visibility into how those statistics were being
 accessed.
 
 Yes, exactly; if I had had the option to disable the index from the
 optimizer's point of view, I'd have seen that it's not used for
 looking up any data by any queries, and thus I would have known that
 I can safely drop it without slowing down queries.  Which was the
 only thing I cared about, and where the stats we provide failed me.

How many other cases do we have where the statistics are getting
incremented and there is no user visibility into the operation?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 How many other cases do we have where the statistics are getting
 incremented and there is no user visibility into the operation?

* system catalog accesses
* vacuum/analyze/cluster/etc

The fact that system-initiated accesses get counted in the statistics
is a feature, not a bug.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Tom Lane
Marko Tiikkaja ma...@joh.to writes:
 On 10/18/14, 4:33 PM, Bruce Momjian wrote:
 Well, if the index is there, why not use it?  I thought the problem was
 just that you had no visibility into how those statistics were being
 accessed.

 Yes, exactly; if I had had the option to disable the index from the 
 optimizer's point of view, I'd have seen that it's not used for looking 
 up any data by any queries, and thus I would have known that I can 
 safely drop it without slowing down queries.  Which was the only thing I 
 cared about, and where the stats we provide failed me.

This argument is *utterly* wrongheaded, because it assumes that the
planner's use of the index provided no benefit to your queries.  If the
planner was touching the index at all then it was planning queries in
which knowledge of the extremal value was relevant to accurate selectivity
estimation.  So it's quite likely that without the index you'd have gotten
different and inferior plans, whether or not those plans actually chose to
use the index.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Marko Tiikkaja

On 10/18/14, 5:46 PM, Tom Lane wrote:

Marko Tiikkaja ma...@joh.to writes:

Yes, exactly; if I had had the option to disable the index from the
optimizer's point of view, I'd have seen that it's not used for looking
up any data by any queries, and thus I would have known that I can
safely drop it without slowing down queries.  Which was the only thing I
cared about, and where the stats we provide failed me.


This argument is *utterly* wrongheaded, because it assumes that the
planner's use of the index provided no benefit to your queries.  If the
planner was touching the index at all then it was planning queries in
which knowledge of the extremal value was relevant to accurate selectivity
estimation.  So it's quite likely that without the index you'd have gotten
different and inferior plans, whether or not those plans actually chose to
use the index.


Maybe.  But at the same time that's a big problem: there's no way of 
knowing whether the index is actually useful or not when it's used only 
by the query planner.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Bruce Momjian
On Sat, Oct 18, 2014 at 06:15:03PM +0200, Marko Tiikkaja wrote:
 On 10/18/14, 5:46 PM, Tom Lane wrote:
 Marko Tiikkaja ma...@joh.to writes:
 Yes, exactly; if I had had the option to disable the index from the
 optimizer's point of view, I'd have seen that it's not used for looking
 up any data by any queries, and thus I would have known that I can
 safely drop it without slowing down queries.  Which was the only thing I
 cared about, and where the stats we provide failed me.
 
 This argument is *utterly* wrongheaded, because it assumes that the
 planner's use of the index provided no benefit to your queries.  If the
 planner was touching the index at all then it was planning queries in
 which knowledge of the extremal value was relevant to accurate selectivity
 estimation.  So it's quite likely that without the index you'd have gotten
 different and inferior plans, whether or not those plans actually chose to
 use the index.
 
 Maybe.  But at the same time that's a big problem: there's no way of
 knowing whether the index is actually useful or not when it's used
 only by the query planner.

That is a good point.  Without an index, the executor is going to do a
sequential scan, while a missing index to the optimizer just means worse
statistics.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-18 Thread Jim Nasby

On 10/17/14, 10:16 PM, Tom Lane wrote:

I wrote:

Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.


BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index (first in OID order), regardless of how many
lower-order columns that index has got.  This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless.  However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns?  It would only take a
couple extra lines of code, and probably not that many added cycles
considering we're going to do an index access of some sort.  But I'm
not sure if it's worth any extra effort --- I think in his example
case, there wasn't any narrower index anyway.


The real cost here isn't the number of columns, it's the size of the index, no? 
So shouldn't we look at relpages instead? For example, you'd certainly want to 
use an index on (field_we_care_about, smallint_field) over an index on 
(field_we_care_about, big_honking_text_field).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Tom Lane
Marko Tiikkaja ma...@joh.to writes:
 This week we had one of the most annoying problems I've ever encountered 
 with postgres.  We had a big index on multiple columns, say,  foo(a, b, 
 c).  According to pg_stat_all_indexes the index was being used *all the 
 time*.  However, after looking into our queries more closely, it turns 
 out that it was only being used to look up statistics for the foo.a 
 column to estimate merge scan viability during planning.  But this took 
 hours for two people to track down.

 So what I'd like to have is a way to be able to distinguish between 
 indexes being used to answer queries, and ones being only used for stats 
 lookups during planning.

Why?  Used is used.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Marko Tiikkaja

On 10/17/14, 11:47 PM, Tom Lane wrote:

Marko Tiikkaja ma...@joh.to writes:

This week we had one of the most annoying problems I've ever encountered
with postgres.  We had a big index on multiple columns, say,  foo(a, b,
c).  According to pg_stat_all_indexes the index was being used *all the
time*.  However, after looking into our queries more closely, it turns
out that it was only being used to look up statistics for the foo.a
column to estimate merge scan viability during planning.  But this took
hours for two people to track down.



So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning.


Why?  Used is used.


Because I don't need a 30GB index on foo(a,b,c) to look up statistics. 
If I ever have a problem, I can replace it with a 5GB one on foo(a).



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Tom Lane
Marko Tiikkaja ma...@joh.to writes:
 On 10/17/14, 11:47 PM, Tom Lane wrote:
 Marko Tiikkaja ma...@joh.to writes:
 So what I'd like to have is a way to be able to distinguish between
 indexes being used to answer queries, and ones being only used for stats
 lookups during planning.

 Why?  Used is used.

 Because I don't need a 30GB index on foo(a,b,c) to look up statistics. 
 If I ever have a problem, I can replace it with a 5GB one on foo(a).

Well, the index might've been getting used in queries too in a way that
really only involved the first column.  I think you're solving the wrong
problem here.  The right problem is how to identify indexes that are
being used in a way that doesn't exploit all the columns.  Which is not
necessarily wrong in itself --- what you'd want is to figure out when the
last column(s) are *never* used.  The existing stats aren't terribly
helpful for that, I agree.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Jim Nasby

On 10/17/14, 4:49 PM, Marko Tiikkaja wrote:

On 10/17/14, 11:47 PM, Tom Lane wrote:

Marko Tiikkaja ma...@joh.to writes:

This week we had one of the most annoying problems I've ever encountered
with postgres.  We had a big index on multiple columns, say,  foo(a, b,
c).  According to pg_stat_all_indexes the index was being used *all the
time*.  However, after looking into our queries more closely, it turns
out that it was only being used to look up statistics for the foo.a
column to estimate merge scan viability during planning.  But this took
hours for two people to track down.



So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning.


Why?  Used is used.


Because I don't need a 30GB index on foo(a,b,c) to look up statistics. If I 
ever have a problem, I can replace it with a 5GB one on foo(a).


That problem can exist with user queries too. Perhaps it would be better to 
find a way to count scans that didn't use all the fields in the index.

I do also see value in differentiating planning use from real query processing; 
not doing that can certainly cause confusion. What I don't know is if the added 
stats bloat is worth it. If we do go down that road, I think it'd be better to 
add an indicator to EState. Aside from allowing stats for all planning access, 
it should make it less likely that someone adds a new access path and forgets 
to mark it as internal (especially if the added field defaults to an invalid 
value).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Marko Tiikkaja

On 10/17/14, 11:59 PM, Tom Lane wrote:

Marko Tiikkaja ma...@joh.to writes:

On 10/17/14, 11:47 PM, Tom Lane wrote:

Marko Tiikkaja ma...@joh.to writes:

So what I'd like to have is a way to be able to distinguish between
indexes being used to answer queries, and ones being only used for stats
lookups during planning.



Why?  Used is used.



Because I don't need a 30GB index on foo(a,b,c) to look up statistics.
If I ever have a problem, I can replace it with a 5GB one on foo(a).


Well, the index might've been getting used in queries too in a way that
really only involved the first column.  I think you're solving the wrong
problem here.  The right problem is how to identify indexes that are
being used in a way that doesn't exploit all the columns.


I'm not sure I agree with that.  Even if there was some information the 
planner could have extracted out of the index by using all columns (thus 
appearing fully used in these hypothetical new statistics), I still 
would've wanted the index gone.  But in this particular case, an index 
on foo(a) alone was not selective enough and it would have been a bad 
choice for practically every query, so I'm not sure what good those 
statistics were in the first place.


I think there's a big difference between this index was used to look up 
stuff for planning and this index was used to answer queries quickly. 
 In my mind the first one belongs to the category this index was 
considered, and the latter is this index was actually useful.  But 
maybe I'm not seeing the big picture here.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Tom Lane
Marko Tiikkaja ma...@joh.to writes:
 On 10/17/14, 11:59 PM, Tom Lane wrote:
 Well, the index might've been getting used in queries too in a way that
 really only involved the first column.  I think you're solving the wrong
 problem here.  The right problem is how to identify indexes that are
 being used in a way that doesn't exploit all the columns.

 I'm not sure I agree with that.  Even if there was some information the 
 planner could have extracted out of the index by using all columns (thus 
 appearing fully used in these hypothetical new statistics), I still 
 would've wanted the index gone.  But in this particular case, an index 
 on foo(a) alone was not selective enough and it would have been a bad 
 choice for practically every query, so I'm not sure what good those 
 statistics were in the first place.

Those stats were perfectly valid: what the planner is looking for is
accurate minimum and maximum values for the index's leading column, and
that's what it got.  You're correct that a narrower index could have given
the same results with a smaller disk footprint, but the planner got the
results it needed from the index you provided for it to work with.

 I think there's a big difference between this index was used to look up 
 stuff for planning and this index was used to answer queries quickly. 

I think that's utter nonsense.  Even if there were any validity to the
position, it wouldn't be enough to justify doubling the stats footprint
in order to track system-driven accesses separately from query-driven
ones.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Marko Tiikkaja

On 10/18/14, 12:15 AM, Tom Lane wrote:

Marko Tiikkaja ma...@joh.to writes:

I think there's a big difference between this index was used to look up
stuff for planning and this index was used to answer queries quickly.


I think that's utter nonsense.


Well you probably know a bit more about the optimizer than I do.  But I 
can't see a case where the stats provided by the index would be useful 
for choosing between two (or more) plans that don't use the index in the 
actual query.  If you're saying that there are such cases, then clearly 
I don't know something, and my thinking is in the wrong here.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Bruce Momjian
On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:
 Marko Tiikkaja ma...@joh.to writes:
  On 10/17/14, 11:59 PM, Tom Lane wrote:
  Well, the index might've been getting used in queries too in a way that
  really only involved the first column.  I think you're solving the wrong
  problem here.  The right problem is how to identify indexes that are
  being used in a way that doesn't exploit all the columns.
 
  I'm not sure I agree with that.  Even if there was some information the 
  planner could have extracted out of the index by using all columns (thus 
  appearing fully used in these hypothetical new statistics), I still 
  would've wanted the index gone.  But in this particular case, an index 
  on foo(a) alone was not selective enough and it would have been a bad 
  choice for practically every query, so I'm not sure what good those 
  statistics were in the first place.
 
 Those stats were perfectly valid: what the planner is looking for is
 accurate minimum and maximum values for the index's leading column, and
 that's what it got.  You're correct that a narrower index could have given
 the same results with a smaller disk footprint, but the planner got the
 results it needed from the index you provided for it to work with.

Uh, why is the optimizer looking at the index on a,b,c and not just the
stats on column a, for example?  I am missing something here.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 On Fri, Oct 17, 2014 at 06:15:37PM -0400, Tom Lane wrote:
 Those stats were perfectly valid: what the planner is looking for is
 accurate minimum and maximum values for the index's leading column, and
 that's what it got.  You're correct that a narrower index could have given
 the same results with a smaller disk footprint, but the planner got the
 results it needed from the index you provided for it to work with.

 Uh, why is the optimizer looking at the index on a,b,c and not just the
 stats on column a, for example?  I am missing something here.

Because it needs up-to-date min/max values in order to avoid being
seriously misled about selectivities of values near the endpoints.
See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] get_actual_variable_range vs idx_scan/idx_tup_fetch

2014-10-17 Thread Tom Lane
I wrote:
 Because it needs up-to-date min/max values in order to avoid being
 seriously misled about selectivities of values near the endpoints.
 See commit 40608e7f949fb7e4025c0ddd5be01939adc79eec.

BTW, on re-reading that code I notice that it will happily seize upon
the first suitable index (first in OID order), regardless of how many
lower-order columns that index has got.  This doesn't make any difference
I think for get_actual_variable_range's own purposes, because it's only
expecting to touch the endmost index page regardless.  However, in light
of Marko's complaint maybe we should teach it to check all the indexes
and prefer the matching one with fewest columns?  It would only take a
couple extra lines of code, and probably not that many added cycles
considering we're going to do an index access of some sort.  But I'm
not sure if it's worth any extra effort --- I think in his example
case, there wasn't any narrower index anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers