[ redirecting to -hackers ]

Tomasz Ostrowski <tometzky...@ato.waw.pl> writes:
> I'm routinely bulk inserting data to a PostgreSQL table and then 
> analyzing a single column of the table, because it contains data which 
> significantly changes histogram of this column values - for example 
> something like adding rows with "todo=true" column, when all rows before 
> bulk insert have "todo=false".

> But I've noticed that a standard automatic analyze, which should work in 
> background, never runs. I've noticed that this fast analyze of one 
> column resets pg_stat_user_tables(n_mod_since_analyze) counter.

> I suppose this is a bug - an analyze, which does not analyze all 
> columns, should not reset pg_stat_user_tables(n_mod_since_analyze). What 
> do you think?

I'm inclined to think that this is a reasonable complaint.  A usage
pattern like that probably hasn't come up before; but now that it has,
it's clear it shouldn't block auto-analyze from happening.

A cheap-and-dirty solution would be to not send a PgStat_MsgAnalyze
message at all, but I think that's probably undesirable: typically it
would be a good thing to accept the new n_live_tuples and n_dead_tuples
estimates.  What we could do instead is add a bool flag to
PgStat_MsgAnalyze saying whether or not to reset changes_since_analyze.

It would be safe enough to back-patch such a change, because the stats
collector messages are private to the backend (in fact, really private
to pgstat.c).

One question here is whether there is any connection between
changes_since_analyze and the tuple-count estimates that would make it
improper to update the latter and not the former.  I can't really see one;
in fact, changes_since_analyze is pretty squishy anyway because changes
made by an ANALYZE's own transaction are already accounted for in the new
pg_statistic entries but will be added to changes_since_analyze at commit
despite that.  So I'd go ahead and update all the fields other than
changes_since_analyze in this case.

Another interesting consideration is that if the table's columns have
different stats targets, a selective-column ANALYZE might possibly sample
fewer rows than an all-columns ANALYZE would.  This might mean that our
new tuple-count estimates are less accurate than autoanalyze would get.
So you could possibly argue that we shouldn't update the tuple-count
estimates after all.  But I don't particularly believe that, because it
disregards the fact that the new estimates are, well, new.  Even if they
have more statistical risk than autoanalyze would have, they could well
be better just by virtue of having seen whatever bulk updates might have
happened since the last autoanalyze.  So my inclination is to disregard
this fine point.  (Though it's interesting to ask whether Tomasz's use
case includes a lower-than-default stats target for his very volatile
column ...)

Also, I'd be a bit inclined to disable the counter reset whenever a column
list is specified, disregarding the corner case where a list is given but
it includes all the table's analyzable columns.  It doesn't really seem
worth the effort to account for that case specially (especially after
you consider that index expressions should count as analyzable columns).

Thoughts?

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to