Re: Analyze on table creation?

2023-06-27 Thread James Coleman
On Mon, Jun 26, 2023 at 4:16 PM James Coleman  wrote:
>
> On Mon, Jun 26, 2023 at 4:00 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-06-26 13:40:49 -0400, James Coleman wrote:
> > > Have we ever discussed running an analyze immediately after creating a 
> > > table?
> >
> > That doesn't make a whole lot of sense to me - we could just insert the
> > constants stats we wanted in that case.
> >
>
> I thought that was implicit in that, but fair enough :)
>
> > > Consider the following:
> > >
> > > create table stats(i int, t text not null);
> > > explain select * from stats;
> > >Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> > > analyze stats;
> > > explain select * from stats;
> > >Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> > >
> > > Combined with rapidly increasing error margin on row estimates when
> > > adding joins means that a query joining to a bunch of empty tables
> > > when a database first starts up can result in some pretty wild plan
> > > costs.
> >
> > The issue is that the table stats are likely going to quickly out of date in
> > that case, even a hand full of inserts (which wouldn't trigger
> > autovacuum analyzing) would lead to the "0 rows" stats causing very bad 
> > plans.
> >
>
> It's not obvious to me (as noted elsewhere in the thread) which is
> worse: a bunch of JOINs on empty tables can result in (specific
> example) plans with cost=15353020, and then trigger JIT, and...here we
> collide with my other thread about JIT [1].
>
> Regards,
> James Coleman
>
> 1: 
> https://www.postgresql.org/message-id/CAAaqYe-g-Q0Mm5H9QLcu8cHeMwok%2BHaxS4-UC9Oj3bK3a5jPvg%40mail.gmail.com

Thinking about this a bit more: it seems like what we're missing is either:

1. A heuristic for "this table will probably remain empty", or
2. A way to invalidate "0 rows" stats more quickly on even a handful of inserts.

I think one of those (ignoring questions about "how" for now) would
solve both cases?

Regards,
James Coleman




Re: Analyze on table creation?

2023-06-26 Thread James Coleman
On Mon, Jun 26, 2023 at 4:00 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-26 13:40:49 -0400, James Coleman wrote:
> > Have we ever discussed running an analyze immediately after creating a 
> > table?
>
> That doesn't make a whole lot of sense to me - we could just insert the
> constants stats we wanted in that case.
>

I thought that was implicit in that, but fair enough :)

> > Consider the following:
> >
> > create table stats(i int, t text not null);
> > explain select * from stats;
> >Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> > analyze stats;
> > explain select * from stats;
> >Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> >
> > Combined with rapidly increasing error margin on row estimates when
> > adding joins means that a query joining to a bunch of empty tables
> > when a database first starts up can result in some pretty wild plan
> > costs.
>
> The issue is that the table stats are likely going to quickly out of date in
> that case, even a hand full of inserts (which wouldn't trigger
> autovacuum analyzing) would lead to the "0 rows" stats causing very bad plans.
>

It's not obvious to me (as noted elsewhere in the thread) which is
worse: a bunch of JOINs on empty tables can result in (specific
example) plans with cost=15353020, and then trigger JIT, and...here we
collide with my other thread about JIT [1].

Regards,
James Coleman

1: 
https://www.postgresql.org/message-id/CAAaqYe-g-Q0Mm5H9QLcu8cHeMwok%2BHaxS4-UC9Oj3bK3a5jPvg%40mail.gmail.com




Re: Analyze on table creation?

2023-06-26 Thread Andres Freund
Hi,

On 2023-06-26 13:40:49 -0400, James Coleman wrote:
> Have we ever discussed running an analyze immediately after creating a table?

That doesn't make a whole lot of sense to me - we could just insert the
constants stats we wanted in that case.


> Consider the following:
> 
> create table stats(i int, t text not null);
> explain select * from stats;
>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> analyze stats;
> explain select * from stats;
>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> 
> Combined with rapidly increasing error margin on row estimates when
> adding joins means that a query joining to a bunch of empty tables
> when a database first starts up can result in some pretty wild plan
> costs.

The issue is that the table stats are likely going to quickly out of date in
that case, even a hand full of inserts (which wouldn't trigger
autovacuum analyzing) would lead to the "0 rows" stats causing very bad plans.

Greetings,

Andres Freund




Re: Analyze on table creation?

2023-06-26 Thread Pavel Stehule
>
> >
> > Originally, until the table had minimally one row, the PostgreSQL
> calculated with 10 pages. It was fixed (changed) in PostgreSQL 14.
> >
> >
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3d351d916b20534f973eda760cde17d96545d4c4
> >
>
> From that commit message:
> > Historically, we've considered the state with relpages and reltuples
> > both zero as indicating that we do not know the table's tuple density.
> > This is problematic because it's impossible to distinguish "never yet
> > vacuumed" from "vacuumed and seen to be empty".  In particular, a user
> > cannot use VACUUM or ANALYZE to override the planner's normal heuristic
> > that an empty table should not be believed to be empty because it is
> > probably about to get populated.  That heuristic is a good safety
> > measure, so I don't care to abandon it, but there should be a way to
> > override it if the table is indeed intended to stay empty.
>
> So that implicitly provides our reasoning for not analyzing up-front
> on table creation.
>
> I haven't thought about this too deeply yet, but it seems plausible to
> me that the dangers of overestimating row count here (at minimum in
> queries like I described with lots of joins) are higher than the
> dangers of underestimating, which we would do if we believed the table
> was empty. One critical question would be how fast we can assume the
> table will be auto-analyzed (i.e., how fast would the underestimate be
> corrected.
>

I found this issue a few years ago. This application had 40% of tables with
one or zero row, 30% was usual size, and 30% was sometimes really big. It
can be "relative" common in OLAP applications.

The estimation was terrible. I don't think there can be some better
heuristic.  Maybe we can introduce some table option like expected size,
that can be used when real statistics are not available.

Some like

CREATE TABLE foo(...) WITH (default_relpages = x)

It is not a perfect solution, but it allows fix this issue by one command.


> Regards,
> James Coleman
>


Re: Analyze on table creation?

2023-06-26 Thread James Coleman
cc'ing Tom because I'm curious if he's willing to provide some greater
context on the commit in question.

On Mon, Jun 26, 2023 at 2:16 PM Pavel Stehule  wrote:
>
>
>
> po 26. 6. 2023 v 19:48 odesílatel James Coleman  napsal:
>>
>> On Mon, Jun 26, 2023 at 1:45 PM Pavel Stehule  
>> wrote:
>> >
>> >
>> >
>> > po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule  
>> > napsal:
>> >>
>> >> Hi
>> >>
>> >> po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:
>> >>>
>> >>> Hello,
>> >>>
>> >>> Have we ever discussed running an analyze immediately after creating a 
>> >>> table?
>> >>>
>> >>> Consider the following:
>> >>>
>> >>> create table stats(i int, t text not null);
>> >>> explain select * from stats;
>> >>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
>> >>> analyze stats;
>> >>> explain select * from stats;
>> >>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>> >>>
>> >>> Combined with rapidly increasing error margin on row estimates when
>> >>> adding joins means that a query joining to a bunch of empty tables
>> >>> when a database first starts up can result in some pretty wild plan
>> >>> costs.
>> >>>
>> >>> This feels like a simple idea to me, and so I assume people have
>> >>> considered it before. If so, I'd like to understand why the conclusion
>> >>> was not to do it, or, alternatively if it's a lack of tuits.
>> >>
>> >>
>> >> I like this. On the second hand, described behaviour is designed for 
>> >> ensuring of back compatibility.
>> >
>> >
>> > if you break this back compatibility, then the immediate ANALYZE is not 
>> > necessary
>>
>> I don't follow what backwards compatibility you're referencing. Could
>> you expand on that?
>
>
> Originally, until the table had minimally one row, the PostgreSQL calculated 
> with 10 pages. It was fixed (changed) in PostgreSQL 14.
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3d351d916b20534f973eda760cde17d96545d4c4
>

>From that commit message:
> Historically, we've considered the state with relpages and reltuples
> both zero as indicating that we do not know the table's tuple density.
> This is problematic because it's impossible to distinguish "never yet
> vacuumed" from "vacuumed and seen to be empty".  In particular, a user
> cannot use VACUUM or ANALYZE to override the planner's normal heuristic
> that an empty table should not be believed to be empty because it is
> probably about to get populated.  That heuristic is a good safety
> measure, so I don't care to abandon it, but there should be a way to
> override it if the table is indeed intended to stay empty.

So that implicitly provides our reasoning for not analyzing up-front
on table creation.

I haven't thought about this too deeply yet, but it seems plausible to
me that the dangers of overestimating row count here (at minimum in
queries like I described with lots of joins) are higher than the
dangers of underestimating, which we would do if we believed the table
was empty. One critical question would be how fast we can assume the
table will be auto-analyzed (i.e., how fast would the underestimate be
corrected.

Regards,
James Coleman




Re: Analyze on table creation?

2023-06-26 Thread Pavel Stehule
po 26. 6. 2023 v 19:48 odesílatel James Coleman  napsal:

> On Mon, Jun 26, 2023 at 1:45 PM Pavel Stehule 
> wrote:
> >
> >
> >
> > po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule 
> napsal:
> >>
> >> Hi
> >>
> >> po 26. 6. 2023 v 19:41 odesílatel James Coleman 
> napsal:
> >>>
> >>> Hello,
> >>>
> >>> Have we ever discussed running an analyze immediately after creating a
> table?
> >>>
> >>> Consider the following:
> >>>
> >>> create table stats(i int, t text not null);
> >>> explain select * from stats;
> >>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> >>> analyze stats;
> >>> explain select * from stats;
> >>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
> >>>
> >>> Combined with rapidly increasing error margin on row estimates when
> >>> adding joins means that a query joining to a bunch of empty tables
> >>> when a database first starts up can result in some pretty wild plan
> >>> costs.
> >>>
> >>> This feels like a simple idea to me, and so I assume people have
> >>> considered it before. If so, I'd like to understand why the conclusion
> >>> was not to do it, or, alternatively if it's a lack of tuits.
> >>
> >>
> >> I like this. On the second hand, described behaviour is designed for
> ensuring of back compatibility.
> >
> >
> > if you break this back compatibility, then the immediate ANALYZE is not
> necessary
>
> I don't follow what backwards compatibility you're referencing. Could
> you expand on that?
>

Originally, until the table had minimally one row, the PostgreSQL
calculated with 10 pages. It was fixed (changed) in PostgreSQL 14.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3d351d916b20534f973eda760cde17d96545d4c4

Regards

Pavel


> Regards,
> James Coleman
>


Re: Analyze on table creation?

2023-06-26 Thread James Coleman
On Mon, Jun 26, 2023 at 1:45 PM Pavel Stehule  wrote:
>
>
>
> po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule  
> napsal:
>>
>> Hi
>>
>> po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:
>>>
>>> Hello,
>>>
>>> Have we ever discussed running an analyze immediately after creating a 
>>> table?
>>>
>>> Consider the following:
>>>
>>> create table stats(i int, t text not null);
>>> explain select * from stats;
>>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
>>> analyze stats;
>>> explain select * from stats;
>>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>>>
>>> Combined with rapidly increasing error margin on row estimates when
>>> adding joins means that a query joining to a bunch of empty tables
>>> when a database first starts up can result in some pretty wild plan
>>> costs.
>>>
>>> This feels like a simple idea to me, and so I assume people have
>>> considered it before. If so, I'd like to understand why the conclusion
>>> was not to do it, or, alternatively if it's a lack of tuits.
>>
>>
>> I like this. On the second hand, described behaviour is designed for 
>> ensuring of back compatibility.
>
>
> if you break this back compatibility, then the immediate ANALYZE is not 
> necessary

I don't follow what backwards compatibility you're referencing. Could
you expand on that?

Regards,
James Coleman




Re: Analyze on table creation?

2023-06-26 Thread Pavel Stehule
po 26. 6. 2023 v 19:43 odesílatel Pavel Stehule 
napsal:

> Hi
>
> po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:
>
>> Hello,
>>
>> Have we ever discussed running an analyze immediately after creating a
>> table?
>>
>> Consider the following:
>>
>> create table stats(i int, t text not null);
>> explain select * from stats;
>>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
>> analyze stats;
>> explain select * from stats;
>>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>>
>> Combined with rapidly increasing error margin on row estimates when
>> adding joins means that a query joining to a bunch of empty tables
>> when a database first starts up can result in some pretty wild plan
>> costs.
>>
>> This feels like a simple idea to me, and so I assume people have
>> considered it before. If so, I'd like to understand why the conclusion
>> was not to do it, or, alternatively if it's a lack of tuits.
>>
>
> I like this. On the second hand, described behaviour is designed for
> ensuring of back compatibility.
>

if you break this back compatibility, then the immediate ANALYZE is not
necessary



>
> Regards
>
> Pavel
>
>
>
>> Regards,
>> James Coleman
>>
>>
>>


Re: Analyze on table creation?

2023-06-26 Thread Pavel Stehule
Hi

po 26. 6. 2023 v 19:41 odesílatel James Coleman  napsal:

> Hello,
>
> Have we ever discussed running an analyze immediately after creating a
> table?
>
> Consider the following:
>
> create table stats(i int, t text not null);
> explain select * from stats;
>Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
> analyze stats;
> explain select * from stats;
>Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)
>
> Combined with rapidly increasing error margin on row estimates when
> adding joins means that a query joining to a bunch of empty tables
> when a database first starts up can result in some pretty wild plan
> costs.
>
> This feels like a simple idea to me, and so I assume people have
> considered it before. If so, I'd like to understand why the conclusion
> was not to do it, or, alternatively if it's a lack of tuits.
>

I like this. On the second hand, described behaviour is designed for
ensuring of back compatibility.

Regards

Pavel



> Regards,
> James Coleman
>
>
>


Analyze on table creation?

2023-06-26 Thread James Coleman
Hello,

Have we ever discussed running an analyze immediately after creating a table?

Consider the following:

create table stats(i int, t text not null);
explain select * from stats;
   Seq Scan on stats  (cost=0.00..22.70 rows=1270 width=36
analyze stats;
explain select * from stats;
   Seq Scan on stats  (cost=0.00..0.00 rows=1 width=36)

Combined with rapidly increasing error margin on row estimates when
adding joins means that a query joining to a bunch of empty tables
when a database first starts up can result in some pretty wild plan
costs.

This feels like a simple idea to me, and so I assume people have
considered it before. If so, I'd like to understand why the conclusion
was not to do it, or, alternatively if it's a lack of tuits.

Regards,
James Coleman