Re: Postgresql 14 partitioning advice

2022-08-08 Thread Justin Pryzby
On Mon, Aug 08, 2022 at 03:45:11PM -0700, Slava Mudry wrote:
> Postgres 14 improved partitioning quite a bit. I used it in Postgres 9 and
> there was a lot of locking on partition hierarchy when you add/drop
> partition tables.

Note that postgres 9 didn't have native/declarative partitioning, and most
improvements in native partitioning don't apply to legacy/inheritance
partitioning.

https://www.postgresql.org/docs/devel/ddl-partitioning.html

"Native" partitioning added in v10 tends to require stronger locks for add/drop
than legacy partitioning, since partitions have associated bounds, which cannot
overlap.  The locking is improved in v12 with CREATE+ATTACH and v14 with
DETACH CONCURRENTLY+DROP.

-- 
Justin




Re: Postgresql 14 partitioning advice

2022-08-08 Thread Slava Mudry
> I'm spinning up a new Postgresql 14 database where I'll have to store a
couple years worth of time series data at the rate of single-digit millions
of rows per day.
I think you absolutely need to use partitioning for the following reasons:
1. maintenance and roll-off of older data
2. indexes are much smaller
3. performance is predictable (if partition pruning kicks in)

Postgres 14 improved partitioning quite a bit. I used it in Postgres 9 and
there was a lot of locking on partition hierarchy when you add/drop
partition tables.
Having thousands of partitions shouldn't be a problem, BUT you will incur
cost on query planning, which is usually under 0.1 second on modern
hardware.


On Tue, Aug 2, 2022 at 5:55 AM Rick Otten  wrote:

>
>
> On Mon, Aug 1, 2022 at 10:16 AM Rick Otten 
> wrote:
>
>>
>>> The other problem I ran into, which I'm still building a test case for
>>> and I fear might be a bug if I can easily reproduce it,
>>> is if I did the original select in a CTE, and then did a sort outside of
>>> the CTE, even though the CTE found 0 rows, the database
>>> still spent a _ton_ of time sorting those 0 rows:
>>> ```
>>>->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual
>>> time=84848.452..84848.453 rows=0 loops=1)
>>> ```
>>> Once I can reproduce this on test data I'll be able to pin down more
>>> closely what is happening and tell if I'm just reading
>>> the explain plan wrong or if something is broken.  It was getting mixed
>>> up with the lack of pruning/index usage problem.
>>>
>>> I'll report back again next week.  Anyway it is looking to me like it
>>> doesn't really matter (within reason) from a performance
>>> perspective how many partitions we use for our data set and query
>>> patterns.  We should be able to pick the most convenient
>>> from an archiving and data management perspective instead.
>>>
>>>
>> This behavior is definitely consistent.  0 rows end up slower than when I
>> find some rows in my CTE:
>> ```
>>->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual
>> time=87110.841..87110.842 rows=0 loops=1)
>>->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual
>> time=25367.867..25367.930 rows=840 loops=1)
>> ```
>> The only thing I changed in the query was the date range.  It is actually
>> the CTE scan step inside the Sort block that is slower when no rows are
>> returned than when rows are returned.  It also only happens when all the
>> partitions are sequence scanned instead of being partition pruned.
>>
>> I'm still writing up a test case that can demo this without using
>> proprietary data.
>>
>
> After a bunch of experiments I can explain this now.  :-)
>
> I had a `limit` clause in my test CTE.  When sequence scanning a bunch of
> partitions, if the limit is reached, the subsequent partitions are marked
> with `never executed` and not scanned.  On the other hand, when no rows are
> found, all of the partitions are scanned.
>
> Therefore, with many millions of rows in the partitions, and being forced
> to sequence scan because I put the `at time zone` clause in the `where`,
> the case when rows are found is always noticeably faster than the case when
> rows aren't found as long as at least one partition hasn't been scanned yet
> when the limit is hit.
>
> I'm now satisfied this is a good thing, and will move on to other
> problems.  Thanks for hearing me out.  I was scratching my head for a while
> over that one.
>
>
>


-- 
-slava


Re: Postgresql 14 partitioning advice

2022-08-02 Thread Rick Otten
On Mon, Aug 1, 2022 at 10:16 AM Rick Otten  wrote:

>
>> The other problem I ran into, which I'm still building a test case for
>> and I fear might be a bug if I can easily reproduce it,
>> is if I did the original select in a CTE, and then did a sort outside of
>> the CTE, even though the CTE found 0 rows, the database
>> still spent a _ton_ of time sorting those 0 rows:
>> ```
>>->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual
>> time=84848.452..84848.453 rows=0 loops=1)
>> ```
>> Once I can reproduce this on test data I'll be able to pin down more
>> closely what is happening and tell if I'm just reading
>> the explain plan wrong or if something is broken.  It was getting mixed
>> up with the lack of pruning/index usage problem.
>>
>> I'll report back again next week.  Anyway it is looking to me like it
>> doesn't really matter (within reason) from a performance
>> perspective how many partitions we use for our data set and query
>> patterns.  We should be able to pick the most convenient
>> from an archiving and data management perspective instead.
>>
>>
> This behavior is definitely consistent.  0 rows end up slower than when I
> find some rows in my CTE:
> ```
>->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual
> time=87110.841..87110.842 rows=0 loops=1)
>->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual
> time=25367.867..25367.930 rows=840 loops=1)
> ```
> The only thing I changed in the query was the date range.  It is actually
> the CTE scan step inside the Sort block that is slower when no rows are
> returned than when rows are returned.  It also only happens when all the
> partitions are sequence scanned instead of being partition pruned.
>
> I'm still writing up a test case that can demo this without using
> proprietary data.
>

After a bunch of experiments I can explain this now.  :-)

I had a `limit` clause in my test CTE.  When sequence scanning a bunch of
partitions, if the limit is reached, the subsequent partitions are marked
with `never executed` and not scanned.  On the other hand, when no rows are
found, all of the partitions are scanned.

Therefore, with many millions of rows in the partitions, and being forced
to sequence scan because I put the `at time zone` clause in the `where`,
the case when rows are found is always noticeably faster than the case when
rows aren't found as long as at least one partition hasn't been scanned yet
when the limit is hit.

I'm now satisfied this is a good thing, and will move on to other
problems.  Thanks for hearing me out.  I was scratching my head for a while
over that one.


Re: Postgresql 14 partitioning advice

2022-08-01 Thread Rick Otten
>
>
> The other problem I ran into, which I'm still building a test case for and
> I fear might be a bug if I can easily reproduce it,
> is if I did the original select in a CTE, and then did a sort outside of
> the CTE, even though the CTE found 0 rows, the database
> still spent a _ton_ of time sorting those 0 rows:
> ```
>->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual
> time=84848.452..84848.453 rows=0 loops=1)
> ```
> Once I can reproduce this on test data I'll be able to pin down more
> closely what is happening and tell if I'm just reading
> the explain plan wrong or if something is broken.  It was getting mixed up
> with the lack of pruning/index usage problem.
>
> I'll report back again next week.  Anyway it is looking to me like it
> doesn't really matter (within reason) from a performance
> perspective how many partitions we use for our data set and query
> patterns.  We should be able to pick the most convenient
> from an archiving and data management perspective instead.
>
>
This behavior is definitely consistent.  0 rows end up slower than when I
find some rows in my CTE:
```
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual
time=87110.841..87110.842 rows=0 loops=1)
   ->  Sort  (cost=109.44..113.19 rows=1500 width=112) (actual
time=25367.867..25367.930 rows=840 loops=1)
```
The only thing I changed in the query was the date range.  It is actually
the CTE scan step inside the Sort block that is slower when no rows are
returned than when rows are returned.  It also only happens when all the
partitions are sequence scanned instead of being partition pruned.

I'm still writing up a test case that can demo this without using
proprietary data.


Re: Postgresql 14 partitioning advice

2022-07-30 Thread Nathan Ward


> On 30/07/2022, at 9:44 AM, Rick Otten  wrote:
> 
> On Wed, Jul 27, 2022 at 8:55 AM Rick Otten  > wrote:
> I'm spinning up a new Postgresql 14 database where I'll have to store a 
> couple years worth of time series data at the rate of single-digit millions 
> of rows per day.  Since this has to run in AWS Aurora, I can't use 
> TimescaleDB.
> 
> I thought I'd report back some of my findings from testing this week:
> 
> I took the same real world two week data set and created identical tables 
> except that I partitioned one by month, one by week, one by day, and one by 
> hour.  I partitioned a little bit into the past and a little bit into the 
> future. I did this on a PG 14.2 RDS instance.  This gave me tables with:
> 3 partitions, 13 partitions, 90 partitions and 2136 partitions, but otherwise 
> the same data.
> 
> Insert times were equivalent.
> 
> Then I crafted a query that was one of the main use cases for the data and 
> ran it a bunch of times.
> 
> I noticed a significant degradation in performance as the number of 
> partitions increased.  The jump from 13 to 90, in particular, was very steep. 
>  It didn't matter what I set work_mem or other tunables to.  I dug deeper...
> 
> Surprising to me was if you partition on a `timestamp with timezone` column, 
> call it "ts":
> If your where clause looks like
> ```
> where ts at time zone 'UTC' > '2022-07-01 00:00'::timestamp
> ```
> you will NOT get partition pruning and it will sequence scan.
> However if you change it to (with an appropriately adjusted right hand side 
> if necessary):
> ```
> where ts > '2022-07-01 00:00'::timestamp
> ```
> It will do partition pruning and will index scan.
> 
> When I made that change the query performance was equivalent regardless of 
> which number of partitions I had in play.
> I did a quick test and this happens on a regular timestamp index on a regular 
> table as well.
> 
> The other problem I ran into, which I'm still building a test case for and I 
> fear might be a bug if I can easily reproduce it,
> is if I did the original select in a CTE, and then did a sort outside of the 
> CTE, even though the CTE found 0 rows, the database
> still spent a _ton_ of time sorting those 0 rows:
> ```
>->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual 
> time=84848.452..84848.453 rows=0 loops=1)
> ```
> Once I can reproduce this on test data I'll be able to pin down more closely 
> what is happening and tell if I'm just reading
> the explain plan wrong or if something is broken.  It was getting mixed up 
> with the lack of pruning/index usage problem.
> 
> I'll report back again next week.  Anyway it is looking to me like it doesn't 
> really matter (within reason) from a performance
> perspective how many partitions we use for our data set and query patterns.  
> We should be able to pick the most convenient
> from an archiving and data management perspective instead.

Hi Rick,

I am working with data with a similar structure. The most recent data is 
accessed significantly more often than older data, so my next step will be to 
have very recent data in hourly tables, then daily, and probably monthly tables 
for what is effectively archived data. My data is a little different in that 
it’s stored by the start of an “interval” which means I can do a = comparison 
for the start of the hour (or day once data is aggregated).

I found a similar interesting partitioning performance issue recently when 
partitioning by timestamp, where if your where clause for a timestamp includes 
math the planner runs very slowly. In my case saying something like:
```
select * from table where ts = some_time_variable - interval ‘1 hour’;
```
is *much* slower than something like:
```
offset_time_variable = some_time_variable - '1 hour’ interval;
select * from table where ts = offset_time_variable;
```

Everything is `timestamp with time zone`.
I believe that it’s calculating that offset for each partition - of which there 
are a couple hundred - and it was causing the planner to run very slowly. 
Pruning works correctly once the planner has run.

This is on postgres 13 - I have yet to try 14 and see if this issue persists in 
14.

Changing my main query to the above structure significantly improved 
performance - I was previously having lots of performance issues when 
aggregation tasks ran and dropped partitions etc.

I posted about this here: 
https://www.postgresql.org/message-id/84101021-8B67-45AD-83F2-A3C8F0AA4BEE%40daork.net

--
Nathan Ward




Re: Postgresql 14 partitioning advice

2022-07-29 Thread Rick Otten
On Wed, Jul 27, 2022 at 8:55 AM Rick Otten  wrote:

> I'm spinning up a new Postgresql 14 database where I'll have to store a
> couple years worth of time series data at the rate of single-digit millions
> of rows per day.  Since this has to run in AWS Aurora, I can't use
> TimescaleDB.
>

I thought I'd report back some of my findings from testing this week:

I took the same real world two week data set and created identical tables
except that I partitioned one by month, one by week, one by day, and one by
hour.  I partitioned a little bit into the past and a little bit into the
future. I did this on a PG 14.2 RDS instance.  This gave me tables with:
3 partitions, 13 partitions, 90 partitions and 2136 partitions, but
otherwise the same data.

Insert times were equivalent.

Then I crafted a query that was one of the main use cases for the data and
ran it a bunch of times.

I noticed a significant degradation in performance as the number of
partitions increased.  The jump from 13 to 90, in particular, was very
steep.  It didn't matter what I set work_mem or other tunables to.  I dug
deeper...

Surprising to me was if you partition on a `timestamp with timezone`
column, call it "ts":
If your where clause looks like
```
where ts at time zone 'UTC' > '2022-07-01 00:00'::timestamp
```
you will NOT get partition pruning and it will sequence scan.
However if you change it to (with an appropriately adjusted right hand side
if necessary):
```
where ts > '2022-07-01 00:00'::timestamp
```
It will do partition pruning and will index scan.

When I made that change the query performance was equivalent regardless of
which number of partitions I had in play.
I did a quick test and this happens on a regular timestamp index on a
regular table as well.

The other problem I ran into, which I'm still building a test case for and
I fear might be a bug if I can easily reproduce it,
is if I did the original select in a CTE, and then did a sort outside of
the CTE, even though the CTE found 0 rows, the database
still spent a _ton_ of time sorting those 0 rows:
```
   ->  Sort  (cost=70.03..72.53 rows=1000 width=112) (actual
time=84848.452..84848.453 rows=0 loops=1)
```
Once I can reproduce this on test data I'll be able to pin down more
closely what is happening and tell if I'm just reading
the explain plan wrong or if something is broken.  It was getting mixed up
with the lack of pruning/index usage problem.

I'll report back again next week.  Anyway it is looking to me like it
doesn't really matter (within reason) from a performance
perspective how many partitions we use for our data set and query
patterns.  We should be able to pick the most convenient
from an archiving and data management perspective instead.


Re: Postgresql 14 partitioning advice

2022-07-27 Thread Jeff Janes
On Wed, Jul 27, 2022 at 8:55 AM Rick Otten  wrote:

>
> One person I talked to said "try not to have more than 100 partitions",
> even with the latest postgresql you'll end up with a lot of lock contention
> if you go over 100 partitions.
>
>
It is hard to know how seriously to take the advice of anonymous people
accompanied with such sparse justification.  Meanwhile, people who actually
wrote the code seem to think that this problem has been mostly overcome
with declarative partitioning in the newer versions.

When you do decide to start removing the oldest data, how will you do it?
Your partitioning should probably be designed to align with this.

> Since the data most frequently queried would be recent data (say the past
month or so)

Is this done specifically with a time clause, or just by criteria which
happen to align with time, but have no formal relationship with it?

Cheers,

Jeff


Re: Postgresql 14 partitioning advice

2022-07-27 Thread Justin Pryzby
On Wed, Jul 27, 2022 at 08:55:14AM -0400, Rick Otten wrote:
> I'm spinning up a new Postgresql 14 database where I'll have to store a
> couple years worth of time series data at the rate of single-digit millions
> of rows per day.  Since this has to run in AWS Aurora, I can't use
> TimescaleDB.

> One person I talked to said "try not to have more than 100 partitions",
> even with the latest postgresql you'll end up with a lot of lock contention
> if you go over 100 partitions.

I'm not familiar with this (but now I'm curious).  We have over 2000 partitions
in some tables.  No locking issue that I'm aware of.  One issue that I *have*
seen is if you have many partitions, you can end up with query plans with a
very large number of planner nodes, and it's hard to set
work_mem*hash_mem_multiplier to account for that.

> This person also recommended manually
> kicking off vacuums on a regular schedule rather than trusting autovacuum
> to work reliably on the partitioned tables.

They must mean *analyze*, which does not run automatically on the partitioned
tables (only the partitions).  The partitioned table is empty, so doesn't need
to be vacuumed.

> I've got several keys, besides the obvious time-key that I could partition
> on.   I could do a multi-key partitioning scheme.  Since the data is
> inbound at a relatively steady rate, if I partition on time, I can adjust
> the partitions to be reasonably similarly sized.  What is a good partition
> size?

Depends on 1) the target number of partitions; and 2) the target size for
indexes on those partitions.  More partition keys will lead to smaller indexes.
Depending on the type of index, and the index keys, to get good INSERT
performance, you may need to set shared_buffers to accommodate the sum of size
of all the indexes (but maybe not, if the leading column is timestamp).

> Since the data most frequently queried would be recent data (say the past
> month or so) would it make sense to build an archiving strategy that rolled
> up older partitions into larger ones?  ie, do daily partitions for the
> first four weeks, then come up with a process that rolled them up into
> monthly partitions for the next few months, then maybe quarterly partitions
> for the data older than a year?  (I'm thinking about ways to keep the
> partition count low - if that advice is justified.)

I think it can make sense.  I do that myself in order to: 1) avoid having a
huge *total* number of tables (which causes pg_attribute to be large, since our
tables are also "wide"); and 2) make our backups of "recent data" smaller; and
3) make autoanalyze a bit more efficient (a monthly partition will be analyzed
numerous times the 2nd half of the month, even though all the historic data
hasn't changed at all).

> Or, should I just have a single 7 Trillion row table with a BRIN index on
> the timestamp and not mess with partitions at all?

Are you going to need to DELETE data ?  Then this isn't great, and DELETEing
data will innevitably cause a lower correlation, making BRIN less effective.

-- 
Justin