Re: Window Functions & Table Partitions

2023-02-09 Thread Benjamin Tingle
Thanks for the helpful response david! I'll have a shot at getting the
patch to work myself & submitting to pgsql-hackers.

Ben

On Wed, Feb 8, 2023 at 2:36 PM David Rowley  wrote:

> On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
> > Basically- window partition functions don't take advantage of existing
> table partitions. I use window functions as a more powerful GROUP BY clause
> that preserves row-by-row information- super handy for a lot of things.
> >
> > In particular, I want to use window functions on already partitioned
> tables, like the below example:
> >
> > create table abb (a int, b int, g int) partition by hash(b)
> > /* populate table etc... */
> > select a, b, min(a) over (partition by b) as g from abb
> >
> > Ideally with a query plan like this:
> >
> > Window:
> > Append:
> > Sort on table_p0
> > Sort on table_p1
> > Sort on table_p2
>
> There was some effort [1] in version 12 to take advantage of the order
> defined by the partitioning scheme. The release notes [2] mention:
>
> "Avoid sorting when partitions are already being scanned in the necessary
> order"
>
> However, it's not 100% of what you need as there'd have to be a btree
> index on abb(b) for the planner to notice.
>
> Likely this could be made better so that add_paths_to_append_rel()
> added the pathkeys defined by the partitioned table into
> all_child_pathkeys if they didn't exist already. In fact, I've
> attached a very quickly hacked together patch against master to do
> this.  I've given it very little thought and it comes complete with
> failing regression tests.
>
> If you're interested in pursuing this then feel free to take the patch
> to the pgsql-hackers mailing list and propose it. It's unlikely I'll
> get time to do that for a while, but I will keep a branch locally with
> it to remind me in case I do at some point in the future.
>
> David
>
> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
> [2] https://www.postgresql.org/docs/release/12.0/
>


-- 

Ben(t).


Window Functions & Table Partitions

2023-02-08 Thread Benjamin Tingle
Hell postgres people!

This is not an issue report so much as a gripe. I'm on postgres 12.2, so it
is entirely possible that the issue I describe is fixed in a later version.
If so, it is not described in the docs or any posts I can find archived on
pgsql-performance. (I am not brave enough to delve into pgsql-developer,
where I'm sure this has been brought up at some point)

Basically- window partition functions don't take advantage of existing
table partitions. I use window functions as a more powerful GROUP BY clause
that preserves row-by-row information- super handy for a lot of things.

In particular, I want to use window functions on already partitioned
tables, like the below example:

create table abb (a int, b int, g int) partition by hash(b)
/* populate table etc... */
select a, b, min(a) over (partition by b) as g from abb

Ideally with a query plan like this:

Window:
Append:
Sort on table_p0
Sort on table_p1
Sort on table_p2

Instead, I get this:

Window:
Sort:
Append:
Parallel seq scan on table_p0
Parallel seq scan on table_p1
Parallel seq scan on table_p2

Which is a BIG no-no, as there could potentially be thousands of partitions
and BILLIONS of rows per table. This can be solved by manually implementing
the first query plan via scripting, e.g:

do $$
declare i int;
begin
for i in 0..get_npartitions() loop
execute('select a, b, min(a) over (partition by b) as g from
abb_p%', i);
end loop;
end $$ language plpgsql;

This is not ideal, but perfectly workable. I'm sure you guys are already
aware of this, it just seems like a really simple fix to me- if the window
function partition scheme exactly matches the partition scheme of the table
it queries, it should take advantage of those partitions.

Thanks,
Ben


Re: Query Planner not taking advantage of HASH PARTITION

2022-04-21 Thread Benjamin Tingle
Going to forward my response from an individual thread with Jeff here in
case anyone else can help me out.

I wasn't sure if forwarding the message to the mailing list would be
considered as a continuation of the same subject, so I'm just going to
paste it here. I'm a bit of an email noob :P

-

Jeff,

First off, thanks for the thoughtful response.

@ the first point about write locks
I think I had/have a misconception about how inserts work in postgres. It's
my understanding that postgres will never draft a parallel insert plan for
any query (except maybe CREATE TABLE AS?) because the process needs to
acquire an exclusive access write lock to the table it is inserting on. My
thinking was that since the partitions are treated as separate tables that
they can be theoretically inserted to in parallel.

@ the second point about indexes on textfield
I realized my error on this after I sent the email, indexes do not speed up
large joins, just small ones.

@ the third point about hash joins
So this is interesting to me. Your description of how hash joins work
sounds like the behaviour I would want, yet performing huge joins is where
my larger databases have been getting stuck. Upon looking back at my code,
I think I realize perhaps why they were getting stuck. So my database
doesn't have just one table, it has three principal tables which relate to
one another: Full disclosure, these databases/tables are distributed
between multiple machines and can get quite enormous (some tables
individually are 200+GB)

tab1 (textfield1 text, tf1_id bigint) unique on textfield1
tab2 (textfield2 text, tf2_id bigint) unique on textfield2
tab3 (tf1_id_fk bigint, tf2_id_fk bigint) unique on tf1_id_fk, tf2_id_fk

So as I'm uploading new data (in the form of (textfield1, textfield2)
entries) I need to record the ID of each matching record on the join(s) or
null if there was no match (thus a new record). The way I have been
accomplishing this so far has been like so:

1. create temporary table upload(textfield1 text, textfield2 text, tf1_id
bigint, tf2_id bigint);
2. copy :'source' to upload(textfield1, textfield2);
3. update upload set tf1_id = tab1.tf1_id from tab1 where upload.textfield1
= tab1.textfield1;
4. create temporary table new_textfield1 (textfield1 text, tf1_id bigint);
5. insert into new_textfield1 (select distinct on (textfield1) textfield1,
nextval('tf1_id_sequence') as tf1_id from upload where tf1_id is null)
6. update upload u set tf1_id = ntf1.tf1_id from new_textfield1 ntf1 where
u.tf1_id is null and u.textfield1 = ntf1.textfield1;
-- etc. continue process for tab2, tab3

Now, since I wrote that code I've learned about aggregation & window
functions so I can generate the new ids during the big table join rather
than after the fact, but the big join "update" statement has been where the
process gets stuck for huge tables. I notice that the query planner
generates a different strategy when just selecting data vs "insert select"
or "update".

 For example, when I write a query to join entries from one huge table to
another using a select statement, I get a nice parallel hash join plan like
you mention.

> explain select * from hugetable1 join hugetable2 on some_field;


Results in:

>  Gather
>Workers Planned: 2
>->  Parallel Hash Left Join
>  Hash Cond: ((ht1.some_field)::text = (ht2.some_field)::text)
>  ->  Parallel Append
>->  Parallel Seq Scan on hugetable2 ht2
>

However, when I change this to an update or insert select, I get a very
different plan.

> explain insert into dummy1 (select *  from hugetable1 join hugetable2 on
> some_field)
> OR
> explain update hugetable1 ht1 set id = ht2.id from hugetable2 ht2 where
> ht1.some_field = ht2.some_field


Results in:

> Update on hugetable1 ||OR|| Insert on dummy1
>->  Merge Join
>  Merge Cond: ((ht1.some_field)::text = (ht2.some_field)::text)
>  ->  Sort
>Sort Key: ht1.some_field
>->  Seq Scan on hugetable1 ht1
>  ->  Materialize
>->  Sort
>  Sort Key: ht2.some_field
>  ->  Append
>->  Seq Scan on hugetable2 ht2


Maybe this query should perform similarly to the hash join, but the fact
remains that I've had databases stuck for weeks on plans like this. The
partitioning strategy I've adopted has been an attempt to force data
locality during the join operation, and so far has been working reasonably
well. If you have some insight into why these large update/insert
operations go so slowly, it would be much appreciated.

-Ben

On Sun, Apr 17, 2022 at 11:50 AM Alvaro Herrera 
wrote:

> On 2022-Apr-14, Benjamin Tingle wrote:
&

Re: Query Planner not taking advantage of HASH PARTITION

2022-04-17 Thread Benjamin Tingle
Interesting. Why is it impossible to prune hash partitions? Maybe prune
isn’t the best word, more so use to advantage. At the very least, it should
be possible to utilize a parallel insert against a table partitioned by
hash. (Partition query rows, then distribute these rows to parallel workers)

On Sun, Apr 17, 2022 at 9:09 AM Tom Lane  wrote:

> Benjamin Tingle  writes:
> > I've recently started taking advantage of the PARTITION BY HASH feature
> for
> > my database system. It's a really great fit since my tables can get quite
> > large (900M+ rows for some) and splitting them up into manageable chunks
> > should let me upload to them without having to update an enormous index
> > every time. What's more, since each partition has a write lock
> independent
> > of the parent table, it should theoretically be possible to perform a
> > parallelized insert operation, provided the data to be added is
> partitioned
> > beforehand.
>
> > What has been disappointing is that the query planner doesn't seem to
> > recognize this potential.
>
> That's because there isn't any.  The hash partitioning rule has
> basically nothing to do with any plausible WHERE condition.  If you're
> hoping to see partition pruning happen, you need to be using list or
> range partitions, with operators compatible with your likely WHERE
> conditions.
>
> (I'm of the opinion that the hash partitioning option is more in the
> category of a dangerous nuisance than a useful feature.  There are some
> around here who will argue otherwise, but they're wrong for exactly the
> reason that it's impossible to prune hash partitions.)
>
> regards, tom lane
>
-- 

Ben(t).


Query Planner not taking advantage of HASH PARTITION

2022-04-17 Thread Benjamin Tingle
Greetings Postgres Developers,

I've recently started taking advantage of the PARTITION BY HASH feature for
my database system. It's a really great fit since my tables can get quite
large (900M+ rows for some) and splitting them up into manageable chunks
should let me upload to them without having to update an enormous index
every time. What's more, since each partition has a write lock independent
of the parent table, it should theoretically be possible to perform a
parallelized insert operation, provided the data to be added is partitioned
beforehand.

What has been disappointing is that the query planner doesn't seem to
recognize this potential. For example, if I have a large list of input
data, and I want to perform a select operation across the target table:

  -- target table is hashed on 'textfield' & has a unique index on
'textfield'
  select * from temp_data td left join target tg on td.textfield =
tg.textfield;

I would expect to get a query plan like this:

  partition temp_data
  parallel scan on
target_p0 using target_p0_textfield_uniq_idx against temp_data_p0
target_p1 using target_p1_textfield_uniq_idx against temp_data_p1
target_p2 using target_p2_textfield_uniq_idx against temp_data_p2
...

Instead, I get a seemingly terrible plan like this:

  hash temp_data
  sequential scan on
target_p0 against temp_data
target_p1 against temp_data
target_p2 against temp_data
...

It doesn't even make use of the index on the textfield! Instead, it opts to
hash all of temp_data and perform a sequential scan against it.

It doesn't help if I partition temp_data by textfield beforehand either
(using the same scheme as the target table). It still opts to concatenate
all of temp_data, hash it, then perform a sequential scan against the
target partitions.

On insert the behaviour is better but it still opts for a sequential insert
instead of a parallel one.

Does the query planner know something I don't? It's my intuition that it
should be faster to do a rough counting sort (partition by hash) first, and
then do N smaller more accurate sorts in parallel afterwards.

Currently I am creating a custom script(s) to emulate my desired behaviour,
but it would be nice if there was a way to get the query planner to do this
automatically. Any tricks to do this would be much appreciated!

-Ben