[GENERAL] Replication stops under certain circumstances
Hi I have some strange issues with a postgresql read replica that seems to stop replicating under certain circumstances. Whenever we have changes to our views we have script that drops all views and reload them from scratch with the new definitions. The reloading of the views happens in a transaction to avoid confusion for everyone using the database. When this update gets to the slave it seems there is a chance for a deadlock to occur that doesn't get detected. As I was trying to reproduce this behavior, I ran into another weird situation that I don't entirely understand. The symptom is the same that replication stops, but it looks quite different. This example won't reproduce the issue reliably, but after a few hours I get a slave that won't continue to replicate until I restart it. The queries in the example won't make much sense, and I don't know if they can be simplified further and still cause the "desired" effect. Setup: Launch a new RDS psql instance (9.6.2) on AWS (will be referred to as db-master) and create a read replica (will be referred to as db-slave). The following options are changed from AWS default: max_standby_streaming_delay=-1 hot_standby_feedback=1 On the master create 2 dummy tables: create table a (id serial primary key); create table b (id serial primary key); Setup thread 1 to do work on master: while true; do psql -h db-master -U postgres db -c 'begin; drop view if exists view_a cascade; drop view if exists view_b; drop view if exists view_c; create view view_a as select * from a; create view view_b as select * from b; create view view_c as select * from view_a join view_b using (id); insert into a values (default); insert into b values (default); commit;'; done Setup thread 2 to do work on Slave: while true; do psql -h db-slave -U postgres db -c 'begin; select * from view_c order by random() limit 10; select * from view_a order by random() limit 10;'; done Setup thread 3 to do more work on slave: while true; do psql -h db-slave -U postgres db -c 'begin; select * from view_b order by random() limit 10; select * from view_a order by random() limit 10;'; done Every now and then a deadlock is detected and one connection is aborted, this works as expected. But After a while(serveral hours) it becomes impossible to connect to db on db-slave and thread 2 and 3 stops producing output. When trying to connect the psql client just hangs. However it is possible connect to template1 database to get a look on what is going on. template1=> select * from pg_stat_activity; -[ RECORD 1 ]+ datid | 16384 datname | rdsadmin pid | 7891 usesysid | 10 usename | rdsadmin application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | -[ RECORD 2 ]+ datid | 1 datname | template1 pid | 11949 usesysid | 16388 usename | hiper application_name | psql client_addr | 192.168.10.166 client_hostname | client_port | 41002 backend_start | 2017-10-20 16:30:26.032745+02 xact_start | 2017-10-20 16:30:34.306418+02 query_start | 2017-10-20 16:30:34.306418+02 state_change | 2017-10-20 16:30:34.306421+02 wait_event_type | wait_event | state | active backend_xid | backend_xmin | 26891 query | select * from pg_stat_activity; There are no active connection except rdsadmin from aws. template1=> select * from pg_locks; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath +--+--+--+---++---+-+---+--++---+-+-+-- virtualxid | | | | | 3/929 | | | | | 3/929 | 9640 | ExclusiveLock | t | t relation | 16390 | 2659 | | | | | | | | 4/829 | 9639 | AccessShareLock | t | t relation | 16390 | 1249 | | | | | | | | 4/829 | 9639 | AccessShareLock | t | t virtualxid | | | | | 4/829 | | | | | 4/829 | 9639 | ExclusiveLock | t | t relation | 1 | 11695 | | | | | | | | 5/148 | 11949 | AccessShareLock | t
Re: [GENERAL] Equivalence Classes when using IN
> On 11 Oct 2017, at 21.46, David Rowley wrote: > >> On 12 October 2017 at 08:37, Kim Rose Carlsen wrote: >> >>> Yeah. The ORDER BY creates a partial optimization fence, preventing >>> any such plan from being considered. >>>> >> >> I can see in the general case it semanticly means different things If you >> allow the WHERE to pass through ORDER BY. >> >> A special case can be allowed for WHERE to pass the ORDER BY if the column >> is part of DISTINCT ON. > > Yeah, we do allow predicates to be pushed down in that case. > Let's ignore that it's not a very useful query I have written. Why don't I see that predicate (customer_id) pushed into the outer nested loop so we don't have to sort the whole table on each loop. (See original post and follow up for definitions) QUERY PLAN - Nested Loop Left Join (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1) Join Filter: (c.customer_id = product.customer_id) Rows Removed by Join Filter: 199900 -> Nested Loop (cost=0.28..199.21 rows=334 width=12) (actual time=0.075..1.146 rows=100 loops=1) -> Seq Scan on customer (cost=0.00..21.51 rows=334 width=8) (actual time=0.067..0.282 rows=100 loops=1) Filter: (age < 20) Rows Removed by Filter: 901 -> Index Only Scan using customer_pkey on customer c (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100) Index Cond: (customer_id = customer.customer_id) Heap Fetches: 100 -> Materialize (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100) -> Sort (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1) Sort Key: product.customer_id, product.product_id Sort Method: quicksort Memory: 142kB -> Seq Scan on product (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1) Planning time: 0.214 ms Execution time: 35.284 ms -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Equivalence Classes when using IN
> Yeah. The ORDER BY creates a partial optimization fence, preventing > any such plan from being considered. >> I can see in the general case it semanticly means different things If you allow the WHERE to pass through ORDER BY. A special case can be allowed for WHERE to pass the ORDER BY if the column is part of DISTINCT ON. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Equivalence Classes when using IN
> If the only reason that is_simple_subquery() rejects subqueries with > ORDER BY is due to wanting to keep the order by of a view, then > couldn't we make is_simple_subquery() a bit smarter and have it check > if the subquery is going to be joined to something else, which likely > would destroy the order, or at least it would remove any guarantees of > it. > > Something like the attached? I dont know if it makes any difference that the ORDER BY is used in a DISTINCT ON clause. In this case the ORDER BY is important. - Kim
Re: [GENERAL] Equivalence Classes when using IN
> You would benefit from adding the age column to view_customer, or at > least consider having some view which contains all the columns you'll > ever need from those tables and if you need special views with only a > subset of columns due to some software doing "select * from > viewname;", then you could just create some. Joining to the same table > again seems like a bit of a waste of effort for the planner and > executor. I would argue that the anti pattern would be the software that insist on using "select * from viewname;" from a view that has calculated columns that you do not care for. I recommend introducing both lightweight views and heavyweight views, so you can join up probably for what you need. My example is fabricated trying to simplify things, but I seem to create more confusion than clarity in my example. My point was only to see if anything could be added to the fabricated execution path. I agree that the listed example does not make sense. So I will try and give some more context to real use cases. Imagine an invoice entity where you have one relation for invoice base data and a relation for invoice_line. The invoice has some invoice_id, customer_id, due_date, paid_date and invoice_line contains each line with a invoice_id, display_name, amount. A view (view_invoice_with_amount) where you calculate the total. so a query could be SELECT c.customer_id, i.invoice_amount_total FROM view_customer c JOIN view_invoice_with_amount i ON c.customer_id = i.customer_id WHERE c.first_name = 'John'; If you ever need to filter by invoice_amount_total, it might be necesary denormalize the relations and cache the amount in the invoice table. > I'd assume customer_id is the PRIMARY KEY of customer and > is unique. This is a continuation of the previous example, maybe I should have included it all to make it more clear. But customer_id is declared as a primary key. > It's not all that clear what your view is doing here. Confusingly > there's a Sort in the plan, yet nothing in the query asked for that, > so I guess that the view must have an ORDER BY. If you get rid of that > the planner would likely use an index on product (customer_id) to > parameterise the nested loop, at least, it likely would, if you have > one. The view is defined in the original post. What I was trying to illustrate was a DISTINCT ON clause to prioritize multiple products pr customer to a somewhat "main" product for the customer. The ORDER BY on product_id would in this case then map the first product a customer gets to its "main" product. It could also be the most valuable product or newest ordered active product etc. It is just some way of mapping one to many relation to a one to one. Again the example is simplified and fabricated and maybe looses its power to explain its intents. > It's pretty bad practice to have ORDER BY in views. I kinda wish we > didn't even allow it, but that ship sailed many years ago... It is required by DISTINCT ON and as soon as you go into reporting, datawarehouse then it gets difficult to avoid these along with group by. Instead of writing each query from the ground up you get a huge benefit by factorizing each query into meaningful entities that can stand alone and make sense by themself, and from these build up the query to answer your questions. That way you gain lots of re-use of code and definition doesn't change between queries. The down side is it leaves alot of work to the planner. It's a trade off between optimization, readability and simplicity. I hope I make more sense now. - Kim
Re: [GENERAL] Equivalence Classes when using IN
>On 9 October 2017 at 08:01, Kim Rose Carlsen wrote: >> Is this because postgres never consider IN clause when building equivalence >> class's? > >Only btree equality operators are considered at the moment. After good night sleep and reading the previous discussion, I am no longer sure I have reduced my original problem to the right example. If we continue from previous setup and add the following: ALTER TABLE customer ADD COLUMN age INTEGER; UPDATE customer SET age = customer_id / 5; CREATE INDEX ON customer (age); CREATE INDEX ON product (customer_id); SET enable_hashjoin = false; SET enable_mergejoin = false; EXPLAIN ANALYZE SELECT * FROM customer JOIN view_customer ON customer.customer_id = view_customer.customer_id WHERE age < 20; QUERY PLAN - Nested Loop Left Join (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1) Join Filter: (c.customer_id = product.customer_id) Rows Removed by Join Filter: 199900 -> Nested Loop (cost=0.28..199.21 rows=334 width=12) (actual time=0.075..1.146 rows=100 loops=1) -> Seq Scan on customer (cost=0.00..21.51 rows=334 width=8) (actual time=0.067..0.282 rows=100 loops=1) Filter: (age < 20) Rows Removed by Filter: 901 -> Index Only Scan using customer_pkey on customer c (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100) Index Cond: (customer_id = customer.customer_id) Heap Fetches: 100 -> Materialize (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100) -> Sort (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1) Sort Key: product.customer_id, product.product_id Sort Method: quicksort Memory: 142kB -> Seq Scan on product (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1) Planning time: 0.214 ms Execution time: 35.284 ms The planner prefer to use hash and merge joins which is ok, when many rows are to be joined, I don't think any condition can be merged to make these case faster. I have disabled merge and hash joins to get to a nested loop join instead, in this case it would be much better if customer_id can be pulled inside the loop, so it can look at only the relevant rows and not all rows for each loop. I somehow inferred that this would be the same as selecting from the view using IN clause, now I'm not so sure anymore. I can see there is a trade off between planner time and how exotic the case is. If you want to be able to hide abstraction through views I guess the nature becomes more OLAP oriented than OLTP. Best Regards Kim Carlsen -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Equivalence Classes when using IN
Hi I have this query where I think it's strange that the join doesn't pull the where condition in since RHS is equal to LHS. It might be easier to expain with an example Setup CREATE TABLE customer ( customer_id INTEGER PRIMARY KEY ); CREATE TABLE product ( product_id INTEGER PRIMARY KEY, customer_id INTEGER NOT NULL REFERENCES customer (customer_id) ); INSERT INTO customer (SELECT generate_series FROM generate_series(0, 100)); INSERT INTO product (product_id, customer_id) (SELECT generate_series, generate_series / 2 FROM generate_series(0, 2000)); Query EXPLAIN ANALYSE SELECT * FROM customer c JOIN (SELECT DISTINCT ON (customer_id) * FROM product ORDER BY customer_id, product_id) p ON c.customer_id = p.customer_id WHERE c.customer_id IN (500, 501); QUERY PLAN -- Merge Join (cost=172.43..186.25 rows=1 width=12) (actual time=1.350..1.353 rows=2 loops=1) Merge Cond: (c.customer_id = product.customer_id) -> Sort (cost=13.93..13.93 rows=2 width=4) (actual time=0.036..0.036 rows=2 loops=1) Sort Key: c.customer_id Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on customer c (cost=8.58..13.92 rows=2 width=4) (actual time=0.026..0.027 rows=2 loops=1) Recheck Cond: (customer_id = ANY ('{500,501}'::integer[])) Heap Blocks: exact=1 -> Bitmap Index Scan on customer_pkey (cost=0.00..8.58 rows=2 width=0) (actual time=0.018..0.018 rows=2 loops=1) Index Cond: (customer_id = ANY ('{500,501}'::integer[])) -> Unique (cost=158.51..169.81 rows=200 width=8) (actual time=0.783..1.221 rows=502 loops=1) -> Sort (cost=158.51..164.16 rows=2260 width=8) (actual time=0.782..0.929 rows=1003 loops=1) Sort Key: product.customer_id, product.product_id Sort Method: quicksort Memory: 142kB -> Seq Scan on product (cost=0.00..32.60 rows=2260 width=8) (actual time=0.015..0.366 rows=2001 loops=1) Planning time: 0.281 ms Execution time: 1.432 ms I would expect that since c.customer_id = p.customer_id then p.customer_id IN (500, 501). If I apply this rule myself, I get a much nicer plan (and it could be even better with an index on product_id). QUERY PLAN -- Merge Join (cost=52.70..53.11 rows=1 width=12) (actual time=0.686..0.693 rows=2 loops=1) Merge Cond: (product.customer_id = c.customer_id) -> Unique (cost=38.77..38.89 rows=22 width=8) (actual time=0.647..0.651 rows=2 loops=1) -> Sort (cost=38.77..38.83 rows=23 width=8) (actual time=0.646..0.647 rows=4 loops=1) Sort Key: product.customer_id, product.product_id Sort Method: quicksort Memory: 25kB -> Seq Scan on product (cost=0.00..38.25 rows=23 width=8) (actual time=0.331..0.632 rows=4 loops=1) Filter: (customer_id = ANY ('{500,501}'::integer[])) Rows Removed by Filter: 1997 -> Sort (cost=13.93..13.93 rows=2 width=4) (actual time=0.033..0.033 rows=2 loops=1) Sort Key: c.customer_id Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on customer c (cost=8.58..13.92 rows=2 width=4) (actual time=0.025..0.026 rows=2 loops=1) Recheck Cond: (customer_id = ANY ('{500,501}'::integer[])) Heap Blocks: exact=1 -> Bitmap Index Scan on customer_pkey (cost=0.00..8.58 rows=2 width=0) (actual time=0.018..0.018 rows=2 loops=1) Index Cond: (customer_id = ANY ('{500,501}'::integer[])) Planning time: 0.386 ms Execution time: 0.774 ms (19 rows) Is this because postgres never consider IN clause when building equivalence class's? Are there any interests in adding such rule? My idea is to wrap this in a view CREATE VIEW view_customer AS SELECT c.customer_id, p.product_id FROM customer c LEFT JOIN (SELECT DISTINCT ON (customer_id) * FROM product ORDER BY customer_id, product_id) p ON c.customer_id = p.customer_id Where the LEFT JOIN can be pruned if there is no explicit need for product_id. Here I loose the power to express that both c.customer_id and p.customer_id is the same. Best regards Kim Carlsen
[GENERAL] Remove useless joins (VARCHAR vs TEXT)
Hi It seems there are some difference in VARCHAR vs TEXT when postgres tries to decide if a LEFT JOIN is useful or not. I can't figure out if this is intentional because there are some difference between TEXT and VARCHAR that I dont know about or if it's a bug. I would expect both examples to produce same query plan a) create table a (id varchar primary key); create table b (id varchar primary key); explain select a.* from a left join (select distinct id from b) as b on a.id = b.id; QUERY PLAN -- Hash Right Join (cost=67.60..113.50 rows=1360 width=32) Hash Cond: ((b.id)::text = (a.id)::text) -> HashAggregate (cost=27.00..40.60 rows=1360 width=32) Group Key: b.id -> Seq Scan on b (cost=0.00..23.60 rows=1360 width=32) -> Hash (cost=23.60..23.60 rows=1360 width=32) -> Seq Scan on a (cost=0.00..23.60 rows=1360 width=32) (7 rows) b) create table a (id text primary key); create table b (id text primary key); explain select a.* from a left join (select distinct id from b) as b on a.id = b.id; QUERY PLAN -- Seq Scan on a (cost=0.00..23.60 rows=1360 width=32) - Kim Carlsen
Re: [GENERAL] Strict min and max aggregate functions
> CREATE OR REPLACE FUNCTION strict_min_agg (anyarray,anyelement ) > RETURNS anyarray LANGUAGE sql IMMUTABLE AS $$ > SELECT CASE > WHEN $1 IS NULL THEN ARRAY[$2] > WHEN $1[1] IS NULL THEN $1 > WHEN $2 IS NULL THEN ARRAY[$2] -- use $2 not NULL to preserve > type > ELSE ARRAY[least($1[1],$2)] END ; > $$; > > > CREATE OR REPLACE FUNCTION strict_min_final (anyarray) > RETURNS anyelement LANGUAGE sql IMMUTABLE AS $$ > SELECT CASE WHEN $1 IS NULL THEN NULL ELSE $1[1] END ; > $$; > > CREATE AGGREGATE strict_min (x anyelement) ( > sfunc = strict_min_agg, > stype = anyarray, > finalfunc = strict_min_final > ); > It seems like this should be possible to do in something more close to O(log n). But I'm not sure how to fix the semantics with aggregates. SELECT max() FROM ; SELECT true FROM WHERE IS NULL LIMIT 1; Both these queries can be resolved with a index lookup (if one is available).
Re: [GENERAL] Partial update on an postgres upsert violates constraint
> AFAIK, EXCLUDED is only available in a trigger function: > > https://www.postgresql.org/docs/9.5/static/trigger-definition.html > > You are using EXCLUDED in a regular function so it would not be found. > > Can you also show the failure for your alternate method? >From the manual https://www.postgresql.org/docs/9.5/static/sql-insert.html " conflict_action conflict_action specifies an alternative ON CONFLICT action. It can be either DO NOTHING, or a DO UPDATE clause specifying the exact details of the UPDATE action to be performed in case of a conflict. The SET and WHERE clauses in ON CONFLICT DO UPDATE have access to the existing row using the table's name (or an alias), and to rows proposed for insertion using the special excluded table. SELECT privilege is required on any column in the target table where corresponding excluded columns are read. "
Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
> Hang on -- upthread the context was inner join, and the gripe was join > fast with '=', slow with INDF. When he said the nulls were > 'generated', I didn't follow that they were part of the original > query. If the nulls are generated along with the query, sure, an > index won't help. > > I maintain my earlier point; with respect to the original query, to > get from performance of INDF to =, you have three options: > a) expr index the nulls (assuming they are physically stored) > b) convert to ((a = b) or a is null and b is null) which can help with > a bitmap or plan > c) covert to union all equivalent of "b" > > merlin a) and b) would be workaround that would run an order of magnitude slower. The query starts with a full table scan of a large table. If the planner had started elsewhere it could have reduced the result to 1-2 rows from the start. It won't choose this plan without the help from =. c) could be a acceptable workaround, but it would clutter up if you would want more than one column to be IS NOT DISTINCT FROM. You end up with 2^n unions to simulate IS NOT DISTINCT FROM. Without knowing the work required, I will still argue that having IS NOT DISTINCT FROM use the same transitive rules as equality, would be a better approach. With fear of talking about things I know little(nothing) of, I think the description of EquivalenceClasses in postgres/src/backend/optimizer/README, should be extended to also include EquivalenceClasses of IS NOT DISTINCT FROM.
Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
>> It might raise another problem, that the nulls are generated through LEFT >> JOINS where no rows are defined. Then the 0 or -1 value need to be >> a computed value. Won't this throw off index lookups? (I might be >> more confused in this area). > >Not following this. The nulls are generated by something like this SELECT c.circuit_id, cc.customer_id FROM circuit AS c LEFT JOIN circuit_customer AS cc ON c.circuit_id = cc.circuit_id To make a magic '0' customer we would be required to use COALESCE(cc.customer_id, '0') I dont think the optimizer will do anything clever with the '0' we have computed from null. I could ofc. by default assign all unassigned circuits to '0' in circuit_customer. I'm not a fan though. >BTW, if you want a fast plan over the current >data without consideration of aesthetics, try this: > >CREATE VIEW view_circuit_with_status AS ( >SELECT r.*, > s.circuit_status, > s.customer_id AS s_customer_id, > p.line_speed, > p.customer_id AS p_customer_id > FROM view_circuit r > JOIN view_circuit_product_main s > ON r.circuit_id = s.circuit_id > AND r.customer_id, s.customer_id > JOIN view_circuit_product p > ON r.circuit_id = p.circuit_id > AND r.customer_id, s.customer_id > UNION ALL SELECT r.*, > s.circuit_status, > s.customer_id AS s_customer_id, > p.line_speed, > p.customer_id AS p_customer_id > FROM view_circuit r > JOIN view_circuit_product_main s > ON r.circuit_id = s.circuit_id > AND r.customer_id IS NULL > AND s.customer_id IS NULL > JOIN view_circuit_product p > ON r.circuit_id = p.circuit_id> I will have to figure something out, but this specific case is still problematic since we would like to filter this view using different criteria's, like circuit_no, products or customers. But with all these detours, I assume that a change to IS NOT DISTINCT FROM, is difficult or not wanted?
Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
On Sat, Oct 29, 2016 at 8:27 AM, Kim Rose Carlsen wrote: > > I have tried creating a function called > > zero_if_null(int) : int that just select COALESCE($1, 0) > > and adding a index on (zero_if_null(customer_id)) on table that contains > > customer_id. The only thing I get from is the planner now only knows how to > > compare customer_id, but it still doesn't know that they are of same value, > > only I know that and I want to declare it for the planner. > Well, the *behavior* is mandated by the sql standard. Our > implementation is slow however. Sorry I'm not following, what behavior is mandated by the sql standard? > I'm surprised the attached function > didn't help, it can be inlined and I was able to get bitmap or which > is pretty good. As I said upthread I think INDF could theoretically > run as fast as equality -- it just doesn't today. It might be harsh to say that it doesn't help at all. I does half the running time, but I need it to run an order of magnitude faster. Here is the plan with the empty_if_null (customer_id is actually varchar) https://explain.depesz.com/s/M1LV with empty_if_null + functional index https://explain.depesz.com/s/eOL with indf > As your joins are written you could probably convert this by reserving > a customer_id to the work that you're trying to do with null, say, 0, > or -1. This is a somewhat dubious practice but seems a better fit for > your use case. I don't think INDF is good in this usage. > > merlin This will work well, I think. But I'm not sure I can mentally accept an unfilled value should not be null (eg. 0, '', '-01-01'). But I can see how the equals operator will work well with this. It might raise another problem, that the nulls are generated through LEFT JOINS where now rows are defined. Then the 0 or -1 value need to be a computed value. Won't this throw of index lookups? (I might be more confused in this area).
Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
>> This doesn't do much good. This doesn't tell the planner that the 3 >> customer_ids are actually of same value, and it therefore can't filter them >> as it sees fit. > You do know you can index on a function, and the planner then keeps > stats on it when you run analyze right? Yes, but I don't think it will make any difference. I don't think I can solve this with an index lookup. I think my savior is the inference that the 2 columns are of same value and the planner are free to choose which order to do the filter and join with this extra information. I have tried creating a function called zero_if_null(int) : int that just select COALESCE($1, 0) and adding a index on (zero_if_null(customer_id)) on table that contains customer_id. The only thing I get from is the planner now only knows how to compare customer_id, but it still doesn't know that they are of same value, only I know that and I want to declare it for the planner. I could probably rewrite the whole view in one query, and then fix it with a proper index. But I think I will loose alot of readability. I could also change the structure to save an explicit state, instead of a calculated state. But then I get some redundancy I need to make sure always stays the same. In the end one of these will probably be the solution. I guess the question is more or less, why doesn't IS NOT DISTINCT FROM behave the same way as = operator, are there any alternatives? And a plausible use case for when it would be useful.
Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
> try this :-D > create or replace function indf(anyelement, anyelement) returns anyelement as > $$ > select $1 = $2 or ($1 is null and $2 is null); > $$ language sql; > > CREATE VIEW view_circuit_with_status AS ( >SELECT r.*, > s.circuit_status, > s.customer_id AS s_customer_id, > p.line_speed, > p.customer_id AS p_customer_id > FROM view_circuit r > JOIN view_circuit_product_main s >ON r.circuit_id = s.circuit_id > AND indf(r.customer_id, s.customer_id) > JOIN view_circuit_product p >ON r.circuit_id = p.circuit_id > AND indf(r.customer_id, s.customer_id) > > merlin This doesn't do much good. This doesn't tell the planner that the 3 customer_ids are actually of same value, and it therefore can't filter them as it sees fit.
Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
> > On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen wrote: > > Hi > > > > I was wondering if there is a way to hint that two columns in two different > > tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if > > table_a.key = 'test' THEN table_b.key = 'test' . > > > > The equals operator already does this but it does not handle NULLS very well > > (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and > > doesn't establish the same inference rules as equals. > > The whole idea behing Postgres' query planner is that you don't have > to use any hints. Late model versions of postgres handle nulls fine, > but nulls are never "equal" to anything else. I.e. where xxx is null > works with indexes. Where x=y does not, since null <> null. > > Suggestion for getting help, put a large-ish aka production sized > amount of data into your db, run your queries with explain analyze and > feed them to https://explain.depesz.com/ and post the links here along > with the slow queries. A lot of times the fix is non-obvious if you're > coming from another db with a different set of troubleshooting skills > for slow queries. The problem is how to reduce the problem into its core, without introducing all the unnecessary. Maybe simplifying the problem, also makes it impossible to say where I go wrong. It might be that I try to push too much logic into the SQL layer and Im adding too many layers of abstraction to accomplish what I want. So let me try and elaborate a little more. I have couple a tables describing resources (circuits) and allocation of resources to customers and products. First layer is a view called view_circuit. This view (left) join any table the circuit table reference through a foreign key (it gives exactly the same rows and columns as circuit table + some extra information like customer_id). Second layer is 2 views 1) a view describing if the circuit is active or inactive, lets call it view_circuit_product_main 2) a view describing line_speed about the circuit, lets call it view_circuit_product These views use aggregations (both GROUP BY and SELECT DISTINCT ON (...)) if this has any relevance. Third layer Next step is to add a view that tells both (joins the two views together on circuit_id). lets call the new view view_circuit_with_status This view is defined as CREATE VIEW view_circuit_with_status AS ( SELECT r.*, s.circuit_status, s.customer_id AS s_customer_id, p.line_speed, p.customer_id AS p_customer_id FROM view_circuit r JOIN view_circuit_product_main s ON r.circuit_id = s.circuit_id AND r.customer_id IS NOT DISTINCT FROM s.customer_id JOIN view_circuit_product p ON r.circuit_id = p.circuit_id AND r.customer_id IS NOT DISTINCT FROM s.customer_id ); SELECT * FROM view_circuit_with_status WHERE customer_id = 1; Since customer_id is exposed through view_circuit the planner assumes view_circuit.customer_id = 1 and from there attempts to join view_circuit_product_main and view_circuit_product using circuit_id. This is not running optimal. However if we change our query to allow the inference rule to take place, the query is executed very fast. SELECT * FROM view_circuit_with_status WHERE customer_id = 1 AND s_customer_id = 1 AND p_customer_id = 1; If a circuit is not assigned to any customers customer_id is set to NULL. This is the reason I can't use = operator. If I do use = then I can't find circuit which are unassigned, but the query do run effective. I can see this still ends up being quite abstract, but the point is it would be quite beneficial if IS NOT DISTINCT used the same rules as = operator. I have attached the 2 query plans Bad plan: https://explain.depesz.com/s/SZN Good plan: https://explain.depesz.com/s/61Ro - Kim Carlsen Do you use potatoes for long posts here?
Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
>>JOIN a AS table_b >> ON table_a.id = table_b.id >> AND table_a.key = table_b.key > Anyways, to use an index for that join, you'd need a composite index on id > *AND* key, not two separate indexes. Its not as much as for using the index, but to be able to push the where clause inside both JOINED tables. The = operator already does this. It gives the planner the option to join the table in using either id = id or key = key. It can deduce that if I have a WHERE condition with table_a.key = 'Something', then table_b.key must also be 'Something'. It can then decide to filter table_b on key. When using IS NOT DISTINCT FROM, the planner is not considering the same options. Now its like it doesn't know table_a.key is same the same as table_b.key. I would somehow expect the IS NOT DISTINCT FROM operator to do the same. As it establish the same rules.. If a = 'test' and a IS NOT DISTINCT FROM b then b = 'test' also - Kim Carlsen
Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
> The bigger picture here is that if you've designed a data representation > that requires that a null be considered "equal to" another null, you're > really going to be fighting against the basic semantics of SQL. You'd > be best off to rethink the representation. We've not seen enough info > about your requirements to suggest just how, though. Sometimes I do wake up in the night scared and afraid that I have used NULLs the wrong way my whole life. I usually use NULLs to denote a value has not been provided. In my dreams I fear I should have used empty string instead but if that is true, then I no longer know what is right and wrong :). In this specific case, its a resource management table. If a resource is not allocated to any customers, then customer_id is set to null else the customer_id is set. From: Tom Lane Sent: Friday, October 28, 2016 8:17:01 PM To: Scott Marlowe Cc: Kim Rose Carlsen; pgsql-general@postgresql.org Subject: Re: [GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other Scott Marlowe writes: > On Fri, Oct 28, 2016 at 10:29 AM, Kim Rose Carlsen wrote: >> I was wondering if there is a way to hint that two columns in two different >> tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if >> table_a.key = 'test' THEN table_b.key = 'test' . >> >> The equals operator already does this but it does not handle NULLS very well >> (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and >> doesn't establish the same inference rules as equals. > The whole idea behing Postgres' query planner is that you don't have > to use any hints. Late model versions of postgres handle nulls fine, > but nulls are never "equal" to anything else. I.e. where xxx is null > works with indexes. Where x=y does not, since null <> null. The bigger picture here is that if you've designed a data representation that requires that a null be considered "equal to" another null, you're really going to be fighting against the basic semantics of SQL. You'd be best off to rethink the representation. We've not seen enough info about your requirements to suggest just how, though. regards, tom lane
Re: [GENERAL] How to hint two columns IS NOT DISTINCT FROM each other in a join
Sorry for double post, just ignore this post.. From: pgsql-general-ow...@postgresql.org on behalf of Kim Rose Carlsen Sent: Thursday, October 27, 2016 6:34:58 PM To: pgsql-general@postgresql.org Subject: [GENERAL] How to hint two columns IS NOT DISTINCT FROM each other in a join This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing<http://aka.ms/LearnAboutSpoofing> Feedback<http://aka.ms/SafetyTipsFeedback> Hi I was wondering if there is a way to hint that two columns in two different tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if table_a.key = 'test' THEN table_b.key = 'test' . The equals operator already does this but it does not handle NULLS very well (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and doesn't establish the same inference rules as equals. Example: CREATE TABLE a ( id INTEGER PRIMARY KEY, key VARCHAR, value VARCHAR ); CREATE INDEX ON a (key); INSERT INTO a VALUES (1, 'test', 'test'), (2, 'foo', 'bar'), (3, null, null), (4, 'baz', 'qoz'); CREATE VIEW view_a AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id ); CREATE VIEW view_a_eq AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id AND table_a.key = table_b.key ); CREATE VIEW view_a_distinct AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id AND table_a.key IS NOT DISTINCT FROM table_b.key ); EXPLAIN SELECT * FROM view_a WHERE key = 'test'; QUERY PLAN Hash Join (cost=12.69..34.42 rows=4 width=100) Hash Cond: (table_b.id = table_a.id) -> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36) -> Hash (cost=12.64..12.64 rows=4 width=68) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) We only get index scan on table_a EXPLAIN SELECT * FROM view_a_eq WHERE key = 'test'; QUERY PLAN Nested Loop (cost=8.36..25.53 rows=1 width=100) Join Filter: (table_a.id = table_b.id) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) -> Materialize (cost=4.18..12.66 rows=4 width=36) -> Bitmap Heap Scan on a table_b (cost=4.18..12.64 rows=4 width=36) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) We get index scan on both tables and the where clause is pushed all the way down EXPLAIN SELECT * FROM view_a_distinct WHERE key = 'test'; QUERY PLAN Hash Join (cost=12.69..34.43 rows=1 width=100) Hash Cond: (table_b.id = table_a.id) Join Filter: (NOT ((table_a.key)::text IS DISTINCT FROM (table_b.key)::text)) -> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36) -> Hash (cost=12.64..12.64 rows=4 width=68) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) Same as the first example In these examples it really doesn't matter which plan is used. But for larger view it might be a great hint for the optimizer to know that since we are joining on a primary key we could hint that all other columns for the table is actually NOT DISTINCT from each other. This will result in the planner being able to push the WHERE condition down into the other joined tables. It works w
[GENERAL] How to hint 2 coulms IS NOT DISTINCT FROM each other
Hi I was wondering if there is a way to hint that two columns in two different tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if table_a.key = 'test' THEN table_b.key = 'test' . The equals operator already does this but it does not handle NULLS very well (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and doesn't establish the same inference rules as equals. Example: CREATE TABLE a ( id INTEGER PRIMARY KEY, key VARCHAR, value VARCHAR ); CREATE INDEX ON a (key); INSERT INTO a VALUES (1, 'test', 'test'), (2, 'foo', 'bar'), (3, null, null), (4, 'baz', 'qoz'); CREATE VIEW view_a AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id ); CREATE VIEW view_a_eq AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id AND table_a.key = table_b.key ); CREATE VIEW view_a_distinct AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id AND table_a.key IS NOT DISTINCT FROM table_b.key ); EXPLAIN SELECT * FROM view_a WHERE key = 'test'; QUERY PLAN Hash Join (cost=12.69..34.42 rows=4 width=100) Hash Cond: (table_b.id = table_a.id) -> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36) -> Hash (cost=12.64..12.64 rows=4 width=68) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) We only get index scan on table_a EXPLAIN SELECT * FROM view_a_eq WHERE key = 'test'; QUERY PLAN Nested Loop (cost=8.36..25.53 rows=1 width=100) Join Filter: (table_a.id = table_b.id) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) -> Materialize (cost=4.18..12.66 rows=4 width=36) -> Bitmap Heap Scan on a table_b (cost=4.18..12.64 rows=4 width=36) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) We get index scan on both tables and the where clause is pushed all the way down EXPLAIN SELECT * FROM view_a_distinct WHERE key = 'test'; QUERY PLAN Hash Join (cost=12.69..34.43 rows=1 width=100) Hash Cond: (table_b.id = table_a.id) Join Filter: (NOT ((table_a.key)::text IS DISTINCT FROM (table_b.key)::text)) -> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36) -> Hash (cost=12.64..12.64 rows=4 width=68) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) Same as the first example In these examples it really doesn't matter which plan is used. But for larger view it might be a great hint for the optimizer to know that since we are joining on a primary key we could hint that all other columns for the table is actually NOT DISTINCT from each other. This will result in the planner being able to push the WHERE condition down into the other joined tables. It works well for the = operator, but it is easy to see the it fails as soon as we asks about nulls. SELECT * FROM view_a_eq WHERE key IS NULL; id | key | value | b_key +-+---+--- (0 rows) Maybe my examples are too simple, but I hope you can verify the same will hold for a much larger table with 2 different views on top of them.
[GENERAL] How to hint two columns IS NOT DISTINCT FROM each other in a join
Hi I was wondering if there is a way to hint that two columns in two different tables IS NOT DISTINCT FROM each other. So that the optimizer may assume if table_a.key = 'test' THEN table_b.key = 'test' . The equals operator already does this but it does not handle NULLS very well (or not at all). And it seems IS NOT DISTINCT FROM is not indexable and doesn't establish the same inference rules as equals. Example: CREATE TABLE a ( id INTEGER PRIMARY KEY, key VARCHAR, value VARCHAR ); CREATE INDEX ON a (key); INSERT INTO a VALUES (1, 'test', 'test'), (2, 'foo', 'bar'), (3, null, null), (4, 'baz', 'qoz'); CREATE VIEW view_a AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id ); CREATE VIEW view_a_eq AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id AND table_a.key = table_b.key ); CREATE VIEW view_a_distinct AS ( SELECT table_a.id, table_a.key, table_a.value, table_b.key as b_key FROM a AS table_a JOIN a AS table_b ON table_a.id = table_b.id AND table_a.key IS NOT DISTINCT FROM table_b.key ); EXPLAIN SELECT * FROM view_a WHERE key = 'test'; QUERY PLAN Hash Join (cost=12.69..34.42 rows=4 width=100) Hash Cond: (table_b.id = table_a.id) -> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36) -> Hash (cost=12.64..12.64 rows=4 width=68) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) We only get index scan on table_a EXPLAIN SELECT * FROM view_a_eq WHERE key = 'test'; QUERY PLAN Nested Loop (cost=8.36..25.53 rows=1 width=100) Join Filter: (table_a.id = table_b.id) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) -> Materialize (cost=4.18..12.66 rows=4 width=36) -> Bitmap Heap Scan on a table_b (cost=4.18..12.64 rows=4 width=36) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) We get index scan on both tables and the where clause is pushed all the way down EXPLAIN SELECT * FROM view_a_distinct WHERE key = 'test'; QUERY PLAN Hash Join (cost=12.69..34.43 rows=1 width=100) Hash Cond: (table_b.id = table_a.id) Join Filter: (NOT ((table_a.key)::text IS DISTINCT FROM (table_b.key)::text)) -> Seq Scan on a table_b (cost=0.00..18.50 rows=850 width=36) -> Hash (cost=12.64..12.64 rows=4 width=68) -> Bitmap Heap Scan on a table_a (cost=4.18..12.64 rows=4 width=68) Recheck Cond: ((key)::text = 'test'::text) -> Bitmap Index Scan on a_key_idx (cost=0.00..4.18 rows=4 width=0) Index Cond: ((key)::text = 'test'::text) Same as the first example In these examples it really doesn't matter which plan is used. But for larger view it might be a great hint for the optimizer to know that since we are joining on a primary key we could hint that all other columns for the table is actually NOT DISTINCT from each other. This will result in the planner being able to push the WHERE condition down into the other joined tables. It works well for the = operator, but it is easy to see the it fails as soon as we asks about nulls. SELECT * FROM view_a_eq WHERE key IS NULL; id | key | value | b_key +-+---+--- (0 rows) Maybe my examples are too simple, but I hope you can verify the same will hold for a much larger table with 2 different views on top of them.