[GENERAL] How to change order sort of table in HashJoin
Hi Experts, As in the example below, i think the plan which hash table is created on testtbl2 (the fewer tuples) should be choosen. Because creating of hash table should faster in testtbl2. But it did not. I have tried to change the ordering of table by tuning parameter even if using pg_hint_plan but not success. Why does planner do not choose the plan which hash table is created on testtbl2 (which can take less time)? And how to change the order? # I also confirm planner info by rebuild postgresql but not found related usefull info about hash table --- postgres=# create table testtbl1(id integer, c1 text, c2 text, c3 text, primary key (c1,c2,c3)); CREATE TABLE postgres=# create table testtbl2(id integer, c1 text, c2 text, c3 text, primary key (c1,c2,c3)); CREATE TABLE postgres=# insert into testtbl1 select generate_series(1,100),random()::text,random()::text,random()::text; INSERT 0 100 postgres=# insert into testtbl2 select * from testtbl1 where id%7 = 0; INSERT 0 142857 postgres=# explain analyze select * from testtbl1 inner join testtbl2 using(c1,c2,c3); QUERY PLAN - Hash Join (cost=38775.00..47171.72 rows=1 width=59) (actual time=1120.824..1506.236 rows=142857 loops=1) Hash Cond: ((testtbl2.c1 = testtbl1.c1) AND (testtbl2.c2 = testtbl1.c2) AND (testtbl2.c3 = testtbl1.c3)) -> Seq Scan on testtbl2 (cost=0.00..3039.57 rows=142857 width=56) (actual time=0.008..27.964 rows=142857 loops=1) -> Hash (cost=21275.00..21275.00 rows=100 width=55) (actual time=1120.687..1120.687 rows=100 loops=1) Buckets: 131072 Batches: 1 Memory Usage: 89713kB -> Seq Scan on testtbl1 (cost=0.00..21275.00 rows=100 width=55) (actual time=0.035..458.522 rows=100 loops=1) Planning time: 0.922 ms Execution time: 1521.258 ms (8 rows) postgres=# set pg_hint_plan.enable_hint to on; SET postgres=# /*+ postgres*# HashJoin(testtbl1 testtbl2) postgres*# Leading(testtbl1 testtbl2) postgres*# */ postgres-# explain analyze select * from testtbl1 inner join testtbl2 using(c1,c2,c3); QUERY PLAN - Hash Join (cost=48541.00..67352.86 rows=1 width=59) (actual time=1220.625..1799.709 rows=142857 loops=1) Hash Cond: ((testtbl2.c1 = testtbl1.c1) AND (testtbl2.c2 = testtbl1.c2) AND (testtbl2.c3 = testtbl1.c3)) -> Seq Scan on testtbl2 (cost=0.00..3039.57 rows=142857 width=56) (actual time=0.011..58.649 rows=142857 loops=1) -> Hash (cost=21275.00..21275.00 rows=100 width=55) (actual time=1219.295..1219.295 rows=100 loops=1) Buckets: 8192 Batches: 32 Memory Usage: 2851kB -> Seq Scan on testtbl1 (cost=0.00..21275.00 rows=100 width=55) (actual time=0.021..397.583 rows=100 loops=1) Planning time: 3.971 ms Execution time: 1807.710 ms (8 rows) postgres=# --- Thanks and best regard!
[GENERAL] Partial update on an postgres upsert violates constraint
Hi, Basically I wanted to do a partial update inside pg (9.5), but it seems that a partial update fails when not all of constraint is fulfilled (such as the not null constraint) Below are the sql queries I used, CREATE TABLE jobs ( id integer PRIMARY KEY, employee_name TEXT NOT NULL, address TEXT NOT NULL, phone_number TEXT); CREATE OR REPLACE FUNCTION upsert_job(job JSONB) RETURNS VOID AS $$BEGININSERT INTO jobs AS origin VALUES( (job->>'id')::INTEGER, job->>'employee_name'::TEXT, job->>'address'::TEXT, job->>'phone_number'::TEXT) ON CONFLICT (id) DO UPDATE SET employee_name = COALESCE(EXCLUDED.employee_name, origin.employee_name), address = COALESCE(EXCLUDED.address, origin.address), phone_number = COALESCE(EXCLUDED.phone_number, origin.phone_number);END;$$ LANGUAGE PLPGSQL SECURITY DEFINER; --Full insert (OK)SELECT upsert_job('{"id" : 1, "employee_name" : "AAA", "address" : "City, x street no.y", "phone_number" : "123456789"}'::jsonb); --Partial update that fulfills constraint (Ok)SELECT upsert_job('{"id" : 1, "employee_name" : "BBB", "address" : "City, x street no.y"}'::jsonb); --Partial update that doesn't fulfill constraint (FAILS)SELECT upsert_job('{"id" : 1, "phone_number" : "12345"}'::jsonb); --ERROR: null value in column "employee_name" violates not-null constraint--DETAIL: Failing row contains (1, null, null, 12345). I also tried explicitly stating the columns that I wanted to insert, and it also fails. How do I go around doing this ? Thank you
Re: [GENERAL] Avoiding double-counting in aggregates with more than one join?
On Fri, Nov 18, 2016 at 10:16 AM, Paul Jungwirth < p...@illuminatedcomputing.com> wrote: > But is there a better way? Nothing that would be more readable nor likely more performant. When performing aggregation it is necessary to limit the scope of the query to only whatever it is you are calculating. Since you wish to compute two things you need two separate parts plus a third to combine them. If performance is a concern you should move the aggregation queries directly to the main query instead of using the optimization fencing CTE. SELECT FROM products LEFT JOIN ( SELECT sum() ) s USING (product_id) LEFT JOIN ( SELECT sum() ) r USING (product_id) If the second "scope" doesn't need to be calculated but simply informs the one-and-only aggregate you should use SEMI JOIN (EXISTS) instead of a INNER/LEFT JOIN. But that isn't what you have here. David J.
[GENERAL] Avoiding double-counting in aggregates with more than one join?
Hi All, I've noticed in the past that doing aggregates while joining to more than one table can sometimes give you unintended results. For example, suppose I have three tables: products, sales, and resupplies. In sales I track what I sell, and in resupplies I track my own purchases to increase inventory. Both have a foreign key to products. Now I want to run a report showing the total dollars sold for each product versus the total dollars spent for each product. I could try this: SELECT p.id, SUM(s.price * s.qty) AS total_sold, SUM(r.price * r.qty) AS total_spent FROMproducts p LEFT OUTER JOIN sales s ON s.product_id = p.id LEFT OUTER JOIN resupplies r ON r.product_id = p.id GROUP BY p.id ; That seems pretty safe, but actually I get bad answers, for example if some product has this data: sales - sold 1 @ $2/ea resupplies -- bought 1 @ $1/eq bought 2 @ $1/ea Then pre-grouping I have this: p.id | s.qty | s.price | r.qty | r.price -+---+-+---+ 1 | 1 | $2 | 1 | $1 1 | 1 | $2 | 2 | $1 You can see the problem is that I'm going to double-count my sales. What I really want is this: p.id | s.qty | s.price | r.qty | r.price -+---+-+---+ 1 | 1 | $2 | 1 | $1 1 | | | 2 | $1 In the past I've always handled these situations by aggregating each table separately and only then joining things together: WITH s AS ( SELECT product_id, SUM(price * qty) AS total_sold FROMsales GROUP BY product_id) s ), r AS ( SELECT product_id, SUM(price * qty) AS total_spent FROMresupplies GROUP BY product_id) r ) SELECT p.id, COALESCE(s.total_sold, 0), COALESCE(r.total_spent, 0) FROMproducts p LEFT OUTER JOIN s ON s.product_id = p.id LEFT OUTER JOIN r ON r.product_id = p.id ; Since I've guaranteed that each CTE includes at most one row per product, this is safe from double-counting errors. But is there a better way? My approach feels verbose and harder to read. Also you have to type COALESCE a lot. :-) Is there some different way of doing things I haven't thought of yet? Also I wonder about the performance merging all these subqueries together. Would the final merging be any faster if I had an ORDER BY in each CTE? It seems like this pattern comes up a lot; what have others done about it? Thanks, Paul -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Trim performance on 9.5
I just ran EXPLAIN ANALYZE, please see images attached. Field doesn't have a index. Em sex, 18 de nov de 2016 às 12:16, vinny escreveu: > On 2016-11-18 15:06, William Ivanski wrote: > > Hi, > > > > I recently did major improvements on perfomance on our routines by > > simply removing the call for trim functions on specific bottlenecks. > > Please see images attached for a simple example. > > > > I'm using PostgreSQL version 9.5.5-1.pgdg80+1 on Debian 8.6. Someone > > knows if it's a bug on trim function? Thanks in advance. > > > > -- > > > > William Ivanski > > Did you run EXPLAIN on these queries? > > I'm guessing that you have an index on the field, but not on > TRIM(field), > which would mean that the database is forced to seqscan to fetch every > row value, trim it and then compare it. > -- William Ivanski -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] pgbench and scaling
On Thu, Nov 17, 2016 at 8:08 PM, Rakesh Kumar wrote: > I noticed that as I scale from 5 to 10 to 20 to 40, the TPS starts falling > almost linearly : > > with 5, TPS was doing 639 > with 10 TPS was down to 490 > with 20 TPS was down to 280 > and so on. Are the TPS numbers per pgbench? If so, then you're getting 10x490=4900 TPS system wide, or 20*280=5600 TPS system wide. If those are total TPS numbers then you should check your disk utilization and CPU utilization and see where your bottleneck is. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Streaming replication failover/failback
On Thu, 17 Nov 2016 08:26:59 -0900 Israel Brewster wrote: > > On Nov 16, 2016, at 4:24 PM, Adrian Klaver > > wrote: > > > > On 11/16/2016 04:51 PM, Israel Brewster wrote: > >> I've been playing around with streaming replication, and discovered that > >> the following series of steps *appears* to work without complaint: > >> > >> - Start with master on server A, slave on server B, replicating via > >> streaming replication with replication slots. > >> - Shut down master on A > >> - Promote slave on B to master > >> - Create recovery.conf on A pointing to B > >> - Start (as slave) on A, streaming from B > >> > >> After those steps, A comes up as a streaming replica of B, and works as > >> expected. In my testing I can go back and forth between the two servers > >> all day using the above steps. > >> > >> My understanding from my initial research, however, is that this > >> shouldn't be possible - I should need to perform a new basebackup from B > >> to A after promoting B to master before I can restart A as a slave. Is > >> the observed behavior then just a "lucky fluke" that I shouldn't rely > > > > You don't say how active the database is, but I going to say it is not > > active enough for the WAL files on B to go out for scope for A in the time > > it takes you to do the switch over. > > Yeah, not very - this was just in testing, so essentially no activity. So > between your response and the one from Jehan-Guillaume de Rorthais, what I'm > hearing is that my information about the basebackup being needed was > obsoleted with the patch he linked to, and as long as I do a clean shutdown > of the master, and don't do too much activity on the *new* master before > bringing the old master up as a slave (such that WAL files are lost) Just set up wal archiving to avoid this (and have PITR backup as a side effect). -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Trim performance on 9.5
On 2016-11-18 15:06, William Ivanski wrote: Hi, I recently did major improvements on perfomance on our routines by simply removing the call for trim functions on specific bottlenecks. Please see images attached for a simple example. I'm using PostgreSQL version 9.5.5-1.pgdg80+1 on Debian 8.6. Someone knows if it's a bug on trim function? Thanks in advance. -- William Ivanski Did you run EXPLAIN on these queries? I'm guessing that you have an index on the field, but not on TRIM(field), which would mean that the database is forced to seqscan to fetch every row value, trim it and then compare it. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Trim performance on 9.5
Hi, I recently did major improvements on perfomance on our routines by simply removing the call for trim functions on specific bottlenecks. Please see images attached for a simple example. I'm using PostgreSQL version 9.5.5-1.pgdg80+1 on Debian 8.6. Someone knows if it's a bug on trim function? Thanks in advance. -- William Ivanski -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How the Planner in PGStrom differs from PostgreSQL?
> On Thu, Nov 17, 2016 at 7:09 PM, Mark Anns wrote: > > Can u explain this statement "check whether the scan qualifier can be > > executable on GPU device" > > > > What are the scan qualifiers? > > > > How to determine whether they are device executable or not? > > > > The cost estimates are entirely based on number of rows and type of scan. > > Then it will be same for both CPU and GPU. How the decision can be > > made for cheaper one comparing CPU and GPU estimates? > > There is a parameter (call it a "factor" if you will) called cpu_tuple_cost > (section 19.7.2 Planner Cost Constants in the PostgreSQL documentation; > link at the bottom), which "sets the planner's estimate of the cost of > processing each row during a query" > as the description on that page says. > > With that as the unit of cost of processing rows using the CPU and considering > any differences in the processing capabilities between CPU and GPU, the > optimizer code will cost the portion of plan that will be processed by the > GPU (typically a table scan or a join) as some fraction of the cost of > executing the same portion of the plan using the traditional CPU processing. > > (Kaigai-san will be better able to explain and correct if the above rough > sketch is not exactly accurate) > It is right introduction. PG-Strom assumes GPU can run functions/operators within scan qualifier more effectively than CPU, but has more startup cost (const) and extra data copy via PCI-E bus (another factor based on width x num rows). These factor makes differences in the cost of individual scan/join paths, then planner will choose the appropriate one. Thanks, PG-Strom Project / NEC OSS Promotion Center KaiGai Kohei -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How the Planner in PGStrom differs from PostgreSQL?
> Can u explain this statement "check whether the scan qualifier can be > executable on GPU device" > > What are the scan qualifiers? > SELECT * FROM my_table WHERE x > 20 AND memo LIKE '%abc%'; This is scan qualifier. > How to determine whether they are device executable or not? > If all the function (or function on behalf of operator) are available to transform GPU source code. Please see codegen.c. > The cost estimates are entirely based on number of rows and type of scan. > Then it will be same for both CPU and GPU. How the decision can be made > for cheaper one comparing CPU and GPU estimates? > If Scan path has any scan qualifier, its cost to evaluate depends on the device type. PG-Strom assumes GPU has larger startup cost but less cost per tuple. So, GpuScan path is tend to be choosen if number of rows are relatively large. Thanks, -- The PG-Strom Project / NEC OSS Promotion Center KaiGai Kohei -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Feature request: separate logging
A glaring weakness in Postgresql for production systems is that the administrator has no way of controlling what types of logs go where. There are at least two types of logs: errors and statement logs. (I could also add: connection, syntax error, query duration, audit). It has becomes increasingly important in the Linux world, with its over reliance on systemd, that we admins be able to distinguish between these log outputs. Systemd wants to control all the daemons and services run on a Linux host. To do this effectively, it needs to capture the stdout and stderr or possibly parse the external logfile of the daemon it spawns. The benefits of systemd's journald subsystem include being able to automatically identify daemons that are failing to start properly. (The merits of systemd are beyond the scope of this discussion; it is now the all-but-ubiquitous standard of linux distributions, and has become nearly intractable). The Postgresql startup process could greatly benefit from systemd. As it is now, PGDG distributes postgresql with init scripts which really do very little to check if postgresql bombed shortly after startup. I have improved upon that script to do some smart detection of a failed startup, but the result is having two different files to monitor. On the one hand, I want to use postgresql's built-in file or CSV logging. On the other, I need to check the output of the startup process. These logs have different formats, especially for the smart DBA who wants to customize the log prefix with a more usable timestamp and other fields. Logging to external files is nice because postgresql rotates the logs for us automatically. (External log rotation is problematic and risky because it requires a configuration reload which may lead to undesirable side-effects. Right?) One alternative is to capture everything into the local logging system, or to send all to stdout/stderr and capture this with journald (or runit's svlogd or something). But then you have the following problem: if I enable statement-level logging or set log_min_duration_statement=0, the in-memory journald will quickly be overrun with statement logs. The journald subsystem will then become useless to the other daemons and subsystems. One solution is proposed in these forums: https://www. postgresql.org/message-id/flat/etPan.53397e77.643c9869. 1a0%40palos#etPan.53397e77.643c9869.1a0@palos While pgbadger has some nice features, it doesn't really solve the problem of allowing postgresql to be used with systemd+journald. What I do today is to configure postgresql to write csvlogs. Stdout/stderr are captured by journald. A custom perl script with the Text::CSV module and tail -F semantics continuously processes the csvlog file, ignores query, dml, and detail log lines, and sends the rest via syslog() (which journald then handles). It's not the right way. I would like to see postgresql to have the ability to 1. Write to a csvlog with one set of selectors 2. Write to stdout/stderr a different set of selectors (no statement, no autovacuum, etc) using a purely line-oriented output that 2.1. has the kind of detail contained in the CSV. Currently, the log-prefix option does not offer some of the information provided in the CSV logs. Really, the CSV log should simply be an implementation of the log-prefix. 2.2. Collapses multi-lined queries into one line (newlines and tabs are escaped with backslashes or the x1B character). Finally, if these changes can be implemented, is it impossible to backport them to prior versions, say 9.1 and up? If I wrote a patch, under what conditions would the patch be accepted for inclusion in official releases of older versions? -- Otheus otheus.u...@gmail.com otheus.shell...@uibk.ac.at
Re: [GENERAL] Full text search tsv column aproach vs concat confusion
Hi, On 16.11.2016 11:54, cen wrote: Hi I am seeking some clarification in regard to full text search across multiple tables and what the best approach is. Documentation talks about two approaches when it comes to building a document: on-the-fly concat of columns and a dedicated tsv column approach. Let's say I want to perform a search for |"txt1 & txt2 & txt3" on columns table1.col1, table1.col2 and table2.col1. I see the following solutions:| |1. Concat all three into a document and perform a FTS.| |SELECT * FROM ( | |SELECTto_tsvector(table1.col1)||to_tsvector(table1.col2)||to_tsvector(table2.col1)asdocument F|||ROM table1 LEFTJOINtable2 ONtable1.table2_id=table2.id| ) subquery |||WHEREsubquery.document@@to_tsquery(unaccent(?));| | |2. Create a tsv column in each table, concat tsv columns and perform FTS on that.| |SELECT*FROMtable1 LEFTJOINtable2 ONtable1.table2_id=table2.id WHEREtable1.tsv ||tale2.tsv @@to_tsquery(unaccent(?));| |3. Have a tsv column only in table1 and insert table2.col1 to the tsv via triggers. Works but seems very hacky.| | | |It seems to me that option #2 is fast and easy to implement but I am not sure what the concat of tsvs really means from index usage and performance standpoint. Option #1 is the most flexible and I'd use that all the time if it was not THAT much slower than tsv column approacj. Documentation on TSV columns states: "||Another advantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches." | The question is, how much faster are tsv columns really? Are there any benchmarks about this? If the performance difference is negligible I'd advocate that using tsv columns is a waste of time and space in most general cases. But since there is no information on how much faster it's hard to decide. I haven't any such benchmarks. But if you have a real database, you can perform tests using it on your solutions. Because it depends on your task and what you need. By the way, I suppose it is better to use COALESCE() function if your columns could have NULL value: SELECT * FROM ( SELECT to_tsvector(coalesce(table1.col1,'')) || to_tsvector(coalesce(table1.col2,'')) || to_tsvector(coalesce(table2.col1,'')) as document FROM table1 LEFT JOIN table2 ON table1.table2_id=table2.id ) subquery WHERE subquery.document @@ to_tsquery(unaccent(?)); And specifying a text search configuration makes queries a little bit faster: ... to_tsvector('english', coalesce(table1.col1,'')) ... -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general