[PERFORM] Re: OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices
Laurenz Albe schrieb am 02.11.2017 um 09:30: > Finally, even though the official line of PostgreSQL is to *not* have > query hints, and for a number of good reasons, this is far from being > an unanimous decision. The scales may tip at some point, though I > personally hope that this point is not too close. I also think that hints are not the right way to solve problems like that. I do like Oracle's approach with SQL profiles, where you can force the optimizer to try harder to find a good execution plan. I _think_ it even runs the statement with multiple plans and compares the expected outcome with the actual values. Once a better plan is found that plan can be attached to that query and the planner will use that plan with subsequent executions. This however requires a much bigger infrastructure then simple hints. (Unrelated, but: maybe a compromise of the never-ending "hints vs. no hints" discussion would be, to think about integrating the existing "pg_hint_plan" as a standard contrib module) Thomas -- Sent from: http://www.postgresql-archive.org/PostgreSQL-performance-f2050081.html -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Architectural question
Jim Nasby schrieb am 11.03.2016 um 17:37: > If the blob is in the database then you have nothing extra to do. It's > handled just like all your other data. > > If it's a file in a file system then you need to: > > - Have application code that knows how and where to get at the file > - Have a way to make those files available on all your webservers > - Have completely separate backup and recovery plans for those files > > That's a lot of extra work. Sometimes it's necessary, but many times it's not. Don't forget the code you need to write to properly handle transactional access (writing, deleting) to the files You usually also need to distribute the files over many directories. Having millions of files in a single directory is usually not such a good idea. In my experience you also need some cleanup job that removes orphaned files from the file system. Because no matter how hard you try, to get updates/writes to the file system right, at some point this fails. Also from a security point of view having this in the database is more robust then in the file system. The downside of bytea is that you can't stream them to the client. The application always needs to read the whole blob into memory before it can be used. This might put some memory pressure on the application server. Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Why is now()::date so much faster than current_date
Hello, I stumbled over this answer: http://stackoverflow.com/a/9717125/330315 and this sounded quite strange to me. So I ran this on my Windows laptop with Postgres 9.4.5, 64bit and indeed now()::date is much faster than current_date: explain analyze select current_date from generate_series (1, 100); Function Scan on generate_series (cost=0.00..6.00 rows=1000 width=0) (actual time=243.878..1451.839 rows=100 loops=1) Planning time: 0.047 ms Execution time: 1517.881 ms And: explain analyze select now()::date from generate_series (1, 100); Function Scan on generate_series (cost=0.00..6.00 rows=1000 width=0) (actual time=244.491..785.819 rows=100 loops=1) Planning time: 0.037 ms Execution time: 826.612 ms Running this on a CentOS 6.6. test server (Postgres 9.4.1, 64bit), there is still a difference, but not as big as on Windows: explain analyze select current_date from generate_series (1, 100); Function Scan on generate_series (cost=0.00..15.00 rows=1000 width=0) (actual time=233.599..793.032 rows=100 loops=1) Planning time: 0.087 ms Execution time: 850.198 ms And explain analyze select now()::date from generate_series (1, 100); Function Scan on generate_series (cost=0.00..15.00 rows=1000 width=0) (actual time=198.385..570.171 rows=100 loops=1) Planning time: 0.074 ms Execution time: 623.211 ms Any ideas? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] SELECT slows down on sixth execution
Jonathan Rogers schrieb am 17.10.2015 um 04:14: >>> Yes, I have been looking at both plans and can see where they diverge. >>> How could I go about figuring out why Postgres fails to see the large >>> difference in plan execution time? I use exactly the same parameters >>> every time I execute the prepared statement, so how would Postgres come >>> to think that those are not the norm? >> >> PostgreSQL does not consider the actual query execution time, it only >> compares its estimates for there general and the custom plan. >> Also, it does not keep track of the parameter values you supply, >> only of the average custom plan query cost estimate. > > OK, that makes more sense then. It's somewhat tedious for the purpose of > testing to execute a prepared statement six times to see the plan which > needs to be optimized. Unfortunately, there doesn't seem to be any way > to force use of a generic plan in SQL based on Pavel Stehule's reply. If you are using JDBC the threshold can be changed: https://jdbc.postgresql.org/documentation/94/server-prepare.html https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/PGStatement.html#setPrepareThreshold%28int%29 As I don't think JDBC is using anything "exotic" I would be surprised if this can't be changed with other programming environments also. Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?
Graeme B. Bell schrieb am 09.07.2015 um 11:44: > I don't recall seeing a clear statement telling me I should mark pl/pgsql > functions nonvolatile wherever possible or throw all performance and > scalability out the window. From: http://www.postgresql.org/docs/current/static/xfunc-volatility.html "For best optimization results, you should label your functions with the strictest volatility category that is valid for them." -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance issues
Tomas Vondra schrieb am 17.03.2015 um 15:43: > On 17.3.2015 15:19, Thomas Kellerer wrote: >> Tomas Vondra schrieb am 17.03.2015 um 14:55: >>> (2) using window functions, e.g. like this: >>> >>> SELECT * FROM ( >>>SELECT *, >>> ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id >>>ORDER BY FROM max_creation_dt) AS rn >>>FROM s_f_touchpoint_execution_status_history >>> ) foo WHERE rn = 1 >>> >>> But estimating this is also rather difficult ... >> >> >> From my experience rewriting something like the above using DISTINCT >> ON is usually faster. > > How do you get the last record (with respect to a timestamp column) > using a DISTINCT ON? You need to use "order by ... desc". See here: http://sqlfiddle.com/#!15/d4846/2 Btw: your row_number() usage wouldn't return the "latest" row either. It would return the "oldest" row. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance issues
Tomas Vondra schrieb am 17.03.2015 um 14:55: > (2) using window functions, e.g. like this: > > SELECT * FROM ( >SELECT *, > ROW_NUMBER() OVER (PARTITION BY touchpoint_execution_id >ORDER BY FROM max_creation_dt) AS rn >FROM s_f_touchpoint_execution_status_history > ) foo WHERE rn = 1 > > But estimating this is also rather difficult ... >From my experience rewriting something like the above using DISTINCT ON is >usually faster. e.g.: select distinct on (touchpoint_execution_id) * from s_f_touchpoint_execution_status_history order by touchpoint_execution_id, max_creation_dt; -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] MusicBrainz postgres performance issues
Robert Kaye schrieb am 16.03.2015 um 13:59: > However, I am glad to report that our problems are fixed and that our > server is back to humming along nicely. > > And as I said to Josh earlier: "Postgres rocks our world. I’m > immensely pleased that once again the problems were our own stupidity > and not PG’s fault. In over 10 years of us using PG, it has never > been PG’s fault. Not once.” > > And thus we’re one tiny bit smarter today. Thank you everyone! > I think it would be nice if you can amend your blog posting to include the solution that you found. Otherwise this will simply stick around as yet another unsolved performance problem Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Reverse Key Index
Sven R. Kunze schrieb am 26.02.2015 um 13:23: > If you think Reverse Key Indexes have no usage here in PostgreSQL, you should > not support convenience features > for easily improving performance without breaking the querying API It's also unclear to me which "performance" you are referring to. Insert performance? Retrieval performance? Concurrency? The use-case for reverse indexes in Oracle is pretty small: it's _only_ about the contention when doing a lot of inserts with increasing numbers (because the different transactions will be blocked when accessing the blocks in question). As Postgres manages inserts differently than Oracle I'm not so sure that this problem exists in Postgres the same way it does in Oracle. That's why I asked if you have a _specific_ problem. Richard Footes blog post is mostly about the myth that _if_ you have a reverse index this is only used for equality operations. It does not claim that a reverse index is faster than a regular index _if_ it is used for a range scan. The question is: do you think you need a reverse index because you have a performance problem with when doing many, many inserts at the same time using "close-by" values into a table that uses a btree index on the column? Or do you think you need a reverse index to improve the performance of a range scan? If that is the then you can easily us a gin/gist index or even a simple btree index using a trigram index to speed up a "LIKE '%abc%'" (something Oracle can't do at all) without having to worry about obfuscation layers (aka ORM). Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Reverse Key Index
Sven R. Kunze schrieb am 26.02.2015 um 12:04: > I just thought about btree indexes here mainly because they well-known and > well-used in ORM frameworks. If your ORM framework needs to know about the internals of an index definition or even requires a certain index type, then you should ditch that ORM framework. Apart from indexes supporting business constraints (e.g. a unique index) neither the application nor the the ORM framework should care about indexes at all. > does PostgreSQL support the concept of reverse key indexing as described > here? The real question is: why do you think you need such an index? Do you have any performance problems with the existing BTree index? If yes, which problem exactly? Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query Performance question
Magers, James, 14.07.2014 15:18: > Thank you. I executed the query this morning after disabling the scan types. > > I am including links to explain.depesz output for each of the three > variations that I executed. > > indexscan and bitmapscan off: http://explain.depesz.com/s/sIx > seqscan and bitmapscan off: http://explain.depesz.com/s/GfM > bitmapscan off: http://explain.depesz.com/s/3wna > So the original query (using an "Index Scan" + "Bitmap Index Scan") is indeed the most efficient one: 4ms vs. 39ms vs. 64ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query Performance question
Magers, James, 14.07.2014 04:20: > Thank you for your feedback. I am attaching the requested information. > While I do not think the query is necessarily inefficient, I believe a > sequence scan would be more efficient. You can try set enable_indexscan = off; set enable_bitmapscan = off; and then run your query. But I would be very surprised if a seq scan (which reads through the whole table) was faster than those 4ms you have now -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How clustering for scale out works in PostgreSQL
bsreejithin wrote on 29.08.2013 18:13: PostgreSQL version was* 8.2*. 8.2 has long been deprecated. For a new system you should use 9.2 (or at least 9.1) Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Thinking About Correlated Columns (again)
Shaun Thomas wrote on 15.05.2013 17:31: Hi! This has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up. A really basic example of how bad the planner is here: CREATE TABLE foo AS SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b FROM generate_series(1, 100) a(id); CREATE INDEX idx_foo_ab ON foo (col_a, col_b); Index Scan using idx_foo_ab on foo (cost=0.00..6.35 rows=1 width=12) (actual time=0.030..3.643 rows=1000 loops=1) Index Cond: ((col_a = 50) AND (col_b = 50)) Hey, look! The row estimate is off by a factor of 1000. This particular case doesn't suffer terribly from the mis-estimation, but others do. Boy, do they ever. What happens if you create one index for each column? (instead of one combined index) For your example it does not seem to improve the situation, but maybe things get better with the "bad" queries? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Deterioration in performance when query executed in multi threads
Anne Rosset, 06.05.2013 19:00: Postgres version: 9.0.13 Work_mem is set to 64MB Shared_buffer to 240MB Segment_size is 1GB Wal_buffer is 10MB Artifact table: 251831 rows Field_value table: 77378 rows Mntr_subscription: 929071 rows Relationship: 270478 row Folder: 280356 rows Item: 716465 rows Sfuser: 5733 rows Project: 1817 rows 8CPUs RAM: 8GB With 8GB RAM you should be able to increase shared_buffer to 1GB or maybe even higher especially if this is a dedicated server. 240MB is pretty conservative for a server with that amount of RAM (unless you have many other applications running on that box) Also what are the values for cpu_tuple_cost seq_page_cost random_page_cost effective_cache_size What kind of harddisk is in the server? SSD? Regular ones (spinning disks)? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] In progress INSERT wrecks plans on table
mark.kirkw...@catalyst.net.nz wrote on 03.05.2013 00:19: I think the idea of telling postgres that we are doing a load is probably the wrong way to go about this. We have a framework that tries to automatically figure out the best plans...I think some more thought about how to make that understand some of the more subtle triggers for a time-to-do-new-plans moment is the way to go. I understand this is probably hard - and may imply some radical surgery to how the stats collector and planner interact. I wonder if "freezing" (analyze, then disable autovacuum) the statistics for the large number of rows would work. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Recursive query gets slower when adding an index
Tom Lane wrote on 19.10.2012 16:20: Thomas Kellerer writes: This is the execution plan without index: http://explain.depesz.com/s/ecCT When I create an index on parent_id execution time increases to something between 110ms and 130ms This is the execution plan with index: http://explain.depesz.com/s/xiL The reason you get a bad plan choice here is the severe underestimate of the average number of rows coming out of the worktable scan (ie, the size of the "recursive" result carried forward in each iteration). Unfortunately, it's really hard to see how we might make that number better. The current rule of thumb is "10 times the size of the nonrecursive term", which is why you get 10 here. We could choose another multiplier but it'd be just as bogus as the current one (unless somebody has some evidence about typical expansion factors?) I suppose though that there's some argument for discouraging the planner from assuming that the carried-forward result is small; so maybe we should use something larger than 10. Thanks for the feedback. I just noticed this behaviour because we ran the same query on SQL Server 2008 and that took well over 30seconds without the index SQL Server *really* improved with the index and returned the result in 0.5 seconds whith the index in place. So I was curious how much faster Postgres would be *with* the index ;) Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Recursive query gets slower when adding an index
Hi, I have a self-referencing table that defines a hierarchy of projects and sub-projects. This is the table definition: CREATE TABLE project ( project_idinteger primary key, project_name text, pl_name text, parent_id integer ); ALTER TABLE project ADD CONSTRAINT project_parent_id_fkey FOREIGN KEY (parent_id) REFERENCES project (project_id) ON UPDATE NO ACTION ON DELETE NO ACTION; The table contains ~11000 rows The following statement: with recursive project_tree as ( select project_id, parent_id, pl_name as root_pl, pl_name as sub_pl, 1 as lvl from project where parent_id is null union all select c.project_id, c.parent_id, coalesce(p.root_pl, c.pl_name) as root_pl, coalesce(c.pl_name, p.sub_pl) as sub_pl, p.lvl + 1 from project c join project_tree p on p.project_id = c.parent_id ) select count(*), max(lvl) from project_tree where root_pl <> sub_pl; usually runs in something like 60-80ms when the parent_id column is *not* indexed. This is the execution plan without index: http://explain.depesz.com/s/ecCT When I create an index on parent_id execution time increases to something between 110ms and 130ms This is the execution plan with index: http://explain.depesz.com/s/xiL As far as I can tell, the choice for the nested loop is the reason for the (slightly) slower execution. I increased the statistics for the parent_id column to 1 (and did an analyze of course) but that didn't change anything. I have no problem with that performance, so this is more a "I'm curious on why this happens" type of question. (And I thought you might be interested in this behaviour as well) My environment: *Windows 7 Professional 64bit * PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres becoming slow, only full vacuum fixes it
Kiriakos Tsourapas, 25.09.2012 13:01: Thank you, I will take this into consideration, since upgrading to 9 will be much harder I assume... I think an upgrade from 8.3 to 8.4 was "harder" due to the removal of a lot of implicit type casts. 8.4 to 9.x shouldn't be that problematic after all (but will take longer due to the required dump/reload) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Using ctid column changes plan drastically
Tom Lane, 24.07.2012 19:12: Well, it would only help if you're running a PG version that's new enough to recognize the NOT EXISTS as an anti-join; and even then, it's possible that joining on a tid column forecloses enough plan types that you don't get any real benefit. But I'm just guessing. Can you show exactly what you tried and what EXPLAIN ANALYZE results you got? I am using 9.1.4 (as I said in my initial post). I finally found a solution that runs fine: DELETE FROM dupes a WHERE EXISTS (SELECT 1 FROM dupes b WHERE b.first_name = a.first_name AND b.last_name = a.last_name AND b.ctid > a.ctid); The execution plan for this is: Delete on public.dupes a (cost=14575.95..16978.87 rows=25000 width=12) (actual time=2419.334..2419.334 rows=0 loops=1) Buffers: shared hit=18029 -> Merge Semi Join (cost=14575.95..16978.87 rows=25000 width=12) (actual time=2043.674..2392.707 rows=17097 loops=1) Output: a.ctid, b.ctid Merge Cond: ((a.first_name = b.first_name) AND (a.last_name = b.last_name)) Join Filter: (b.ctid > a.ctid) Buffers: shared hit=930 -> Sort (cost=7287.98..7475.48 rows=75000 width=20) (actual time=1024.195..1030.051 rows=75000 loops=1) Output: a.ctid, a.first_name, a.last_name Sort Key: a.first_name, a.last_name Sort Method: quicksort Memory: 8870kB Buffers: shared hit=465 -> Seq Scan on public.dupes a (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.025..23.234 rows=75000 loops=1) Output: a.ctid, a.first_name, a.last_name Buffers: shared hit=465 -> Sort (cost=7287.98..7475.48 rows=75000 width=20) (actual time=1019.148..1028.483 rows=105841 loops=1) Output: b.ctid, b.first_name, b.last_name Sort Key: b.first_name, b.last_name Sort Method: quicksort Memory: 8870kB Buffers: shared hit=465 -> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.017..19.133 rows=75000 loops=1) Output: b.ctid, b.first_name, b.last_name Buffers: shared hit=465 Total runtime: 2420.953 ms Which is a lot better than the plan using "WHERE ctid NOT IN (.)": Delete on public.dupes (cost=1777.50..4925055.00 rows=37500 width=6) (actual time=582515.094..582515.094 rows=0 loops=1) Buffers: shared hit=18027 -> Seq Scan on public.dupes (cost=1777.50..4925055.00 rows=37500 width=6) (actual time=1038.164..582332.927 rows=17097 loops=1) Output: dupes.ctid Filter: (NOT (SubPlan 1)) Buffers: shared hit=930 SubPlan 1 -> Materialize (cost=1777.50..1890.00 rows=7500 width=20) (actual time=0.001..2.283 rows=35552 loops=75000) Output: (min(b.ctid)), b.first_name, b.last_name Buffers: shared hit=465 -> HashAggregate (cost=1777.50..1852.50 rows=7500 width=20) (actual time=90.964..120.228 rows=57903 loops=1) Output: min(b.ctid), b.first_name, b.last_name Buffers: shared hit=465 -> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=20) (actual time=0.008..25.515 rows=75000 loops=1) Output: b.ctid, b.first_name, b.last_name Buffers: shared hit=465 Total runtime: 582517.711 ms Using "WHERE id NOT IN (...)" is the fastest way: Delete on public.dupes (cost=1871.25..3273.75 rows=37500 width=6) (actual time=187.949..187.949 rows=0 loops=1) Buffers: shared hit=18490 -> Seq Scan on public.dupes (cost=1871.25..3273.75 rows=37500 width=6) (actual time=125.351..171.108 rows=17097 loops=1) Output: dupes.ctid Filter: (NOT (hashed SubPlan 1)) Buffers: shared hit=930 SubPlan 1 -> HashAggregate (cost=1777.50..1852.50 rows=7500 width=18) (actual time=73.131..93.421 rows=57903 loops=1) Output: min(b.id), b.first_name, b.last_name Buffers: shared hit=465 -> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000 width=18) (actual time=0.004..8.515 rows=75000 loops=1) Output: b.id, b.first_name, b.last_name Buffers: shared hit=465 Total runtime: 189.222 ms Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Using ctid column changes plan drastically
Tom Lane wrote on 24.07.2012 17:55: Joins on tid columns just aren't supported very well at the moment. Partly that's from lack of round tuits, and partly it's because it doesn't seem all that wise to encourage people to use them. There are gotchas if any of the rows receive concurrent updates. Thanks for the clarification. I will keep that in mind. FWIW, it might be helpful to cast this as a NOT EXISTS rather than NOT IN subquery. Hmm. How would you change that into an NOT EXISTS clause (so that one of the duplicates remains) Everything I come up with is in fact slower than the NOT IN solution. Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Using ctid column changes plan drastically
Tom Lane, 24.07.2012 16:23: Thomas Kellerer writes: DELETE FROM dupes WHERE id NOT IN (SELECT min(b.id) FROM dupes b GROUP BY first_name, last_Name HAVING count(*) > 1); Doesn't that kill the non-duplicates too? Ah right - another good point on how important the correct test data is ;) Why does the usage of the CTID column change the plan so drastically? IIRC, type tid doesn't have any hash support. So the "bad" plan is expected? Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Using ctid column changes plan drastically
Hi, I was testing a query to delete duplicates to see how well using ctid works if the table doesn't have a unique identifier available. The table definition is: create table dupes ( id integer primary key, first_name text, last_name text ); My test table has 100.000 rows with ~13000 being actually unique. The following statement: DELETE FROM dupes WHERE id NOT IN (SELECT min(b.id) FROM dupes b GROUP BY first_name, last_Name HAVING count(*) > 1); produces a quite nice execution plan: Delete on public.dupes (cost=2770.00..4640.00 rows=5 width=6) (actual time=299.809..299.809 rows=0 loops=1) Buffers: shared hit=88100 -> Seq Scan on public.dupes (cost=2770.00..4640.00 rows=5 width=6) (actual time=150.113..211.340 rows=86860 loops=1) Output: dupes.ctid Filter: (NOT (hashed SubPlan 1)) Buffers: shared hit=1240 SubPlan 1 -> HashAggregate (cost=2620.00..2745.00 rows=1 width=18) (actual time=115.739..143.004 rows=13140 loops=1) Output: min(b.id), b.first_name, b.last_name Filter: (count(*) > 1) Buffers: shared hit=620 -> Seq Scan on public.dupes b (cost=0.00..1620.00 rows=10 width=18) (actual time=0.006..15.563 rows=10 loops=1) Output: b.id, b.first_name, b.last_name Buffers: shared hit=620 Total runtime: 301.241 ms Now assuming I do not have a unique value in the table. In that case I would revert to using the ctid to identify individual rows: DELETE FROM dupes WHERE ctid NOT IN (SELECT min(b.ctid) FROM dupes b GROUP BY first_name, last_Name HAVING count(*) > 1); Which has a completely different execution plan: Delete on public.dupes (cost=2620.00..10004490.00 rows=5 width=6) (actual time=269966.623..269966.623 rows=0 loops=1) Buffers: shared hit=88720 -> Seq Scan on public.dupes (cost=2620.00..10004490.00 rows=5 width=6) (actual time=176.107..269582.651 rows=86860 loops=1) Output: dupes.ctid Filter: (NOT (SubPlan 1)) Buffers: shared hit=1240 SubPlan 1 -> Materialize (cost=2620.00..2795.00 rows=1 width=20) (actual time=0.002..0.799 rows=12277 loops=10) Output: (min(b.ctid)), b.first_name, b.last_name Buffers: shared hit=620 -> HashAggregate (cost=2620.00..2745.00 rows=1 width=20) (actual time=131.162..164.941 rows=13140 loops=1) Output: min(b.ctid), b.first_name, b.last_name Filter: (count(*) > 1) Buffers: shared hit=620 -> Seq Scan on public.dupes b (cost=0.00..1620.00 rows=10 width=20) (actual time=0.005..29.531 rows=10 loops=1) Output: b.ctid, b.first_name, b.last_name Buffers: shared hit=620 Total runtime: 269968.515 ms This is Postgres 9.1.4 64bit on Windows 7 Why does the usage of the CTID column change the plan so drastically? Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] The need for clustered indexes to boost TPC-V performance
Craig Ringer, 04.07.2012 07:43: I'm not sure what the best option for getting a 9.2 beta build for Windows is. Download the ZIP from here: http://www.enterprisedb.com/products-services-training/pgbindownload Unzip, initdb, pg_ctl start Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: Could synchronous streaming replication really degrade the performance of the primary?
MauMau, 10.05.2012 13:34: Today, they told me that they ran the test on two virtual machines on a single physical machine. Which means that both databases shared the same I/O system (harddisks). Thererfor it's not really surprising that the overall performance goes down if you increase the I/O load. A more realistic test (at least in my opinion) would have been to have two separate computers with two separate I/O systems -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Result Set over Network Question
Robert Klemme, 07.05.2012 15:44: http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html SQL Developer does not support PostgreSQL Last time I checked (quite a while ago) you could use arbitrary JDBC drivers. There's also http://docs.oracle.com/cd/E25259_01/appdev.31/e24285/intro.htm#sthref306 And this seems to indicate that it's still the case: "[...] or another third-party driver. [...] JDBC URL (Other Third Party Driver): URL for connecting directly from Java to the database; overrides any other connection type specification." http://docs.oracle.com/cd/E25259_01/appdev.31/e24285/dialogs.htm#BACDGCIA I assume Oracle is not interested in aggressively advertizing this feature though. That seems to be a documentation bug. I tried it, and it definitely does not work (or I am missing something). Their release notes at: http://www.oracle.com/technetwork/developer-tools/sql-developer/sqldev31-ea-relnotes-487612.html state: Third Party Databases SQL Developer supports IBM DB2 UDB LUW , Microsoft SQL Server and Microsoft Access, MySQL, Sybase Adaptive Server and Teradata. See Supported Platforms for details on all third party database releases supported. Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Result Set over Network Question
Robert Klemme, 07.05.2012 14:03: Alternative tools for JDBC tests: http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html SQL Developer does not support PostgreSQL This page: http://wiki.postgresql.org/wiki/Community_Guide_to_PostgreSQL_GUI_Tools also lists several JDBC based tools. Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning Postgres 9.1 on Windows
Walker, James Les wrote on 01.05.2012 16:44: I installed the enterprisedb distribution and immediately saw a 400% performance increase. What exactly is "the enterprisedb distribution"? Are you talking about the the Advanced Server? I would be very surprised if the code base would differ so much to allow such a performance gain. Could it be that the default settings for the Advanced Server are different than those of the "community edition"? And what did you have installed before that? (as the Windows binary are always distributed by EnterpriseDB) Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning Postgres 9.1 on Windows
Merlin Moncure wrote on 30.04.2012 23:43: Trying turning off fsync in postgrsql.conf to be sure. This is a dangerous advise. Turning off fsync can potentially corrupt the database in case of a system failure (e.g. power outage). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
Tom Lane wrote on 26.04.2012 21:17: Richard Kojedzinszky writes: Dear list, We have a database schema, which looks the same as the attached script. When filling the tables with data, and skipping analyze on the table (so pg_stats contains no records for table 'a'), the first select in the script runs fast, but after an analyze the planner decides to sequence scan tables b and c, thus making the query much slower. Can somebody help me solving this issue, or tuning our installation to not to use sequence scans in this case? Um ... did you analyze all the tables, or just some of them? I get sub-millisecond runtimes if all four tables have been analyzed, but it does seem to pick lousy plans if, say, only a and b have been analyzed. Here it's similar to Richard's experience: Before analyzing the four tables, the first statement yields this plan: Merge Left Join (cost=504.89..2634509.91 rows=12500 width=16) (actual time=0.103..0.108 rows=1 loops=1) Merge Cond: (a.b = b.id) -> Sort (cost=504.89..506.14 rows=500 width=8) (actual time=0.043..0.043 rows=1 loops=1) Sort Key: a.b Sort Method: quicksort Memory: 17kB -> Bitmap Heap Scan on a (cost=12.14..482.47 rows=500 width=8) (actual time=0.028..0.029 rows=1 loops=1) Recheck Cond: (id = 4) -> Bitmap Index Scan on a_idx1 (cost=0.00..12.01 rows=500 width=0) (actual time=0.021..0.021 rows=1 loops=1) Index Cond: (id = 4) -> Materialize (cost=0.00..884002.52 rows=5000 width=8) (actual time=0.041..0.057 rows=5 loops=1) -> Merge Join (cost=0.00..759002.52 rows=5000 width=8) (actual time=0.037..0.051 rows=5 loops=1) Merge Cond: (b.id = c.id) -> Index Scan using b_idx1 on b (cost=0.00..4376.26 rows=10 width=4) (actual time=0.016..0.018 rows=5 loops=1) -> Materialize (cost=0.00..4626.26 rows=10 width=4) (actual time=0.017..0.022 rows=5 loops=1) -> Index Scan using c_idx1 on c (cost=0.00..4376.26 rows=10 width=4) (actual time=0.014..0.017 rows=5 loops=1) Total runtime: 0.209 ms This continues to stay the plan for about 10-15 repetitions, then it turns to this plan Hash Right Join (cost=2701.29..6519.30 rows=1 width=16) (actual time=79.604..299.227 rows=1 loops=1) Hash Cond: (b.id = a.b) -> Hash Join (cost=2693.00..6136.00 rows=10 width=8) (actual time=79.550..265.251 rows=10 loops=1) Hash Cond: (b.id = c.id) -> Seq Scan on b (cost=0.00..1443.00 rows=10 width=4) (actual time=0.011..36.158 rows=10 loops=1) -> Hash (cost=1443.00..1443.00 rows=10 width=4) (actual time=79.461..79.461 rows=10 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 2735kB -> Seq Scan on c (cost=0.00..1443.00 rows=10 width=4) (actual time=0.010..32.930 rows=10 loops=1) -> Hash (cost=8.28..8.28 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Index Scan using a_idx1 on a (cost=0.00..8.28 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1) Index Cond: (id = 4) Total runtime: 299.564 ms (I guess autovacuum kicked in, because that the same plan I get when running analyze on all four tables right after populating them) And the second one yields this one here (Regardless of analyze or not): QUERY PLAN Nested Loop Left Join (cost=0.00..16.89 rows=1 width=16) (actual time=0.027..0.031 rows=1 loops=1) -> Nested Loop Left Join (cost=0.00..16.57 rows=1 width=12) (actual time=0.020..0.022 rows=1 loops=1) -> Index Scan using a_idx1 on a (cost=0.00..8.28 rows=1 width=8) (actual time=0.011..0.012 rows=1 loops=1) Index Cond: (id = 4) -> Index Scan using b_idx1 on b (cost=0.00..8.28 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=1) Index Cond: (a.b = id) -> Index Scan using c_idx1 on c (cost=0.00..0.31 rows=1 width=4) (actual time=0.004..0.005 rows=1 loops=1) Index Cond: (b.id = id) Total runtime: 0.104 ms My version: PostgreSQL 9.1.3, compiled by Visual C++ build 1500, 32-bit Running on Windows XP SP3 shared_buffers = 768MB work_mem = 24MB effective_cache_size = 1024MB All other (relevant) settings are on defaults Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Update join performance issues
Kevin Kempter wrote on 03.04.2012 19:29: Hi All; I have a query that wants to update a table based on a join like this: update test_one set f_key = t.f_key from upd_temp1 t, test_one t2 where t.id_number = t2.id_number upd_temp1 has 248,762 rows test_one has 248,762 rows To extend on what Kevin has already answere: Quote from the manual: "Note that the target table must not appear in the from_list, unless you intend a self-join" -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: PG as in-memory db? How to warm up and re-populate buffers? How to read in all tuples into memory?
Stefan Keller wrote on 26.02.2012 01:16: 2. Are there any hints on how to tell Postgres to read in all table contents into memory? What about creating tablespace on a RAM Fileystem (tmpfs), then create a second schema in your database where all tables are located in the that "temp" tablespace. Then upon startup (or using triggers) you can copy all data from the persistent tables to the memory tables. It would probably make sense to change the value of random_page_cost for that tablespace to 1 I'm not sure though how PostgreSQL handles a system-restart with tables on a tablespace that might not be there. Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL Parallel Processing !
sridhar bamandlapally, 27.01.2012 05:31: SQL> explain plan for select * from hr.emp ; Explained. PLAN -- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -- | 0 | SELECT STATEMENT | | 7444K| 944M| 16077 (4)| 00:03:13 | | 1 | TABLE ACCESS FULL| EMP | 7444K| 944M| 16077 (4)| 00:03:13 | -- *WITH PARALLEL Option* SQL> explain plan for select /*+parallel(emp,4)*/ * from hr.emp ; Explained. PLAN - | Id | Operation| Name | Rows | Bytes | Cost (%CPU)| Time | - | 0 | SELECT STATEMENT | | 7444K| 944M| 4442 (3)| 00:00:54 | | 1 | PX COORDINATOR | | | || | | 2 | PX SEND QC (RANDOM)| :TQ1 | 7444K| 944M| 4442 (3)| 00:00:54 | | 3 |PX BLOCK ITERATOR | | 7444K| 944M| 4442 (3)| 00:00:54 | | 4 | TABLE ACCESS FULL| EMP | 7444K| 944M| 4442 (3)| 00:00:54 | - In the above plan ( WITH PARALLEL Option ) 1. "Cost" has been nearly reduced to 1/4th 2. "CPU" has been reduced 3. "Time" has been nearly reduced to 1/3rd I have *never* seen the "time" column in the explain plan output come anywhere near the actual execution time in Oracle. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: Hash index use presently(?) discouraged since 2005: revive or bury it?
Robert Klemme, 19.09.2011 13:13: On Sun, Sep 18, 2011 at 9:31 PM, Stefan Keller wrote: I'm simply referring to literature (like the intro Ramakrishnan& Gehrke). I just know that Oracle an Mysql actually do have them too and use it without those current implementation specific restrictions in Postgres. Where exactly do you take that from that Oracle has hash indexes? I can't seem to find them: http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/indexiot.htm#sthref293 Are you mixing this up with hash partitioning? http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/schemaob.htm#sthref443 Or am I missing something? Maybe he was referring to a hash cluster: http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/statements_5001.htm This is a storage option where you can store related rows (e.g. in a parent/child relationship) in the same phyiscal database block based on a hash value. That enables the databse to read parent and child rows with just a single IO. In the background Oracle probably has something like a hash index to support that. Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: PostgreSQL-related topics of theses and seminary works sought (Was: Hash index use presently(?) discouraged...)
Stefan Keller, 17.09.2011 22:01: I'm also interested in such proposals or ideas! Here's some list of topics: * Time in PostgreSQL * Fast Bulk Data Inserting in PostgreSQL with Unlogged tables I don't understand these two items. Postgres does have a time data type and it has unlogged tables since 9.1 Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Why Index is not used
Adarsh Sharma, 25.03.2011 07:51: Thanks Andreas, I was about print the output but it takes too much time. Below is the output of explain analyze command : pdc_uima=# explain analyze select c.clause, s.* from clause2 c, svo2 s where c.clause_id=s.clause_id and s.doc_id=c.source_id and c. pdc_uima-# sentence_id=s.sentence_id ; QUERY PLAN Merge Join (cost=5673831.05..34033959.87 rows=167324179 width=2053) (actual time=216281.162..630721.636 rows=30473117 loops=1) Merge Cond: ((s.clause_id = c.clause_id) AND (s.doc_id = c.source_id) AND (s.sentence_id = c.sentence_id)) -> Index Scan using idx_svo2 on svo2 s (cost=0.00..24489343.65 rows=27471560 width=1993) (actual time=0.130..177599.310 rows=27471560 loops=1) -> Materialize (cost=5673828.74..6071992.29 rows=31853084 width=72) (actual time=216280.596..370507.452 rows=52037763 loops=1) -> Sort (cost=5673828.74..5753461.45 rows=31853084 width=72) (actual time=216280.591..324707.956 rows=31853083 loops=1) Sort Key: c.clause_id, c.source_id, c.sentence_id Sort Method: external merge Disk: 2616520kB -> Seq Scan on clause2 c (cost=0.00..770951.84 rows=31853084 width=72) (actual time=0.025..25018.665 rows=31853083 loops=1) Total runtime: 647804.037 ms (9 rows) How many rows are there in clause2 in total? 31853084 rows are returned from that table which sounds like the whole table qualifies for the join condition. Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Disabling nested loops - worst case performance
Anssi Kääriäinen, 18.03.2011 08:15: Hello list, I am working on a Entity-Attribute-Value (EAV) database using PostgreSQL 8.4.7. The basic problem is that when joining multiple times different entities the planner thinks that there is vastly less rows to join than there is in reality and decides to use multiple nested loops for the join chain. Did you consider using hstore instead? I think in the PostgreSQL world, this is a better alternative than EAV and most probably faster as well. Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Which gives good performance? separate database vs separate schema
Divakar Singh, 25.11.2010 12:37: Hello Friends, I have many instances of my software running on a server (Solaris SPARC). Each software instance requires some DB tables (same DDL for all instances' tables) to store data. It essentially means that some processes from each instance of the software connect to these tables. Now, should I put these tables in 1 Database's different schemas or in separate databases itself for good performance? I am using libpq for connection. I don't think it will make a big difference in performance. The real question is: do you need queries that "cross boundaries"? If that is the case you have to use schema, because Postgres does not support cross-database queries. Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Select * is very slow
Kevin Grittner, 08.11.2010 18:01: "shaiju.ck" wrote: The table have 200 records now. Select * from employee takes 15 seconds to fetch the data!!! Which seems to be very slow. But when I say select id,name from empoyee it executes in 30ms. Same pefromance if I say select count(*) from emloyee. You haven't given nearly enough information for anyone to diagnose the issues with any certainty. Earlier responses have asked for some particularly important information, and I would add a request to see the output from `VACUUM VERBOSE employee;`. Beyond that, you might want to review this page for checks you can make yourself, and information which you could provide to allow people to give more informed advice: Do you really think that VACCUM is the problem? If the OP only selects two columns it is apparently fast. If he selects all columns it's slow, so I wouldn't suspect dead tuples here. My bet is that there are some really large text columns in there... He has asked the same question here: http://forums.devshed.com/postgresql-help-21/select-is-very-slow-761130.html but has also failed to answer the question about the table details... Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Select count(*), the sequel
Kenneth Marshall, 27.10.2010 22:41: Different algorithms have been discussed before. A quick search turned up: quicklz - GPL or commercial fastlz - MIT works with BSD okay zippy - Google - no idea about the licensing lzf - BSD-type lzo - GPL or commercial zlib - current algorithm Of these lzf can compress at almost 3.7X of zlib and decompress at 1.7X and fastlz can compress at 3.1X of zlib and decompress at 1.9X. The same comparison put lzo at 3.0X for compression and 1.8X decompress. The block design of lzl/fastlz may be useful to support substring access to toasted data among other ideas that have been floated here in the past. Just keeping the hope alive for faster compression. What about a dictionary based compression (like DB2 does)? In a nutshell: it creates a list of "words" in a page. For each word, the occurance in the db-block are stored and the actual word is removed from the page/block itself. This covers all rows on a page and can give a very impressive overall compression. This compression is not done only on disk but in-memory as well (the page is loaded with the dictionary into memory). I believe Oracle 11 does something similar. Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Sorted group by
Matthew Wakeling wrote on 10.08.2010 18:03: On Tue, 10 Aug 2010, Thomas Kellerer wrote: No. It's built in (8.4) and it's called Windowing functions: http://www.postgresql.org/docs/8.4/static/tutorial-window.html http://www.postgresql.org/docs/8.4/static/functions-window.html SELECT group, last_value(value) over(ORDER BY number) FROM table I may be mistaken, but as I understand it, a windowing function doesn't reduce the number of rows in the results? Yes you are right, a bit too quick on my side ;) But this might be what you are after then: select group_, value_ from ( select group_, value_, number_, row_number() over (partition by group_ order by value_ desc) as row_num from numbers ) t where row_num = 1 order by group_ desc -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Sorted group by
Matthew Wakeling wrote on 10.08.2010 17:40: Currently, I do this in my application by ordering by the number and only using the last value. I imagine that this is something that can be done in the new Postgres 9, with a sorted group by - something like this: SELECT group, LAST(value, ORDER BY number) FROM table GROUP BY group Is this something that is already built in, or would I have to write my own LAST aggregate function? No. It's built in (8.4) and it's called Windowing functions: http://www.postgresql.org/docs/8.4/static/tutorial-window.html http://www.postgresql.org/docs/8.4/static/functions-window.html SELECT group, last_value(value) over(ORDER BY number) FROM table You don't need the group by then (but you can apply e.g. an ORDER BY GROUP) Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Does FILTER in SEQSCAN short-circuit AND?
Craig James wrote on 27.05.2010 23:13: It would be nice if Postgres had a way to assign a cost to every function. Isn't that what the COST parameter is intended to be: http://www.postgresql.org/docs/current/static/sql-createfunction.html Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] autovacuum strategy / parameters
akp geek, 28.04.2010 16:37: We have 8.4, which of AUTOVACUUM PARAMETERS can be set to handle individual table? All documented here: http://www.postgresql.org/docs/current/static/sql-createtable.html -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] autovacuum strategy / parameters
Rick, 22.04.2010 22:42: So, in a large table, the scale_factor is the dominant term. In a small table, the threshold is the dominant term. But both are taken into account. The default values are set for small tables; it is not being run for large tables. With 8.4 you can adjust the autovacuum settings per table... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PG optimization question
Nickolay wrote on 09.01.2010 11:24: it would be pretty hard with one table because of blocking What do you man with "because of blocking"? Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 8.4.1 ubuntu karmic slow createdb
Michael Clemmons wrote on 11.12.2009 23:52: Thanks all this has been a good help. I don't have control(or easy control) over unit tests creating/deleting databases since Im using the django framework for this job. Createdb takes 12secs on my system(9.10 pg8.4 and ext4) which is impossibly slow for running 200unittests. I wonder if you could simply create one database, and then a new schema for each of the tests. After creating the schema you could alter the search_path for the "unit test user" and it would look like a completely new database. Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Modeling a table with arbitrary columns
Andreas Hartmann wrote on 29.10.2009 21:52: Hi everyone, I want to model the following scenario for an online marketing application: Users can create mailings. The list of recipients can be uploaded as spreadsheets with arbitrary columns (each row is a recipient). I expect the following maximum quantities the DB will contain: * up to 5000 mailings * 0-10'000 recipients per mailing, average maybe 2000 * approx. 20 columns per spreadsheet [...] I have the feeling that the second approach is cleaner. But since the recipient_value table will contain approx. 20 times more rows than the recipient table in approach A, I'd expect a performance degradation. Is there a limit to the number of rows that should be stored in a table? With approach B the maximum number of rows could be about 200'000'000, which sounds quite a lot … I don't think the number of rows is that critical (it sure doesn't hit any "limits". The question is how you want to access that information and how quick that has to be. If you need sub-second response time for aggregates over that, you'll probably have to throw quite some hardware at it. You could also check out the hstore contrib module which lets you store key value pairs in a single column, which might actually be what you are looking for (note that I have never used it, so I cannot tell how fast it acutally is) http://www.postgresql.org/docs/current/static/hstore.html So something like CREATE TABLE recipient ( mailing integer NOT NULL, row integer NOT NULL, recipient_values hstore, PRIMARY KEY (mailing, row), FOREIGN KEY (mailing) REFERENCES mailing (id) ) Btw: I would get rid of a column named "row", this more a "recipient_id", but that is just personal taste. Regards Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Getting a random row
Shaul Dar, 13.10.2009 17:17: Also PG does not have a concept of an auto-increment pseudo-column like Oracle's "rownum". Any suggestions? Yes it does (at least 8.4) SELECT row_number() over(), the_other_columns... FROM your_table So you could do something like: SELECT * FROM ( SELECT row_number() over() as rownum, the_other_columns... FROM your_table ) t WHERE t.rownum = a_random_integer_value_lower_than_rowcount; Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgres Clustering
Alan McKay wrote on 27.05.2009 19:57: > What options are available? I guess a good starting point is the Postgres Wiki: http://wiki.postgresql.org/wiki/Replication%2C_Clustering%2C_and_Connection_Pooling -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] L
ramasubramanian, 27.05.2009 08:42: How to insert or update a file in a table using the query in postgres CREATE TABLE excel_file_upload ( user_id integer, excel_file bytea } example insert into excel_file_upload values(1,file1) file1 can be any file *.doc,*.xls How i can do this(with out using java or any other language) using query? If you are restricted to psql then I gues the only way is the solution show by Albe - but I guess that only works if the file is on the server. Some of the GUI SQL tools can handle blob "uploads" from the client. So if you are willing to switch to a different SQL client, this could be done without programming. My own tool available at http://www.sql-workbench.net can either do that through a GUI dialog or as part of an extended SQL syntax that is of course specific to that application. Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance for high-volume log insertion
Stephen Frost wrote on 22.04.2009 23:51: What about 4 individual prepared inserts? Just curious about it. 4 inserts, one prepared statement each (constructing the prepared statement only once), in a single transaction: 1.68s I'm surprised that there's any win here at all. For a single column table, I wouldn't expect much either. With more columns I think it would be a larger improvement. Out of curiosity I did some tests through JDBC. Using a single-column (integer) table, re-using a prepared statement took about 7 seconds to insert 10 rows with JDBC's batch interface and a batch size of 1000 Using a prepared statement that had a 1000 (?) after the insert (insert into foo values (?), (?), ...) the insert took about 0.8 seconds. Quite an improvement I'd say. Then I switched to a three column table (int, varchar(500), varchar(500)). Insert using a preparedstatement with batch (first scenario) now was ~8.5 seconds, whereas the multi-value insert now took ~3 seconds. So the difference got smaller, but still was quite substantial. This was inserting relatively small strings (~20 characters) into the table When increasing the size of the inserted strings, things began to change. When I bumped the length of the strings to 70 and 200 characters, the multi-value insert slowed down considerably. Both solutions now took around 9 seconds. The multi-value solution ranged between 7 and 9 seconds, whereas the "regular" insert syntax was pretty constant at roughly 9 seconds (I ran it about 15 times). So it seems, that as the size of the row increases the multi-value insert loses its head-start compared to the "regular" insert. I also played around with batch size. Going beyond 200 didn't make a big difference. For the JDBC batch, the batch size was the number of rows after which I called executeBatch() for the multi-value insert, this was the number of tuples I sent in a single INSERT statement. The multi-value statement seems to perform better with lower "batch" sizes (~10-50) whereas the JDBC batching seems to be fastest with about 200 statements per batch. Thomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance