Re: [PERFORM] Tons of free RAM. Can't make it go away.
Sorry for late response, but may be you are still strugling. It can be that some query(s) use a lot of work mem, either because of high work_mem setting or because of planner error. In this case the moment query runs it will need memory that will later be returned and become free. Usually this can be seen as active memory spike with a lot of free memory after. 2012/10/22 Shaun Thomas > Hey everyone! > > This is pretty embarrassing, but I've never seen this before. This is our > system's current memory allocation from 'free -m': > > total used free buffers cached > Mem: 72485 58473 14012 3 34020 > -/+ buffers/cache: 24449 48036 > > So, I've got 14GB of RAM that the OS is just refusing to use for disk or > page cache. Does anyone know what might cause that? > > Our uname -sir, for reference: > > Linux 3.2.0-31-generic x86_64 > > -- -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Paged Query
Понеділок, 9 липня 2012 р. користувач Misa Simic написав: > > > 2012/7/9 Gregg Jaskiewicz >> >> Use cursors. >> By far the most flexible. offset/limit have their down sides. > > > Well, I am not aware what down sides there are in LIMIT OFFSET what does not exist in any other solutions for paged queries... 'where key > last-value order by key limit N' is much better in performance for large offsets. p.s. Sorry for previous email- hit send too early. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Paged Query
Понеділок, 9 липня 2012 р. користувач Misa Simic написав: > > > 2012/7/9 Gregg Jaskiewicz >> >> Use cursors. >> By far the most flexible. offset/limit have their down sides. > > > Well, I am not aware what down sides there are in LIMIT OFFSET what does not exist in any other solutions for paged queries... where key > last-previous-key order by key -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Paged Query
What language are you using? Usually there is iterator with chunked fetch option (like setFetchSize in java jdbc). So you are passing query without limit and then read as many results as you need. Note that query plan in this case won't be optimized for your limit and I don't remember if postgres has "optimize for N rows" statement option. Also, if your statement is ordered by some key, you can use general paging technique when you rerun query with "key>max_prev_value" filter to get next chunk. Середа, 4 липня 2012 р. користувач Hermann Matthes написав: > I want to implement a "paged Query" feature, where the user can enter in a dialog, how much rows he want to see. After displaying the first page of rows, he can can push a button to display the next/previous page. > On database level I could user "limit" to implement this feature. My problem now is, that the user is not permitted to view all rows. For every row a permission check is performed and if permission is granted, the row is added to the list of rows sent to the client. > If for example the user has entered a page size of 50 and I use "limit 50" to only fetch 50 records, what should I do if he is only permitted to see 20 of these 50 records? There may be more records he can view. > But if I don't use "limit", what happens if the query would return 5,000,000 rows? Would my result set contain 5,000,000 rows or would the performance of the database go down? > > Thanks in advance > Hermann > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] correlated exists with join is slow.
Glad to hear postgresql becomes better and better :) 2012/6/18 Tom Lane > =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= writes: > > Today I've found a query that I thought will be fast turned out to be > slow. > > The problem is correlated exists with join - it does not want to make > > correlated nested loop to make exists check. > > 9.2 will make this all better. These are exactly the type of case where > you need the "parameterized path" stuff. > >regards, tom lane > -- Best regards, Vitalii Tymchyshyn
[PERFORM] correlated exists with join is slow.
Hello. Today I've found a query that I thought will be fast turned out to be slow. The problem is correlated exists with join - it does not want to make correlated nested loop to make exists check. Even if I force it to use nested loop, it materialized join uncorrelated and then filters it. It's OK when exists does not have join. Also good old left join where X=null works fast. Note that I could see same problem for both exists and not exists. Below is test case (tested on 9.1.4) with explains. create temporary table o(o_val,c_val) as select v, v/2 from generate_series(1,100) v; create temporary table i(o_ref, l_ref) as select generate_series(1,100), generate_series(1,10); create temporary table l(l_val, l_name) as select v, 'n_' || v from generate_series(1,10) v; create index o_1 on o(o_val); create index o_2 on o(c_val); create index i_1 on i(o_ref); create index i_2 on i(l_ref); create index l_1 on l(l_val); create index l_2 on l(l_name); analyze o; analyze i; analyze l; explain analyze select 1 from o where not exists (select 1 from i join l on l_ref = l_val where l_name='n_2' and o_ref=o_val) and c_val=33; -- http://explain.depesz.com/s/Rvw explain analyze select 1 from o where not exists (select 1 from i join l on l_ref = l_val where l_val=2 and o_ref=o_val) and c_val=33; -- http://explain.depesz.com/s/fVHw explain analyze select 1 from o where not exists (select 1 from i where l_ref=2 and o_ref=o_val) and c_val=33; -- http://explain.depesz.com/s/HgN explain analyze select 1 from o left join i on o_ref=o_val left join l on l_ref = l_val and l_name='n_2' where o_ref is null and c_val=33; -- http://explain.depesz.com/s/mLA set enable_hashjoin=false; explain analyze select 1 from o where not exists (select 1 from i join l on l_ref = l_val where l_name='n_2' and o_ref=o_val) and c_val=33; -- http://explain.depesz.com/s/LYu rollback; -- Best regards, Vitalii Tymchyshyn
[PERFORM] SSD selection
Hello, all. We've reached to the point when we would like to try SSDs. We've got a central DB currently 414 GB in size and increasing. Working set does not fit into our 96GB RAM server anymore. So, the main question is what to take. Here what we've got: 1) Intel 320. Good, but slower then current generation sandforce drives 2) Intel 330. Looks like cheap 520 without capacitor 3) Intel 520. faster then 320 No capacitor. 4) OCZ Vertex 3 Pro - No available. Even on OCZ site 5) OCZ Deneva - can't find in my country :) We are using Areca controller with BBU. So as for me, question is: Can 520 series be set up to handle fsyncs correctly? We've got the Areca to handle buffering. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Maximum number of sequences that can be created
2012/5/13 Robert Klemme > On Sun, May 13, 2012 at 10:12 AM, Віталій Тимчишин > wrote: > > 2012/5/11 Robert Klemme > > >> On the contrary: what would be the /advantage/ of being able to create > >> millions of sequences? What's the use case? > > > > We are using sequences as statistics counters - they produce almost no > > performance impact and we can tolerate it's non-transactional nature. I > can > > imaging someone who wants to have a sequence per user or other relation > > row. > > I can almost see the point. But my natural choice in that case would > be a table with two columns. Would that actually be so much less > efficient? Of course you'd have fully transactional behavior and thus > locking. > We've had concurrency problems with table solution (a counter that is updated by many concurrent queries), so we traded transactionality for speed. We are actually using this data to graph pretty graphs in nagios, so it's quite OK. But we have only ~10 sequences, not millions :) -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Maximum number of sequences that can be created
2012/5/11 Robert Klemme > On Fri, May 11, 2012 at 12:50 PM, Vidhya Bondre > wrote: > > Is there any max limit set on sequences that can be created on the > database > > ? Also would like to know if we create millions of sequences in a single > db > > what is the downside of it. > The sequences AFAIK are accounted as relations. Large list of relations may slowdown different system utilities like vacuuming (or may not, depends on queries and indexes on pg_class). > > On the contrary: what would be the /advantage/ of being able to create > millions of sequences? What's the use case? > > We are using sequences as statistics counters - they produce almost no performance impact and we can tolerate it's non-transactional nature. I can imaging someone who wants to have a sequence per user or other relation row. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] bad planning with 75% effective_cache_size
How about with par_ as (select * from product_parent where parent_name like 'aa%' ) select distinct product_code from product p_ inner join par_ on p_.parent_id=par_.id limit 2 ? 2012/4/3 Istvan Endredy > Hi, > > i've ran into a planning problem. > > > select distinct product_code from product p_ > inner join product_parent par_ on p_.parent_id=par_.id > where par_.parent_name like 'aa%' limit 2 > > > If effective_cache_size is smaller (32MB), planning is ok and query is > fast. (10ms) > In the worst case (effective_cache_size=6GB) the speed depends on the > value of 'limit' (in select): if it is smaller, query is slower. (12ms) > > > -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] SeqScan with full text search
2012/4/16 Tomek Walkuski > Hello group! > > I have query like this: > > SELECT > employments.candidate_id AS candidate_id, > SUM(TS_RANK(employers.search_vector, TO_TSQUERY('simple', 'One:* | > Two:* | Three:* | Four:*'), 2)) AS ts_rank > FROM > employments > INNER JOIN > employers ON employments.employer_id = employers.id > AND > employers.search_vector @@ TO_TSQUERY('simple', 'One:* | Two:* | > Three:* | Four:*') > GROUP BY > candidate_id; > > And it results with this: > > http://explain.depesz.com/s/jLM > > The JOIN between employments and employers is the culprit. I'm unable > to get rid of the seq scan, and setting enable_seqscan to off makes > things even worse. > > Is there any way to get rid of this JOIN? > > Have you got an index on employments.employer_id? It seems for me that only some employments get out of join, so index would help here. What's the plan with seq_scan off? P.S. I don't see why all employments are needed. May be I am reading something wrong? For me it's max 2616 employments out of 1606432. Best regards, Vitalii Tymchyshyn -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] database slowdown while a lot of inserts occur
Few words regarding small inserts and a lot of fsyncs: If it is your problem, you can fix this by using battery-backed raid card. Similar effect can be reached by turning synchronious commit off. Note that the latter may make few last commits lost in case of sudden reboot. But you can at least test if moving to BBU will help you. (Dunno if this setting can be changed with SIGHUP without restart). Note that this may still be a lot of random writes. And in case of RAID5 - a lot of random reads too. I don't think batching will help other applications. This is the tool to help application that uses batching. If you have random writes, look at HOT updates - they may help you if you will follow requirements. Check your checkpoints - application writes to commit log first (sequential write), then during checkpoints data is written to tables (random writes) - longer checkpoints may make you life easier. Try to increase checkpoint_segments. If you have alot of data written - try to move you commit logs to another drive/partition. If you have good raid card with memory and BBU, you may try to disable read cache on it (leaving only write cache). Read cache is usually good at OS level (with much more memory) and fast writes need BBU-protected write cache. Best regards, Vitalii Tymchyshyn 2012/3/29 Campbell, Lance > PostgreSQL 9.0.x > > We have around ten different applications that use the same database. > When one particular application is active it does an enormous number of > inserts. Each insert is very small. During this time the database seems > to slow down in general. The application in question is inserting into a > particular table that is not used by the other applications. > > ** ** > > **1) **What should I do to confirm that the database is the issue > and not the applications? > > **2) **How can I identify where the bottle neck is occurring if the > issue happens to be with the database? > > ** ** > > I have been using PostgreSQL for eight years. It is an amazing database.* > *** > > ** ** > > Thanks, > > ** ** > > Lance Campbell > > Software Architect > > Web Services at Public Affairs > > 217-333-0382 > > ** ** > -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Subquery flattening causing sequential scan
' 27.12.2011 20:13 пользователь "Tom Lane" написал: > > Jim Crate writes: > > My question is why does it do a seq scan when it flattens this > > subquery into a JOIN? > > Because it thinks there will be 3783 rows out of the msg scan, which if > true would make your desired nestloop join a serious loser. But second plan is evaluated cheapier by analyze. I thought this should make it being used unless it is not evaluated. Can it be collapse limit problem or like?
Re: [PERFORM] will the planner ever use an index when the condition is <> ?
17.12.2011 18:25 пользователь "Filip Rembiałkowski" написал: > > Normally there is no chance it could work, > because (a) the planner does not know all possible values of a column, > and (b) btree indexes cannot search on "not equal" operator. > Why so? a<>b is same as (ab), so, planner should chech this option.
Re: [PERFORM] Performance die when COPYing to table with bigint PK
In my tests it greatly depends on if index writes are random or sequential. My test time goes down from few hours to seconds if I add to the end of index. As for me, best comparision would be to make two equal int4 columns with same data as in int8, two indexes, then perform the test. My bet it will be slower than int8. Четвер, 4 серпня 2011 р. користувач Robert Ayrapetyan < robert.ayrapet...@comodo.com> написав: > All you are saying disproves following: > > in experiment I replaces bigint index: > > CREATE INDEX ix_t_big ON test.t USING btree (id_big) TABLESPACE tblsp_ix; > > with 4 (!) other indexes: > >>> If you look at the rest of my mail - you would notice 50 times >>> difference in performance. >>> What you would say? >> >> That accessing a page from RAM is more than 50 times as fast as a >> random access of that page from disk. >> >> -Kevin >> > > > > -- > Ayrapetyan Robert, > Comodo Anti-Malware Data Processing Analysis and Management System (CAMDPAMS) > http://repo-qa.camdpams.odessa.office.comodo.net/mediawiki/index.php > -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] big distinct clause vs. group by
2011/4/23 Robert Haas > On Apr 18, 2011, at 1:13 PM, Uwe Bartels wrote: > > Hi Robert, > > > > thanks for your answer. > > the aggregate function I was talking about is the function I need to use > for the non-group by columns like min() in my example. > > There are of course several function to choose from, and I wanted to know > which causes as less as possible resources. > > Oh, I see. min() is probably as good as anything. You could also create a > custom aggregate that just always returns its first input. I've occasionally > wished we had such a thing as a built-in. > > I've once done "single" grouping function - it checks that all it's input values are equal (non-null ones) and returns the value or raises an error if there are two different values. Best regards, Vitalii Tymchyshyn -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Shouldn't we have a way to avoid "risky" plans?
2011/3/23 Tom Lane > Claudio Freire writes: > > On Wed, Mar 23, 2011 at 5:29 PM, Josh Berkus wrote: > >> On 3/23/11 10:35 AM, Claudio Freire wrote: > >>> * consider plan bailout: execute a tempting plan, if it takes too > >>> long or its effective cost raises well above the expected cost, bail > >>> to a safer plan > > >> That would actually solve this particular case. It would still require > >> us to have some definition of "safer" though. > > > In my head, safer = better worst-case performance. > > If the planner starts operating on the basis of worst case rather than > expected-case performance, the complaints will be far more numerous than > they are today. > > This can se GUC-controllable. Like plan_safety=0..1 with low default value. This can influence costs of plans where cost changes dramatically with small table changes and/or statistics is uncertain. Also this can be used as direct "hint" for such dangerous queries by changing GUC for session/single query. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Talking about optimizer, my long dream
27 лютого 2011 р. 19:59 Robert Haas написав: > 2011/2/4 Віталій Тимчишин : > > Hi, all. > > All this optimizer vs hint thread reminded me about crazy idea that got > to > > my head some time ago. > > I currently has two problems with postgresql optimizer > > 1) Dictionary tables. Very usual thing is something like "select * from > > big_table where distionary_id = (select id from dictionary where > > name=value)". This works awful if dictionary_id distribution is not > uniform. > > Does it work better if you write it as a join? > > SELECT b.* FROM big_table b, dictionary d WHERE b.dictionary_id = d.id > AND d.name = 'value' > > I would like to see a concrete example of this not working well, > because I've been writing queries like this (with MANY tables) for > years and it's usually worked very well for me. > > Here you are: PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit create table a(dict int4, val int4); create table b(dict int4, name text); create index c on a(dict); insert into b values (1, 'small'), (2, 'large'); insert into a values (1,1); insert into a select 2,generate_series(1,1); analyze a; analyze b; test=# explain analyze select * from a where dict=1; QUERY PLAN - Index Scan using c on a (cost=0.00..8.27 rows=1 width=8) (actual time=0.014..0.016 rows=1 loops=1) Index Cond: (dict = 1) Total runtime: 0.041 ms (3 rows) test=# explain analyze select * from a where dict=2; QUERY PLAN - Seq Scan on a (cost=0.00..170.01 rows=1 width=8) (actual time=0.014..6.876 rows=1 loops=1) Filter: (dict = 2) Total runtime: 13.419 ms (3 rows) test=# explain analyze select * from a,b where a.dict=b.dict and b.name ='small'; QUERY PLAN --- Hash Join (cost=1.04..233.55 rows=5000 width=18) (actual time=0.047..13.159 rows=1 loops=1) Hash Cond: (a.dict = b.dict) -> Seq Scan on a (cost=0.00..145.01 rows=10001 width=8) (actual time=0.009..6.633 rows=10001 loops=1) -> Hash (cost=1.02..1.02 rows=1 width=10) (actual time=0.011..0.011 rows=1 loops=1) -> Seq Scan on b (cost=0.00..1.02 rows=1 width=10) (actual time=0.006..0.008 rows=1 loops=1) Filter: (name = 'small'::text) Total runtime: 13.197 ms (7 rows) test=# explain analyze select * from a,b where a.dict=b.dict and b.name ='large'; QUERY PLAN --- Hash Join (cost=1.04..233.55 rows=5000 width=18) (actual time=0.074..21.476 rows=1 loops=1) Hash Cond: (a.dict = b.dict) -> Seq Scan on a (cost=0.00..145.01 rows=10001 width=8) (actual time=0.012..7.085 rows=10001 loops=1) -> Hash (cost=1.02..1.02 rows=1 width=10) (actual time=0.021..0.021 rows=1 loops=1) -> Seq Scan on b (cost=0.00..1.02 rows=1 width=10) (actual time=0.015..0.016 rows=1 loops=1) Filter: (name = 'large'::text) Total runtime: 28.293 ms (7 rows) It simply don't know that small=1 and large=2, so it never uses nested loop + iindex scan: test=# set enable_hashjoin=false; SET test=# explain analyze select * from a,b where a.dict=b.dict and b.name ='small'; QUERY PLAN Nested Loop (cost=0.00..253.28 rows=5000 width=18) (actual time=0.041..0.047 rows=1 loops=1) -> Seq Scan on b (cost=0.00..1.02 rows=1 width=10) (actual time=0.010..0.012 rows=1 loops=1) Filter: (name = 'small'::text) -> Index Scan using c on a (cost=0.00..189.75 rows=5000 width=8) (actual time=0.021..0.023 rows=1 loops=1) Index Cond: (a.dict = b.dict) Total runtime: 0.089 ms (6 rows) -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Why we don't want hints Was: Slow count(*) again...
2011/2/10 Tobias Brox > On 4 February 2011 04:46, Josh Berkus wrote: > > "Optimizer hints are used to work around problems in the optimizer and > > introduce upgrade and maintenance issues. We would rather have the > > problems reported and fixed. We have discussed a more sophisticated > > system of per-class cost adjustment instead, but a specification remains > > to be developed." > > I have no clue about how hints works in Oracle ... I've never been > working "enterprise level" on anything else than Postgres. Anyway, > today I just came over an interesting problem in our production > database today - and I think it would be a benefit to be able to > explicitly tell the planner what index to use (the dev team is adding > redundant attributes and more indexes to solve the problem - which > worries me, because we will run into serious problems as soon as there > won't be enough memory for all the frequently-used indexes). > > We have users and transactions, and we have transaction types. The > transaction table is huge. The users are able to interactively check > their transaction listings online, and they have some simple filter > options available as well. Slightly simplified, the queries done > looks like this: > > select * from account_transaction where account_id=? order by > created desc limit 25; > > select * from account_transaction where trans_type_id in ( ... > long, hard-coded list ...) and account_id=? order by created desc > limit 25; > > and we have indexes on: > > account_transaction(account_id, created) > > account_transaction(account_id, trans_type_id, created) > > If the list is hard-coded, you can create partial index on account_transaction(account_id, created desc) where trans_type_id in ( ... long, hard-coded list ...) -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] How to best use 32 15k.7 300GB drives?
2011/2/4 Robert Haas > On Fri, Feb 4, 2011 at 4:19 AM, Vitalii Tymchyshyn > wrote: > > Why do you expect such a invasive code changes? I know little about > > postgresql code layering, but what I propose (with changing delete to > > truncate) is: > > 1) Leave tuple addressing as it is now > > i.e. a block number and a slot position within the block? > > Seems like you'd need . > No, that's what I mean. Leave as it is. You will have file logical length (fixed for all but the last one, 1GB currently) and file actual legth that can be less (if file trucated). In the latter case you still have this "empty" blocks that don't exists at all. Actually the simplest implementation could be to tell to file system "drop this part of file and pretend it's all zeros", but I don't think many FSs (OSes?) supports this. So, each file still have it's fixed N blocks. And filenumber is still blocknumber / N. -- Best regards, Vitalii Tymchyshyn
[PERFORM] Talking about optimizer, my long dream
Hi, all. All this optimizer vs hint thread reminded me about crazy idea that got to my head some time ago. I currently has two problems with postgresql optimizer 1) Dictionary tables. Very usual thing is something like "select * from big_table where distionary_id = (select id from dictionary where name=value)". This works awful if dictionary_id distribution is not uniform. The thing that helps is to retrieve subselect value and then simply do "select * from big_table where dictionary_id=id_value". 2) Complex queries. If there are over 3 levels of subselects, optmizer counts often become less and less correct as we go up on levels. On ~3rd level this often lead to wrong choises. The thing that helps is to create temporary tables from subselects, analyze them and then do main select using this temporary tables. While first one can be fixed by introducing some correlation statistics, I don't think there is any simple way to fix second one. But what if optimizer could in some cases tell "fetch this and this and then I'll plan other part of the query based on statistics of what you've fetched"? -- Best regards, Vitalii Tymchyshyn
Re: [HACKERS] [PERFORM] Slow count(*) again...
4 лютого 2011 р. 09:32 написав: > > > when a copy command is issued, I assume that there is some indication of > how much data is going to follow. I know that it's not just 'insert > everything until the TCP connection terminates' because that would give you > no way of knowing if the copy got everything in or was interrupted part way > through. think about what happens with ftp if the connection drops, you get > a partial file 'successfully' as there is no size provided, but with HTTP > you get a known-bad transfer that you can abort or resume. > > I don't think so, since you can do 'cat my_large_copy.sql | psql'. AFAIR it simply looks for end of data marker, either in protocol or in stream itself (run copy from stdin in psql and it will tell you what marker is). -- Best regards, Vitalii Tymchyshyn
Re: [HACKERS] [PERFORM] Slow count(*) again...
2011/2/4 Mladen Gogala > Josh Berkus wrote: > >> However, since this system wasn't directly compatible with Oracle Hints, >> folks pushing for hints dropped the solution as unsatisfactory. This is >> the discussion we have every time: the users who want hints specifically >> want hints which work exactly like Oracle's, and aren't interested in a >> system designed for PostgreSQL. It's gotten very boring; it's like the >> requests to support MySQL-only syntax. >> >> > Actually, I don't want Oracle hints. Oracle hints are ugly and cumbersome. > I would prefer something like this: > > > http://dev.mysql.com/doc/refman/5.0/en/index-hints.html > > As far as I can see, this should be embedded into query, should not it? You can achive something like this by setting variables right before query (usually even in same sall by embedding multiple statements into execute query call). E.g. "set random_page_cost=1;select something that need index; set random_page_to to default;". Yes this is as ugly as a hack may look and can't be used on per-table basis in complex statement, but you have it. -- Best regards, Vitalii Tymchyshyn
Re: [HACKERS] [PERFORM] Slow count(*) again...
2011/2/3 > > If the table is not large enough to fit in ram, then it will compete for > I/O, and the user will have to wait. > > what I'm proposing is that as the records are created, the process doing > the creation makes copies of the records (either all of them, or some of > them if not all are needed for the analysis, possibly via shareing memory > with the analysis process), this would be synchronous with the load, not > asynchronous. > > this would take zero I/O bandwidth, it would take up some ram, memory > bandwidth, and cpu time, but a load of a large table like this is I/O > contrained. > > it would not make sense for this to be the default, but as an option it > should save a significant amount of time. > > I am making the assumption that an Analyze run only has to go over the data > once (a seqential scan of the table if it's >> ram for example) and gathers > stats as it goes. > > with the current code, this is a completely separate process that knows > nothing about the load, so if you kick it off when you start the load, it > makes a pass over the table (competing for I/O), finishes, you continue to > update the table, so it makes another pass, etc. As you say, this is a bad > thing to do. I am saying to have an option that ties the two togeather, > essentially making the data feed into the Analyze run be a fork of the data > comeing out of the insert run going to disk. So the Analyze run doesn't do > any I/O and isn't going to complete until the insert is complete. At which > time it will have seen one copy of the entire table. > > Actually that are two different problems. The one is to make analyze more automatic to make select right after insert more clever by providing statistics to it. Another is to make it take less IO resources. I dont like for it to be embedded into insert (unless the threshold can be determined before inserts starts). Simply because it is more CPU/memory that will slow down each insert. And if you will add knob, that is disabled by default, this will be no more good than manual analyze. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] How to best use 32 15k.7 300GB drives?
2011/1/28 Scott Carey > > > On 1/28/11 9:28 AM, "Stephen Frost" wrote: > > >* Scott Marlowe (scott.marl...@gmail.com) wrote: > >> There's nothing wrong with whole table updates as part of an import > >> process, you just have to know to "clean up" after you're done, and > >> regular vacuum can't fix this issue, only vacuum full or reindex or > >> cluster. > > > >Just to share my experiences- I've found that creating a new table and > >inserting into it is actually faster than doing full-table updates, if > >that's an option for you. > > I wonder if postgres could automatically optimize that, if it thought that > it was going to update more than X% of a table, and HOT was not going to > help, then just create a new table file for XID's = or higher than the one > making the change, and leave the old one for old XIDs, then regular VACUUM > could toss out the old one if no more transactions could see it. > > > I was thinking if a table file could be deleted if it has no single live row. And if this could be done by vacuum. In this case vacuum on table that was fully updated recently could be almost as good as cluster - any scan would skip such non-existing files really fast. Also almost no disk space would be wasted. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] hashed subplan 5000x slower than two sequential operations
2011/1/18 masterchief > > > Tom Lane wrote: > > > > The only really effective way the planner knows to optimize an > > "IN (sub-SELECT)" is to turn it into a semi-join, which is not possible > > here because of the unrelated OR clause. You might consider replacing > > this with a UNION of two scans of "contexts". (And yes, I know it'd be > > nicer if the planner did that for you.) > > In moving our application from Oracle to Postgres, we've discovered that a > large number of our reports fall into this category. If we rewrite them as > a UNION of two scans, it would be quite a big undertaking. Is there a way > to tell the planner explicitly to use a semi-join (I may not grasp the > concepts here)? If not, would your advice be to hunker down and rewrite > the > queries? > > You can try "exists" instead of "in". Postgresql likes exists better. Alternatively, you can do something like "set enable_seqscan=false". Note that such set is more like a hammer, so should be avoided. If it is the only thing that helps, it can be set right before calling query and reset to default afterwards. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Performance under contention
2010/12/7 Robert Haas > On Tue, Dec 7, 2010 at 1:08 PM, Ivan Voras wrote: > > > I'm not very familiar with PostgreSQL code but if we're > > brainstorming... if you're only trying to protect against a small > > number of expensive operations (like DROP, etc.) that don't really > > happen often, wouldn't an atomic reference counter be good enough for > > the purpose (e.g. the expensive operations would spin-wait until the > > counter is 0)? > > No, because (1) busy-waiting is only suitable for locks that will only > be held for a short time, and an AccessShareLock on a table might be > held while we read 10GB of data in from disk, and (2) that wouldn't > allow for deadlock detection. > As far as I understand this thread, the talk is about contention - where large number of processors want to get single partition lock to get high-level shared lock. As far as I can see from the source, there is a lot of code executed under the partition lock protection, like two hash searches (and possibly allocations). What can be done, is that number of locks can be increased - one could use spin locks for hash table manipulations, e.g. a lock preventing rehashing (number of baskets being changed) and a lock for required basket. In this case only small range of code can be protected by partition lock. As for me, this will make locking process more cpu-intensive (more locks will be acquired/freed during the exection), but will decrease contention (since all but one lock can be spin locks working on atomic counters, hash searches can be done in parallel), won't it? The thing I am not sure in is how much spinlocks on atomic counters cost today. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Slow query to get last created row using CURRVAL
2010/12/4 Mathieu De Zutter > > For each page load I first create an entry in that table, e.g.: > > INSERT INTO log_event (user_id, ip, action_id, object1_id, object2_id, > event_timestamp, comments) VALUES (1, '127.0.0.1', 96, null, null, > NOW(), 'TEST'); > > After that, I want to retrieve the data stored in log_event from a > trigger, e.g.: > > SELECT user_id FROM log_event WHERE id = CURRVAL('log_event_id_seq'); > > This way my insert-trigger knows who is creating the new row, while > using only one pg-user to query the database. > > Please note that you can use next query to perform both insert and select: INSERT INTO log_event (user_id, ip, action_id, object1_id, object2_id, event_timestamp, comments) VALUES (1, '127.0.0.1', 96, null, null, NOW(), 'TEST') returning user_id; -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] anti-join chosen even when slower than old plan
2010/11/11 Robert Haas > > But thinking over what you've written here, I'm reminded of something > Peter said years ago, also about the optimizer. He was discussed the > ratio of the estimated cost to the actual cost and made an off-hand > remark that efforts had been made over the years to make that ratio > more consistent (i.e. improve the quality of the cost estimates) but > that they'd been abandoned because they didn't necessarily produce > better plans. Applying that line of thinking to this problem, maybe > we should give up on trying to make the estimates truly model reality, > and focus more on assigning them values which work well in practice. > For example, in your case, it would be sufficient to estimate the > amount of data that a given query is going to grovel through and then > applying some heuristic to choose values for random_page_cost and > seq_page_cost based on the ratio of that value to, I don't know, > effective_cache_size. > As for me, the simplest solution would be to allow to set costs on per-relation basis. E.g. I know that this relation is most time in memory and other one (archive) is on the disk. This could work like charm along with buffer pools (portions of shared cache) - tables (or indexes) that are required to be cached can be assigned to bufferpool that has enough size to hold all the data, archive ones - to small bufferpool. This can guarantie that after query on the archive data, cached tables are still cached. This solutions however, does not help on tables where only some portion of table is activelly used. The solution can be to allow set costs via partial indexes - e.g. "for any table access using this index, use this cost values". This, BTW, will make table access via given index more preferable. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Slow count(*) again...
2010/10/10 Neil Whelchel > On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote: > > On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel > wrote: > > > I know that there haven been many discussions on the slowness of > count(*) > > > even when an index is involved because the visibility of the rows has > to > > > be checked. In the past I have seen many suggestions about using > > > triggers and tables to keep track of counts and while this works fine > in > > > a situation where you know what the report is going to be ahead of > time, > > > this is simply not an option when an unknown WHERE clause is to be used > > > (dynamically generated). I ran into a fine example of this when I was > > > searching this mailing list, "Searching in 856,646 pages took 13.48202 > > > seconds. Site search powered by PostgreSQL 8.3." Obviously at some > point > > > count(*) came into play here because the site made a list of pages (1 2 > > > 3 4 5 6 > next). I very commonly make a list of pages from search > > > results, and the biggest time killer here is the count(*) portion, even > > > worse yet, I sometimes have to hit the database with two SELECT > > > statements, one with OFFSET and LIMIT to get the page of results I need > > > and another to get the amount of total rows so I can estimate how many > > > pages of results are available. The point I am driving at here is that > > > since building a list of pages of results is such a common thing to do, > > > there need to be some specific high speed ways to do this in one query. > > > Maybe an estimate(*) that works like count but gives an answer from the > > > index without checking visibility? I am sure that this would be good > > > enough to make a page list, it is really no big deal if it errors on > the > > > positive side, maybe the list of pages has an extra page off the end. I > > > can live with that. What I can't live with is taking 13 seconds to get > a > > > page of results from 850,000 rows in a table. > > > > 99% of the time in the situations you don't need an exact measure, and > > assuming analyze has run recently, select rel_tuples from pg_class for > > a given table is more than close enough. I'm sure wrapping that in a > > simple estimated_rows() function would be easy enough to do. > > This is a very good approach and it works very well when you are counting > the > entire table, but when you have no control over the WHERE clause, it > doesn't > help. IE: someone puts in a word to look for in a web form. > > From my perspective, this issue is the biggest problem there is when using > Postgres to create web pages, and it is so commonly used, I think that > there > should be a specific way to deal with it so that you don't have to run the > same WHERE clause twice. > IE: SELECT count(*) FROM WHERE ; to get the total amount of > items to make page navigation links, then: > SELECT FROM table WHERE LIMIT OFFSET > <(page_no-1)*items_per_page>; to get the actual page contents. > > How about select * from (select *, count(*) over () as total_count from where OFFSET <(page_no-1)*items_per_page> It will return you total_count column with equal value in each row. You may have problems if no rows are returned (e.g. page num is too high). -- Best regards, Vitalii Tymchyshyn
[PERFORM] Exists, limit and alternate plans
Hello. Today I've found out strange results for query below. select version(); version -- PostgreSQL 8.4.2 on amd64-portbld-freebsd8.0, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD], 64-bit --Original query: explain analyze select exists(select * from investor i where i.company_id = this_.id) from COMPANY this_ order by this_.rank desc, this_.id asc limit 10; Limit (cost=0.00..50.67 rows=10 width=16) (actual time=144.489..144.556 rows=10 loops=1) -> Index Scan using comp_rank_id on company this_ (cost=0.00..34616009.08 rows=6831169 width=16) (actual time=144.484..144.524 rows=10 loops=1) SubPlan 1 -> Index Scan using company_invs on investor i (cost=0.00..9.52 rows=2 width=0) (never executed) Index Cond: ((company_id)::bigint = $0) SubPlan 2 -> Seq Scan on investor i (cost=0.00..1836.17 rows=41717 width=8) (actual time=0.006..72.364 rows=41722 loops=1) Total runtime: 144.975 ms (8 rows) --set enable_seqscan=false; explain analyze select exists(select * from investor i where i.company_id = this_.id) from COMPANY this_ order by this_.rank desc, this_.id asc limit 10; Limit (cost=0.00..50.67 rows=10 width=16) (actual time=0.045..0.177 rows=10 loops=1) -> Index Scan using comp_rank_id on company this_ (cost=0.00..34616009.08 rows=6831169 width=16) (actual time=0.041..0.146 rows=10 loops=1) SubPlan 1 -> Index Scan using company_invs on investor i (cost=0.00..9.52 rows=2 width=0) (actual time=0.007..0.007 rows=1 loops=10) Index Cond: ((company_id)::bigint = $0) SubPlan 2 -> Seq Scan on investor i (cost=100.00..1001836.17 rows=41717 width=8) (never executed) Total runtime: 0.253 ms (8 rows) --limit inside exists explain analyze select exists(select * from investor i where i.company_id = this_.id limit 1) from COMPANY this_ order by this_.rank desc, this_.id asc limit 10; Limit (cost=0.00..50.67 rows=10 width=16) (actual time=0.052..0.219 rows=10 loops=1) -> Index Scan using comp_rank_id on company this_ (cost=0.00..34616009.08 rows=6831169 width=16) (actual time=0.049..0.189 rows=10 loops=1) SubPlan 1 -> Limit (cost=0.00..4.76 rows=1 width=422) (actual time=0.011..0.011 rows=1 loops=10) -> Index Scan using company_invs on investor i (cost=0.00..9.52 rows=2 width=422) (actual time=0.007..0.007 rows=1 loops=10) Index Cond: ((company_id)::bigint = $0) Total runtime: 0.291 ms (7 rows) So, my Qs: 1) Do we really have alternative plans for SubPlan that are selected at runtime? Wow. 2) Why "Seq scan" plan is selected by default? Is it because of outer limit not being applied when calculating costs for subplans at runtime? 3) Why does limit inside exists helps? Is it simply because new "alternative" logic in not applied for "complex case"? -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Slow Bulk Delete
2010/5/17 Jon Nelson > > On Mon, May 17, 2010 at 5:10 AM, Pierre C wrote: > > - or use a JOIN delete with a virtual VALUES table > > - or fill a temp table with ids and use a JOIN DELETE > > What is a virtual VALUES table? Can you give me an example of using a > virtual table with selects, joins, and also deletes? > > > delete from a using (values (1),(2),(5),(8)) b(x) where a.id=b.x See http://www.postgresql.org/docs/8.4/static/sql-values.html -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Benchmark with FreeBSD 8.0 and pgbench
Hi. Not strictly connected to your tests, but: As of ZFS, we've had experience that it degrades over time after random updates because of files becoming non-linear and sequential reads becomes random. Also there are Q about ZFS block size - setting it to 8K makes first problem worse, setting it to higher values means that 8K write will need a read to recreate the whole block in new place. Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first
16 квітня 2010 р. 17:19 Tom Lane написав: > =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= writes: > > I've thought and someone in this list've told me that this should be done > > automatically. > > As was pointed out, even if we had such logic it wouldn't apply in this > example, because the equality conditions aren't real equalities but > OUTER JOIN conditions. > > In this case you can copy condition to "ON" condition, not to where cause and this would work correct, e.g. "select something from a join b on a.x=b.y where a.x > n" <=> "select something from a join b on a.x=b.y and b.y > n where a.x > n". As of making planner more clever, may be it is possible to introduce division on "fast queries" and "long queries", so that if after fast planning cost is greater then some configurable threshold, advanced planning techniques (or settings) are used. As far as I have seen in this list, many techniques are not used simply because they are too complex and could make planning take too much time for really fast queries, but they are vital for long ones. Also same (or similar) threshold could be used to enable replanning for each run of prepared query - also an often complaint is that planned query is not that fast as is could be. -- Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first
16 квітня 2010 р. 16:21 Yeb Havinga написав: > Віталій Тимчишин wrote: > >> >> BTW: Changing slow query to inner joins do not make it fast >> > I'm interested to see the query andplan of the slow query with inner joins. > > > Here you are. The query: select * from company this_ inner join company_tag this_1_ on this_.id=this_1_.company_id inner join company_measures companymea2_ on this_.id=companymea2_.company_id inner join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>5000 order by this_.id asc limit 1000 ; Total runtime: 14088.942 ms (plan is attached) Best regards, Vitalii Tymchyshyn Limit (cost=227.15..883.22 rows=1000 width=1209) (actual time=14062.106..14087.375 rows=1000 loops=1) -> Merge Join (cost=227.15..4355277.70 rows=6638046 width=1209) (actual time=14062.101..14084.577 rows=1000 loops=1) Merge Cond: (this_.id = (this_1_.company_id)::bigint) -> Merge Join (cost=185.14..4025978.59 rows=6757358 width=1193) (actual time=10692.975..10708.923 rows=1054 loops=1) Merge Cond: ((companymea2_.company_id)::bigint = this_.id) -> Merge Join (cost=0.00..1784574.44 rows=6821672 width=570) (actual time=0.111..9138.804 rows=1097096 loops=1) Merge Cond: ((companymea2_.company_id)::bigint = (ces3_.company_id)::bigint) -> Index Scan using pk_comp_ms on company_measures companymea2_ (cost=0.00..456350.36 rows=6821672 width=68) (actual time=0.066..1747.291 rows=1097096 loops=1) -> Index Scan using cws_company_descr_unique on company_descr ces3_ (cost=0.00..1225899.00 rows=6821672 width=502) (actual time=0.033..1822.085 rows=1097096 loops=1) -> Index Scan using pk_comp_m on company this_ (cost=0.00..2139324.53 rows=6886598 width=623) (actual time=0.026..1.853 rows=1054 loops=1) Index Cond: (this_.id > 5000) -> Index Scan using company_tag_company_id_idx on company_tag this_1_ (cost=0.00..229167.56 rows=6765004 width=16) (actual time=0.028..1859.967 rows=1075634 loops=1) Filter: ((this_1_.tag_id)::bigint = 7) Total runtime: 14088.942 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first
16 квітня 2010 р. 11:25 Hannu Krosing написав: > On Fri, 2010-04-16 at 11:02 +0300, Віталій Тимчишин wrote: > > Hello. > > > > > > I have a query that performs very poor because there is a limit on > > join column that is not applied to other columns: > > > > > > select * from company this_ left outer join company_tag this_1_ on > > this_.id=this_1_.company_id left outer join company_measures > > companymea2_ on this_.id=companymea2_.company_id left outer join > > company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id > > = 7 and this_.id>5000 > > and this_1_.company_id>5000 > > order by this_.id asc limit 1000; > > > > > > (plan1.txt) > > Total runtime: 7794.692 ms > > > > > > At the same time if I apply the limit (>5000) to other columns in > > query itself it works like a charm: > > > > > > select * from company this_ left outer join company_tag this_1_ on > > this_.id=this_1_.company_id left outer join company_measures > > companymea2_ on this_.id=companymea2_.company_id left outer join > > company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id > > = 7 and this_.id>5000 > > and this_1_.company_id>5000 > > and companymea2_.company_id>5000 and ces3_.company_id>5000 > > order by this_.id asc limit 1000; > > The queries are not the same. > > 2nd variant will not return the rows where there are no matching rows > inthis_1_ , companymea2_ or ces3_.company_id > > A query equivalent to first one would be: > > > select * from company this_ > left outer join company_tag this_1_ > on (this_.id=this_1_.company_id >and this_1_.company_id>5000) > left outer join company_measures companymea2_ > on (this_.id=companymea2_.company_id >and companymea2_.company_id>5000) > left outer join company_descr ces3_ > on (this_.id=ces3_.company_id >and ces3_.company_id>5000) > where this_1_.tag_id = 7 > and this_.id>5000 > order by this_.id asc > limit 1000; > And it's still fast (see plan in another mail), while "inner join" variant of original query is still slow. > > > I'm not sure that planner considers the above form of plan rewrite, nor > that it would make much sense to do so unless there was a really small > number of rows where x_.company_id>5000 > > Actually no, select id > 5000, count(*) from company group by 1 f,1096042 t,5725630 I don't know why the planner wishes to perform few merges of 1000 to a million of records (and the merges is the thing that takes time) instead of taking a 1000 of records from main table and then doing a nested loop. And it must read all the records that DO NOT match the criteria for secondary tables before getting to correct records if it do not filter secondary tables with index on retrieve. set enable_mergejoin=false helps original query, but this is another problem and first solution is simpler and can be used by planner automatically, while second requires rethinking/rewrite of LIMIT estimation logic (Plan of nested loop attached) Limit (cost=0.00..2369.36 rows=1000 width=1209) (actual time=0.179..41.155 rows=1000 loops=1) -> Nested Loop (cost=0.00..15727940.41 rows=6638046 width=1209) (actual time=0.174..38.312 rows=1000 loops=1) -> Nested Loop (cost=0.00..11165483.75 rows=6701224 width=1141) (actual time=0.134..26.421 rows=1000 loops=1) -> Nested Loop (cost=0.00..5763844.03 rows=6765004 width=639) (actual time=0.066..14.389 rows=1000 loops=1) -> Index Scan using pk_comp_m on company this_ (cost=0.00..1152936.77 rows=6886598 width=623) (actual time=0.038..1.908 rows=1054 loops=1) Index Cond: (id > 5000) -> Index Scan using company_tag_company_id_idx on company_tag this_1_ (cost=0.00..0.66 rows=1 width=16) (actual time=0.004..0.006 rows=1 loops=1054) Index Cond: ((this_1_.company_id)::bigint = this_.id) Filter: ((this_1_.tag_id)::bigint = 7) -> Index Scan using cws_company_descr_unique on company_descr ces3_ (cost=0.00..0.79 rows=1 width=502) (actual time=0.004..0.006 rows=1 loops=1000) Index Cond: ((ces3_.company_id)::bigint = this_.id) -> Index Scan using pk_comp_ms on company_measures companymea2_ (cost=0.00..0.67 rows=1 width=68) (actual time=0.004..0.006 rows=1 loops=1000) Index Cond: ((companymea2_.company_id)::bigint = this_.id) Total runtime: 42.940 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Planner not using column limit specified for one column for another column equal to first
16 квітня 2010 р. 11:31 Yeb Havinga написав: > Віталій Тимчишин wrote: > >> Hello. >> >> I have a query that performs very poor because there is a limit on join >> column that is not applied to other columns: >> >> select * from company this_ left outer join company_tag this_1_ on >> this_.id=this_1_.company_id left outer join company_measures companymea2_ on >> this_.id=companymea2_.company_id left outer join company_descr ces3_ on >> this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>5000 and >> this_1_.company_id>5000 >> order by this_.id asc limit 1000; >> >> (plan1.txt) >> Total runtime: 7794.692 ms >> >> At the same time if I apply the limit (>5000) to other columns in >> query itself it works like a charm: >> >> select * from company this_ left outer join company_tag this_1_ on >> this_.id=this_1_.company_id left outer join company_measures companymea2_ on >> this_.id=companymea2_.company_id left outer join company_descr ces3_ on >> this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>5000 and >> this_1_.company_id>5000 >> and companymea2_.company_id>5000 and ces3_.company_id>5000 >> order by this_.id asc limit 1000; >> >> (plan2.txt) >> Total runtime: 27.547 ms >> >> I've thought and someone in this list've told me that this should be done >> automatically. >> > Yes, if you have in a query a=b and b=c, then the optimizer figures out > that a=c as well. (a,b and c are then member of the same equivalence class). > > However both queries are not the same, since the joins you're using are > outer joins. In the first it's possible that records are returned for > company records with no matching ces3_ records, the ces3_ records is null in > that case. In the second query no NULL ces3_ information may be returned. > OK, but when I move limit to join condition the query is still fast: select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id and companymea2_.company_id>5000 left outer join company_descr ces3_ on this_.id=ces3_.company_id and ces3_.company_id>5000 where this_1_.tag_id = 7 and this_.id>5000 and this_1_.company_id>5000 order by this_.id asc limit 1000; (plan3.txt), Total runtime: 26.327 ms BTW: Changing slow query to inner joins do not make it fast > > Another thing is it seems that the number of rows guessed is far off from > the actual number of rows, is the number 500 artificial or are you're > statistics old or too small histogram/mcv's? > Nope, I suppose this is because of limit. If I remove the limit, the estimations are quite correct. There are ~6 millions of row in each table. Limit (cost=293.40..1028.60 rows=1000 width=1209) (actual time=0.128..24.751 rows=1000 loops=1) -> Merge Left Join (cost=293.40..4197731.11 rows=5709243 width=1209) (actual time=0.124..21.968 rows=1000 loops=1) Merge Cond: (this_.id = (companymea2_.company_id)::bigint) -> Merge Left Join (cost=246.59..3681230.10 rows=5709243 width=1141) (actual time=0.099..15.284 rows=1000 loops=1) Merge Cond: (this_.id = (ces3_.company_id)::bigint) -> Merge Join (cost=37.87..2435536.00 rows=5709243 width=639) (actual time=0.074..8.487 rows=1000 loops=1) Merge Cond: (this_.id = (this_1_.company_id)::bigint) -> Index Scan using pk_comp_m on company this_ (cost=0.00..2139324.53 rows=6886598 width=623) (actual time=0.037..1.860 rows=1054 loops=1) Index Cond: (id > 5000) -> Index Scan using company_tag_company_id_idx on company_tag this_1_ (cost=0.00..207678.85 rows=5709243 width=16) (actual time=0.027..1.758 rows=1000 loops=1) Index Cond: ((this_1_.company_id)::bigint > 5000) Filter: ((this_1_.tag_id)::bigint = 7) -> Index Scan using cws_company_descr_unique on company_descr ces3_ (cost=0.00..1169511.92 rows=5757068 width=502) (actual time=0.020..1.788 rows=1054 loops=1) Index Cond: ((ces3_.company_id)::bigint > 5000) -> Index Scan using pk_comp_ms on company_measures companymea2_ (cost=0.00..440945.79 rows=565 width=68) (actual time=0.019..1.729 rows=1054 loops=1) Index Cond: ((companymea2_.company_id)::bigint > 5000) Total runtime: 26.327 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Planner not using column limit specified for one column for another column equal to first
Hello. I have a query that performs very poor because there is a limit on join column that is not applied to other columns: select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>5000 and this_1_.company_id>5000 order by this_.id asc limit 1000; (plan1.txt) Total runtime: 7794.692 ms At the same time if I apply the limit (>5000) to other columns in query itself it works like a charm: select * from company this_ left outer join company_tag this_1_ on this_.id=this_1_.company_id left outer join company_measures companymea2_ on this_.id=companymea2_.company_id left outer join company_descr ces3_ on this_.id=ces3_.company_id where this_1_.tag_id = 7 and this_.id>5000 and this_1_.company_id>5000 and companymea2_.company_id>5000 and ces3_.company_id>5000 order by this_.id asc limit 1000; (plan2.txt) Total runtime: 27.547 ms I've thought and someone in this list've told me that this should be done automatically. But I have pretty recent server: PostgreSQL 8.4.2 on amd64-portbld-freebsd8.0, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD], 64-bit and it still do not work Do I misunderstand something or this feature don't work in such a query? Best regards, Vitalii Tymchyshyn Limit (cost=497.46..1464.50 rows=1000 width=693) (actual time=7767.721..7793.047 rows=1000 loops=1) -> Merge Left Join (cost=497.46..5521612.64 rows=5709243 width=693) (actual time=7767.717..7790.274 rows=1000 loops=1) Merge Cond: (this_.id = (companymea2_.company_id)::bigint) -> Merge Left Join (cost=404.31..4544508.54 rows=5709243 width=625) (actual time=4211.501..4227.215 rows=1000 loops=1) Merge Cond: (this_.id = (ces3_.company_id)::bigint) -> Merge Join (cost=37.87..2435536.00 rows=5709243 width=123) (actual time=0.069..8.584 rows=1000 loops=1) Merge Cond: (this_.id = (this_1_.company_id)::bigint) -> Index Scan using pk_comp_m on company this_ (cost=0.00..2139324.53 rows=6886598 width=107) (actual time=0.031..1.876 rows=1054 loops=1) Index Cond: (id > 5000) -> Index Scan using company_tag_company_id_idx on company_tag this_1_ (cost=0.00..207678.85 rows=5709243 width=16) (actual time=0.027..1.757 rows=1000 loops=1) Index Cond: ((this_1_.company_id)::bigint > 5000) Filter: ((this_1_.tag_id)::bigint = 7) -> Index Scan using company_descrs on company_descr ces3_ (cost=0.00..2073526.89 rows=1996612 width=502) (actual time=0.014..2576.013 rows=1097096 loops=1) -> Index Scan using pk_comp_ms on company_measures companymea2_ (cost=0.00..889427.81 rows=6821672 width=68) (actual time=0.020..1946.255 rows=1097096 loops=1) Total runtime: 7794.692 ms Limit (cost=330.57..4888.97 rows=1000 width=693) (actual time=0.125..26.011 rows=1000 loops=1) -> Merge Join (cost=330.57..5260651.29 rows=1153986 width=693) (actual time=0.122..23.215 rows=1000 loops=1) Merge Cond: (this_.id = (this_1_.company_id)::bigint) -> Merge Join (cost=292.71..5023728.99 rows=1391960 width=677) (actual time=0.090..16.615 rows=1054 loops=1) Merge Cond: ((companymea2_.company_id)::bigint = this_.id) -> Merge Join (cost=0.00..2846769.89 rows=1685017 width=570) (actual time=0.063..9.534 rows=1054 loops=1) Merge Cond: ((companymea2_.company_id)::bigint = (ces3_.company_id)::bigint) -> Index Scan using pk_comp_ms on company_measures companymea2_ (cost=0.00..848312.00 rows=565 width=68) (actual time=0.033..1.973 rows=1054 loops=1) Index Cond: ((company_id)::bigint > 5000) -> Index Scan using company_descrs on company_descr ces3_ (cost=0.00..1963172.96 rows=1685017 width=502) (actual time=0.020..2.152 rows=1054 loops=1) Index Cond: ((ces3_.company_id)::bigint > 5000) -> Index Scan using pk_comp_m on company this_ (cost=0.00..2139324.53 rows=6886598 width=107) (actual time=0.022..1.855 rows=1054 loops=1) Index Cond: (this_.id > 5000) -> Index Scan using company_tag_company_id_idx on company_tag this_1_ (cost=0.00..207678.85 rows=5709243 width=16) (actual time=0.025..1.742 rows=1000 loops=1) Index Cond: ((this_1_.company_id)::bigint > 5000) Filter: ((this_1_.tag_id)::bigint = 7) Total runtime: 27.547 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Deferred constraint and delete performance
2010/2/10 Tom Lane > Franck Routier writes: > > I am wondering if deferring foreign key constraints (instead of > > disableing them) would increase performance, compared to non deferred > > constraints > > No, it wouldn't make any noticeable difference AFAICS. It would > postpone the work from end-of-statement to end-of-transaction, > but not make the work happen any more (or less) efficiently. > > What about disc access? Won't "working" with one table, then another be faster than working with both at the same time?
Re: [PERFORM] Queries within a function
2010/2/2 Mridula Mahadevan > Hi, > > I am running a bunch of queries within a function, creating some temp > tables and populating them. When the data exceeds say, 100k the queries > start getting really slow and timeout (30 min). when these are run outside > of a transaction(in auto commit mode), they run in a few seconds. Any ideas > on what may be going on and any postgresql.conf parameters etc that might > help? > > Thanks > Have you tried to analyze temp tables after you've populated them? Because AFAIK it won't do it automatically for tables created, filled and then used in same transaction.
Re: [PERFORM] Constraint propagating for equal fields
30 січня 2010 р. 04:30 Greg Stark написав: > 2010/1/28 Віталій Тимчишин > > > > I've always thought that PostgreSQL would propagate constraint from > field1 to field2 if condition says field1=field2, but this does not seem the > case: > > version? > > PostgreSQL 8.3.7 on amd64-portbld-freebsd7.2, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD]
[PERFORM] Constraint propagating for equal fields
Hello. I've always thought that PostgreSQL would propagate constraint from field1 to field2 if condition says field1=field2, but this does not seem the case: dict=# explain select * from domain_list,title.domains where processed_at is not null and key=groupid and key < 100 and groupid < 100; QUERY PLAN -- Hash Join (cost=2179918.87..4529994.61 rows=4616 width=318) Hash Cond: (domain_list.key = domains.groupid) -> Bitmap Heap Scan on domain_list (cost=26253.02..2310541.55 rows=870759 width=123) Recheck Cond: (key < 100) -> Bitmap Index Scan on domain_list_new_pkey (cost=0.00..26035.33 rows=870759 width=0) Index Cond: (key < 100) -> Hash (cost=2119232.34..2119232.34 rows=864201 width=195) -> Bitmap Heap Scan on domains (cost=16674.34..2119232.34 rows=864201 width=195) Recheck Cond: (groupid < 100) Filter: (processed_at IS NOT NULL) -> Bitmap Index Scan on dgroup (cost=0.00..16458.29 rows=890154 width=0) Index Cond: (groupid < 100) (12 rows) dict=# explain select * from domain_list,title.domains where processed_at is not null and key=groupid and key < 100 ; QUERY PLAN Hash Join (cost=2337583.04..18222634.81 rows=845372 width=318) Hash Cond: (domains.groupid = domain_list.key) -> Seq Scan on domains (cost=0.00..5423788.20 rows=158280964 width=195) Filter: (processed_at IS NOT NULL) -> Hash (cost=2310541.55..2310541.55 rows=870759 width=123) -> Bitmap Heap Scan on domain_list (cost=26253.02..2310541.55 rows=870759 width=123) Recheck Cond: (key < 100) -> Bitmap Index Scan on domain_list_new_pkey (cost=0.00..26035.33 rows=870759 width=0) Index Cond: (key < 100) (9 rows) dict=# explain select * from domain_list,title.domains where processed_at is not null and key=groupid and groupid < 100; QUERY PLAN Hash Join (cost=2153665.85..16943819.35 rows=862710 width=318) Hash Cond: (domain_list.key = domains.groupid) -> Seq Scan on domain_list (cost=0.00..6887257.54 rows=162753054 width=123) -> Hash (cost=2119232.34..2119232.34 rows=864201 width=195) -> Bitmap Heap Scan on domains (cost=16674.34..2119232.34 rows=864201 width=195) Recheck Cond: (groupid < 100) Filter: (processed_at IS NOT NULL) -> Bitmap Index Scan on dgroup (cost=0.00..16458.29 rows=890154 width=0) Index Cond: (groupid < 100) (9 rows) The first query is the fastest one, but it is equal to both 2 and 3 and I thought PostgreSQL can perform such propagation by itself. Best regards, Vitalii Tymchyshyn.
Re: [PERFORM] Should the optimiser convert a CASE into a WHERE if it can?
27 січня 2010 р. 19:01 Matthew Wakeling написав: > On Wed, 27 Jan 2010, Віталій Тимчишин wrote: > >> How about SELECT SUM (case when id > 120 and id < 121 then 1 end) >> from tbl_tracker; >> > > That is very interesting. > > > * All the functions should be noop for null input >> > > Alas, not true for COUNT(*), AVG(), etc. > > select avg(b), count(b), count(*) from (values (2),(null))a(b) gives (2.0, 1, 2) for me, so AVG is in game. Sure, it won't work for count(*), but optimizer already knows which aggregates are strict and which are not, so no new information is needed. Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Should the optimiser convert a CASE into a WHERE if it can?
2010/1/26 Matthew Wakeling > On Tue, 26 Jan 2010, Richard Neill wrote: > >> SELECT SUM (case when id > 120 and id < 121 then 1 else 0 end) >> from tbl_tracker; >> >> Explain shows that this does a sequential scan. >> > > I'd defer to Tom on this one, but really, for Postgres to work this out, it > would have to peer deep into the mysterious SUM function, and realise that > the number zero is a noop. I suppose it would be possible, but you'd have to > define noops for each of the different possible functions, *and* make the > planner clever enough to spot the noop-matching number in the else and > convert the WHEN into a WHERE. > > Hello. How about SELECT SUM (case when id > 120 and id < 121 then 1 end) from tbl_tracker; It gives same result (may be unless there are no records at all) and optimizer already knows it need not to call function for null input. Such an optimization would cover much more cases. It would look like: * Check only for aggregate subselects * All the functions should be noop for null input * Add ORed constraint for every function input is not null (in this example (case when id > A1 and id < B1 then 1 end is not null) or (case when id > A2 and id < B2 then 1 end is not null) or ... or (case when id > An and id < Bn then 1 end is not null) * Know special "case" (case when id > A1 and id < B1 then 1 end is not null) <=> (id > A1 and id < B1) by ORing all the "when" conditions case when C1 then D1 when C2 then D2 ... when Cm then Dm end is not null <=> C1 or C2 or ... or Cm. Event without last part it may give bonuses even for "select count(field) from table" transformed into "select count(field) from table where field is not null" and using [partial] indexes. As of last "*", replacing COUNT with SUM(CASE()) is used often enough when multiple count calculations are needed. Best regards, Vitalii Tymchyshyn
Re: [PERFORM] View based upon function won't use index on joins
20 листопада 2009 р. 17:01 Jonathan Foy написав: > This seems to result in the same problem; should I attempt to pull for a > specific id_nbr/id_qfr, postgres uses the index without a problem. If I try > to join the two tables/views however, it insists on doing a sequential scan > (actually two in this case) and will not use the index. Any other > ideas/explanations? > Have you tried to do same (join) when not using the viewes or converting columns into records? May be the problem is not in conversion, but in something simplier, like statistics or index bloat? Best regards, Vitalii Tymchyshyn
Re: [PERFORM] View based upon function won't use index on joins
How about CREATE OR REPLACE VIEW value_codes_view AS select * from ( SELECT value_codes.id_nbr, value_codes.id_qfr, (ARRAY[val_1_cd_1, ... , val_2_cd_12])[i] as value_code, (ARRAY[val_1_amt_1, ... , val_2_amt_12])[i] as value_amount, FROM value_codes, generate_series(1,24) i) a where value_code is not null and value_code != ''; ?
Re: [PERFORM] Modeling a table with arbitrary columns
2009/10/29 Andreas Hartmann > Hi everyone, > > I want to model the following scenario for an online marketing application: > > Users can create mailings. The list of recipients can be uploaded as > spreadsheets with arbitrary columns (each row is a recipient). I expect the > following maximum quantities the DB will contain: > > I see basically two approaches to store the recipients: > > A) A single table with a fixed number of generic columns. If the > spreadsheet has less columns than the table, the values will be null. > > B) Two tables, one for the recipients and one for the values: > One more option is to use arrays (and single table).
Re: [PERFORM] Getting a random row
2009/10/14 Scott Marlowe > On Wed, Oct 14, 2009 at 1:20 AM, Pavel Stehule > wrote: > > 2009/10/14 Thom Brown : > >> 2009/10/14 Scott Marlowe : > >> Why not just do something like: > >> > >> SELECT thisfield, thatfield > >> FROM my_table > >> WHERE thisfield IS NOT NULL > >> ORDER BY RANDOM() > >> LIMIT 1; > >> > > > > this works well on small tables. On large tables this query is extremely > slow. > > Exactly. If you're running that query over and over your "performance > test" is on how well pgsql can run that very query. :) Anything else > you do is likely to be noise by comparison. > > What I am using often to get a set of random rows is SELECT thisfield, thatfield FROM my_table WHERE random() < rowsneeded::float8/(select count * from my_table); Of course it does not give exact number of rows, but close enough for me. As of taking one row I'd try: select * from ( SELECT thisfield, thatfield FROM my_table WHERE random() < 100.0/(select count * from my_table)) a order by random() limit 1 I'd say probability of returning no rows is quite low and query can be extended even more by returning first row from table in this rare case.
Re: [PERFORM] possible wrong query plan on pg 8.3.5,
May be you have very bad disk access times (e.g. slow random access)? In this case everything should be OK while data in cache and awful, when not. Could you check disk IO speed && IO wait while doing slow & fast query. BTW: In this case, increasing shared buffers may help. At least this will prevent other applications & AFAIK sequence scans to move your index data from cache. Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Query tuning
2009/8/19 Kevin Kempter > > We do have an index on url_hits.time > > not sure why timestamps were not used, I was not here for the design phase. > What's type of time column? I don't like it casts it to double in explain. If it is integer, may be you need to change and time >= extract ('epoch' from timestamp '2009-08-12') and time < extract ('epoch' from timestamp '2009-08-13' ) to and time >= extract ('epoch' from timestamp '2009-08-12')::int4 and time < extract ('epoch' from timestamp '2009-08-13' )::int4 for the index to be used?
Re: [PERFORM] Can Postgres use an INDEX over an OR?
27 липня 2009 р. 17:18 Tom Lane написав: > =?KOI8-U?B?96bUwcymyiD0yc3eydvJzg==?= writes: > > Actually what I am talking about is to make OR with UNION (or UNION-like > > because it's a little different depending on input rows uniqueness) as an > > option. All of OR parts can use/not use different strategies (including > > multiple different idexes or hash joins). > > AFAICS you're proposing re-inventing the old implementation of OR'd > indexscans. We took that out when we added bitmap scans because it > didn't have any performance advantage over BitmapOr. > It's not tied to indexscans at all. Different parts can do (as in UNION) totally different strategy - e.g. perform two hash joins or perform merge join for one part and nested loop for another or ... As of performance - see above in this thread. UNION now often provides much better performance when different parts of OR expression involve different additional tables.
Re: [PERFORM] Can Postgres use an INDEX over an OR?
27 липня 2009 р. 15:02 Robert Haas написав: > > The problem, though, is that it won't ALWAYS be right to implement OR > using UNION, so you have to have some way of deciding which is better. > That's easy - you propose both ways to planner and it's up to it to decide. Yes, it can decide wrong way, but we are returning to statistics problem. At least one can tune costs and enable_ settings. Now one have to rewrite query that may be not possible/too complex.
Re: [PERFORM] Can Postgres use an INDEX over an OR?
27 липня 2009 р. 13:53 Robert Haas написав: > > Hmm. What you're suggesting here is that we could consider > implementing OR conditions by rescanning the inner side for each index > qual and then unique-ifying the results on the index column. That's > probably possible, but it doesn't sound easy, especially since our > selectivity-estimation code for OR conditions is not very good, so we > might choose to do it this way when that's not actually the best plan. > > ...Robert > Actually what I am talking about is to make OR with UNION (or UNION-like because it's a little different depending on input rows uniqueness) as an option. All of OR parts can use/not use different strategies (including multiple different idexes or hash joins). In cases when conditions are complex this can drastically increase performance by winning over sequence scan. As of selectivity, I'd say this is general problem - sometimes it is estimated OK, sometimes not, but this should not prevent from trying different plans. (From my current work: it does wrong estimations of filter selectivity, introduces HASH join and kills the server with OOM). Best regards, Vitaliy Tymchyshyn.
Re: [PERFORM] Can Postgres use an INDEX over an OR?
20 липня 2009 р. 11:02 Chris написав: > Віталій Тимчишин wrote: > >> >> >> 2009/7/20 Robert James > srobertja...@gmail.com>> >> >> >>Hi. I notice that when I do a WHERE x, Postgres uses an index, and >>when I do WHERE y, it does so as well, but when I do WHERE x OR y, >>it doesn't. Why is this so? >> >> It's not clever enough. >> > > Of course it is. For simple cases > > I'm running 8.3.7. > > create table t1(id int primary key); > insert into t1(id) select a from generate_series(1, 50) as s(a); > analyze t1; > explain analyze select * from t1 where id < 1 "Index Scan using t1_pkey on t1 (cost=0.00..322.51 rows=9612 width=4) (actual time=0.030..3.700 rows= loops=1)" " Index Cond: (id < 1)" "Total runtime: 4.835 ms" explain analyze select * from t1 where id in (select (random() * 50)::int4 from generate_series(0,10)) "Nested Loop (cost=32.50..1341.49 rows=200 width=4) (actual time=15.353..67.014 rows=11 loops=1)" " -> HashAggregate (cost=32.50..34.50 rows=200 width=4) (actual time=0.028..0.043 rows=11 loops=1)" "-> Function Scan on generate_series (cost=0.00..20.00 rows=1000 width=0) (actual time=0.014..0.020 rows=11 loops=1)" " -> Index Scan using t1_pkey on t1 (cost=0.00..6.52 rows=1 width=4) (actual time=6.083..6.084 rows=1 loops=11)" "Index Cond: (t1.id = (((random() * 50::double precision))::integer))" "Total runtime: 67.070 ms" explain analyze select * from t1 where id in (select (random() * 50)::int4 from generate_series(0,10)) or id < 1 "Seq Scan on t1 (cost=22.50..9735.50 rows=254806 width=4) (actual time=0.049..148.947 rows=10010 loops=1)" " Filter: ((hashed subplan) OR (id < 1))" " SubPlan" "-> Function Scan on generate_series (cost=0.00..20.00 rows=1000 width=0) (actual time=0.014..0.019 rows=11 loops=1)" "Total runtime: 150.123 ms" explain analyze select * from t1 where id in (select (random() * 50)::int4 from generate_series(0,10)) union select * from t1 where id < 1 "Unique (cost=2412.68..2461.74 rows=9812 width=4) (actual time=89.190..95.014 rows=10010 loops=1)" " -> Sort (cost=2412.68..2437.21 rows=9812 width=4) (actual time=89.189..91.167 rows=10010 loops=1)" "Sort Key: public.t1.id" "Sort Method: quicksort Memory: 854kB" "-> Append (cost=32.50..1762.13 rows=9812 width=4) (actual time=16.641..76.338 rows=10010 loops=1)" " -> Nested Loop (cost=32.50..1341.49 rows=200 width=4) (actual time=16.641..70.051 rows=11 loops=1)" "-> HashAggregate (cost=32.50..34.50 rows=200 width=4) (actual time=0.033..0.049 rows=11 loops=1)" " -> Function Scan on generate_series (cost=0.00..20.00 rows=1000 width=0) (actual time=0.020..0.026 rows=11 loops=1)" "-> Index Scan using t1_pkey on t1 (cost=0.00..6.52 rows=1 width=4) (actual time=6.359..6.361 rows=1 loops=11)" " Index Cond: (public.t1.id = (((random() * 50::double precision))::integer))" " -> Index Scan using t1_pkey on t1 (cost=0.00..322.51 rows=9612 width=4) (actual time=0.023..4.075 rows= loops=1)" "Index Cond: (id < 1)" "Total runtime: 112.694 ms" So, if it founds out anything complex, it sadly falls back to Sequence scan.
Re: [PERFORM] Can Postgres use an INDEX over an OR?
2009/7/20 Robert James > > Hi. I notice that when I do a WHERE x, Postgres uses an index, and when I > do WHERE y, it does so as well, but when I do WHERE x OR y, it doesn't. Why > is this so? It's not clever enough. And how can I shut this off? Use UNION/UNION ALL if possible in your case.
Re: [PERFORM] - Slow Query
Sorry, it was an error in previous letter. 3 липня 2009 р. 14:22 Віталій Тимчишин написав: > > > 2009/7/1 Mike Ivanov > >> >> >> > LEFT outer JOIN ville ON ville.uid = bien.ref_ville >> > LEFT outer JOIN freguesia_ville ON freguesia_ville.ref_ville =ville.uid >> >> This is not enough. You have to add this condition as well: >> >> AND bien.ref_ville = freguesia_ville.ref_ville >> >> In other words, when you link three tables by a common field, all three >> relationships should be explicitly expressed, otherwise you'll have this >> type of explosive row multiplication. >> > > Why so? Is not changing "freguesia_ville.ref_ville =ville.uid" to > "freguesia_ville.ref_ville =bien.ref_ville" enough (to prevent cases when > ville.uid is null as result of join)? > > >
Re: [PERFORM] - Slow Query
2009/7/1 Mike Ivanov > > > > LEFT outer JOIN ville ON ville.uid = bien.ref_ville > > LEFT outer JOIN freguesia_ville ON freguesia_ville.ref_ville =ville.uid > > This is not enough. You have to add this condition as well: > > AND bien.ref_ville = freguesia_ville.ref_ville > > In other words, when you link three tables by a common field, all three > relationships should be explicitly expressed, otherwise you'll have this > type of explosive row multiplication. > Why so? Is not changing "freguesia_ville.ref_ville =ville.uid" to "freguesia_ville.ref_ville =bien.uid" enough (to prevent cases when ville.uid is null as result of join)?
Re: [PERFORM] Pointers needed on optimizing slow SQL statements
I'd prefer ALTER VIEW SET ANALYZE=true; or CREATE/DROP ANALYZE ; Also it should be possible to change statistics target for analyzed columns. Such a statement would allow to analyze multi-table correlations. Note that for view planner should be able to use correlation information even for queries that do not use view, but may benefit from the information.
Re: [PERFORM] Unexpected query plan results
2009/6/2 Robert Haas > On Mon, Jun 1, 2009 at 4:53 PM, Anne Rosset wrote: > >> On Mon, Jun 1, 2009 at 2:14 PM, Anne Rosset wrote: > >>> SELECT SUM(1) FROM item WHERE is_deleted = 'f'; sum - 1824592 > (1 > >>> row) > >>> SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641 > >>> '; sum > > >>> 122412 (1 row) > >>> SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641 > >>> ' AND > is_deleted > >>> = > >>> 'f'; sum - 71 (1 row) > >>> SELECT SUM(1) FROM item WHERE folder_id = 'tracker3641 > >>> ' AND > is_deleted > >>> = > >>> 't'; sum 122341 (1 row) > > > > The item table has 2324829 rows > > So 1824592/2324829 = 78.4% of the rows have is_deleted = false, and > 0.06709% of the rows have the relevant folder_id. Therefore the > planner assumes that there will be 2324829 * 78.4% * 0.06709% =~ > 96,000 rows that satisfy both criteria (the original explain had > 97,000; there's some variability due to the fact that the analyze only > samples a random subset of pages), but the real number is 71, leading > it to make a very bad decision. This is a classic "hidden > correlation" problem, where two columns are correlated but the planner > doesn't notice, and you get a terrible plan. > > Unfortunately, I'm not aware of any real good solution to this > problem. The two obvious approaches are multi-column statistics and > planner hints; PostgreSQL supports neither. > How about partial index (create index idx on item(folder_id) where not is_deleted)? Won't it have required statistics (even if it is not used in plan)?
Re: [PERFORM] Very inefficient query plan with disjunction in WHERE clause
2009/6/1 Koen Martens > > Now, when I split up the OR in two distinct queries, everything is nice and > fast. Both queries run in sub-second time. Hi. PostgreSQL simply do not like ORs (can't use indexes in this case), so UNION/UNION ALL is your friend. Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Nested query performance issue
OK, got to my postgres. Here you are: create or replace function explode_array(in_array anyarray) returns setof anyelement as $$ select ($1)[s] from generate_series(1,array_upper($1, 1)) as s; $$ language sql immutable; SELECT s.* FROM score s WHERE s.id IN ( select -- Get the high scoring score ID for each game: explode_array(ARRAY( -- Get the high score for game g: SELECT s2.id FROM score s2 WHERE s2.game_id = g.id ORDER BY s2.score DESC LIMIT 5 )) FROM game g ); It takes ~64ms for me Best regards, Vitaliy Tymchyshyn
Re: [PERFORM] Nested query performance issue
2009/4/9 Glenn Maynard > (This is related to an earlier post on -sql.) > > I'm querying for the N high scores for each game, with two tables: > scores and games. > > CREATE TABLE game (id SERIAL NOT NULL PRIMARY KEY); > CREATE TABLE score (id SERIAL NOT NULL PRIMARY KEY, score REAL, > game_id INTEGER REFERENCES game (id)); > -- test data: 1000 games, 10 scores > INSERT INTO game (id) select generate_series(1,1000); > INSERT INTO score (game_id, score) select game.id, random() from game, > generate_series(1,100); > CREATE INDEX score_idx1 ON score (game_id, score desc); > ANALYZE; > How about select s1.* from score s1 join score s2 on s1.game_id=s2.game_id and s2.score >= s1.score group by s1.* having count(s2.*) <= N Note: you can have problems if you have same scores - you will loose last group that overlap N In any case, you don't need to join game since all you need is game_id you already have in score. P.S. EXPLAIN ANALYZE could help Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Very specialised query
> > > The outer nested join has the VALUES as the main loop, and the complicated > join as the leaf. So, the complicated overlap-finding join gets run twice. That's weird. What do you have as statistics target? Planner is incorrect few orders of magnitude, so increasing it may help. BTW: One of constraints is redundant l1.start <= l2.start implies l1.start <= l2.end, so latter can be removed as for me. > > > Oh, there's also the great big sort and unique, but I think I can get rid > of that. > As far as I can see, duplicates will occur if and only if l1.start == l2.start && l1.end == l2.end. That can be easily filtered by adding "where n=1 or l1.start != l2.start or l1.end != l2.end" to outer select.
Re: [PERFORM] Very specialised query
> > > Yeah, that's nice. > > However, it is still the case that we can't trust the database to choose > the correct plan. It is currently only choosing the correct plan now by > chance, and some time later it may by chance switch to one that takes 40 > minutes. What is the bad plan? Is it like the first plan from your first message? You can sometimes tweak optimizer to make sure it will do correct plan. E.g. when your database fits in memory, you can tweak page access costs. Also don't forget to raise statistics target. BTW: About aggregates: they can return arrays, but I can't imagine what you can group by on... May be windowing functions from 8.4 could help. Also, if your maximum length (select max(end-start) from location) is low enough, you can try adding some more constraints to make optimizer happy (have it more precise row count to select correct plan).
Re: [PERFORM] Very specialised query
Hi. Look, what I did mean by "symmetric" is that you don't need to make second part of query because you will get just same results simply by select case when n == 1 then id1 else id2 end, case when n == 2 then id1 else id2 end from ( SELECT l1.id AS id1, l2.id AS id2 FROM location l1, location l2 WHERE l1.objectid = 22893 AND l2.objectid = 22893 AND l1.id <> l2.id AND l1.start < l2.end AND l1.end > l2.start AND l1.start < l2.start) a, (values (1),(2)) b(n) (I may miss some border cases like when l1.start=l2.start and/or l1.end=l2.end, but this can be fixed by adding "=" to query). Look, You can have 4 types of intersections: a) 1s 2s 2e 1e - 2 inside 1 b) 2s 1s 1e 2e - 1 inside 2 (symmetric to (a), if you have 1,2 from (a) you can generate 2,1 for (b)) c) 1s 2s 1e 2e - 1 to the left of 2 d) 2s 1s 2e 1e - 2 to the left of 1 (symmetric to (c), if you have 1,2 from (c) you can generate 2,1 for (d)) The query above gives you results for (a) and (c) and you don't need any second part - simply add "symmetric" results. Correct me if I miss something. Best Regards, Vitalii Tymchyshyn
Re: [PERFORM] Very specialised query
Hello. You could try adding"AND l2.start > l1.start" to the first query. This will drop symmetric half of intersections (the ones that will remain are l2 inside or to the left of l1), but you can redo results by id1,id2 union all id2, id1 and may allow to use start index for "between", for my "like" test this looks like the next: " -> Index Scan using location__start on location l2 (cost=0.00..756.34 rows=37787 width=12)" "Index Cond: ((l2.start < l1.eend) AND (l2.start > l1.start))" also an index on (objectid, start) would help resulting in : " -> Index Scan using lt on location l2 (cost=0.00..0.84 rows=20 width=16)" "Index Cond: ((l2.objectid = l1.objectid) AND (l2.start < l1.eend) AND (l2.start > l1.start))" Best regards, Vitalii Tymchyshyn
Re: [PERFORM] Query much slower when run from postgres function
2009/3/14 decibel > On Mar 10, 2009, at 12:20 PM, Tom Lane wrote: > >> f...@redhat.com (Frank Ch. Eigler) writes: >> >>> For a prepared statement, could the planner produce *several* plans, >>> if it guesses great sensitivity to the parameter values? Then it >>> could choose amongst them at run time. >>> >> >> We've discussed that in the past. "Choose at runtime" is a bit more >> easily said than done though --- you can't readily flip between plan >> choices part way through, if you've already emitted some result rows. >> > > True, but what if we planned for both high and low cardinality cases, > assuming that pg_stats indicated both were a possibility? We would have to > store multiple plans for one prepared statement, which wouldn't work well > for more complex queries (if you did high and low cardinality estimates for > each table you'd end up with 2^r plans, where r is the number of relations), > so we'd need a way to cap it somehow. Of course, whether that's easier than > having the ability to throw out a current result set and start over with a > different plan is up for debate... > > On a related note, I wish there was a way to tell plpgsql not to pre-plan a > query. Sure, you can use EXECUTE, but building the query plan is a serious > pain in the rear. > I'd say it would be great for PostgreSQL to replan each execution of query automatically if execution plan tells it would take some factor (say, x100, configurable) more time to execute query then to plan. In this case it would not spend many time planning for small queries, but will use the most efficient plan possible for long queries. And even if a query can't be run better, it would spend only 1/factor time more (1% more time for factor of 100).
Re: [PERFORM] PostgreSQL NOT IN performance
2008/11/19 DANIEL CRISTIAN CRUZ <[EMAIL PROTECTED]> > Something weird with your example which doesn't have the same result, see > row count with explain analyze: > My fault. EXCEPT ALL would not work here, so this method with EXCEPT can be used only when either operation is done on unique key on t1 or result is going to be made unique. > cruz=# SELECT version(); > version > > PostgreSQL 8.3.5 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Debian > 4.3.2-1) 4.3.2 > (1 registro) > > cruz=# EXPLAIN ANALYZE select * from t1 where id not in (select id from t2); > QUERY PLAN > -- > Seq Scan on t1 (cost=1643.00..4928.00 rows=10 width=4) (actual > time=256.687..585.774 rows=73653 loops=1) >Filter: (NOT (hashed subplan)) >SubPlan > -> Seq Scan on t2 (cost=0.00..1393.00 rows=10 width=4) (actual > time=0.052..86.867 rows=10 loops=1) > Total runtime: 625.471 ms > (5 registros) > > cruz=# EXPLAIN ANALYZE select * from t1 except all (select id from t2); >QUERY PLAN > - > SetOp Except All (cost=34469.90..35969.90 rows=3 width=4) (actual > time=2598.574..3663.712 rows=126733 loops=1) >-> Sort (cost=34469.90..35219.90 rows=30 width=4) (actual > time=2598.550..3178.387 rows=30 loops=1) > Sort Key: "*SELECT* 1".id > Sort Method: external merge Disk: 5864kB > -> Append (cost=0.00..7178.00 rows=30 width=4) (actual > time=0.037..1026.367 rows=30 loops=1) >-> Subquery Scan "*SELECT* 1" (cost=0.00..4785.00 > rows=20 width=4) (actual time=0.035..439.507 rows=20 loops=1) > -> Seq Scan on t1 (cost=0.00..2785.00 rows=20 > width=4) (actual time=0.029..161.355 rows=20 loops=1) >-> Subquery Scan "*SELECT* 2" (cost=0.00..2393.00 > rows=10 width=4) (actual time=0.107..255.160 rows=10 loops=1) > -> Seq Scan on t2 (cost=0.00..1393.00 rows=10 > width=4) (actual time=0.097..110.639 rows=10 loops=1) > Total runtime: 3790.831 ms > (10 registros) > > Sometimes I got a better result (on older versions) with this kind of > query, but in this case it doesn't: > > cruz=# EXPLAIN ANALYZE SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE > t2.id IS NULL; > QUERY PLAN > --- > Merge Right Join (cost=30092.86..35251.53 rows=155304 width=8) (actual > time=850.232..1671.091 rows=73653 loops=1) >Merge Cond: (t2.id = t1.id) >Filter: (t2.id IS NULL) >-> Sort (cost=9697.82..9947.82 rows=10 width=4) (actual > time=266.501..372.560 rows=10 loops=1) > Sort Key: t2.id > Sort Method: quicksort Memory: 4392kB > -> Seq Scan on t2 (cost=0.00..1393.00 rows=10 width=4) (actual > time=0.029..78.087 rows=10 loops=1) >-> Sort (cost=20394.64..20894.64 rows=20 width=4) (actual > time=583.699..855.427 rows=273364 loops=1) > Sort Key: t1.id > Sort Method: quicksort Memory: 8784kB > -> Seq Scan on t1 (cost=0.00..2785.00 rows=20 width=4) (actual > time=0.087..155.665 rows=20 loops=1) > Total runtime: 1717.062 ms > (12 registros) > > Yes, your method is even better on 8.3.3 I have. I will try to update to 8.3.5 to see if there was optimizer improvements. You could try increasing values, say, by 10 in table filling to see if NOT IT will switch to "slow" version (for me it starts being slow from some magic row count in t2). I suppose it is the moment it switches from "hashed subplan" to "subplan". For me for 1 values it is "hashed subplan" (and it is momentary fast), for 10 - it is "subplan" and it is sloow. BTW: Which (memory?) configuration variable can affect such a switch?
Re: [PERFORM] PostgreSQL NOT IN performance
2008/11/19 Stephan Szabo <[EMAIL PROTECTED]> > > On Wed, 19 Nov 2008, [ISO-8859-5] Віталій Тимчишин wrote: > > > Query 1: > > select * from t1 where id not in (select id from t2); > > > > Query 2 (gives same result as Q1): > > select * from t1 except all (select id from t2); > > It gives the same result as long as no nulls are in either table. If > either table can have a null, the conversion changes the results. > > In addition, a conversion like the above only happens to work because t1 > only has an id column. If t1 had two columns you'd get an error because > the two sides of except all must have the same number of columns. > Actually It can be done even for multi-column mode if the selection is done on unique key. It would look like: select * from t1 inner join ( select id from t1 except select id from t2) talias on t1.id = talias.id And it would produce better results then "not in" for large counts in t1 and t2.
[PERFORM] PostgreSQL NOT IN performance
Hello. It's second query rewrite postgresql seems not to handle - making EXCEPT from NOT IT. Here is an example: Preparation: drop table if exists t1; drop table if exists t2; create temporary table t1(id) as select (random()*10)::int from generate_series(1,20) a(id); create temporary table t2(id) as select (random()*10)::int from generate_series(1,10) a(id); analyze t1; analyze t2; Query 1: select * from t1 where id not in (select id from t2); Plan: "Seq Scan on t1 (cost=1934.00..164105319.00 rows=10 width=4)" " Filter: (NOT (subplan))" " SubPlan" "-> Materialize (cost=1934.00..3325.00 rows=10 width=4)" " -> Seq Scan on t2 (cost=0.00..1443.00 rows=10 width=4)" Query 2 (gives same result as Q1): select * from t1 except all (select id from t2); Plan: "SetOp Except All (cost=38721.90..40221.90 rows=3 width=4)" " -> Sort (cost=38721.90..39471.90 rows=30 width=4)" "Sort Key: "*SELECT* 1".id" "-> Append (cost=0.00..7328.00 rows=30 width=4)" " -> Subquery Scan "*SELECT* 1" (cost=0.00..4885.00 rows=20 width=4)" "-> Seq Scan on t1 (cost=0.00..2885.00 rows=20 width=4)" " -> Subquery Scan "*SELECT* 2" (cost=0.00..2443.00 rows=10 width=4)" "-> Seq Scan on t2 (cost=0.00..1443.00 rows=10 width=4)" If I am correct, planner simply do not know that he can rewrite NOT IN as "EXCEPT ALL" operator, so all NOT INs when list of values to remove is long takes very much time. So the question is: I am willing to participate in postgresql development because it may be easier to fix planner then to rewrite all my queries :). How can I? (I mean to work on query planner enhancements by providing new options of query rewrite, not to work on other thing nor on enhancing planner in other ways, like better estimations of known plans).
Re: [PERFORM] PostgreSQL OR performance
2008/11/15 Tom Lane <[EMAIL PROTECTED]> > "=?ISO-8859-5?B?svbi0Nv22SDC2Nzn2OjY3Q==?=" <[EMAIL PROTECTED]> writes: > > I am not. I can't see how materialize can multiply number of rows it gets > > from sort by 100. > > Is it the right-hand input of a merge join? If so you're looking at > mark/restore rescans, ie, repeated fetches of the same tuples. There > must be a huge number of duplicate join keys in that relation to make > for such an increase though. Normally the planner avoids putting a > table with lots of duplicates as the RHS of a merge, but if it doesn't > have good statistics for the join key then it might not realize the > problem. > OK, thanks for cleaning-up some mystery. But, returning to original Q: Do anyone known why does it choose plan from * OR-plan.txt* instead of *union-plan.txt*? The first is cost=4588.13..960900482668.95, the latter is cost=266348.42..272953.14 according to statistics postgres have, so I suppose planner would select it if it could evaluate it.
Re: [PERFORM] PostgreSQL OR performance
2008/11/7 Richard Huxton <[EMAIL PROTECTED]> > But it's this materialize that's taking the biggest piece of the time. > > > " -> Materialize (cost=469981.13..498937.42 rows=2316503 width=30) > > (actual time=15915.639..391938.338 rows=242752539 loops=1)" > > 15.9 seconds to 391.9 seconds. That's half your time right there. The > fact that it's ending up with 242 million rows isn't promising - are you > sure the query is doing what you think it is? I am not. I can't see how materialize can multiply number of rows it gets from sort by 100. > > > "-> Sort (cost=469981.13..475772.39 rows=2316503 width=30) > (actual > > time=15915.599..19920.912 rows=2316503 loops=1)" > > " Sort Key: production.company.run_id" > > " Sort Method: external merge Disk: 104896kB" > > By constrast, this on-disk sort of 104MB is comparatively fast. >
Re: [PERFORM] PostgreSQL OR performance
Sorry, for delayed response - It was very busy week. 2008/11/7 David Wilson <[EMAIL PROTECTED]> > On Fri, Nov 7, 2008 at 4:14 AM, Віталій Тимчишин <[EMAIL PROTECTED]> wrote: > > "Merge Join (cost=518771.07..62884559.80 rows=1386158171 width=32) > (actual > > time=30292.802..755751.242 rows=34749 loops=1)" > > Have you tried increasing the default_statistics_target? The planner > is expecting 1.3 billion rows to be produced from a query that's only > actually producting 35k, which probably indicates some very bad > statistics. The planner seems to think that every second pair from company<->company join will succeed with this join expression (1386158171 ~ 52648^2 / 2). That is not true. Anyway, I've tried to set default_statistics_target to 1000, then analyze. Nothing've changed At the same time, the materialize step produces 242 > million rows when the planner only expects to produce 2.3, indicating > a similar problem in the opposite direction. This probably means that > the planner is choosing plans that would be optimal if it was making > good guesses but are decidedly sub-optimal for your actual data. > > That is even more strange, because materialize step must produce exactly the rows it takes from sort, that is 2316503, so I don't get how table scan + sort + materialize can multiply number of rows by 100.
Re: [PERFORM] PostgreSQL OR performance
> > > Yes, the query should output exactly same result as in "Union" plan. I will > run "slow" explain analyze now and will repost after it will complete > (tomorrow?). > BTW: I'd say planner should think rows estimated as sum of "ORs" estimation > minus intersection, but no more then sum or ORs (if intersection is 0). For > first condition it has rows=525975, for second it has rows=2403 (with other > plans, of course), so it's strange it has such a high estimation It's > exactly 50% of full cartesian join of merge, so it does think that every > second pair would succeed, that is not true. > > I am sorry, I've emptied atom_match table, so one part produce 0 result, but anyway here is explain: "Merge Join (cost=518771.07..62884559.80 rows=1386158171 width=32) (actual time=30292.802..755751.242 rows=34749 loops=1)" " Merge Cond: (production.run.id = (production.company.run_id)::bigint)" " Join Filter: (((production.company.name)::text = (production.company.name)::text) OR (hashed subplan))" " -> Sort (cost=45474.92..45606.54 rows=52648 width=38) (actual time=562.928..595.128 rows=15507 loops=1)" "Sort Key: production.run.id" "Sort Method: external sort Disk: 880kB" "-> Nested Loop (cost=1184.82..39904.24 rows=52648 width=38) (actual time=90.571..530.925 rows=15507 loops=1)" " -> HashAggregate (cost=1.55..1.56 rows=1 width=8) (actual time=3.077..3.078 rows=1 loops=1)" "-> Seq Scan on run (cost=0.00..1.55 rows=1 width=8) (actual time=3.066..3.068 rows=1 loops=1)" " Filter: ((name)::text = 'test'::text)" " -> Nested Loop (cost=1183.27..39376.19 rows=52648 width=30) (actual time=87.489..484.605 rows=15507 loops=1)" "-> HashAggregate (cost=1.55..1.56 rows=1 width=8) (actual time=0.016..0.019 rows=1 loops=1)" " -> Seq Scan on run (cost=0.00..1.55 rows=1 width=8) (actual time=0.009..0.011 rows=1 loops=1)" "Filter: ((name)::text = 'test'::text)" "-> Bitmap Heap Scan on company (cost=1181.72..38592.03 rows=62608 width=30) (actual time=87.465..441.014 rows=15507 loops=1)" " Recheck Cond: ((production.company.run_id)::bigint = production.run.id)" " Filter: ((production.company.status)::text = 'unprocessed'::text)" " -> Bitmap Index Scan on comp_run (cost=0.00..1166.07 rows=62608 width=0) (actual time=65.828..65.828 rows=15507 loops=1)" "Index Cond: ((production.company.run_id)::bigint = production.run.id)" " -> Materialize (cost=469981.13..498937.42 rows=2316503 width=30) (actual time=15915.639..391938.338 rows=242752539 loops=1)" "-> Sort (cost=469981.13..475772.39 rows=2316503 width=30) (actual time=15915.599..19920.912 rows=2316503 loops=1)" " Sort Key: production.company.run_id" " Sort Method: external merge Disk: 104896kB" " -> Seq Scan on company (cost=0.00..58808.03 rows=2316503 width=30) (actual time=22.244..7476.954 rows=2316503 loops=1)" " SubPlan" "-> Nested Loop (cost=2267.65..3314.94 rows=22 width=1038) (actual time=0.009..0.009 rows=0 loops=1)" " -> Hash Join (cost=2267.65..3141.36 rows=22 width=523) (actual time=0.006..0.006 rows=0 loops=1)" "Hash Cond: ((atom_match.atom1_id)::integer = s1.id)" "-> Seq Scan on atom_match (cost=0.00..30.38 rows=1630 width=8) (actual time=0.002..0.002 rows=0 loops=1)" " Filter: ((match_function_id)::integer = 2)" "-> Hash (cost=1292.04..1292.04 rows=12209 width=523) (never executed)" " -> Index Scan using atomstr_typ on atoms_string s1 (cost=0.00..1292.04 rows=12209 width=523) (never executed)" "Index Cond: ((atom_type_id)::integer = (-1))" " -> Index Scan using pk_atoms_string on atoms_string s2 (cost=0.00..7.88 rows=1 width=523) (never executed)" "Index Cond: (s2.id = (atom_match.atom2_id)::integer)" "Total runtime: 755802.686 ms" P.S. May be I've chosen wrong list and my Q better belongs to -hackers?
Re: [PERFORM] PostgreSQL OR performance
2008/11/6 Richard Huxton <[EMAIL PROTECTED]> > Віталій Тимчишин wrote: > > As you can see from other plans, it do have all the indexes to perform > it's > > work fast (when given part by part). It simply do not wish to use them. > My > > question: Is this a configuration problem or postgresql optimizer simply > > can't do such a query rewrite? > > I must admit, I haven't managed to figure out what your query is trying > to do, but then that's a common problem with autogenerated queries. That's easy - I am looking for duplicates from subset of companies. Two companies are equal when there names are simply equal or there is an entry in "match" table for names. > > > The main question that needs answering is why the planner thinks you're > going to get 1.3 billion rows in the "or" query: > > "Nested Loop (cost=4588.13..960900482668.95 rows=1386158171 width=32)" > > You don't show "explain analyse" for this query, so there's no way of > knowing how many rows get returned but presumably you're expecting > around 88000. What does "explain analyse" return? Yes, the query should output exactly same result as in "Union" plan. I will run "slow" explain analyze now and will repost after it will complete (tomorrow?). BTW: I'd say planner should think rows estimated as sum of "ORs" estimation minus intersection, but no more then sum or ORs (if intersection is 0). For first condition it has rows=525975, for second it has rows=2403 (with other plans, of course), so it's strange it has such a high estimation It's exactly 50% of full cartesian join of merge, so it does think that every second pair would succeed, that is not true.
Re: [PERFORM] PostgreSQL OR performance
2008/11/6 Helio Campos Mello de Andrade <[EMAIL PROTECTED]> > For what i see in four OR-plan.txt tou are doing too much "sequencial scan" > . Create some indexes for those tables using the fields that you use an it > may help you. > > OBS: If you already have lots of indexes in your tables it may be a good > time for you re-think your strategy because it´s ot working. > Tips: > 1 - create indexes for the tables with the fields that you will use in > the query if it is your most important query. If you have others querys that > are used please post those here and we can help you to desing a better plan. As you can see from other plans, it do have all the indexes to perform it's work fast (when given part by part). It simply do not wish to use them. My question: Is this a configuration problem or postgresql optimizer simply can't do such a query rewrite? Actually I did rewrite the query to work properly as you can see from union-plan.txt. My question is if postgresql can do this automatically because such a rewrite is not always easy/possible (esp. for generated queries)?
Re: [PERFORM] PostgreSQL OR performance
My main message is that I can see this in many queries and many times. But OK, I can present exact example. 2008/11/5 Jeff Davis <[EMAIL PROTECTED]> > On Wed, 2008-11-05 at 13:12 +0200, Віталій Тимчишин wrote: > > For a long time already I can see very poor OR performance in > > postgres. > > If one have query like "select something from table where condition1 > > or condition2" it may take ages to execute while > > "select something from table where condition1" and "select something > > from table where condition2" are executed very fast and > > "select something from table where condition1 and not condition2 union > > all select something from table where condition2" gives required > > results fast > > > > What version are you using? Server version 8.3.3 > > > Have you run "VACUUM ANALYZE"? I have autovacuum, but for this example I did vacuum analyze of the whole DB. The real-life query (autogenerated) looks like the next: select t0.id as pk1,t1.id as pk2 ,t0.run_id as f1_run_id,t1.run_id as f2_run_id from tmpv_unproc_null_production_company_dup_cons_company as t0, (select * from production.company where run_id in (select id from production.run where name='test')) as t1 where t0.name = t1.name or (t0.name,t1.name) in (select s1.name, s2.name from atom_match inner join atoms_string s1 on atom_match.atom1_id = s1.id inner join atoms_string s2 on atom_match.atom2_id = s2.id where s1.atom_type_id = -1 and match_function_id = 2) with tmpv_unproc_null_production_company_dup_cons_company: create temporary view tmpv_unproc_null_production_company_dup_cons_company as select * from production.company where 1=1 and status='unprocessed' and run_id in (select id from production.run where name='test') > > > Next, do: > > EXPLAIN ANALYZE select something from table where condition1 or > condition2; without analyze is in OR-plan.txt Also plans for only condition1, only condition2 and union is attached "Nested Loop (cost=4588.13..960900482668.95 rows=1386158171 width=32)" " Join Filter: (((production.company.name)::text = (production.company.name)::text) OR (subplan))" " -> Hash IN Join (cost=1.56..73814.22 rows=52648 width=30)" "Hash Cond: ((production.company.run_id)::bigint = production.run.id)" "-> Seq Scan on company (cost=0.00..64599.29 rows=2316503 width=30)" " Filter: ((status)::text = 'unprocessed'::text)" "-> Hash (cost=1.55..1.55 rows=1 width=8)" " -> Seq Scan on run (cost=0.00..1.55 rows=1 width=8)" "Filter: ((name)::text = 'test'::text)" " -> Nested Loop (cost=1183.27..39219.67 rows=52648 width=30)" "-> HashAggregate (cost=1.55..1.56 rows=1 width=8)" " -> Seq Scan on run (cost=0.00..1.55 rows=1 width=8)" "Filter: ((name)::text = 'test'::text)" "-> Bitmap Heap Scan on company (cost=1181.72..38435.51 rows=62608 width=30)" " Recheck Cond: ((production.company.run_id)::bigint = production.run.id)" " -> Bitmap Index Scan on comp_run (cost=0.00..1166.07 rows=62608 width=0)" "Index Cond: ((production.company.run_id)::bigint = production.run.id)" " SubPlan" "-> Materialize (cost=3403.29..4005.74 rows=35745 width=28)" " -> Hash Join (cost=928.57..3122.55 rows=35745 width=28)" "Hash Cond: ((atom_match.atom1_id)::integer = s1.id)" "-> Hash Join (cost=445.80..1880.19 rows=35745 width=18)" " Hash Cond: ((atom_match.atom2_id)::integer = s2.id)" " -> Seq Scan on atom_match (cost=0.00..674.81 rows=35745 width=8)" "Filter: ((match_function_id)::integer = 2)" " -> Hash (cost=260.91..260.91 rows=14791 width=18)" "-> Seq Scan on atoms_string s2 (cost=0.00..260.91 rows=14791 width=18)" "-> Hash (cost=297.89..297.89 rows=14791 width=18)" " -> Seq Scan on atoms_string s1 (cost=0.00..297.89 rows=14791 width=18)" "Filter: ((atom_type_id)::integer = (-1))" "Merge Join (cost=89373.23..97526.15 rows=525975 width=32) (actual time=276.869..523.669 rows=34749 loops=1)" " Merge Cond: ((production.company.name)::text = (production.company.name)::text)" " -> Sort (cost=44764.87..44896.49 rows=52648 width=30
[PERFORM] PostgreSQL OR performance
Hello. For a long time already I can see very poor OR performance in postgres. If one have query like "select something from table where condition1 or condition2" it may take ages to execute while "select something from table where condition1" and "select something from table where condition2" are executed very fast and "select something from table where condition1 and not condition2 union all select something from table where condition2" gives required results fast For example, in my current query for "condition1" optimizer gives 88252, for "condition1 and not condition2" it is 88258, for "condition2" it is 99814. And for "condition1 or condition2" it is 961499627680. And it does perform this way. All is more or less good when "select" part is easy and query can be easily rewritten. But for complex queries it looks ugly and if the query is autogenerated, moving autogeneration mechanism from creating simple clean "where" to unions is not an easy task. So the question is: Do I miss something? Can this be optimized?
Re: [PERFORM] Index bloat, reindex weekly, suggestions etc?
2008/10/17 Tory M Blue <[EMAIL PROTECTED]> > > The real issue is my index growth and my requirement for weekly > re-indexing (which blocks and therefore is more or less a manual > process in a live production environment (fail over, changing vips > etc). > BTW: Can't you simply recreate indexes online? Since postgresql accepts multiple indexes of same definition, this may look like: 1) create index concurrently index_alt 2) analyze index_alt 3) drop index_orig Both index_alt and index_orig having same definition
Re: [PERFORM] Delete performance again
BTW: Have just tried "clean" (without any foreign keys constraints) peformance of "delete from tbl where field not in (select)" vs "create temporary table tmp(id) as select distinct field from tbl; delete from tmp where id in (select); delete from tbl where field in (select id from tmp)". both tbl and select are huge. tbl cardinality is ~5 million, select is ~1 milliion. Number of records to delete is small. select is simply "select id from table2". First (simple) one could not do in a night, second did in few seconds.
Re: [PERFORM] Delete performance again
OK, I did try you proposal and correlated subselect. I have a database ~90 companies. First try was to remove randomly selected 1000 companies Uncorrelated subselect: 65899ms Correlated subselect: 97467ms using: 9605ms my way: 104979ms. (without constraints recreate) My is the worst because it is oriented on massive delete. So I thought USING would perform better, so I did try 1 companies my way: 190527ms. (without constraints recreate) using: 694144ms I was a little shocked, but I did check plans and found out that it did switch from Nested Loop to Hash Join. I did disable Hash Join, it not show Merge Join. This was also disabled and I've got 747253ms. Then I've tried combinations: Without hash join it was the best result of 402629ms, without merge join it was 1096116ms. My conclusion: Until optimizer would take into account additional actions needed (like constraints check/cascade deletes/triggers), it can not make good plan.
Re: [PERFORM] Delete performance again
2008/10/2 Tom Lane <[EMAIL PROTECTED]> > "=?ISO-8859-5?B?svbi0Nv22SDC2Nzn2OjY3Q==?=" <[EMAIL PROTECTED]> writes: > > delete from company where id not in (select company_id from > company_descr); > > I've tried to analyze command, but unlike to other RDBM I've used it did > not > > include cascade deletes/checks into query plan. That is first problem. > > It was SLOW. > > Usually the reason for that is having forgotten to make an index on the > referencing column(s) ? > Not at all. As you can see below in original message, simply "extending" the query to what should have been done by optimizer helps. I'd say optimizer always uses fixed plan not taking into account that this is massive update and id doing index lookup of children records for each parent record, while it would be much more effective to perform removal of all children records in single table scan. It's like trigger "for each record" instead of "for each statement".
[PERFORM] Delete performance again
Hello. I have a database with company table that have a bunch of related (delete=cascade) tables. Also it has 1<->M relation to company_descr table. Once we've found that ~half of our companies do not have any description and we would like to remove them. First this I've tried was delete from company where id not in (select company_id from company_descr); I've tried to analyze command, but unlike to other RDBM I've used it did not include cascade deletes/checks into query plan. That is first problem. It was SLOW. To make it faster I've done next thing: create temporary table comprm(id) as select id from company; delete from comprm where id in (select company_id from company_descr); delete from company where id in (select id from comprm); That was much better. So the question is why postgresql can't do such a thing. But it was better only until "removing" dataset was small (~5% of all table). As soon as I've tried to remove 50% I've got speed problems. I've ensured I have all indexes for both ends of foreign key. I've tried to remove all cascaded entries by myself, e.g.: create temporary table comprm(id) as select id from company; delete from comprm where id in (select company_id from company_descr); delete from company_alias where company_id in (select id from comprm); ... delete from company where id in (select id from comprm); It did not help until I drop all constraints before and recreate all constraints after. Now I have it work for 15minutes, while previously it could not do in a day. Is it OK? I'd say, some (if not all) of the optimizations could be done by postgresql optimizer.