[PERFORM] Can postgresql plan a query using multiple CPU cores?
Hi, I have a letancy-sensitive legacy application, where the time consumed by query planning was always causing some headaches. Currently it is running on postgresql-8.4 - will postgresql-10 support generating plans using multiple CPU cores to reduce the time required to generate a single plan? Thank you in advance and best regards, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any disadvantages of using =ANY(ARRAY()) instead of IN?
Hello Noah, Thanks a lot for your feedback and explanations. Since you have 15+ tables at the top level, the genetic query optimizer should be kicking in and delivering a plan in reasonable time, albeit with plan quality hazards. There's a danger zone when the deterministic planner is still in effect but {from,join}_collapse_limit have limited the scope of its investigation. If you're in that zone and have not hand-tailored your explicit join order, poor plans are unsurprising. What exact configuration changes are you using? Basically only the changes, suggested here a year ago, which made the problem go away for less complex queries: geqo_threshold = 20 from_collapse_limit = 13 join_collapse_limit = 13 Hundreds of rows, no. Consider this example: IN(...): Total runtime: 2200.767 ms ANY(ARRAY(...)): Total runtime: 11748.348 ms In case there is an index on C, the resulting index scan is, even with 1000 elements, 3 times faster on my Notebook. However, both queries execute in next-to-no time (15 vs 5ms). Filing a bug report with the content you've already posted would not add much, but a self-contained test case could prove useful. Many of the deficiencies that can make ANY(ARRAY(...)) win do represent unimplemented planner intelligence more than bugs. Incidentally, you can isolate whether ANY(ARRAY(...))'s advantage comes solely from suppressing the subquery collapse. Keep IN but tack OFFSET 0 onto the subquery. If this gives the same performance as ANY(ARRAY(...)), then the subquery-collapse suppression was indeed the source of advantage. I see your point, some dumb logic to replace IN with ANY(ARRAY wouldn't always yield better results. I'll try to come up with a self-containing testcase. Thanks again, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any disadvantages of using =ANY(ARRAY()) instead of IN?
Hi, I would be really grateful for feedback regardding this issue. Tom? Should Ifile a bug-report about the optimizer trying too hard to collapse the subquery and therefor generating a bad plan? Its my understanding that a IN shouldn't perform any worse than ANY on an ARRAY, right? Thank you in advance, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any disadvantages of using =ANY(ARRAY()) instead of IN?
Hi again, That doesn't sound like a tremendously good idea to me. Could you elaborate on the downsides of this approach a bit? Any other thoughts about the pro/cons replacing IN(subquery) with =ANY(ARRAY(subquery))? Are there patological cases, except when the subquery returns a huge amount of rows? Should Ifile a bug-report about the optimizer trying too hard to collapse the subquery and therefor generating a bad plan? Thank you in advance, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Any disadvantages of using =ANY(ARRAY()) instead of IN?
Hi, I am using postgresql as database for a hibernate based java oltp project and as in previous projects am totally impressed by postgresql's robustness, performance and feature-richness. Thanks for this excellent piece of software. Quite often Hibernate ends up generating queries with a lot of joins which usually works well, except for queries which load some additional data based on a previous query (SUBSELECT collections), which look like: select . from table1 ... left outer join table 15 WHERE table1.id IN (select id join table16 ... join table20 WHERE table20.somevalue=?) Starting with some amount of joins, the optimizer starts to do quite suboptimal things like hash-joining huge tables where selctivity would very low. I already raised join_collapse_limit and from_collapse_limit, but after a certain point query planning starts to become very expensive. However, when using =ANY(ARRAY(select ...)) instead of IN the planner seems to do a lot better, most likely because it treats the subquery as a black-box that needs to be executed independently. I've hacked hibernate a bit to use ANY+ARRAY, and it seems to work a lot better than using IN. However, I am a bit uncertain: - Is it safe to use ANY(ARRAY(select ...)) when I know the sub-query will only return a small amount (0-100s) of rows? - Shouldn't the optimizer be a bit smarter avoiding optimizing this case in the first place, instead of bailing out later? Should I file a bug-report about this problem? Thank you in advance, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Any disadvantages of using =ANY(ARRAY()) instead of IN?
Hi Tom, Thanks for your reply. What PG version are we talking about here? For development I use 9.1.3, on the production server is 8.4.7 - happens with both cases. That doesn't sound like a tremendously good idea to me. Could you elaborate on the downsides of this approach a bit? But with so few details, it's hard to comment intelligently. Can you provide a concrete test case? A self contained testcase would take some time to create (and list members willing to configure and run), so I hope a query as well as an explain-analyze run will provide more information (done with 9.1.3): http://pastebin.com/BGRdAPg2 Its kind of the worst-worst case which I will improve later (way too much relations loaded through join-fetching), but its quite a good way to show the issue. Replacing the IN with a ANY(ARRAY()) already yields a way better plan. Thank you in advance, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] hash semi join caused by IN (select ...)
Hi, Does anybody know why the planner treats = ANY(ARRAY(select ...)) differently than IN(select ...)? Which one is preferable, when I already have a lot of joins? Thanks, Clemens 2011/5/17 Clemens Eisserer linuxhi...@gmail.com: Hi, select from t1 left join t2 WHERE id IN (select ) Does it work as expected with one less join? If so, try increasing join_collapse_limit ... That did the trick - thanks a lot. I only had to increase join_collapse_limit a bit and now get an almost perfect plan. Instead of hash-joining all the data, the planner generates nested-loop-joins with index only on the few rows I fetch. Using = ANY(array(select... )) also seems to work, I wonder which one works better. Does ANY(ARRAY(...)) force the optimizer to plan the subquery seperated from the main query? Thanks, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] hash semi join caused by IN (select ...)
Hi, select from t1 left join t2 WHERE id IN (select ) Does it work as expected with one less join? If so, try increasing join_collapse_limit ... That did the trick - thanks a lot. I only had to increase join_collapse_limit a bit and now get an almost perfect plan. Instead of hash-joining all the data, the planner generates nested-loop-joins with index only on the few rows I fetch. Using = ANY(array(select... )) also seems to work, I wonder which one works better. Does ANY(ARRAY(...)) force the optimizer to plan the subquery seperated from the main query? Thanks, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] hash semi join caused by IN (select ...)
Hi, I have a quite complex, performance sensitive query in a system with a few (7) joins: select from t1 left join t2 WHERE id IN (select ) For this query the planner evaluates the IN with a hash semi join last, and all the joining is done by hash joins for all rows contained in t1. However when I specify the ids manually (IN (1, 2, 3, 4, 5) the planner first does an index lookup on the primary key column id, and subsequently does nested loop joins using an index on t2 - which gives way better results. Is there any way to guide the planner to evaluate the IN condition first, instead of last? Why is the planner behaving this way? (postgresql 8.4.??) Thank you in advance, Clemens Query plan with IN(select): Sort (cost=165.77..165.77 rows=2 width=16974) (actual time=13.459..13.460 rows=2 loops=1) Sort Key: this_.id Sort Method: quicksort Memory: 26kB - Hash Semi Join (cost=123.09..165.76 rows=2 width=16974) (actual time=12.741..13.432 rows=2 loops=1) Hash Cond: (this_.id = kladdenent0_.id) - Hash Left Join (cost=119.17..160.90 rows=348 width=16974) (actual time=8.765..13.104 rows=342 loops=1) Hash Cond: (flugzeug2_.flugzeugtyp_id = flugzeugty3_.id) - Hash Left Join (cost=118.10..155.08 rows=348 width=16454) (actual time=8.724..12.412 rows=342 loops=1) Hash Cond: (flugzeug2_.zaehlertyp_id = bmintype4_.id) - Hash Left Join (cost=117.06..152.71 rows=348 width=15934) (actual time=8.660..11.786 rows=342 loops=1) Hash Cond: (this_.lehrerid = pilot5_.id) - Hash Left Join (cost=96.66..130.46 rows=348 width=8912) (actual time=6.395..8.899 rows=342 loops=1) Hash Cond: (this_.nachid = flugplatz6_.id) - Hash Left Join (cost=93.89..122.90 rows=348 width=8370) (actual time=6.354..8.429 rows=342 loops=1) Hash Cond: (this_.flugzeugid = flugzeug2_.id) - Hash Left Join (cost=23.17..47.04 rows=348 width=7681) (actual time=1.992..3.374 rows=342 loops=1) Hash Cond: (this_.pilotid = pilot7_.id) - Hash Left Join (cost=2.78..22.04 rows=348 width=659) (actual time=0.044..0.548 rows=342 loops=1) Hash Cond: (this_.vonid = flugplatz8_.id) - Seq Scan on startkladde this_ (cost=0.00..14.48 rows=348 width=117) (actual time=0.004..0.074 rows=342 loops=1) - Hash (cost=1.79..1.79 rows=79 width=542) (actual time=0.032..0.032 rows=79 loops=1) - Seq Scan on flugplatz flugplatz8_ (cost=0.00..1.79 rows=79 width=542) (actual time=0.003..0.010 rows=79 loops=1) - Hash (cost=15.73..15.73 rows=373 width=7022) (actual time=1.938..1.938 rows=375 loops=1) - Seq Scan on pilot pilot7_ (cost=0.00..15.73 rows=373 width=7022) (actual time=0.006..0.769 rows=375 loops=1) - Hash (cost=51.43..51.43 rows=1543 width=689) (actual time=4.351..4.351 rows=1543 loops=1) - Seq Scan on flugzeug flugzeug2_ (cost=0.00..51.43 rows=1543 width=689) (actual time=0.006..1.615 rows=1543 loops=1) - Hash (cost=1.79..1.79 rows=79 width=542) (actual time=0.031..0.031 rows=79 loops=1) - Seq Scan on flugplatz flugplatz6_ (cost=0.00..1.79 rows=79 width=542) (actual time=0.003..0.011 rows=79 loops=1) - Hash (cost=15.73..15.73 rows=373 width=7022) (actual time=2.236..2.236 rows=375 loops=1) - Seq Scan on pilot pilot5_ (cost=0.00..15.73 rows=373 width=7022) (actual time=0.005..0.781 rows=375 loops=1) - Hash (cost=1.02..1.02 rows=2 width=520) (actual time=0.005..0.005 rows=2 loops=1) - Seq Scan on bmintype bmintype4_ (cost=0.00..1.02 rows=2 width=520) (actual time=0.003..0.004 rows=2 loops=1) - Hash (cost=1.03..1.03 rows=3 width=520) (actual time=0.004..0.004 rows=3 loops=1) - Seq Scan on flugzeugtype flugzeugty3_ (cost=0.00..1.03 rows=3 width=520) (actual time=0.002..0.002 rows=3 loops=1) - Hash (cost=3.90..3.90 rows=2 width=4) (actual time=0.239..0.239 rows=2 loops=1) - Limit (cost=0.00..3.88 rows=2 width=4) (actual time=0.202..0.236 rows=2 loops=1) - Index Scan using startkladde_pkey on startkladde kladdenent0_ (cost=0.00..56.24 rows=29 width=4) (actual time=0.200..0.233 rows=2 loops=1)
Re: [PERFORM] Slow query execution over high latency network
Hi Andrej, Thanks a lot for taking a loot at the tcpdump data. I just had a brief glance over your tcpdump data ... are you sure hibernate isn't using a cursor to fetch each row individually? Pretty sure, yes. I get the same performance when executing the hibernate-generated query using JDBC, even setting a large fetch-size doesn't improve the situation: st.setFetchSize(100); st.setFetchDirection(ResultSet.FETCH_FORWARD); Could it be jdbc driver struggles with the huge number of columns (~500)? Thanks, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Slow query execution over high latency network
Hello, When executing huge (10kb), hibernate-generated queries I noticed that when executed remotly over high-latency network (ping to server 200-400ms), the query takes a lot longer to complete. When the query is executed remotly (psql or jdbc) it takes 1800ms to execute, when I issue the query in an ssh terminal, I see the results almost immediatly. So although I should see the same latency over ssh , its way faster over ssh. The transmitted data is small (the wireshard-file has 22kb, attached), and even though the umts-network is high-latency its relativly high bandwith (~512kbit/s up, ~2mbit/s down). Any idea whats causing this? Maybe too small buffers somewhere? For me it poses problem, because I am working on a 2-Tier java application which should connect to postgres remotly - however with every more complex query taking 2s its almost unuseable over wireless networks (umts). Thank you in advance, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query execution over high latency network
Hi Pierre, Thanks a lot for your reply. Your attached file didn't come through. Hmm, ok. I uploaded the wireshark-log to: http://93.190.88.182/psql_large_query.bin - different execution plan between your app and ssh+psql, which can happen if the planning uses/doesn't use your specific parameters, Its both times (ssh and remote psql) exactly the same query - I copied the SQL generated by hibernate and executed it in psql. And although it has many columns (~210) the result-set is only about 5 rows and am sure not larger than a few kb. - dumb client versus smart client : smart client : use the protocol which sends the query text + parameters + prepare + execute in 1 TCP message, 1 ping, postgres works, 1 ping, get reply So are both psql and the jdbc driver dumb clients? Or are there only buffers somewhere too small and therefor data is sent in many smal batches. I thought one query would more or less equal to one roundtrip, right? Maybe I should ask on the pgsql-jdbc list. If you want to ensure the fastest response time you need to ensure than one user action (click) needs one and only one roundtrip to the server before all the results are displayed One solution could be to put the database handling stuff inside an appserver, make your app communicate to it with a low-overhead RPC protocol (ie, not raw uncompressed XML) that minimizes the number of roudtrips, and compresses data thoroughly. I use well tuned hibernate fetch profiles to ensure fewest possible roundtrips, however I am not getting paid well enough to create an appserver tier ;) Thanks, Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance on new 64bit server compared to my 32bit desktop
Hi, This isn't an older Opteron, its 6 core, 6MB L3 cache Istanbul. Its not the newer stuff either. Everything before Magny Cours is now an older Opteron from my perspective. The 6-cores are identical to Magny Cours (except that Magny Cours has two of those beast in one package). - Clemens -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Are Indices automatically generated for primary keys?
Hi, Are indices for columns marked with PRIMARY KEY automatically generated by postgresql, or do I have to do it manually? The question might seem dumb, I ask because I remember from working with MySQL it generates indices automatically in this case. Thank you in advance, Clemens
Re: [PERFORM] Are Indices automatically generated for primary keys?
Hi, they are generated automatically. Thanks depesz! The reason why I asked was because pgAdmin doesn't display the automatically created indices, which confused me. Thanks, Clemens PS: If you look at the documentation page for CREATE TABLE, you'll see the following . but if you use your browser to search for PRIMARY KEY within the page, it's not too hard to find. Its quite harsh to imply I didn't look for documentation. I looked at the Indexes and ORDER BY which doesn't mention it, or I've overlook it. Doesn't make a difference anyway. Also, if you create a primary key or a unique constraint on a table, you should see a notice informing you of the creation of the index, and its name. I use Hibernate, and it generates the DDL for me. Even with debug/DDL/SQL-output enabled, I don't get any hint that an index was created. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Question about explain-command...
Hello, I just discovered the explain command and well ... have some (for you of course very stupid) questions. I do a quite large (for my taste) join, the query looks like the following: SELECT DISTINCT customer.email AS cemail, customer.key AS ckey, customer.anrede AS canrede, customer.strasse AS cstrasse, customer.plz AS cplz, customer.ort AS cort, customer.vorname AS cvorname, customer.nachname AS cnachname , custtype.name AS tname, customer.land AS cland, customer.datanotvalid AS cdatanvalid FROM customer LEFT JOIN sells ON customer.key=sells.custid LEFT JOIN goods ON sells.goodsid=goods.key LEFT JOIN custtype ON customer.custgroup=custtype.key LEFT JOIN prodtype ON prodtype.key=goods.prodgroup WHERE customer.nachname LIKE '%name%'; All primary keys are indixed, and this is what explain tells me: Unique (cost=15.67..16.69 rows=34 width=115) - Sort (cost=15.67..15.75 rows=34 width=115) Sort Key: customer.email, customer.key, customer.anrede, customer.str asse, customer.plz, customer.ort, customer.vorname, customer.nachname, custtype. name, customer.land, customer.datanotvalid - Hash Left Join (cost=6.16..14.80 rows=34 width=115) Hash Cond: (outer.prodgroup = inner.key) - Hash Left Join (cost=4.97..13.10 rows=34 width=119) Hash Cond: (outer.custgroup = inner.key) - Hash Left Join (cost=3.88..11.49 rows=34 width=111) Hash Cond: (outer.goodsid = inner.key) - Hash Left Join (cost=1.98..9.08 rows=34 width=111) Hash Cond: (outer.key = inner.custid) - Seq Scan on customer (cost=0.00..6.10 rows=34 width=107) Filter: ((nachname)::text ~~ '%au%'::text) - Hash (cost=1.78..1.78 rows=78 width=8) - Seq Scan on sells (cost=0.00..1.78 rows=78 width=8) - Hash (cost=1.72..1.72 rows=72 width=8) - Seq Scan on goods (cost=0.00..1.72 rows=72 width=8) - Hash (cost=1.08..1.08 rows=8 width=16) - Seq Scan on custtype (cost=0.00..1.08 rows=8 width=16) - Hash (cost=1.15..1.15 rows=15 width=4) - Seq Scan on prodtype (cost=0.00..1.15 rows=15 width=4) What does the hash-lines mean, does that mean my query does not use the indices at all? Why are some table-names and some column-names surrounded by ' '? Are they threated as text-columns? I have to admit that the tables are just filled with test-data so the analyzer may take just a very simple way since almost no data is in... lg Clemens ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings