[HACKERS] Table and Index compression
With the talk about adding compression to pg_dump lately, I've been wondering if tables and indexes could be compressed too. So I've implemented a quick on-the-fly compression patch for postgres Sorry for the long email, but I hope you find this interesting. Why compress ? 1- To save disk space ? Disks are so cheap now that it is not a valid reason. Besides, anyone with a database that is big enough to fill a modern harddisk probably also has a monster RAID to get more IO throughput. 2- To make the database faster ? This would be a valid reason. If the database is made smaller through compression, this can have several interesting effects : - you need less RAM to cache it entirely, so random IO hits more cache than before - if the CPU can decompress faster than the disk IO, even seq scans can be faster (this will not be the case on monster RAID setups, but for cheap servers, it could be) So, why not ? I coded it. I've benchmarked lzo, lzf, quicklz, lzjb, and fastLZ. The best for this is lzo : very fast decompression, a good compression ratio on a sample of postgres table and indexes, and a license that could work. QuickLZ compresses faster and more, but is not free. Compression would compress each page independently, so it is fast to decompress a single page. This means 8k pages are a bit small, I used 32k pages to get better compression. I've checked various table and index files. - your average table with many columns, some TEXT, etc, compresses about 2x - when the row header is significant vs the row data, it can reach 3x - indexes compress well too, 2-4x - gist indexes could be compressed up to 6x sometimes Roughly, 2-3x compression is achievable on a complete database. Implementation This lives in md.c The implementation is about 100 lines of code (!) Instead of calling FileWrite and FileRead, md.c calls special functions that read and write compressed blocks. * Writing : The block is compressed in a temp memory buffer. A header (prepended to the compressed data) tells the length of this data. Then, it is written where in the disk file. * Reading : The first 4k of compressed data is read. Looking at the length header, we know how much more to read. The rest of the data (if any) is read. The data is decompressed. Since a compressed block can be larger than the original block, I have enlarged the block size in the files by 4k, so that there is a block every 40k instead of every 32k (with 32k postgres pages). That's it, very simple. Now, the reason it works is the underlying file is not handled as sparse by the OS. The holes between compressed blocks are removed : not recorded on disk, and never cached either. However sparse files can have big performance problems if you do this : - write a small block in the middle of the file, surrounded by holes - later enlarge this block When the block is enlarged, if it needs an extra filesystem page, it will not be allocated contiguously. When it is read later, it will need an extra seek, which is really bad. So, looking at the compression statistics : gist index for geometric coordinates search : a 32k page is compressed to 1-2 4k-pages, very rarely 3 pages. btree indexes : a 32k page is compressed to 2-3 4k-pages large table : a 32k page is compressed to 2-4 4k-pages Therefore, on write, I pre-allocate some space in the sparse file, by writing more than needed : currently I write 5 4k-blocks. Whatever is written after the compressed data is garbage previously in the buffer, it is ignored on reads. This means the disk space savings are less than a full compression, but access is much smoother, in fact much like a regular non-sparse file, since the blocks between the holes almost never need to be grown. Without pre-allocating, performance is abysmal, not even worth talking about. Pre-allocated but not actually used blocks are never read, except maybe by OS readahead during seq scan. On a heavy random access database they will not be touched, not wasting any space in the OS cache. shared_buffers thus contains decompressed blocks : a row that is updated very often will not go through decompression-compression cycles each time. The OS cache contains compressed data. Some tests : It appears to behave as expected. It didn't crash (yet...). Basically it looks like RAM has doubled and CPU speed is halved. Random access queries are faster, even on a cold cache, and of course, much better cached afterwards, since the amount of data the OS cache can hold is at least doubled. Seq scans on a huge table, reading data from disk, are a tiny bit slower, which is strange : on my test box, the disks are slow (50 MB/s) and lzo can decompress much faster than this. At least it isn't slower. Seq scans on a cached table that would fit in RAM anyway are slower, because it needs to be decompressed. Se
Re: [HACKERS] More thoughts on sorting
PFC writes: - for short strings (average 12 bytes), sort is CPU-bound in strcoll() - for longer strings (average 120 bytes), sort is even more CPU-bound in strcoll() No news there. If you are limited by the speed of text comparisons, consider using C locale. regards, tom lane Actually, I think (see the bottom of my last email) that this would be a good argument for the per-column COLLATE patch... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] More thoughts on sorting
There was a thread some time ago about sorting... it kind of died... I did some tests on a desktop (Postgres 8.3.7, kubuntu, Core 2 dual core, 4GB RAM, RAID1 of 2 SATA disks) Quick conclusions : - grabbing the stuff to sort can be IO bound of course (not here) - for short strings (average 12 bytes), sort is CPU-bound in strcoll() - for longer strings (average 120 bytes), sort is even more CPU-bound in strcoll() - strcoll() time seems to depend on the length of the strings, not the place where a difference occurs (grokking glibc source code confirms) See detailed test procedure below. Locale is fr_FR.UTF-8 and database is UNICODE All strings are ASCII, they are mostly alphanumeric references. There are 391469 strings. min length 6 chars max length 80 chars avg length 11.82 chars We have a table test with (id INTEGER PRIMARY KEY, TEXT, BYTEA ), and contents of TEXT and BYTEA are identical. We have a table test2 which contains the same thing as test, except the id and a 100-character constant are appended to the strings to make them longer. Test Procedure : Grab test data from : http://home.peufeu.com/pg/test_data_references.copy.gz Sorting with Python Sorting all string converted to unicode (from utf8) using strcoll() and correct locale => 5.8 s With longer strings (as in table test2 below ) => 8 s Postgres To get query timings, I use \t and "SELECT * FROM test ORDER BY id OFFSET 391468;" which avoids EXPLAIN ANALYZE overhead, it just prints the last row from the results. Timings are a bit shorter than EXPLAIN ANALYZE gives, and I checked the plans, they are all sorts. -- Create test table and load it BEGIN; CREATE TABLE test1 (t TEXT NOT NULL); \copy test1 FROM test_data_references.copy CREATE TABLE test (id SERIAL PRIMARY KEY, t TEXT NOT NULL, b BYTEA NOT NULL ); INSERT INTO test (t,b) SELECT t,t::BYTEA FROM test1; DROP TABLE test1; ALTER TABLE test DROP CONSTRAINT test_pkey; CREATE TABLE test2 (id INTEGER NOT NULL, t TEXT NOT NULL, b BYTEA NOT NULL ); INSERT INTO test2 SELECT id, (t || id || 'This is a dummy text of length 100 bytes') AS t, (t || id || 'This is a dummy text of length 100 bytes')::BYTEA AS b FROM test; COMMIT; \d test SHOW work_mem; --> 16MB SHOW maintenance_work_mem; --> 512MB \timing -- cache it really well SELECT count(*) FROM test; SELECT count(*) FROM test; SELECT count(*) FROM test; --> 391469 --> Temps : 87,033 ms SELECT * FROM test ORDER BY id OFFSET 391468; --> Temps : 918,893 ms SELECT id FROM test ORDER BY id OFFSET 391468; --> Temps : 948,015 ms Interpretation : - Time for hauling around extra data (SELECT * instead of SELECT id) is not significant. - Sorting by integers is quite fast (not THAT fast though, MySQL below is 3x faster when selecting just 'id' and 2x slower when SELECT *, hum.) SELECT * FROM test ORDER BY b OFFSET 391468; --> Temps : 2145,555 ms SELECT id FROM test ORDER BY b OFFSET 391468; --> Temps : 2152,273 ms Interpretation : - Time for hauling around extra data (SELECT * instead of SELECT id) is not significant. - Sorting by BYTEA is just a memcmp(), it is strange that is it 2x slower than ints. Probably the varlena stuff, I guess. - See ridiculous MySQL results using a BLOB below which are 10x slower SELECT * FROM test ORDER BY t OFFSET 391468; --> Temps : 7305,373 ms SELECT id FROM test ORDER BY t OFFSET 391468; --> Temps : 7345,234 ms Interpretation : - Time for hauling around extra data (SELECT * instead of SELECT id) is not significant. - Sorting localized TEXT really is SLOW ! - The little test above calling strcoll() from Python confirms the slowness is in strcoll() - MySQL (see below) seems to be much faster (about equal to postgres) on VARCHAR, and 2x slower on TEXT (hum...) BEGIN; CREATE INDEX test_id ON test( id ); --> Temps : 555,718 ms CREATE INDEX test_b ON test( b ); --> Temps : 1762,263 ms CREATE INDEX test_t ON test( t ); --> Temps : 6274,624 ms ROLLBACK; Interpretation : - maintenance_work_mem is much higher than work_mem so the sorts are faster, but the slowness in localized text sorting subsists... SELECT count(*) FROM test2; --> 391469 --> Temps : 114,669 ms SELECT * FROM test2 ORDER BY id OFFSET 391468; --> Temps : 1788,246 ms SELECT id FROM test2 ORDER BY id OFFSET 391468; --> Temps : 989,238 ms Interpretation : - Time for hauling around extra data (SELECT * instead of SELECT id) IS significant this time due to the extra string lengths. SELECT * FROM test2 ORDER BY b OFFSET 391468; --> Temps : 2906,108 ms SELECT id FROM test2 ORDER BY b OFFSET 391468; --> Temps : 2554,931 ms SELECT * FROM test2 ORDER BY t OFFSET 391468; --> Temps : 10637,649 ms SELECT id FROM test2 ORDER BY t OFFSET 391468; --> Temps : 10322,480 ms Interpretation : - Note : the strings are longer, however they are sorta
Re: [HACKERS] Protection from SQL injection
Sure, modifying the WHERE clause is still possible, but the attacker is a lot more limited in what he can do if he can't tack on a whole new command. I hacked into a site like that some day to show a guy that you shouldn't trust magicquotes (especially when you switch hosting providers and it's not installed at your new provider, lol). Binary search on the password field by adding some stuff to the WHERE... You could still wipe out tables (just add a "' OR 1;--" to the id in the url to delete somthing... But it's true that preventing multi-statements adds a layer of idiot-proofness... a rather thin layer... The important aspects of this that I see are: 1. Inexpensive to implement; 2. Unlikely to break most applications; 3. Closes off a fairly large class of injection attacks. The cost/benefit ratio looks pretty good (unlike the idea that started this thread...) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protection from SQL injection
Could we also get a mode, where PREPARE would only be allowed for queries of the form "SELECT * FROM func(?,?,?,?,?); :) Actually, that is similar to the concept of "global prepared statements" that I proposed some time ago, but I will not have time to write the patch, alas... Idea was that the DBA can create a list of SQL statements (with privileges about who can execute them, just like functions) which are prepared on-demand at the first EXECUTE by the client. This would enhance performance (but for performance I like the idea of caching plans better). It would be pretty cumbersome, though, to execute dynamic SQL like the typical search query... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protection from SQL injection
For example, some applications need to replace whole phrases: $criteria = "WHERE $var1 = '$var2'" This is a very common approach for dynamic search screens, and really not covered by placeholder approaches. Python, again : params = { 'column1': 10, 'column2': "a st'ring", } where = " AND ".join( "%s=%%s" % (key,value) for key,value in params.items() ) cursor.execute( "SELECT * FROM table WHERE " + where, params ) I use the same approach (albeit more complicated) in PHP. For complex expressions you can play with arrays etc, it is not that difficult. Or you just do : $criteria = db_quote_query( "WHERE $var1 = %s", array( $var2 )) using the function I posted earlier. This supposes of course that $var1 which is the column name, comes from a known source, and not user input. In that case, $var1 will probably be the form field name, which means it is specified by the programmer a few lines prior in the code. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protection from SQL injection
zero developer pain Actually it's not zero pain, but the main problem is: there is no way to enforce using it. Sure, there is no way to enforce it (apart from grepping the source for pg_query() and flogging someone if it is found), but is it really necessary when the right solution is easier to use than the wrong solution ? Capitalizing on developer laziness is a win IMHO, lol. The problem is not only quotes. The problem is all kinds of user input. For example: sql = "SELECT * FROM ORDERS WHERE ORDER_ID = " + orderId; This is not a problem if orderId is a number. But what if it's a String? For example "1 AND (SELECT * FROM USERS WHERE NAME='admin' AND PASSWORD LIKE 'm%')". An attacker could then retrieve the admin password quite quickly. IMHO this is an example of what should never be done. // very bad (especially in PHP where you never know the type of your variables) sql = "SELECT * FROM ORDERS WHERE ORDER_ID = " + orderId; // slightly better (and safe) sql = "SELECT * FROM ORDERS WHERE ORDER_ID = " + int( orderId ); // correct (PHP syntax) pg_query_params( "SELECT * FROM ORDERS WHERE ORDER_ID = $1", array( orderId )) db_query( "SELECT * FROM ORDERS WHERE ORDER_ID = %s", array( orderId )) // correct (Python syntax) cursor.execute( "SELECT * FROM ORDERS WHERE ORDER_ID = %s", ( orderId, )) The last two don't complain if orderId is a string, it will be correctly quoted, and then postgres will complain only if it is a string which does not contain a number. This is useful in PHP where you never know what type you actually have. The little function in my previous mail is also useful for mysql which has no support for parameterized queries. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protection from SQL injection
On Tue, 29 Apr 2008 01:03:33 +0200, Brendan Jurd <[EMAIL PROTECTED]> wrote: On Tue, Apr 29, 2008 at 7:00 AM, PFC <[EMAIL PROTECTED]> wrote: I have found that the little bit of code posted afterwards did eliminate SQL holes in my PHP applications with zero developer pain, actually it is MORE convenient to use than randomly pasting strings into queries. You just call db_query( "SELECT * FROM table WHERE column1=%s AND column2=%s", array( $var1, $var2 )); Implementing this for yourself is crazy; PHP's Postgres extension already does this for you since 5.1.0: $result = pg_query_params("SELECT foo FROM bar WHERE baz = $1", array($baz)); http://www.php.net/manual/en/function.pg-query-params.php Cheers, BJ pg_query_params is quite slower actually... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Protection from SQL injection
As you know, "SQL injection" is the main security problem of databases today. I think I found a solution: 'disabling literals'. Or you may call it 'enforcing the use of parameterized statements'. This means that SQL statements with embedded user input are rejected at runtime. My solution goes beyond saying "developers ~should~ use parameterized statements". That is not a solution because developers are lazy. My solution is: "developers MUST use parameterized statements". It goes like this: Literals are disabled using the SQL statement: I have found that the little bit of code posted afterwards did eliminate SQL holes in my PHP applications with zero developer pain, actually it is MORE convenient to use than randomly pasting strings into queries. You just call db_query( "SELECT * FROM table WHERE column1=%s AND column2=%s", array( $var1, $var2 )); It is inspired from the Python interface which performs the same (but slightly more elegantly). I have removed the logging features for clarity. function db_quote_query( $sql, $params=false ) { // if no params, send query raw if( $params === false ) return $sql; if( !is_array( $params )) $params = array( $params ); // quote params foreach( $params as $key => $val ) { if( is_array( $val )) $params[$key] = implode( ', ', array_map( intval, $val )); else $params[$key] = is_null($val)?'NULL':("'".pg_escape_string($val)."'");; } return vsprintf( $sql, $params ); } function db_query( $sql, $params=false ) { // it's already a query if( is_resource( $sql )) return $sql; $sql = db_quote_query( $sql, $params ); $r = pg_query( $sql ); if( !$r ) { echo "Erreur PostgreSQL :/>".htmlspecialchars(pg_last_error())."RequĂȘte :/>".$sql."Traceback :"; foreach( debug_backtrace() as $t ) xdump( $t ); echo ""; die(); } return $r; } SET ALLOW_LITERALS NONE; Afterwards, SQL statements with text are not allowed any more for this session. That means, SQL statement of the form "SELECT * FROM USERS WHERE PASSWORD='qerkllkj'" will fail with the exception 'Literals are not allowed, please use parameters'. It is like the database does not know what ='qerkllkj' means. Only statements of the secure form are allowed, for example "SELECT * FROM USERS WHERE PASSWORD=?". This solves the problem because SQL injection is almost impossible if user input is not directly embedded in SQL statements. The 'ALLOW_LITERALS NONE' mode is enabled by the developer itself, or by an administrator. It is still possible to generate SQL statements dynamically, and use the same APIs as before, as long as SQL statements don't include literals. Literals can still be used when using query tools, or in applications considered 'safe'. To ease converting the application to use parameterized queries, there should be a second mode where number literals are allowed: SET ALLOW_LITERALS NUMBERS. To allow all literals, execute SET ALLOW_LITERALS ALL (this is the default setting). So far this feature is implemented in my little database H2. More information about this feature is described here: http://www.h2database.com/html/advanced.html#sql_injection I know about the Perl taint mode, but this is only for Perl. I also know about disabling multi-statement commands (only solves part of the problem). PostgreSQL should also support database level 'constants' that are similar to constants in other programming languages, otherwise application level constants (such as 'active') can't be used in queries directly (I propose to add new SQL statements CREATE CONSTANT ... VALUE ... and DROP CONSTANT ..., example: CREATE CONSTANT STATE_ACTIVE VALUE 'active'). I also know the 'disabling literals' feature does not solve SQL injection completely: for example 'ORDER BY injection' where an application dynamically adds the column to sort on based on a hidden 'sort column' field in a web app. To solve that I suggest to support parameterized ORDER BY: ORDER BY ? where ? is an integer. Then, instead of using SET ALLOW_LITERALS NONE the use of literals should probably be two access right (REVOKE LITERAL_TEXT, LITERAL_NUMBER FROM APP_ROLE). Those are details that still need to be discussed. What do you think about it? Do you think it makes sense to implement this security feature in PostgreSQL as well? If not why not? Does PostgreSQL have another solution or plan to solve the SQL injection problem? Regards, Thomas P.S. I have send this proposal to [EMAIL PROTECTED] first and got replies, but I would like to get some feedback from the PostgreSQL developers as well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Per-table random_page_cost for tables that we know are always cached
Example : let's imagine a "cache priority" setting. Which we can presume the DBA will set incorrectly because the tools needed to set that right aren't easy to use. LOL, yes. Jim threw out that you can just look at the page hit percentages instead. That's not completely true. If you've had some nasty query blow out your buffer cache, or if the server has been up a looong time and the total stas don't really reflect recent reality, what's in the buffer cache and what the stats say have been historical cached can diverge. Yes : - perform huge query on table A - table A is now in cache - perform huge query on table B - table B is now in cache, A isn't - perform huge query on table A again - postgres still thinks table A is cached and chooses a bad plan This would not examine whatever is in the OS' cache, though. Yeah, but now that shared_buffers can be set to a large part of physical RAM, does it still matters ? Point is, postgres knows what is in the shared_buffers, so it can make a good decision. Postgres doesn't know what the OS has in cache, so it could only make a wild guess. I would rather err on the side of safety... I don't know that it's too unrealistic to model the OS as just being an extrapolated bigger version of the buffer cache. I can think of a couple of ways those can diverge: 1) Popular pages that get high usage counts can end up with a higher representation in shared_buffers than the OS 2) If you've being doing something like a bulk update, you can have lots of pages that have been written recently in the OS cache that aren't really accounted for fully in shared_buffers, because they never get a high enough usage count to stay there (only used once) but can fill the OS cache as they're spooled up to write. Especially on CHECKPOINT -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Per-table random_page_cost for tables that we know are always cached
It started with this query : EXPLAIN ANALYZE SELECT * FROM relations r JOIN nodes n ON (n.id=r.child_id) WHERE r.parent_id=16330; QUERY PLAN --- Hash Join (cost=370.96..496.29 rows=543 width=273) (actual time=18.887..21.164 rows=543 loops=1) Hash Cond: (r.child_id = n.id) -> Index Scan using relations_unique on relations r (cost=0.00..111.75 rows=543 width=58) (actual time=0.022..0.519 rows=543 loops=1) Index Cond: (parent_id = 16330) -> Hash (cost=243.76..243.76 rows=10176 width=215) (actual time=18.830..18.830 rows=10176 loops=1) -> Seq Scan on nodes n (cost=0.00..243.76 rows=10176 width=215) (actual time=0.006..5.135 rows=10176 loops=1) Total runtime: 21.453 ms SET enable_hashjoin TO 0; EXPLAIN ANALYZE SELECT * FROM relations r JOIN nodes n ON (n.id=r.child_id) WHERE r.parent_id=16330; QUERY PLAN --- Nested Loop (cost=0.00..514.50 rows=543 width=273) (actual time=0.037..4.412 rows=543 loops=1) -> Index Scan using relations_unique on relations r (cost=0.00..111.75 rows=543 width=58) (actual time=0.023..0.476 rows=543 loops=1) Index Cond: (parent_id = 16330) -> Index Scan using nodes_pkey on nodes n (cost=0.00..0.73 rows=1 width=215) (actual time=0.004..0.005 rows=1 loops=543) Index Cond: (n.id = r.child_id) Total runtime: 4.638 ms In order to have it use the fast plan I must set random_page_cost to 1 which I absolutely don't want to do. Setting effective_cache_size to huge values has no effect. If I select a value of parent_id that has much less children, the index will be used, but in this case I think the threshold is misplaced, it should be slightly higher. Here we have about 5% of values selected. Hash join becomes better at about 15% because the table is cached. This is 8.3. Perhaps there would be a need for a per-object setting (object=table,index,partition) to alter the aggressiveness/lazyness of the page flushing and how long the pages for this object are kept in shared_buffers... this would be used to modify random_page_cost on a per-table/index/partition basis. Example : let's imagine a "cache priority" setting. - "cache priority" set to the minimum means this table is mostly write-only - "cache priority" set to default would give current behaviour (which is correct in most cases) - "cache priority" set to a high value would tell Postgres "I know this table/index/partition is small and often accessed rather randomly, so I want you to keep it in shared_buffers, purge it if you must but otherwise keep it in memory, flush something else instead which has lower cache_priority". The optimizer could then use a different (much lower) value of random_page_cost for tables for which "cache priority" is set highest since it would know. An alternative would be for the background writer to keep some stats and do the thing for us : - begin bgwriter scan - setup hashtable of [relid => page count] - at each page that is scanned, increment "page count" for this relation (uses very little CPU) - end bgwriter stats - for each relation, compare the number of pages we found in shared_buffers with the number of pages in the relation and draw conclusions about how well cached the relation is - update random_page_cost accordingly for this relation This would not examine whatever is in the OS' cache, though. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Plan targetlists in EXPLAIN output
On Thu, 17 Apr 2008 20:42:49 +0200, Simon Riggs <[EMAIL PROTECTED]> wrote: On Thu, 2008-04-17 at 12:34 -0400, Tom Lane wrote: I'm tempted to propose redefining the currently-nearly-useless EXPLAIN VERBOSE option as doing this. Yes please. Sounds like a good home for other useful things also. I'd like to have an EXPLAIN mode that displayed the plan without *any* changeable info (i.e. no costs, row counts etc). This would then allow more easy determination of whether plans had changed over time. (But EXPLAIN TERSE sounds silly). Plan = Tree Tree = XML EXPLAIN ANALYZE SELECT * FROM test NATURAL JOIN test2 WHERE id =ANY('{3,666,975,521'}); QUERY PLAN --- Nested Loop (cost=17.04..65.13 rows=1 width=8) (actual time=51.835..51.835 rows=0 loops=1) Join Filter: (test.value = test2.value) -> Bitmap Heap Scan on test (cost=17.04..31.96 rows=4 width=8) (actual time=16.622..16.631 rows=4 loops=1) Recheck Cond: (id = ANY ('{3,666,975,521}'::integer[])) -> Bitmap Index Scan on test_pkey (cost=0.00..17.04 rows=4 width=0) (actual time=16.613..16.613 rows=4 loops=1) Index Cond: (id = ANY ('{3,666,975,521}'::integer[])) -> Index Scan using test2_pkey on test2 (cost=0.00..8.28 rows=1 width=8) (actual time=8.794..8.795 rows=1 loops=4) Index Cond: (test2.id = test.id) EXPLAIN XML ... Nicely parsable and displayable in all its glory in pgadmin ;) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] count(*) performance improvement ideas
My wife has a snake phobia, besides, I've just started learning Scala. Just had a look at Scala, it looks nice. Slightly Lispish (like all good languages)... txid_current() No... hold on, it is per session, and a session can't have two or more transactions active at once can it? It could be used to detect rollback. So the problem is that other functions may be using GD themselves, and your own code is at the mercy of the other functions. Conversely you shouldn't clear GD, as some other function may be using it. Exactly. So you're better off using a single function for everything, and using SD within it? Since the purpose is to store counts for rows matching a certain criteria in a set of tables, you could build a hashtable of hashtables, like : GD[table name][criteria name][criteria value] = count This would add complexity, about half a line of code. But you'd have to create lots of plpgsql trigger functions to wrap it. There isn't any way of telling whether the function is being called for the first time in a transaction. You don't know when to clear it. The first time in a session, GD will be empty. Clearing it at the start of a transaction would not be useful (clearing it at ROLLBACK would). It is updating the "real" summary table with the contents of this hash that is the problem, also. So, basically, if you connect, do one insert, and disconnect, this would be useless. But, if you do a zillion inserts, caching the counts deltas in RAM would be faster. And if you use persistent connections, you could update the counts in the real table only every N minutes, for instance, but this would need some complicity from the backend. Regards, Stephen Denne. Disclaimer: At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way. __ This email has been scanned by the DMZGlobal Business Quality Electronic Messaging Suite. Please see http://www.dmzglobal.com/dmzmessaging.htm for details. __ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] count(*) performance improvement ideas
On Thu, 17 Apr 2008 02:48:37 +0200, Stephen Denne <[EMAIL PROTECTED]> wrote: PFC wrote: Let's try this quick & dirty implementation of a local count-delta cache using a local in-memory hashtable (ie. {}). CREATE OR REPLACE FUNCTION update_count( key TEXT, delta INTEGER ) RETURNS INTEGER AS $$ if key in GD: GD[key] += delta else: GD[key] = delta return GD[key] $$ LANGUAGE plpythonu; Thanks for the code, this seems to be very much what I was looking for. I don't know plpythonu (nor python), just read a few docs now: Learn Python, it is a really useful language ;) "The global dictionary SD is available to store data between function calls. This variable is private static data. The global dictionary GD is public data, available to all Python functions within a session. Use with care." Does session == transaction or connection? I don't understand the difference between SD and GD, private and public. Where are the context boundaries? There is no sharing between processes, so - both SD and GD are limited to the current session (connection, postgres process), no shared memory is involved - GD is global between all python functions (global) - SD is specific to each python function (static) The big gotcha is that these are all non-transactional : if you rollback, GD and SD stay the same, and when you issue a query, you can assume the state of SD and GD is random (due to previous queries) unless you initialize them to a known value. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] count(*) performance improvement ideas
The whole thing is a bit of an abuse of what the mechanism was intended for, and so I'm not sure we should rejigger GUC's behavior to make it more pleasant, but on the other hand if we're not ready to provide a better substitute ... In my experiments with materialized views, I identified these problems as "minor" difficulties. Resolving them would allow further abuse ;) Let's try this quick & dirty implementation of a local count-delta cache using a local in-memory hashtable (ie. {}). Writing the results to stable storage in an ON COMMIT trigger is left as an exercise to the reader ;) Performance isn't that bad, calling the trigger takes about 50 us. Oldskool implementation with a table is at the end, it's about 10x slower. Example : INSERT INTO victim1 (key) VALUES ('one'),('two'),('two'); INSERT 0 3 Temps : 1,320 ms test=# SELECT * FROM get_count(); key | cnt -+- two | 2 one | 1 CREATE OR REPLACE FUNCTION clear_count( ) RETURNS VOID AS $$ GD.clear() $$ LANGUAGE plpythonu; CREATE OR REPLACE FUNCTION update_count( key TEXT, delta INTEGER ) RETURNS INTEGER AS $$ if key in GD: GD[key] += delta else: GD[key] = delta return GD[key] $$ LANGUAGE plpythonu; CREATE TYPE count_data AS ( key TEXT, cnt INTEGER ); CREATE OR REPLACE FUNCTION get_count( ) RETURNS SETOF count_data AS $$ return GD.iteritems() $$ LANGUAGE plpythonu; CREATE TABLE victim( id SERIAL PRIMARY KEY, key TEXT NOT NULL ); INSERT INTO victim (key) SELECT (random() * 300)::INTEGER::TEXT FROM generate_series( 1,10 ); CREATE TABLE victim1( id SERIAL PRIMARY KEY, key TEXT NOT NULL ); \timing INSERT INTO victim1 SELECT * FROM victim; TRUNCATE TABLE victim1; SELECT clear_count(); INSERT INTO victim1 SELECT * FROM victim RETURNING update_count( key, 1 ); SELECT * FROM get_count(); TRUNCATE TABLE victim1; CREATE OR REPLACE FUNCTION counter_trigger_f() RETURNS TRIGGER LANGUAGE plpgsql AS $$ DECLARE BEGIN IF TG_OP = 'INSERT' THEN PERFORM update_count( NEW.key, 1 ); RETURN NEW; ELSEIF TG_OP = 'UPDATE' THEN -- update topic IF NEW.key != OLD.key THEN PERFORM update_count( OLD.key, -1 ), update_count( NEW.key, 1 ); END IF; RETURN NEW; ELSE-- DELETE PERFORM update_count( OLD.key, -1 ); RETURN OLD; END IF; END; $$; CREATE TRIGGER count_trigger BEFORE INSERT OR UPDATE OR DELETE ON victim1 FOR EACH ROW EXECUTE PROCEDURE counter_trigger_f(); SELECT clear_count(); INSERT INTO victim1 SELECT * FROM victim; SELECT * FROM get_count(); SELECT clear_count(); TRUNCATE TABLE victim1; INSERT INTO victim1 (key) VALUES ('one'),('two'),('two'); SELECT * FROM get_count(); DELETE FROM victim1 WHERE key='two'; SELECT * FROM get_count(); UPDATE victim1 SET key='three' WHERE key='one'; SELECT * FROM get_count(); DELETE FROM victim1; SELECT * FROM get_count(); CREATE TABLE counts( key TEXT PRIMARY KEY, total INTEGER NOT NULL DEFAULT 0 ); CREATE OR REPLACE FUNCTION table_counter_trigger_f() RETURNS TRIGGER LANGUAGE plpgsql AS $$ DECLARE BEGIN IF TG_OP = 'INSERT' THEN UPDATE counts SET total=total+1 WHERE key=NEW.key; IF NOT FOUND THEN INSERT INTO counts (key,total) VALUES (NEW.key,1); END IF; RETURN NEW; ELSEIF TG_OP = 'UPDATE' THEN -- update topic IF NEW.key != OLD.key THEN UPDATE counts SET total=total-1 WHERE key=OLD.key; UPDATE counts SET total=total+1 WHERE key=NEW.key; IF NOT FOUND THEN INSERT INTO counts (key,total) VALUES (NEW.key,1); END IF; END IF; RETURN NEW; ELSE-- DELETE UPDATE counts SET total=total-1 WHERE key=OLD.key; RETURN OLD; END IF; END; $$; CREATE TABLE victim2( id SERIAL PRIMARY KEY, key TEXT NOT NULL ); CREATE TRIGGER table_count_trigger BEFORE INSERT OR UPDATE OR DELETE ON victim2 FOR EACH ROW EXECUTE PROCEDURE table_counter_trigger_f(); SELECT * FROM counts; TRUNCATE TABLE victim2; INSERT INTO victim2 SELECT * FROM victim; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached Query Plans (was: global prepared statements)
On Mon, 14 Apr 2008 16:17:18 +0200, Csaba Nagy <[EMAIL PROTECTED]> wrote: On Mon, 2008-04-14 at 16:10 +0200, Csaba Nagy wrote: ... or plan the query with the actual parameter value you get, and also record the range of the parameter values you expect the plan to be valid for. If at execution time the parameter happens to be out of that range, replan, and possibly add new sublpan covering the extra range. This could still work with prepared queries (where you don't get any parameter values to start with) by estimating the most probable parameter range (whatever that could mean), and planning for that. More on that: recording the presumptions under which the (cached!)plan is thought to be valid would also facilitate setting up dependencies against statistics, to be checked when you analyze tables... and if the key value which you depend on with your query changed, the analyze process could possibly replan it in the background. LOL, it started with the idea to make small queries faster, and now the brain juice is pouring. Those "Decision" nodes could potentially lead to lots of decisions (ahem). What if you have 10 conditions in the Where, plus some joined ones ? That would make lots of possibilities... Consider several types of queries : - The small, quick query which returns one or a few rows : in this case, planning overhead is large relative to execution time, but I would venture to guess that the plans always end up being the same. - The query that takes a while : in this case, planning overhead is nil compared to execution time, better replan every time with the params. - The complex query that still executes fast because it doesn't process a lot of rows and postgres finds a good plan (for instance, a well optimized search query). Those would benefit from reducing the planning overhead, but those also typically end up having many different plans depending on the search parameters. Besides, those queries are likely to be dynamically generated. So, would it be worth it to add all those features just to optimize those ? I don't know... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached Query Plans (was: global prepared statements)
Bind message behaviour was modified : - If the user asks for execution of a named prepared statement, and the named statement does not exist in PG's prepared statements cache, instead of issuing an error and borking the transaction, it Binds to an empty statement, that takes no parameters, and returns no result. Parameters sent by the user are consumed but not used. You mentioned the need for a wire protocol change to allow this. Why can't this be controlled with a server variable, like SET auto_prepare = 'true'? Actually, thanks to the hack, the wire protocol doesn't change. Explanation : - Send Parse(SQL) to unnamed statement + Bind unnamed statement => works as usual (no cache) - Send only Bind (named statement) with a statement name that is not found in the cache => doesn't raise an error, instead informs the application that the statement does not exist. The application can then prepare (send a Parse message with SQL and a name) the statement and give it a name. I used as name the SQL itself, but you can use anything else. The application can then send the Bind again, which will (hopefully) work. So, here, the information ("cache" or "don't cache") is passed from the client to the server, in a hidden way : it depends on what function you use to send the query (unnamed statements are not cached, named statements are cached). There is no protocol change, but a new information is provided to the server nonetheless. Downside to this is that the application needs to be modified (only a little, though) and applications that expect exceptions on "Statement does not exist" will break, thus the necessity of a GUC to control it. It was just a quick & dirty test to see if this way of doing it was an option to consider or not. Apparently it works, but wether it is The Right Way remains to be seen... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached Query Plans (was: global prepared statements)
If cached plans would be implemented, the dependence on parameter values could be solved too: use special "fork" nodes in the plan which execute different sub-plans depending on special parameter values/ranges, possibly looking up the stats at runtime, so that the plan is in a compiled state with the "decision points" wired in. This of course would mean a lot heavier planning and possibly a lot bigger plans, but you could afford that if you cache the plan. You could even have a special command to plan a query this way. And, the "fork node" could mutter to itself "Strange, I'm getting 1 rows instead of the 2 for which I was planned, perhaps I should switch to a different plan..." I have made another very simple hack to test for another option : Bind message behaviour was modified : - If the user asks for execution of a named prepared statement, and the named statement does not exist in PG's prepared statements cache, instead of issuing an error and borking the transaction, it Binds to an empty statement, that takes no parameters, and returns no result. Parameters sent by the user are consumed but not used. The application was modified thusly : - Calls to pg_query_params were changed to calls to the following function : function pg_query_cached( $sql, $params ) { // Try to execute it, using the query string as statement name. $q = pg_execute( $sql, $params ); if( !$q ) die( pg_last_error() ); // If it worked, return result to caller. if( pg_result_status( $q, PGSQL_STATUS_STRING ) != "" ) return $q; // If we got an empty query result (not a result with 0 rows which is valid) then prepare the query $q = pg_prepare( $sql, $sql ); if( !$q ) die( pg_last_error() ); // and execute it again $q = pg_execute( $sql, $params ); if( !$q ) die( pg_last_error() ); return $q; } Pros : - It works - It is very very simple - The user can choose between caching plans or not by calling pg_query_params() (no cached plans) or pg_query_cached() (cached plans) - It works with persistent connections Cons : - It is too simple - Plans are cached locally, so memory use is proportional to number of connections - It is still vulnerable to search_path problems -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached Query Plans (was: global prepared statements)
Why limit ourselves with Oracle? How all major proprietary RDBMSs do it. Thanks for the links. Very interesting. The DB2 document especially mentions an important point : in order to make their planner/optimizer smarter, they had to make it slower, hence it became crucial to cache the plans. Contrast this with MySQL where using prepared statements gains nothing : the "optimizer" does so little work that it actually doesn't matter. So, basically, Orcale : - Parses the query every time (identifies tables, permissions etc) (soft parse) - From that parsed query it looks up a cached plan (the lookup key could then be different depending on the schema etc) - If not, it must plan the query (hard parse). Also the Oracle doc mentions that the soft parsing should be avoided by using prepared statements in the application (ie Parse once and Bind lots of times) So, Oracle will redo the parsing + permissions check each time, unless prepared statements are used, in which case it's direct execution. And DB2 : Er, the document is not very clear about what it actually does, but the stats look nice ;) I liked your global prepared statements idea much better. Named the statements is no problem: DB frontends do that for you anyway sometimes. Hm. The "global statements" and the cache would complement each other actually. Why not. When the user wants to name the statements, he can do so (and perhaps control who can execute what, etc, like with stored procs) Permission checking overhead will be there at each execution. Should the plan be cached locally ? (RAM consumption times N bakends...) Cached per user once permissions have been checked ? (avoids the overhead of rechecking permissions) What about the search path ? (I'd force the global statements to use the default search path no matter what, being explicit is better than "why does it stop working ?") Can the application or the database library name the statements ? I'm not so sure. This could work for compiled languages (what about when you run several applications ? or several versions of the same application ? do we need a uniqueness of statement names from all developers all over the world ?) Solution : make each application use a different user name, and global prepared statements only visible to the user that created them, perhaps. This conflicts with some desirable features, though. It needs more thinking. What about non-compiled languages ? It will not be possible to generate a list of statements beforehands... And queries are also constructed dynamically by frameworks such as Rails, which makes naming them impossible, but caching the plans would work well. So, some situations would benefit from a plan cache, Frankly, I think you're better off storing them in a table. Shared memory is a limited resource and you cannot change how much you've I'd say that unless you have a perverse application that will try all the permutations of column names just to make sure the query is different every time, how many different queries would you want to cache ?... probably less than 1000... so it wouldn't take more than a couple megabytes... allocated after the server has started. It does mean you'll have to serialise/deserialise them, but this will be cheaper than replanning, right? What would be the overhead of a catalog lookup to get a cached plan for a statement that returns 1 row ? Would the catalog cache make it fast enough ? And what about deserialization ?... I am not too sure that plans and statistical counters should be stored together... Not sure either. Probably plans should go in one place, and statistics should go to the stats collector (I know he's not quite ready for this ;)). That's the problem... Hm, a limit on how much memory can be used for plans (query_plan_cache_size GUC?), and a LRU/LFU expiration of old plans? Now it gets hairy ;) Yes memory size should be limited. But how to make a LRU cleaner which doesn't create lots of contention ?... Luckily, with a hash having a fixed number of buckets, it is easier (clean a bucket every N seconds for instance). Perhaps a GUC for controlling query cache should heve three values: none -- don't cache any statement smart -- use heuristics for deciding whether to cache it all -- force caching all queries -- for uncommon/statistical/testing purposes. I would not volunteer to write that heuristic ;) Although there would be a very simple solution : if time to parse > some percentage of time to execute then cache. The hairiness is in the plan dependence (or independence) on parameter values, ideally we only want to cache plans that would be good for all parameter values, only the user knows that precisely. Although it could be possible to examine the column histograms... (like mysql, /* flags */ SELECT blah ) I don't like the hint flags. They tend to ha
Re: [HACKERS] Cached Query Plans (was: global prepared statements)
On Fri, Apr 11, 2008 at 12:34 PM, PFC <[EMAIL PROTECTED]> wrote: Well, I realized the idea of global prepared statements actually sucked, so I set on another approach thanks to ideas from this list, this is caching query plans. Well, that's a blatantly bad realization. Perhaps you should do more research. No, what I meant is that the "global prepared statements" as I tried to implement them before weren't that good... I think simple caching based on the query text itself is preferable to having to name each of your queries, extract them from your programs and replace them by executes, issue a "create statement" command for each of them, etc. Few people would actually use that feature because it would mean lots of modifications to the application, so all the applications that have to be compatible with other databases would not use the feature (*) It could be useful for permissions and fine access control, though, but views and stored procs already provide that functionality... (*) = Note that caching the plans based on the query text (with $ params) from a parse message will not provide caching for oldskool queries with params inside in the form of escaped strings. This is good, because it means the safer solution (using $-quoted params) will also be the faster solution. And in the application, only a very small part of the code needs to be changed, that's the DB abstraction layer. Doesn't Oracle do this now transparently to clients? Of course it does, and it has since the late 80's I believe. Oracle keeps a statement/plan cache in its shared memory segment (SGA) that greatly improves its performance at running queries that don't change very often. Can we have more details on how Oracle does it ? For "inspiration"... Here is what I'm thinking about : Don't flame me too much about implementation issues, this is just throwing ideas in the air to see where they'll fall ;) * global plan cache in shared memory, implemented as hashtable, hash key being the (search_path, query_string) Doubt : Can a plan be stored in shared memory ? Will it have to be copied to local memory before being executed ? This stores : - the plans (not for all keys, see below) - the stats : - number of times this query has been executed, - total, min and max wallclock time and CPU time spent planning this query, - total, min and max wallclock time, CPU time and RAM spent executing this query, - total, min and max number of rows returned, - last timestamp of execution of this query, There should be separate GUCs to control this : - should the whole thing be activated ? - should the cache be active ? or just the stats ? and what stats ? There should be also a way to query this to display the statistics (ie "what query is killing my server ?"), and a way to purge old plans. * every time a Parse message comes up : - look if the (search_path, query_string) is in the cache - if it is in the cache : - if there is a cached plan, make the unnamed statement point to it, and we're done. - if there is no cached plan, prepare the query, and put it in the unnamed statement. Now, the query has been parsed, so we can decide if it is cacheable. Should this be done in Parse, in Bind, or somewhere else ? I have no idea. For instance, queries which contain VALUES() or IN( list of consts ) should not be cached, since the IN() is likely to change all the time, it would just trash the cache. Using =ANY( $1 ) instead will work with cached plans. Also, will a plan to be cached have to be prepared with or without the parameters ? That's also an interesting question... Perhaps the user should also be able to specify wether to cache a plan or not, or wether to use the params or not, with hint flags in the query string ? (like mysql, /* flags */ SELECT blah ) Now, if the query is cacheable, store it in the cache, and update the stats. If we decided to store the plan, do that too. For instance we might decide to store the plan only if this query has been executed a certain number of times, etc. * In the Execute message, if a cached plan was used, execute it and update the stats (time spent, etc). Now, about contention, since this is one shared hashtable for everyone, it will be fought for... However, the lock on it is likely to be held during a very small time (much less than a microsecond), so would it be that bad ? Also, GUC can be used to mitigate the contention, for instance if the user is not interested in the stats, the thing becomes mostly read-only -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached Query Plans
Well if you're caching per-connection then it doesn't really matter whether you do it on the client side or the server side, it's pretty much exactly the same problem. Actually I thought about doing it on the server since it would then also work with connection pooling. Doing it on the client means the client has to maintain state, which is not possible in a pool... Unsurprisingly most drivers do precisely what you're describing. In Perl DBI for example you just change $dbh->prepare("") into $dbh->prepare_cached("") and it does exactly what you want. I would expect the PHP drivers to have something equivalent. Well, you clearly have "expectations" about PHP, lol. PHP takes pride in always aiming below your expectations, not above ;) It has no such feature. Also pg_query_params() is SLOWER than pg_query() which makes you choose between clean&slow, and string quoting hell. Perhaps I should patch PHP instead... Or perhaps this feature should be implemented in pgpool or pgbouncer. But, using prepared statements with persistent connections is messy, because you never know if the connection is new or not, If you were to fix *that* then both this problem and others (such as setting up desired SET-parameter values) would go away. True. Languages that keep a long-running context (like application servers etc) can do this easily. Although in the newer versions of PHP, it's not so bad, pconnect seems to work (ie. it will issue ROLLBACKs when the script dies, reset session variables like enable_indexscan, etc), so the only remaining problem seems to be prepared statements. And again, adding a method for the application to know if the persistent connection is new or not, will not work in a connection pool... Perhaps a GUC flag saying EXECUTE should raise an error but not kill the current transaction if the requested prepared statement does not exist ? Then the application would issue a PREPARE. It could also raise a non-fatal error when the tables have changed (column added, for instance) so the application can re-issue a PREPARE. But I still think it would be cleaner to do it in the server. Also, I rethought about what Gregory Stark said : The contention on the shared cache is likely to negate much of the planning savings but I think it would still be a win. If a shared plan cache is implemented, it will mostly be read-only, ie. when the application is started, new queries will come, so the plans will have to be written to the cache, but then once the cache contains everything it needs, it will not be modified that often, so I wouldn't think contention would be such a problem... It's not so easy as all that. Consider search_path. Consider temp tables. Temp tables : I thought plan revalidation took care of this ? (After testing, it does work, if a temp table is dropped and recreated, PG finds it, although of course if a table is altered by adding a column for instance, it logically fails). search_path: I suggested to either put the search_path in the cache key along with the SQL string, or force queries to specify schema.table for all tables. It is also possible to shoot one's foot with the current PREPARE (ie. search_path is used to PREPARE but of course not for EXECUTE), and also with plpgsql functions (ie. the search path used to compile the function is the one that is active when it is compiled, ie at its first call in the current connection, and not the search path that was active when the function was defined)... SET search_path TO DEFAULT; CREATE SCHEMA a; CREATE SCHEMA b; CREATE TABLE a.test( v TEXT ); CREATE TABLE b.test( v TEXT ); INSERT INTO a.test VALUES ('This is schema a'); INSERT INTO b.test VALUES ('This is schema b'); CREATE OR REPLACE FUNCTION test_search_path() RETURNS SETOF TEXT LANGUAGE plpgsql AS $$ DECLARE x TEXT; BEGIN FOR x IN SELECT v FROM test LOOP RETURN NEXT x; END LOOP; END; $$; test=> SET search_path TO a,public; test=> SELECT * FROM test_search_path(); test_search_path -- This is schema a test=> \q $ psql test test=> SET search_path TO b,public; test=> SELECT * FROM test_search_path(); test_search_path -- This is schema b test=> SET search_path TO a,public; test=> SELECT * FROM test_search_path(); test_search_path -- This is schema b -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Cached Query Plans
I think what he's referring to is persistently caching plans so that new connections can use them. That makes a lot more sense if you have lots of short-lived connections like a stock php server without persistent connections turned on or a connection pooler. You can prepare queries but they only live for a single web page so you don't get any benefit. Let me explain a little further. Persistent database connections are the way to go for web applications, because the connection is only going to get used for a few queries, and the time needed to start the postgres process and establish the connection is often significant compared to the time used for the actual queries. Connection pooling can also be used, you get the idea. So, using persistent database connections, it makes sense to use prepared statements to speed up execution of simple queries, like those returning a few rows with a few joins and no complicated WHERE clauses, which is actually most of the queries on your average website. As shown in my previous message, the CPU time spent planning the query can be as much or even a lot more than CPU time spent actually executing the query. But, using prepared statements with persistent connections is messy, because you never know if the connection is new or not, if it contains already prepared statements or not, you'd have to maintain a list of those statements (named) for every query in your application, and when someone changes a query, it's a mess, not to mention queries generated by the ORM like Rails etc. The idea in this "proof of concept" was : Wouldn't it be nice if Postgres could just say "Hey, I already planned that query, I'll reuse that plan". And it is very easy to recognize a query we've seen before, since $-params takes the parameters out of the equation, and eliminates parsing time and string quoting hell. Storing the cached plans as prepared statements in the connection-local hashtable makes sense : it doesn't use that much memory anyway, and there are no locking and contention problems. Just like PREPARE and EXECUTE. Personally I would like to see this, not primarily for the performance gains, but for the possibility of managing when plans change -- ie, plan stability. Unfortunately, this isn't compatible with a non-shared memory approach... But there is resistance from other quarters about the reliability hit of having the plan data structures in shared memory. I agree. Hence the idea to put them in non-shared memory, local to a process. Perfectly useless when using non-persistent connections, but very powerful when using persistent connections. I still don't see why you would need a wire protocol change. Because I'd think that sometimes the client will not want to use a cached plan, when the query is rarely used (no need to waste memory to cache the plan), or it is complex and needs to be replanned according to parameter values every time. Sure, the client could use the oldskool "send query as text with parameters inside" but that's back to string escaping hell, and it's ugly. It would be nicer to have a bool "cache_plan". You would just have clients prepare plans normally and stash them in shared memory for other backends in a hash table keyed by, well, something, perhaps the original query text. Query text seems to be the simplest, better not ask the user to come up with distinct names when the query text will be a perfect key. Besides, hand-generated names might turn out not to be so distinct after all... Then whenever you're asked to prepare a query you go check if someone else has already done it for you and find an already generated plan in the shared memory hash table. The contention on the shared cache is likely to negate much of the planning savings but I think it would still be a win. But what's really interesting to me is then providing an interface to see and manipulate that cache. Then you could see what plans other backends are using for queries, mark plans as being acceptable or not, and even revoke users' permissions to execute queries which aren't already present and marked as being acceptable. If it can be made to work with a shared cache, why not, but that would be more complex. You'd also have to deal with permissions, different users with different privileges, etc. But local would probably be simplest (and faster). Also, there will be problems with the schema search path. Perhaps a query should be required to specify the fully qualified table names (schema.table) for all tables in order to be cacheable. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cached Query Plans (was: global prepared statements)
Well, I realized the idea of global prepared statements actually sucked, so I set on another approach thanks to ideas from this list, this is caching query plans. First, let's see if there is low hanging fruit with the typical small, often-executed queries that are so frequent on websites. Tables test, test2 and test3 contain id (integer primary key) and another integer field. There are 100K rows in each. First, the simplest query : SELECT * FROM test WHERE id = $1 110 us : Send query as text (PHP:pg_query -> PQexec) 125 us : Parse+Bind (PHP:pg_query_params -> PQexecParams) 67 us : Execute a previously prepared statement (PHP:pg_execute -> PQexecPrepared) A slightly more complex one but still pretty classic : SELECT * FROM (SELECT * FROM test WHERE id>$1 ORDER BY id LIMIT 5) AS a NATURAL LEFT JOIN test2 NATURAL LEFT JOIN test3 ORDER BY id 523 us : Send query as text (PHP:pg_query -> PQexec) 580 us : Parse+Bind (PHP:pg_query_params -> PQexecParams) 148 us : Execute a previously prepared statement (PHP:pg_execute -> PQexecPrepared) OK, so there is low hanging fruit since the parsing+planning time of those is longer than doing the query itself. Since the Parse message includes a $-parameterized query that is to be prepared, it seems logical to put the caching logic there : the query string (without parameters) makes a nice cache key. So I made a really quick and really dirty experimentation without changing the wire protocol between client and server. This is only "proof of concept". Try #1 : in exec_parse_message(), if the statement is named, look it up in the prepared statements cache, if it is found, return at once and do nothing else. To exploit this, I issue a pg_prepare() followed by pg_execute() at every query, wether or not the statement exists. If it already exists, pg_prepare() now does nothing (except losing a little time). Results : 88 us : simple query 173 us : complex query So, the timings are between a simple execute and a plan+execute. It provides a nice performance gain versus replanning every time, but not perfect. Try #2 : again, in exec_parse_message(), if the statement is unnamed, I use the query string as the statement name, search the plan in the prepared statements hash table. If it is not found, then it is prepared. Then I make the unnamed statement plan point to this. Of course, this is dangerous since it probably introduces a dozen crash bugs, but for this proof of concept, it's OK. Client code is unchanged, PQexecParams will benefit from the plan caching, since it always sends a Parse+Bind message using the unnamed statement. Results are identical to executing an execute on a prepared statement, modulo a few microseconds. This means the overhead of sending the Parse message, and of the server ignoring it when the statement is cached, is negligible. So, where to go from that ? I don't see a way to implement this without a (backwards-compatible) change to the wire protocol, because the clients will want to specify when a plan should be cached or not. Since the user should not have to name each and every one of the statements they want to use plan caching, I see the following choices : - Add a new Parse+Bind command, which gets the $-parameterized SQL and the parameters. If the plan is cached, grab it and execute, else prepare and execute. Add a flag to allow the client to specify if he wants caching or not. Pros : Only one message, faster Cons : SQL is transmitted in full, useless most of the time, but this overhead is rather small. - Send the SQL with Bind as statement name, add a flag to Bind telling it to report a cache miss instead of raising an error, then have the client send a Parse and Bind again. - Should there be one common hashtable for named prepared statements and cached plans, or two hashtables ? Using the SQL string as the statement name is not clean. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Dumb Micro-Optimization
* Dumb Optimization #1: - Add executorFunc function pointer to struct PlanState - in ExecProcNode.c -> ExecProcNode() : - upon first execution, set executorFunc to the function corresponding to node type - next calls use function pointer Effect : removes a switch (nodeTag(node)) which otherwise executes for every tuple returned by every node Gain : - 4% CPU time on SELECT sum(an integer column) FROM a table of one million rows - nil on selects returning few rows obviously -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Free Space Map data structure
PFC wrote: About the FSM : Would it be possible to add a flag marking pages where all tuples are visible to all transactions ? (kinda like frozen I think) Ah, the visibility map. That's another line of discussion. The current plan is to not tie that to the FSM, but implement it separately. There's some infrastructure changes that are needed for both, like the "map forks" (see recent FSM discussions), which is why we need to have a design for FSM as well before we start implementing the visibility map. It's definitely something I want to do for 8.4. Ahh, yes, yes, yes ;) yes ! Here's my rough plan: 1. Common "map forks" support 2. Rewrite FSM 3. Implement visibility map, to allow partial vacuums 4. Implement index-only scans, using the visibility map. Throwing another idea that is related to partial vacuums (perhaps ?): Consider a table that is often inserted (archive, forum posts, log, whatever), we want to CLUSTER it. In that case it would be beneficial to only cluster the "tail" of the table, where all the recently inserted rows are. Example : - Table is clustered. - Insert rows in random order, update some, delete some, etc, supposing inserts happen at the end - Table now looks like "head":[clustered part with some holes] plus "tail":[rows in random order] - As long as the "tail" fits in disk cache, the random order isn't a problem. - So, when the "tail" reaches a certain size : - Grab it, sort it by cluster order, write it again in the heap - Update the indexes in a manner similar to VACUUM (ie. bulk update) - Table now looks like "head":[clustered part with some holes] plus "tail":[clustered] This does not remove the holes in the "head", but is this really a problem ? In this usage scenario, I don't think so. Regular CLUSTER could also be run, much less frequently than before, and it will also be much faster since the rows are approximately in-order already. This approach is complimentary to the "auto-cluster" approach where the index is asked "where should I insert that row ?" (this will be used to fill the holes). Auto-cluster will work well in tables that are updated very often. But starting from an empty table, or an already clustered table, or in a mostly-insert scenario, the index will have no idea where to put that row... The goodness of this approach is that - As long as the "tail" fits in RAM, sorting it is going to be very fast (unlike the current CLUSTER). - Bulk index updates will also be fast as long as the list of changes to apply to the index fits in memory. - Therefore it will block the table for much less time than good old CLUSTER. - Therefore it will get more use ;) How to make it non-locking ? - Doing something like this in pseudo SQL : INSERT INTO table SELECT * FROM (DELETE FROM table WHERE date > last time we did this RETURNING *) ORDER BY cluster_columns; VACUUM; That is, take the "tail" of the table (as above), sort it, insert it back in big chunks, and mark the old rows as deleted just like a regular delete would have done. Then VACUUM. In this case you now have : "head":[clustered part with some holes] + "big hole" + "tail":[clustered rows] Is the "big hole" a problem ? Probably not, it will be marked as free space by VACUUM and used for new inserts. A week later we get this : "head":[clustered part with some holes] + [rows in random order] + "tail":[clustered rows] Repeating the process above will make the "tail" grow and the "hole" will stay more or less in the same place. Another way to do it is to use partitions : - "archive" table - "current" table Periodically the rows from "current" are transferred to the "archive" table and sorted in the process. Then "current" is truncated. This works, but it is blocking, and you have the overhead from partitioning... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Free Space Map data structure
About the FSM : Would it be possible to add a flag marking pages where all tuples are visible to all transactions ? (kinda like frozen I think) This could be useful to implement index-only scans, for count(), or to quickly skip rows when OFFSET is used, or to use only the index when the selected columns are all in the index. Of course if the page is flagged as "may contain updated tuples", then it would have to look in the heap. But, for tables that are not randomly updated (for instance tables that are mostly appended to, like forum posts, or logs, or the huge archive table, etc) it could save a lot of heap lookups and IO. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] modules
On Sat, 05 Apr 2008 02:17:10 +0100 Gregory Stark <[EMAIL PROTECTED]> wrote: I was inclined to dismiss it myself but I think the point that's come up here is interesting. The ISP has to not just install an RPM or type make install in some source tree -- but actually log into each customer's database and run an SQL script. That does seem like more work and more risk than a lot of ISPs will be willing to take on. On (k)Ubuntu you can apt-get install postgresql-contrib-8.3 which puts everything in the right places, all you have to do then is to run the sql scripts in /usr/share/postgresql/8.3/contrib/ as user postgres... But of course you need the ISP to do it for you if you are not superuser. Some will bother to run a few commands for a user, some won't... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] COPY Transform support
INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken records); What do we gain against current way of doing it, which is: COPY loadtable FROM 'dump.txt' WITH ... INSERT INTO destination_table(...) SELECT ... FROM loadtable; You read and write the data only once instead of twice (faster) if you want to import all of it. If you just want to compute some aggregates and store the results in a table, you just read the data once and don't write it at all. The advantages are the same than your proposed transformations to COPY, except I feel this way of doing it opens more options (like, you can combine columns, check FKs at load, do queries on data without loading it, don't necessarily have to insert the data in a table, don't have to invent a new syntax to express the transformations, etc). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] COPY Transform support
On Thu, 03 Apr 2008 16:57:53 +0200, Csaba Nagy <[EMAIL PROTECTED]> wrote: On Thu, 2008-04-03 at 16:44 +0200, PFC wrote: CREATE FLATFILE READER mydump ( id INTEGER, dateTEXT, ... ) FROM file 'dump.txt' (followed by delimiter specification syntax identical to COPY, etc) ; Very cool idea, but why would you need to create a reader object first ? You should be able to use COPY directly with the target table being omitted, meaning the copy will not pump it's result in the target but be equivalent to a select... and use it in any place where a select can be used. This would have absolutely no new syntax, just the rules changed... Now that I had a second look you actually need the field definitions to meaningfully interpret the file, Yeah, you need to tell Postgres the field names, types, and NULLness before it can parse them... or else it's just a plain flat text file which makes no sense... but then why not use a record specification instead of the table in the normal COPY command ? I'm not sure if there's any existing syntax for that but I would guess yes... Hm, yeah, that's even simpler, just create a type for the row (or just use table%ROWTYPE if you have a table that fits the description), and tell COPY to parse according to the row type definition... smart... Like : CREATE TYPE import_rowtype AS (id INTEGER, date TEXT); INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '-00-00' )::DATE FROM (COPY AS import_rowtype FROM 'mysql_trash.txt') AS foo WHERE (FKs check and drop the borken records); Looks clean... Obviously, in this case (and also in my proposal's case) you must use COPY and not \copy since it is the database server which will be reading the file. This could probably be hacked so the client sends the file via the \copy interface, too... In any case, such a feature would help a lot in processing input files based also on other existing data in the DB. Yeah, it would be cool. Also, since COPY TO can use a SELECT as a data source, you could use postgres to read from a file/pipe, process data, and write to a file/pipe (kinda better than sed, lol) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] COPY Transform support
Data transformation while doing a data load is a requirement now and then. Considering that users will have to do mass updates *after* the load completes to mend the data to their liking should be reason enough to do this while the loading is happening. I think to go about it the right way we should support the following: * The ability to provide per-column transformation expressions * The ability to use any kind of expressions while doing the transformation The transformation expression should be any expression (basically ExecEvalExpr) that can be evaluated to give a resulting value and obviously a corresponding is_null value too. It should and could be system in-built functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined functions too * The transformation expression can refer to other columns involved in the load. So that when the current row is extracted from the input file, the current values should be used to generate the new resultant values before doing a heap_form_tuple. E.g. (col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform "UPPER(col1 || col3)",...) I have spent some thoughts on how to do this and will be happy to share the same if the list is interested. Personally, I think data transformation using such expressions is a pretty powerful and important activity while doing the data load itself. Well, since COPY is about as fast as INSERT INTO ... SELECT plus the parsing overead, I suggest adding a special SELECT form that can read from a file instead of a table, which returns tuples, and which therefore can be used and abused to the user's liking. This is a much more powerful feature because : - there is almost no new syntax - it is much simpler for the user - lots of existing stuff can be leveraged EXAMPLE : Suppose I want to import a MySQL dump file (gasp !) which obviously contains lots of crap like -00-00 dates, '' instead of NULL, borken foreign keys, etc. Let's have a new command : CREATE FLATFILE READER mydump ( id INTEGER, dateTEXT, ... ) FROM file 'dump.txt' (followed by delimiter specification syntax identical to COPY, etc) ; This command would create a set-returning function which is basically a wrapper around the existing parser in COPY. Column definition gives a name and type to the fields in the text file, and tells the parser what to expect and what to return. It looks like a table definition, and this is actually pretty normal : it is, after all, very close to a table. INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken records); Now I can import data and transform it at will using a simple SELECT. The advantage is that everybody will know what to do without learning a new command, no awkward syntax (transform...), you can combine columns in expressions, JOIN to ckeck FKs, use ORDER to get a clustered table, anything you want, without any extension to the Postgres engine besides the creation of this file-parsing set-returning function, which should be pretty simple. Or, if I have a few gigabytes of logs, but I am absolutely not interested in inserting them into a table, instead I want to make some statistics, or perhaps I want to insert into my table some aggregate computation from this data, I would just : CREATE FLATFILE READER accesses_dump ( dateTEXT, ip INET, ... ) FROM file 'web_server_logtxt'; And I can do some stats without even loading the data : SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*) HAVING count(*) > 1000; Much better than having to load those gigabytes just to make a query on them... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] first time hacker ;) messing with prepared statements
The MAJOR benefit of Microsoft's approach is that it works on existing application, Yes, that is a nice benefit ! Is there a way to turn it on/off ? Or is it smart enough to only cache plans for cases where it is relevant ? For instance, I absolutely want some queries to be planned according to real parameters (makes huge difference on some search queries, as expected), whereas most simple queries like the proverbial select by ID etc could be cached without problems... and, most importantly makes NO assumptions on the "volatile" server state. A few cases where the Microsoft solution works, while yours will fail is: * Server restart and assorted like failover (you need to redo a global prepare). * Cleanup and instantiation of a prepared statement. Hehe, actually, mine does work after restart since the statements are stored in a database-specific system catalog which is persistent. Actually, what I store is not the result of PREPARE (a plan) but the text of the SQL query "PREPARE foo.", that is I just cut the GLOBAL from "GLOBAL PREPARE" and store the rest. The actual PREPARE is realized by each connection when it encounters an EXECUTE request and doesn't find the cached plan. It is actually extremely simple ;) did you expect a fancy shared memory cache (ahem...) ? No, no, it's very basic. This way, if a table was dropped and recreated, or whatever other stuff that can invalidate a plan since the GLOBAL PREPARE was issued, no problem, since there was no global stored plan anyway, just some SQL text. Also if a needed table was dropped, the user will get the same error message as he would have got issuing a PREPARE for the associated SQL query string. The overhead of each connection doing its own PREPARE is negligible, since, if you use that feature, you intend to issue this query many, many times during the life of the persistent connection. What you are doing for a global query cache is already in consideration and having plan invalidation mechanism on schema changes or, maybe, statistic updates was a step into that direction. You code mostly contributed the other parts already. As I said it is much simpler than that : I store no plans ;) Of course this means it only works with persistent connections. Another considerations is whether most task are getting CPU bound or IO bound. A better, per query, plan might reduce IO load due to better use of statistics on that single case, while for CPU bound it is very nice to reduce the planning overhead significantly. Well, if it is IO bound, then this thing is useless. However, since the purpose is to optimize often-used, simple queries, the likes of which abound in web applications, then it is relevant... because, if this kind of simple selects become IO bound, and you have a few on each page, you're really in trouble... Another possible implementation would be to use a connection pooler which, when opening a new connection, can be configured to send a SQL script containing all the PREPARE statements. This is, IMHO, an application side feature that might be a good addition to PHP and other languages that provide the "persistent connection" feature. On second thought, if it is not in Postgres, I'd rather put this in the connection pooler, because this way it can be used by different applications. But then you have to use a connection pooler. Also, by putting it in Postgres, statements are only prepared as needed, whereas the pooler would have to issue a lot of PREPAREs at connection startup, making new connection startup slower. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [JDBC] [HACKERS] How embarrassing: optimization of a one-shot query doesn't work
On Tue, 01 Apr 2008 16:06:01 +0200, Tom Lane <[EMAIL PROTECTED]> wrote: Dave Cramer <[EMAIL PROTECTED]> writes: Was the driver ever changed to take advantage of the above strategy? Well, it's automatic as long as you use the unnamed statement. About all that might need to be done on the client side is to use unnamed statements more often in preference to named ones, and I believe that something like that did get done in JDBC. regards, tom lane PHP is also affected if you use pg_query_params... Syntax : pg_query_params( "SQL with $ params", array( parameters ) Note that value is TEXT, indexed, there are 100K rows in table. pg_query( "SELECT * FROM test WHERE id =12345" ); 1 rows in 0.15931844711304 ms pg_query( "SELECT * FROM test WHERE value LIKE '1234%'" ); 11 rows in 0.26795864105225 ms pg_query_params( "SELECT * FROM test WHERE id =$1", array( 12345 ) ); 1 rows in 0.16618013381958 ms pg_query_params( "SELECT * FROM test WHERE value LIKE $1", array( '1234%' )); 11 rows in 40.66633939743 ms Last query does not use index. However since noone uses pg_query_params in PHP (since PHP coders just LOVE to manually escape their strings, or worse use magicquotes), noone should notice ;) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] first time hacker ;) messing with prepared statements
* Server restart and assorted like failover (you need to redo a global prepare). Hmm? He's proposing storing the info in a system catalog. That hardly seems "volatile"; it'll certainly survive a server restart. Yes, it's in a system catalog. I agree with the point that this isn't completely transparent to applications, but if an app is already using named prepared statements it would surely be a pretty small matter to make it use this feature. The app code would likely get simpler instead of more complex, since you'd stop worrying about whether a given statement had been prepared yet in the current session. Thanks. That was the idea behing this hack... I'm having a problem with the terminology here, since AFAICT what your patch does is exactly not a global "prepare" --- there is no permanently stored cached plan. That's a good thing probably, but it seems like the feature needs to be described differently. Sure, but I couldn't come up with a suitable name at the time... perhaps CREATE STATEMENT (and DROP STATEMENT) ? This would describe it better (especially the DROP, because GLOBAL DEALLOCATE is a rather bad name, since it doesn't actually deallocate anything...) I'm also pretty dubious about storing raw text in that catalog. In the first place, while I've not looked at your patch, I expect you are pulling the raw text from debug_query_string. That won't work in cases where multiple SQL commands were submitted in one query string. LOL, you are right, I had tested with multiple queries on the same line from psql, but psql apparently splits the queries, when I feed multiple queries from PHP, one of them being GLOBAL PREPARE, it fails. In the second place, raw-text SQL commands will be subject to a whole lot of ambiguity at parse time. If for instance another session tries to use the command with a different search_path or standard_conforming_string setting, it'll get different results. While I can think of use-cases for that sort of behavior, it seems like mostly a bad idea. You're right. I'm thinking that a more appropriate representation would use stored parse trees, the same as we do in pg_rewrite, and with the same dependency information so that a stored statement couldn't outlive the objects it depends on. Do the parse tree store fully qualified "schema.table" or "schema.function" ? I mean, if table T is mentioned in a parse tree which is stored, and the table is later dropped and recreated... or a column dropped... what happens ? Dropping the statement would seem more logical, since it would probably no longer be valid... Another area that could do with more thought is the hard-wired association between statement ownership and accessibility. That's likely to be pretty inconvenient in a lot of cases, particularly systems that use role membership heavily. Yes, need to think about that. I also wonder whether statements should belong to schemas... Since they are basically an extremely simple form of a function, why not ? (but since part of the goodness on prepared statements is that they are stored in a fast hash cache, wouldn't that add too much overhead ?) Thanks for the helpful advice. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch : Global Prepared Statements
Here is the result of my little experiment, for your viewing pleasure, lol. Now it works. Well, it hasn't crashed yet... so I guess I can show it to people ;) - Purpose : Allow PHP (or other languages) users to use prepared statements (pg_exec()) together with persistent connections, while not bothering about preparing all those statements when the persistent connection is first created (you have no way of knowing, from PHP, if your persistent connection is new or second-hand, by the way). - What it does : New commands : GLOBAL PREPARE, does the same as PREPARE, except for all current and future sessions FROM THE SAME USER instead of just the current session. GLOBAL DEALLOCATE does almost what you'd expect. - Results Here is a comparison between sending the query as string (oldskool) with parameters in it, and executing a prepared statement via pg_exec() (which emits a BIND, not a SQL EXECUTE) CREATE TABLE test( id SERIAL PRIMARY KEY, value INTEGER ); INSERT INTO test (value) SELECT random()*100 FROM generate_series( 1,10 ); CREATE INDEX test_value ON test( value ); CREATE TABLE test2 (id INTEGER PRIMARY KEY REFERENCES test( id ) ON DELETE CASCADE, value2 TEXT ); INSERT INTO test2 SELECT id, (random()*100)::TEXT FROM test; CREATE TABLE test3 (id INTEGER PRIMARY KEY REFERENCES test( id ) ON DELETE CASCADE, value3 INTEGER ); INSERT INTO test3 SELECT id, random()*100 FROM test; - simple SELECT : SELECT * FROM test WHERE id = $1 => Server CPU load : -46% (almost cut in half) => Queries/s from PHP including PHP overhead : +71% - three table JOIN with order by/limit : SELECT * FROM test a NATURAL JOIN test2 b NATURAL JOIN test3 c WHERE a.value < $1 ORDER BY value DESC LIMIT 1 => Server CPU load : -84% (ie uses 6 times less CPU) => Queries/s from PHP including PHP overhead : +418% (ie 5 times faster) (By the way, with pg_exec on those simple queries, Postgres beats InnoDB and MyISAM thoroughly in both queries/s (seen by client) and server CPU used by query). Note that the patch has no influence on these results whatsoever. It just allows a PHP coder to easily use the pure goodness of pg_exec() without hassle. - Implementation : * a new system catalog (I snipped the query texts, but they are stored entirely). test=# SELECT * FROM pg_global_prepared ; stmt_owner | stmt_name | stmt_sql +---+--- 10 | test_plan_order_3 | PREPARE test_plan_order_3 (INTEGER) AS SEL 10 | test_plan_pk_3| PREPARE test_plan_pk_3 (INTEGER) AS SELECT 10 | test_plan_pk | PREPARE test_plan_pk (INTEGER) AS SELECT * 10 | test_plan_order | PREPARE test_plan_order (INTEGER) AS SELEC * GLOBAL PREPARE foo AS SELECT ... Checks syntax, does a PREPARE, if it works, inserts row in pg_global_prepared with user OID, statement name, and PREPARE command. * GLOBAL DEALLOCATE foo Deletes the line from pg_global_prepared Note that this does not make all other open connections forget the plan ;) * In the code : FetchPreparedStatement() gets a new sister, FetchOrCreatePreparedStatement(). - try to find statement in the session cache (usual behaviour) found -> ok, we're done - if transaction is aborted, byebye - look in pg_global_prepared for the requested statement name and the current user ID. not found -> byebye - SPI_execute the PREPARE statement - return the prepared statement to caller SQL EXECUTE, the BIND handler in postgres.c, and some others got calls to FetchOrCreatePreparedStatement(). So if you pg_exec() or EXECUTE or EXPLAIN EXECUTE, etc, a statement that does not exist, the backend will PREPARE it for you if it can find it in pg_global_prepared. Reason for the user OID is that independent users mst not be able to modify each other's prepared statements... Patch is here (against 8.3.1) http://home.peufeu.com/pg/ This is certainly not ready for production, lol. But since I had fun writing this, I'd rather share it. Regards, Pierre -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] first time hacker ;) messing with prepared statements
PFC wrote: Hello, So, I embarked (yesterday) on a weekend project to add a new feature to Postgres... I use PHP with persistent connections and always have been bothered that those very small AJAX queries (usually simple selects returning 1 row) take more CPU in postgres to parse & plan than to actually execute. Microsoft's answer to this issue with SQLServer appears to have been to introduce a smart cache for all statement plans. It seems to be very effective. I guess you're doing much the same thing but with more user intervention, in effect. Actually, the main purpose was to 1) have fun hacking Postgres, and 2) perhaps something useful would come of it... And I did find it very interesting, probably due to the fact that Postgres source code is so... impressively clean... well organized... readable... it took two hours from downloading the source to having an extra functional system catalog, and it worked at the first compile, all due to the concise but to the point comments in include/catalog, I couldn't believe it. Anyway, Microsoft's solution is cool, too, but you need to reparse the entire query to then detect "I've planned this query before, with other parameters, so I'll reuse that prepared plan", so it adds another parsing step, which is less efficient. Postgres could also do that with a small modification, by the way : like by using the entire string (with $1 style parameters) instead of the statement name, use that as a cache key, and send parameters separately, but I think it would be less clean than, say, a statement called "get_user_by_id" or something. Also I like the idea of named prepared queries, which feel a bit like procedures, because many of those small, often-used queries would end up being defined in the same place, which makes schema changes (and having to modify queries in your application) slightly less painful. PHP can also use pg_exec() which sends the parameters separately, automagically converted to postgres format, so you save time and hassle on quoting and de-quoting versus a SQL EXECUTE. Since the performance of pg_exec() is almost twice as fast as plain SQL, and PHP scripts tend to use quite a lot of these queries, it also means a free database server performance upgrade (ie. the same DB server can handle more webservers, for instance). Another possible implementation would be to use a connection pooler which, when opening a new connection, can be configured to send a SQL script containing all the PREPARE statements. Are you sure that you application wouldn't benefit more from a MOM solution with persisted database connections? Have you looked at http://safmq.sourceforge.net/? Dunno. Is this related to Postgres ? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] first time hacker ;) messing with prepared statements
Hello, So, I embarked (yesterday) on a weekend project to add a new feature to Postgres... I use PHP with persistent connections and always have been bothered that those very small AJAX queries (usually simple selects returning 1 row) take more CPU in postgres to parse & plan than to actually execute. Since those small queries usually come in great numbers, I would like to PREPARE them beforehand and use php's pg_exec(), (faster than SQL EXECUTE). Saves about 50% CPU time on the server for those small queries. However with persistent connections there is a problem : you never know if the query has already been prepared or not. Ideally a PHP process would open a persistent connection and find all queries already prepared, ready to execute... So : - Added a system catalog "pg_global_prepared" (one per database actually) which contains : - oid of user who created the row - name of statement - SQL command for preparing statement example : test=# SELECT * FROM pg_global_prepared ; stmt_owner |stmt_name | stmt_sql +-+--- 10 | test| PREPARE test (INTEGER) AS SELECT $1+3; 10 | test_plan_pk| PREPARE test_plan_pk (INTEGER) AS SELECT * FROM test WHERE id = $1; 10 | test_plan_order | PREPARE test_plan_order (INTEGER) AS SELECT * FROM test WHERE value < $1 ORDER BY value DESC LIMIT 1; - Added sql command GLOBAL PREPARE foo (arg types) AS sql query This inserts a row in the above catalog after having run a standard "prepare" on the query to test its validity - Added sql command GLOBAL DEALLOCATE This removes row(s) from the above catalog, (only those owned by the current user) - Messed with EXECUTE (ExecuteQuery) so that : - if the requested statement is found in session cache, use it (as usual) - if not, look into pg_global_prepared to see if there is one of the same name and created by same user - if found, use this to PREPARE, then store in session cache, then execute it After that I put this logic in FetchPreparedStatement instead so if it is asked to fetch a non-existing statement for which there is a row in pg_global_prepared, it will create it. test=# EXPLAIN ANALYZE EXECUTE test_plan_pk(1); NOTICE: prepared statement "test_plan_pk" does not exist in local session cache, now searching pg_global_prepared for a template to create it. NOTICE: found template for requested statement, executing : "test_plan_pk" : NOTICE: PREPARE test_plan_pk (INTEGER) AS SELECT * FROM test WHERE id = $1; QUERY PLAN -- Index Scan using test_pkey on test (cost=0.00..8.28 rows=1 width=8) (actual time=19.476..19.478 rows=1 loops=1) Index Cond: (id = $1) Total runtime: 0.079 ms (3 lignes) So, you take whatever persistent connection from a pool and issue an EXECUTE without worries. * Now, the problem : - EXECUTE, EXPLAIN EXECUTE, EXPLAIN ANALYZE EXECUTE all work - pg_exec() from php makes it crash Actually pg_exec() does not use SQL EXECUTE, I think it uses the new extended query protocol and just sends a message to execute a named prepared query. In that case, my code in FetchPreparedStatement crashes : NOTICE: prepared statement "test_plan_pk" does not exist in local session cache, now searching pg_global_prepared for a template to create it. LOG: server process (PID 30692) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing GDB says it is because CurrentResourceOwner is NULL. Did I forger to initialize something ? lol. I'll post more details and complete traceback this afternoon, but here is the problematic bit of code, this is the code that finds the SQL to prepare a statement. Thanks for any suggestion ;) Relationmycatalog; HeapTuple tup; TupleDesc dsc; NameDatastmt_name_data; ScanKeyData skey[2]; SysScanDesc scan; Datum datum; boolfound = false; boolisnull; const char *sql = ""; namestrcpy(&stmt_name_data, stmt_name); mycatalog = heap_open(GlobalPreparedRelationId, RowExclusiveLock); < crashes here dsc = RelationGetDescr( mycatalog ); ScanKeyInit(&skey[0], Anum_pg_global_prepared_stmt_owner, BTEqualStrategyNumber, F_OIDEQ, GetUserId()); ScanKeyInit(&skey[1], Anum_pg_global_prepared_stmt_name, BTEqualStrategyNumber, F_NAMEEQ, NameGetDatum(&stmt_name_data)); scan = systable_beginscan(mycatalog, GlobalPreparedIndexId, true,
Re: [HACKERS] Worries about delayed-commit semantics
On Fri, 22 Jun 2007 16:43:00 +0200, Bruce Momjian <[EMAIL PROTECTED]> wrote: Simon Riggs wrote: On Fri, 2007-06-22 at 14:29 +0100, Gregory Stark wrote: > "Joshua D. Drake" <[EMAIL PROTECTED]> writes: > > > Tom Lane wrote: > > > >> untrustworthy disk hardware, for instance. I'd much rather use names > >> derived from "deferred commit" or "delayed commit" or some such. > > > > Honestly, I prefer these names as well as it seems directly related versus > > transaction guarantee which sounds to be more like us saying, if we turn it off > > our transactions are bogus. That was the intention..., but name change accepted. > Hm, another possibility: "synchronous_commit = off" Ooo, I like that. Any other takers? Yea, I like that too but I am now realizing that we are not really deferring or delaying the "COMMIT" command but rather the recovery of the commit. GUC as full_commit_recovery? commit_waits_for_fsync = force_yes : makes all commits "hard" yes : commits are "hard" unless specified otherwise [default] no : commits are "soft" unless specified otherwise [should replace fsync=off use case] force_no : makes all commits "soft" (controller with write cache "emulator") the force_yes and force_no are for benchmarking purposes mostly, ie. once your app is tuned to specify which commits have to be guaranteed ("hard") and which don't ("soft") you can then bench it with force_yes and force_no to see how much you gained, and how much you'd gain by buying a write cache controller... ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Worries about delayed-commit semantics
So now we're poking a hole in that but we certainly have to ensure that any transactions that do see the results of our deferred commit themselves don't record any visible effects until both their commit and ours hit WAL. The essential point in Simon's approach that guarantees that is that when you fsync you fsync all work that came before you. So committing a transaction also commits all deferred commits that you might depend on. BTW: I really dislike the name "transaction guarantee" for the feature; it sounds like marketing-speak, not to mention overpromising what we can deliver. Postgres can't "guarantee" anything in the face of untrustworthy disk hardware, for instance. I'd much rather use names derived from "deferred commit" or "delayed commit" or some such. Well from an implementation point of view we're delaying or deferring the commit. But from a user's point of view the important thing for them to realize is that a committed record could be lost. Perhaps we should just not come up with a new name and reuse the fsync variable. That way users of old installs which have fsync=off silently get this new behaviour. I'm not sure I like that idea since I use fsync=off to run cpu overhead tests here. But from a user's point of view it's probably the "right" thing. This is really what fsync=off should always have been doing. Say you call them SOFT COMMIT and HARD COMMIT... HARD COMMIT fsyncs, obviously. Does SOFT COMMIT fflush() the WAL (so it's postgres-crash-safe) or not ? (just in case some user C function misbehaves and crashes) Do we get a config param to set default_commit_mode=hard or soft ? By the way InnoDB has a similar option where you set innodb_flush_log_on_commit (or something). However you cannot set it on a per-transaction basis. So, on a e-commerce site, for instance, most transactions will be "unimportant" (ie. no need to fsync, ACI only, like incrementing products view counts, add to cart, etc) but some transactions will have to be guaranteed (full ACID) like recording that an order has been submitted / paid / shipped. But with InnoDB you can't choose this on a per-transaction basis, so it's all or nothing. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Load Distributed Checkpoints test results
On Fri, 15 Jun 2007 22:28:34 +0200, Gregory Maxwell <[EMAIL PROTECTED]> wrote: On 6/15/07, Gregory Stark <[EMAIL PROTECTED]> wrote: While in theory spreading out the writes could have a detrimental effect I think we should wait until we see actual numbers. I have a pretty strong suspicion that the effect would be pretty minimal. We're still doing the same amount of i/o total, just with a slightly less chance for the elevator algorithm to optimize the pattern. ..and the sort patching suggests that the OS's elevator isn't doing a great job for large flushes in any case. I wouldn't be shocked to see load distributed checkpoints cause an unconditional improvement since they may do better at avoiding the huge burst behavior that is overrunning the OS elevator in any case. ...also consider that if someone uses RAID5, sorting the writes may produce more full-stripe writes, which don't need the read-then-write RAID5 performance killer... ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Controlling Load Distributed Checkpoints
>If we extended relations by more than 8k at a time, we would know a lot >more about disk layout, at least on filesystems with a decent amount of >free space. I doubt it makes that much difference. If there was a significant amount of fragmentation, we'd hear more complaints about seq scan performance. The issue here is that we don't know which relations are on which drives and controllers, how they're striped, mirrored etc. Actually, isn't pre-allocation one of the tricks that Greenplum uses to get it's seqscan performance? My tests here show that, at least on reiserfs, after a few hours of benchmark torture (this represents several million write queries), table files become significantly fragmented. I believe the table and index files get extended more or less simultaneously and end up somehow a bit mixed up on disk. Seq scan perf suffers. reiserfs doesn't have an excellent fragmentation behaviour... NTFS is worse than hell in this respect. So, pre-alloc could be a good idea. Brutal Defrag (cp /var/lib/postgresql to somewhere and back) gets seq scan perf back to disk throughput. Also, by the way, InnoDB uses a BTree organized table. The advantage is that data is always clustered on the primary key (which means you have to use something as your primary key that isn't necessary "natural", you have to choose it to get good clustering, and you can't always do it right, so it somehow, in the end, sucks rather badly). Anyway, seq-scan on InnoDB is very slow because, as the btree grows (just like postgres indexes) pages are split and scanning the pages in btree order becomes a mess of seeks. So, seq scan in InnoDB is very very slow unless periodic OPTIMIZE TABLE is applied. (caveat to the postgres TODO item "implement automatic table clustering"...) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] vacuum, performance, and MVCC
My idea is that if an UPDATE places the new tuple on the same page as the old tuple, it will not create new index entries for any indexes where the key doesn't change. Basically the idea behind preventing index bloat by updates is to have one index tuple point to several actual tuples having the same value. So : Index entry -> list of tuples having the same value -> actual tuples (-> represents an indirection) I proposed to put the list of tuples in the index ; you propose to put it in data pages. I think both solutions have pros and cons : * List of tuples in the index : + reduces index size, makes cacheability in RAM more likely + speeds up index scans - complexity - slows down modifications to the index (a bit) * List of tuples in the page + simpler to implement + reduces index size, but less so than previous solution - useless if UPDATE puts the new tuple on a different page I guess the best solution would be a mix of both. Also, I insist (again) that there is a lot to gain by using a bit of compression on the data pages, even if it's very simple compression like storing the new version of a row as a difference from the previous version (ie. only store the columns that changed). I think DB2 stores the latest version entirely, and stores the previous versions as a delta. This is more efficient. In the case of tables containing TEXT values, these could also get TOASTed. When an update does not modify the TOASTed columns, it would be nice to simply be able to keep the reference to the TOASTed data instead of decompressing it and recompressing it. Or is it already the case ? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] vacuum, performance, and MVCC, and compression
What about increasing the size of an existing index entry? Can that be done easily when a new row is added? I'd say it looks pretty much like inserting a new index tuple... Say "value" is the indexed column. Find first page in the index featuring "value". 1 If there is space on the page, add the tuple id to the list of the corresponding index entry (just like creating a new index tuple, but uses less space). else look at next page. If next page has an index tuple with the same indexed value, goto 1 else insert new page and create an index tuple on it I would be worried about the overhead of doing that on compression and decompression. The compression methods mentioned in the article which was passed on the list seemed pretty fast. From IO-limited, the test database became CPU-limited (and a lot faster). ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] vacuum, performance, and MVCC, and compression
There were some talks lately about compression. With a bit of lateral thinking I guess this can be used to contain the bloat induced by updates. Of course this is just my hypothesis. Compression in indexes : Instead of storing (value, tuple identifier) keys in the indexes, store (value, [tuple identifier list]) ; ie. all tuples which have the same indexed value are referenced by the same index tuple, instead of having one index tuple per actual tuple. The length of the list would of course be limited to the space actually available on an index page ; if many rows have the same indexed value, several index tuples would be generated so that index tuples fit on index pages. This would make the index smaller (more likely to fit in RAM) at the cost of a little CPU overhead for index modifications, but would make the index scans actually use less CPU (no need to compare the indexed value on each table tuple). Compression in data pages : The article that circulated on the list suggested several types of compression, offset, dictionary, etc. The point is that several row versions on the same page can be compressed well because these versions probably have similar column values. Just a thought... ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] vacuum, performance, and MVCC
What I see in this discussion is a huge amount of "the grass must be greener on the other side" syndrome, and hardly any recognition that every technique has its downsides and complications. Sure ;) MVCC generates dead rows, by its very nature ; however I see two trends in this : 1* A large transaction that updates/deletes many rows. For instance suppose you UPDATE an entire table whose size is larger than memory. Old row versions have to be kept somewhere until commit, be it in the table itself or in some accessory undo-log. So, there will be a lot of harddisk grinding anyway, be it MVCC or Oracle-style, or whatever. MVCC will bloat the table and indexes, then VACUUM will shrink them. Update-in-place systems will bloat an undo log. It seems to me the current MVCC+VACUUM is the right tool for this job, requiring about the same amount of IO that the others. Vacuum scans sequentially, so it's the best way to process large volumes of data. 2* Many short transactions update rows in a table Like the sessions problem, for instance. Current VACUUM sucks for this case, I guess that's known. --- So, we have two different problems, and one tool which is adapted to one problem only. Should the tool (Vacuum) be fixed to handle both problems, making it more complex and difficult to maintain, or should another tool be created specifically for the second problem ? Problem 2 is very different from problem 1. The only case when they meet is when there is a continuous stream of small updates running concurrently with a long transaction. So, what is the ideal tool for case 2 ? We'd want a vacuuming machine that can be run very often, or even better, continuously. The table can be large, larger than the disk cache, so scanning it is not an option. The updates are probably randomly distributed into the table. Therefore, VACUUMing a long time after these transactions are commited and the pages are no longer in cache would require a lot of random seeks, which is also bad. Besides, we want this vacuum to be continuous and transparent. The best time to vacuum pages is, thus, when they are still in the background writer's memory, or the disk cache, waiting to be flushed to disk. There, they can be re-read, vacuumed and re-written with no seek penalty, only additional WAL traffic. However the amount of WAL traffic in bytes/s is less important that the frequency of WAL syncs. Emitting more WAL data shouldn't be a problem if those sync writes are coalesced with the sync writes of current reansactions. So, I guess the best solution for case 2 is to have the background writer perform on-the-fly VACUUM : An UPDATE or DELETE transaction hands over dirty pages to be written to the bgwriter. It also tells the bgwriter the ID of the current transaction and flags specifying if they contain candidate dead rows. The bgwriter should have a sufficiently large buffer in order to be able to keep these pages in memory until all the transactions that can see the dead rows in these pages are finished. Then, the pages are vacuumed and written. The key is the size of the buffer. It should be large enough to contain enough pages so that it is actually possible to vacuum something out of them before writing them. However if the buffer capacity is exceeded (for instance, because there is a long running transaction), this is not a problem : the pages are simply written to disk normally, they will contain dead rows, which will need to be handled lated by the standard VACUUM. I think this would maximize code reuse by using the current bgwriter's architecture... did I miss something ? ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] vacuum, performance, and MVCC
Well, then please help me find a better design cause I can't see one... what we have here is a big "membership" table of email lists. When there's a sendout then the memberships of the affected group are heavily read/updated, otherwise they are idle. None of the memberships is archive data, they are all active data... the only problem is that they are so many. Is it so hard to believe that >100 million rows is all active data, but only used in bursts once per week (that's an example, some groups are more active, others less) ? I suppose you have a table memberships (user_id, group_id) or something like it ; it should have as few columns as possible ; then try regularly clustering on group_id (maybe once a week) so that all the records for a particular group are close together. Getting the members of a group to send them an email should be faster (less random seeks). For tables with very few small fields (like a few integers) the 26-something bytes row overhead is significant ; MySQL can be faster because MyISAM tables have no transaction support and thus have very little things to store besides actual row data, and the table can then fit in RAM... ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] vacuum, performance, and MVCC
What you seem not to grasp at this point is a large web-farm, about 10 or more servers running PHP, Java, ASP, or even perl. The database is usually the most convenient and, aside from the particular issue we are talking about, best suited. The answer is sticky sessions : each user is assigned to one and only one webserver in the cluster and his session is maintained locally, in RAM. No locks, no need to manage distributed session... I actually have a good number of years of experience in this topic, and memcached or file system files are NOT the best solutions for a server farm. If sessions are distributed, certainly, but if sessions are sticky to their own server ? ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] vacuum, performance, and MVCC
As you can see, in about a minute at high load, this very simple table lost about 10% of its performance, and I've seen worse based on update frequency. Before you say this is an obscure problem, I can tell you it isn't. I have worked with more than a few projects that had to switch away from PostgreSQL because of this behavior. Obviously this is not a problem with small sites, but this is a real problem with an enterprise level web site with millions of visitors and actions a day. Quite frankly it is a classic example of something that does not scale. The more and more updates there are, the higher the load becomes. You can see it on "top" as the footest program runs. I believe sessions should not be stored in a SQL database. It makes no sense. Updates and Inserts to the database should only be done where there is an interesting thing to record, when the user does an action like posting to a forum, making a purchase, sending a message, etc. I believe sessions should be stored in the memory of the application server, as native objects of the whatever language the application is written in. This way, sessions incur no serializing overhead and can be quite large and complex, which allows storage of interesting things, like the result of a complicated search query which is to be later paginated, for instance. It really makes sense to use native language objects too, as these have a lot more power and versatility than a database row. Think about rights management, for instance. When the framework used lacks this power (most do and this is sad), then sessions incur serializing overhead ; but they should be serialized to filesystem files, or better, to memory using memcached, for instance. It makes no sense to pay the performance penalty of a COMMIT (disk seek delay etc) and the database overhead for sessions, which are by nature volatile data changing all the time. I don't think postgres should be tweaked to allow better handling of this. It would only make a better foot-gun. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Rethinking stats communication mechanisms
Great minds think alike ;-) ... I just committed exactly that protocol. I believe it is correct, because AFAICS there are only four possible risk cases: Congrats ! For general culture you might be interested in reading this : http://en.wikipedia.org/wiki/Software_transactional_memory http://libcmt.sourceforge.net/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Rethinking stats communication mechanisms
It strikes me that we are using a single communication mechanism to handle what are really two distinct kinds of data: Interesting. I recently read a paper on how to get rid of locks for this kind of pattern. * For the Command String - Problem : need to display the currently executing command in the ps list. - Will only be of use if the command is taking a long, long time. So, it need not be realtime ; no problem if the data comes with a little delay, or not at all if the command executes quickly. People are only interested in the currently executing command to answer questions like "what query has my server grinding ?" Point : the currently executing query is only interesting to display if it's currently executing. If it's not, it's in the log (if enabled). So, the proposal : Each backend has a shared memory area where to store : - the currently executing command (like in your proposal). - a timestamp - a counter On executing a command, Backend stores the command string, then overwrites the counter with (counter + 1) and with the timestamp of command start. Periodically, like every N seconds, a separate process reads the counter, then reads the data, then reads the counter again. If the counter value changed, the process is repeated. If the counter value did not change, the command string did not change either, so it's valid, and can be used. Other thoughts : If the backend process itself should update its process title, and this operation is costly, it should only be done if the current query has been running for more than T seconds. However syscalls for getting the current time are costly. A separate process can update a counter in shared memory with the current time every N seconds, and the backend can check it. The main point is that if this value is written to every few seconds, but read often by only one process ; or written often but read seldom, there will not be a lot of interprocessor cache trashing on it. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CSV mode option for pg_dump
From what I gather, the CSV format dump would only contain data. I think pg_dump is the friend of pg_restore. It dumps everything including user defined functions, types, schemas etc. CSV does not fit with this. Besides, people will probably want to dump into CSV the result of any query, to load it into excel, not just the full contents of a table. So, why not create a separate tool, someone suggested pg_query for that, I second it. This tool would take a query and format options, and would output a file in whatever format chosen by the user (CSV, COPY format, xml, whatever) A script language (python) can be used, which will significantly shorten development times and allow easy modularity, as it is easier to add a module to a python program than a C program. I would vote for Python because I love it and it has a very good postgres adapter (psycopg2) which knows how to convers every postgres type to a native language type (yes, even multidimensional arrays of BOX get converted). And it's really fast at retrieving large volumes of data. So you have a stable, fast tool for backup and restore (pg_dump) and a rapidly evolving, user-friendly and extendable tool for exporting data, and everyone is happy. Mr Momijan talks about adding modular functionality to pg_dump. Is it really necessary ? What is the objective ? Is it to reuse code in pg_dump ? I guess not ; if a user wants to dump, for instance, all the tables in a schema, implementing this logic in python is only a few lines of code (select from information_schema...) To be realistic, output format modules should be written in script languages. Noone sane is eager to do string manipulation in C. Thus these modules would have to somehow fit with pg_dump, maybe with a pipe or something. This means designing another protocol. Reimplementing in a scripting langage the parts of pg_dump which will be reused by this project (mainly, enumerating tables and stuff) will be far easier. Just look. Python 2.4.2 (#1, Mar 30 2006, 14:34:35) [GCC 3.4.4 (Gentoo 3.4.4-r1, ssp-3.4.4-1.0, pie-8.7.8)] on linux2 Type "help", "copyright", "credits" or "license" for more information. ...opens a db connection... c.execute( "SELECT * FROM test.csv" ) data = c.fetchall() data [[1, datetime.date(2006, 6, 13), 'this\tcontains\ttabulations'], [2, datetime.date(2006, 6, 13), "this'contains'quotes"], [3, datetime.date(2006, 6, 13), 'this"contains"double quotes']] import csv, sys c = csv.writer( sys.stdout, dialect = csv.excel ) c.writerows( data ) 1,2006-06-13,this containstabulations 2,2006-06-13,this'contains'quotes 3,2006-06-13,"this""contains""double quotes" ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Faster Updates
Hello, Sometimes people complain that UPDATE is slow in postgres. UPDATE... - generates dead tuples which must be vacuumed. - needs to hit all indexes even if only one column was modified. From what I know UPDATE creates a new copy of the old row with the relevant C/TID's, then indexes it. On COMMIT the old version becomes dead but stays in the table and indexes until VACUUM. I propose a simple idea, which may be idiotic, but who knows. When a row is UPDATED, instead of storing a new copy of the entire row, only a differential is stored. The old row stays in the page anyway, so we might as well only store the binary encoded equivalent of "Use the row version number X and change column A to value Y". This is possible only if the differential fits in the free space on the page. In this case, a lot less dead space is generated. VACUUM would consolidate the differentials for commited transactions into a new base value for this row. While reading the page looking for a specific version of a row, all differences would need to be consolidated. This adds overhead, but it might be a win. With this method, it could be possible to avoid updating the indexes for unmodified columns. This is a big win. What do you think ? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] COPY (query) TO file
MySQL already does this for INSERT : INSERT INTO x (a,b) VALUES (1,2), (3,4), (5,6)...; Does MySQL really let you stream that? Trying to do syntax like that in Postgres wouldn't work because the parser would try to build up a parse tree for the whole statement before running the command. Hehe, I don't know, but I suppose it's parsed in one-shot then executed, and not streamed, because : - you can append modifiers at the end of the statement (IGNORE...), - mysql barfs if the complete SQL including data is larger than the query buffer specified in the config file. The second point leads to an interesting fact, ie. dumps generated by phpmyadmin and mysqldump need a parameter specifying how long, in bytes, the insert commands can be ; so that hopefully they can be reloaded later. If one of the inserted values violates a "constraint", it is substituted by "some other default value". Still, it's useful ; and one interesting part is that everything happens in the same SQL command (wrt concurrency). ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] COPY (query) TO file
I was also vaguely pondering whether all the DDL commands could be generalized to receive or send COPY formatted data for repeated execution. It would be neat to be able to prepare an UPDATE with placeholders and stream data in COPY format as parameters to the UPDATE to execute it thousands or millions of times without any protocol overhead or network pipeline stalls. MySQL already does this for INSERT : INSERT INTO x (a,b) VALUES (1,2), (3,4), (5,6)...; allowing arbitrary SELECT statements as a COPY source seems much more powerful and flexible than just supporting COPY FROM VIEW. MySQL already does this : SELECT INTO OUTFILE blah FROM table... Now in both cases the MySQL syntax sucks but it's still quite practical, and the INSERT saves some overhead (parsing, acquiring locks...) and is quite a bit faster than regular INSERT. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Generalized concept of modules
Think about version API compatibility. Suppose you have a working database on server A which uses module foo version 1. Some time passes, you buy another server B and install postgres on it. Meanwhile the module foo has evolved into version 2 which is cooler, but has some minor API incompatibilities. You dump the database on server A and reload it on server B. pg_dump issues an INSTALL MODULE which installs foo version 2 on the new server. Due to the "minor API incompatibilities", your database breaks. It's really cool not to pollute the dumps (and the global namespace...) with all the module functions, however implementing module functionality can be tricky. So don't forget about versions and possible incompatibilities ; also versions means you might need an UPGRADE MODULE which does more than uninstall + reinstall. Suppose a module has created some tables for its use, these shouldn't be dumped when upgrading to a new version ; however maybe the new version will want to add a column... Think gentoo portage, for instance. This excellent package system is a lot more evolved than the module system needs to be, but having a look at the feature list would be a good inspiration maybe. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] pg_proc probin misuse
Hm, thinking again, I guess Tom Lane is right Surely the initialization code would have to be run anyway ... and if the function does import a pile of modules, do you really want to cache all that in its pg_proc entry? What happens if some of the modules get updated later? Besides, what happens if you store compiled bytecode in a table, then upgrade the python interpreter to a new version... would it be compatible ? I suppose so, but I don't really know... Persistent connections should be used anyway, this makes the RAM caching good... ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pg_proc probin misuse
If it were really expensive to derive bytecode from source text then maybe it'd make sense to do what you're doing, but surely that's not all that expensive. Everyone else manages to parse prosrc on the fly and cache the result in memory; why isn't plpython doing that? It depends on the number of imported modules in the function. If it imports a lot of modules, it can take some time to compile a python function (especially if the modules have some initialisation code which must be run on import). ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] [HACKERS] Big IN() clauses etc : feature proposal
> 0.450 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC > LIMIT 20 > 0.443 ms ANALYZE tmp > 0.365 ms SELECT * FROM tmp > 0.310 ms DROP TABLE tmp > 32.918 ms COMMIT The 32 seconds for commit can hardly be catalog related. It seems the file is fsynced before it is dropped. I'd hope that wasn't what's happening... is the backend smart enough to know not to fsync anything involved with the temp table? ISTM that that transaction shouldn't actually be creating any WAL traffic at all. Though on the other hand there's no reason that DROP should be in the transaction at all; maybe that's gumming things up during the commit. I included the DROP to make it clear that the time was spent in COMMITting, not in DROPping the table. Also, you can't use CREATE TEMP TABLE AS SELECT ... and at the same time make it ON COMMIT DROP. You have to CREATE and INSERT. With an ON COMMIT DROP temp table, the global timings are the same wether or not it is dropped before commit : it is always the COMMIT which takes all the milliseconds. I still bet on system catalog updates being the main cause of the time spent in COMMIT... (because ANALYZE changes this time) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] [HACKERS] Big IN() clauses etc : feature proposal
On Tue, May 09, 2006 at 06:29:31PM +0200, PFC wrote: You mean the cursors'storage is in fact the same internal machinery as a temporary table ? Use the source, Luke... LOL, yeah, I should have, sorry. See tuplestore_begin_heap in backend/utils/sort/tuplestore.c and heap_create_with_catalog in backend/catalog/heap.c. You'll find that creating a tuplestore is far easier than creating a temp table. I had used intuition (instead of the source) to come at the same conclusion regarding the level of complexity of these two... But I'll look at the source ;) Perhaps it would be worth creating a class of temporary tables that used a tuplestore, although that would greatly limit what could be done with that temp table. Just selecting from it I guess, but that's all that's needed. Anymore would duplicate the functionality of a temp table. I find cursors awkward. The application can FETCH from them, but postgres itself can't do it in SQL, unless using FOR.. IN in plpgsql... It would be a powerful addition to be able to split queries, factor out common parts between multiple queries, etc, using this system, it can even be used to execute an inner part of a query, then plan the rest according to the results and execute it... without the overhead of a temp table. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal
Have you tried getting a profile of what exactly PostgreSQL is doing that takes so long when creating a temp table? Nope, I'm not proficient in the use of these tools (I stopped using C some time ago). BTW, I suspect catalogs might be the answer, Probably, because : - Temp tables don't use fsync (I hope) - Catalogs do - fsync=off makes COMMIT fast - fsync=on makes COMMIT slow - fsync=on and using ANALYZE makes COMMIT slower (more updates to the catalogs I guess) which is why Oracle has you define a temp table once (which does all the work of putting it in the catalog) and then you just use it accordingly in each individual session. Interesting (except for the ANALYZE bit...) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal
Speaking of which, if a temp table is defined as ON COMMIT DROP or DELETE ROWS, there shouldn't be any need to store xmin/xmax, only cmin/cmax, correct? Yes, that's that type of table I was thinking about... You can't ROLLBACK a transaction on such a table. You can however rollback a savepoint and use "INSERT INTO tmp SELECT FROM tmp" which implies MVCC (I think ?) I was suggesting to be able to use FETCH (from a cursor) in the same way as SELECT, effectively using a named cursor (DECLARE...) as a simpler, faster version of a temporary table, but there is another (better ?) option : If rowcount estimates for functions are implemented, then a set-returning function can be written, which takes as argument a named cursor, and returns its rows. It would have accurate rowcount estimation (if the cursor is WITH SCROLL, which is the case here, rows are stored, so we know their number). Then you could do : DECLARE my_cursor ... AS (query that we only want to do once) SELECT ... FROM table1 JOIN fetch_cursor( my_cursor ) ON ... SELECT ... FROM table2 JOIN fetch_cursor( my_cursor ) ON ... SELECT ... FROM table3 JOIN fetch_cursor( my_cursor ) ON ... No need to redefine the FETCH keyword. An interesting functionalyty with minimal hassle. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal
The problem is that you need a set-returning function to retrieve the values. SRFs don't have rowcount estimates, so the plans suck. What about adding some way of rowcount estimation to SRFs, in the way of: CREATE FUNCTION foo (para, meters) RETURNS SETOF bar AS $$ ... function code ... $$ LANGUAGE plpgsql ROWCOUNT_ESTIMATOR $$ ... estimation code ... $$ ; Internally, this could create two functions, foo (para, meters) and estimate_foo(para, meters) that are the same language and coupled together (just like a SERIAL column and its sequence). The estimator functions have an implicit return parameter of int8. Parameters may be NULL when they are not known at query planning time. What do you think about this idea? It would be very useful. A few thoughts... You need to do some processing to know how many rows the function would return. Often, this processing will be repeated in the function itself. Sometimes it's very simple (ie. the function will RETURN NEXT each element in an array, you know the array length...) Sometimes, for functions returning few rows, it might be faster to compute the entire result set in the cost estimator. So, it might be a bit hairy to find a good compromise. Ideas on how to do this (clueless hand-waving mode) : 1- Add new attributes to set-returning functions ; basically a list of functions, each returning an estimation parameter (rowcount, cpu tuple cost, etc). This is just like you said. 2- Add an "estimator", to a function, which would just be another function, returning one row, a record, containing the estimations in several columns (rowcount, cpu tuple cost, etc). Pros : only one function call to estimate, easier and faster, the estimator just leaves the unknown columns to NULL. The estimator needs not be in the same language as the function itself. It's just another function. 3- The estimator could be a set-returning function itself which would return rows mimicking pg_statistics Pros : planner-friendly, the planner would SELECT from the SRF instead of looking in pg_statistics, and the estimator could tell the planner that, for instance, the function will return unique values. Cons : complex, maybe slow 4- Add simple flags to a function, like : - returns unique values - returns sorted values (no need to sort my results) - please execute me and store my results in a temporary storage, count the rows returned, and plan the outer query accordingly - etc. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal
> SELECT * FROM somewhere WHERE id IN (SELECT id FROM result) Well, you can either SELECT * FROM somewhere JOIN (SELECT id FROM result GROUP BY id) AS a USING (id); It's the same thing (and postgres knows it) You might want to use PL to store values, say PLperl, or even C, say: I tried. The problem is that you need a set-returning function to retrieve the values. SRFs don't have rowcount estimates, so the plans suck. Should work faster than a in-application solution :) Should, but don't, because of what I said above... With the version in CVS tip, supprting a fast =ANY( array ), this should be doable, though. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal
Does the time for commit change much if you leave out the analyze? Yes, when I don't ANALYZE the temp table, commit time changes from 30 ms to about 15 ms ; but the queries get horrible plans (see below) : Fun thing is, the rowcount from a temp table (which is the problem here) should be available without ANALYZE ; as the temp table is not concurrent, it would be simple to inc/decrement a counter on INSERT/DELETE... I like the temp table approach : it can replace a large, complex query with a batch of smaller and easier to optimize queries... EXPLAIN ANALYZE SELECT a.* FROM tmp t, annonces_display a WHERE a.id=t.id ORDER BY t.sort; QUERY PLAN - Sort (cost=3689.88..3693.15 rows=1310 width=940) (actual time=62.327..62.332 rows=85 loops=1) Sort Key: t.sort -> Merge Join (cost=90.93..3622.05 rows=1310 width=940) (actual time=5.595..61.373 rows=85 loops=1) Merge Cond: ("outer".id = "inner".id) -> Index Scan using annonces_pkey on annonces (cost=0.00..3451.39 rows=10933 width=932) (actual time=0.012..6.620 rows=10916 loops=1) -> Sort (cost=90.93..94.20 rows=1310 width=12) (actual time=0.098..0.105 rows=85 loops=1) Sort Key: t.id -> Seq Scan on tmp t (cost=0.00..23.10 rows=1310 width=12) (actual time=0.004..0.037 rows=85 loops=1) Total runtime: 62.593 ms EXPLAIN ANALYZE SELECT * FROM contacts WHERE id IN (SELECT contact_id FROM tmp); QUERY PLAN Hash Join (cost=28.88..427.82 rows=200 width=336) (actual time=0.156..5.019 rows=45 loops=1) Hash Cond: ("outer".id = "inner".contact_id) -> Seq Scan on contacts (cost=0.00..349.96 rows=9396 width=336) (actual time=0.009..3.373 rows=9396 loops=1) -> Hash (cost=28.38..28.38 rows=200 width=4) (actual time=0.082..0.082 rows=46 loops=1) -> HashAggregate (cost=26.38..28.38 rows=200 width=4) (actual time=0.053..0.064 rows=46 loops=1) -> Seq Scan on tmp (cost=0.00..23.10 rows=1310 width=4) (actual time=0.001..0.015 rows=85 loops=1) Total runtime: 5.092 ms ANALYZE tmp; ANALYZE annonces=> EXPLAIN ANALYZE SELECT a.* FROM tmp t, annonces_display a WHERE a.id=t.id ORDER BY t.sort; QUERY PLAN --- Sort (cost=508.63..508.84 rows=85 width=940) (actual time=1.830..1.832 rows=85 loops=1) Sort Key: t.sort -> Nested Loop (cost=0.00..505.91 rows=85 width=940) (actual time=0.040..1.188 rows=85 loops=1) -> Seq Scan on tmp t (cost=0.00..1.85 rows=85 width=12) (actual time=0.003..0.029 rows=85 loops=1) -> Index Scan using annonces_pkey on annonces (cost=0.00..5.89 rows=1 width=932) (actual time=0.003..0.004 rows=1 loops=85) Index Cond: (annonces.id = "outer".id) Total runtime: 2.053 ms (7 lignes) annonces=> EXPLAIN ANALYZE SELECT * FROM contacts WHERE id IN (SELECT contact_id FROM tmp); QUERY PLAN - Nested Loop (cost=2.06..139.98 rows=36 width=336) (actual time=0.072..0.274 rows=45 loops=1) -> HashAggregate (cost=2.06..2.51 rows=45 width=4) (actual time=0.052..0.065 rows=46 loops=1) -> Seq Scan on tmp (cost=0.00..1.85 rows=85 width=4) (actual time=0.003..0.016 rows=85 loops=1) -> Index Scan using contacts_pkey on contacts (cost=0.00..3.04 rows=1 width=336) (actual time=0.003..0.004 rows=1 loops=46) Index Cond: (contacts.id = "outer".contact_id) Total runtime: 0.341 ms ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] [HACKERS] Big IN() clauses etc : feature proposal
Creating cursors for a simple plan like a single sequential scan is fast because it's using the original data from the table. I used the following query : SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 It's a backward index scan + limit... not a seq scan. And it's damn fast : 0.042 ms BEGIN 0.153 ms DECLARE tmp SCROLL CURSOR WITHOUT HOLD FOR SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 0.246 ms FETCH ALL FROM tmp 0.048 ms MOVE FIRST IN tmp 0.246 ms FETCH ALL FROM tmp 0.048 ms CLOSE tmp 0.084 ms COMMIT But your example was predicated on this part of the job being a complex query. If it's a complex query involving joins and groupings, etc, then it will have to be materialized and there's no (good) reason for that to be any faster than a temporary table which is effectively the same thing. You mean the cursors'storage is in fact the same internal machinery as a temporary table ? In that case, this raises an interesting question : why is the cursor faster ? Let's try a real-life example from my website : it is a search query (quite complex) which is then joined to a lot of tables to resolve FKeys. To that query I must add add an application-made join using a big IN() clause extracted from the data. Timings includes the time to fetch the results into Python. The "running total" column is the sum of all timings since the BEGIN. query_time running_total rowsquery 0.061 ms0.061 ms-1 BEGIN 23.420 ms 23.481 ms 85 SELECT * FROM (huge query with a lot of joins) 4.318 ms27.799 ms 2 SELECT l.*, u.login, u.bg_color FROM annonces_log l, users u WHERE u.id=l.user_id AND l.annonce_id IN (list of ids from previous query) ORDER BY annonce_id, added 0.241 ms28.040 ms -1 COMMIT (Just in case you want to hurt yourself, here's the EXPLAIN ANALYZE output : http://peufeu.com/temp/big_explain.txt) Using a cursor takes about the same time. Also, doing just the search query takes about 12 ms, the joins take up the rest. Now, I'll rewrite my query eliminating the joins and using a temp table. Storing the whole result in the temp table will be too slow, because there are too many columns. Therefore I will only store the primary and foreign key columns, and join again to the main table to get the full records. query_time running_total rowsquery 0.141 ms0.141 ms-1 BEGIN Do the search : 8.229 ms8.370 ms-1 CREATE TEMPORARY TABLE tmp AS SELECT id, city_id, zipcode, contact_id, contact_group_id, price/terrain as sort FROM (stripped down search query) 0.918 ms9.287 ms-1 ANALYZE tmp Fetch the main data to display : 7.663 ms16.951 ms 85 SELECT a.* FROM tmp t, annonces_display a WHERE a.id=t.id ORDER BY t.sort Fetch log entries associates with each row (one row to many log entries) : 1.021 ms17.972 ms 2 SELECT l.*, u.login, u.bg_color FROM annonces_log l, users u, tmp t WHERE u.id=l.user_id AND l.annonce_id = t.id ORDER BY annonce_id, added 3.468 ms21.440 ms 216 SELECT annonce_id, array_accum(list_id) AS list_ids, array_accum(COALESCE(user_id,0)) AS list_added_by, max(added) AS added_to_list FROM bookmarks GROUP BY annonce_id Resolve foreign key relations 1.034 ms22.474 ms 37 SELECT r.annonce_id FROM read_annonces r, tmp t WHERE r.annonce_id = t.id 0.592 ms23.066 ms 9 SELECT * FROM cities_dist_zipcode WHERE zipcode IN (SELECT zipcode FROM tmp) 0.716 ms23.782 ms 11 SELECT * FROM cities_dist WHERE id IN (SELECT city_id FROM tmp) 1.125 ms24.907 ms 45 SELECT * FROM contacts WHERE id IN (SELECT contact_id FROM tmp) 0.799 ms25.705 ms 42 SELECT * FROM contact_groups WHERE id IN (SELECT contact_group_id FROM tmp) 0.463 ms26.169 ms -1 DROP TABLE tmp 32.208 ms 58.377 ms -1 COMMIT From this we see : Using a temporary table is FASTER than doing the large query with all the joins. (26 ms versus 28 ms). It's also nicer and cleaner. However the COMMIT takes as much time as all the queries together ! Let's run with fsync=off : query_time running_total rowsquery 0.109 ms0.109 ms-1 BEGIN 8.321 ms8.430 ms-1 CREATE TEMPORARY TABLE tmp AS SELECT id, city_id, zipcode, contact_id, contact_group_id, price/terrain as sort FROM (stripped down search query) 0.849 ms9.280 ms-1 ANALYZE tmp 7.360 ms16.640 ms 85 SELECT a.* FROM tmp t, annonces_display a WHERE a.id=t.id ORDER BY t.sort 1.067 ms17.707 ms 2 SELECT l.*, u.login, u.bg_color FROM annonces_log l, users u, t
Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal
It would be interesting to know what the bottleneck is for temp tables for you. They do not go via the buffer-cache, they are stored in private memory in the backend, they are not xlogged. Nor flushed to disk on backend exit. They're about as close to in-memory tables as you're going to get... Hum... Timings are a mean over 100 queries, including roundtrip to localhost, via a python script. 0.038 ms BEGIN 0.057 ms SELECT 1 0.061 ms COMMIT 0.041 ms BEGIN 0.321 ms SELECT count(*) FROM bookmarks 0.080 ms COMMIT this test table contains about 250 rows 0.038 ms BEGIN 0.378 ms SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 0.082 ms COMMIT the ORDER BY uses an index 0.042 ms BEGIN 0.153 ms DECLARE tmp SCROLL CURSOR WITHOUT HOLD FOR SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 0.246 ms FETCH ALL FROM tmp 0.048 ms MOVE FIRST IN tmp 0.246 ms FETCH ALL FROM tmp 0.048 ms CLOSE tmp 0.084 ms COMMIT the CURSOR is about as fast as a simple query 0.101 ms BEGIN 1.451 ms CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP 0.450 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 0.443 ms ANALYZE tmp 0.365 ms SELECT * FROM tmp 0.310 ms DROP TABLE tmp 32.918 ms COMMIT CREATING the table is OK, but what happens on COMMIT ? I hear the disk seeking frantically. With fsync=off, I get this : 0.090 ms BEGIN 1.103 ms CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP 0.439 ms INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 0.528 ms ANALYZE tmp 0.364 ms SELECT * FROM tmp 0.313 ms DROP TABLE tmp 0.688 ms COMMIT Getting closer ? I'm betting on system catalogs updates. I get the same timings with ROLLBACK instead of COMMIT. Temp tables have a row in pg_class... Another temporary table wart : BEGIN; CREATE TEMPORARY TABLE tmp ( a INTEGER NOT NULL, b INTEGER NOT NULL, c TIMESTAMP NOT NULL, d INTEGER NOT NULL ) ON COMMIT DROP; INSERT INTO tmp SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20; EXPLAIN ANALYZE SELECT * FROM tmp; QUERY PLAN --- Seq Scan on tmp (cost=0.00..25.10 rows=1510 width=20) (actual time=0.003..0.006 rows=20 loops=1) Total runtime: 0.030 ms (2 lignes) ANALYZE tmp; EXPLAIN ANALYZE SELECT * FROM tmp; QUERY PLAN Seq Scan on tmp (cost=0.00..1.20 rows=20 width=20) (actual time=0.003..0.008 rows=20 loops=1) Total runtime: 0.031 ms We see that the temp table has a very wrong estimated rowcount until it has been ANALYZED. However, temporary tables do not support concurrent access (obviously) ; and in the case of on-commit-drop tables, inserts can't be rolled back (obviously), so an accurate rowcount could be maintained via a simple counter... ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal
Additionally to your query you are already transferring the whole result set multiple times. First you copy it to the result table. Then you read it again. Your subsequent queries will also have to read over all the unneeded tuples just to get your primary key. Considering that the result set is not very large and will be cached in RAM, this shouldn't be a problem. then why useth thy not the DISTINCT clause when building thy result table and thou shalt have no duplicates. Because the result table contains no duplicates ;) I need to remove duplicates in this type of queries : -- get object owners info SELECT * FROM users WHERE id IN (SELECT user_id FROM results); And in this case I find IN() easier to read than DISTINCT (what I posted was a simplification of my real use case...) which is a perfect reason to use a temp table. Another variation on the temp table scheme is use a result table and add a query_id. True. Doesn't solve my problem though : it's still complex, doesn't have good rowcount estimation, bloats a table (I only need these records for the duration of the transaction), etc. We do something like this in our web application when users submit complex queries. For each query we store tuples of (query_id,result_id) in a result table. It's then easy for the web application to page the result set. Yes, that is about the only sane way to page big result sets. A cleaner solution usually pays off in the long run whereas a hackish or overly complex solution will bite you in the behind for sure as time goes by. Yes, but in this case temp tables add too much overhead. I wish there were RAM based temp tables like in mysql. However I guess the current temp table slowness comes from the need to mark their existence in the system catalogs or something. That's why I proposed using cursors... ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Big IN() clauses etc : feature proposal
You might consider just selecting your primary key or a set of primary keys to involved relations in your search query. If you currently use "select *" this can make your result set very large. Copying all the result set to the temp. costs you additional IO that you propably dont need. It is a bit of a catch : I need this information, because the purpose of the query is to retrieve these objects. I can first store the ids, then retrieve the objects, but it's one more query. Also you might try: SELECT * FROM somewhere JOIN result USING (id) Instead of: SELECT * FROM somewhere WHERE id IN (SELECT id FROM result) Yes you're right in this case ; however the query to retrieve the owners needs to eliminate duplicates, which IN() does. On the other hand if your search query runs in 10ms it seems to be fast enough for you to run it multiple times. Theres propably no point in optimizing anything in such case. I don't think so : - 10 ms is a mean time, sometimes it can take much more time, sometimes it's faster. - Repeating the query might yield different results if records were added or deleted in the meantime. - Complex search queries have imprecise rowcount estimates ; hence the joins that I would add to them will get suboptimal plans. Using a temp table is really the cleanest solution now ; but it's too slow so I reverted to generating big IN() clauses in the application. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Big IN() clauses etc : feature proposal
The moral of the story is that you're probably better off running a bunch of small selects than in trying to optimize things with one gargantuan select. Ever experiment with loading the parameters into a temp table and joining to that? Also, it might be worth re-testing that conclusion with PG CVS tip (or 8.2 when it comes out). The reimplementation of IN as = ANY that I did a couple months ago might well change the results. Long mail, but I think it's interesting... I think this is a generic problem, which is often encountered : selecting a bunch of records based on a list of primary keys (or other indexed, unique field) ; said list being anything from very short to quite large. Here are a few occurences of this need : 1- The application supplies a list of id's (the case of the OP of this thread) 2- A query Q1 yields a list of selected objects , that we wish to use in several subsequent queries. And Q1 is a query we don't wish to do several times, either because it's slow, complicated (advanced search, for instance), or it acts on a constantly moving dataset, so the results would be different each time. So we store the result of Q1 in the application, or in a temp table, or in an array in a plpgsql variable, whatever, to reuse them. Then, for each of these objects, often we will make more queries to resolve foreign keys (get category name, owner name, from categories and users tables, etc). I have encountered both cases quite often, and they both pose a few problems. I think it would be a good opportunity for a new feature (see below). A typical use case for point 2 : Consider an "objects" table. Each object ... - is related to one or several rows from the "categories" table via an "objects_categories" link table. - has an owner_id referencing the "users" table I do an "advanced search" query on "objects", which returns a list of objects. I can join directly to "users" to get the owner's name, but joining to "categories" is already problematic because of the many-to-many relationship. I wish to do this : fetch all objects matching the search criteria ; fetch the owner users ; fetch the categories ; build in my application object space a clean and straightforward data representation for all this. Also : - I do not wish to complicate the search query. - The row estimates for the search query results are likely to be "not so good" (because it's a complex query) ; so the joins to users and categories are likely to use suboptimal plans based on "not so good" estimates. - The rows from "objects" are large ; so moving them around through a lot of small joins hurts performance. The obvious solution is this : BEGIN; CREATE TEMPORARY TABLE results ON COMMIT DROP AS SELECT * FROM advanced search query; ANALYZE results; -- get the results to the application SELECT * FROM results; -- get object owners info SELECT * FROM users WHERE id IN (SELECT user_id FROM results); -- get category info SELECT * FROM categories WHERE id IN (SELECT category_id FROM objects_to_categories WHERE object_id IN (SELECT id FROM results)); -- get object/category relations (the ORM will use this to link objects in the application) SELECT * FROM objects_to_categories WHERE object_id IN (SELECT id FROM results); COMMIT; You might wonder why I do it this way on the "categories" table. This is because I use an Object-Relational mapper which will instantiate a User or Category class object for each row I fetch from these tables. I do not want to fetch just the username, using a simple join, but I want the full object, because : - I want to instantiate these objects (they have useful methods to process rights etc) - I do not want to mix columns from "objects" and "users" And I do not wish to instantiate each category more than once. This would waste memory, but more importantly, it is a lot cleaner to have only one instance per row, because my ORM then translates the foreign key relations into object relations (pointers). Each instanciated category will contain a list of Object instances ; each Object instance will contain a list of the categories it belongs to, and point to its owner user. Back to the point : I can't use the temp table method, because temp tables are too slow. Creating a temp table, filling it, analyzing it and then dropping it takes about 100 ms. The search query, on average, takes 10 ms. So I have to move this logic to the application, or to plpgsql, and jump through hoops and use big IN() clauses ; which has the following drawbacks : - slow - ugly - very hard for the ORM to auto-generate *** Feature proposal : A way to store query results in a named buffer and reuse them in the next queries. This should be as fast as possible, store results in RAM
Re: [HACKERS] [PERFORM] A Better External Sort?
Bulk loading speed is irrelevant here - that is dominated by parsing, which we have covered copiously (har har) previously and have sped up by 500%, which still makes Postgres < 1/2 the loading speed of MySQL. Let's ask MySQL 4.0 LOAD DATA INFILE blah 0 errors, 666 warnings SHOW WARNINGS; not implemented. upgrade to 4.1 duh ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
Just to add a little anarchy in your nice debate... Who really needs all the results of a sort on your terabyte table ? I guess not many people do a SELECT from such a table and want all the results. So, this leaves : - Really wanting all the results, to fetch using a cursor, - CLUSTER type things, where you really want everything in order, - Aggregates (Sort->GroupAggregate), which might really need to sort the whole table. - Complex queries where the whole dataset needs to be examined, in order to return a few values - Joins (again, the whole table is probably not going to be selected) - And the ones I forgot. However, Most likely you only want to SELECT N rows, in some ordering : - the first N (ORDER BY x LIMIT N) - last N (ORDER BY x DESC LIMIT N) - WHERE x>value ORDER BY x LIMIT N - WHERE x Or, you are doing a Merge JOIN against some other table ; in that case, yes, you might need the whole sorted terabyte table, but most likely there are WHERE clauses in the query that restrict the set, and thus, maybe we can get some conditions or limit values on the column to sort. Also the new, optimized hash join, which is more memory efficient, might cover this case. Point is, sometimes, you only need part of the results of your sort. And the bigger the sort, the most likely it becomes that you only want part of the results. So, while we're in the fun hand-waving, new algorithm trying mode, why not consider this right from the start ? (I know I'm totally in hand-waving mode right now, so slap me if needed). I'd say your new, fancy sort algorithm needs a few more input values : - Range of values that must appear in the final result of the sort : none, minimum, maximum, both, or even a set of values from the other side of the join, hashed, or sorted. - LIMIT information (first N, last N, none) - Enhanced Limit information (first/last N values of the second column to sort, for each value of the first column) (the infamous "top10 by category" query) - etc. With this, the amount of data that needs to be kept in memory is dramatically reduced, from the whole table (even using your compressed keys, that's big) to something more manageable which will be closer to the size of the final result set which will be returned to the client, and avoid a lot of effort. So, this would not be useful in all cases, but when it applies, it would be really useful. Regards ! ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] Avoiding tuple construction/deconstruction during joining
On my machine (Laptop with Pentium-M 1.6 GHz and 512MB DDR333) I get the following timings : Big Joins Query will all the fields and no order by (I just put a SELECT * in the first table) yielding about 6k rows : => 12136.338 ms Replacing the SELECT * from the table with many fields by just a SELECT of the foreign key columns : => 1874.612 ms I felt like playing a bit so I implemented a hash join in python (download the file, it works on Miroslav's data) : All timings do not include time to fetch the data from the database. Fetching all the tables takes about 1.1 secs. * With something that looks like the current implementation (copying tuples around) and fetching all the fields from the big table : => Fetching all the tables : 1.1 secs. => Joining : 4.3 secs * Fetching only the integer fields => Fetching all the tables : 0.4 secs. => Joining : 1.7 secs * A smarter join which copies nothing and updates the rows as they are processed, adding fields : => Fetching all the tables : 1.1 secs. => Joining : 0.4 secs With the just-in-time compiler activated, it goes down to about 0.25 seconds. First thing, this confirms what Tom said. It also means that doing this query in the application can be a lot faster than doing it in postgres including fetching all of the tables. There's a problem somewhere ! It should be the other way around ! The python mappings (dictionaries : { key : value } ) are optimized like crazy but they store column names for each row. And it's a dynamic script language ! Argh. Note : run the program like this : python test.py |less -S So that the time spent scrolling your terminal does not spoil the measurements. Download test program : http://boutiquenumerique.com/pf/miroslav/test.py ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Avoiding tuple construction/deconstruction during joining
I have asked him for the data and played with his queries, and obtained massive speedups with the following queries : http://boutiquenumerique.com/pf/miroslav/query.sql http://boutiquenumerique.com/pf/miroslav/query2.sql http://boutiquenumerique.com/pf/miroslav/materialize.sql Note that my optimized version of the Big Joins is not much faster that the materialized view without index (hash joins are damn fast in postgres) but of course using an index... ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org