Re: [HACKERS] Odd out of memory problem.
On Thu, Mar 29, 2012 at 7:38 PM, Peter Eisentraut pete...@gmx.net wrote: On tis, 2012-03-27 at 00:53 +0100, Greg Stark wrote: Hm. So my original plan was dependent on adding the state-merge function we've talked about in the past. Not all aggregate functions necessarily can support such a function but I think all or nearly all the builtin aggregates can. Certainly min,max, count, sum, avg, stddev, array_agg can which are most of what people do. That would be a function which can take two state variables and produce a new state variable. This information could also be useful to have in PL/Proxy (or similar FDWs) to be able to integrate aggregate computation into the language. Currently, you always have to do the state merging yourself. I don't know exactly how PL/Proxy or pgpool accomplish the multi-phase aggregate, but in theory the proposal above is state-merge function, so it doesn't apply to general aggregate results that passed through the final function. Of course some functions that don't have final functions are ok to call state-merge function on the results. Thanks, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Tab completion of double quoted identifiers broken
Hi, I just spotted that tab completion of double quoted identifiers seems to be broken in 9.2devel. For example things like this which worked in 9.1 no longer work: UPDATE foo bar tab It looks like the problem is in get_previous_words() here: if (buf[start] == '') inquotes = !inquotes; else if (!inquotes) { ... test for start of word ... which fails to account for the fact that the double quote itself might be the start of the word. I think the solution is just to remove the else: if (buf[start] == '') inquotes = !inquotes; if (!inquotes) { ... test for start of word ... to allow it to find a word-breaking character immediately before the double quote. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring lwlock-related latency spikes
On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas robertmh...@gmail.com wrote: which means, if I'm not confused here, that every single lwlock-related stall 1s happened while waiting for a buffer content lock. Moreover, each event affected a different buffer. I find this result so surprising that I have a hard time believing that I haven't screwed something up, so if anybody can check over the patch and this analysis and suggest what that thing might be, I would appreciate it. Possible candidates are 1) pages on the RHS of the PK index on accounts. When the page splits a new buffer will be allocated and the contention will move to the new buffer. Given so few stalls, I'd say this was the block one above leaf level. 2) Buffer writes hold the content lock in shared mode, so a delayed I/O during checkpoint on a page requested by another for write would show up as a wait for a content lock. That might happen to updates where checkpoint write occurs between the search and write portions of the update. The next logical step in measuring lock waits is to track the reason for the lock wait, not just the lock wait itself. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Http Frontend implemented using pgsql?
I had a thought that it might be interesting to have a simple C fronted that converts HTTP to and from some pgsql friendly structure and delegates all the core logic to a stored procedure in the database. This might make it easier to hack on the API without worrying about memory management and buffer overflow vulnerabilities. Is this a brain wave or a brain fart?
Re: [HACKERS] measuring lwlock-related latency spikes
On Sat, Mar 31, 2012 at 4:53 AM, Simon Riggs si...@2ndquadrant.com wrote: The next logical step in measuring lock waits is to track the reason for the lock wait, not just the lock wait itself. I had the same thought. I'm not immediately sure what the best way to do that is, but I'll see if I can figure something out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd out of memory problem.
On fre, 2012-03-30 at 22:59 -0700, Hitoshi Harada wrote: I don't know exactly how PL/Proxy or pgpool accomplish the multi-phase aggregate, They don't. but in theory the proposal above is state-merge function, so it doesn't apply to general aggregate results that passed through the final function. Of course some functions that don't have final functions are ok to call state-merge function on the results. You're right, it's not quite the same thing. But perhaps it could be kept in mind if someone wants to develop things in this area. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch pg_is_in_backup()
Hi Gilles, first and foremost, sorry for jumping in this thread so late. I read all previous discussions and I'd be happy to help you with this patch. Agreed and sorry for the response delay. I've attached 2 patches here, the first one is the same as before with just the renaming of the function into pg_is_in_exclusive_backup(). My quick response: I really like the idea of having a function that simply returns a boolean value. To be honest, I would have called it pg_is_in_backup() as you originally did - after all, I do not execute pg_start_exclusive_backup() or pg_stop_exclusive_backup(). It is more intuitive and pairs with pg_is_in_recovery(). I have never found any mention whatsoever in the documentation that talks about exclusive backup, and I am afraid it would generate confusion. However, I leave this up to the rest of the community's judgement (here is my opinion). I agree also that a function that returns the timestamp of the start of the backup might be useful. My opinion with the 'exclusive' keyword applies here too (I would simply name it pg_backup_start_time()). Finally, I tried to apply the patch but it looks like we need a new version that can apply with current HEAD. If you can do that, I am happy to assist with the review. Have a good weekend. Thank you, Gabriele -- Gabriele Bartolini - 2ndQuadrant Italia PostgreSQL Training, Services and Support gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring lwlock-related latency spikes
On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas robertmh...@gmail.com wrote: But I didn't think we were ever supposed to hold content locks for that long. Isn't that lock held while doing visibility checks? What about I/O waiting for a clog page to be read? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Http Frontend implemented using pgsql?
On Sat, Mar 31, 2012 at 6:27 AM, Dobes Vandermeer dob...@gmail.com wrote: I had a thought that it might be interesting to have a simple C fronted that converts HTTP to and from some pgsql friendly structure and delegates all the core logic to a stored procedure in the database. This might make it easier to hack on the API without worrying about memory management and buffer overflow vulnerabilities. Is this a brain wave or a brain fart? Something along the lines of a stripped down mod_libpq? http://asmith.id.au/mod_libpq.html If we had something along the lines of JSON - Row/setof in core, I could see this being a very nice RPC mechanism for PostgreSQL. Plain HTTP still give's you the session/transaction control problem of stateless clients, but maybe coupled with PgPool you could cobble something together... a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Sat, Mar 31, 2012 at 1:44 AM, Daniel Farina dan...@heroku.com wrote: On Fri, Mar 30, 2012 at 10:21 AM, Daniel Farina dan...@heroku.com wrote: Any enhancement here that can't be used with libpq via, say, drop-in .so seems unworkable to me, and that's why any solution that is basically proxying to the database is basically a non-starter outside the very earliest prototyping stages. The tuple scanning and protocol semantics can and even should remain the same, especially at first. I should add: proxying could work well if libpq had all the right hooks. The server could remain ignorant. Regardless, upstream changes result. Just to be clear, what you are saying that writing a process that accepts requests by HTTP and translates them into requests using the existing protocol to send to the server would have unacceptable performance? Or is there something else about it that is a non-starter?
Re: [HACKERS] measuring lwlock-related latency spikes
On Sat, Mar 31, 2012 at 1:58 PM, Greg Stark st...@mit.edu wrote: On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas robertmh...@gmail.com wrote: But I didn't think we were ever supposed to hold content locks for that long. Isn't that lock held while doing visibility checks? What about I/O waiting for a clog page to be read? So what we should be logging is the list of lwlocks held when the lock wait occurred. That would differentiate call paths somewhat better than just looking at the current lock request. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] narwhal versus gnu_printf
I noticed that the build logs for buildfarm member narwhal (a mingw critter) contain an awful lot of occurrences of ../../src/include/utils/elog.h:159: warning: `gnu_printf' is an unrecognized format function type Evidently, the following hunk in pg_config_manual.h failed to consider mingw. Is there a simple fix? /* * Set the format style used by gcc to check printf type functions. We really * want the gnu_printf style set, which includes what glibc uses, such * as %m for error strings and %lld for 64 bit long longs. But not all gcc * compilers are known to support it, so we just use printf which all * gcc versions alive are known to support, except on Windows where * using gnu_printf style makes a dramatic difference. Maybe someday * we'll have a configure test for this, if we ever discover use of more * variants to be necessary. */ #ifdef WIN32 #define PG_PRINTF_ATTRIBUTE gnu_printf #else #define PG_PRINTF_ATTRIBUTE printf #endif regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tab completion of double quoted identifiers broken
Dean Rasheed dean.a.rash...@gmail.com writes: I just spotted that tab completion of double quoted identifiers seems to be broken in 9.2devel. Yeah, looks like I broke it :-( --- I think I missed the fact that the last check with WORD_BREAKS was looking at the previous character not the current one, so I thought adding the else was a safe optimization. Patch applied, thanks for the report! regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] narwhal versus gnu_printf
On 03/31/2012 11:01 AM, Tom Lane wrote: I noticed that the build logs for buildfarm member narwhal (a mingw critter) contain an awful lot of occurrences of ../../src/include/utils/elog.h:159: warning: `gnu_printf' is an unrecognized format function type Evidently, the following hunk in pg_config_manual.h failed to consider mingw. Is there a simple fix? /* * Set the format style used by gcc to check printf type functions. We really * want the gnu_printf style set, which includes what glibc uses, such * as %m for error strings and %lld for 64 bit long longs. But not all gcc * compilers are known to support it, so we just use printf which all * gcc versions alive are known to support, except on Windows where * using gnu_printf style makes a dramatic difference. Maybe someday * we'll have a configure test for this, if we ever discover use of more * variants to be necessary. */ #ifdef WIN32 #define PG_PRINTF_ATTRIBUTE gnu_printf #else #define PG_PRINTF_ATTRIBUTE printf #endif No, we did consider mingw - very much so since this attribute isn't used under MSVC. Narwhal is using an ancient version of gcc, however. Compare this to frogmouth which has no such warnings. (I know frogmouth hasn't been reporting for 13 days - I'm fixing that) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed dblink using alternate libpq tuple storage
Marko Kreen mark...@gmail.com writes: On Thu, Mar 29, 2012 at 06:56:30PM -0400, Tom Lane wrote: Yeah. Perhaps we should tweak the row-processor callback API so that it gets an explicit notification that this is a new resultset. Duplicating PQexec's behavior would then involve having the dblink row processor throw away any existing tuplestore and start over when it gets such a call. There's multiple ways to express that but the most convenient thing from libpq's viewpoint, I think, is to have a callback that occurs immediately after collecting a RowDescription message, before any rows have arrived. So maybe we could express that as a callback with valid res but columns set to NULL? A different approach would be to add a row counter to the arguments provided to the row processor; then you'd know a new resultset had started if you saw rowcounter == 0. This might have another advantage of not requiring the row processor to count the rows for itself, which I think many row processors would otherwise have to do. I had been leaning towards the second approach with a row counter, because it seemed cleaner, but I thought of another consideration that makes the first way seem better. Suppose that your row processor has to do some setup work at the start of a result set, and that work needs to see the resultset properties (eg number of columns) so it can't be done before starting PQgetResult. In the patch as submitted, the only way to manage that is to keep enough state to recognize that the current row processor call is the first one, which we realized is inadequate for multiple-result-set cases. With a row counter argument you can do the setup whenever rowcount == 0, which fixes that. But neither of these methods deals correctly with an empty result set! To make that work, you need to add extra logic after the PQgetResult call to do the setup work the row processor should have done but never got a chance to. So that's ugly, and it makes for an easy-to-miss bug. A call that occurs when we receive RowDescription, independently of whether the result set contains any rows, makes this a lot cleaner. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Add PGDLLIMPORT to ScanKeywords and NumScanKeywords.
Andrew Dunstan and...@dunslane.net writes: On 03/31/2012 10:56 AM, Tom Lane wrote: Add PGDLLIMPORT to ScanKeywords and NumScanKeywords. Per buildfarm, this is now needed by contrib/pg_stat_statements. It seems to have broken mingw earlier now :-( Ugh. It looks like ecpg (and also pg_dump) are defining symbols named ScanKeywords and NumScanKeywords, but relying on the backend's keywords.h to provide global declarations for those. And this doesn't work once we PGDLLIMPORT-decorate those declarations. The only simple fix I can see is to rename the symbols in ecpg and pg_dump to something else. This is probably a good thing anyway to reduce confusion. Anybody have another idea? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] pg_dump incredibly slow dumping a single schema from a large db
Mike Roest mike.ro...@replicon.com writes: The file is 6 megs so I've dropped it here. That was doing perf for the length of the pg_dump command and then a perf report -n http://dl.dropbox.com/u/13153/output.txt Hmm ... that's a remarkably verbose output format, but the useful part of this info seems to be just # Events: 2M cpu-clock # # Overhead SamplesCommand Shared Object Symbol # .. ... . . # 65.96%1635392 pg_dump pg_dump[.] findLoop | --- findLoop | |--69.66%-- findDependencyLoops | sortDumpableObjects | main | __libc_start_main | |--30.34%-- findLoop | findDependencyLoops | sortDumpableObjects | main | __libc_start_main --0.00%-- [...] 10.16% 251955 pg_dump pg_dump[.] getTables | --- getTables getSchemaData main __libc_start_main The file also shows that findObjectByDumpId() accounts for only a negligible percentage of runtime, which means that the recursion path in findLoop() is being executed hardly at all. After staring at that for awhile I realized that what is the O(N^2) part is the initial is the object in the workspace? test. The actual dependency chains are probably never very long, but as we run through the collection of objects we gradually add all of them to the front of the workspace. So this is dumb; we should manage the is the object already processed component of that with an O(1) check, like a bool array or some such, rather than an O(N) search loop. As for the getTables slowdown, the only part of that I can see that looks to be both significant and entirely contained in getTables() itself is the nested loop near the end that's trying to copy the dumpable flags for owned sequences from their owning tables. Do you have a whole lot of owned sequences? Maybe we could postpone that work until we have the fast table lookup array constructed, which should reduce this from O(M*N) to O(M*logN). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Add PGDLLIMPORT to ScanKeywords and NumScanKeywords.
On 03/31/2012 11:59 AM, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: On 03/31/2012 10:56 AM, Tom Lane wrote: Add PGDLLIMPORT to ScanKeywords and NumScanKeywords. Per buildfarm, this is now needed by contrib/pg_stat_statements. It seems to have broken mingw earlier now :-( Ugh. It looks like ecpg (and also pg_dump) are defining symbols named ScanKeywords and NumScanKeywords, but relying on the backend's keywords.h to provide global declarations for those. And this doesn't work once we PGDLLIMPORT-decorate those declarations. The only simple fix I can see is to rename the symbols in ecpg and pg_dump to something else. This is probably a good thing anyway to reduce confusion. Anybody have another idea? Seems the sane thing to do. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Add PGDLLIMPORT to ScanKeywords and NumScanKeywords.
Andrew Dunstan and...@dunslane.net writes: On 03/31/2012 11:59 AM, Tom Lane wrote: The only simple fix I can see is to rename the symbols in ecpg and pg_dump to something else. This is probably a good thing anyway to reduce confusion. Anybody have another idea? Seems the sane thing to do. Done, we'll see how the buildfarm likes this version ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] pg_dump incredibly slow dumping a single schema from a large db
I wrote: So this is dumb; we should manage the is the object already processed component of that with an O(1) check, like a bool array or some such, rather than an O(N) search loop. As for the getTables slowdown, the only part of that I can see that looks to be both significant and entirely contained in getTables() itself is the nested loop near the end that's trying to copy the dumpable flags for owned sequences from their owning tables. Do you have a whole lot of owned sequences? Maybe we could postpone that work until we have the fast table lookup array constructed, which should reduce this from O(M*N) to O(M*logN). I've committed fixes for both these issues. If you are in a position to test with 9.1 branch tip from git, it'd be nice to have confirmation that these patches actually cure your problem. For both of them, the issue seems to only show up in a subset of cases, which may explain why we'd not identified the problem before. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [GENERAL] pg_dump incredibly slow dumping a single schema from a large db
I've committed fixes for both these issues. If you are in a position to test with 9.1 branch tip from git, it'd be nice to have confirmation that these patches actually cure your problem. For both of them, the issue seems to only show up in a subset of cases, which may explain why we'd not identified the problem before. regards, tom lane Cool, I've grabbed your changes and it seems to have completely cured the issue. reading user-defined tables has gone down to 9 seconds from 2.5 minutes reading dependency data has gone down to 20 seconds from 5.5 minutes. Thanks so much Tom this is perfect. I'm just pulling another backup using the stock 9.1.1 pg_dump to ensure the backups are equivalent. Any idea when 9.1.4 with this change will be out so we can pull the cluster up. With regards to your previous question about sequences there are 61K in the DB, looks like our schema currently has about 115 sequences per tenant. --Mike
[HACKERS] Re: [GENERAL] pg_dump incredibly slow dumping a single schema from a large db
I'm just pulling another backup using the stock 9.1.1 pg_dump to ensure the backups are equivalent. Schema data are identical between the 2 backups. the new backup passes all our tests for validating a tenant. Thank you again for the quick response! --Mike
Re: [HACKERS] measuring lwlock-related latency spikes
On Sat, Mar 31, 2012 at 8:58 AM, Greg Stark st...@mit.edu wrote: On Sat, Mar 31, 2012 at 4:41 AM, Robert Haas robertmh...@gmail.com wrote: But I didn't think we were ever supposed to hold content locks for that long. Isn't that lock held while doing visibility checks? Nope. heap_update() and friends do a very complicated little dance to avoid that. Heikki articulated that rule when he installed the visibility map in 8.4, and I had to work pretty hard to preserve it in 9.2 when I did the work to make the visibility map crash-safe, but now I'm glad I did. What about I/O waiting for a clog page to be read? I'm pretty sure that can happen, because TransactionIdIsCommitted() can get called from HeapTupleSatisfies*() which pretty much only gets called while holding the page lock. I don't know whether it's the cause of these particular stalls, but it's plausible if the CLOG cache is getting thrashed hard enough. I did discover one systematic error in my testing methodology: I only instrumented LWLockAcquire(), not LWLockAcquireOrWait(). The latter turns out to be an important case in this instance since we use that when flushing WAL. Just running the regression tests on my laptop suggests that with that oversight corrected, WALWriteLock is going to pop out as a huge source of latency spikes. But I will know for sure after I do a more formal test run. I also modified the code to print out debugging output every time we have to wait for a long time (I think I've got it set to 10ms right now, but I might raise that if it's too verbose in a real test run) with the file and line number attempting the lock acquisition that blocked, and the file and line number that had most recently acquired the lock at the time we first discovered we needed to wait. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Publish checkpoint timing and sync files summary data to pg_stat_bgwriter
On 28 March 2012 15:23, Robert Haas robertmh...@gmail.com wrote: At any rate, I strongly agree that counting the number of strategy allocations is not really a viable proxy for counting the number of backend writes. You can't know how many of those actually got dirtied. Sure. Since any backend write is necessarily the result of that backend trying to allocate a buffer, I think maybe we should just count whether the number of times it was trying to allocate a buffer *using a BAS* vs. the number of times it was trying to allocate a buffer *not using a BAS*. That is, decide whether or not it's a strategy write not based on whether the buffer came in via a strategy allocation, but rather based on whether it's going out as a result of a strategy allocation. I'm not quite sure I understand what you mean here. Are you suggesting that I produce a revision that bumps beside FlushBuffer() in BufferAlloc(), as a dirty page is evicted/written, while breaking the figure out into != BAS_NORMAL and == BAS_NORMAL figures? Would both figures be presented as separate columns within pg_stat_bgwriter? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [GENERAL] pg_dump incredibly slow dumping a single schema from a large db
Mike Roest mike.ro...@replicon.com writes: Any idea when 9.1.4 with this change will be out so we can pull the cluster up. Well, we just did some releases last month, so unless somebody finds a really nasty security or data-loss issue, I'd think it will be a couple of months before the next set. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring lwlock-related latency spikes
On Sat, Mar 31, 2012 at 10:14 PM, Robert Haas robertmh...@gmail.com wrote: Isn't that lock held while doing visibility checks? Nope. heap_update() and friends do a very complicated little dance to avoid that. ... What about I/O waiting for a clog page to be read? I'm pretty sure that can happen I'm confused because i thought these two sentences were part of describing the same case. because TransactionIdIsCommitted() can get called from HeapTupleSatisfies*() which pretty much only gets called while holding the page lock. I don't know whether it's the cause of these particular stalls, but it's plausible if the CLOG cache is getting thrashed hard enough. I wonder if it would make sense to, if we come across an xid that isn't in the slru release the lock while we read in the clog page. When we reobtain it we can check if the LSN has changed and if it has restart the visibility checks. If it hasn't pick up where we left off. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] new group commit behavior not helping?
Hoping to demonstrate the wonders of our new group commit code, I ran some benchmarks on the IBM POWER7 machine with synchronous_commit = on. But, it didn't come out much better than 9.1. pgbench, scale factor 300, median of 3 30-minute test runs, # clients = #threads, shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = on, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms. By number of clients: master: 01 tps = 118.968446 (including connections establishing) 02 tps = 120.666865 (including connections establishing) 04 tps = 209.624437 (including connections establishing) 08 tps = 377.387029 (including connections establishing) 16 tps = 695.172899 (including connections establishing) 32 tps = 1318.468375 (including connections establishing) REL9_1_STABLE: 01 tps = 117.037056 (including connections establishing) 02 tps = 119.393871 (including connections establishing) 04 tps = 205.958750 (including connections establishing) 08 tps = 365.464735 (including connections establishing) 16 tps = 673.379394 (including connections establishing) 32 tps = 1101.324865 (including connections establishing) Is this expected behavior? Is this not the case where it's supposed to help? I thought Peter G. posted results showing a huge improvement on this kind of workload, and I thought Heikki reproduced them on a different server, so I'm confused why I can't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new group commit behavior not helping?
On 1 April 2012 01:10, Robert Haas robertmh...@gmail.com wrote: Hoping to demonstrate the wonders of our new group commit code, I ran some benchmarks on the IBM POWER7 machine with synchronous_commit = on. But, it didn't come out much better than 9.1. pgbench, scale factor 300, median of 3 30-minute test runs, # clients = #threads, shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = on, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms. Why the low value for wal_writer_delay? master: 01 tps = 118.968446 (including connections establishing) 02 tps = 120.666865 (including connections establishing) 04 tps = 209.624437 (including connections establishing) 08 tps = 377.387029 (including connections establishing) 16 tps = 695.172899 (including connections establishing) 32 tps = 1318.468375 (including connections establishing) REL9_1_STABLE: 01 tps = 117.037056 (including connections establishing) 02 tps = 119.393871 (including connections establishing) 04 tps = 205.958750 (including connections establishing) 08 tps = 365.464735 (including connections establishing) 16 tps = 673.379394 (including connections establishing) 32 tps = 1101.324865 (including connections establishing) (presumably s/tps/clients/ was intended here) Is this expected behavior? Is this not the case where it's supposed to help? I thought Peter G. posted results showing a huge improvement on this kind of workload, and I thought Heikki reproduced them on a different server, so I'm confused why I can't. The exact benchmark that I ran was the update.sql pgbench-tools benchmark, on my laptop. The idea was to produce a sympathetic benchmark with a workload that was maximally commit-bound. Heikki reproduced similar numbers on his laptop, iirc. Presumably the default TPC-B-like transaction test has been used here. You didn't mention what kind of disks this server has - I'm not sure if that information is available elsewhere. That could be highly pertinent. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new group commit behavior not helping?
On Sat, Mar 31, 2012 at 8:31 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: Why the low value for wal_writer_delay? A while back I was getting a benefit from cranking that down. I could try leaving it out and see if it matters. master: 01 tps = 118.968446 (including connections establishing) 02 tps = 120.666865 (including connections establishing) 04 tps = 209.624437 (including connections establishing) 08 tps = 377.387029 (including connections establishing) 16 tps = 695.172899 (including connections establishing) 32 tps = 1318.468375 (including connections establishing) REL9_1_STABLE: 01 tps = 117.037056 (including connections establishing) 02 tps = 119.393871 (including connections establishing) 04 tps = 205.958750 (including connections establishing) 08 tps = 365.464735 (including connections establishing) 16 tps = 673.379394 (including connections establishing) 32 tps = 1101.324865 (including connections establishing) (presumably s/tps/clients/ was intended here) The number at the beginning of each line is the number of clients. Everything after the first space is the output of pgbench for the median of three runs with that number of clients. Is this expected behavior? Is this not the case where it's supposed to help? I thought Peter G. posted results showing a huge improvement on this kind of workload, and I thought Heikki reproduced them on a different server, so I'm confused why I can't. The exact benchmark that I ran was the update.sql pgbench-tools benchmark, on my laptop. The idea was to produce a sympathetic benchmark with a workload that was maximally commit-bound. Heikki reproduced similar numbers on his laptop, iirc. Presumably the default TPC-B-like transaction test has been used here. Yes. You didn't mention what kind of disks this server has - I'm not sure if that information is available elsewhere. That could be highly pertinent. I'm a little fuzzy on that, but I think it's a collection of 600GB 10K RPM SAS SFF disk drives with LVM sitting on top. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring lwlock-related latency spikes
On Sat, Mar 31, 2012 at 6:01 PM, Greg Stark st...@mit.edu wrote: On Sat, Mar 31, 2012 at 10:14 PM, Robert Haas robertmh...@gmail.com wrote: Isn't that lock held while doing visibility checks? Nope. heap_update() and friends do a very complicated little dance to avoid that. ... What about I/O waiting for a clog page to be read? I'm pretty sure that can happen I'm confused because i thought these two sentences were part of describing the same case. Oh, I thought you were talking about the visibility *map*. Sorry. because TransactionIdIsCommitted() can get called from HeapTupleSatisfies*() which pretty much only gets called while holding the page lock. I don't know whether it's the cause of these particular stalls, but it's plausible if the CLOG cache is getting thrashed hard enough. I wonder if it would make sense to, if we come across an xid that isn't in the slru release the lock while we read in the clog page. When we reobtain it we can check if the LSN has changed and if it has restart the visibility checks. If it hasn't pick up where we left off. I've discovered a bug in my code that was causing it to print at most 2 histogram buckets per lwlock, which obviously means that my previous results were totally inaccurate. Ah, the joys of benchmarking. I found the problem when I added code to log a message any time an lwlock wait exceeded a certain time threshold, and it fired far more often than the previous results would have indicated. In particular, it turns out that long waits for WALInsertLock are extremely common and not, as the previous results appeared to indicated, unheard-of. I'm rerunning my tests now and will post the updated, hopefully-accurate results when that's done. /me attempts to remove egg from face. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Command Triggers patch v18
On Fri, Mar 30, 2012 at 11:19 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Yeah context is not explicit, we could call that toplevel: the command tag of the command that the user typed. When toplevel is null, the event trigger is fired on a command the user sent, when it's not null, the trigger is fired on some inner command operation. How about calling it command tag? I think both context and toplevel are inconsistent with other uses of those terms: context is an error-reporting field, among other things; and we don't care about the toplevel command, just the command-tag of the one we're executing - e.g. if DROP fires a command trigger which invokes CREATE which fires another command trigger, the inner one is going to get CREATE not DROP. Or at least so I presume. tag. I think for a drop trigger I would want the function to receive this information: type of object dropped, OID of object dropped, column number in the case of a column drop, flag indicating whether it's a toplevel drop or a cascaded drop. I wouldn't object to also making the currently-in-context toplevel command tag available, but I think most drop triggers wouldn't really care, so I wouldn't personally spend much implementation effort on it if it turns out to be hard. I'm not sure it would be hard as I'm only seeing a single depth possible here, so a single per-backend string static variable would do. See above example: I am pretty sure you need a stack. But in general, I don't really know what a proper subcommand is or why some subcommands should be more proper than others, or why we should even be concerned about whether something is a subcommand at all. I think it's fine and useful to have triggers that fire in those kinds of places, but I don't see why we should limit ourselves to that. For applications like replication, auditing, and enhanced security, the parse tree and subcommand/non-subcommand status of a particular operation are irrelevant. What you need is an exact Not really. When replicating you could perfectly say that you only replicate the toplevel DROP because the replica will also do the cascade dance and you might have decided not to replicate all related objects on the other side. I would not want my replication system issuing cascaded drops, because if the sides don't match it might cascade to something on the remote side that it doesn't cascade to on the local side, which exceeds my tolerance for scary behavior. The information you need really want not to miss is when only the cascaded object is part of the replication, not the main one. That was not covered by my previous patch but now we have a way to cover it. Also true. description of the operation that got performed (e.g. the default on table X column Y got dropped); you might be able to reverse-engineer that from the parse tree, but it's much better to have the system pass you the information you need more directly. Certainly, there are cases where you might want to have the parse tree, or even the raw command text, available, but I'm not even convinced that that those cases will be the most commonly used ones unless, of course, they're the only ones we offer, in which case everyone will go down that path by necessity. There are far too many variants and cases of our command to be able to extract their parameters in a flat way (a bunch of variables compared to a nested description ala json or xml), and I don't think such a flat representation is going to be much better than the parse tree. I strongly disagree. I think we'll find that with the right choice of hook points, the number of variables that need to be exposed is quite compact. Indeed, I'd venture to say that needing to pass lots and lots of information is evidence that you've made a poor choice of hook point. Now, we will later be able to offer a normalized rewritten command string from the parse tree to the use, but I don't see us adding support for that from cascaded drops, one other reason why I like to expose them as sub commands. Again, I'm not understanding the distinction between toplevel events and sub-events. I don't see any need for such a distinction. I think there are just events, and some of them happen at command start/end and others happen somewhere in the middle. As long as it's a safe and useful place to fire a trigger, who cares? I guess you're head is too heavily in the code side of things as opposed to the SQL user view point. Maybe my attempt to conciliate both views is not appropriate, but I really do think it is. Given the scope of this mini expression language, we can easily bypass calling the executor in v1 here, and reconsider later if we want to allow calling a UDF in the WHEN clause… I don't think it's an easy feature to add in, though. Or a necessary one. AFAICS, the main benefit of WHEN clauses on Exactly. regular triggers is that you can prevent the AFTER trigger queue from
Re: [HACKERS] measuring lwlock-related latency spikes
On Sat, Mar 31, 2012 at 9:29 PM, Robert Haas robertmh...@gmail.com wrote: I've discovered a bug in my code that was causing it to print at most 2 histogram buckets per lwlock, which obviously means that my previous results were totally inaccurate. Ah, the joys of benchmarking. I found the problem when I added code to log a message any time an lwlock wait exceeded a certain time threshold, and it fired far more often than the previous results would have indicated. In particular, it turns out that long waits for WALInsertLock are extremely common and not, as the previous results appeared to indicated, unheard-of. I'm rerunning my tests now and will post the updated, hopefully-accurate results when that's done. /me attempts to remove egg from face. All right, so with the aforementioned bug fixed (see attached, revised patch), there are now massive latency spikes popping out all over the place: on my latest run, there were 377 distinct lwlocks that took = 1s to acquire on at least one occasion during this 30-minute run. Some of those locks, of course, had more than one problem event. In total, somebody waited = 1 s for a lock 4897 times during this test. These break down as follows. Note that the blocked by is the person who had most recently acquired the lock as of the start of the wait, and is not necessarily solely responsible for the full duration of the wait due to shared locks and queue jumping. 1 waited at heapam.c:1651 blocked by bufmgr.c:2475 1 waited at heapam.c:2844 blocked by heapam.c:2758 1 waited at hio.c:335 blocked by heapam.c:1651 1 waited at hio.c:336 blocked by hio.c:336 1 waited at indexam.c:521 blocked by hio.c:345 1 waited at xlog.c:2090 blocked by xlog.c:2090 2 waited at bufmgr.c:1671 blocked by bufmgr.c:2475 2 waited at indexam.c:521 blocked by heapam.c:3464 2 waited at nbtpage.c:650 blocked by nbtinsert.c:124 2 waited at xlog.c:1502 blocked by xlog.c:2090 2 waited at xlog.c:2241 blocked by xlog.c:1502 3 waited at slru.c:310 blocked by slru.c:404 4 waited at indexam.c:521 blocked by hio.c:335 4 waited at indexam.c:521 blocked by hio.c:336 4 waited at xlog.c:2241 blocked by xlog.c:2090 6 waited at hio.c:336 blocked by heapam.c:2758 12 waited at indexam.c:521 blocked by bufmgr.c:2475 20 waited at xlog.c:2090 blocked by xlog.c:2241 26 waited at heapam.c:2758 blocked by indexam.c:521 29 waited at heapam.c:2758 blocked by heapam.c:2758 80 waited at xlog.c:1502 blocked by xlog.c:2241 89 waited at indexam.c:521 blocked by heapam.c:2758 115 waited at varsup.c:65 blocked by varsup.c:65 1540 waited at slru.c:310 blocked by slru.c:526 2948 waited at xlog.c:909 blocked by xlog.c:909 xlog.c:909 is the LWLockAcquire of WALInsertLock from within XLogInsert. slru.c:310 is in SimpleLruWaitIO(), where we attempt to grab the SLRU buffer lock in shared mode after releasing the control lock. slru.c:526 is in SlruInternalWritePage(), where we hold the buffer lock while writing the page. This is commit 194b5ea3d0722f94e8a6ba9cec03b858cc8c9370, if you want to look up an of the other line numbers. If I filter for waits greater than 8s, a somewhat different picture emerges: 2 waited at indexam.c:521 blocked by bufmgr.c:2475 212 waited at slru.c:310 blocked by slru.c:526 In other words, some of the waits for SLRU pages to be written are... really long. There were 126 that exceeded 10 seconds and 56 that exceeded 12 seconds. Painful is putting it mildly. I suppose one interesting question is to figure out if there's a way I can optimize the disk configuration in this machine, or the Linux I/O scheduler, or something, so as to reduce the amount of time it spends waiting for the disk. But the other thing is why we're waiting for SLRU page writes to begin with. My guess based on previous testing is that what's happening here is (1) we examine a tuple on an old page and decide we must look up its XID, (2) the relevant CLOG page isn't in cache so we decide to read it, but (3) the page we decide to evict happens to be dirty, so we have to write it first. That sure seems like something that a smart background writer ought to be able to fix for us. Simon previously posted a patch for that: http://archives.postgresql.org/pgsql-hackers/2012-01/msg00571.php ...but the testing I did at the time didn't seem to show a real clear benefit: http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results The obvious question here is: was that a problem with the patch, or a problem with my testing methodology, or is it just that the performance characteristics of the machine I used for that test (Nate Boley's 32-core AMD box) were different from this one (IBM POWER7)? I don't know, and I think I'm out of time to play with this for this weekend, but I'll investigate further when time permits. Other thoughts welcome. -- Robert Haas EnterpriseDB:
[HACKERS] new group commit behavior not helping?
On Saturday, March 31, 2012, Robert Haas robertmh...@gmail.com wrote: Hoping to demonstrate the wonders of our new group commit code, I ran some benchmarks on the IBM POWER7 machine with synchronous_commit = on. But, it didn't come out much better than 9.1. Where I would expect (and have seen) much improvement is where #clients #CPU. Or cores, whatever the term of art is. Of course I've mostly seen this where CPU=1 It looks like in your case tps was still scaling with clients when you gave up, so clients was probably too small. Jeff
Re: [HACKERS] new group commit behavior not helping?
On Sat, Mar 31, 2012 at 8:31 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: The exact benchmark that I ran was the update.sql pgbench-tools benchmark, on my laptop. The idea was to produce a sympathetic benchmark with a workload that was maximally commit-bound. Heikki reproduced similar numbers on his laptop, iirc. Presumably the default TPC-B-like transaction test has been used here. OK, I ran pgbench-tools with your configuration file. Graphs attached. Configuration otherwise as in my standard pgbench runs, except max_connections=1000 to accommodate the needs of the test. I now see the roughly order-of-magnitude increase you measured earlier. There seem to be too relevant differences between your test and mine: (1) your test is just a single insert per transaction, whereas mine is pgbench's usual update, select, update, update, insert and (2) it seems that, to really see the benefit of this patch, you need to pound the server with a very large number of clients. On this test, 250 clients was the sweet spot. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company attachment: commit-scaling-power-mini.pngattachment: commit-scaling-power.png -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] new group commit behavior not helping?
On Sun, Apr 1, 2012 at 1:40 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Saturday, March 31, 2012, Robert Haas robertmh...@gmail.com wrote: Hoping to demonstrate the wonders of our new group commit code, I ran some benchmarks on the IBM POWER7 machine with synchronous_commit = on. But, it didn't come out much better than 9.1. Where I would expect (and have seen) much improvement is where #clients #CPU. Or cores, whatever the term of art is. It seems you are right; see the email I just sent. Of course I've mostly seen this where CPU=1 It looks like in your case tps was still scaling with clients when you gave up, so clients was probably too small. What is kind of weird is that it actually seems to scale at almost exactly half of linear. Clients/tps on 9.2, with the pgbench-tools test Peter recommended: 1 140 2 143 4 289 8 585 16 1157 32 2317 50 3377 150 9511 250 12721 350 12582 450 11370 700 6972 You'll notice that at 2 clients we get basically no improvement. But 4 gets twice the single-client throughput; 8 gets about four times the single-client throughput; 16 gets about eight times the single-client throughput; 32 gets about sixteen times the single-client throughput; and 50 gets nearly 25 times the single-client throughput. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers