Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 4:40 AM, Noah Misch n...@leadboat.com wrote: On Tue, Jan 03, 2012 at 01:18:41AM +, Simon Riggs wrote: Just for the record, yes we do run multiple catalog scans in some parts of the code. So I can see how we might trigger 4 nested scans, using cache replacement while scanning, so best assume more, with no guarantee of them being neatly stacked for pop/push type access. Yeah, I wouldn't want to commit to a nesting limit. Agreed However, I _would_ have expected that a stack would suffice; PushActiveSnapshot()/PopActiveSnapshot() is adequate for a great deal of the backend, after all. In what sort of situation do catalog scans not strictly nest? It's possible. Making assumptions about what is possible bit me before, as you showed. I've seen code where we are explicitly running two concurrent scans. I don't want to put unnecessary and subtle assumptions into catalog scan code so I'm working on the assumption that endscan may not be called in strict FIFO order. The difference is fairly minor and doesn't restrict us in other ways. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] review: CHECK FUNCTION statement
Pavel Stehule wrote: here is new version of CHECK FUNCTION patch I won't be able to review that one because I'll be in California from Jan 6 to Jan 29. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Add SPI results constants available for PL/*
I'm using PL/Python, and when getting the result object from a plpy.execute(), I can access to the result.status(). E.g.: the result.status() is 4. But to know what 4 corresponds to, I must open the spi.h file from the sources to see : #define SPI_OK_CONNECT 1 #define SPI_OK_FINISH 2 #define SPI_OK_FETCH 3 #define SPI_OK_UTILITY 4 #define SPI_OK_SELECT 5 #define SPI_OK_SELINTO 6 #define SPI_OK_INSERT 7 #define SPI_OK_DELETE 8 #define SPI_OK_UPDATE 9 #define SPI_OK_CURSOR 10 #define SPI_OK_INSERT_RETURNING 11 #define SPI_OK_DELETE_RETURNING 12 #define SPI_OK_UPDATE_RETURNING 13 #define SPI_OK_REWRITTEN 14 Is there a way to have access to these constants from the PL/python code and other PL/* (Tcl, Perl, etc.) ? Thanks, -- Samuel PHAN
Re: [HACKERS] Review of VS 2010 support patches
On 12/31/2011 06:10 PM, Brar Piening wrote: Brar Piening wrote: Andrew Dunstan wrote: Can you narrow down exactly what in that commit broke VS 2010? Are there any compiler warnings? I was able to nail down the problem. In the absence of reaction, to keep my promise, I'm sending the attached Patch which restores the previous working behaviour for Visual Studio 2011. Note however that it also restores the previous conflicts with errno.h which aren't neccessarily a problem, but might be in future. OK, committed with a minor change to remove another compiler warning. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
On Tue, Jan 3, 2012 at 5:55 AM, Samuel PHAN sam...@nomao.com wrote: I'm using PL/Python, and when getting the result object from a plpy.execute(), I can access to the result.status(). E.g.: the result.status() is 4. But to know what 4 corresponds to, I must open the spi.h file from the sources to see : #define SPI_OK_CONNECT 1 #define SPI_OK_FINISH 2 #define SPI_OK_FETCH 3 #define SPI_OK_UTILITY 4 #define SPI_OK_SELECT 5 #define SPI_OK_SELINTO 6 #define SPI_OK_INSERT 7 #define SPI_OK_DELETE 8 #define SPI_OK_UPDATE 9 #define SPI_OK_CURSOR 10 #define SPI_OK_INSERT_RETURNING 11 #define SPI_OK_DELETE_RETURNING 12 #define SPI_OK_UPDATE_RETURNING 13 #define SPI_OK_REWRITTEN 14 Is there a way to have access to these constants from the PL/python code and other PL/* (Tcl, Perl, etc.) ? I'd suppose it interesting to add a table to pg_catalog containing this data. That would be one of the easier ways to allow all languages to have access to the constants. It might be a SMOP (Simple Matter Of Programming) to write a script to transform the .h file into a suitable INSERT statement for such a table. I wonder if there are other sets of constants worth having. I'd think that the various forms of command OK codes would also be interesting to have as a table like this. -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
2012/1/3 Christopher Browne cbbro...@gmail.com: On Tue, Jan 3, 2012 at 5:55 AM, Samuel PHAN sam...@nomao.com wrote: I'm using PL/Python, and when getting the result object from a plpy.execute(), I can access to the result.status(). E.g.: the result.status() is 4. But to know what 4 corresponds to, I must open the spi.h file from the sources to see : #define SPI_OK_CONNECT 1 #define SPI_OK_FINISH 2 #define SPI_OK_FETCH 3 #define SPI_OK_UTILITY 4 #define SPI_OK_SELECT 5 #define SPI_OK_SELINTO 6 #define SPI_OK_INSERT 7 #define SPI_OK_DELETE 8 #define SPI_OK_UPDATE 9 #define SPI_OK_CURSOR 10 #define SPI_OK_INSERT_RETURNING 11 #define SPI_OK_DELETE_RETURNING 12 #define SPI_OK_UPDATE_RETURNING 13 #define SPI_OK_REWRITTEN 14 Is there a way to have access to these constants from the PL/python code and other PL/* (Tcl, Perl, etc.) ? I'd suppose it interesting to add a table to pg_catalog containing this data. That would be one of the easier ways to allow all languages to have access to the constants. - it is useless overhead I am expecting so definition some constants in Perl, Python is simple Regards Pavel It might be a SMOP (Simple Matter Of Programming) to write a script to transform the .h file into a suitable INSERT statement for such a table. I wonder if there are other sets of constants worth having. I'd think that the various forms of command OK codes would also be interesting to have as a table like this. -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Mon, Jan 2, 2012 at 2:53 PM, Simon Riggs si...@2ndquadrant.com wrote: Get rid of the freelist? Once shared buffers are full, it's just about useless anyway. But you'd need to think about the test cases that you pay attention to, as there might be scenarios where it remains useful. Agree freelist is mostly useless, but startup and dropping objects requires it. Not really. If someone drops an object, we must invalidate all the buffers immediately, but adding them to the free list is just an optimization to make sure we reuse the invalidated buffers in preference to evicting buffers that still have valid contents. But I suspect that in practice this is not a very important optimization, because (1) most people probably don't drop permanent tables or databases very often and (2) buffers that are being heavily used should have a positive reference count, in which case the clock sweep will pass over them and land on one of the newly-invalidated buffers anyway. When its full its just an integer 0 test, which is cheap and its on the same cacheline as the nextVictimBuffer anyway, so we have to fetch it. The clock sweep is where all the time goes, in its current form. ...but I agree with this. In its current form, the clock sweep has to acquire a spinlock for every buffer it touches. That's really expensive, and I think we need to either get rid of it altogether or at least find some way to move it into the background. Ideally, in the common case, a backend that wants to allocate a buffer shouldn't need to do more than acquire a spinlock, pop a buffer off some sort of linked list of allocation targets, and release the spinlock. Whatever other computation is needed should get pushed out of foreground processing. We have 2 problems 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so bgwriter has to fight with backends to find out where it should clean. As a result it spends lots of time waiting and insufficient time cleaning. I have trouble accepting that this is really the problem we should be solving. There's only one background writer process, and that process is waking up every 200ms by default. Many people probably tune that down quite a bit, but unless I'm quite mistaken, it should still be a drop in the bucket next to what the backends are doing. 2. When a backend can't find a free buffer, it spins for a long time while holding the lock. This makes the buffer strategy O(N) in its worst case, which slows everything down. Notably, while this is happening the bgwriter sits doing nothing at all, right at the moment when it is most needed. The Clock algorithm is an approximation of an LRU, so is already suboptimal as a perfect cache. Tweaking to avoid worst case behaviour makes sense. How much to tweak? Well,... I generally agree with this analysis, but I don't think the proposed patch is going to solve the problem. It may have some merit as a way of limiting the worst case behavior. For example, if every shared buffer has a reference count of 5, the first buffer allocation that misses is going to have a lot of work to do before it can actually come up with a victim. But I don't think it's going to provide good scaling in general. Even if the background writer only spins through, on average, ten or fifteen buffers before finding one to evict, that still means we're acquiring ten or fifteen spinlocks while holding BufFreelistLock. I don't currently have the measurements to prove that's too expensive, but I bet it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 5:47 AM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Jan 3, 2012 at 4:40 AM, Noah Misch n...@leadboat.com wrote: On Tue, Jan 03, 2012 at 01:18:41AM +, Simon Riggs wrote: Just for the record, yes we do run multiple catalog scans in some parts of the code. So I can see how we might trigger 4 nested scans, using cache replacement while scanning, so best assume more, with no guarantee of them being neatly stacked for pop/push type access. Yeah, I wouldn't want to commit to a nesting limit. Agreed However, I _would_ have expected that a stack would suffice; PushActiveSnapshot()/PopActiveSnapshot() is adequate for a great deal of the backend, after all. In what sort of situation do catalog scans not strictly nest? It's possible. Making assumptions about what is possible bit me before, as you showed. I've seen code where we are explicitly running two concurrent scans. I don't want to put unnecessary and subtle assumptions into catalog scan code so I'm working on the assumption that endscan may not be called in strict FIFO order. The difference is fairly minor and doesn't restrict us in other ways. I feel like the first thing we should be doing here is some benchmarking. If we change just the scans in dependency.c and then try the test case Tom suggested (dropping a schema containing a large number of functions), we can compare the patched code with master and get an idea of whether the performance is acceptable. If it is, changing everything else is a mostly mechanical process that we can simply grind through. If it's not, I'd rather learn that before we start grinding. I started to do this before Christmas, but then I ... went on vacation. Here's Perl script that can be used to generate an SQL script to create as many functions as you'd like to try dropping at once, in case it's helpful. The idea is to run the resulting SQL script through psql and then test the speed of DROP SCHEMA lots_of_functions CASCADE;. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company lots_of_functions.pl Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: ALTER TABLE IF EXISTS
On Mon, Jan 2, 2012 at 12:01 PM, Pavel Stehule pavel.steh...@gmail.com wrote: here is updated patch I think the comments in parse_utilcmd.c probably need a bit of adjustment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add protransform for numeric, varbit, and temporal types
On Sat, Dec 31, 2011 at 7:36 PM, Noah Misch n...@leadboat.com wrote: Building on commit 8f9fe6edce358f7904e0db119416b4d1080a83aa, this adds protransform functions to the length coercions for numeric, varbit, timestamp, timestamptz, time, timetz and interval. This mostly serves to make more ALTER TABLE ALTER TYPE operations avoid a rewrite, including numeric(10,2) - numeric(12,2), varbit(4) - varbit(8) and timestamptz(2) - timestamptz(4). The rules for varbit are exactly the same as for varchar. Numeric is slightly more complex: * Flatten calls to our length coercion function that solely represent * increases in allowable precision. Scale changes mutate every datum, so * they are unoptimizable. Some values, e.g. 1E-1001, can only fit into an * unconstrained numeric, so a change from an unconstrained numeric to any * constrained numeric is also unoptimizable. time{,stamp}{,tz} are similar to varchar for these purposes, except that, for example, plain timestamptz is equivalent to timestamptz(6). interval has a vastly different typmod format, but the principles applicable to length coercion remain the same. Under --disable-integer-datetimes, I'm not positive that timestamp_scale() is always a no-op when one would logically expect as much. Does there exist a timestamp such that v::timestamp(2) differs from v:timestamp(2)::timestamp(4) due to floating point rounding? Even if so, I'm fairly comfortable calling it a feature rather than a bug to avoid perturbing values that way. After these patches, the only core length coercion casts not having protransform functions are those for bpchar and bit. For those, we could only optimize trivial cases of no length change. I'm not planning to do so. This is cool stuff. I will plan to review this once the CF starts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: ALTER TABLE IF EXISTS
Hello 2012/1/3 Robert Haas robertmh...@gmail.com: On Mon, Jan 2, 2012 at 12:01 PM, Pavel Stehule pavel.steh...@gmail.com wrote: here is updated patch I think the comments in parse_utilcmd.c probably need a bit of adjustment. I don't see it - there is only one comment and it is adjusted with if statement. please, show it Regards Pavel -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote: The clock sweep is where all the time goes, in its current form. ...but I agree with this. In its current form, the clock sweep has to acquire a spinlock for every buffer it touches. That's really expensive, and I think we need to either get rid of it altogether or at least find some way to move it into the background. Ideally, in the common case, a backend that wants to allocate a buffer shouldn't need to do more than acquire a spinlock, pop a buffer off some sort of linked list of allocation targets, and release the spinlock. Whatever other computation is needed should get pushed out of foreground processing. So you don't think a freelist is worth having, but you want a list of allocation targets. What is the practical difference? We have 2 problems 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so bgwriter has to fight with backends to find out where it should clean. As a result it spends lots of time waiting and insufficient time cleaning. I have trouble accepting that this is really the problem we should be solving. There's only one background writer process, and that process is waking up every 200ms by default. Many people probably tune that down quite a bit, but unless I'm quite mistaken, it should still be a drop in the bucket next to what the backends are doing. That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes, most of the contention is caused by the other backends, but the bgwriter experiences that contention currently when it has no need to do so. Presumably if you did have an allocation list maintained in the background by the bgwriter you wouldn't expect the bgwriter to wait behind a lock for no reason when it could be doing useful work. 2. When a backend can't find a free buffer, it spins for a long time while holding the lock. This makes the buffer strategy O(N) in its worst case, which slows everything down. Notably, while this is happening the bgwriter sits doing nothing at all, right at the moment when it is most needed. The Clock algorithm is an approximation of an LRU, so is already suboptimal as a perfect cache. Tweaking to avoid worst case behaviour makes sense. How much to tweak? Well,... I generally agree with this analysis, but I don't think the proposed patch is going to solve the problem. It may have some merit as a way of limiting the worst case behavior. For example, if every shared buffer has a reference count of 5, the first buffer allocation that misses is going to have a lot of work to do before it can actually come up with a victim. But I don't think it's going to provide good scaling in general. Even if the background writer only spins through, on average, ten or fifteen buffers before finding one to evict, that still means we're acquiring ten or fifteen spinlocks while holding BufFreelistLock. I don't currently have the measurements to prove that's too expensive, but I bet it is. I think its worth reducing the cost of scanning, but that has little to do with solving the O(N) problem. I think we need both. I've left the way open for you to redesign freelist management in many ways. Please take the opportunity and go for it, though we must realise that major overhauls require significantly more testing to prove their worth. Reducing spinlocking only sounds like a good way to proceed for this release. If you don't have time in 9.2, then these two small patches are worth having. The bgwriter locking patch needs less formal evidence to show its worth. We simply don't need to have a bgwriter that just sits waiting doing nothing. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: ALTER TABLE IF EXISTS
On Tue, Jan 3, 2012 at 10:38 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello 2012/1/3 Robert Haas robertmh...@gmail.com: On Mon, Jan 2, 2012 at 12:01 PM, Pavel Stehule pavel.steh...@gmail.com wrote: here is updated patch I think the comments in parse_utilcmd.c probably need a bit of adjustment. I don't see it - there is only one comment and it is adjusted with if statement. please, show it Well, basically, the comment preceding the part you altered say the lock level requested here, but here is getting spread out quite a bit more with this code change. Maybe that doesn't matter. However, on further examination, this is a pretty awkward way to write the code. Why not something like this: rel = relation_openrv_extended(stmt-relation, lockmode, stmt-missing_ok); if (rel == NULL) { ereport(...); return NIL; } Maybe the intent of sticking heap_openrv_extended() into the upper branch of the if statement is to try to bounce relations that aren't tables, but that's not actually what that function does (e.g. a foreign table will slip through). And I think if we're going to have IF EXISTS support for ALTER TABLE at all, we ought to have it for the whole family of related statements: ALTER VIEW, ALTER SEQUENCE, ALTER FOREIGN TABLE, etc., not just ALTER TABLE itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 3:24 PM, Robert Haas robertmh...@gmail.com wrote: I feel like the first thing we should be doing here is some benchmarking. If we change just the scans in dependency.c and then try the test case Tom suggested (dropping a schema containing a large number of functions), we can compare the patched code with master and get an idea of whether the performance is acceptable. Yes, I've done this and it takes 2.5s to drop 10,000 functions using an MVCC snapshot. That was acceptable to *me*, so I didn't try measuring using just SnapshotNow. We can do a lot of tests but at the end its a human judgement. Is 100% correct results from catalog accesses worth having when the real world speed of it is not substantially very good? (Whether its x100 times slower or not is not relevant if it is still fast enough). Anybody with that many DDL statements probably cares about doing various operations with lower lock levels. Fastpath locking will slow down DDL but we didn't measure the performance slow down there. We understood the benefit and were willing to pay the price. So I'll proceed for now with the patch, which isn't as simple as you think. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to allow users to kill their own queries
On Tue, Dec 20, 2011 at 02:30:08PM +0100, Magnus Hagander wrote: That said - can someone who knows the translation stuff better than me comment on if this is actually going to be translatable, or if it violates too many translation rules? +pg_signal_backend(int pid, int sig, bool allow_same_role, const char *actionstr, const char *hint) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg(must be superuser to %s other server processes, actionstr), + errhint(%s, hint))); + PG_RETURN_BOOL(pg_signal_backend(PG_GETARG_INT32(0), SIGTERM, false, + gettext_noop(terminate), + gettext_noop(You can cancel your own processes with pg_cancel_backend().))); } You need errhint(%s, _(hint)) or errhint(hint) to substitute the translation at runtime; only the printf-pattern string gets an automatic message catalog lookup. Regarding the other message, avoid composing a translated message from independently-translated parts. The translator sees this: #: utils/adt/misc.c:110 #, c-format msgid must be superuser to %s other server processes msgstr #: utils/adt/misc.c:166 msgid terminate msgstr #: utils/adt/misc.c:167 msgid You can cancel your own processes with pg_cancel_backend(). msgstr He'll probably need to read the code to see how the strings go together. If we add an alternative to terminate, not all languages will necessarily have a translation of the outer message that works for both inner fragments. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.2] Fix Leaky View Problem
2011/12/23 Robert Haas robertmh...@gmail.com: On Fri, Dec 23, 2011 at 5:56 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I'd like the regression test on select_view test being committed also to detect unexpected changed in the future. How about it? Can you resend that as a separate patch? I remember there were some things I didn't like about it, but I don't remember what they were at the moment... Sorry for this late response. The attached one is patch of the regression test that checks scenario of malicious function with/without security_barrier option. I guess you concerned about that expected/select_views_1.out is patched, not expected/select_views.out. I'm not sure the reason why regression test script tries to make diff between results/select_views and expected/select_views_1.out. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp pgsql-regtest-leaky-views.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Tue, Jan 3, 2012 at 10:56 AM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote: The clock sweep is where all the time goes, in its current form. ...but I agree with this. In its current form, the clock sweep has to acquire a spinlock for every buffer it touches. That's really expensive, and I think we need to either get rid of it altogether or at least find some way to move it into the background. Ideally, in the common case, a backend that wants to allocate a buffer shouldn't need to do more than acquire a spinlock, pop a buffer off some sort of linked list of allocation targets, and release the spinlock. Whatever other computation is needed should get pushed out of foreground processing. So you don't think a freelist is worth having, but you want a list of allocation targets. What is the practical difference? I think that our current freelist is practically useless, because it is almost always empty, and the cases where it's not empty (startup, and after a table or database drop) are so narrow that we don't really get any benefit out of having it. However, I'm not opposed to the idea of a freelist in general: I think that if we actually put in some effort to keep the freelist in a non-empty state it would help a lot, because backends would then have much less work to do at buffer allocation time. I have trouble accepting that this is really the problem we should be solving. There's only one background writer process, and that process is waking up every 200ms by default. Many people probably tune that down quite a bit, but unless I'm quite mistaken, it should still be a drop in the bucket next to what the backends are doing. That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes, most of the contention is caused by the other backends, but the bgwriter experiences that contention currently when it has no need to do so. Sure, but if that contention is a negligible fraction of the total and doesn't cost anything measurable, then it doesn't make sense to add code to eliminate it. There are all sorts of places in the system where we have contention at a level that doesn't affect performance. Many of those could probably be fixed by adding more complicated locking, but that would just complicate the code to no purpose. If there's a demonstrable performance benefit to this change then it's worth a harder look, but my gut feeling is that there won't be. 2. When a backend can't find a free buffer, it spins for a long time while holding the lock. This makes the buffer strategy O(N) in its worst case, which slows everything down. Notably, while this is happening the bgwriter sits doing nothing at all, right at the moment when it is most needed. The Clock algorithm is an approximation of an LRU, so is already suboptimal as a perfect cache. Tweaking to avoid worst case behaviour makes sense. How much to tweak? Well,... I generally agree with this analysis, but I don't think the proposed patch is going to solve the problem. It may have some merit as a way of limiting the worst case behavior. For example, if every shared buffer has a reference count of 5, the first buffer allocation that misses is going to have a lot of work to do before it can actually come up with a victim. But I don't think it's going to provide good scaling in general. Even if the background writer only spins through, on average, ten or fifteen buffers before finding one to evict, that still means we're acquiring ten or fifteen spinlocks while holding BufFreelistLock. I don't currently have the measurements to prove that's too expensive, but I bet it is. I think its worth reducing the cost of scanning, but that has little to do with solving the O(N) problem. I think we need both. Maybe. I would really like a unified solution to both problems, but I'd be willing to accept a solution for one of them if we're confident that we've actually solved it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 11:17 AM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, Jan 3, 2012 at 3:24 PM, Robert Haas robertmh...@gmail.com wrote: I feel like the first thing we should be doing here is some benchmarking. If we change just the scans in dependency.c and then try the test case Tom suggested (dropping a schema containing a large number of functions), we can compare the patched code with master and get an idea of whether the performance is acceptable. Yes, I've done this and it takes 2.5s to drop 10,000 functions using an MVCC snapshot. That was acceptable to *me*, so I didn't try measuring using just SnapshotNow. We can do a lot of tests but at the end its a human judgement. Is 100% correct results from catalog accesses worth having when the real world speed of it is not substantially very good? (Whether its x100 times slower or not is not relevant if it is still fast enough). Sure. But I don't see why that means it wouldn't be nice to know whether or not it is in fact a million times slower. If Tom's artificial worst case is a million times slower, then probably there are some cases we care about more that are going to be measurably impacted, and we're going to want to think about what to do about that. We can wait until you've finished the patch before we do that testing, or we can do it now and maybe get some idea whether the approach is likely to be viable or whether it's going to require some adjustment before we actually go trawl through all that code. On my laptop, dropping a schema with 10,000 functions using commit d5448c7d31b5af66a809e6580bae9bd31448bfa7 takes 400-500 ms. If my laptop is the same speed as your laptop, that would mean a 5-6x slowdown, but of course that's comparing apples and oranges... in any event, if the real number is anywhere in that ballpark, it's probably a surmountable obstacle, but I'd like to know rather than guessing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sorting operators in pg_dump
On Sun, Jan 1, 2012 at 11:50 PM, Peter Eisentraut pete...@gmx.net wrote: Currently, pg_dump sorts operators by name, but operators with the same name come out in random order. A few releases ago we adjusted this for functions, so that they are in increasing number of arguments order. I'd like to do this for operators as well, so that they come out in order, say, prefix, postfix, infix. (It might be surprising that something like this is necessary, but it happens. ip4r for example contains operators with different fixnesses (fixities?).) Patch attached, and a little test case. Seems like a good idea. The patch looks good, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
Simon Riggs si...@2ndquadrant.com writes: That was acceptable to *me*, so I didn't try measuring using just SnapshotNow. We can do a lot of tests but at the end its a human judgement. Is 100% correct results from catalog accesses worth having when the real world speed of it is not substantially very good? (Whether its x100 times slower or not is not relevant if it is still fast enough). That argument is just nonsense AFAICT. Yes, 2.5 s to drop 1 functions is probably fine, but that is an artificial test case whose only real interest is to benchmark what a change in SnapshotNow scans might cost us. In the real world it's hard to guess what fraction of a real workload might consist of such scans, but I suspect there are some where a significant increase in that cost might hurt. So in my mind the point of the exercise is to find out how much the cost increased, and I'm just baffled as to why you won't use a benchmark case to, um, benchmark. Another point that requires some thought is that switching SnapshotNow to be MVCC-based will presumably result in a noticeable increase in each backend's rate of wanting to acquire snapshots. Hence, more contention in GetSnapshotData can be expected. A single-threaded test case doesn't prove anything at all about what that might cost under load. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to allow users to kill their own queries
Noah Misch n...@leadboat.com writes: Regarding the other message, avoid composing a translated message from independently-translated parts. Yes. I haven't looked at the patch, but I wonder whether it wouldn't be better to dodge both of these problems by having the subroutine return a success/failure result code, so that the actual ereport can be done at an outer level where the full message (and hint) can be written directly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On Fri, Sep 9, 2011 at 11:02 PM, Daniel Farina dan...@heroku.com wrote: On Wed, Aug 24, 2011 at 1:04 PM, Daniel Farina dan...@heroku.com wrote: On Wed, Aug 24, 2011 at 12:38 PM, Tom Lane t...@sss.pgh.pa.us wrote: Assuming the issue really is the physical unlinks (which I agree I'd like to see some evidence for), I wonder whether the problem could be addressed by moving smgrDoPendingDeletes() to after locks are released, instead of before, in CommitTransaction/AbortTransaction. There does not seem to be any strong reason why we have to do that before lock release, since incoming potential users of a table should not be trying to access the old physical storage after that anyway. Alright, since this concern about confirming the expensive part of index dropping has come up a few times but otherwise the waters are warm, I'll go ahead and do some work to pin things down a bit before we continue working on those assumptions. This suspicion seems to be proven correct; there came an opportunity where we were removing some indexes on a live system and I took the opportunity to carefully control and time the process. There's not much relationship between size of the index and the delay, but the pauses are still very real. On the other hand, the first time this was noticed there was significantly higher load. I'd still like to do something to solve this problem, though: even if the time-consuming part of the process is not file unlinking, it's clearly something after the AccessExclusiveLock is acquired based on our other measurements. This could well be related to the fact that DropRelFileNodeBuffers() does a scan of shared_buffers, which is an O(N) approach no matter the size of the index. On top of that, taking what Robert Haas mentioned on another thread, InvalidateBuffer currently calls StretegyFreeBuffer(), which waits for an ExclusiveLock on the BufFreelistLock. On a busy system this will be heavily contended, so adding blocks to the freelist only if the lock is free seems warranted. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 3e62448..af7215f 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -218,12 +218,21 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) } /* - * StrategyFreeBuffer: put a buffer on the freelist + * StrategyFreeBuffer: put a buffer on the freelist, unless we're busy */ void StrategyFreeBuffer(volatile BufferDesc *buf) { - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); + /* + * The buffer is already invalidated and is now an allocation target. + * Adding buffers back onto the freelist is an optimisation only, + * so we can decide to skip this step if the lock is busy. + * This improves the speed of dropping indexes and tables on a busy system. + * If the system is busy the newly invalidated buffers will be reallocated + * within one clock sweep. + */ + if (!LWLockConditionalAcquire(BufFreelistLock, LW_EXCLUSIVE)) + return; /* * It is possible that we are told to put something in the freelist that -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 12:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: That was acceptable to *me*, so I didn't try measuring using just SnapshotNow. We can do a lot of tests but at the end its a human judgement. Is 100% correct results from catalog accesses worth having when the real world speed of it is not substantially very good? (Whether its x100 times slower or not is not relevant if it is still fast enough). That argument is just nonsense AFAICT. Yes, 2.5 s to drop 1 functions is probably fine, but that is an artificial test case whose only real interest is to benchmark what a change in SnapshotNow scans might cost us. In the real world it's hard to guess what fraction of a real workload might consist of such scans, but I suspect there are some where a significant increase in that cost might hurt. So in my mind the point of the exercise is to find out how much the cost increased, and I'm just baffled as to why you won't use a benchmark case to, um, benchmark. Ditto. Another point that requires some thought is that switching SnapshotNow to be MVCC-based will presumably result in a noticeable increase in each backend's rate of wanting to acquire snapshots. Hence, more contention in GetSnapshotData can be expected. A single-threaded test case doesn't prove anything at all about what that might cost under load. This is obviously true at some level, but I'm not sure that it really matters. It's not that difficult to construct a test case where we have lots of people concurrently reading a table, or reading many tables, or writing a table, or writing many tables, but what kind of realistic test case involves enough DDL for any of this to matter? If you're creating or dropping tables, for example, the filesystem costs are likely to be a much bigger problem than GetSnapshotData(), to the point where you probably can't get enough concurrency for GetSnapshotData() to matter. Maybe you could find a problem case involving creating or dropping lots and lots of functions concurrently, or something like that, but who does that? You'd have to be performing operations on hundreds of non-table SQL objects per second, and it is hard for me to imagine why anyone would be doing that. Maybe I'm not imaginative enough? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
Excerpts from Robert Haas's message of mar ene 03 12:24:52 -0300 2012: I feel like the first thing we should be doing here is some benchmarking. If we change just the scans in dependency.c and then try the test case Tom suggested (dropping a schema containing a large number of functions), we can compare the patched code with master and get an idea of whether the performance is acceptable. If it is, changing everything else is a mostly mechanical process that we can simply grind through. If it's not, I'd rather learn that before we start grinding. If there are many call sites, maybe it'd be a good idea to use a semantic patcher tool such as Coccinelle instead of doing it one by one. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
Robert Haas robertmh...@gmail.com writes: On Tue, Jan 3, 2012 at 12:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Another point that requires some thought is that switching SnapshotNow to be MVCC-based will presumably result in a noticeable increase in each backend's rate of wanting to acquire snapshots. Hence, more contention in GetSnapshotData can be expected. A single-threaded test case doesn't prove anything at all about what that might cost under load. This is obviously true at some level, but I'm not sure that it really matters. It's not that difficult to construct a test case where we have lots of people concurrently reading a table, or reading many tables, or writing a table, or writing many tables, but what kind of realistic test case involves enough DDL for any of this to matter? Um ... you're supposing that only DDL uses SnapshotNow, which is wrong. I refer you to the parser, the planner, execution functions for arrays, records, enums, any sort of relcache reload, etc etc etc. Yes, some of that is masked by backend-internal caching, some of the time, but it's folly to just assume that there are no SnapshotNow scans during normal queries. None of this is necessarily grounds to reject a patch along the proposed lines. I'm just asking for some benchmarking effort to establish what the costs might be, rather than naively hoping they are negligible. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 1:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Um ... you're supposing that only DDL uses SnapshotNow, which is wrong. I refer you to the parser, the planner, execution functions for arrays, records, enums, any sort of relcache reload, etc etc etc. Yes, some of that is masked by backend-internal caching, some of the time, but it's folly to just assume that there are no SnapshotNow scans during normal queries. Hmm. That's unfortunate, because it seems difficult to construct a test case that will exercise every feature in the system. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
Robert Haas robertmh...@gmail.com wrote: Tom Lane t...@sss.pgh.pa.us wrote: Um ... you're supposing that only DDL uses SnapshotNow, which is wrong. I refer you to the parser, the planner, execution functions for arrays, records, enums, any sort of relcache reload, etc etc etc. Yes, some of that is masked by backend-internal caching, some of the time, but it's folly to just assume that there are no SnapshotNow scans during normal queries. Hmm. That's unfortunate, because it seems difficult to construct a test case that will exercise every feature in the system. Would the result of IsMVCCSnapshot() change for these snapshots? If so, it might require work in the SSI code to avoid a performance hit there. In early profiling and stepping through execution I noticed that we had overhead in serializable transactions for the planner checks for the actual values at the beginning or end of an index. This went away when we avoided SSI work for reads using a non-MVCC snapshot. If we're going to start using MVCC snapshots for such things, we need some other way to avoid unnecessary work in this area (and possibly others). At a minimum, some comparative benchmarks at the serializable isolation level would be in order when considering a patch like this. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
I wrote: Another point that requires some thought is that switching SnapshotNow to be MVCC-based will presumably result in a noticeable increase in each backend's rate of wanting to acquire snapshots. BTW, I wonder if this couldn't be ameliorated by establishing some ground rules about how up-to-date a snapshot really needs to be. Arguably, it should be okay for successive SnapshotNow scans to use the same snapshot as long as we have not acquired a new lock in between. If not, reusing an old snap doesn't introduce any race condition that wasn't there already. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] improve pg_restore warning on text dump input
From time to time there are complaints because people mistakenly feed a text format dump to pg_restore and get back a somewhat cryptic message about the file not being a valid archive. It's been suggested that we should have pg_restore run the file through psql, but that would involve more work than I at least care to give the problem. However, I think we should give a nicer message, suggesting the user try feeding the file to psql instead. The attached small patch does that. cheers andrew diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 7d895c4..9918c4d 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -77,6 +77,9 @@ typedef struct _parallel_slot #define NO_SLOT (-1) +#define TEXT_DUMP_HEADER --\n-- PostgreSQL database dump\n--\n\n +#define TEXT_DUMPALL_HEADER --\n-- PostgreSQL database cluster dump\n--\n\n + /* state needed to save/restore an archive's output target */ typedef struct _outputContext { @@ -1872,7 +1875,19 @@ _discoverArchiveFormat(ArchiveHandle *AH) die_horribly(AH, modulename, input file does not appear to be a valid archive (too short?)\n); if (!isValidTarHeader(AH-lookahead)) - die_horribly(AH, modulename, input file does not appear to be a valid archive\n); + { + if (strncmp(AH-lookahead, TEXT_DUMP_HEADER, strlen(TEXT_DUMP_HEADER)) == 0 || +strncmp(AH-lookahead, TEXT_DUMPALL_HEADER, strlen(TEXT_DUMPALL_HEADER)) == 0) + { +/* looks like it's probably a text format dump. so suggest they try psql */ +die_horribly(AH, modulename, input file appears to be a text format dump. Please use psql.\n); + } + else + { +/* we have no idea what this is */ +die_horribly(AH, modulename, input file does not appear to be a valid archive\n); + } + } AH-format = archTar; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improve pg_restore warning on text dump input
Andrew Dunstan and...@dunslane.net writes: From time to time there are complaints because people mistakenly feed a text format dump to pg_restore and get back a somewhat cryptic message about the file not being a valid archive. It's been suggested that we should have pg_restore run the file through psql, but that would involve more work than I at least care to give the problem. However, I think we should give a nicer message, suggesting the user try feeding the file to psql instead. The attached small patch does that. It would probably be better if you put this test before the one that insists the file is at least 512 bytes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 1:42 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: Tom Lane t...@sss.pgh.pa.us wrote: Um ... you're supposing that only DDL uses SnapshotNow, which is wrong. I refer you to the parser, the planner, execution functions for arrays, records, enums, any sort of relcache reload, etc etc etc. Yes, some of that is masked by backend-internal caching, some of the time, but it's folly to just assume that there are no SnapshotNow scans during normal queries. Hmm. That's unfortunate, because it seems difficult to construct a test case that will exercise every feature in the system. Would the result of IsMVCCSnapshot() change for these snapshots? If so, it might require work in the SSI code to avoid a performance hit there. In early profiling and stepping through execution I noticed that we had overhead in serializable transactions for the planner checks for the actual values at the beginning or end of an index. This went away when we avoided SSI work for reads using a non-MVCC snapshot. If we're going to start using MVCC snapshots for such things, we need some other way to avoid unnecessary work in this area (and possibly others). At a minimum, some comparative benchmarks at the serializable isolation level would be in order when considering a patch like this. Ugh. Yeah, that sounds like a problem. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improve pg_restore warning on text dump input
On 01/03/2012 01:55 PM, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: From time to time there are complaints because people mistakenly feed a text format dump to pg_restore and get back a somewhat cryptic message about the file not being a valid archive. It's been suggested that we should have pg_restore run the file through psql, but that would involve more work than I at least care to give the problem. However, I think we should give a nicer message, suggesting the user try feeding the file to psql instead. The attached small patch does that. It would probably be better if you put this test before the one that insists the file is at least 512 bytes. Hmm, yeah. I guess we're pretty much certain that these markers can't reasonably appear at the start of a tar archive. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 1:42 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: Another point that requires some thought is that switching SnapshotNow to be MVCC-based will presumably result in a noticeable increase in each backend's rate of wanting to acquire snapshots. BTW, I wonder if this couldn't be ameliorated by establishing some ground rules about how up-to-date a snapshot really needs to be. Arguably, it should be okay for successive SnapshotNow scans to use the same snapshot as long as we have not acquired a new lock in between. If not, reusing an old snap doesn't introduce any race condition that wasn't there already. Is that likely to help much? I think our usual pattern is to lock the catalog, scan it, and then release the lock, so there will normally be an AcceptInvalidationMessages() just before the scan. Or at least, I think there will. Another thought is that it should always be safe to reuse an old snapshot if no transactions have committed or aborted since it was taken (possibly we could narrow that to no transactions have committed since it was taken, but I think that might cause some headaches with RecentGlobalXmin). I wonder if we could come up with a cheap, hopefully lock-free method of determining whether or not that's the case. For example, suppose we store the last XID to commit or abort in shared memory. In GetSnapshotData(), we check whether that value has changed (probably, after enforcing a memory barrier of some kind) before acquiring the lock. If it has, we proceed normally, but if not, we just reuse the results from the previous GetSnapshotData(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
Robert Haas robertmh...@gmail.com writes: On Tue, Jan 3, 2012 at 1:42 PM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, I wonder if this couldn't be ameliorated by establishing some ground rules about how up-to-date a snapshot really needs to be. Arguably, it should be okay for successive SnapshotNow scans to use the same snapshot as long as we have not acquired a new lock in between. If not, reusing an old snap doesn't introduce any race condition that wasn't there already. Is that likely to help much? I think our usual pattern is to lock the catalog, scan it, and then release the lock, so there will normally be an AcceptInvalidationMessages() just before the scan. Or at least, I think there will. Um, good point. Those locks aren't meant to avoid race conditions, but the mechanism doesn't know that. Another thought is that it should always be safe to reuse an old snapshot if no transactions have committed or aborted since it was taken Yeah, that might work better. And it'd be a win for all MVCC snaps, not just the ones coming from promoted SnapshotNow ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: ALTER TABLE IF EXISTS
Hello 2012/1/3 Robert Haas robertmh...@gmail.com: On Tue, Jan 3, 2012 at 10:38 AM, Pavel Stehule pavel.steh...@gmail.com wrote: Hello 2012/1/3 Robert Haas robertmh...@gmail.com: On Mon, Jan 2, 2012 at 12:01 PM, Pavel Stehule pavel.steh...@gmail.com wrote: here is updated patch I think the comments in parse_utilcmd.c probably need a bit of adjustment. I don't see it - there is only one comment and it is adjusted with if statement. please, show it Well, basically, the comment preceding the part you altered say the lock level requested here, but here is getting spread out quite a bit more with this code change. Maybe that doesn't matter. However, on further examination, this is a pretty awkward way to write the code. Why not something like this: rel = relation_openrv_extended(stmt-relation, lockmode, stmt-missing_ok); if (rel == NULL) { ereport(...); return NIL; } Maybe the intent of sticking heap_openrv_extended() into the upper branch of the if statement is to try to bounce relations that aren't tables, but that's not actually what that function does (e.g. a foreign table will slip through). And I think if we're going to have IF EXISTS support for ALTER TABLE at all, we ought to have it for the whole family of related statements: ALTER VIEW, ALTER SEQUENCE, ALTER FOREIGN TABLE, etc., not just ALTER TABLE itself. jup, we can continue in enhancing step by step. I change a patch and now ALTER TABLE, ALTER INDEX, ALTER SEQUENCE and ALTER VIEW has IF EXISTS clause Regards Pavel -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company *** ./doc/src/sgml/ref/alter_index.sgml.orig 2012-01-02 17:01:00.0 +0100 --- ./doc/src/sgml/ref/alter_index.sgml 2012-01-03 19:45:24.210189185 +0100 *** *** 21,30 refsynopsisdiv synopsis ! ALTER INDEX replaceable class=PARAMETERname/replaceable RENAME TO replaceable class=PARAMETERnew_name/replaceable ! ALTER INDEX replaceable class=PARAMETERname/replaceable SET TABLESPACE replaceable class=PARAMETERtablespace_name/replaceable ! ALTER INDEX replaceable class=PARAMETERname/replaceable SET ( replaceable class=PARAMETERstorage_parameter/replaceable = replaceable class=PARAMETERvalue/replaceable [, ... ] ) ! ALTER INDEX replaceable class=PARAMETERname/replaceable RESET ( replaceable class=PARAMETERstorage_parameter/replaceable [, ... ] ) /synopsis /refsynopsisdiv --- 21,30 refsynopsisdiv synopsis ! ALTER INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable RENAME TO replaceable class=PARAMETERnew_name/replaceable ! ALTER INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable SET TABLESPACE replaceable class=PARAMETERtablespace_name/replaceable ! ALTER INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable SET ( replaceable class=PARAMETERstorage_parameter/replaceable = replaceable class=PARAMETERvalue/replaceable [, ... ] ) ! ALTER INDEX [ IF EXISTS ] replaceable class=PARAMETERname/replaceable RESET ( replaceable class=PARAMETERstorage_parameter/replaceable [, ... ] ) /synopsis /refsynopsisdiv *** *** 38,43 --- 38,53 variablelist varlistentry + termliteralIF EXISTS/literal/term + listitem + para + Do not throw an error if the index does not exist. A notice is issued + in this case. + /para + /listitem +/varlistentry + +varlistentry termliteralRENAME/literal/term listitem para *** ./doc/src/sgml/ref/alter_sequence.sgml.orig 2012-01-02 17:01:00.0 +0100 --- ./doc/src/sgml/ref/alter_sequence.sgml 2012-01-03 18:44:14.397429013 +0100 *** *** 23,37 refsynopsisdiv synopsis ! ALTER SEQUENCE replaceable class=parametername/replaceable [ INCREMENT [ BY ] replaceable class=parameterincrement/replaceable ] [ MINVALUE replaceable class=parameterminvalue/replaceable | NO MINVALUE ] [ MAXVALUE replaceable class=parametermaxvalue/replaceable | NO MAXVALUE ] [ START [ WITH ] replaceable class=parameterstart/replaceable ] [ RESTART [ [ WITH ] replaceable class=parameterrestart/replaceable ] ] [ CACHE replaceable class=parametercache/replaceable ] [ [ NO ] CYCLE ] [ OWNED BY { replaceable class=parametertable/replaceable.replaceable class=parametercolumn/replaceable | NONE } ] ! ALTER SEQUENCE replaceable class=parametername/replaceable OWNER TO replaceable class=PARAMETERnew_owner/replaceable ! ALTER SEQUENCE replaceable class=parametername/replaceable RENAME TO replaceable class=parameternew_name/replaceable ! ALTER SEQUENCE replaceable class=parametername/replaceable SET SCHEMA replaceable class=parameternew_schema/replaceable /synopsis /refsynopsisdiv --- 23,37 refsynopsisdiv synopsis ! ALTER SEQUENCE [ IF EXISTS ] replaceable class=parametername/replaceable [ INCREMENT [ BY ] replaceable class=parameterincrement/replaceable ] [
Re: [HACKERS] spinlocks on powerpc
On 2012-01-03 04:44, Robert Haas wrote: On read-only workloads, you get spinlock contention, because everyone who wants a snapshot has to take the LWLock mutex to increment the shared lock count and again (just a moment later) to decrement it. Does the LWLock protect anything but the shared lock count? If not then the usually quickest manipulation is along the lines of: loop: lwarx r5,0,r3 #load and reserve add r0,r4,r5 #increment word stwcx. r0,0,r3 #store new value if still reserved bne-loop #loop if lost reservation (per IBM's software ref manual, https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600719DF2 ) The same sort of thing generally holds on other instruction-sets also. Also, heavy-contention locks should be placed in cache lines away from other data (to avoid thrashing the data cache lines when processors are fighting over the lock cache lines). -- Jeremy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
Hi! Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.). -- With best regards, Alexander Korotkov. arrayanalyze-0.8.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlocks on powerpc
On Tue, Jan 3, 2012 at 3:05 PM, Jeremy Harris j...@wizmail.org wrote: On 2012-01-03 04:44, Robert Haas wrote: On read-only workloads, you get spinlock contention, because everyone who wants a snapshot has to take the LWLock mutex to increment the shared lock count and again (just a moment later) to decrement it. Does the LWLock protect anything but the shared lock count? If not then the usually quickest manipulation is along the lines of: loop: lwarx r5,0,r3 #load and reserve add r0,r4,r5 #increment word stwcx. r0,0,r3 #store new value if still reserved bne- loop #loop if lost reservation (per IBM's software ref manual, https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600719DF2 ) The same sort of thing generally holds on other instruction-sets also. Sure, but the actual critical section is not that simple. You might look at the code for LWLockAcquire() if you're curious. Also, heavy-contention locks should be placed in cache lines away from other data (to avoid thrashing the data cache lines when processors are fighting over the lock cache lines). Yep. This is possibly a problem, and has been discussed before, but I don't think we have any firm evidence that it's a problem, or how much padding helps. The heavily contended LWLocks are mostly non-consecutive, except perhaps for the buffer mapping locks. It's been suggested to me that we should replace our existing LWLock implementation with a CAS-based implementation that crams all the relevant details into a single 8-byte word. The pointer to the head of the wait queue, for example, could be stored as an offset into the allProcs array rather than a pointer value, which would allow it to be stored in 24 bits rather than 8 bytes. But there's not quite enough bit space to make it work without making compromises -- most likely, reducing the theoretical upper limit on MaxBackends from 2^24 to, say, 2^22. Even if we were willing to do that, the performance benefits of using atomics here are so far unproven... which doesn't mean they don't exist, but someone's going to have to do some work to show that they do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote: Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.). I get the same diagnostic from this version. Opteron processor, operating system is Ubuntu 8.04 (64-bit). You're using --enable-cassert, right? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlocks on powerpc
Robert Haas robertmh...@gmail.com writes: On Tue, Jan 3, 2012 at 3:05 PM, Jeremy Harris j...@wizmail.org wrote: Also, heavy-contention locks should be placed in cache lines away from other data (to avoid thrashing the data cache lines when processors are fighting over the lock cache lines). Yep. This is possibly a problem, and has been discussed before, but I don't think we have any firm evidence that it's a problem, or how much padding helps. The heavily contended LWLocks are mostly non-consecutive, except perhaps for the buffer mapping locks. We are in fact already padding and aligning LWLocks on (if memory serves) 16 or 32 byte boundaries depending on platform. So there might be 2 to 4 LWLocks in the same cache line, depending on platform. It's been suggested before that we pad more to reduce this number, but nobody's demonstrated any performance win from doing so. It's been suggested to me that we should replace our existing LWLock implementation with a CAS-based implementation that crams all the relevant details into a single 8-byte word. I'm under the impression that the main costs are associated with trading cache line ownership between CPUs, which makes me think that this'd be essentially a waste of effort, quite aside from the portability problems involved. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Collect frequency statistics for arrays
On Wed, Jan 4, 2012 at 12:33 AM, Noah Misch n...@leadboat.com wrote: On Wed, Jan 04, 2012 at 12:09:16AM +0400, Alexander Korotkov wrote: Thanks for your great work on reviewing this patch. Now I'm trying to find memory corruption bug. Unfortunately it doesn't appears on my system. Can you check if this bug remains in attached version of patch. If so, please provide me information about system you're running (processor, OS etc.). I get the same diagnostic from this version. Opteron processor, operating system is Ubuntu 8.04 (64-bit). You're using --enable-cassert, right? Oh, actually no. Thanks for point. -- With best regards, Alexander Korotkov.
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 6:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Jan 3, 2012 at 12:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Another point that requires some thought is that switching SnapshotNow to be MVCC-based will presumably result in a noticeable increase in each backend's rate of wanting to acquire snapshots. Hence, more contention in GetSnapshotData can be expected. A single-threaded test case doesn't prove anything at all about what that might cost under load. This is obviously true at some level, but I'm not sure that it really matters. It's not that difficult to construct a test case where we have lots of people concurrently reading a table, or reading many tables, or writing a table, or writing many tables, but what kind of realistic test case involves enough DDL for any of this to matter? Um ... you're supposing that only DDL uses SnapshotNow, which is wrong. I refer you to the parser, the planner, execution functions for arrays, records, enums, any sort of relcache reload, etc etc etc. Yes, some of that is masked by backend-internal caching, some of the time, but it's folly to just assume that there are no SnapshotNow scans during normal queries. None of this is necessarily grounds to reject a patch along the proposed lines. I'm just asking for some benchmarking effort to establish what the costs might be, rather than naively hoping they are negligible. All of which is reasonable doubt. So far, all I'm saying is that on a couple of heavy duty cases that I've run, I haven't noticed a problem. That is sufficient for me to push forwards with a patch that does very close to what we want, so we can judge acceptability with wider tests. I'm not saying it will be acceptable a priori, only that it is worth trying because the benefits are high. The number of call points are also high. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On Tue, Jan 3, 2012 at 6:14 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: If there are many call sites, maybe it'd be a good idea to use a semantic patcher tool such as Coccinelle instead of doing it one by one. Thanks for the suggestion, regrettably I've already made those changes. After examining the call sites, I identified 35 that might need changing. Of those, about 30 were changed to use systable_beginscan, while a few others use declared snapshots instead. So not a great effort and worth doing the by-hand inspection. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
Excerpts from Simon Riggs's message of mar ene 03 17:57:56 -0300 2012: On Tue, Jan 3, 2012 at 6:14 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: If there are many call sites, maybe it'd be a good idea to use a semantic patcher tool such as Coccinelle instead of doing it one by one. Thanks for the suggestion, regrettably I've already made those changes. After examining the call sites, I identified 35 that might need changing. Of those, about 30 were changed to use systable_beginscan, while a few others use declared snapshots instead. So not a great effort and worth doing the by-hand inspection. Yeah, for 35 changes I wouldn't have gone all the length required to learn the new tool either. If they had been 200, OTOH ... -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlocks on powerpc
Robert Haas robertmh...@gmail.com writes: I'm unconvinced by these numbers. There is a measurable change but it is pretty small. The Itanium changes resulted in an enormous gain at higher concurrency levels. Yeah, that was my problem with it also: I couldn't measure enough gain to convince me it was a real effect, and not an artifact of the specific machine being tested. It occurred to me though that we already know that pgbench itself is a bottleneck in tests like this, and there's an easy way to take it out of the picture: move the selects into a plpgsql function that iterates multiple times per client query. The attached testing script reduces the client interaction costs by a thousandfold compared to plain pgbench -S, and also takes parse/plan time out of the loop, so that it becomes easier to see the effects of contention. With this script, I still see a loss of 1% or so from adding the unlocked test in TAS_SPIN at moderate contention levels, but there's a very clear jump when the machine is saturated. So this convinces me that Manabu-san's results are reproducible, and I've committed the TAS_SPIN addition. git head as of this morning, on 8-core IBM 8406-71Y: pgbench -c 1 -j 1 -f bench.script -T 300 bench tps = 50.142878 (including connections establishing) pgbench -c 2 -j 1 -f bench.script -T 300 bench tps = 97.179234 (including connections establishing) pgbench -c 8 -j 4 -f bench.script -T 300 bench tps = 341.731982 (including connections establishing) pgbench -c 16 -j 8 -f bench.script -T 300 bench tps = 402.114111 (including connections establishing) pgbench -c 32 -j 16 -f bench.script -T 300 benchtps = 371.338156 (including connections establishing) pgbench -c 64 -j 32 -f bench.script -T 300 benchtps = 359.785087 (including connections establishing) pgbench -c 96 -j 48 -f bench.script -T 300 benchtps = 363.879950 (including connections establishing) pgbench -c 128 -j 64 -f bench.script -T 300 bench tps = 376.794934 (including connections establishing) after re-adding TAS_SPIN macro: pgbench -c 1 -j 1 -f bench.script -T 300 bench tps = 50.182676 (including connections establishing) pgbench -c 2 -j 1 -f bench.script -T 300 bench tps = 96.751910 (including connections establishing) pgbench -c 8 -j 4 -f bench.script -T 300 bench tps = 327.108510 (including connections establishing) pgbench -c 16 -j 8 -f bench.script -T 300 bench tps = 395.425611 (including connections establishing) pgbench -c 32 -j 16 -f bench.script -T 300 benchtps = 444.291852 (including connections establishing) pgbench -c 64 -j 32 -f bench.script -T 300 benchtps = 486.151168 (including connections establishing) pgbench -c 96 -j 48 -f bench.script -T 300 benchtps = 496.379981 (including connections establishing) pgbench -c 128 -j 64 -f bench.script -T 300 bench tps = 494.058124 (including connections establishing) For Itanium, I was able to find some fairly official-looking documentation that said this is how you should do it. It would be nice to find something similar for PPC64, instead of testing every machine and reinventing the wheel ourselves. You are aware that our spinlock code is pretty much verbatim from the PPC ISA spec, no? The issue here is that the official documentation has been a moving target over the decades the ISA has been in existence. regards, tom lane #! /bin/sh psql bench EOF create or replace function bench(scale int, reps int) returns void language plpgsql as $$ declare naccounts int := 10 * scale; v_aid int; v_abalance int; begin for i in 1 .. reps loop v_aid := round(random() * naccounts + 1); SELECT abalance INTO v_abalance FROM pgbench_accounts WHERE aid = v_aid; end loop; end; $$; EOF cat bench.script EOF select bench(:scale, 1000); EOF # warm the caches a bit pgbench -c 1 -j 1 -f bench.script -T 30 bench /dev/null echo pgbench -c 1 -j 1 -f bench.script -T 300 bench pgbench -c 1 -j 1 -f bench.script -T 300 bench | grep including echo pgbench -c 2 -j 1 -f bench.script -T 300 bench pgbench -c 2 -j 1 -f bench.script -T 300 bench | grep including echo pgbench -c 8 -j 4 -f bench.script -T 300 bench pgbench -c 8 -j 4 -f bench.script -T 300 bench | grep including echo pgbench -c 16 -j 8 -f bench.script -T 300 bench pgbench -c 16 -j 8 -f bench.script -T 300 bench | grep including echo pgbench -c 32 -j 16 -f bench.script -T 300 bench pgbench -c 32 -j 16 -f bench.script -T 300 bench | grep including echo pgbench -c 64 -j 32 -f bench.script -T 300 bench pgbench -c 64 -j 32 -f bench.script -T 300 bench | grep including echo pgbench -c 96 -j 48 -f bench.script -T 300 bench pgbench -c 96 -j 48 -f bench.script -T 300 bench | grep including echo pgbench -c 128 -j 64 -f bench.script -T 300 bench pgbench -c 128 -j 64 -f bench.script -T 300 bench | grep including -- Sent via pgsql-hackers mailing list
[HACKERS] [patch] Improve documentation around FreeBSD Kernel Tuning
Hi, I have a patch that improves the documentation for FreeBSD Kernel Tuning: - Show a # prompt instead of $ to indicate a root shell is needed - Remove the -w flag to sysctl since it is not needed anymore and just silently ignored - Encourage the user to set the read-only sysctls in /boot/loader.conf, instead of setting them manually in the loader. I have put these in a github fork of the repo, but I am new to git. So I apologize if this is incorrect. https://github.com/so14k/postgres/commit/12c03bdb2967346e7ad9ce0bdd3db8dfcf81507e Regards, Brad Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Improve documentation around FreeBSD Kernel Tuning
On 01/03/2012 04:49 PM, Brad Davis wrote: Hi, I have a patch that improves the documentation for FreeBSD Kernel Tuning: - Show a # prompt instead of $ to indicate a root shell is needed - Remove the -w flag to sysctl since it is not needed anymore and just silently ignored - Encourage the user to set the read-only sysctls in /boot/loader.conf, instead of setting them manually in the loader. I have put these in a github fork of the repo, but I am new to git. So I apologize if this is incorrect. https://github.com/so14k/postgres/commit/12c03bdb2967346e7ad9ce0bdd3db8dfcf81507e Instead of a URL, please just email us the diff as an attachment. Normally we prefer these in context diff format, although it doesn't matter so much for such a small patch. See cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] spinlocks on powerpc
On Tue, Jan 3, 2012 at 4:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: For Itanium, I was able to find some fairly official-looking documentation that said this is how you should do it. It would be nice to find something similar for PPC64, instead of testing every machine and reinventing the wheel ourselves. You are aware that our spinlock code is pretty much verbatim from the PPC ISA spec, no? The issue here is that the official documentation has been a moving target over the decades the ISA has been in existence. I wasn't aware of that, but I think my basic point still stands: it's gonna be painful if we have to test large numbers of different PPC boxes to figure all this out... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
Simon Riggs si...@2ndquadrant.com wrote: v2 of checksum patch, using a conditional copy if checksumming is enabled, so locking is removed. Thanks to Andres for thwacking me with the cluestick, though I have used a simple copy rather than a copy calc. Tested using make installcheck with parameter on/off, then restart and vacuumdb to validate all pages. Reviews, objections, user interface tweaks all welcome. I'm happy with how this looks, except (as noted in a code comment) that there seems to be room for optimization of the calculation itself. Details below: (1) I like the choice of Fletcher-16. It should be very good at detecting problems while being a lot less expensive that an official CRC calculation. The fact that it was developed at Lawrence Livermore Labs and has been subjected to peer review in the IEEE Transactions on Communications generates confidence in the technique. According to what I've read, though, the technique is conceptually based around the modulus of the sums. Doing the modulus calculation for each addition is often done to prevent overflow, but if we define sum1 and sum2 as uint I don't see how we can get an overflow with 8k byes, so I suggest we declare both of these local variables as uint and leave the modulus to the end. It can't hurt to leave off 16000 division operations per checksum generation. (2) I'm not sure about doing this in three parts, to skip the checksum itself and the hole in the middle of the page. Is this because the hole might not have predictable data? Why would that matter, as long as it is read back the same? Is it expected that this will reduce the cost by checking fewer bytes? That we could tolerate the read of an actual corrupted disk sector in the middle of a page if it didn't contain any data? If we can set the checksum to zero before starting, might it be faster to load four bytes at a time, and iterate fewer times? (Like I said, I'm not sure about any of this, but it seemed worth asking the questions.) (3) Rather than having PageSetVerificationInfo() use memcpy, followed by pass through the copied data to calculate the checksum, might it be better to have a copy and calculate version of the function (as in VMware's original patch) to save an extra pass over the page image? Other than these performance tweaks around the calculation phase, I didn't spot any problems. I beat up on it a bit on a couple machines without hitting any bugs or seeing any regression test failures. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] grants vs. inherited tables
On Mon, Jan 02, 2012 at 12:31:13PM +0100, Dimitri Fontaine wrote: Marko Kreen mark...@gmail.com writes: I tried to generalize a function that creates partitions for a table and found out it's impossible to do it for grants. Basically, what I want is a child table that takes it's grants from parent table. IMHO quite reasonable request. But I don't see a way to do it in pl/pgsql. (Writing parser in plpgsql for aclitemout() output does not count.) We solved that manually in https://github.com/slardiere/PartMgr, maybe you will find it useful for pre-9.2 releases. See function partition.grant() and partition.setgrant() in part_api.sql. Thanks, thats interesting. Here is my current state: https://github.com/markokr/skytools/blob/master/sql/dispatch/create_partition.sql which uses info-schema for grants, which seems nicer than parsing, but still not as nice as including all. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Jan 3, 2012, at 4:21 PM, Kevin Grittner wrote: (2) I'm not sure about doing this in three parts, to skip the checksum itself and the hole in the middle of the page. Is this because the hole might not have predictable data? Why would that matter, as long as it is read back the same? IMO not checksumming the free space would be a really bad idea. It's entirely possible to have your hardware crapping on your free space, and I'd still want to know that that was happening. Now, it would be interesting if the free space could be checksummed separately, since there's no reason to refuse to read the page if only the free space is screwed up... But given the choice, I'd rather get an error when the free space is corrupted and be forced to look into things rather than blissfully ignore corrupted free space only to be hit later with real data loss. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On Jan 3, 2012, at 12:11 PM, Simon Riggs wrote: This could well be related to the fact that DropRelFileNodeBuffers() does a scan of shared_buffers, which is an O(N) approach no matter the size of the index. On top of that, taking what Robert Haas mentioned on another thread, InvalidateBuffer currently calls StretegyFreeBuffer(), which waits for an ExclusiveLock on the BufFreelistLock. On a busy system this will be heavily contended, so adding blocks to the freelist only if the lock is free seems warranted. Couldn't we just leave the buffers alone? Once an index is dropped and that's pushed out through the catalog then nothing should be trying to access them and they'll eventually just get aged out. In fact, IIRC the function that scans for buffers actually checks to see if a rel still exists before it returns the buffer... -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Improve documentation around FreeBSD Kernel Tuning
On Tue, Jan 03, 2012 at 05:02:57PM -0500, Andrew Dunstan wrote: On 01/03/2012 04:49 PM, Brad Davis wrote: Hi, I have a patch that improves the documentation for FreeBSD Kernel Tuning: - Show a # prompt instead of $ to indicate a root shell is needed - Remove the -w flag to sysctl since it is not needed anymore and just silently ignored - Encourage the user to set the read-only sysctls in /boot/loader.conf, instead of setting them manually in the loader. I have put these in a github fork of the repo, but I am new to git. So I apologize if this is incorrect. https://github.com/so14k/postgres/commit/12c03bdb2967346e7ad9ce0bdd3db8dfcf81507e Instead of a URL, please just email us the diff as an attachment. Normally we prefer these in context diff format, although it doesn't matter so much for such a small patch. Sorry about that.. it is attached below. Thanks, Brad Davis diff against doc/src/sgml/runtime.sgml 797,799c797,799 prompt$/prompt userinputsysctl -w kern.ipc.shmall=32768/userinput prompt$/prompt userinputsysctl -w kern.ipc.shmmax=134217728/userinput prompt$/prompt userinputsysctl -w kern.ipc.semmap=256/userinput --- prompt#/prompt userinputsysctl kern.ipc.shmall=32768/userinput prompt#/prompt userinputsysctl kern.ipc.shmmax=134217728/userinput prompt#/prompt userinputsysctl kern.ipc.semmap=256/userinput 807,815c807,815 commandsysctl/command is concerned, but can be changed before boot using the commandloader/command prompt: screen prompt(loader)/prompt userinputset kern.ipc.semmni=256/userinput prompt(loader)/prompt userinputset kern.ipc.semmns=512/userinput prompt(loader)/prompt userinputset kern.ipc.semmnu=256/userinput /screen Similarly these can be saved between reboots in filename/boot/loader.conf/filename. --- commandsysctl/command is concerned, but can be set in filename/boot/loader.conf/filename: programlisting kern.ipc.semmni=256 kern.ipc.semmns=512 kern.ipc.semmnu=256 /programlisting After modifying these values a reboot is required for the new settings to take affect. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Jan 3, 2012, at 11:15 AM, Robert Haas wrote: So you don't think a freelist is worth having, but you want a list of allocation targets. What is the practical difference? I think that our current freelist is practically useless, because it is almost always empty, and the cases where it's not empty (startup, and after a table or database drop) are so narrow that we don't really get any benefit out of having it. However, I'm not opposed to the idea of a freelist in general: I think that if we actually put in some effort to keep the freelist in a non-empty state it would help a lot, because backends would then have much less work to do at buffer allocation time. This is exactly what the FreeBSD VM system does (which is at least one of the places where the idea of a clock sweep for PG came from ages ago). There is a process that does nothing but attempt to keep X amount of memory on the free list, where it can immediately be grabbed by anything that needs memory. Pages on the freelist are guaranteed to be clean (as in not dirty), but not zero'd. In fact, IIRC if a page on the freelist gets referenced again it can be pulled back out of the free list and put back into an active state. The one downside I see to this is that we'd need some heuristic to determine how many buffers we want to keep on the free list. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] controlling the location of server-side SSL files
On mån, 2012-01-02 at 23:42 -0500, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: On mån, 2012-01-02 at 15:47 +0100, Magnus Hagander wrote: Were you thinking one option pointing to a directory or one option per file? One option per file: That seems like serious overkill. Why not one option specifying the directory? What use-case is there for letting them be in different directories, let alone letting the DBA choose random names for each one? Several consistency reasons: - We provide the same per-file options in libpq. - Indeed, if you use something like dblink or plproxy, these might even point to the same files. - We also have settings for hba_file and ident_file that point to a file. - All other daemons with SSL support known to me, such as mail and web servers, have per-file options. Also some practical reasons: - Yes, choosing random names is important. We have local naming schemes. And I would rather have a descriptive file name for the CA than having them all named root.crt, and if I want to know which one it is I have to look inside the file. - You might not want all of the files in the same directory. CA and CRL might live elsewhere, for example. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
Jim Nasby j...@nasby.net writes: On Jan 3, 2012, at 12:11 PM, Simon Riggs wrote: This could well be related to the fact that DropRelFileNodeBuffers() does a scan of shared_buffers, which is an O(N) approach no matter the size of the index. Couldn't we just leave the buffers alone? Once an index is dropped and that's pushed out through the catalog then nothing should be trying to access them and they'll eventually just get aged out. No, we can't, because if they're still dirty then the bgwriter would first try to write them to the no-longer-existing storage file. It's important that we kill the buffers immediately during relation drop. I'm still thinking that it might be sufficient to mark the buffers invalid and let the clock sweep find them, thereby eliminating the need for a freelist. Simon is after a different solution involving getting rid of the clock sweep, but he has failed to explain how that's not going to end up being the same type of contention-prone coding that we got rid of by adopting the clock sweep, some years ago. Yeah, the sweep takes a lot of spinlocks, but that only matters if there is contention for them, and the sweep approach avoids the need for a centralized data structure. (BTW, do we have a separate clock sweep hand for each backend? If not, there might be some low hanging fruit there.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: pg_regress: Replace exit_nicely() with exit() plus atexit() hook
On mån, 2012-01-02 at 17:27 -0500, Andrew Dunstan wrote: On 01/02/2012 04:37 PM, Peter Eisentraut wrote: On mån, 2012-01-02 at 15:55 -0500, Andrew Dunstan wrote: On 01/02/2012 03:12 PM, Peter Eisentraut wrote: pg_regress: Replace exit_nicely() with exit() plus atexit() hook This appears to have broken the buildfarm. I think you mean it has caused the build to fail on some buildfarm members. AFAICT, the buildfarm itself is still intact. This is hardly the first use of this idiom. But it's about as helpful as a bug report saying help it's broken. I think there is some room for improvement there: - Why is the isolation test not part of check-world/installcheck-world? The buildfarm does not use the -world targets, for several reasons, including: That was not my question. I run check-world/installcheck-world locally, and if the isolation test had been part of this, this problem wouldn't have happened. - Could we get the buildfarm server to send out emails whenever a build fails? This facility has been there for years Cool, maybe it could be advertised on buildfarm.postgresql.org. I couldn't see any link leading from there to anywhere near the mailing lists. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Improve documentation around FreeBSD Kernel Tuning
On 01/03/2012 06:15 PM, Brad Davis wrote: On Tue, Jan 03, 2012 at 05:02:57PM -0500, Andrew Dunstan wrote: On 01/03/2012 04:49 PM, Brad Davis wrote: Hi, I have a patch that improves the documentation for FreeBSD Kernel Tuning: - Show a # prompt instead of $ to indicate a root shell is needed - Remove the -w flag to sysctl since it is not needed anymore and just silently ignored - Encourage the user to set the read-only sysctls in /boot/loader.conf, instead of setting them manually in the loader. I have put these in a github fork of the repo, but I am new to git. So I apologize if this is incorrect. https://github.com/so14k/postgres/commit/12c03bdb2967346e7ad9ce0bdd3db8dfcf81507e Instead of a URL, please just email us the diff as an attachment. Normally we prefer these in context diff format, although it doesn't matter so much for such a small patch. Sorry about that.. it is attached below. But you didn't :-) You just cut and pasted it. And if you're not going to send a context diff, it should be a unidiff such as git normally produces. Since I didn't actually tell you that I've made a context diff for you, and it's attached. I'll let someone with more FBSD-fu than me actually comment on it. cheers andrew *** a/doc/src/sgml/runtime.sgml --- b/doc/src/sgml/runtime.sgml *** *** 794,802 options SEMMNS=240 commandloader/command interfaces. The following parameters can be set using commandsysctl/command: screen ! prompt$/prompt userinputsysctl -w kern.ipc.shmall=32768/userinput ! prompt$/prompt userinputsysctl -w kern.ipc.shmmax=134217728/userinput ! prompt$/prompt userinputsysctl -w kern.ipc.semmap=256/userinput /screen To have these settings persist over reboots, modify filename/etc/sysctl.conf/filename. --- 794,802 commandloader/command interfaces. The following parameters can be set using commandsysctl/command: screen ! prompt#/prompt userinputsysctl kern.ipc.shmall=32768/userinput ! prompt#/prompt userinputsysctl kern.ipc.shmmax=134217728/userinput ! prompt#/prompt userinputsysctl kern.ipc.semmap=256/userinput /screen To have these settings persist over reboots, modify filename/etc/sysctl.conf/filename. *** *** 804,818 options SEMMNS=240 para The remaining semaphore settings are read-only as far as ! commandsysctl/command is concerned, but can be changed ! before boot using the commandloader/command prompt: ! screen ! prompt(loader)/prompt userinputset kern.ipc.semmni=256/userinput ! prompt(loader)/prompt userinputset kern.ipc.semmns=512/userinput ! prompt(loader)/prompt userinputset kern.ipc.semmnu=256/userinput ! /screen ! Similarly these can be saved between reboots in ! filename/boot/loader.conf/filename. /para para --- 804,818 para The remaining semaphore settings are read-only as far as ! commandsysctl/command is concerned, but can be set in ! filename/boot/loader.conf/filename: ! programlisting ! kern.ipc.semmni=256 ! kern.ipc.semmns=512 ! kern.ipc.semmnu=256 ! /programlisting ! After modifying these values a reboot is required for the new ! settings to take affect. /para para -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: pg_regress: Replace exit_nicely() with exit() plus atexit() hook
On 01/03/2012 06:29 PM, Peter Eisentraut wrote: I think there is some room for improvement there: - Why is the isolation test not part of check-world/installcheck-world? The buildfarm does not use the -world targets, for several reasons, including: That was not my question. I run check-world/installcheck-world locally, and if the isolation test had been part of this, this problem wouldn't have happened. I have no idea why. It was probably an oversight when the isolation tests were added. I guess something like this would fix it? diff --git a/GNUmakefile.in b/GNUmakefile.in index 50fae41..5976832 100644 --- a/GNUmakefile.in +++ b/GNUmakefile.in @@ -68,7 +68,7 @@ check installcheck installcheck-parallel: $(call recurse,check-world,src/test src/pl src/interfaces/ecpg contrib,check) -$(call recurse,installcheck-world,src/test src/pl src/interfaces/ecpg contrib,installcheck) +$(call recurse,installcheck-world,src/test src/test/isolation src/pl src/interfaces/ecpg contrib,installcheck) $(call recurse,maintainer-check,doc src config contrib) The isolation tests only run installcheck, not plain check. See the archives for details of why. - Could we get the buildfarm server to send out emails whenever a build fails? This facility has been there for years Cool, maybe it could be advertised on buildfarm.postgresql.org. I couldn't see any link leading from there to anywhere near the mailing lists. OK, I'll look at providing links from the web site to the lists. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Improve documentation around FreeBSD Kernel Tuning
Andrew Dunstan and...@dunslane.net writes: Since I didn't actually tell you that I've made a context diff for you, and it's attached. I'll let someone with more FBSD-fu than me actually comment on it. I have no FBSD-fu whatever, but the question this patch raises in my mind is whether /boot/loader.conf exists in every version of FBSD. If not, we probably need to say something like do this in versions = whatever, and do the other in versions before that. Likewise, has it always been true that -w is unnecessary? For other systems such as Mac OS X, we have recommendations covering quite ancient OS releases, and I don't see why we'd not hold the FreeBSD section to the same standard. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On Jan 3, 2012, at 5:28 PM, Tom Lane wrote: Jim Nasby j...@nasby.net writes: On Jan 3, 2012, at 12:11 PM, Simon Riggs wrote: This could well be related to the fact that DropRelFileNodeBuffers() does a scan of shared_buffers, which is an O(N) approach no matter the size of the index. Couldn't we just leave the buffers alone? Once an index is dropped and that's pushed out through the catalog then nothing should be trying to access them and they'll eventually just get aged out. No, we can't, because if they're still dirty then the bgwriter would first try to write them to the no-longer-existing storage file. It's important that we kill the buffers immediately during relation drop. I'm still thinking that it might be sufficient to mark the buffers invalid and let the clock sweep find them, thereby eliminating the need for a freelist. Simon is after a different solution involving getting rid of the clock sweep, but he has failed to explain how that's not going to end up being the same type of contention-prone coding that we got rid of by adopting the clock sweep, some years ago. Yeah, the sweep takes a lot of spinlocks, but that only matters if there is contention for them, and the sweep approach avoids the need for a centralized data structure. Yeah, but the problem we run into is that with every backend trying to run the clock on it's own we end up with high contention again... it's just in a different place than when we had a true LRU. The clock sweep might be cheaper than the linked list was, but it's still awfully expensive. I believe our best bet is to have a free list that is actually useful in normal operations, and then optimize the cost of pulling buffers out of that list as much as possible (and let the bgwriter deal with keeping enough pages in that list to satisfy demand). Heh, it occurs to me that the SQL analogy for how things work right now is that backends currently have to run a SeqScan (or 5) to find a free page... what we need to do is CREATE INDEX free ON buffers(buffer_id) WHERE count = 0;. (BTW, do we have a separate clock sweep hand for each backend? If not, there might be some low hanging fruit there.) No... having multiple clock hands is an interesting idea, but I'm worried that it could potentially get us into trouble if scores of backends were suddenly decrementing usage counts all over the place. For example, what if 5 backends all had their hands in basically the same place, all pointing at a very heavily used buffer. All 5 backends go for free space, they each grab the spinlock on that buffer in succession and suddenly this highly used buffer that started with a count of 5 has now been freed. We could potentially use more than one hand, but I think the relation between the number of hands and the maximum usage count has to be tightly controlled. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] information schema/aclexplode doesn't know about default privileges
On Jan 1, 2012, at 10:43 PM, Peter Eisentraut wrote: I figured the best and most flexible way to address this is to export acldefault() as an SQL function and replace aclexplode(proacl) with aclexplode(coalesce(proacl, acldefault('f', proowner))) It would be nice to provide a convenience function that does the coalesce for you. End users sometimes need this stuff as well as info_schema. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
Jim Nasby j...@nasby.net writes: Yeah, but the problem we run into is that with every backend trying to run the clock on it's own we end up with high contention again... it's just in a different place than when we had a true LRU. The clock sweep might be cheaper than the linked list was, but it's still awfully expensive. I believe our best bet is to have a free list that is actually useful in normal operations, and then optimize the cost of pulling buffers out of that list as much as possible (and let the bgwriter deal with keeping enough pages in that list to satisfy demand). Well, maybe, but I think the historical evidence suggests that that approach will be a loser, simply because no matter how cheap, the freelist will remain a centralized and heavily contended data structure. IMO we need to be looking for a mechanism that has no single point of contention, and modifying the clock sweep rules looks like the best way to get there. Still, he who wants to do the work can try whatever approach he feels like. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Checksums
On Dec 28, 2011, at 3:31 AM, Simon Riggs wrote: On Wed, Dec 28, 2011 at 9:00 AM, Robert Haas robertmh...@gmail.com wrote: What I'm not too clear about is whether a 16-bit checksum meets the needs of people who want checksums. We need this now, hence the gymnastics to get it into this release. 16-bits of checksum is way better than zero bits of checksum, probably about a million times better (numbers taken from papers quoted earlier on effectiveness of checksums). The strategy I am suggesting is 16-bits now, 32/64 later. What about allowing for an initdb option? That means that if you want binary compatibility so you can pg_upgrade then you're stuck with 16 bit checksums. If you can tolerate replicating all your data then you can get more robust checksumming. In either case, it seems that we're quickly approaching the point where we need to start putting resources into binary page upgrading... -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Improve documentation around FreeBSD Kernel Tuning
On Tue, Jan 03, 2012 at 06:43:52PM -0500, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Since I didn't actually tell you that I've made a context diff for you, and it's attached. I'll let someone with more FBSD-fu than me actually comment on it. I have no FBSD-fu whatever, but the question this patch raises in my mind is whether /boot/loader.conf exists in every version of FBSD. If not, we probably need to say something like do this in versions = whatever, and do the other in versions before that. Likewise, has it always been true that -w is unnecessary? For other systems such as Mac OS X, we have recommendations covering quite ancient OS releases, and I don't see why we'd not hold the FreeBSD section to the same standard. Well.. The man page appeared somewhere between FreeBSD 3.0 and 4.0.. and 4.0 was released March 14, 2000. Regards, Brad Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Setting -Werror in CFLAGS
During the talk How To Get Your PostgreSQL Patch Accepted during PgCon last year, I raised the idea of making a -Werror build option easily available. I think it was Robert that pointed out that the problem with that was that there is a warning due to an upstream Flex bug that we can't do anything about. He was right about that; I subsequently tried fairly hard to get the Flex guys to fix it a few months back, but it seems that Flex isn't that well maintained, so even though I convinced someone to write a patch to fix it, it ended up going nowhere. Looking back through the commit log, I can see that this is an old problem. i thought that we could get most of the benefit of a -Werror option by excluding the relevant class of error in CLAGS; gcc allows for this (pity it doesn't have MSVC's much more granular ability to discriminate against diagnostic messages). This should work: ./configure --prefix=/home/peter/pgsql CFLAGS=-Werror -Wno-error=unused-but-set-variable However, it does not (with or without the -Wno-error): checking whether getpwuid_r takes a fifth argument... no checking whether strerror_r returns int... no checking test program... ok checking whether long int is 64 bits... no checking whether long long int is 64 bits... no configure: error: Cannot find a working 64-bit integer type. Obviously it breaks this one configure test. However, I got it to work in 2 minutes by hacking up config.log. It would be nice if someone could fix the test (and possibly later ones) so that it doesn't produce a warning/error when invoking the compiler, and it finds a 64-bit int type as expected. That way, it could sort of become folk wisdom that you should build Postgres with those flags during development, and maybe even, on some limited basis, on the build farm, and we'd all be sightly better off. Giving that many people build with CFLAGS=-O0 -g on a day-to-day basis, which doesn't show some warnings, the folk wisdom might end up being: Just before you submit your patch, see how an -O2 -Werror build goes. There is an entry in the parser Makefile to support this sort of use, which makes it look like my -Wno-error=unused-but-set-variable use would be redundant if it worked: # Latest flex causes warnings in this file. ifeq ($(GCC),yes) gram.o: CFLAGS += -Wno-error endif (although we could discriminate better within the parse here by using my flag). As if to prove my point, I found this warning which I didn't previously notice, just from 5 minutes of playing around: hashovfl.c: In function ‘_hash_freeovflpage’: hashovfl.c:394:10: error: variable ‘bucket’ set but not used [-Werror=unused-but-set-variable] cc1: all warnings being treated as errors (plus others) Yes, I know that these only appeared in GCC 4.6+ and as such are a relatively recent phenomenon, but there has been some effort to eliminate them, and if I could get a non-hacked -Werror build I'd feel happy enough about excluding them as already outlined. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_internal.init and an index file have the same inode
I'm not sure if this is an XFS problem, or Postgres. There's enough suspicious evidence that it's too hard to say. Today, I get an interesting issue raised whereby a reasonably simple query fails on a system that does take successful pg_dumps regularly. To make a short story shorter, I end up with the following situation on an index relation file: $ sha1sum 16587.8 pg_internal.init 4a0f94285c182b67175d6a68669c6b9fd46fa11e 16587.8 4a0f94285c182b67175d6a68669c6b9fd46fa11e pg_internal.init And, even more interestingly, $ stat 16587.8 pg_internal.init File: `16587.8' Size: 98532 Blocks: 200IO Block: 4096 regular file Device: fe00h/65024dInode: 1073741952 Links: 1 Access: (0600/-rw---) Uid: ( 107/postgres) Gid: ( 115/postgres) Access: 2011-12-14 23:35:35.630043643 + Modify: 2011-12-14 23:35:35.630043643 + Change: 2011-12-14 23:35:35.630043643 + File: `pg_internal.init' Size: 98532 Blocks: 200IO Block: 4096 regular file Device: fe00h/65024dInode: 1073741952 Links: 1 Access: (0600/-rw---) Uid: ( 107/postgres) Gid: ( 115/postgres) Access: 2011-12-14 23:35:35.630043643 + Modify: 2011-12-14 23:35:35.630043643 + Change: 2011-12-14 23:35:35.630043643 + Most notably, the inode numbers are the same. At first, I thought this was a file descriptor race in PG, but then I noticed the file system only reports *one* link: that doesn't look like a valid state for XFS. Should I raise this on their mailing list, and would pgsql-hackers like to know more about this? Also, consider this an advisory, as this was found on a vanilla Ubuntu Lucid 10.04 machine. As such, I do think this probably could have happened to a non-index relation file. This database has had uptime since mid-December, and that inode seems to have been modified at the time of the last reboot. The database was born in October via point-in-time recovery. ...Happy New Year! -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] our buffer replacement strategy is kind of lame
On Tue, Jan 3, 2012 at 6:22 PM, Jim Nasby j...@nasby.net wrote: On Jan 3, 2012, at 11:15 AM, Robert Haas wrote: So you don't think a freelist is worth having, but you want a list of allocation targets. What is the practical difference? I think that our current freelist is practically useless, because it is almost always empty, and the cases where it's not empty (startup, and after a table or database drop) are so narrow that we don't really get any benefit out of having it. However, I'm not opposed to the idea of a freelist in general: I think that if we actually put in some effort to keep the freelist in a non-empty state it would help a lot, because backends would then have much less work to do at buffer allocation time. This is exactly what the FreeBSD VM system does (which is at least one of the places where the idea of a clock sweep for PG came from ages ago). There is a process that does nothing but attempt to keep X amount of memory on the free list, where it can immediately be grabbed by anything that needs memory. Pages on the freelist are guaranteed to be clean (as in not dirty), but not zero'd. In fact, IIRC if a page on the freelist gets referenced again it can be pulled back out of the free list and put back into an active state. The one downside I see to this is that we'd need some heuristic to determine how many buffers we want to keep on the free list. Fortuitously, I believe the background writer already has most of the necessary logic: it attempts to predict how many buffers are about to be needed - I think based on a decaying average. Actually, I think that logic could use some improvement, because I believe I've heard Greg Smith comment that it's often necessary to tune bgwriter_delay downward. It'd be nice to make the delay adaptive somehow, to avoid the need for manual tuning (and unnecessary wake-ups when the system goes idle). But possibly the existing logic is good enough for a first cut. However, in the interest of full disclosure, I'll admit that I've done no testing in this area at all and am talking mostly out of my posterior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On Tue, Jan 3, 2012 at 7:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jim Nasby j...@nasby.net writes: Yeah, but the problem we run into is that with every backend trying to run the clock on it's own we end up with high contention again... it's just in a different place than when we had a true LRU. The clock sweep might be cheaper than the linked list was, but it's still awfully expensive. I believe our best bet is to have a free list that is actually useful in normal operations, and then optimize the cost of pulling buffers out of that list as much as possible (and let the bgwriter deal with keeping enough pages in that list to satisfy demand). Well, maybe, but I think the historical evidence suggests that that approach will be a loser, simply because no matter how cheap, the freelist will remain a centralized and heavily contended data structure. IMO we need to be looking for a mechanism that has no single point of contention, and modifying the clock sweep rules looks like the best way to get there. Still, he who wants to do the work can try whatever approach he feels like. It might be possible to partition the free list. So imagine, for example, 8 free lists. The background writer runs the clock sweep and finds some buffers that are about ready to be reallocated and distributes one-eighth of them to each free list. Then, when a backend wants to allocate a buffer, it picks a free list somehow (round robin?) and pulls a buffer off it. If the buffer turns out to have been used since it was added to the free list, we give up on it and try again. This hopefully shouldn't happen too often, though, as long as we only add enough buffers to the free list to satisfy the requests that we expect to get over the next small-fraction-of-a-second. Of course you have to think about what happens if you find that your chosen free list is empty. In that case you probably want to try a different one, and if in the worst case where they're all empty, run the clock sweep in the foreground. You probably also want to kick the background writer in the pants if even a single one is empty (or nearly empty?) and tell it to hurry up and add some more buffers to the freelist. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SEGFAULT on SELECT * FROM view
On Mon, Jan 2, 2012 at 1:42 AM, chris r. chri...@gmx.net wrote: I ported the entire schema to my test DB server and could not reproduce the error there. Note that probably recreating the view solves this issue. Given this, how should I proceed to create a test case? Any tutorial on this? (I'm not too familiar with all this yet.) It's possibly statistics-dependent; make sure you have the same stats targets on both DBs, and try re-analyzing a few times. Check other planner parameters are the same, too. In addition to the schema, I now also copied parts of the data in our production system to the testing DB. (I cannot copy all data, as it's too large for the testing DB.) After a couple of VACUUM FULL ANALYZEs, the bug still doesn't reproduce. In our production environment, I'm working with the re-created view that doesn't show the SEGFAULT behavior. I can live with this workaround, as only one particular view seems to be affected, but if you want me to debug more on the error please let me know. It would be awfully nice to figure out what was going on. Another thing that would help, if you can still reproduce the error, would be to get a stack trace. The easiest way would probably be to enable core dumps (pg_ctl -c start, or ulimit -c unlimited followed by pg_ctl start). Then, when you get a core, which might end up inside $PGDATA or perhaps elsewhere depending on your system configuration, do: gdb /path/to/postgres /path/to/corefile and then from within gdb: bt ...to get a backtrace. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
On Tue, Jan 3, 2012 at 9:33 AM, Pavel Stehule pavel.steh...@gmail.com wrote: I'd suppose it interesting to add a table to pg_catalog containing this data. - it is useless overhead I tend to agree. I am expecting so definition some constants in Perl, Python is simple Presumably one could instead write a script to transform the list of constants into a .pm file that could be loaded into the background, or whatever PL/python's equivalent of that concept is. Not sure if there's a better way to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes jeff.ja...@gmail.com wrote: On 12/29/11, Ants Aasma ants.aa...@eesti.ee wrote: Unless I'm missing something, double-writes are needed for all writes, not only the first page after a checkpoint. Consider this sequence of events: 1. Checkpoint 2. Double-write of page A (DW buffer write, sync, heap write) 3. Sync of heap, releasing DW buffer for new writes. ... some time goes by 4. Regular write of page A 5. OS writes one part of page A 6. Crash! Now recovery comes along, page A is broken in the heap with no double-write buffer backup nor anything to recover it by in the WAL. Isn't 3 the very definition of a checkpoint, meaning that 4 is not really a regular write as it is the first one after a checkpoint? I think you nailed it. But it doesn't seem safe to me replace a page from the DW buffer and then apply WAL to that replaced page which preceded the age of the page in the buffer. That's what LSNs are for. If we write the page to the checkpoint buffer just once per checkpoint, recovery can restore the double-written versions of the pages and then begin WAL replay, which will restore all the subsequent changes made to the page. Recovery may also need to do additional double-writes if it encounters pages that for which we wrote WAL but never flushed the buffer, because a crash during recovery can also create torn pages. When we reach a restartpoint, we fsync everything down to disk and then nuke the double-write buffer. Similarly, in normal running, we can nuke the double-write buffer at checkpoint time, once the fsyncs are complete. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting -Werror in CFLAGS
On Tue, Jan 3, 2012 at 7:39 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: Yes, I know that these only appeared in GCC 4.6+ and as such are a relatively recent phenomenon, but there has been some effort to eliminate them, and if I could get a non-hacked -Werror build I'd feel happy enough about excluding them as already outlined. I just do this: echo COPT=-Werror src/Makefile.custom ...which seems to work reasonably well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: pg_regress: Replace exit_nicely() with exit() plus atexit() hook
On Tue, Jan 3, 2012 at 6:29 PM, Peter Eisentraut pete...@gmx.net wrote: On mån, 2012-01-02 at 17:27 -0500, Andrew Dunstan wrote: On 01/02/2012 04:37 PM, Peter Eisentraut wrote: On mån, 2012-01-02 at 15:55 -0500, Andrew Dunstan wrote: On 01/02/2012 03:12 PM, Peter Eisentraut wrote: pg_regress: Replace exit_nicely() with exit() plus atexit() hook This appears to have broken the buildfarm. I think you mean it has caused the build to fail on some buildfarm members. AFAICT, the buildfarm itself is still intact. This is hardly the first use of this idiom. But it's about as helpful as a bug report saying help it's broken. I don't see why. Normally you can just go to buildfarm.postgresql.org and see which machines are failing and at what stage, and the view the stage logs to see the specific errors. It's not the best web interface I've ever seen, but it's not *that* bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] controlling the location of server-side SSL files
On Tue, Jan 3, 2012 at 6:25 PM, Peter Eisentraut pete...@gmx.net wrote: On mån, 2012-01-02 at 23:42 -0500, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: On mån, 2012-01-02 at 15:47 +0100, Magnus Hagander wrote: Were you thinking one option pointing to a directory or one option per file? One option per file: That seems like serious overkill. Why not one option specifying the directory? What use-case is there for letting them be in different directories, let alone letting the DBA choose random names for each one? [ reasons ] I agree with these reasons. We don't get charged $0.50 per GUC, so there's no particular reason to contort things to have fewer of them. It's nice where it's possible, of course, but not when it makes people contort things to support the way we think our users should lay out their filesystem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add SPI results constants available for PL/*
On 01/03/2012 08:40 PM, Robert Haas wrote: On Tue, Jan 3, 2012 at 9:33 AM, Pavel Stehulepavel.steh...@gmail.com wrote: I'd suppose it interesting to add a table to pg_catalog containing this data. - it is useless overhead I tend to agree. I am expecting so definition some constants in Perl, Python is simple Presumably one could instead write a script to transform the list of constants into a .pm file that could be loaded into the background, or whatever PL/python's equivalent of that concept is. Not sure if there's a better way to do it. Yeah, I'm with you and Pavel. Here's my quick perl one-liner to produce a set of SPI_* constants for pl/perl. I'm looking at the best way to include this in the bootstrap code. perl -ne 'BEGIN { print use constant\n{\n; } END { print };\n; } print \t$1 = $2,\n if /#define (SPI_\S+)\s+\(?(-?\d+)\)?/;' src/include/executor/spi.h cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ordering op for WHERE
On Tue, Dec 27, 2011 at 10:48 PM, YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: does it make sense to teach the planner (and the executor?) use an ordering op to optimize queries like the following? select * from t where a - 1000 10 Seems useful to me. I'm not sure how hard it is, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: pg_regress: Replace exit_nicely() with exit() plus atexit() hook
On 01/03/2012 09:00 PM, Robert Haas wrote: Normally you can just go to buildfarm.postgresql.org and see which machines are failing and at what stage, and the view the stage logs to see the specific errors. It's not the best web interface I've ever seen, but it's not *that* bad. And if someone wants to improve it they are welcome. The code is at https://github.com/PGBuildFarm/server-code. It helps if you know Template Toolkit :-) cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Standalone synchronous master
On Tue, Dec 27, 2011 at 6:39 AM, Alexander Björnhagen alex.bjornha...@gmail.com wrote: And so we get back to the three likelihoods in our two-node setup : 1.The master fails - Okay, promote the standby 2.The standby fails - Okay, the system still works but you no longer have data redundancy. Deal with it. 3.Both fail, together or one after the other. It seems to me that if you are happy with #2, you don't really need to enable sync rep in the first place. At any rate, even without multiple component failures, this configuration makes it pretty easy to lose durability (which is the only point of having sync rep in the first place). Suppose the NIC card on the master is the failing component. If it happens to drop the TCP connection to the clients just before it drops the connection to the standby, the standby will have all the transactions, and you can fail over just fine. If it happens to drop the TCP connection to the just before it drops the connection to the clients, the standby will not have all the transactions, and failover will lose some transactions - and presumably you enabled this feature in the first place precisely to prevent that sort of occurrence. I do think that it might be useful to have this if there's a configurable timeout involved - that way, people could say, well, I'm OK with maybe losing transactions if the standby has been gone for X seconds. But if the only possible behavior is equivalent to a zero-second timeout I don't think it's too useful. It's basically just going to lead people to believe that their data is more secure than it really is, which IMHO is not helpful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting -Werror in CFLAGS
Robert Haas robertmh...@gmail.com writes: On Tue, Jan 3, 2012 at 7:39 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: Yes, I know that these only appeared in GCC 4.6+ and as such are a relatively recent phenomenon, but there has been some effort to eliminate them, and if I could get a non-hacked -Werror build I'd feel happy enough about excluding them as already outlined. I just do this: echo COPT=-Werror src/Makefile.custom ...which seems to work reasonably well. I see no point in -Werror whatsoever. If you aren't examining the make output for warnings, you're not following proper development practice IMO. gcc is not the only tool we use in the build process, so if you are relying on -Werror to call attention to everything you should be worrying about, you lost already. I'm also less than thrilled with the idea that whatever the gcc boys decide to make a warning tomorrow will automatically become a MUST FIX NOW for us. If you don't see why this is a problem, try building any PG release more than a few months old on latest and greatest gcc. (Of note here, latest-and-greatest is changing again this week, at least for Fedora, and I fully expect 4.7 to start whinging about things we never heard of before.) Moving-target warnings are one thing, moving-target hard errors are another thing entirely. Personally I tend to do something like make -j4 make.out 2make.err cat make.err regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_internal.init and an index file have the same inode
Daniel Farina dan...@heroku.com writes: And, even more interestingly, $ stat 16587.8 pg_internal.init File: `16587.8' Size: 98532 Blocks: 200IO Block: 4096 regular file Device: fe00h/65024d Inode: 1073741952 Links: 1 Access: (0600/-rw---) Uid: ( 107/postgres) Gid: ( 115/postgres) Access: 2011-12-14 23:35:35.630043643 + Modify: 2011-12-14 23:35:35.630043643 + Change: 2011-12-14 23:35:35.630043643 + File: `pg_internal.init' Size: 98532 Blocks: 200IO Block: 4096 regular file Device: fe00h/65024d Inode: 1073741952 Links: 1 Access: (0600/-rw---) Uid: ( 107/postgres) Gid: ( 115/postgres) Access: 2011-12-14 23:35:35.630043643 + Modify: 2011-12-14 23:35:35.630043643 + Change: 2011-12-14 23:35:35.630043643 + Most notably, the inode numbers are the same. At first, I thought this was a file descriptor race in PG, but then I noticed the file system only reports *one* link: that doesn't look like a valid state for XFS. Yeah. In any case it wouldn't be a PG bug, because we don't issue any link(2) calls AFAIR. File an XFS bug. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] controlling the location of server-side SSL files
Robert Haas robertmh...@gmail.com writes: On Tue, Jan 3, 2012 at 6:25 PM, Peter Eisentraut pete...@gmx.net wrote: [ reasons ] I agree with these reasons. We don't get charged $0.50 per GUC, so there's no particular reason to contort things to have fewer of them. Well, there definitely is a distributed cost to each additional GUC. Peter's given what are probably adequate reasons to add several of them here, but that doesn't mean we should not ask the question whether each new GUC is really necessary. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SQL:2011 features
I assume you guys know where to go to get the complete sql:2011 spec: http://www.iso.org/iso/search.htm?qt=9075searchSubmit=Searchsort=reltype=simplepublished=true But if time/money is an issue the following seems to be the best publicly available description of the temporal features: http://metadata-standards.org/Document-library/Documents-by-number/WG2-N1501-N1550/WG2_N1536_koa046-Temporal-features-in-SQL-standard.pdf If/as you all make progress with this we would love to hear about it in the Temporal Data group on LinkedIn, where we are discussing temporal data in general and the SQL:2011 extensions in particular. http://www.linkedin.com/groups?home=gid=3885228 We have had some excellent input from Jeff Davis in the past. Craig -- View this message in context: http://postgresql.1045698.n5.nabble.com/SQL-2011-features-tp5114317p5116875.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?
On Jan 3, 2012, at 7:34 PM, Robert Haas wrote: On Tue, Jan 3, 2012 at 7:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Jim Nasby j...@nasby.net writes: Yeah, but the problem we run into is that with every backend trying to run the clock on it's own we end up with high contention again... it's just in a different place than when we had a true LRU. The clock sweep might be cheaper than the linked list was, but it's still awfully expensive. I believe our best bet is to have a free list that is actually useful in normal operations, and then optimize the cost of pulling buffers out of that list as much as possible (and let the bgwriter deal with keeping enough pages in that list to satisfy demand). Well, maybe, but I think the historical evidence suggests that that approach will be a loser, simply because no matter how cheap, the freelist will remain a centralized and heavily contended data structure. IMO we need to be looking for a mechanism that has no single point of contention, and modifying the clock sweep rules looks like the best way to get there. Still, he who wants to do the work can try whatever approach he feels like. It might be possible to partition the free list. So imagine, for example, 8 free lists. The background writer runs the clock sweep and finds some buffers that are about ready to be reallocated and distributes one-eighth of them to each free list. Then, when a backend wants to allocate a buffer, it picks a free list somehow (round robin?) and pulls a buffer off it. If the buffer turns out to have been used since it was added to the free list, we give up on it and try again. This hopefully shouldn't happen too often, though, as long as we only add enough buffers to the free list to satisfy the requests that we expect to get over the next small-fraction-of-a-second. Of course you have to think about what happens if you find that your chosen free list is empty. In that case you probably want to try a different one, and if in the worst case where they're all empty, run the clock sweep in the foreground. You probably also want to kick the background writer in the pants if even a single one is empty (or nearly empty?) and tell it to hurry up and add some more buffers to the freelist. If it comes down to it, we can look at partitioning the free list. But here's the thing: this is the strategy that FreeBSD (and I think now Linux as well) use to service memory requests, be they for free memory or for reading data from disk. If it's good enough for an OS to use, I would expect we could make it work as well. I would expect that pulling a page off the free list would be an extremely fast operation... lock the read lock on the list (we should probably have separate locks for putting stuff on the list vs taking it off), pin the buffer (which shouldn't be contentious), update the read pointer and drop the read lock. Even in C code that's probably less than 100 machine instructions, and no looping. Compared to the cost of running a clock sweep it should be significantly faster. So while there will be contention on the free list read lock, none of that contention should last very long. Do we have any other places where we take extremely short locks like this and still run into problems? -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PL/Perl Does not Like vstrings
Is this perhaps by design? Oy, this doesn’t look good: $ do LANGUAGE plperl $$ elog(NOTICE, $^V) $$; ERROR: server conn crashed? ERROR: server conn crashed? The connection to the server was lost. Attempting reset: Succeeded. (pgxn@localhost:5900) 06:44:42 [pgxn] $ Best, David smime.p7s Description: S/MIME cryptographic signature
[HACKERS] BGWriter latch, power saving
As part of the ongoing effort to reduce wake-ups when idle/power consumption, the attached patch modifies the background writer to hibernate in ten second bursts once the bgwriter laps the clock sweep. It's fairly well commented, so a description of how it works here would probably be redundant. The most controversial aspect of this patch is the fact that it adds a SetLatch() call to MarkBufferDirty(), though I went to some lengths to ensure that that call will very probably find the latch already set when it actually matters, so it only checks a single flag. Thoughts? I can produce benchmarks, if that helps, but I think it's fairly unlikely that the patch introduces a measurable performance regression. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c new file mode 100644 index e3ae92d..0fcaf01 *** a/src/backend/bootstrap/bootstrap.c --- b/src/backend/bootstrap/bootstrap.c *** AuxiliaryProcessMain(int argc, char *arg *** 416,421 --- 416,422 case BgWriterProcess: /* don't set signals, bgwriter has its own agenda */ + ProcGlobal-bgwriterLatch = MyProc-procLatch; /* Expose */ BackgroundWriterMain(); proc_exit(1); /* should never return */ diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c new file mode 100644 index fc1a579..a5248ba *** a/src/backend/port/unix_latch.c --- b/src/backend/port/unix_latch.c *** *** 50,55 --- 50,56 #include miscadmin.h #include postmaster/postmaster.h #include storage/latch.h + #include storage/proc.h #include storage/shmem.h /* Are we currently in WaitLatch? The signal handler would like to know. */ diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c new file mode 100644 index 1f8d2d6..ad8a056 *** a/src/backend/postmaster/bgwriter.c --- b/src/backend/postmaster/bgwriter.c *** *** 51,56 --- 51,58 #include storage/ipc.h #include storage/lwlock.h #include storage/pmsignal.h + #include storage/proc.h + #include storage/procsignal.h #include storage/shmem.h #include storage/smgr.h #include storage/spin.h *** static volatile sig_atomic_t shutdown_re *** 73,78 --- 75,82 /* * Private state */ + #define BGWRITER_HIBERNATE_MS 1 + static bool am_bg_writer = false; /* Prototypes for private functions */ *** BackgroundWriterMain(void) *** 123,129 pqsignal(SIGQUIT, bg_quickdie); /* hard crash time */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); ! pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */ pqsignal(SIGUSR2, SIG_IGN); /* --- 127,133 pqsignal(SIGQUIT, bg_quickdie); /* hard crash time */ pqsignal(SIGALRM, SIG_IGN); pqsignal(SIGPIPE, SIG_IGN); ! pqsignal(SIGUSR1, procsignal_sigusr1_handler); pqsignal(SIGUSR2, SIG_IGN); /* *** BackgroundWriterMain(void) *** 238,278 for (;;) { /* ! * Emergency bailout if postmaster has died. This is to avoid the ! * necessity for manual cleanup of all postmaster children. */ ! if (!PostmasterIsAlive()) ! exit(1); ! ! if (got_SIGHUP) { ! got_SIGHUP = false; ! ProcessConfigFile(PGC_SIGHUP); ! /* update global shmem state for sync rep */ } ! if (shutdown_requested) { /* ! * From here on, elog(ERROR) should end with exit(1), not send ! * control back to the sigsetjmp block above */ ! ExitOnAnyError = true; ! /* Normal exit from the bgwriter is here */ ! proc_exit(0); /* done */ ! } ! /* ! * Do one cycle of dirty-buffer writing. ! */ ! BgBufferSync(); ! /* Nap for the configured time. */ ! BgWriterNap(); } } /* * BgWriterNap -- Nap for the configured time or until a signal is received. */ static void BgWriterNap(void) --- 242,337 for (;;) { /* ! * Do one cycle of dirty-buffer writing, potentially hibernating if ! * there have been no buffers to write. */ ! if (!BgBufferSync()) { ! /* Clock sweep was not lapped - nap for the configured time. */ ! BgWriterNap(); } ! else { + int res = 0; /* ! * Initiate hibernation by arming the process latch. This usage ! * differs from the standard pattern for latches outlined in ! * latch.h, where the latch is generally reset immediately after ! * WaitLatch returns, in the next iteration of an infinite loop. ! * ! * It should only be possible to *really* set the latch from client ! * backends while the bgwriter is idle, and not during productive ! * cycles where buffers are written, or shortly thereafter. It's ! * important that the SetLatch() call within the buffer manager ! * usually inexpensively returns having found that the latch is ! * already
Re: [HACKERS] 16-bit page checksums for 9.2
Kevin Grittner wrote: if we define sum1 and sum2 as uint I don't see how we can get an overflow with 8k byes I feel the need to amend that opinion. While sum1 only needs to hold a maximum of (BLCKSZ * 255), which would be adequate for a BLCKSZ up to 16 MB, sum2 needs to hold up to a maximum of ((BLCKSZ * (BLCKSZ + 1) / 2) * 255) so a 32-bit unsigned isn't even good for an 8 KB block size. A uint64 for sum2 can handle BLCKSZ up to 256 MB. So we could define sum1 as uint (which is presumably either uint32 or uint64) and sum2 as uint64 and for a BLCKSZ up toe 16 MB we still don't need to find the modulo 255 numbers until we're done adding things up. Does anyone think we need to support anything bigger than that? Making them both uint64 would support a BLCKSZ up to 256 MB without needing to do the division inside the loop. but it might be slightly slower on 32-bit builds. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BGWriter latch, power saving
On 04.01.2012 07:58, Peter Geoghegan wrote: As part of the ongoing effort to reduce wake-ups when idle/power consumption, the attached patch modifies the background writer to hibernate in ten second bursts once the bgwriter laps the clock sweep. It's fairly well commented, so a description of how it works here would probably be redundant. The most controversial aspect of this patch is the fact that it adds a SetLatch() call to MarkBufferDirty(), though I went to some lengths to ensure that that call will very probably find the latch already set when it actually matters, so it only checks a single flag. I think SetBufferCommitInfoNeedsSave() needs the same treatment as MarkBufferDirty(). And it would probably be good to only set the latch if the buffer wasn't dirty already. Setting a latch that's already set is fast, but surely it's even faster to not even try. Thoughts? I can produce benchmarks, if that helps, but I think it's fairly unlikely that the patch introduces a measurable performance regression. Yeah, I'd like to see a micro-benchmark of a worst-case scenario. I'm a bit worried about the impact on systems with a lot of CPUs. If you have a lot of CPUs writing to the same cache line that contains the latch's flag, that might get expensive. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers