Re: [HACKERS] Speed dblink using alternate libpq tuple storage
Hello, This is a new version of the patch formerly known as 'alternative storage for libpq'. - Changed the concept to 'Alternative Row Processor' from 'Storage handler'. Symbol names are also changed. - Callback function is modified following to the comment. - From the restriction of time, I did minimum check for this patch. The purpose of this patch is to show the new implement. - Proformance is not measured for this patch for the same reason. I will do that on next monday. - The meaning of PGresAttValue is changed. The field 'value' now contains a value withOUT terminating zero. This change seems to have no effect on any other portion within the whole source tree of postgresql from what I've seen. I would like to propose better one-shot API with: void *(*RowStoreHandler)(PGresult *res, PGresAttValue *columns); ... 1) Pass-through processing do not need to care about unnecessary per-row allocations. 2) Handlers that want to copy of the row (like regular libpq), can optimize allocations by having global view of the row. (Eg. One allocation for row header + data). I expect the new implementation is far more better than the orignal. regargs, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 1af8df6..c47af3a 100644 --- a/src/interfaces/libpq/exports.txt +++ b/src/interfaces/libpq/exports.txt @@ -160,3 +160,6 @@ PQconnectStartParams 157 PQping158 PQpingParams 159 PQlibVersion 160 +PQregisterRowProcessor 161 +PQgetRowProcessorParam 163 +PQsetRowProcessorErrMes 164 diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index d454538..93803d5 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -2692,6 +2692,7 @@ makeEmptyPGconn(void) conn-allow_ssl_try = true; conn-wait_ssl_try = false; #endif + conn-rowProcessor = NULL; /* * We try to send at least 8K at a time, which is the usual size of pipe @@ -5076,3 +5077,10 @@ PQregisterThreadLock(pgthreadlock_t newhandler) return prev; } + +void +PQregisterRowProcessor(PGconn *conn, RowProcessor func, void *param) +{ + conn-rowProcessor = func; + conn-rowProcessorParam = param; +} diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index b743566..5d78b39 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -66,7 +66,7 @@ static PGresult *PQexecFinish(PGconn *conn); static int PQsendDescribe(PGconn *conn, char desc_type, const char *desc_target); static int check_field_number(const PGresult *res, int field_num); - +static void *pqAddTuple(PGresult *res, PGresAttValue *columns); /* * Space management for PGresult. @@ -160,6 +160,9 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) result-curBlock = NULL; result-curOffset = 0; result-spaceLeft = 0; + result-rowProcessor = pqAddTuple; + result-rowProcessorParam = NULL; + result-rowProcessorErrMes = NULL; if (conn) { @@ -194,6 +197,12 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status) } result-nEvents = conn-nEvents; } + + if (conn-rowProcessor) + { + result-rowProcessor = conn-rowProcessor; + result-rowProcessorParam = conn-rowProcessorParam; + } } else { @@ -445,7 +454,7 @@ PQsetvalue(PGresult *res, int tup_num, int field_num, char *value, int len) } /* add it to the array */ - if (!pqAddTuple(res, tup)) + if (pqAddTuple(res, tup) == NULL) return FALSE; } @@ -701,7 +710,6 @@ pqClearAsyncResult(PGconn *conn) if (conn-result) PQclear(conn-result); conn-result = NULL; - conn-curTuple = NULL; } /* @@ -756,7 +764,6 @@ pqPrepareAsyncResult(PGconn *conn) */ res = conn-result; conn-result = NULL; /* handing over ownership to caller */ - conn-curTuple = NULL; /* just in case */ if (!res) res = PQmakeEmptyPGresult(conn, PGRES_FATAL_ERROR); else @@ -829,12 +836,17 @@ pqInternalNotice(const PGNoticeHooks *hooks, const char *fmt,...) /* * pqAddTuple - * add a row pointer to the PGresult structure, growing it if necessary - * Returns TRUE if OK, FALSE if not enough memory to add the row + * add a row to the PGresult structure, growing it if necessary + * Returns the pointer to the new tuple if OK, NULL if not enough + * memory to add the row. */ -int -pqAddTuple(PGresult *res, PGresAttValue *tup) +void * +pqAddTuple(PGresult *res, PGresAttValue *columns) { + PGresAttValue *tup; + int nfields = res-numAttributes; + int i; + if (res-ntups = res-tupArrSize) { /* @@ -858,13 +870,39 @@ pqAddTuple(PGresult *res, PGresAttValue *tup) newTuples = (PGresAttValue **) realloc(res-tuples, newSize * sizeof(PGresAttValue *)); if (!newTuples) - return FALSE; /* malloc or realloc failed */ + return NULL; /* malloc or realloc failed */
Re: [HACKERS] Inline Extension
On Thu, Jan 26, 2012 at 3:48 PM, David E. Wheeler da...@justatheory.com wrote: On Jan 26, 2012, at 9:40 AM, Dimitri Fontaine wrote: Not for 9.2, but I can't help thinking that if we could manage to host the .so module itself in the catalogs, we could solve updating it in a transactional way and more importantly host it per-database, rather than having the modules work per major version (not even per cluster) and the extension mechanism work per-database inside each cluster. But that's work for another release. +1 Cloud vendors will *love* this. Confirmed. Let me share my perspective. I'll begin by describing the current state of runtime code dependency management for comparison. In Ruby, any user can push an application to our platform which relies on any/every ruby gem ever released (give or take). These gems may require exact releases of other gems, have elaborate system dependencies, and/or natively compiled code components. This is thanks to the rubygems.org repository, the gem system, and recently but crucially, the bundler system for resolving and isolating dependencies. Releasing a new gem takes moments and I have personally released a half dozen of no real consequence to the world which I use from time to time. In contrast, the idea that any person or team of people could possibly review the literally hundreds of gems released each day is no longer plausible. The only feasible solution for providing a robust service is to engineer a solution which can be operated from inside the cluster to install any library whatsoever. Our aim is, put simply, to be able to support every extension in the world, at once, under cascading replication, across major catalogue upgrades. We hope this ideal is shared by the community at large, since our problems are generally the same as other users, just writ large. -pvh PS: As an aside, because of the many problems with in-cluster multi-tenancy (to pick just one example, resource isolation between users) I have no security concerns with giving users every ability to execute code as the cluster owner's UNIX user. On our service we do not restrict our users access to superuser out of spite, but to reduce the available surface area for self-destruction. -- Peter van Hardenberg San Francisco, California Everything was beautiful, and nothing hurt. -- Kurt Vonnegut -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: ALTER TABLE IF EXISTS
On 23 January 2012 20:14, Pavel Stehule pavel.steh...@gmail.com wrote: Hello 2012/1/23 Robert Haas robertmh...@gmail.com: On Tue, Jan 3, 2012 at 2:49 PM, Pavel Stehule pavel.steh...@gmail.com wrote: jup, we can continue in enhancing step by step. I change a patch and now ALTER TABLE, ALTER INDEX, ALTER SEQUENCE and ALTER VIEW has IF EXISTS clause ALTER FOREIGN TABLE should be parallel as well, I think. refreshed + ALTER FOREIGN TABLE IF EXISTS ... support I just noticed this copy-and-paste error in the ALTER FOREIGN TABLE docs: IF EXISTS: Do not throw an error if the sequence does not exist. A notice is issued in this case. That should be foreign table not sequence. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inline Extension
So I'm going to prepare the next version of the patch with this design: - in catalog extension scripts for inline extension pg_extension_script(extoid, oldversion, version, script) oldversion is null when create extension is used unless when using the create extension from 'unpackaged' form Would you keep all the migration scripts used over time to upgrade from one version to another? yes, that is the idea. You then can play all the stack of SQL needed to go from version foo to version bar (we don't care about version format, here again). - same as we have -t, add -e --extension to pg_dump so that you can choose to dump only a given extension Also --exclude-extension? It might be the default. We need something to dump the content of pg_catalog.pg_extension_script (or whatever table is going to contain SQL code), per extension or all. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: ALTER TABLE IF EXISTS
On 27.01.2012 11:57, Dean Rasheed wrote: I just noticed this copy-and-paste error in the ALTER FOREIGN TABLE docs: IF EXISTS: Do not throw an error if the sequence does not exist. A notice is issued in this case. That should be foreign table not sequence. Thanks, fixed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales sca...@vmware.com wrote: I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). If there were every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantage to putting it in smgrwrite()? The smgr and md layers don't currently know anything about the page format, and I imagine we want to keep it that way. It seems like the right place for this is in some higher layer, like bufmgr. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On 26.01.2012 04:10, Robert Haas wrote: On Wed, Jan 25, 2012 at 3:11 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Attached is a patch to do that. It adds a new mode to LWLockConditionalAcquire(), LW_EXCLUSIVE_BUT_WAIT. If the lock is free, it is acquired and the function returns immediately. However, unlike normal LWLockConditionalAcquire(), if the lock is not free it goes to sleep until it is released. But unlike normal LWLockAcquire(), when the lock is released, the function returns *without* acquiring the lock. ... I think you should break this off into a new function, LWLockWaitUntilFree(), rather than treating it as a new LWLockMode. Also, instead of adding lwWaitOnly, I would suggest that we generalize lwWaiting and lwExclusive into a field lwWaitRequest, which can be set to 1 for exclusive, 2 for shared, 3 for wait-for-free, etc. That way we don't have to add another boolean every time someone invents a new type of wait - not that that should hopefully happen very often. A side benefit of this is that you can simplify the test in LWLockRelease(): keep releasing waiters until you come to an exclusive waiter, then stop. This possibly saves one shared memory fetch and subsequent test instruction, which might not be trivial - all of this code is extremely hot. Makes sense. Attached is a new version, doing exactly that. We probably want to benchmark this carefully to make sure that it doesn't cause a performance regression even just from this: + if (!proc-lwWaitOnly) + lock-releaseOK = false; I know it sounds crazy, but I'm not 100% sure that that additional test there is cheap enough not to matter. Assuming it is, though, I kind of like the concept: I think there must be other places where these semantics are useful. Yeah, we have to be careful with any overhead in there, it can be a hot spot. I wouldn't expect any measurable difference from the above, though. Could I ask you to rerun the pgbench tests you did recently with this patch? Or can you think of a better test for this? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/access/transam/twophase.c --- b/src/backend/access/transam/twophase.c *** *** 327,333 MarkAsPreparing(TransactionId xid, const char *gid, proc-databaseId = databaseid; proc-roleId = owner; proc-lwWaiting = false; ! proc-lwExclusive = false; proc-lwWaitLink = NULL; proc-waitLock = NULL; proc-waitProcLock = NULL; --- 327,333 proc-databaseId = databaseid; proc-roleId = owner; proc-lwWaiting = false; ! proc-lwWaitMode = 0; proc-lwWaitLink = NULL; proc-waitLock = NULL; proc-waitProcLock = NULL; *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 2118,2140 XLogFlush(XLogRecPtr record) /* initialize to given target; may increase below */ WriteRqstPtr = record; ! /* read LogwrtResult and update local state */ { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; SpinLockAcquire(xlogctl-info_lck); if (XLByteLT(WriteRqstPtr, xlogctl-LogwrtRqst.Write)) WriteRqstPtr = xlogctl-LogwrtRqst.Write; LogwrtResult = xlogctl-LogwrtResult; SpinLockRelease(xlogctl-info_lck); - } ! /* done already? */ ! if (!XLByteLE(record, LogwrtResult.Flush)) ! { ! /* now wait for the write lock */ ! LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); LogwrtResult = XLogCtl-Write.LogwrtResult; if (!XLByteLE(record, LogwrtResult.Flush)) { --- 2118,2160 /* initialize to given target; may increase below */ WriteRqstPtr = record; ! /* ! * Now wait until we get the write lock, or someone else does the ! * flush for us. ! */ ! for (;;) { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; + /* read LogwrtResult and update local state */ SpinLockAcquire(xlogctl-info_lck); if (XLByteLT(WriteRqstPtr, xlogctl-LogwrtRqst.Write)) WriteRqstPtr = xlogctl-LogwrtRqst.Write; LogwrtResult = xlogctl-LogwrtResult; SpinLockRelease(xlogctl-info_lck); ! /* done already? */ ! if (XLByteLE(record, LogwrtResult.Flush)) ! break; ! ! /* ! * Try to get the write lock. If we can't get it immediately, wait ! * until it's released, and recheck if we still need to do the flush ! * or if the backend that held the lock did it for us already. This ! * helps to maintain a good rate of group committing when the system ! * is bottlenecked by the speed of fsyncing. ! */ ! if (!LWLockWaitUntilFree(WALWriteLock, LW_EXCLUSIVE)) ! { ! /* ! * The lock is now free, but we didn't acquire it yet. Before we ! * do, loop back to check if someone else flushed the record for ! * us already. ! */ ! continue; ! } ! /* Got the
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Thu, Jan 26, 2012 at 11:36 PM, Peter Geoghegan pe...@2ndquadrant.com wrote: I'm not surprised that you weren't able to measure a performance regression from the binary bloat. Any such regression is bound to be very small and probably quite difficult to notice most of the time; it's really the cumulative effect of many binary-size-increasing changes we're worried about, not each individual one. Certainly, trying to shrink the binary is micro-optimimzation at its finest, but then so is inlining comparators. I don't think it can be realistically argued that we can increasing the size of the binary arbitrarily will never get us in trouble, much like (for a typical American family) spending $30 to have dinner at a cheap resteraunt will never break the budget. But if you do it every day, it gets expensive (and fattening). Sure. At the risk of stating the obvious, and of repeating myself, I will point out that the true cost of increasing the size of the binary is not necessarily linear - it's a complex equation. I hope that this doesn't sound flippant, but if some naive person were to look at just the increasing binary size of Postgres and its performance in each successive release, they might conclude that there was a positive correlation between the two (since we didn't add flab to the binary, but muscle that pulls its own weight and then some). I completely agree. So the point is that, when faced a patch that adds an atypically large number of CPU instructions, we ought to ask ourselves whether those instructions are pulling their weight. By way of comparison, the latest posted version of Tom's generalized index-only paths patch adds 14kB to the resulting executable (on my Mac). Yours adds 55kB. We might ask ourselves whether the benefits of this patch are four times greater than the benefits of Tom's patch. It's pretty hard to compare them directly since they're doing very different things, and all of this is completely subjective, but I doubt it. On the other hand, there is no rule that every byte of code that gets committed must be made of solid gold, either. So, once again: judgement call. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Group commit, revised
On Fri, Jan 27, 2012 at 8:35 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Yeah, we have to be careful with any overhead in there, it can be a hot spot. I wouldn't expect any measurable difference from the above, though. Could I ask you to rerun the pgbench tests you did recently with this patch? Or can you think of a better test for this? I can't do so immediately, because I'm waiting for Nate Boley to tell me I can go ahead and start testing on that machine again. But I will do it once I get the word. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
Uh, obviously I meant causal relationship and not correlation. On 27 January 2012 13:37, Robert Haas robertmh...@gmail.com wrote: I completely agree. So the point is that, when faced a patch that adds an atypically large number of CPU instructions, we ought to ask ourselves whether those instructions are pulling their weight. By way of comparison, the latest posted version of Tom's generalized index-only paths patch adds 14kB to the resulting executable (on my Mac). Yours adds 55kB. We might ask ourselves whether the benefits of this patch are four times greater than the benefits of Tom's patch. It's pretty hard to compare them directly since they're doing very different things, and all of this is completely subjective, but I doubt it. On the other hand, there is no rule that every byte of code that gets committed must be made of solid gold, either. So, once again: judgement call. Well, I don't think it's all that subjective - it's more the case that it is just difficult, or it gets that way as you consider more specialisations. As for what types/specialisations may not make the cut, I'm increasingly convinced that floats (in the following order: float4, float8) should be the first to go. Aside from the fact that we cannot use their specialisations for anything like dates and timestamps, floats are just way less useful than integers in the context of database applications, or at least those that I've been involved with. As important as floats are in the broad context of computing, it's usually only acceptable to store data in a database as floats within scientific applications, and only then when their limitations are well-understood and acceptable. I think we've all heard anecdotes at one time or another, involving their limitations not being well understood. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan pe...@2ndquadrant.com wrote: Well, I don't think it's all that subjective - it's more the case that it is just difficult, or it gets that way as you consider more specialisations. Sure it's subjective. Two well-meaning people could have different opinions without either of them being wrong. If you do a lot of small, in-memory sorts, more of this stuff is going to seem worthwhile than if you don't. As for what types/specialisations may not make the cut, I'm increasingly convinced that floats (in the following order: float4, float8) should be the first to go. Aside from the fact that we cannot use their specialisations for anything like dates and timestamps, floats are just way less useful than integers in the context of database applications, or at least those that I've been involved with. As important as floats are in the broad context of computing, it's usually only acceptable to store data in a database as floats within scientific applications, and only then when their limitations are well-understood and acceptable. I think we've all heard anecdotes at one time or another, involving their limitations not being well understood. While we're waiting for anyone else to weigh in with an opinion on the right place to draw the line here, do you want to post an updated patch with the changes previously discussed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dry-run mode for pg_archivecleanup
On Sun, Jan 15, 2012 at 5:05 PM, Josh Kupershmidt schmi...@gmail.com wrote: On Sun, Jan 15, 2012 at 3:02 PM, Gabriele Bartolini gabriele.bartol...@2ndquadrant.it wrote: My actual intention was to have the filename as output of the command, in order to easily pipe it to another script. Hence my first choice was to use the stdout channel, considering also that pg_archivecleanup in dry-run mode is harmless and does not touch the content of the directory. Oh, right - I should have re-read your initial email before diving into the patch. That all makes sense given your intended purpose. I guess your goal of constructing some simple way to pass the files which would be removed on to another script is a little different than what I initially thought the patch would be useful for, namely as a testing/debugging aid for an admin. Perhaps both goals could be met by making use of '--debug' together with '--dry-run'. If they are both on, then an additional message like pg_archivecleanup: would remove file ... would be printed to stderr, along with just the filename printed to stdout you already have. This email thread seems to have trailed off without reaching a conclusion. The patch is marked as Waiting on Author in the CommitFest application, but I'm not sure that's accurate. Can we try to nail this down? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Caching for stable expressions with constant arguments v6
On Mon, Jan 16, 2012 at 12:06 PM, Marti Raudsepp ma...@juffo.org wrote: Here's v6 of my expression caching patch. The patch is not attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed dblink using alternate libpq tuple storage
On Fri, Jan 27, 2012 at 2:57 AM, Kyotaro HORIGUCHI horiguchi.kyot...@oss.ntt.co.jp wrote: Hello, This is a new version of the patch formerly known as 'alternative storage for libpq'. I took a quick look at the patch and the docs. Looks good and agree with rationale and implementation. I see you covered the pqsetvalue case which is nice. I expect libpq C api clients coded for performance will immediately gravitate to this api. - The meaning of PGresAttValue is changed. The field 'value' now contains a value withOUT terminating zero. This change seems to have no effect on any other portion within the whole source tree of postgresql from what I've seen. This is a minor point of concern. This function was exposed to support libpqtypes (which your stuff compliments very nicely by the way) and I quickly confirmed removal of the null terminator didn't cause any problems there. I doubt anyone else is inspecting the structure directly (also searched the archives and didn't find anything). This needs to be advertised very loudly in the docs -- I understand why this was done but it's a pretty big change in the way the api works. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed dblink using alternate libpq tuple storage
On Fri, Jan 27, 2012 at 05:57:01PM +0900, Kyotaro HORIGUCHI wrote: Hello, This is a new version of the patch formerly known as 'alternative storage for libpq'. - Changed the concept to 'Alternative Row Processor' from 'Storage handler'. Symbol names are also changed. - Callback function is modified following to the comment. - From the restriction of time, I did minimum check for this patch. The purpose of this patch is to show the new implement. - Proformance is not measured for this patch for the same reason. I will do that on next monday. - The meaning of PGresAttValue is changed. The field 'value' now contains a value withOUT terminating zero. This change seems to have no effect on any other portion within the whole source tree of postgresql from what I've seen. I think we have general structure in place. Good. Minor notes: = rowhandler api = * It returns bool, so void* is wrong. Instead libpq style is to use int, with 1=OK, 0=Failure. Seems that was also old pqAddTuple() convention. * Drop PQgetRowProcessorParam(), instead give param as argument. * PQsetRowProcessorErrMes() should strdup() the message. That gets rid of allocator requirements in API. This also makes safe to pass static strings there. If strdup() fails, fall back to generic no-mem message. * Create new struct to replace PGresAttValue for rowhandler usage. RowHandler API is pretty unique and self-contained. It should have it's own struct. Main reason is that it allows to properly document it. Otherwise the minor details get lost as they are different from libpq-internal usage. Also this allows two structs to be improved separately. (PGresRawValue?) * Stop storing null_value into -value. It's libpq internal detail. Instead the -value should always point into buffer where the value info is located, even for NULL. This makes safe to simply subtract pointers to get row size estimate. Seems pqAddTuple() already does null_value logic, so no need to do it in rowhandler api. = libpq = Currently its confusing whether rowProcessor can be NULL, and what should be done if so. I think its better to fix usage so that it is always set. * PQregisterRowProcessor() should use default func if func==NULL. and set default handler if so. * Never set rowProcessor directly, always via PQregisterRowProcessor() * Drop all if(rowProcessor) checks. = dblink = * There are malloc failure checks missing in initStoreInfo() storeHandler(). -- marko PS. You did not hear it from me, but most raw values are actually nul-terminated in protocol. Think big-endian. And those which are not, you can make so, as the data is not touched anymore. You cannot do it for last value, as next byte may not be allocated. But you could memmove() it lower address so you can null-terminate. I'm not suggesting it for official patch, but it would be fun to know if such hack is benchmarkable, and benchmarkable on realistic load. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unreliable pg_ctl -w start again
Hello, Last year, I asked for your opinions about how to fix the bug of unreliable pg_ctl -w start, as in the thread: http://archives.postgresql.org/pgsql-hackers/2011-05/msg01407.php The phenomenon was that pg_ctl -w start did not return for 60 seconds when postgresql.conf contained a wrong parameter specification. Recently, I've encountered another problem of pg_ctl -w start, which I cannot reliably avoid. I found the cause in pg_ctl.c. I'm willing to create a patch, but I'm concerned about the correctness of the fix. I desire this bug will be eliminated as soon as possible. I'd like to ask your opinions. [Problem] I use PostgreSQL 8.3.12 embedded in a packaged application. The problem occurred on RHEL5 when the operating system was starting up. The packaged application is started from /etc/init.d/myapp. That application internally executes pg_ctl -w start and checks its return value. The application does not start unless the return value is 0. The problematic phenomenon is that pg_ctl -w start fails with return value 1 in only two seconds without waiting until 60 seconds pass. That is, -w did not work. However, the database server started successfully. The timeline was as follows: 18:09:45 the application executed pg_ctl -w start 18:09:47 pg_ctl -w start returned with 1 PostgreSQL's server log (dates are intentionally eliminated) 18:10:01 JST 22995 LOG: database system was interrupted;last known up at 2012-01-21 02:24:59 JST 18:10:32 JST 22995 LOG: database system was not properly shut down; automatic recovery in progress 18:10:34 JST 22995 LOG: record with zero length at 0/23E35D4 18:10:34 JST 22995 LOG: redo is not required 18:11:38 JST 22893 LOG: database system is ready to accept connections 18:11:38 JST 23478 LOG: autovacuum launcher started PostgreSQL took a long time to start. This is probably because the system load was high with many processes booting up concurrently during OS boot. [Cause] The following part in do_start() of pg_ctl.c contains a bug: if (old_pid != 0) { pg_usleep(100); pid = get_pgpid(); if (pid == old_pid) { write_stderr(_(%s: could not start server\n Examine the log output.\n), progname); exit(1); } } This part assumes that postmaster will overwrite postmaster.pid within a second. This assumption is not correct under heavy load like OS startup. In PostgreSQL 9.1, the wait processing is largely modified. However, the same assumption seems to still remain, though the duration is 5 seconds. 5 seconds of wait is probably insufficient for my case. I think no fixed duration is appropriate. [Solution] So, what is the reliable solution? The pipe-based one, which I proposed in the past thread, would be reliable. However, that is not simple enough to back-port to 8.3. How about inserting postmaster_is_alive() as below? I know this is not perfect, but this will work in most cases. I need some solution that pratically helps. if (old_pid != 0) { pg_usleep(100); pid = get_pgpid(); if (pid == old_pid postmaster_is_alive(pid)) { write_stderr(_(%s: could not start server\n Examine the log output.\n), progname); exit(1); } } Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed dblink using alternate libpq tuple storage
On Fri, Jan 27, 2012 at 09:35:04AM -0600, Merlin Moncure wrote: On Fri, Jan 27, 2012 at 2:57 AM, Kyotaro HORIGUCHI - The meaning of PGresAttValue is changed. The field 'value' now contains a value withOUT terminating zero. This change seems to have no effect on any other portion within the whole source tree of postgresql from what I've seen. This is a minor point of concern. This function was exposed to support libpqtypes (which your stuff compliments very nicely by the way) and I quickly confirmed removal of the null terminator didn't cause any problems there. I doubt anyone else is inspecting the structure directly (also searched the archives and didn't find anything). This needs to be advertised very loudly in the docs -- I understand why this was done but it's a pretty big change in the way the api works. Note that the non-NUL-terminated PGresAttValue is only used for row handler. So no existing usage is affected. But I agree using same struct in different situations is confusing, thus the request for separate struct for row handler usage. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Sun, Jan 15, 2012 at 1:01 PM, Joachim Wieland j...@mcknight.de wrote: So this is the parallel pg_dump patch, generalizing the existing parallel restore and allowing parallel dumps for the directory archive format, the patch works on Windows and Unix. This patch introduces a large amount of notational churn, as perhaps well-illustrated by this hunk: static int ! dumpBlobs(Archive *AHX, void *arg) { + /* +* This is a data dumper routine, executed in a child for parallel backup, +* so it must not access the global g_conn but AH-connection instead. +*/ + ArchiveHandle *AH = (ArchiveHandle *) AHX; It seems pretty grotty to have a static function that gets an argument of the wrong type, and then just immediately turns around and casts it to something else. It's not exactly obvious that that's even safe, especially if you don't know that ArchiveHandle is a struct whose first element is an Archive. But even if you do know that subclassing is intended, that doesn't prove that the particular Archive object is always going to be an ArchiveHandle under the hood. If it is, why not just pass it as an ArchiveHandle to begin with? I think this needs to be revised in some way. At least in the few cases I checked, the only point of getting at the ArchiveHandle this way was to find AH-connection, which suggests to me either that AH-connection should be in the public section, inside Archive rather than ArchiveHandle, or else that we ought to just pass the connection object to this function (and all of its friends who have similar issues) as a separate argument. Either way, I think that would make this patch both cleaner and less-invasive. In fact we might want to pull out just that change and commit it separately to simplify review of the remaining work. It's not clear to me why fmtQualifiedId needs to move to dumputils.c. The way you have it, fmtQualifiedId() is now with fmtId(), but no longer with fmtCopyColumnList(), the only other similarly named function in that directory. That may be more logical, or it may not, but rearranging the code like this makes it a lot harder to review, and I would prefer that we make such changes as separate commits if we're going to do them, so that diff can do something sensible with the changes that are integral to the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Fri, Jan 27, 2012 at 10:57 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 15, 2012 at 1:01 PM, Joachim Wieland j...@mcknight.de wrote: So this is the parallel pg_dump patch, generalizing the existing parallel restore and allowing parallel dumps for the directory archive format, the patch works on Windows and Unix. This patch introduces a large amount of notational churn, as perhaps well-illustrated by this hunk: static int ! dumpBlobs(Archive *AHX, void *arg) { + /* + * This is a data dumper routine, executed in a child for parallel backup, + * so it must not access the global g_conn but AH-connection instead. + */ + ArchiveHandle *AH = (ArchiveHandle *) AHX; It seems pretty grotty to have a static function that gets an argument of the wrong type, and then just immediately turns around and casts it to something else. It's not exactly obvious that that's even safe, especially if you don't know that ArchiveHandle is a struct whose first element is an Archive. But even if you do know that subclassing is intended, that doesn't prove that the particular Archive object is always going to be an ArchiveHandle under the hood. If it is, why not just pass it as an ArchiveHandle to begin with? I think this needs to be revised in some way. At least in the few cases I checked, the only point of getting at the ArchiveHandle this way was to find AH-connection, which suggests to me either that AH-connection should be in the public section, inside Archive rather than ArchiveHandle, or else that we ought to just pass the connection object to this function (and all of its friends who have similar issues) as a separate argument. Either way, I think that would make this patch both cleaner and less-invasive. In fact we might want to pull out just that change and commit it separately to simplify review of the remaining work. It's not clear to me why fmtQualifiedId needs to move to dumputils.c. The way you have it, fmtQualifiedId() is now with fmtId(), but no longer with fmtCopyColumnList(), the only other similarly named function in that directory. That may be more logical, or it may not, but rearranging the code like this makes it a lot harder to review, and I would prefer that we make such changes as separate commits if we're going to do them, so that diff can do something sensible with the changes that are integral to the patch. Woops, hit send a little too soon there. I'll try to make some more substantive comments after looking at this more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] TS: Limited cover density ranking
Hello, I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage: -- normal cover density ranking : not changed luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) -- limited to 2 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0 (1 row) luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.1 (1 row) -- limited to 3 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.13 (1 row) Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. I have some questions: 1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions? If this is the case, there are some further questions: - should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ? - should I define define this new sql level functions in core or should I go only with this 2 lines change in calc_rank_cd() and define the new functions as an extension? If we prefer the later, could I overload core functions with functions defined in extensions? - and finally there is always the possibility to duplicate the code and make an independent extension. 2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the new functions in the system catalog without reinitializing the cluster? Best regards luben -- Luben Karavelov
Re: [HACKERS] TS: Limited cover density ranking
And here is the patch, that I forgot to attach Hello, I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage: ... Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. -- Luben Karavelovdiff -pur postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c /usr/src/postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c --- postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c 2011-12-01 23:47:20.0 +0200 +++ /usr/src/postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c 2012-01-27 07:45:34.558028176 +0200 @@ -724,7 +724,7 @@ get_docrep(TSVector txt, QueryRepresenta } static float4 -calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method) +calc_rank_cd(int limit, float4 *arrdata, TSVector txt, TSQuery query, int method) { DocRepresentation *doc; int len, @@ -768,6 +768,9 @@ calc_rank_cd(float4 *arrdata, TSVector t int nNoise; DocRepresentation *ptr = ext.begin; +if (limit 0 ext.end-pos - ext.begin-pos limit) +continue; + while (ptr = ext.end) { InvSum += invws[ptr-wclass]; @@ -834,7 +837,7 @@ ts_rankcd_wttf(PG_FUNCTION_ARGS) int method = PG_GETARG_INT32(3); float res; - res = calc_rank_cd(getWeights(win), txt, query, method); + res = calc_rank_cd(0, getWeights(win), txt, query, method); PG_FREE_IF_COPY(win, 0); PG_FREE_IF_COPY(txt, 1); @@ -850,7 +853,7 @@ ts_rankcd_wtt(PG_FUNCTION_ARGS) TSQuery query = PG_GETARG_TSQUERY(2); float res; - res = calc_rank_cd(getWeights(win), txt, query, DEF_NORM_METHOD); + res = calc_rank_cd(0, getWeights(win), txt, query, DEF_NORM_METHOD); PG_FREE_IF_COPY(win, 0); PG_FREE_IF_COPY(txt, 1); @@ -866,7 +869,7 @@ ts_rankcd_ttf(PG_FUNCTION_ARGS) int method = PG_GETARG_INT32(2); float res; - res = calc_rank_cd(getWeights(NULL), txt, query, method); + res = calc_rank_cd(0, getWeights(NULL), txt, query, method); PG_FREE_IF_COPY(txt, 0); PG_FREE_IF_COPY(query, 1); @@ -880,9 +883,75 @@ ts_rankcd_tt(PG_FUNCTION_ARGS) TSQuery query = PG_GETARG_TSQUERY(1); float res; - res = calc_rank_cd(getWeights(NULL), txt, query, DEF_NORM_METHOD); + res = calc_rank_cd(0, getWeights(NULL), txt, query, DEF_NORM_METHOD); PG_FREE_IF_COPY(txt, 0); PG_FREE_IF_COPY(query, 1); PG_RETURN_FLOAT4(res); } + +Datum +ts_rankcd_lwttf(PG_FUNCTION_ARGS) +{ +int limit = PG_GETARG_INT32(0); + ArrayType *win = (ArrayType *) PG_DETOAST_DATUM(PG_GETARG_DATUM(1)); + TSVector txt = PG_GETARG_TSVECTOR(2); + TSQuery query = PG_GETARG_TSQUERY(3); + int method = PG_GETARG_INT32(4); + float res; + + res = calc_rank_cd(limit, getWeights(win), txt, query, method); + + PG_FREE_IF_COPY(win, 1); + PG_FREE_IF_COPY(txt, 2); + PG_FREE_IF_COPY(query, 3); + PG_RETURN_FLOAT4(res); +} + +Datum +ts_rankcd_lwtt(PG_FUNCTION_ARGS) +{ +int limit = PG_GETARG_INT32(0); + ArrayType *win = (ArrayType *) PG_DETOAST_DATUM(PG_GETARG_DATUM(1)); + TSVector txt = PG_GETARG_TSVECTOR(2); + TSQuery query = PG_GETARG_TSQUERY(3); + float res; + + res = calc_rank_cd(limit, getWeights(win), txt, query, DEF_NORM_METHOD); + + PG_FREE_IF_COPY(win, 1); + PG_FREE_IF_COPY(txt, 2); + PG_FREE_IF_COPY(query, 3); + PG_RETURN_FLOAT4(res); +} + +Datum +ts_rankcd_lttf(PG_FUNCTION_ARGS) +{ +int limit = PG_GETARG_INT32(0); + TSVector txt = PG_GETARG_TSVECTOR(1); + TSQuery query = PG_GETARG_TSQUERY(2); + int method = PG_GETARG_INT32(3); + float res; + + res = calc_rank_cd(limit, getWeights(NULL), txt, query, method); + + PG_FREE_IF_COPY(txt, 1); + PG_FREE_IF_COPY(query, 2); + PG_RETURN_FLOAT4(res); +} + +Datum +ts_rankcd_ltt(PG_FUNCTION_ARGS) +{ +int limit = PG_GETARG_INT32(0); + TSVector txt = PG_GETARG_TSVECTOR(1); + TSQuery query = PG_GETARG_TSQUERY(2); + float res; + + res = calc_rank_cd(limit, getWeights(NULL), txt, query, DEF_NORM_METHOD); + + PG_FREE_IF_COPY(txt, 1); + PG_FREE_IF_COPY(query, 2); + PG_RETURN_FLOAT4(res); +} diff -pur postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h /usr/src/postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h --- postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h 2011-12-01 23:47:20.0 +0200 +++ /usr/src/postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h 2012-01-27 05:45:53.944979678 +0200 @@ -4159,6 +4159,15 @@ DATA(insert OID = 3709 ( ts_rank_cd PGN DESCR(relevance); DATA(insert OID = 3710 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 2 0 700 3614 3615 _null_ _null_ _null_ _null_ ts_rankcd_tt _null_ _null_ _null_ )); DESCR(relevance); +DATA(insert OID = 3675 ( ts_rank_cd PGNSP PGUID 12 1 0 0 f f f t f i 5 0 700 23 1021 3614
Re: [HACKERS] patch for parallel pg_dump
On Fri, Jan 27, 2012 at 10:58 AM, Robert Haas robertmh...@gmail.com wrote: It's not clear to me why fmtQualifiedId needs to move to dumputils.c. The way you have it, fmtQualifiedId() is now with fmtId(), but no longer with fmtCopyColumnList(), the only other similarly named function in that directory. That may be more logical, or it may not, but rearranging the code like this makes it a lot harder to review, and I would prefer that we make such changes as separate commits if we're going to do them, so that diff can do something sensible with the changes that are integral to the patch. Woops, hit send a little too soon there. I'll try to make some more substantive comments after looking at this more. And maybe retract some of the bogus ones, like the above: I see why you moved this, now - parallel.c needs it. Still looking... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TS: Limited cover density ranking
The rank counts 1/coversize. So bigger covers will not have much impact anyway. What is the need of the patch? -Sushant. On Fri, 2012-01-27 at 18:06 +0200, karave...@mail.bg wrote: Hello, I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage: -- normal cover density ranking : not changed luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) -- limited to 2 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0 (1 row) luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.1 (1 row) -- limited to 3 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('ad')); ts_rank_cd 0.033 (1 row) luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('ad')); ts_rank_cd 0.13 (1 row) Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. I have some questions: 1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions? If this is the case, there are some further questions: - should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ? - should I define define this new sql level functions in core or should I go only with this 2 lines change in calc_rank_cd() and define the new functions as an extension? If we prefer the later, could I overload core functions with functions defined in extensions? - and finally there is always the possibility to duplicate the code and make an independent extension. 2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the new functions in the system catalog without reinitializing the cluster? Best regards luben -- Luben Karavelov -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Sun, Jan 15, 2012 at 1:01 PM, Joachim Wieland j...@mcknight.de wrote: So this is the parallel pg_dump patch, generalizing the existing parallel restore and allowing parallel dumps for the directory archive format, the patch works on Windows and Unix. It seems a little unfortunate that we are using threads here on Windows and processes on Linux. Maybe it's too late to revisit that decision, but it seems like a recipe for subtle bugs. In parallel restore, the master closes its own connection to the database before forking of worker processes, just as it does now. In parallel dump however, we need to hold the masters connection open so that we can detect deadlocks. The issue is that somebody could have requested an exclusive lock after the master has initially requested a shared lock on all tables. Therefore, the worker process also requests a shared lock on the table with NOWAIT and if this fails, we know that there is a conflicting lock in between and that we need to abort the dump. I think this is an acceptable limitation, but the window where it can happen seems awfully wide right now. As things stand, it seems like we don't try to lock the table in the child until we're about to access it, which means that, on a large database, we could dump out 99% of the database and then be forced to abort the dump because of a conflicting lock on the very last table. We could fix that by having every child lock every table right at the beginning, so that all possible failures of this type would happen before we do any work, but that will eat up a lot of lock table space. It would be nice if the children could somehow piggyback on the parent's locks, but I don't see any obvious way to make that work. Maybe we just have to live with it the way it is, but I worry that people whose dumps fail 10 hours into a 12 hour parallel dump are going to be grumpy. The connections of the parallel dump use the synchronized snapshot feature. However there's also an option --no-synchronized-snapshots which can be used to dump from an older PostgreSQL version. Right now, you have it set up so that --no-synchronized-snapshots is ignored even if synchronized snapshots are supported, which doesn't make much sense to me. I think you should allow --no-synchronized-snapshots with any release, and error out if it's not specified and the version is too old work without it. It's probably not a good idea to run with --no-synchronized-snapshots ever, and doubly so if they're not available, but I'd rather leave that decision to the user. (Imagine, for example, that we discovered a bug in our synchronized snapshot implementation.) I am tempted to advocate calling the flag --unsynchronized-snapshots, because to me that underscores the danger a little more clearly, but perhaps that is too clever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TS: Limited cover density ranking
- Цитат от Sushant Sinha (sushant...@gmail.com), на 27.01.2012 в 18:32 - The rank counts 1/coversize. So bigger covers will not have much impact anyway. What is the need of the patch? -Sushant. If you want to find only combinations of words that are close one to another, with the patch you could use something as: WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('ad') AS query) SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)0; I could not find another way to make this type of queries. If there is an alternative, I am open to suggestions Best regards -- Luben Karavelov
Re: [HACKERS] patch for parallel pg_dump
On 27.01.2012 18:46, Robert Haas wrote: On Sun, Jan 15, 2012 at 1:01 PM, Joachim Wielandj...@mcknight.de wrote: In parallel restore, the master closes its own connection to the database before forking of worker processes, just as it does now. In parallel dump however, we need to hold the masters connection open so that we can detect deadlocks. The issue is that somebody could have requested an exclusive lock after the master has initially requested a shared lock on all tables. Therefore, the worker process also requests a shared lock on the table with NOWAIT and if this fails, we know that there is a conflicting lock in between and that we need to abort the dump. I think this is an acceptable limitation, but the window where it can happen seems awfully wide right now. As things stand, it seems like we don't try to lock the table in the child until we're about to access it, which means that, on a large database, we could dump out 99% of the database and then be forced to abort the dump because of a conflicting lock on the very last table. We could fix that by having every child lock every table right at the beginning, so that all possible failures of this type would happen before we do any work, but that will eat up a lot of lock table space. It would be nice if the children could somehow piggyback on the parent's locks, but I don't see any obvious way to make that work. Maybe we just have to live with it the way it is, but I worry that people whose dumps fail 10 hours into a 12 hour parallel dump are going to be grumpy. If the master process keeps the locks it acquires in the beginning, you could fall back to dumping those tables where the child lock fails using the master connection. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TS: Limited cover density ranking
- Цитат от karave...@mail.bg, на 27.01.2012 в 18:48 - - Цитат от Sushant Sinha (sushant...@gmail.com), на 27.01.2012 в 18:32 - The rank counts 1/coversize. So bigger covers will not have much impact anyway. What is the need of the patch? -Sushant. If you want to find only combinations of words that are close one to another, with the patch you could use something as: WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('ad') AS query) SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)0; Another example, if you want to match 'b c d' only, you could use: WITH A AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('bcd') AS query) SELECT * FROM A WHERE vec @@ query AND ts_rank_cd(2,vec,query)0; The catch is that it will match also 'b d c', 'd c b', 'd b c', 'c d b' and 'd b d', so it is not a replacement for exact phrase match but something that I find useful -- Luben Karavelov
Re: [HACKERS] pg_statistic, lack of documentation
On Sat, Jan 14, 2012 at 7:34 AM, Sergey Konoplev gray...@gmail.com wrote: Hi, http://www.postgresql.org/docs/9.1/interactive/catalog-pg-statistic.html It specifies that entries are created by ANALYZE, but does not mention that if a table is empty the entry for it is not created. The actual behavior, in somewhat more detail, is that the existing statistics entries, if any, are retained. I've added a note to that effect to the documentation for ANALYZE, which seems like a more appropriate place than the pg_statistic documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch for parallel pg_dump
On Fri, Jan 27, 2012 at 11:53 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: If the master process keeps the locks it acquires in the beginning, you could fall back to dumping those tables where the child lock fails using the master connection. Hmm, that's a thought. Another idea I just had is to allow a transaction that has imported a snapshot to jump the queue when attempting to acquire a lock that the backend from which the snapshot was imported already holds. We don't want to allow queue-jumping in general because there are numerous places in the code where we rely on the current behavior to avoid starving strong lockers, but it seems like it might be reasonable to allow it in this special case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables
On Wed, Jan 11, 2012 at 6:43 AM, Ashutosh Bapat ashutosh.ba...@enterprisedb.com wrote: Hi, After running regression, I ran EXPLAIN on one of the queries in regression (test create_misc) and got following output regression=# explain verbose select * into table ramp from road where name ~ '.*Ramp'; QUERY PLAN Result (cost=0.00..154.00 rows=841 width=67) Output: public.road.name, public.road.thepath - Append (cost=0.00..154.00 rows=841 width=67) - Seq Scan on public.road (cost=0.00..135.05 rows=418 width=67) Output: public.road.name, public.road.thepath Filter: (public.road.name ~ '.*Ramp'::text) - Seq Scan on public.ihighway road (cost=0.00..14.99 rows=367 width=67) ^ Output: public.road.name, public.road.thepath ^^, ^^ Filter: (public.road.name ~ '.*Ramp'::text) ^^^ - Seq Scan on public.shighway road (cost=0.00..3.96 rows=56 width=67) Output: public.road.name, public.road.thepath Filter: (public.road.name ~ '.*Ramp'::text) (12 rows) regression=# \d+ road Table public.road Column | Type | Modifiers | Storage | Stats target | Description -+--+---+--+--+- name | text | | extended | | thepath | path | | extended | | Indexes: rix btree (name) Child tables: ihighway, shighway Has OIDs: no Table road has children ihighway and shighway as seen in the \d+ output above. The EXPLAIN output of Seq Scan node on children has public.road as prefix for variables. public.road could imply the parent table road and thus can cause confusion, as to what's been referreed, the columns of parent table or child table. In the EXPLAIN output children tables have road as alias (as against public.road). The alias comes from RangeTblEntry-eref-aliasname. It might be better to have road as prefix in the variable names over public.road. It's a feature, not a bug, that we schema-qualify names when VERBOSE is specified. That was done on purpose for the benefit of external tools that might need this information to disambiguate which object is being referenced. Table *aliases*, of course, should not be schema-qualified, but I don't think that's what we're doing. You could make it more clear by including an alias in the query, like this: explain verbose select * into table ramp from road hwy where name ~ '.*Ramp'; -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] get_fn_expr_argtype() vs. internal calls
On Thu, Dec 29, 2011 at 4:17 PM, Noah Misch n...@leadboat.com wrote: We document that a polymorphic C-language function may identify the concrete data type of each argument using calls to get_fn_expr_argtype(). That relies on FmgrInfo.fn_expr, which only the executor sets. Calls of internal origin, by way of {Direct,,Oid}FunctionCall*(), don't cons up an fn_expr, so get_fn_expr_argtype() just returns InvalidOid every time. (Indeed, we couldn't easily do better in many cases.) To what extent is it safe to rely on this situation remaining as it is? I ask on account of some second thoughts I had about CheckIndexCompatible(). When writing it, I did not explicitly consider operator classes having polymorphic opcintype. If get_fn_expr_argtype() were to work in a function called from the btree search code, CheckIndexCompatible() should impose stricter checks on indexes having opclasses of polymorphic opcintype. If that's not too likely to happen, I might just add a comment instead. [ found this while reviewing emails I hadn't read carefully enough when they were first posted ] I think the way you actually chose to tighten this up is probably safer, and it seems to have minimal downside. Thanks for pursuing this despite the lack of response to your initial query. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unreliable pg_ctl -w start again
MauMau maumau...@gmail.com writes: In PostgreSQL 9.1, the wait processing is largely modified. However, the same assumption seems to still remain, though the duration is 5 seconds. 5 seconds of wait is probably insufficient for my case. I think no fixed duration is appropriate. Well, feel free to increase that duration if you want. The reason it's there is to not wait for a long time if the postmaster falls over instantly at startup, but in a non-interactive situation you might not care. How about inserting postmaster_is_alive() as below? Looks like complete nonsense to me, if the goal is to behave sanely when postmaster.pid hasn't been created yet. Where do you think get_pgpid gets the PID from? If you want to do something useful about this, the correct hint is buried in start_postmaster(): /* * Since there might be quotes to handle here, it is easier simply to pass * everything to a shell to process them. * * XXX it would be better to fork and exec so that we would know the child * postmaster's PID directly; then test_postmaster_connection could use * the PID without having to rely on reading it back from the pidfile. */ If we had the postmaster's PID a priori, we could detect postmaster death directly instead of having to make assumptions about how long is reasonable to wait for the pidfile to appear. The problem is that we don't want to write a complete replacement for the shell's command line parser and I/O redirection logic. It doesn't look like a small project. (But maybe we could bypass that by doing a fork() and then having the child exec() the shell, telling it to exec postmaster in turn?) And of course Windows as usual makes things twice as hard, since we couldn't make such a change unless start_postmaster could return the proper PID in that case too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Intermittent regression test failures from index-only plan changes
On Sat, Jan 7, 2012 at 12:30 PM, Tom Lane t...@sss.pgh.pa.us wrote: I feel like this is a trick question, but I'll ask anyway: Can't we just ignore ANALYZE? AFAICS, no. ANALYZE will run user-defined code: not only user-supplied stats collection functions, but user-defined index expressions. We cannot assume that none of that ever requires a snapshot. The question is: Why would it matter if we expunged tuples from table A while ANALYZE was running on table B? I guess the problem is that the index on B might involve a user-defined function which (under the covers) peeks at table A, possibly now seeing an inconsistent view of the database. It's pretty unfortunate to have to cater to that situation, though, because most of the time an ANALYZE on table A is only going to look at table A and the system catalogs. In fact, it wouldn't even be disastrous (in most cases) if we removed tuples from the table being analyzed - we're engaged in an inherently statistical process anyway, so who really cares if things change on us in medias res? Could we easily detect the cases where user code is being run and ignore ANALYZE when none is? A probably crazy idea is to add an option to vacuum that would cause it, upon discovering that it can't set PD_ALL_VISIBLE on a page because the global xmin is too old, to wait for all of the virtual transaction IDs who might not be able to see every tuple on the page. This would allow us to get into a state where all the PD_ALL_VISIBLE bits are known to be set. But that seems a bit complex for something that we probably don't care about much outside of the regression tests. If none of the above is feasible (and I suspect it isn't), we might just want to tweak the queries to do something that will preclude using an index-only scan, like including tableoid::regclass in the target list. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] rewriteheap.c bug: toast rows don't get XIDs matching their parents
On Thu, Jan 12, 2012 at 4:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: While working on bug #6393 I was reminded of the truth of $SUBJECT: any rows inserted into the new toast table will have the xmin of the CLUSTER or VACUUM FULL operation, and invalid xmax, whereas their parent heap rows will have xmin/xmax copied from the previous instance of the table. This does not matter much for ordinary live heap rows, but it's also necessary for CLUSTER/VACUUM FULL to copy recently-dead, insert-in-progress, and delete-in-progress rows. In such cases, a later plain VACUUM might reap the parent heap rows and not the toast rows, leading to a storage leak that won't be recovered short of another CLUSTER/VACUUM FULL. I can't remember if we discussed this risk when the heap rewrite code was written. I'm not sure it's worth fixing, but at the least it ought to be documented in the comments in rewriteheap.c. People run CLUSTER and VACUUM FULL to recover wasted storage, so it's a bit unfortunate if those operations can themselves introduce a storage leak. So I think it would be nice to fix this, but if that's more than we can manage right now, then I agree we should at least add a code comment so that it has a better chance of getting fixed later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_statistic, lack of documentation
On Fri, Jan 27, 2012 at 9:14 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jan 14, 2012 at 7:34 AM, Sergey Konoplev gray...@gmail.com wrote: I've added a note to that effect to the documentation for ANALYZE, which seems like a more appropriate place than the pg_statistic documentation. Thank you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sergey Konoplev Blog: http://gray-hemp.blogspot.com LinkedIn: http://ru.linkedin.com/in/grayhemp JID/GTalk: gray...@gmail.com Skype: gray-hemp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multithread Query Planner
Not to mention palloc, another extremely fundamental and non-reentrant subsystem. Possibly we could work on making all that stuff re-entrant, but it would be a huge amount of work for a distant and uncertain payoff. Right. I think it makes more sense to try to get parallelism working first with the infrastructure we have. Converting to use threading, if we ever do it at all, should be something we view as a later performance optimization. But I suspect we won't want to do it anyway; I think there will be easier ways to get where we want to be. Multithreading got fashionable with the arrival of the Dual-core CPU a few years ago. However, multithreading as it is used currently has a huge problem : usually, threads share all of their memory. This opens the door to an infinite number of hard to find bugs, and more importantly, defeats the purpose. Re-entrant palloc() is nonsense. Suppose you can make a reentrant palloc() which scales OK at 2 threads thanks to a cleverly placed atomic instruction. How is it going to scale on 64 cores ? On HP's new 1000-core ARM server with non-uniform memory access ? Probably it would suck very very badly... not to mention the horror of multithreaded exception-safe deallocation when 1 thread among many blows up on an error... For the ultimate in parallelism, ask a FPGA guy. Is he using shared memory to wire together his 12000 DSP blocks ? Nope, he's using isolated Processes which share nothing and communicate through FIFOs and hardware message passing. Like shell pipes, basically. Or Erlang. Good parallelism = reduce shared state and communicate through data/message channels. Shared-everything multithreading is going to be in a lot of trouble on future many-core machines. Incidentally, Postgres, with its Processes, sharing only what is needed, has a good head start... With more and more cores coming, you guys are going to have to fight to reduce the quantity of shared state between processes, not augment it by using shared memory threads !... Say you want to parallelize sorting. Sorting is a black-box with one input data pipe and one output data pipe. Data pipes are good for parallelism, just like FIFOs. FPGA guys love black boxes with FIFOs between them. Say you manage to send tuples through a FIFO like zeromq. Now you can even run the sort on another machine and allow it to use all the RAM if you like. Now split the black box in two black boxes (qsort and merge), instanciate as many qsort boxes as necessary, and connect that together with pipes. Run some boxes on some of this machine's cores, some other boxes on another machine, etc. That would be very flexible (and scalable). Of course the black box has a small backdoor : some comparison functions can access shared state, which is basically *the* issue (not reentrant stuff, which you do not need). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writes that don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c) Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages. Dan - Original Message - From: Robert Haas robertmh...@gmail.com To: Dan Scales sca...@vmware.com Cc: Noah Misch n...@leadboat.com, Heikki Linnakangas heikki.linnakan...@enterprisedb.com, Andres Freund and...@anarazel.de, Kevin Grittner kevin.gritt...@wicourts.gov, da...@fetter.org, ai...@highrise.ca, st...@mit.edu, pgsql-hackers@postgresql.org, Simon Riggs si...@2ndquadrant.com Sent: Friday, January 27, 2012 5:19:32 AM Subject: Re: [HACKERS] 16-bit page checksums for 9.2 On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales sca...@vmware.com wrote: I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). If there were every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantage to putting it in smgrwrite()? The smgr and md layers don't currently know anything about the page format, and I imagine we want to keep it that way. It seems like the right place for this is in some higher layer, like bufmgr. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Simulating Clog Contention
On Thu, Jan 12, 2012 at 4:31 AM, Simon Riggs si...@2ndquadrant.com wrote: The following patch adds a pgbench option -I to load data using INSERTs, so that we can begin benchmark testing with rows that have large numbers of distinct un-hinted transaction ids. With a database pre-created using this we will be better able to simulate and thus more easily measure clog contention. Note that current clog has space for 1 million xids, so a scale factor of greater than 10 is required to really stress the clog. Running with this patch with a non-default scale factor generates the spurious notice: Scale option ignored, using pgbench_branches table count = 10 In fact the scale option is not being ignored, because it was used to initialize the pgbench_branches table count earlier in this same invocation. I think that even in normal (non-initialization) usage, this message should be suppressed when the provided scale factor is equal to the pgbench_branches table count. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Sat, Jan 21, 2012 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, it was. Sorry about that. New version attached, retesting while you read this. In my hands I could never get this patch to do anything. The new cache was never used. I think that that was because RecentXminPageno never budged from -1. I think that that, in turn, is because the comparison below can never return true, because the comparison is casting both sides to uint, and -1 cast to uint is very large /* When we commit advance ClogCtl's shared RecentXminPageno if needed */ if (ClogCtl-shared-RecentXminPageno TransactionIdToPage(RecentXmin)) ClogCtl-shared-RecentXminPageno = TransactionIdToPage(RecentXmin); Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dry-run mode for pg_archivecleanup
On Fri, Jan 27, 2012 at 9:47 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Jan 15, 2012 at 5:05 PM, Josh Kupershmidt schmi...@gmail.com wrote: On Sun, Jan 15, 2012 at 3:02 PM, Gabriele Bartolini gabriele.bartol...@2ndquadrant.it wrote: My actual intention was to have the filename as output of the command, in order to easily pipe it to another script. Hence my first choice was to use the stdout channel, considering also that pg_archivecleanup in dry-run mode is harmless and does not touch the content of the directory. Oh, right - I should have re-read your initial email before diving into the patch. That all makes sense given your intended purpose. I guess your goal of constructing some simple way to pass the files which would be removed on to another script is a little different than what I initially thought the patch would be useful for, namely as a testing/debugging aid for an admin. Perhaps both goals could be met by making use of '--debug' together with '--dry-run'. If they are both on, then an additional message like pg_archivecleanup: would remove file ... would be printed to stderr, along with just the filename printed to stdout you already have. This email thread seems to have trailed off without reaching a conclusion. The patch is marked as Waiting on Author in the CommitFest application, but I'm not sure that's accurate. Can we try to nail this down? Perhaps my last email was a bit wordy. The only real change I am suggesting for Gabriele's patch is that the message printed to stderr when debug + dryrun are activated be changed to would remove file ... from removing file, i.e around line 124: if (debug) fprintf(stderr, %s: %s file \%s\\n, progname, (dryrun ? would remove : removing), WALFilePath); Other than that little quibble, I thought the patch was fine. Josh -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dry-run mode for pg_archivecleanup
Excerpts from Josh Kupershmidt's message of vie ene 27 19:43:51 -0300 2012: On Fri, Jan 27, 2012 at 9:47 AM, Robert Haas robertmh...@gmail.com wrote: This email thread seems to have trailed off without reaching a conclusion. The patch is marked as Waiting on Author in the CommitFest application, but I'm not sure that's accurate. Can we try to nail this down? Perhaps my last email was a bit wordy. The only real change I am suggesting for Gabriele's patch is that the message printed to stderr when debug + dryrun are activated be changed to would remove file ... from removing file, i.e around line 124: if (debug) fprintf(stderr, %s: %s file \%s\\n, progname, (dryrun ? would remove : removing), WALFilePath); Other than that little quibble, I thought the patch was fine. If you do that, please make sure you use two complete separate strings instead of building one from spare parts. This bit is missing the _() stuff though. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. right -- exactly. but why stop at one page? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configuring Postgres to Add A New Source File
Indeed, I'm a beginner in Make, but I read few tutorials and was able to do what I wanted outside of PG using a simple make file. Now, when moving to PG, I found the Make structure much more complicated and didn't know where to add my configuration. I'm looking only for this file to run in PG (the required functionality is done already). My code will be a part of the backend, so I want to keep it there. You're saying it's going to be a bit tricky to do what I want with make (as I'm novice in it).. so can you please point out some references I can start from?! Do you think reposting this in the novice mailing list will be a good idea? Thanks On Thu, Jan 26, 2012 at 12:54 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jan 25, 2012 at 7:06 AM, Tareq Aljabban tareq.aljab...@gmail.com wrote: Hi, I'm doing some development on the storage manager module of Postgres. I have added few source files already to the smgr folder, and I was able to have the Make system includes them by adding their names to the OBJS list in the Makefile inside the smgr folder. (i.e. When I add A.c, I would add A.o to the OBJS list). That was working fine. Now I'm trying to add a new file hdfs_test.c to the project. The problem with this file is that it requires some extra directives in its compilation command (-I and -L directives). Using gcc the file is compiled with the command: gcc hdfs_test.c -I/HDFS_HOME/hdfs/src/c++/libhdfs -I/usr/lib/jvm/default-java/include -L/HDFS_HOME/hdfs/src/c++/libhdfs -L/HDFS_HOME/build/c++/Linux-i386-32/lib -L/usr/lib/jvm/default-java/jre/lib/i386/server -ljvm -lhdfs -o hdfs_test I was told that in order to add the extra directives, I need to modify $LDFLAGS in configure.in and $LIBS in MakeFile. I read about autoconf and checked configure.in of Postgres but still wasn't able to know exactly what I should be doing. This is really a general question about make that has nothing to do with PostgreSQL specifically; you might want to get a book on make. Makefiles in subdirectories of src/backend are only intended to be able to build the backend itself, so there's probably no particularly easy way to do what you want there. You likely want to put this code someplace like src/test/hdfs_test and copy one of the other Makefiles into the new directory, perhaps src/test/isolation/Makefile. But even if you do that for starters, you're going to need to make some modifications, which may be a bit tricky if you have no experience at all using make. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
[HACKERS] initdb and fsync
It looks like initdb doesn't fsync all the files it creates, e.g. the PG_VERSION file. While it's unlikely that it would cause any real data loss, it can be inconvenient in some testing scenarios involving VMs. Thoughts? Would a patch to add a few fsync calls to initdb be accepted? Is a platform-independent fsync be available at initdb time? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CLOG contention, part 2
On Fri, Jan 27, 2012 at 3:16 PM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes jeff.ja...@gmail.com wrote: Also, I think the general approach is wrong. The only reason to have these pages in shared memory is that we can control access to them to prevent write/write and read/write corruption. Since these pages are never written, they don't need to be in shared memory. Just read each page into backend-local memory as it is needed, either palloc/pfree each time or using a single reserved block for the lifetime of the session. Let the kernel worry about caching them so that the above mentioned reads are cheap. right -- exactly. but why stop at one page? If you have more than one, you need code to decide which one to evict (just free) every time you need a new one. And every process needs to be running this code, while the kernel is still going to need make its own decisions for the entire system. It seems simpler to just let the kernel do the job for everyone. Are you worried that a read syscall is going to be slow even when the data is presumably cached in the OS? Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Temp file missing during large pgbench data set
Hi, I'm using latest git master (latest entry 0816fad6eebddb8f1f0e21635e46625815d690b9) and I'm getting an error when trying to create a large data set with pgbench: thom@swift:~/Development$ createdb pgbench thom@swift:~/Development$ pgbench -i -s 100 pgbench NOTICE: table pgbench_branches does not exist, skipping NOTICE: table pgbench_tellers does not exist, skipping NOTICE: table pgbench_accounts does not exist, skipping NOTICE: table pgbench_history does not exist, skipping creating tables... 1 tuples done. 2 tuples done. 3 tuples done. 4 tuples done. snip 997 tuples done. 998 tuples done. 999 tuples done. 1000 tuples done. set primary key... NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index pgbench_branches_pkey for table pgbench_branches NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index pgbench_tellers_pkey for table pgbench_tellers NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index pgbench_accounts_pkey for table pgbench_accounts LOG: could not stat file base/pgsql_tmp/pgsql_tmp8056.0: Success STATEMENT: alter table pgbench_accounts add primary key (aid) vacuum...done. I've tried this multiple times and the result is the same. I'm not sure what this message means, but the primary key is created successfully. If I reduce the scale to 20 the error doesn't occur. The following changes were made to my postgresql.conf file: max_connections = 300 shared_buffers = 3900MB temp_buffers = 16MB work_mem = 16MB maintenance_work_mem = 256MB checkpoint_segments = 32 random_page_cost = 1.1 effective_cache_size = 12GB All other values are at default. Is this anything to worry about? -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unreliable pg_ctl -w start again
From: Tom Lane t...@sss.pgh.pa.us Well, feel free to increase that duration if you want. The reason it's there is to not wait for a long time if the postmaster falls over instantly at startup, but in a non-interactive situation you might not care. Yes, just lengthening the wait duration causes unnecessary long wait when we run pg_ctl interactively. Therefore, the current wait approach is is not correct. How about inserting postmaster_is_alive() as below? Looks like complete nonsense to me, if the goal is to behave sanely when postmaster.pid hasn't been created yet. Where do you think get_pgpid gets the PID from? Yes, I understand that get_pgpid() gets the pid from postmaster.pid, which may be the pid of the previous postmaster that did not stop cleanly. I think my simple fix makes sense to solve the problem as follows. Could you point out what might not be good? 1.The previous postmaster was terminated abruptly due to OS shutdown, machine failure, etc. leaving postmaster.pid. 2.Run pg_ctl -w start to start new postmaster. 3.do_start() of pg_ctl reads the pid of previously running postmaster from postmaster.pid. Say, let it be pid-1 (old_pid in code) here. old_pid = get_pgpid(); 4.Anyway, try to start postmaster by calling start_postmaster(). 5.If postmaster.pid existed at step 3, it means either of: (a) Previous postmaster did not stop cleanly and left postmaster.pid. (b) Another postmaster is already running in the data directory (since before running pg_ctl -w start this time.) But we can't distinguish between them. Then, we read ostmaster.pid again to judge the situation. Let it be pid-2 (pid in code). if (old_pid != 0) { pg_usleep(100); pid = get_pgpid(); 6.If pid-1 != pid-2, it means that the situation (a) applies and the newly started postmaster overwrote old postmaster.pid. Then, try to connect to postmaster. If pid-1 == pid-2, it means either of: (a') Previous postmaster did not stop cleanly and left postmaster.pid. Newly started postmaster will complete startup, but hasn't overwritten postmaster.pid yet. (b) Another postmaster is already running in the data directory (since before running pg_ctl -w start this time.) The current comparison logic cannot distinguish between them. In my problem situation, situation a' happened, and pg_ctl mistakenly exited. if (pid == old_pid) { write_stderr(_(%s: could not start server\n Examine the log output.\n), progname); exit(1); } 7.To distinguish between a' and b, check if pid-1 is alive. If pid-1 is alive, it means situation b. Otherwise, that is situation a'. if (pid == old_pid postmaster_is_alive(old_pid)) However, the pid of newly started postmaster might match the one of old postmaster. To deal with that situation, it may be better to check the modified timestamp of postmaster.pid in addition. What do you think? If we had the postmaster's PID a priori, we could detect postmaster death directly instead of having to make assumptions about how long is reasonable to wait for the pidfile to appear. The problem is that we don't want to write a complete replacement for the shell's command line parser and I/O redirection logic. It doesn't look like a small project. Yes, I understand this. I don't think we can replace shell's various work. (But maybe we could bypass that by doing a fork() and then having the child exec() the shell, telling it to exec postmaster in turn?) Possibly. I hope this works. Then, we can pass unnamed pipe file descriptors to postmaster via environment variables from the pg_ctl's forked child. And of course Windows as usual makes things twice as hard, since we couldn't make such a change unless start_postmaster could return the proper PID in that case too. Well, we can make start_postmaster() return the pid of the newly created postmaster. CreateProcess() sets the process handle in the structure passed to it. We can pass the process handle to WaitForSingleObject8) to know whether postmaster is alive. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unreliable pg_ctl -w start again
From: Tom Lane t...@sss.pgh.pa.us Well, feel free to increase that duration if you want. The reason it's there is to not wait for a long time if the postmaster falls over instantly at startup, but in a non-interactive situation you might not care. Yes, just lengthening the wait duration causes unnecessary long wait when we run pg_ctl interactively. Therefore, the current wait approach is is not correct. How about inserting postmaster_is_alive() as below? Looks like complete nonsense to me, if the goal is to behave sanely when postmaster.pid hasn't been created yet. Where do you think get_pgpid gets the PID from? Yes, I understand that get_pgpid() gets the pid from postmaster.pid, which may be the pid of the previous postmaster that did not stop cleanly. I think my simple fix makes sense to solve the problem as follows. Could you point out what might not be good? 1.The previous postmaster was terminated abruptly due to OS shutdown, machine failure, etc. leaving postmaster.pid. 2.Run pg_ctl -w start to start new postmaster. 3.do_start() of pg_ctl reads the pid of previously running postmaster from postmaster.pid. Say, let it be pid-1 (old_pid in code) here. old_pid = get_pgpid(); 4.Anyway, try to start postmaster by calling start_postmaster(). 5.If postmaster.pid existed at step 3, it means either of: (a) Previous postmaster did not stop cleanly and left postmaster.pid. (b) Another postmaster is already running in the data directory (since before running pg_ctl -w start this time.) But we can't distinguish between them. Then, we read ostmaster.pid again to judge the situation. Let it be pid-2 (pid in code). if (old_pid != 0) { pg_usleep(100); pid = get_pgpid(); 6.If pid-1 != pid-2, it means that the situation (a) applies and the newly started postmaster overwrote old postmaster.pid. Then, try to connect to postmaster. If pid-1 == pid-2, it means either of: (a') Previous postmaster did not stop cleanly and left postmaster.pid. Newly started postmaster will complete startup, but hasn't overwritten postmaster.pid yet. (b) Another postmaster is already running in the data directory (since before running pg_ctl -w start this time.) The current comparison logic cannot distinguish between them. In my problem situation, situation a' happened, and pg_ctl mistakenly exited. if (pid == old_pid) { write_stderr(_(%s: could not start server\n Examine the log output.\n), progname); exit(1); } 7.To distinguish between a' and b, check if pid-1 is alive. If pid-1 is alive, it means situation b. Otherwise, that is situation a'. if (pid == old_pid postmaster_is_alive(old_pid)) However, the pid of newly started postmaster might match the one of old postmaster. To deal with that situation, it may be better to check the modified timestamp of postmaster.pid in addition. What do you think? If we had the postmaster's PID a priori, we could detect postmaster death directly instead of having to make assumptions about how long is reasonable to wait for the pidfile to appear. The problem is that we don't want to write a complete replacement for the shell's command line parser and I/O redirection logic. It doesn't look like a small project. Yes, I understand this. I don't think we can replace shell's various work. (But maybe we could bypass that by doing a fork() and then having the child exec() the shell, telling it to exec postmaster in turn?) Possibly. I hope this works. Then, we can pass unnamed pipe file descriptors to postmaster via environment variables from the pg_ctl's forked child. And of course Windows as usual makes things twice as hard, since we couldn't make such a change unless start_postmaster could return the proper PID in that case too. Well, we can make start_postmaster() return the pid of the newly created postmaster. CreateProcess() sets the process handle in the structure passed to it. We can pass the process handle to WaitForSingleObject8) to know whether postmaster is alive. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables
Robert Haas robertmh...@gmail.com writes: It's a feature, not a bug, that we schema-qualify names when VERBOSE is specified. That was done on purpose for the benefit of external tools that might need this information to disambiguate which object is being referenced. Table *aliases*, of course, should not be schema-qualified, but I don't think that's what we're doing. You could make it more clear by including an alias in the query, like this: explain verbose select * into table ramp from road hwy where name ~ '.*Ramp'; I think you are both focusing on the wrong thing. There is a lot of squishiness in what EXPLAIN prints out, since SQL notation is not always well suited to what an execution plan actually does. But this code has a hard and fast requirement that it dump view definitions correctly, else pg_dump doesn't work. And after looking at this I think Ashutosh has in fact found a bug. Consider this example: regression=# create schema s1; CREATE SCHEMA regression=# create schema s2; CREATE SCHEMA regression=# create table s1.t1 (f1 int); CREATE TABLE regression=# create table s2.t1 (f1 int); CREATE TABLE regression=# create view v1 as regression-# select * from s1.t1 where exists ( regression(# select 1 from s2.t1 where s2.t1.f1 = s1.t1.f1 regression(# ); CREATE VIEW regression=# \d+ v1 View public.v1 Column | Type | Modifiers | Storage | Description +-+---+-+- f1 | integer | | plain | View definition: SELECT t1.f1 FROM s1.t1 WHERE (EXISTS ( SELECT 1 FROM s2.t1 WHERE t1.f1 = s1.t1.f1)); regression=# alter table s2.t1 rename to tx; ALTER TABLE regression=# \d+ v1 View public.v1 Column | Type | Modifiers | Storage | Description +-+---+-+- f1 | integer | | plain | View definition: SELECT t1.f1 FROM s1.t1 WHERE (EXISTS ( SELECT 1 FROM s2.tx t1 WHERE t1.f1 = s1.t1.f1)); Both of the above displays of the view are formally correct, in that the variables will be taken to refer to the correct upper or lower RTE. But let's change that back and rename the other table: regression=# alter table s2.tx rename to t1; ALTER TABLE regression=# alter table s1.t1 rename to tx; ALTER TABLE regression=# \d+ v1 View public.v1 Column | Type | Modifiers | Storage | Description +-+---+-+- f1 | integer | | plain | View definition: SELECT t1.f1 FROM s1.tx t1 WHERE (EXISTS ( SELECT 1 FROM s2.t1 WHERE t1.f1 = s1.t1.f1)); This is just plain wrong, as you'll see if you try to execute that query: regression=# SELECT t1.f1 regression-#FROM s1.tx t1 regression-# WHERE (EXISTS ( SELECT 1 regression(#FROM s2.t1 regression(# WHERE t1.f1 = s1.t1.f1)); ERROR: invalid reference to FROM-clause entry for table t1 LINE 5: WHERE t1.f1 = s1.t1.f1)); ^ HINT: There is an entry for table t1, but it cannot be referenced from this part of the query. (The HINT is a bit confused here, but the query is certainly invalid.) So what we have here is a potential failure to dump and reload view definitions, which is a lot more critical in my book than whether EXPLAIN's output is confusing. If we stick with the existing rule for attaching a fake alias to renamed RTEs, I think that Ashutosh's patch or something like it is probably appropriate, because the variable-printing code ought to be in step with the RTE-printing code. Unfortunately, I think the hack to attach a fake alias to renamed RTEs creates some issues of its own. Consider select * from s1.t1 where exists (select 1 from s2.t2 t1 where t1.f1 = s1.t1.f1); If s1.t1 is now renamed to s1.tx, it is still possible to express the same semantics: select * from s1.tx where exists (select 1 from s2.t2 t1 where t1.f1 = s1.tx.f1); But when we attach a fake alias, it's broken: select * from s1.tx t1 where exists (select 1 from s2.t2 t1 where t1.f1 = ?.f1); There is no way to reference the outer RTE anymore from the subquery, because the conflicting lower alias masks it. We may be between a rock and a hard place though, because it's not that hard to demonstrate cases where not adding a fake alias breaks it too: select * from s1.t1 tx where exists (select 1 from s2.t1 where s2.t1.f1 = tx.f1); If s2.t1 is renamed to s2.tx, there's no longer any way to reference the upper alias tx, unless you alias the lower RTE to some different name. I think that when we put in the fake-alias behavior, we made a value judgment that this type of situation was more common than the other, but I'm not really sure why. Maybe what we need to do instead is create totally-made-up, unique aliases when something like this
Re: [HACKERS] Unreliable pg_ctl -w start again
MauMau maumau...@gmail.com writes: From: Tom Lane t...@sss.pgh.pa.us Looks like complete nonsense to me, if the goal is to behave sanely when postmaster.pid hasn't been created yet. Where do you think get_pgpid gets the PID from? Yes, I understand that get_pgpid() gets the pid from postmaster.pid, which may be the pid of the previous postmaster that did not stop cleanly. [ convoluted reasoning about what to do if that's the case ] I don't see any point in worrying about that case when you can't handle the basic case that the postmaster hasn't created postmaster.pid yet. In any case, this does nothing at all to answer the question you posed, which was how long is it reasonable to wait for the postmaster to produce a new postmaster.pid file. We really need to know the PID of the process we started in order to make any real improvement here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Temp file missing during large pgbench data set
Thom Brown t...@linux.com writes: I'm using latest git master (latest entry 0816fad6eebddb8f1f0e21635e46625815d690b9) and I'm getting an error when trying to create a large data set with pgbench: LOG: could not stat file base/pgsql_tmp/pgsql_tmp8056.0: Success STATEMENT: alter table pgbench_accounts add primary key (aid) I'll bet lunch this is due to some bug in commit bc3347484a7bf9eddb98e4352d84599cae9a31c6 regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cursors FOR UPDATE don't return most recent row
Alvaro Herrera alvhe...@alvh.no-ip.org writes: I expected the FETCH to return one row, with the latest data, i.e. (1, 3), but instead it's returning empty. This is the same thing I was complaining about in the bug #6123 thread, http://archives.postgresql.org/message-id/9698.1327266...@sss.pgh.pa.us It looks a bit ticklish to fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Fri, Jan 27, 2012 at 04:19:41PM -0800, Jeff Davis wrote: It looks like initdb doesn't fsync all the files it creates, e.g. the PG_VERSION file. While it's unlikely that it would cause any real data loss, it can be inconvenient in some testing scenarios involving VMs. Thoughts? Would a patch to add a few fsync calls to initdb be accepted? +1. If I'm piloting strace -f right, initdb currently issues *no* syncs. We'd probably, then, want a way to re-disable the fsyncs for hacker benefit. Is a platform-independent fsync be available at initdb time? Not sure. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unreliable pg_ctl -w start again
From: Tom Lane t...@sss.pgh.pa.us I don't see any point in worrying about that case when you can't handle the basic case that the postmaster hasn't created postmaster.pid yet. In any case, this does nothing at all to answer the question you posed, which was how long is it reasonable to wait for the postmaster to produce a new postmaster.pid file. We really need to know the PID of the process we started in order to make any real improvement here. I'm sorry, but I may be missing the point of discussion. Let me clarify my point again. 1.To be honest, I wish a real solution that can eliminate the problem completely, for example, by using SIGCHLD approach Tom-san proposed before, or by using unnamed pipes I proposed last year. However, the real solution seems complicated to back-port to past versions. 2.As you said, we need any real improvement, if not a perfect solution, that can lower the possibility of the problem. Doing something which can somehow mitigate the problem is better than doing nothing and leaving the problem. 3.How long is it reasonable to wait for the postmaster to produce a new postmaster.pid file? No duration is reasonable. 4.Then, what can we do to mitigate the problem in the past versions (8.3, 8.4, 9.0)? My very simple fix works well in the cases where old postmaster.pid file exists. But it does not make any improvement in the cases where old postmaster.pid is not left (i.e., running pg_ctl after clean shutdown; a basic case). Am I missing anything? Does the fix result in degradation in some cases? Otherwise, is Tom-san saying: We will be able to solve the remaining problems completely if we can get the pid of postmaster started in pg_ctl. We thought that it was not easy to get the pid during 9.1 development. However, we might be able to get it more easily than we thought, such as by fork-execing and making the child process exec shell to make the new shell exec postmaster. So let's pursue the real solution here. If so, I agree. But if we find the real solution difficult, I wish my non-perfect but somewhat effective solution will be accepted. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers