Re: [HACKERS] [v9.5] Custom Plan API
> On 7 May 2014 02:05, Kouhei Kaigai wrote: > > Prior to the development cycle towards v9.5, I'd like to reopen the > > discussion of custom-plan interface. Even though we had lots of > > discussion during the last three commit-fests, several issues are > > still under discussion. So, I'd like to clarify direction of the > > implementation, prior to the first commit-fest. > > > > (1) DDL support and system catalog > > > > Simon suggested that DDL command should be supported to track custom- > > plan providers being installed, and to avoid nonsense hook calls if it > > is an obvious case that custom-plan provider can help. It also makes > > sense to give a chance to load extensions once installed. > > (In the previous design, I assumed modules are loaded by LOAD command > > or *_preload_libraries parameters). > > > > I tried to implement the following syntax: > > > > CREATE CUSTOM PLAN FOR (scan|join|any) HANDLER ; > > Thank you for exploring that thought and leading the way on this research. > I've been thinking about this also. > > What I think we need is a declarative form that expresses the linkage between > base table(s) and a related data structures that can be used to optimize > a query, while still providing accurate results. > > In other DBMS, we have concepts such as a JoinIndex or a MatView which allow > some kind of lookaside behaviour. Just for clarity, a concrete example is > Oracle's Materialized Views which can be set using ENABLE QUERY REWRITE > so that the MatView can be used as an alternative path for a query. We do > already have this concept in PostgreSQL, where an index can be used to > perform an IndexOnlyScan rather than accessing the heap itself. > > We have considerable evidence that the idea of alternate data structures > results in performance gains. > * KaiGai's work - https://wiki.postgresql.org/wiki/PGStrom > * http://www.postgresql.org/message-id/52c59858.9090...@garret.ru > * http://citusdata.github.io/cstore_fdw/ > * University of Manchester - exploring GPUs as part of the AXLE project > * Barcelona SuperComputer Centre - exploring FPGAs, as part of the AXLE > project > * Some other authors have also cited gains using GPU technology in databases > > So I would like to have a mechanism that provides a *generic* Lookaside > for a table or foreign table. > > Tom and Kevin have previously expressed that MatViews would represent a > planning problem, in the general case. One way to solve that planning issue > is to link structures directly together, in the same way that an index and > a table are linked. We can then process the lookaside in the same way we > handle a partial index - check prerequisites and if usable, calculate a > cost for the alternate path. > We need not add planning time other than to the tables that might benefit > from that. > > Roughly, I'm thinking of this... > > CREATE LOOKASIDE ON foo >TO foo_mat_view; > > and also this... > > CREATE LOOKASIDE ON foo >TO foo_as_a_foreign_table /* e.g. PGStrom */ > > This would allow the planner to consider alternate plans for foo_mv during > set_plain_rel_pathlist() similarly to the way it considers index paths, > in one of the common cases that the mat view covers just one table. > > This concept is similar to ENABLE QUERY REWRITE in Oracle, but this thought > goes much further, to include any generic user-defined data structure or > foreign table. > Let me clarify. This mechanism allows to add alternative scan/join paths including built-in ones, not only custom enhanced plan/exec node, isn't it? Probably, it is a variation of above proposition if we install a handler function that proposes built-in path nodes towards the request for scan/join. > Do we need this? For MVs, we *might* be able to deduce that the MV is > rewritable for "foo", but that is not deducible for Foreign Tables, by > current definition, so I prefer the explicit definition of objects that > are linked - since doing this for indexes is already familiar to people. > > Having an explicit linkage between data structures allows us to enhance > an existing application by transaparently adding new structures, just as > we already do with indexes. Specifically, that we allow more than one > lookaside structure on any one table. > Not only alternative data structure, alternative method to scan/join towards same data structure is also important, isn't it? > Forget the exact name, thats not important. But I think the requirements > here are... > > * Explicit definition that we are attaching an alternate path onto a table > (conceptually similar to adding an index) > I think the syntax allows "tables", not only a particular table. It will inform the core planner this lookaside/customplan (name is not important, anyway this feature...) can provide alternative path towards the set of relations; being considered. So, it allows to reduce number of function calls on planner stage. > * Ability to check that the alternate path is via
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 6 May 2014 17:55, Andres Freund wrote: >> All this changes is the cost of >> IndexScans that would use more than 25% of shared_buffers worth of >> data. Hopefully not many of those in your workload. Changing the cost >> doesn't necessarily prevent index scans either. And if there are many >> of those in your workload AND you run more than one at same time, then >> the larger setting will work against you. So the benefit window for >> such a high setting is slim, at best. > > Why? There's many workloads where indexes are larger than shared buffers > but fit into the operating system's cache. And that's precisely what > effective_cache_size is about. > Especially on bigger machines shared_buffers can't be set big enough to > actually use all the machine's memory. It's not uncommon to have 4GB > shared buffers on a machine with 512GB RAM... It'd be absolutely > disastrous to set effective_cache_size to 1GB for an analytics workload. In this case, a setting of effective_cache_size > (4 * shared_buffers) could be appropriate, as long as we are certain we have the memory. We don't have any stats on peak memory usage to be certain - although in that case its pretty clear. If we had stats on how effective the indexscan was at multiple-hitting earlier read blocks, we'd be able to autotune, but I accept that without that we do still need the parameter. >> I specifically picked 25% of shared_buffers because that is the point >> at which sequential scans become more efficient and use the cache more >> efficiently. If our cost models are correct, then switching away from >> index scans shouldn't hurt at all. > > More often than not indexes are smaller than the table size, so this > argument doesn't seem to make much sense. If we believe that 25% of shared_buffers worth of heap blocks would flush the cache doing a SeqScan, why should we allow 400% of shared_buffers worth of index blocks? In your example, that would be 1GB of heap blocks, or 16GB of index blocks. If our table is 100GB with a 32GB index, then yes, that is 1% of the heap and 50% of the index. But that doesn't matter, since I am discussing the point at which we prevent the cache being churned. Given your example we do not allow a SeqScan of a table larger than 1GB to flush cache, since we use BAS_BULKREAD. If we allow an indexscan plan that will touch 16GB of an index that will very clearly flush out our 4GB of shared_buffers, increasing time for later queries even if they only have to read from OS buffers back into shared_buffers. That will still show itself as a CPU spike, which is what people say they are seeing. I think I'm arguing myself towards using a BufferAccessStrategy of BAS_BULKREAD for large IndexScans, BitMapIndexScans and BitMapHeapScans. Yes, we can make plans assuming we can use OS cache, but we shouldn't be churning shared_buffers when we execute those plans. "large" here meaning the same thing as it does for SeqScans, which is a scan that seems likely to touch more than 25% of shared buffers. I'll work up a patch. Perhaps it would also be useful to consider using a sequential scan of the index relation for less selective BitmapIndexScans, just as we do very effectively during VACUUM. Maybe that is a better idea than bitmap indexes. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.4 checksum errors in recovery with gin index
When recovering from a crash (with injection of a partial page write at time of crash) against 7c7b1f4ae5ea3b1b113682d4d I get a checksum verification failure. 16396 is a gin index. If I have it ignore checksum failures, there is no apparent misbehavior. I'm trying to bisect it, but it could take a while and I thought someone might have some theories based on the log: 29075 2014-05-06 23:29:51.411 PDT:LOG: 0: database system was not properly shut down; automatic recovery in progress 29075 2014-05-06 23:29:51.411 PDT:LOCATION: StartupXLOG, xlog.c:6361 29075 2014-05-06 23:29:51.412 PDT:LOG: 0: redo starts at 11/323FE1C0 29075 2014-05-06 23:29:51.412 PDT:LOCATION: StartupXLOG, xlog.c:6600 29075 2014-05-06 23:29:51.471 PDT:WARNING: 01000: page verification failed, calculated checksum 35967 but expected 7881 29075 2014-05-06 23:29:51.471 PDT:CONTEXT: xlog redo Delete list pages (16), node: 1663/16384/16396 blkno: 0 29075 2014-05-06 23:29:51.471 PDT:LOCATION: PageIsVerified, bufpage.c:145 29075 2014-05-06 23:29:51.471 PDT:FATAL: XX001: invalid page in block 28486 of relation base/16384/16396 29075 2014-05-06 23:29:51.471 PDT:CONTEXT: xlog redo Delete list pages (16), node: 1663/16384/16396 blkno: 0 29075 2014-05-06 23:29:51.471 PDT:LOCATION: ReadBuffer_common, bufmgr.c:483 27799 2014-05-06 23:29:51.473 PDT:LOG: 0: startup process (PID 29075) exited with exit code 1 27799 2014-05-06 23:29:51.473 PDT:LOCATION: LogChildExit, postmaster.c:3281 27799 2014-05-06 23:29:51.473 PDT:LOG: 0: aborting startup due to startup process failure Cheers, Jeff
Re: [HACKERS] 9.4 checksum errors in recovery with gin index
Hi, On 2014-05-07 00:35:35 -0700, Jeff Janes wrote: > When recovering from a crash (with injection of a partial page write at > time of crash) against 7c7b1f4ae5ea3b1b113682d4d I get a checksum > verification failure. > > 16396 is a gin index. Over which type? What was the load? make check? > If I have it ignore checksum failures, there is no apparent misbehavior. > I'm trying to bisect it, but it could take a while and I thought someone > might have some theories based on the log: If you have the WAL a pg_xlogdump grepping for everything referring to that block would be helpful. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql.auto.conf read from wrong directory
On Tue, May 6, 2014 at 7:04 PM, Christoph Berg wrote: > Hi, > > if you split configuration and data by setting data_directory, > postgresql.auto.conf is writen to the data directory > (/var/lib/postgresql/9.4/main in Debian), but tried to be read from > the etc directory (/etc/postgresql/9.4/main). > > One place to fix it would be in ProcessConfigFile in > src/backend/utils/misc/guc-file.l:162 by always setting > CallingFileName = NULL in src/backend/utils/misc/guc-file.l:162, but > that breaks later in AbsoluteConfigLocation() when data_directory is > NULL. (As the comment in ProcessConfigFile says.) This problem occurs because we don't have the value of data_directory set in postgresql.conf by the time we want to parse .auto.conf file during server start. The value of data_directory is only available after processing of config files. To fix it, we need to store the value of data_directory during parse of postgresql.conf file so that we can use it till data_directory is actually set. Attached patch fixes the problem. Could you please once confirm if it fixes the problem in your env./scenario. Another way to fix could be that during parse, we directly set the config value of data_directory, but I didn't prefer that way because the parse routine is called from multiple paths which later on process the values separately. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_processing_auto_conf-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On 7 May 2014 08:17, Kouhei Kaigai wrote: > Let me clarify. This mechanism allows to add alternative scan/join paths > including built-in ones, not only custom enhanced plan/exec node, isn't it? > Probably, it is a variation of above proposition if we install a handler > function that proposes built-in path nodes towards the request for scan/join. Yes, I am looking for a way to give you the full extent of your requirements, within the Postgres framework. I have time and funding to assist you in achieving this in a general way that all may make use of. > Not only alternative data structure, alternative method to scan/join towards > same data structure is also important, isn't it? Agreed. My proposal is that if the planner allows the lookaside to an FDW then we pass the query for full execution on the FDW. That means that the scan, aggregate and join could take place via the FDW. i.e. "Custom Plan" == lookaside + FDW Or put another way, if we add Lookaside then we can just plug in the pgstrom FDW directly and we're done. And everybody else's FDW will work as well, so Citus etcc will not need to recode. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On 05/06/2014 11:30 PM, Peter Geoghegan wrote: On Tue, May 6, 2014 at 12:48 PM, Andres Freund wrote: Enthusiatically seconded. I've asked for that about three times without much success. If it had been my decision the patch wouldn't have been merged without that and other adjustments. I'm almost certain that the only feedback of yours that I didn't incorporate was that I didn't change the name of JsonbValue, a decision I stand by, and also that I didn't add ascii art to illustrate the on-disk format. I can write a patch that adds the latter soon. That would be great. I found the serialization routine, convertJsonb() to be a bit of a mess. It's maintaining a custom stack of levels, which can be handy if you need to avoid recursion, but it's also relying on the native stack. And I didn't understand the point of splitting it into the "walk" and "put" functions; the division of labor between the two was far from clear IMHO. I started refactoring that, and ended up with the attached. One detail that I found scary is that the estSize field in JsonbValue is not just any rough estimate. It's used ín the allocation of the output buffer for convertJsonb(), so it has to be large enough or you hit an assertion or buffer overflow. I believe it was correct as it was, but that kind of programming is always scary. I refactored the convertJsonb() function to use a StringInfo buffer instead, and removed estSize altogether. This is still work-in-progress, but I thought I'd post this now to let people know I'm working on it. For example, the StringInfo isn't actually very well suited for this purpose, it might be better to have a custom buffer that's enlarged when needed. For my own sanity, I started writing some docs on the on-disk format. See the comments in jsonb.h for my understanding of it. I moved around the structs a bit in jsonb.h, to make the format clearer, but the actual on-disk format is unchanged. - Heikki diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index cf5d6f2..8413da7 100644 --- a/src/backend/utils/adt/jsonb.c +++ b/src/backend/utils/adt/jsonb.c @@ -65,7 +65,7 @@ jsonb_recv(PG_FUNCTION_ARGS) if (version == 1) str = pq_getmsgtext(buf, buf->len - buf->cursor, &nbytes); else - elog(ERROR, "Unsupported jsonb version number %d", version); + elog(ERROR, "unsupported jsonb version number %d", version); return jsonb_from_cstring(str, nbytes); } @@ -249,7 +249,6 @@ jsonb_in_object_field_start(void *pstate, char *fname, bool isnull) v.type = jbvString; v.val.string.len = checkStringLen(strlen(fname)); v.val.string.val = pnstrdup(fname, v.val.string.len); - v.estSize = sizeof(JEntry) + v.val.string.len; _state->res = pushJsonbValue(&_state->parseState, WJB_KEY, &v); } @@ -290,8 +289,6 @@ jsonb_in_scalar(void *pstate, char *token, JsonTokenType tokentype) JsonbInState *_state = (JsonbInState *) pstate; JsonbValue v; - v.estSize = sizeof(JEntry); - switch (tokentype) { @@ -300,7 +297,6 @@ jsonb_in_scalar(void *pstate, char *token, JsonTokenType tokentype) v.type = jbvString; v.val.string.len = checkStringLen(strlen(token)); v.val.string.val = pnstrdup(token, v.val.string.len); - v.estSize += v.val.string.len; break; case JSON_TOKEN_NUMBER: @@ -312,7 +308,6 @@ jsonb_in_scalar(void *pstate, char *token, JsonTokenType tokentype) v.type = jbvNumeric; v.val.numeric = DatumGetNumeric(DirectFunctionCall3(numeric_in, CStringGetDatum(token), 0, -1)); - v.estSize += VARSIZE_ANY(v.val.numeric) +sizeof(JEntry) /* alignment */ ; break; case JSON_TOKEN_TRUE: v.type = jbvBool; diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 1caaa4a..49a1d4d 100644 --- a/src/backend/utils/adt/jsonb_util.c +++ b/src/backend/utils/adt/jsonb_util.c @@ -37,49 +37,16 @@ #define JSONB_MAX_PAIRS (Min(MaxAllocSize / sizeof(JsonbPair), \ JENTRY_POSMASK)) -/* - * State used while converting an arbitrary JsonbValue into a Jsonb value - * (4-byte varlena uncompressed representation of a Jsonb) - * - * ConvertLevel: Bookkeeping around particular level when converting. - */ -typedef struct convertLevel -{ - uint32 i;/* Iterates once per element, or once per pair */ - uint32 *header; /* Pointer to current container header */ - JEntry *meta; /* This level's metadata */ - char *begin; /* Pointer into convertState.buffer */ -} convertLevel; - -/* - * convertState: Overall bookkeeping state for conversion - */ -typedef struct convertState -{ - /* Preallocated buffer in which to form varlena/Jsonb value */ - Jsonb *buffer; - /* Pointer into buffer */ - char *ptr; - - /* State for */ - convertLevel *allState, /* Overall state array */ - *contPtr; /* Cur container pointer (in allState) */ - - /* Current size of buffer containing allState array */ - Size levelSz; - -} convertState; - static int compareJsonbScalarValue(JsonbValue *a, Json
[HACKERS] PGDLLEXPORTing all GUCs?
Hi all As part of development on BDR the issue of GUC globals not being PGDLLEXPORT'ed has been run into a few times. Is there any reason _not_ to PGDLLEXPORT all GUCs, other than cosmetic concerns? Barring objections I'll post a patch to do this tomorrow. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
> -Original Message- > From: Simon Riggs [mailto:si...@2ndquadrant.com] > Sent: Wednesday, May 07, 2014 5:02 PM > To: Kaigai Kouhei(海外 浩平) > Cc: Tom Lane; Robert Haas; Andres Freund; PgHacker; Stephen Frost; Shigeru > Hanada; Jim Mlodgenski; Peter Eisentraut; Kohei KaiGai > Subject: Re: [v9.5] Custom Plan API > > On 7 May 2014 08:17, Kouhei Kaigai wrote: > > > Let me clarify. This mechanism allows to add alternative scan/join > > paths including built-in ones, not only custom enhanced plan/exec node, > isn't it? > > Probably, it is a variation of above proposition if we install a > > handler function that proposes built-in path nodes towards the request > for scan/join. > > Yes, I am looking for a way to give you the full extent of your requirements, > within the Postgres framework. I have time and funding to assist you in > achieving this in a general way that all may make use of. > > > Not only alternative data structure, alternative method to scan/join > > towards same data structure is also important, isn't it? > > Agreed. My proposal is that if the planner allows the lookaside to an FDW > then we pass the query for full execution on the FDW. That means that the > scan, aggregate and join could take place via the FDW. i.e. > "Custom Plan" == lookaside + FDW > > Or put another way, if we add Lookaside then we can just plug in the pgstrom > FDW directly and we're done. And everybody else's FDW will work as well, > so Citus etcc will not need to recode. > Hmm. That sounds me, you intend to make FDW perform as a central facility to host pluggable plan/exec stuff. Even though we have several things to be clarified, I also think it's a direction worth to investigate. Let me list up the things to be clarified / developed randomly. * Join replacement by FDW; We still don't have consensus about join replacement by FDW. Probably, it will be designed to remote-join implementation primarily, however, things to do is similar. We may need to revisit the Hanada-san's proposition in the past. * Lookaside for ANY relations; I want planner to try GPU-scan for any relations once installed, to reduce user's administration cost. It needs lookaside allow to specify a particular foreign-server, not foreign- table, then create ForeignScan node that is not associated with a particular foreign-table. * ForeignScan node that is not associated with a particular foreign-table. Once we try to apply ForeignScan node instead of Sort or Aggregate, existing FDW implementation needs to be improved. These nodes scan on a materialized relation (generated on the fly), however, existing FDW code assumes ForeignScan node is always associated with a particular foreign-table. We need to eliminate this restriction. * FDW method for MultiExec. In case when we can stack multiple ForeignScan nodes, it's helpful to support to exchange scanned tuples in their own data format. Let's assume two ForeignScan nodes are stacked. One performs like Sort, another performs like Scan. If they internally handle column- oriented data format, TupleTableSlot is not a best way for data exchange. * Lookaside on the INSERT/UPDATE/DELETE. Probably, it can be implemented using writable FDW feature. Not a big issue, but don't forget it... How about your opinion? Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On 7 May 2014 10:06, Kouhei Kaigai wrote: > Let me list up the things to be clarified / developed randomly. > > * Join replacement by FDW; We still don't have consensus about join > replacement > by FDW. Probably, it will be designed to remote-join implementation > primarily, > however, things to do is similar. We may need to revisit the Hanada-san's > proposition in the past. Agreed. We need to push down joins into FDWs and we need to push down aggregates also, so they can be passed to FDWs. I'm planning to look at aggregate push down. > * Lookaside for ANY relations; I want planner to try GPU-scan for any > relations > once installed, to reduce user's administration cost. > It needs lookaside allow to specify a particular foreign-server, not > foreign- > table, then create ForeignScan node that is not associated with a particular > foreign-table. IMHO we would not want to add indexes to every column, on every table, nor would we wish to use lookaside for all tables. It is a good thing to be able to add optimizations for individual tables. GPUs are not good for everything; it is good to be able to leverage their strengths, yet avoid their weaknesses. If do you want that, you can write an Event Trigger that automatically adds a lookaside for any table. > * ForeignScan node that is not associated with a particular foreign-table. > Once we try to apply ForeignScan node instead of Sort or Aggregate, existing > FDW implementation needs to be improved. These nodes scan on a materialized > relation (generated on the fly), however, existing FDW code assumes > ForeignScan node is always associated with a particular foreign-table. > We need to eliminate this restriction. I don't think we need to do that, given the above. > * FDW method for MultiExec. In case when we can stack multiple ForeignScan > nodes, it's helpful to support to exchange scanned tuples in their own > data format. Let's assume two ForeignScan nodes are stacked. One performs > like Sort, another performs like Scan. If they internally handle column- > oriented data format, TupleTableSlot is not a best way for data exchange. I agree TupleTableSlot may not be best way for bulk data movement. We probably need to look at buffering/bulk movement between executor nodes in general, which would be of benefit for the FDW case also. This would be a problem even for Custom Scans as originally presented also, so I don't see much change there. > * Lookaside on the INSERT/UPDATE/DELETE. Probably, it can be implemented > using writable FDW feature. Not a big issue, but don't forget it... Yes, possible. I hope these ideas make sense. This is early days and there may be other ideas and much detail yet to come. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgresql.auto.conf read from wrong directory
Re: Amit Kapila 2014-05-07 > This problem occurs because we don't have the value of data_directory > set in postgresql.conf by the time we want to parse .auto.conf file > during server start. The value of data_directory is only available after > processing of config files. To fix it, we need to store the value of > data_directory during parse of postgresql.conf file so that we can use it > till data_directory is actually set. Attached patch fixes the problem. > Could you please once confirm if it fixes the problem in your > env./scenario. Hi Amit, the patch fixes the problem. Thanks for looking into this. Christoph -- c...@df7cb.de | http://www.df7cb.de/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
So, apart from cleaning up the code, we really need to take a close look at the on-disk format now. The code can be cleaned up later, too, but we're going to be stuck with the on-disk format forever, so it's critical to get that right. First, a few observations: * JENTRY_ISFIRST is redundant. Whenever you deal with the Jentry struct, you know from the context which element in the array it is. * JENTRY_ISNEST is set but never used. * JENTRY_ISBOOL is defined as (JENTRY_ISNUMERIC | JENTRY_ISNEST), which seems confusing. I'm going to proceed refactoring those things, which will change the on-disk format. It's late in the release cycle - these things really should've been cleaned up earlier - but it's important to get the on-disk format right. Shout if you have any objections. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On Wed, May 7, 2014 at 8:20 PM, Heikki Linnakangas wrote: > So, apart from cleaning up the code, we really need to take a close look at > the on-disk format now. The code can be cleaned up later, too, but we're > going to be stuck with the on-disk format forever, so it's critical to get > that right. > > First, a few observations: > > * JENTRY_ISFIRST is redundant. Whenever you deal with the Jentry struct, you > know from the context which element in the array it is. > > * JENTRY_ISNEST is set but never used. > > * JENTRY_ISBOOL is defined as (JENTRY_ISNUMERIC | JENTRY_ISNEST), which > seems confusing. > > I'm going to proceed refactoring those things, which will change the on-disk > format. It's late in the release cycle - these things really should've been > cleaned up earlier - but it's important to get the on-disk format right. > Shout if you have any objections. +1. It is saner to do that now than never. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
Hi, On 2014-05-07 14:20:19 +0300, Heikki Linnakangas wrote: > So, apart from cleaning up the code, we really need to take a close look at > the on-disk format now. The code can be cleaned up later, too, but we're > going to be stuck with the on-disk format forever, so it's critical to get > that right. +1 > First, a few observations: Agreed. I'd like to add that: * Imo we need space in jsonb ondisk values to indicate a format version. We won't fully get this right. * The jentry representation should be changed so it's possible to get the type of a entry without checking individual types. That'll make code like findJsonbValueFromSuperHeader() (well, whatever you've renamed it to) much easier to read. Preferrably so it an have the same values (after shifting/masking) ask the JBE variants. And it needs space for futher types (or representations thereof). * I wonder if the hash/object pair representation is wise and if it shouldn't be keys combined with offsets first, and then the values. That will make access much faster. And that's what jsonb essentially is about. * I think both arrays and hashes should grow individual structs. With space for additional flags. * I have doubts of the wisdom of allowing to embed jbvBinary values in JsonbValues. Although that can be changed later since it's not on disk. > I'm going to proceed refactoring those things, which will change the on-disk > format. It's late in the release cycle - these things really should've been > cleaned up earlier - but it's important to get the on-disk format right. > Shout if you have any objections. I don't think it's likely that beta1 will be binary compatible with the final version at this point. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 3:18 AM, Simon Riggs wrote: > If we believe that 25% of shared_buffers worth of heap blocks would > flush the cache doing a SeqScan, why should we allow 400% of > shared_buffers worth of index blocks? I think you're comparing apples and oranges. The 25% threshold is answering the question "How big does a sequential scan have to be before it's likely to flush so much so much unrelated data out of shared_buffers that it hurts the performance of other things running on the system?". So it's not really about whether or not things will *fit* in the cache, but rather a judgement about at what point caching that stuff is going to be less value than continuing to cache other things. Also, it's specifically a judgement about shared_buffers, not system memory. But effective_cache_size is used to estimate the likelihood that an index scan which accesses the same heap or index block twice will still be in cache on the second hit, and thus need to be faulted in only once. So this *is* a judgment about what will fit - generally over a very short time scale. And, since bringing a page into shared_buffers from the OS cache is much less expensive than bringing a page into memory from disk, it's really about what will fit in overall system memory, not just shared_buffers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On Wed, May 7, 2014 at 7:40 AM, Andres Freund wrote: >> I'm going to proceed refactoring those things, which will change the on-disk >> format. It's late in the release cycle - these things really should've been >> cleaned up earlier - but it's important to get the on-disk format right. >> Shout if you have any objections. +1. > I don't think it's likely that beta1 will be binary compatible with the > final version at this point. I rather think we're not ready for beta1 at this point (but I expect to lose that argument). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On Wed, May 7, 2014 at 1:20 PM, Heikki Linnakangas wrote: > So, apart from cleaning up the code, we really need to take a close look > at the on-disk format now. The code can be cleaned up later, too, but we're > going to be stuck with the on-disk format forever, so it's critical to get > that right. > > First, a few observations: > > * JENTRY_ISFIRST is redundant. Whenever you deal with the Jentry struct, > you know from the context which element in the array it is. > > * JENTRY_ISNEST is set but never used. > > * JENTRY_ISBOOL is defined as (JENTRY_ISNUMERIC | JENTRY_ISNEST), which > seems confusing. > > I'm going to proceed refactoring those things, which will change the > on-disk format. It's late in the release cycle - these things really > should've been cleaned up earlier - but it's important to get the on-disk > format right. Shout if you have any objections. +1. It's now or never. If the on-disk format needs changing, not doing it now is going to leave us with a "jsonc" in the future... Better bite the bullet now. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On 2014-05-07 08:50:33 -0400, Robert Haas wrote: > On Wed, May 7, 2014 at 7:40 AM, Andres Freund wrote: > > I don't think it's likely that beta1 will be binary compatible with the > > final version at this point. > > I rather think we're not ready for beta1 at this point (but I expect > to lose that argument). Well, I guess it depends on what we define 'beta1' to be. Imo evaluating problematic pieces of new code, locating unfinished pieces is part of that. I don't see much point in forbidding incompatible changes in beta1 personally. That robs th the development cycle of the only period where users can actually test the new version in a halfway sane manner and report back with things that apparently broken. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On Wed, May 7, 2014 at 2:56 PM, Andres Freund wrote: > On 2014-05-07 08:50:33 -0400, Robert Haas wrote: > > On Wed, May 7, 2014 at 7:40 AM, Andres Freund > wrote: > > > I don't think it's likely that beta1 will be binary compatible with the > > > final version at this point. > > > > I rather think we're not ready for beta1 at this point (but I expect > > to lose that argument). > > Well, I guess it depends on what we define 'beta1' to be. Imo evaluating > problematic pieces of new code, locating unfinished pieces is part of > that. I don't see much point in forbidding incompatible changes in beta1 > personally. That robs th the development cycle of the only period where > users can actually test the new version in a halfway sane manner and > report back with things that apparently broken. > > We need to be very careful to tell people about it though. Preferrably if we *know* a dump/reload will be needed to go beta1->beta2, we should actually document that in the releasenotes of beta1 already. So people can make proper plans.. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On 2014-05-07 15:00:01 +0200, Magnus Hagander wrote: > On Wed, May 7, 2014 at 2:56 PM, Andres Freund wrote: > > > On 2014-05-07 08:50:33 -0400, Robert Haas wrote: > > > On Wed, May 7, 2014 at 7:40 AM, Andres Freund > > wrote: > > > > I don't think it's likely that beta1 will be binary compatible with the > > > > final version at this point. > > > > > > I rather think we're not ready for beta1 at this point (but I expect > > > to lose that argument). > > > > Well, I guess it depends on what we define 'beta1' to be. Imo evaluating > > problematic pieces of new code, locating unfinished pieces is part of > > that. I don't see much point in forbidding incompatible changes in beta1 > > personally. That robs th the development cycle of the only period where > > users can actually test the new version in a halfway sane manner and > > report back with things that apparently broken. > > > > > We need to be very careful to tell people about it though. Preferrably if > we *know* a dump/reload will be needed to go beta1->beta2, we should > actually document that in the releasenotes of beta1 already. So people can > make proper plans.. Yes, I think it actually makes sense to add that to *all* beta release notes. Even in beta2, although slightly weakened. That's not a new thing btw. E.g. 9.3 has had a catversion bump between beta1/2: git diff 09bd2acbe5ac866ce9..817a89423f429a6a8b -- src/include/catalog/catversion.h The more interesting note probably is that there quite possibly won't be pg_upgrade'ability... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On Wed, May 7, 2014 at 3:04 PM, Andres Freund wrote: > On 2014-05-07 15:00:01 +0200, Magnus Hagander wrote: > > On Wed, May 7, 2014 at 2:56 PM, Andres Freund >wrote: > > > > > On 2014-05-07 08:50:33 -0400, Robert Haas wrote: > > > > On Wed, May 7, 2014 at 7:40 AM, Andres Freund > > > wrote: > > > > > I don't think it's likely that beta1 will be binary compatible > with the > > > > > final version at this point. > > > > > > > > I rather think we're not ready for beta1 at this point (but I expect > > > > to lose that argument). > > > > > > Well, I guess it depends on what we define 'beta1' to be. Imo > evaluating > > > problematic pieces of new code, locating unfinished pieces is part of > > > that. I don't see much point in forbidding incompatible changes in > beta1 > > > personally. That robs th the development cycle of the only period where > > > users can actually test the new version in a halfway sane manner and > > > report back with things that apparently broken. > > > > > > > > We need to be very careful to tell people about it though. Preferrably if > > we *know* a dump/reload will be needed to go beta1->beta2, we should > > actually document that in the releasenotes of beta1 already. So people > can > > make proper plans.. > > Yes, I think it actually makes sense to add that to *all* beta release > notes. Even in beta2, although slightly weakened. > That's not a new thing btw. E.g. 9.3 has had a catversion bump between > beta1/2: > git diff 09bd2acbe5ac866ce9..817a89423f429a6a8b -- > src/include/catalog/catversion.h > > The more interesting note probably is that there quite possibly won't be > pg_upgrade'ability... > > Yeah, that's the big thing really. Requiring pg_upgrade between betas might even be "good" in the sense that then we get more testing of pg_upgrade :) But requiring a dump/reload is going to hurt people more. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] PGDLLEXPORTing all GUCs?
Craig Ringer writes: > Is there any reason _not_ to PGDLLEXPORT all GUCs, other than cosmetic > concerns? That seems morally equivalent to "is there a reason not to make every static variable global?". IOW, no, I don't accept this proposition. Every time we DLLEXPORT some variable, we lose the freedom to redefine it later. So DLLEXPORT'ing GUCs should be on a case by case basis, just as for any other variable. In some cases it might be smarter to export a wrapper function. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
Magnus Hagander writes: > On Wed, May 7, 2014 at 2:56 PM, Andres Freund wrote: >> Well, I guess it depends on what we define 'beta1' to be. Imo evaluating >> problematic pieces of new code, locating unfinished pieces is part of >> that. I don't see much point in forbidding incompatible changes in beta1 >> personally. That robs th the development cycle of the only period where >> users can actually test the new version in a halfway sane manner and >> report back with things that apparently broken. > We need to be very careful to tell people about it though. Preferrably if > we *know* a dump/reload will be needed to go beta1->beta2, we should > actually document that in the releasenotes of beta1 already. So people can > make proper plans.. This seems like much ado about very little. The policy will be the same as it ever was: once beta1 is out, we prefer to avoid forcing an initdb, but we'll do it if we have to. In any case, +1 for fixing whatever needs to be fixed now; I expect to have a fix for the limited-GIN-index-length issue later today, and that really is also an on-disk format change, though it won't affect short index entries. ("Short" is TBD; I was thinking of hashing keys longer than say 128 bytes.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
On 2014-05-07 09:35:06 -0400, Tom Lane wrote: > Craig Ringer writes: > > Is there any reason _not_ to PGDLLEXPORT all GUCs, other than cosmetic > > concerns? > > That seems morally equivalent to "is there a reason not to make every > static variable global?". > > IOW, no, I don't accept this proposition. Every time we DLLEXPORT some > variable, we lose the freedom to redefine it later. So DLLEXPORT'ing GUCs > should be on a case by case basis, just as for any other variable. In > some cases it might be smarter to export a wrapper function. I think what Craig actually tries to propose is to mark all GUCs currently exported in headers PGDLLIMPORT. Currently it's easy to have extensions that work on sane systems but not windows. If they're already exposed in headers I don't think changes get any harder just because thy also can get used on windows... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On 2014-05-07 09:44:36 -0400, Tom Lane wrote: > Magnus Hagander writes: > > On Wed, May 7, 2014 at 2:56 PM, Andres Freund wrote: > >> Well, I guess it depends on what we define 'beta1' to be. Imo evaluating > >> problematic pieces of new code, locating unfinished pieces is part of > >> that. I don't see much point in forbidding incompatible changes in beta1 > >> personally. That robs th the development cycle of the only period where > >> users can actually test the new version in a halfway sane manner and > >> report back with things that apparently broken. > > > We need to be very careful to tell people about it though. Preferrably if > > we *know* a dump/reload will be needed to go beta1->beta2, we should > > actually document that in the releasenotes of beta1 already. So people can > > make proper plans.. > > This seems like much ado about very little. The policy will be the same > as it ever was: once beta1 is out, we prefer to avoid forcing an initdb, > but we'll do it if we have to. I think Magnus' point is that we should tell users that we'll try but won't guarantee anything. -hackers knowing about it doesn't mean users will know. Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 7 May 2014 13:31, Robert Haas wrote: > On Wed, May 7, 2014 at 3:18 AM, Simon Riggs wrote: >> If we believe that 25% of shared_buffers worth of heap blocks would >> flush the cache doing a SeqScan, why should we allow 400% of >> shared_buffers worth of index blocks? > > I think you're comparing apples and oranges. I understood the distinction, which is why I changed the direction of my thinking to say > Yes, we can make plans assuming we can use OS cache, > but we shouldn't be churning shared_buffers when we execute those > plans. and hence why I proposed > I think I'm arguing myself towards using a BufferAccessStrategy of > BAS_BULKREAD for large IndexScans, BitMapIndexScans and > BitMapHeapScans. which I hope will be effective in avoiding churn in shared_buffers even though we may use much larger memory from the OS. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Simon Riggs writes: > I think I'm arguing myself towards using a BufferAccessStrategy of > BAS_BULKREAD for large IndexScans, BitMapIndexScans and > BitMapHeapScans. As soon as you've got some hard evidence to present in favor of such changes, we can discuss it. I've got other things to do besides hypothesize. In the meantime, it seems like there is an emerging consensus that nobody much likes the existing auto-tuning behavior for effective_cache_size, and that we should revert that in favor of just increasing the fixed default value significantly. I see no problem with a value of say 4GB; that's very unlikely to be worse than the pre-9.4 default (128MB) on any modern machine. Votes for or against? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 9:07 AM, Tom Lane wrote: > Simon Riggs writes: >> I think I'm arguing myself towards using a BufferAccessStrategy of >> BAS_BULKREAD for large IndexScans, BitMapIndexScans and >> BitMapHeapScans. > > As soon as you've got some hard evidence to present in favor of such > changes, we can discuss it. I've got other things to do besides > hypothesize. Let me throw out one last point: It's pretty likely that s_b is going to be raised higher as a percentage of RAM. I never really bought into the conventional wisdom of 25% and have had to set it lower many times. Nevertheless, it was a documented suggestion. The core issues are: 1) There is no place to enter total system memory available to the database in postgresql.conf 2) Memory settings (except for the above) are given as absolute amounts, not percentages. It would be a lot easier to standardize configurations particularly if there was a way to electronically support #1 with auto-detection. Then, e_c_s. s_b, work_mem, and various other settings could be given using standard (and perhaps somewhat conservative) percentages using the best and hopefully factually supported recommendations. I oversee dozens of servers in a virtualized environment (as most enterprise shops are these days). Everything is 'right sized', often on demand, and often nobody bothers to adjust the various settings. > In the meantime, it seems like there is an emerging consensus that nobody > much likes the existing auto-tuning behavior for effective_cache_size, > and that we should revert that in favor of just increasing the fixed > default value significantly. I see no problem with a value of say 4GB; > that's very unlikely to be worse than the pre-9.4 default (128MB) on any > modern machine. In lieu of something fancy like the above, adjusting the defaults seems a better way to go (so I vote to revert). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 2014-05-07 10:07:07 -0400, Tom Lane wrote: > In the meantime, it seems like there is an emerging consensus that nobody > much likes the existing auto-tuning behavior for effective_cache_size, > and that we should revert that in favor of just increasing the fixed > default value significantly. I see no problem with a value of say 4GB; > that's very unlikely to be worse than the pre-9.4 default (128MB) on any > modern machine. > > Votes for or against? +1 for increasing it to 4GB and remove the autotuning. I don't like the current integration into guc.c much and a new static default doesn't seem to be worse than the current autotuning. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 10:12 AM, Andres Freund wrote: > On 2014-05-07 10:07:07 -0400, Tom Lane wrote: >> In the meantime, it seems like there is an emerging consensus that nobody >> much likes the existing auto-tuning behavior for effective_cache_size, >> and that we should revert that in favor of just increasing the fixed >> default value significantly. I see no problem with a value of say 4GB; >> that's very unlikely to be worse than the pre-9.4 default (128MB) on any >> modern machine. >> >> Votes for or against? > > +1 for increasing it to 4GB and remove the autotuning. I don't like the > current integration into guc.c much and a new static default doesn't > seem to be worse than the current autotuning. It was my proposal originally, so I assume I'd be counted as in favor, but for the sake of clarity: +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 4:12 PM, Andres Freund wrote: > On 2014-05-07 10:07:07 -0400, Tom Lane wrote: > > In the meantime, it seems like there is an emerging consensus that nobody > > much likes the existing auto-tuning behavior for effective_cache_size, > > and that we should revert that in favor of just increasing the fixed > > default value significantly. I see no problem with a value of say 4GB; > > that's very unlikely to be worse than the pre-9.4 default (128MB) on any > > modern machine. > > > > Votes for or against? > > +1 for increasing it to 4GB and remove the autotuning. I don't like the > current integration into guc.c much and a new static default doesn't > seem to be worse than the current autotuning. > +1. If we can't make the autotuning better than that, we're better off holding off on that one until we can actually figure out something better. (At which point perhaps we can reach the level where we can just remove it.. But that's all handwaving about the future of course). -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Robert Haas writes: > On Wed, May 7, 2014 at 3:18 AM, Simon Riggs wrote: >> If we believe that 25% of shared_buffers worth of heap blocks would >> flush the cache doing a SeqScan, why should we allow 400% of >> shared_buffers worth of index blocks? > I think you're comparing apples and oranges. The 25% threshold is > answering the question "How big does a sequential scan have to be > before it's likely to flush so much so much unrelated data out of > shared_buffers that it hurts the performance of other things running > on the system?". So it's not really about whether or not things will > *fit* in the cache, but rather a judgement about at what point caching > that stuff is going to be less value than continuing to cache other > things. Also, it's specifically a judgement about shared_buffers, not > system memory. > But effective_cache_size is used to estimate the likelihood that an > index scan which accesses the same heap or index block twice will > still be in cache on the second hit, and thus need to be faulted in > only once. So this *is* a judgment about what will fit - generally > over a very short time scale. And, since bringing a page into > shared_buffers from the OS cache is much less expensive than bringing > a page into memory from disk, it's really about what will fit in > overall system memory, not just shared_buffers. Another point is that the 25% seqscan threshold actually controls some specific caching decisions, which effective_cache_size does not. Raising effective_cache_size "too high" is unlikely to result in cache trashing; in fact I'd guess the opposite. What that would do is cause the planner to prefer indexscans over seqscans in more cases involving large tables. But if you've got a table+index that's bigger than RAM, seqscans are probably going to be worse for the OS cache than indexscans, because they're going to require bringing in more data. So I still think this whole argument is founded on shaky hypotheses with a complete lack of hard data showing that a smaller default for effective_cache_size would be better. The evidence we have points in the other direction. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bgworker crashed or not?
On 07/05/14 02:25, Petr Jelinek wrote: On 06/05/14 19:05, Robert Haas wrote: What I'm inclined to do is change the logic so that: (1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so that anything which is still registered gets restarted immediately. Yes, that's quite obvious change which I missed completely :). (2) If a shmem-connected backend fails to release the deadman switch or exits with an exit code other than 0 or 1, we crash-and-restart. A non-shmem-connected backend never causes a crash-and-restart. +1 (3) When a background worker exits without triggering a crash-and-restart, an exit code of precisely 0 causes the worker to be unregistered; any other exit code has no special effect, so bgw_restart_time controls. +1 Ok, attached patch is my try at the proposed changes. I don't like the reset_bgworker_crash_state function name, maybe you'll come up with something better... I passes my tests for the desired behavior, hopefully I didn't miss some scenario. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index a6b25d8..e913395 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -472,13 +472,13 @@ bgworker_quickdie(SIGNAL_ARGS) on_exit_reset(); /* - * Note we do exit(0) here, not exit(2) like quickdie. The reason is that - * we don't want to be seen this worker as independently crashed, because - * then postmaster would delay restarting it again afterwards. If some - * idiot DBA manually sends SIGQUIT to a random bgworker, the "dead man - * switch" will ensure that postmaster sees this as a crash. + * Note we do exit(2) here, not exit(0) for same reasons quickdie does. + * Another reason is that bgworker is not restarted after exit(0). + * + * The bgw_restart_time does not apply here and the bgworker will be + * restarted immediately. */ - exit(0); + exit(2); } /* @@ -832,8 +832,9 @@ RegisterDynamicBackgroundWorker(BackgroundWorker *worker, * running but is no longer. * * In the latter case, the worker may be stopped temporarily (if it is - * configured for automatic restart, or if it exited with code 0) or gone - * for good (if it is configured not to restart and exited with code 1). + * configured for automatic restart and exited with code 1) or gone for + * good (if it exited with code other than 1 or if it is configured not + * to restart). */ BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pidp) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 6d09887..99d5e85 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -396,6 +396,7 @@ static void TerminateChildren(int signal); static int CountChildren(int target); static int CountUnconnectedWorkers(void); +static void reset_bgworker_crash_state(void); static void maybe_start_bgworker(void); static bool CreateOptsFile(int argc, char *argv[], char *fullprogname); static pid_t StartChildProcess(AuxProcType type); @@ -2616,7 +2617,10 @@ reaper(SIGNAL_ARGS) if (PgStatPID == 0) PgStatPID = pgstat_start(); - /* some workers may be scheduled to start now */ + /* reset the state of crashed bgworkers so their restart is not delayed */ + reset_bgworker_crash_state(); + + /* workers may be scheduled to start now */ maybe_start_bgworker(); /* at this point we are really open for business */ @@ -2845,14 +2849,8 @@ CleanupBackgroundWorker(int pid, snprintf(namebuf, MAXPGPATH, "%s: %s", _("worker process"), rw->rw_worker.bgw_name); - /* Delay restarting any bgworker that exits with a nonzero status. */ - if (!EXIT_STATUS_0(exitstatus)) - rw->rw_crashed_at = GetCurrentTimestamp(); - else - rw->rw_crashed_at = 0; - /* - * Additionally, for shared-memory-connected workers, just like a + * For shared-memory-connected workers, just like a * backend, any exit status other than 0 or 1 is considered a crash * and causes a system-wide restart. */ @@ -2860,21 +2858,21 @@ CleanupBackgroundWorker(int pid, { if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus)) { -rw->rw_crashed_at = GetCurrentTimestamp(); +rw->rw_crashed_at = 0; HandleChildCrash(pid, exitstatus, namebuf); return true; } } - if (!ReleasePostmasterChildSlot(rw->rw_child_slot)) - { - /* - * Uh-oh, the child failed to clean itself up. Treat as a crash - * after all. - */ + /* Mark any bgworker that exits with a nonzero status or fails to clean + * up after itself as crashed and ready for restart at bgw_restart_time. */ + if (!EXIT_STATUS_0(exitstatus) || !ReleasePostmasterChildSlot(rw->rw_child_slot)) rw->rw_crashed_at = GetCurrentTimestamp(); - HandleChildCrash(pid, exitstatus, namebuf); - return tr
Re: [HACKERS] PGDLLEXPORTing all GUCs?
On 05/07/2014 09:45 PM, Andres Freund wrote: > I think what Craig actually tries to propose is to mark all GUCs > currently exported in headers PGDLLIMPORT. Currently it's easy to have > extensions that work on sane systems but not windows. If they're already > exposed in headers I don't think changes get any harder just because thy > also can get used on windows... Yes, rather. Exporting GUCs that're currently static wouldn't make sense. I'm just taking about making what works on !windows work on Windows. If a GUC is declared extern in a header, it should be PGDLLIMPORT. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
Andres Freund writes: > On 2014-05-07 09:35:06 -0400, Tom Lane wrote: >> Craig Ringer writes: >>> Is there any reason _not_ to PGDLLEXPORT all GUCs, other than cosmetic >>> concerns? >> That seems morally equivalent to "is there a reason not to make every >> static variable global?". > I think what Craig actually tries to propose is to mark all GUCs > currently exported in headers PGDLLIMPORT. There are few if any GUCs that aren't exposed in headers, just so that guc.c can communicate with the owning modules. That doesn't mean that we want everybody in the world messing with them. To my mind, we PGDLLEXPORT some variable only after deciding that yeah, we're okay with having third-party modules touching that. Craig's proposal is to remove human judgement from that process. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 05/07/2014 10:12 AM, Andres Freund wrote: On 2014-05-07 10:07:07 -0400, Tom Lane wrote: In the meantime, it seems like there is an emerging consensus that nobody much likes the existing auto-tuning behavior for effective_cache_size, and that we should revert that in favor of just increasing the fixed default value significantly. I see no problem with a value of say 4GB; that's very unlikely to be worse than the pre-9.4 default (128MB) on any modern machine. Votes for or against? +1 for increasing it to 4GB and remove the autotuning. I don't like the current integration into guc.c much and a new static default doesn't seem to be worse than the current autotuning. +1. If we ever want to implement an auto-tuning heuristic it seems we're going to need some hard empirical evidence to support it, and that doesn't seem likely to appear any time soon. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 7 May 2014 15:07, Tom Lane wrote: > Simon Riggs writes: >> I think I'm arguing myself towards using a BufferAccessStrategy of >> BAS_BULKREAD for large IndexScans, BitMapIndexScans and >> BitMapHeapScans. > > As soon as you've got some hard evidence to present in favor of such > changes, we can discuss it. I've got other things to do besides > hypothesize. Now we have a theory to test, I'll write a patch and we can collect evidence for, or against. > In the meantime, it seems like there is an emerging consensus that nobody > much likes the existing auto-tuning behavior for effective_cache_size, > and that we should revert that in favor of just increasing the fixed > default value significantly. I see no problem with a value of say 4GB; > that's very unlikely to be worse than the pre-9.4 default (128MB) on any > modern machine. > > Votes for or against? +1 for fixed 4GB and remove the auto-tuning code. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 7 May 2014 15:10, Merlin Moncure wrote: > The core issues are: > 1) There is no place to enter total system memory available to the > database in postgresql.conf > 2) Memory settings (except for the above) are given as absolute > amounts, not percentages. Those sound useful starting points. The key issue for me is that effective_cache_size is a USERSET. It applies per-query, just like work_mem (though work_mem is per query node). If we had "total system memory" we wouldn't know how to divide it up amongst users since we have no functionality for "workload management". It would be very nice to be able to tell Postgres that "I have 64GB RAM, use it wisely". At present, any and all users can set effective_cache_size and work_mem to any value they please, any time they wish and thus overuse available memory. Which is why I've had to write plugins to manage the memory allocations better in userspace. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
Tom Lane-2 wrote > Andres Freund < > andres@ > > writes: >> On 2014-05-07 09:35:06 -0400, Tom Lane wrote: >>> Craig Ringer < > craig@ > > writes: Is there any reason _not_ to PGDLLEXPORT all GUCs, other than cosmetic concerns? > >>> That seems morally equivalent to "is there a reason not to make every >>> static variable global?". > >> I think what Craig actually tries to propose is to mark all GUCs >> currently exported in headers PGDLLIMPORT. > > There are few if any GUCs that aren't exposed in headers, just so that > guc.c can communicate with the owning modules. That doesn't mean that > we want everybody in the world messing with them. > > To my mind, we PGDLLEXPORT some variable only after deciding that yeah, > we're okay with having third-party modules touching that. Craig's > proposal is to remove human judgement from that process. So third-party modules that use GUC's that are not PGDLLEXPORT are doing so improperly - even if it works for them because they only care/test non-Windows platforms? David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/PGDLLEXPORTing-all-GUCs-tp5802901p5802955.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
On 2014-05-07 10:29:36 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2014-05-07 09:35:06 -0400, Tom Lane wrote: > >> Craig Ringer writes: > >>> Is there any reason _not_ to PGDLLEXPORT all GUCs, other than cosmetic > >>> concerns? > > >> That seems morally equivalent to "is there a reason not to make every > >> static variable global?". > > > I think what Craig actually tries to propose is to mark all GUCs > > currently exported in headers PGDLLIMPORT. > > There are few if any GUCs that aren't exposed in headers, just so that > guc.c can communicate with the owning modules. That doesn't mean that > we want everybody in the world messing with them. > > To my mind, we PGDLLEXPORT some variable only after deciding that yeah, > we're okay with having third-party modules touching that. Craig's > proposal is to remove human judgement from that process. It's just levelling the planefield between platforms. If I had an idea how I'd still like to just PGDDLIMPORT *all* 'export'ed variables on windows. The problem is that there's lot of variables which aren't exported and which we'll only discover after the release. Just look at what e.g. postgres_fdw needed. It's not particularly unlikely that others fdws need some of those as well. But they can't change the release at the same time. If we want to declare variables off limits to extension/external code we need a solution that works on !windows as well. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
Andres Freund writes: > On 2014-05-07 10:29:36 -0400, Tom Lane wrote: >> To my mind, we PGDLLEXPORT some variable only after deciding that yeah, >> we're okay with having third-party modules touching that. Craig's >> proposal is to remove human judgement from that process. > It's just levelling the planefield between platforms. If I had an idea > how I'd still like to just PGDDLIMPORT *all* 'export'ed variables on > windows. > The problem is that there's lot of variables which aren't exported and > which we'll only discover after the release. Just look at what > e.g. postgres_fdw needed. It's not particularly unlikely that others > fdws need some of those as well. But they can't change the release at > the same time. [ shrug... ] That problem is uncorrelated with GUC status, however. If that's your argument for a patch, then the patch should DLLEXPORT *every single non-static variable*. Which is a discussion we've already had, and rejected. I'd not be against an automatic mechanism for that, and indeed put considerable work into trying to make it happen a few months ago. But I'll resist wholesale cluttering of the source code with those markers. As long as we have to have them, I think we should use them in the way I outlined, that we mark only variables that are "considered okay to access". In fact, GUCs are exactly the *last* variables that should get marked that way automatically, because so many of them are global only because of the need for guc.c to communicate with the owning module, not because we want anything else touching them. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Schizophrenic coding in gin_extract_jsonb(_hash)
Peter Geoghegan writes: > On Tue, May 6, 2014 at 8:08 PM, Tom Lane wrote: >> The early-exit code path supposes that JB_ROOT_COUNT is absolutely >> reliable as an indicator that there's nothing in the jsonb value. >> On the other hand, the realloc logic inside the iteration loop implies >> that JB_ROOT_COUNT is just an untrustworthy estimate. Which theory is >> correct? And why is there not a comment to be seen anywhere? If the code >> is correct then this logic is certainly worthy of a comment or three. > JsonbIteratorNext() is passed "false" as its skipNested argument. It's > recursive. And? I think you're just proving the point that this code is woefully underdocumented. If there were, somewhere, some comment explaining what the heck JB_ROOT_COUNT actually counts, maybe I wouldn't be asking this question. jsonb.h is certainly not divulging any such information. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
On Wed, May 7, 2014 at 11:19 AM, Tom Lane wrote: > Andres Freund writes: >> On 2014-05-07 10:29:36 -0400, Tom Lane wrote: >>> To my mind, we PGDLLEXPORT some variable only after deciding that yeah, >>> we're okay with having third-party modules touching that. Craig's >>> proposal is to remove human judgement from that process. > >> It's just levelling the planefield between platforms. If I had an idea >> how I'd still like to just PGDDLIMPORT *all* 'export'ed variables on >> windows. >> The problem is that there's lot of variables which aren't exported and >> which we'll only discover after the release. Just look at what >> e.g. postgres_fdw needed. It's not particularly unlikely that others >> fdws need some of those as well. But they can't change the release at >> the same time. > > [ shrug... ] That problem is uncorrelated with GUC status, however. > If that's your argument for a patch, then the patch should DLLEXPORT > *every single non-static variable*. Which is a discussion we've already > had, and rejected. > > I'd not be against an automatic mechanism for that, and indeed put > considerable work into trying to make it happen a few months ago. But > I'll resist wholesale cluttering of the source code with those markers. > As long as we have to have them, I think we should use them in the way > I outlined, that we mark only variables that are "considered okay to > access". In fact, GUCs are exactly the *last* variables that should get > marked that way automatically, because so many of them are global only > because of the need for guc.c to communicate with the owning module, > not because we want anything else touching them. I don't accept this argument. In EnterpriseDB's Advanced Server fork of PostgreSQL, we've marked a bunch of extra things PGDLLEXPORT precisely because we have external modules that need access to them. Of course, what that means is that when PostgreSQL changes things around in a future release, we have to go revise those external modules as well. However, that just doesn't happen often enough to actually pose a significant problem for us. It's not really accurate to think that people are only going to rely on the things that we choose to PGDLLEXPORT. It's more accurate to think that when we don't mark things PGDLLEXPORT, we're forcing people to decide between (1) giving up on writing their extension, (2) having that extension not work on Windows, (3) submitting a patch to add a PGDLLEXPORT marking and waiting an entire release cycle for that to go G.A., and then still not being able to support older versions, or (4) forking PostgreSQL. That's an unappealing list of options. I would not go so far as to say that we should PGDLLEXPORT absolutely every non-static variable. But I think adding those markings to every GUC we've got is perfectly reasonable. I am quite sure that the 2ndQuadrant folks know that they'll be on the hook to update any code they write that uses those variables if and when a future version of PostgreSQL whacks things around. But that's not a new problem - the same thing happens when a function signature changes, or when a variable that does happen to have a PGDLLEXPORT marking changes. And at least in my experience it's also not a large problem. The amount of time EnterpriseDB spends updating our (large!) amount of proprietary code in response to such changes is a small percentage of our overall development time. Enabling extensibility is a far more important goal than keeping people from committing to interfaces that may change in the future, especially since the latter is a losing battle anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
Robert Haas writes: > I don't accept this argument. In EnterpriseDB's Advanced Server fork > of PostgreSQL, we've marked a bunch of extra things PGDLLEXPORT > precisely because we have external modules that need access to them. Well, that's an argument for marking every darn global variable as PGDLLEXPORT. But it's *not* an argument for marking GUCs in particular that way. In particular, you are conveniently ignoring the point that GUCs are much more likely to be global as an artifact of the way guc.c is modularized than because we actually think they should be globally accessible. If Craig has a concrete argument why all GUCs should be accessible to external modules, then let's see it (after which we'd better debate exposing the few that are in fact static in guc.c). Or if you want to hang your hat on the platform-leveling argument, then we should be re-debating exporting *all* global variables. But as far as the actually proposed patch goes, all I'm hearing is very confused thinking. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
* Simon Riggs (si...@2ndquadrant.com) wrote: > Agreed. My proposal is that if the planner allows the lookaside to an > FDW then we pass the query for full execution on the FDW. That means > that the scan, aggregate and join could take place via the FDW. i.e. > "Custom Plan" == lookaside + FDW How about we get that working for FDWs to begin with and then we can come back to this idea..? We're pretty far from join-pushdown or aggregate-pushdown to FDWs, last I checked, and having those would be a massive win for everyone using FDWs. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Schizophrenic coding in gin_extract_jsonb(_hash)
On 05/07/2014 06:27 PM, Tom Lane wrote: I think you're just proving the point that this code is woefully underdocumented. If there were, somewhere, some comment explaining what the heck JB_ROOT_COUNT actually counts, maybe I wouldn't be asking this question. jsonb.h is certainly not divulging any such information. After having reverse-engineered the convertJsonb code, I think I can explain what JB_ROOT_COUNT is. If the root of the Jsonb datum is an array, it's the number of elements in that top-level array. If it's an object, it's the number of key/value pairs in that top-level object. Some of the elements of that array (or values of the object) can be arrays or objects themselves. gin_extract_jsonb recursively extracts all the elements, keys and values of any sub-object too, but JB_ROOT_COUNT only counts the top-level elements. (I hope this is made a bit more clear in the comments I added in the patch I posted this morning) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
On 2014-05-07 12:21:55 -0400, Tom Lane wrote: > Robert Haas writes: > > I don't accept this argument. In EnterpriseDB's Advanced Server fork > > of PostgreSQL, we've marked a bunch of extra things PGDLLEXPORT > > precisely because we have external modules that need access to them. > > Well, that's an argument for marking every darn global variable as > PGDLLEXPORT. But it's *not* an argument for marking GUCs in particular > that way. In particular, you are conveniently ignoring the point that > GUCs are much more likely to be global as an artifact of the way guc.c > is modularized than because we actually think they should be globally > accessible. GUCs in general are user configurable things. So it's not particularly unreasonable to assume that a significant fraction of them are of interest to extensions. And it's not like exporting them gives way to many additional dangers - they already can be overwritten. In fact, I am pretty sure that nearly all of these cases are about *reading* the underlying variable not writing them. It's pretty darn less convenient and far slower to get the config variable as text and then convert it to the underlying type. > (after which we'd better debate exposing the few that are in fact > static in guc.c). I plan to do that for most of them - completely independently of this topic. I think 'export'ing a variable in several files is a horrible idea. > Or if you want > to hang your hat on the platform-leveling argument, then we should be > re-debating exporting *all* global variables. Yes. I am wondering whether that's not the most sensible course. It's a pita, but essentially we'd have to do a global s/export/pg_export/ in the headers. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
* Simon Riggs (si...@2ndquadrant.com) wrote: > IMHO we would not want to add indexes to every column, on every table, > nor would we wish to use lookaside for all tables. It is a good thing > to be able to add optimizations for individual tables. GPUs are not > good for everything; it is good to be able to leverage their > strengths, yet avoid their weaknesses. It's the optimizer's job to figure out which path to pick though, based on which will have the lowest cost. > If do you want that, you can write an Event Trigger that automatically > adds a lookaside for any table. This sounds terribly ugly and like we're pushing optimization decisions on to the user instead of just figuring out what the best answer is. > I agree TupleTableSlot may not be best way for bulk data movement. We > probably need to look at buffering/bulk movement between executor > nodes in general, which would be of benefit for the FDW case also. > This would be a problem even for Custom Scans as originally presented > also, so I don't see much change there. Being able to do bulk movement would be useful, but (as I proposed months ago) being able to do asyncronous returns would be extremely useful also, when you consider FDWs and Append()- the main point there being that you want to keep the FDWs busy and working in parallel. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] PGDLLEXPORTing all GUCs?
On Wed, May 7, 2014 at 12:21 PM, Tom Lane wrote: > Robert Haas writes: >> I don't accept this argument. In EnterpriseDB's Advanced Server fork >> of PostgreSQL, we've marked a bunch of extra things PGDLLEXPORT >> precisely because we have external modules that need access to them. > > Well, that's an argument for marking every darn global variable as > PGDLLEXPORT. But it's *not* an argument for marking GUCs in particular > that way. In particular, you are conveniently ignoring the point that > GUCs are much more likely to be global as an artifact of the way guc.c > is modularized than because we actually think they should be globally > accessible. > > If Craig has a concrete argument why all GUCs should be accessible > to external modules, then let's see it (after which we'd better debate > exposing the few that are in fact static in guc.c). Or if you want > to hang your hat on the platform-leveling argument, then we should be > re-debating exporting *all* global variables. But as far as the actually > proposed patch goes, all I'm hearing is very confused thinking. I think there's actually a very good reason to think that GUCs are good candidates for this treatment, which is that, by definition, the GUC is a public interface: you can change it with a SET command. It's certainly easier to imagine an extension wanting access to update_process_title than, say, criticalRelcachesBuilt. But maybe you're right and we should revisit the idea of exposing everything. A quick grep through src/include suggests that GUCs are a big percentage of what's not marked PGDLLEXPORT anyway, and the other stuff that's in there is stuff like PgStartTime and PostmasterPid which hardly seem like silly things to expose. I certainly think we should err on the side of exposing stuff that people think might be useful rather than pretending that we can stop them from using symbols by refusing to PGDLLEXPORT them. We can't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Schizophrenic coding in gin_extract_jsonb(_hash)
Heikki Linnakangas writes: > gin_extract_jsonb recursively extracts all the elements, keys and values > of any sub-object too, but JB_ROOT_COUNT only counts the top-level elements. Got it. So if the top level is empty, we can exit early, but otherwise we use its length * 2 as a guess at how big the output will be; which will be right if it's an object without further substructure, and otherwise might need enlargement. > (I hope this is made a bit more clear in the comments I added in the > patch I posted this morning) Didn't read that yet, but will incorporate this info into the jsonb_gin patch I'm working on. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
Robert Haas writes: > On Wed, May 7, 2014 at 12:21 PM, Tom Lane wrote: >> If Craig has a concrete argument why all GUCs should be accessible >> to external modules, then let's see it (after which we'd better debate >> exposing the few that are in fact static in guc.c). > I think there's actually a very good reason to think that GUCs are > good candidates for this treatment, which is that, by definition, the > GUC is a public interface: you can change it with a SET command. Sure, and we provide public APIs for accessing/setting GUCs. The SET side of that is most emphatically *not* "just set the C variable". Yeah, you can get away with reading them like that, assuming you want the internal representation not the user-visible one. In any case, I've not heard the use-case why all (and only) GUCs might need to be readable in that way. Again, I'm not arguing against a proposal that we should automatically export all globally-declared variables for platform-levelling reasons. I *am* saying that I find a proposal to do that just to GUCs to be unsupported by any argument made so far. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PGDLLEXPORTing all GUCs?
On 2014-05-07 13:08:52 -0400, Tom Lane wrote: > Robert Haas writes: > > On Wed, May 7, 2014 at 12:21 PM, Tom Lane wrote: > >> If Craig has a concrete argument why all GUCs should be accessible > >> to external modules, then let's see it (after which we'd better debate > >> exposing the few that are in fact static in guc.c). > > > I think there's actually a very good reason to think that GUCs are > > good candidates for this treatment, which is that, by definition, the > > GUC is a public interface: you can change it with a SET command. > > Sure, and we provide public APIs for accessing/setting GUCs. As strings. Not a useful representation for C code... And to avoid units getting tacked on you need to first get the config option number, then allocate an array on the stack, call GetConfigOptionByNum(), then parse the result into the underlying type. Meh. > The SET > side of that is most emphatically *not* "just set the C variable". > Yeah, you can get away with reading them like that, assuming you want > the internal representation not the user-visible one. In any case, > I've not heard the use-case why all (and only) GUCs might need to be > readable in that way. I think you're making up the 'and only' part. There's lots of variables that should/need to be exported. Just look at the amount of mess you cleaned up with variou extensions not actually working on windows... Last time round you argued against exporting all variables. So Craig seems to have choosen a subset that's likely to be needed. > Again, I'm not arguing against a proposal that we should automatically > export all globally-declared variables for platform-levelling reasons. > I *am* saying that I find a proposal to do that just to GUCs to be > unsupported by any argument made so far. Well, then let's start discussing that proposal then. I propose having defining a 'pg_export' macro that's suitably defined by the buildsystem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 checksum errors in recovery with gin index
On Wed, May 7, 2014 at 12:48 AM, Andres Freund wrote: > Hi, > > On 2014-05-07 00:35:35 -0700, Jeff Janes wrote: > > When recovering from a crash (with injection of a partial page write at > > time of crash) against 7c7b1f4ae5ea3b1b113682d4d I get a checksum > > verification failure. > > > > 16396 is a gin index. > > Over which type? What was the load? make check? > A gin index on text[]. The load is a variation of the crash recovery tester I've been using the last few years, this time adapted to use a gin index in a rather unnatural way. I just increment a counter on a random row repeatedly via a unique key, but for this purpose that unique key is stuffed into text[] along with a bunch of cruft. The cruft is text representations of negative integers, the actual key is text representation of nonnegative integers. The test harness (patch to induce crashes, and two driving programs) and a preserved data directory are here: https://drive.google.com/folderview?id=0Bzqrh1SO9FcESDZVeFk5djJaeHM&usp=sharing (role jjanes, database jjanes) As far as I can tell, this problem goes back to the beginning of page checksums. > > If I have it ignore checksum failures, there is no apparent misbehavior. > > I'm trying to bisect it, but it could take a while and I thought someone > > might have some theories based on the log: > > If you have the WAL a pg_xlogdump grepping for everything referring to > that block would be helpful. > The only record which mentions block 28486 by name is this one: rmgr: Gin len (rec/tot): 1576/ 1608, tx: 77882205, lsn: 11/30F4C2C0, prev 11/30F4C290, bkp: , desc: Insert new list page, node: 1663/16384/16396 blkno: 28486 However, I think that that record precedes the recovery start point. Cheers, Jeff
Re: [HACKERS] PGDLLEXPORTing all GUCs?
On Wed, May 7, 2014 at 1:08 PM, Tom Lane wrote: > Robert Haas writes: >> On Wed, May 7, 2014 at 12:21 PM, Tom Lane wrote: >>> If Craig has a concrete argument why all GUCs should be accessible >>> to external modules, then let's see it (after which we'd better debate >>> exposing the few that are in fact static in guc.c). > >> I think there's actually a very good reason to think that GUCs are >> good candidates for this treatment, which is that, by definition, the >> GUC is a public interface: you can change it with a SET command. > > Sure, and we provide public APIs for accessing/setting GUCs. The SET > side of that is most emphatically *not* "just set the C variable". > Yeah, you can get away with reading them like that, assuming you want > the internal representation not the user-visible one. In any case, > I've not heard the use-case why all (and only) GUCs might need to be > readable in that way. My experience is that GUCs are a common thing to want expose to extensions, and that C code usually wants the internal form, not the string form. I'm not arguing that nothing else needs to be exposed, but if there's a better argument possible for exposing the GUC variables than the fact that a bunch of people with experience developing PostgreSQL extensions view that as a real need, I can't think what it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On 7 May 2014 17:43, Stephen Frost wrote: > * Simon Riggs (si...@2ndquadrant.com) wrote: >> IMHO we would not want to add indexes to every column, on every table, >> nor would we wish to use lookaside for all tables. It is a good thing >> to be able to add optimizations for individual tables. GPUs are not >> good for everything; it is good to be able to leverage their >> strengths, yet avoid their weaknesses. > > It's the optimizer's job to figure out which path to pick though, based > on which will have the lowest cost. Of course. I'm not suggesting otherwise. >> If do you want that, you can write an Event Trigger that automatically >> adds a lookaside for any table. > > This sounds terribly ugly and like we're pushing optimization decisions > on to the user instead of just figuring out what the best answer is. I'm proposing that we use a declarative approach, just like we do when we say CREATE INDEX. The idea is that we only consider a lookaside when a lookaside has been declared. Same as when we add an index, the optimizer considers whether to use that index. What we don't want to happen is that the optimizer considers a GIN plan, even when a GIN index is not available. I'll explain it more at the developer meeting. It probably sounds a bit weird at first. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 checksum errors in recovery with gin index
Hi, On 2014-05-07 10:21:26 -0700, Jeff Janes wrote: > On Wed, May 7, 2014 at 12:48 AM, Andres Freund wrote: > > > Hi, > > > > On 2014-05-07 00:35:35 -0700, Jeff Janes wrote: > > > When recovering from a crash (with injection of a partial page write at > > > time of crash) against 7c7b1f4ae5ea3b1b113682d4d I get a checksum > > > verification failure. > > > > > > 16396 is a gin index. > > > > Over which type? What was the load? make check? > > > > A gin index on text[]. > > The load is a variation of the crash recovery tester I've been using the > last few years, this time adapted to use a gin index in a rather unnatural > way. I just increment a counter on a random row repeatedly via a unique > key, but for this purpose that unique key is stuffed into text[] along with > a bunch of cruft. The cruft is text representations of negative integers, > the actual key is text representation of nonnegative integers. > > The test harness (patch to induce crashes, and two driving programs) and a > preserved data directory are here: > > https://drive.google.com/folderview?id=0Bzqrh1SO9FcESDZVeFk5djJaeHM&usp=sharing > > (role jjanes, database jjanes) > > As far as I can tell, this problem goes back to the beginning of page > checksums. Interesting. > > > If I have it ignore checksum failures, there is no apparent misbehavior. > > > I'm trying to bisect it, but it could take a while and I thought someone > > > might have some theories based on the log: > > > > If you have the WAL a pg_xlogdump grepping for everything referring to > > that block would be helpful. > > > > The only record which mentions block 28486 by name is this one: Hm, try running it with -b specified. > rmgr: Gin len (rec/tot): 1576/ 1608, tx: 77882205, lsn: > 11/30F4C2C0, prev 11/30F4C290, bkp: , desc: Insert new list page, node: > 1663/16384/16396 blkno: 28486 > > However, I think that that record precedes the recovery start point. If that's the case it seems likely that a PageSetLSN() or PageSetDirty() are missing somewhere... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
* Simon Riggs (si...@2ndquadrant.com) wrote: > On 7 May 2014 17:43, Stephen Frost wrote: > > It's the optimizer's job to figure out which path to pick though, based > > on which will have the lowest cost. > > Of course. I'm not suggesting otherwise. > > >> If do you want that, you can write an Event Trigger that automatically > >> adds a lookaside for any table. > > > > This sounds terribly ugly and like we're pushing optimization decisions > > on to the user instead of just figuring out what the best answer is. > > I'm proposing that we use a declarative approach, just like we do when > we say CREATE INDEX. There's quite a few trade-offs when it comes to indexes though. I'm trying to figure out when you wouldn't want to use a GPU, if it's available to you and the cost model says it's faster? To me, that's kind of like saying you want a declarative approach for when to use a HashJoin. > The idea is that we only consider a lookaside when a lookaside has > been declared. Same as when we add an index, the optimizer considers > whether to use that index. What we don't want to happen is that the > optimizer considers a GIN plan, even when a GIN index is not > available. Yes, I understood your proposal- I just don't agree with it. ;) For MatViews and/or Indexes, there are trade-offs to be had as it relates to disk space, insert speed, etc. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 05/07/2014 07:31 AM, Andrew Dunstan wrote: > +1. If we ever want to implement an auto-tuning heuristic it seems we're > going to need some hard empirical evidence to support it, and that > doesn't seem likely to appear any time soon. 4GB default it is, then. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On Wed, May 7, 2014 at 4:40 AM, Andres Freund wrote: > * Imo we need space in jsonb ondisk values to indicate a format > version. We won't fully get this right. That would be nice. > * The jentry representation should be changed so it's possible to get the type > of a entry without checking individual types. That'll make code like > findJsonbValueFromSuperHeader() (well, whatever you've renamed it to) > much easier to read. Preferrably so it an have the same values (after > shifting/masking) ask the JBE variants. And it needs space for futher > types (or representations thereof). I don't know what you mean. Every place that macros like JBE_ISNUMERIC() are used subsequently involves access to (say) numeric union fields. At best, you could have those functions check that the types match, and then handle each case in a switch that only looked at (say) the "candidate", but that doesn't really save much at all. It wouldn't take much to have the macros give enum constant values back as you suggest, though. > * I wonder if the hash/object pair representation is wise and if it > shouldn't be keys combined with offsets first, and then the > values. That will make access much faster. And that's what jsonb > essentially is about. I like that the physical layout reflects the layout of the original JSON document. Besides, I don't think this is obviously the case. Are you sure that it won't be more useful to make children as close to their parents as possible? Particularly given idiomatic usage. Jsonb, in point of fact, *is* fast. > * I think both arrays and hashes should grow individual structs. With > space for additional flags. Why? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 05/06/2014 10:35 PM, Peter Geoghegan wrote: > +1. In my view, we probably should have set it to a much higher > absolute default value. The main problem with setting it to any > multiple of shared_buffers that I can see is that shared_buffers is a > very poor proxy for what effective_cache_size is supposed to > represent. In general, the folk wisdom around sizing shared_buffers > has past its sell-by date. Unfortunately nobody has the time/resources to do the kind of testing required for a new recommendation for shared_buffers. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 11:04 AM, Josh Berkus wrote: > Unfortunately nobody has the time/resources to do the kind of testing > required for a new recommendation for shared_buffers. I meant to suggest that the buffer manager could be improved to the point that the old advice becomes obsolete. Right now, it's much harder to analyze shared_buffers than it should be, presumably because of the problems with the buffer manager. I think that if we could formulate better *actionable* advice around what we have right now, that would have already happened. We ought to be realistic about the fact that the current recommendations around sizing shared_buffers are nothing more than folk wisdom. That's the best we have right now, but that seems quite unsatisfactory to me. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On 2014-05-07 10:55:28 -0700, Peter Geoghegan wrote: > On Wed, May 7, 2014 at 4:40 AM, Andres Freund wrote: > > * The jentry representation should be changed so it's possible to get the > > type > > of a entry without checking individual types. That'll make code like > > findJsonbValueFromSuperHeader() (well, whatever you've renamed it to) > > much easier to read. Preferrably so it an have the same values (after > > shifting/masking) ask the JBE variants. And it needs space for futher > > types (or representations thereof). > > I don't know what you mean. Every place that macros like > JBE_ISNUMERIC() are used subsequently involves access to (say) numeric > union fields. At best, you could have those functions check that the > types match, and then handle each case in a switch that only looked at > (say) the "candidate", but that doesn't really save much at all. It > wouldn't take much to have the macros give enum constant values back > as you suggest, though. Compare for (i = 0; i < count; i++) { JEntry *e = array + i; if (JBE_ISNULL(*e) && key->type == jbvNull) { result->type = jbvNull; result->estSize = sizeof(JEntry); } else if (JBE_ISSTRING(*e) && key->type == jbvString) { result->type = jbvString; result->val.string.val = data + JBE_OFF(*e); result->val.string.len = JBE_LEN(*e); result->estSize = sizeof(JEntry) + result->val.string.len; } else if (JBE_ISNUMERIC(*e) && key->type == jbvNumeric) { result->type = jbvNumeric; result->val.numeric = (Numeric) (data + INTALIGN(JBE_OFF(*e))); result->estSize = 2 * sizeof(JEntry) + VARSIZE_ANY(result->val.numeric); } else if (JBE_ISBOOL(*e) && key->type == jbvBool) { result->type = jbvBool; result->val.boolean = JBE_ISBOOL_TRUE(*e) != 0; result->estSize = sizeof(JEntry); } else continue; if (compareJsonbScalarValue(key, result) == 0) return result; } with for (i = 0; i < count; i++) { JEntry *e = array + i; if (!JBE_TYPE_IS_SCALAR(*e)) continue; if (JBE_TYPE(*e) != key->type) continue; result = getJsonbValue(e); if (compareJsonbScalarValue(key, result) == 0) return result; } Yes, it's not a fair comparison because I made up getJsonbValue(). But it's *much* more readable regardless. And there's more places that could use it. Like the second half of findJsonbValueFromSuperHeader(). FWIW, that's one of the requests I definitely made before. > > * I wonder if the hash/object pair representation is wise and if it > > shouldn't be keys combined with offsets first, and then the > > values. That will make access much faster. And that's what jsonb > > essentially is about. > > I like that the physical layout reflects the layout of the original > JSON document. Don't see much value in that. This is a binary representation and it'd be bijective. > Besides, I don't think this is obviously the case. Are > you sure that it won't be more useful to make children as close to > their parents as possible? Particularly given idiomatic usage. Because - if done right - it would allow elementwise access without scanning previous values. > > * I think both arrays and hashes should grow individual structs. With > > space for additional flags. > Why? Because a) it will make the code more readable b) it'll allow for adding different representations of hashes/arrays. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Tue, May 6, 2014 at 9:55 AM, Andres Freund wrote: > On 2014-05-06 17:43:45 +0100, Simon Riggs wrote: > > > All this changes is the cost of > > IndexScans that would use more than 25% of shared_buffers worth of > > data. Hopefully not many of those in your workload. Changing the cost > > doesn't necessarily prevent index scans either. And if there are many > > of those in your workload AND you run more than one at same time, then > > the larger setting will work against you. So the benefit window for > > such a high setting is slim, at best. > Not only do you need to run more than one at a time, but they also must use mostly disjoint sets of data, in order for the larger estimate to be bad. > > Why? There's many workloads where indexes are larger than shared buffers > but fit into the operating system's cache. And that's precisely what > effective_cache_size is about. > It is more about the size of the table referenced by the index, rather than the size of the index. The point is that doing a large index scan might lead you to visit the same table blocks repeatedly within quick succession. (If a small index scan is on the inner side of a nested loop, then you might access the same index leaf blocks and the same table blocks repeatedly--that is why is only mostly about the table size, rather than exclusively). Cheers, Jeff
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 1:13 PM, Peter Geoghegan wrote: > On Wed, May 7, 2014 at 11:04 AM, Josh Berkus wrote: >> Unfortunately nobody has the time/resources to do the kind of testing >> required for a new recommendation for shared_buffers. > > I meant to suggest that the buffer manager could be improved to the > point that the old advice becomes obsolete. Right now, it's much > harder to analyze shared_buffers than it should be, presumably because > of the problems with the buffer manager. I think that if we could > formulate better *actionable* advice around what we have right now, > that would have already happened. > > We ought to be realistic about the fact that the current > recommendations around sizing shared_buffers are nothing more than > folk wisdom. That's the best we have right now, but that seems quite > unsatisfactory to me. I think the stock advice is worse then nothing because it is A. based on obsolete assumptions and B. doesn't indicate what the tradeoffs are or what kinds of symptoms adjusting the setting could alleviate. The documentation should be reduced to things that are known, for example: *) raising shared buffers does not 'give more memory to postgres for caching' -- it can only reduce it via double paging *) are generally somewhat faster than fault to o/s buffers *) large s_b than working dataset size can be good configuration for read only loads especially *) have bad interplay with o/s in some configurations with large settings *) shared buffers can reduce write i/o in certain workloads *) interplay with checkpoint *) have different mechanisms for managing contention than o/s buffers merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] making bgworkers without shmem access actually not have shmem access
I've complained about this problem a few times before: there's nothing to prevent a background worker which doesn't request shared memory access from calling InitProcess() and then accessing shared memory anyway. The attached patch is a first crack at fixing it. Unfortunately, there's still a window between when we fork() and when we detach shared memory, but that's also true for processes like syslogger, whose death is nevertheless does not cause a crash-and-restart. Also unfortunately, in the EXEC_BACKEND case, we actually have to attach shared memory first, to get our background worker entry details. This is undesirable, but I'm not sure there's any good way to fix it at all, and certainly not in time for 9.4. Hopefully, the window when shared memory is attached with this patch is short enough to make things relatively safe. At a minimum, it's got to be better than the status quo, where shared memory is accessible throughout the entire lifetime of non-shmem-access background workers. Attached is also a new background worker called say_hello which I used for testing this patch. That's obviously not for commit. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index a6b25d8..8b8ae89 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -20,9 +20,11 @@ #include "postmaster/bgworker_internals.h" #include "postmaster/postmaster.h" #include "storage/barrier.h" +#include "storage/dsm.h" #include "storage/ipc.h" #include "storage/latch.h" #include "storage/lwlock.h" +#include "storage/pg_shmem.h" #include "storage/pmsignal.h" #include "storage/proc.h" #include "storage/procsignal.h" @@ -400,12 +402,16 @@ BackgroundWorkerStopNotifications(pid_t pid) BackgroundWorker * BackgroundWorkerEntry(int slotno) { + static BackgroundWorker myEntry; BackgroundWorkerSlot *slot; Assert(slotno < BackgroundWorkerData->total_slots); slot = &BackgroundWorkerData->slot[slotno]; Assert(slot->in_use); - return &slot->worker; /* can't become free while we're still here */ + + /* must copy this in case we don't intend to retain shmem access */ + memcpy(&myEntry, &slot->worker, sizeof myEntry); + return &myEntry; } #endif @@ -542,6 +548,19 @@ StartBackgroundWorker(void) snprintf(buf, MAXPGPATH, "bgworker: %s", worker->bgw_name); init_ps_display(buf, "", "", ""); + /* + * If we're not supposed to have shared memory access, then detach from + * shared memory. If we didn't request shared memory access, the + * postmaster won't force a cluster-wide restart if we exit unexpectedly, + * so we'd better make sure that we don't mess anything up that would + * require that sort of cleanup. + */ + if ((worker->bgw_flags & BGWORKER_SHMEM_ACCESS) == 0) + { + dsm_detach_all(); + PGSharedMemoryDetach(); + } + SetProcessingMode(InitProcessing); /* Apply PostAuthDelay */ @@ -616,19 +635,29 @@ StartBackgroundWorker(void) /* We can now handle ereport(ERROR) */ PG_exception_stack = &local_sigjmp_buf; - /* Early initialization */ - BaseInit(); - /* - * If necessary, create a per-backend PGPROC struct in shared memory, - * except in the EXEC_BACKEND case where this was done in - * SubPostmasterMain. We must do this before we can use LWLocks (and in - * the EXEC_BACKEND case we already had to do some stuff with LWLocks). + * If the background worker request shared memory access, set that up now; + * else, detach all shared memory segments. */ -#ifndef EXEC_BACKEND if (worker->bgw_flags & BGWORKER_SHMEM_ACCESS) + { + /* + * Early initialization. Some of this could be useful even for + * background workers that aren't using shared memory, but they can + * call the individual startup routines for those subsystems if needed. + */ + BaseInit(); + + /* + * Create a per-backend PGPROC struct in shared memory, except in the + * EXEC_BACKEND case where this was done in SubPostmasterMain. We must + * do this before we can use LWLocks (and in the EXEC_BACKEND case we + * already had to do some stuff with LWLocks). + */ +#ifndef EXEC_BACKEND InitProcess(); #endif + } /* * If bgw_main is set, we use that value as the initial entrypoint. diff --git a/contrib/say_hello/Makefile b/contrib/say_hello/Makefile new file mode 100644 index 000..2da89d7 --- /dev/null +++ b/contrib/say_hello/Makefile @@ -0,0 +1,17 @@ +# contrib/say_hello/Makefile + +MODULES = say_hello + +EXTENSION = say_hello +DATA = say_hello--1.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/say_hello +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/say_hello/say_hello--1.0.sql b/contrib/say_hello/say_hello--1.0.sql new file mode 100644 index 000..c82b7df --- /d
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 2014-05-07 13:32:41 -0500, Merlin Moncure wrote: > On Wed, May 7, 2014 at 1:13 PM, Peter Geoghegan wrote: > > On Wed, May 7, 2014 at 11:04 AM, Josh Berkus wrote: > >> Unfortunately nobody has the time/resources to do the kind of testing > >> required for a new recommendation for shared_buffers. > > > > I meant to suggest that the buffer manager could be improved to the > > point that the old advice becomes obsolete. Right now, it's much > > harder to analyze shared_buffers than it should be, presumably because > > of the problems with the buffer manager. I think that if we could > > formulate better *actionable* advice around what we have right now, > > that would have already happened. > > > > We ought to be realistic about the fact that the current > > recommendations around sizing shared_buffers are nothing more than > > folk wisdom. That's the best we have right now, but that seems quite > > unsatisfactory to me. > > I think the stock advice is worse then nothing because it is A. based > on obsolete assumptions and B. doesn't indicate what the tradeoffs are > or what kinds of symptoms adjusting the setting could alleviate. The > documentation should be reduced to things that are known, for example: > > *) raising shared buffers does not 'give more memory to postgres for > caching' -- it can only reduce it via double paging That's absolutely not a necessary consequence. If pages are in s_b for a while the OS will be perfectly happy to throw them away. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 05/07/2014 11:13 AM, Peter Geoghegan wrote: > We ought to be realistic about the fact that the current > recommendations around sizing shared_buffers are nothing more than > folk wisdom. That's the best we have right now, but that seems quite > unsatisfactory to me. So, as one of several people who put literally hundreds of hours into the original benchmarking which established the sizing recommendations for shared_buffers (and other settings), I find the phrase "folk wisdom" personally offensive. So, can we stop with this? Otherwise, I don't think I can usefully participate in this discussion. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] making bgworkers without shmem access actually not have shmem access
Robert Haas writes: > I've complained about this problem a few times before: there's nothing > to prevent a background worker which doesn't request shared memory > access from calling InitProcess() and then accessing shared memory > anyway. The attached patch is a first crack at fixing it. > Comments? Looks reasonable to me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 11:38 AM, Andres Freund wrote: >> *) raising shared buffers does not 'give more memory to postgres for >> caching' -- it can only reduce it via double paging > > That's absolutely not a necessary consequence. If pages are in s_b for a > while the OS will be perfectly happy to throw them away. The biggest problem with double buffering is not that it wastes memory. Rather, it's that it wastes memory bandwidth. I think that lessening that problem will be the major benefit of making larger shared_buffers settings practical. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
Andres Freund writes: > * The jentry representation should be changed so it's possible to get the type > of a entry without checking individual types. That'll make code like > findJsonbValueFromSuperHeader() (well, whatever you've renamed it to) > much easier to read. Preferrably so it an have the same values (after > shifting/masking) ask the JBE variants. And it needs space for futher > types (or representations thereof). Further types? What further types? JSON seems unlikely to grow any other value types than what it's got. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 2:40 PM, Josh Berkus wrote: > On 05/07/2014 11:13 AM, Peter Geoghegan wrote: >> We ought to be realistic about the fact that the current >> recommendations around sizing shared_buffers are nothing more than >> folk wisdom. That's the best we have right now, but that seems quite >> unsatisfactory to me. > > So, as one of several people who put literally hundreds of hours into > the original benchmarking which established the sizing recommendations > for shared_buffers (and other settings), I find the phrase "folk wisdom" > personally offensive. So, can we stop with this? > > Otherwise, I don't think I can usefully participate in this discussion. +1. I think it is quite accurate to say that we can't predict precisely what value of shared_buffers will perform best for a particular workload and on a particular system. There are people out there using very large values and very small ones, according to what they have found most effective. But that does not mean, as the phrase "folk wisdom" might be taken to imply, that we don't know anything at all about what actually works well in practice. Because we do know quite a bit about that. I and people I work with have been able to improve performance greatly on many systems by providing guidance based on what this community has been able to understand on this topic, and dismissing it as rubbish is wrong. Also, I seriously doubt that a one-size-fits-all guideline about setting shared_buffers will ever be right for every workload. Workloads, by their nature, are complex beasts. The size of the workload varies, and which portions of it are how hot vary, and the read-write mix varies, and those are not problems with PostgreSQL; those are problems with data. That is not to say that we can't do anything to make PostgreSQL work better across a wider range of settings for shared_buffers, but it is to say that no matter how much work we do on the code, setting this optimally for every workload will probably remain complex. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 2014-05-07 11:45:04 -0700, Peter Geoghegan wrote: > On Wed, May 7, 2014 at 11:38 AM, Andres Freund wrote: > >> *) raising shared buffers does not 'give more memory to postgres for > >> caching' -- it can only reduce it via double paging > > > > That's absolutely not a necessary consequence. If pages are in s_b for a > > while the OS will be perfectly happy to throw them away. > > The biggest problem with double buffering is not that it wastes > memory. Rather, it's that it wastes memory bandwidth. Doesn't match my experience. Even with the current buffer manager there's usually enough locality to keep important pages in s_b for a meaningful time. I *have* seen workloads that should have fit into memory not fit because of double buffering. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 2:49 PM, Andres Freund wrote: > On 2014-05-07 11:45:04 -0700, Peter Geoghegan wrote: >> On Wed, May 7, 2014 at 11:38 AM, Andres Freund >> wrote: >> >> *) raising shared buffers does not 'give more memory to postgres for >> >> caching' -- it can only reduce it via double paging >> > >> > That's absolutely not a necessary consequence. If pages are in s_b for a >> > while the OS will be perfectly happy to throw them away. >> >> The biggest problem with double buffering is not that it wastes >> memory. Rather, it's that it wastes memory bandwidth. > > Doesn't match my experience. Even with the current buffer manager > there's usually enough locality to keep important pages in s_b for a > meaningful time. I *have* seen workloads that should have fit into > memory not fit because of double buffering. Same here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On 2014-05-07 14:48:51 -0400, Tom Lane wrote: > Andres Freund writes: > > * The jentry representation should be changed so it's possible to get the > > type > > of a entry without checking individual types. That'll make code like > > findJsonbValueFromSuperHeader() (well, whatever you've renamed it to) > > much easier to read. Preferrably so it an have the same values (after > > shifting/masking) ask the JBE variants. And it needs space for futher > > types (or representations thereof). > > Further types? What further types? JSON seems unlikely to grow any > other value types than what it's got. I am not thinking about user level exposed types, but e.g. a hash/object representation that allows duplicated keys and keeps the original order. Because I am pretty sure that'll come. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 11:40 AM, Josh Berkus wrote: > So, as one of several people who put literally hundreds of hours into > the original benchmarking which established the sizing recommendations > for shared_buffers (and other settings), I find the phrase "folk wisdom" > personally offensive. So, can we stop with this? I have also put a lot of time into benchmarking. No personal offence was intended, and I'm glad that we have some advice to give to users, but the fact of the matter is that current *official* recommendations are very vague. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] making bgworkers without shmem access actually not have shmem access
On Wed, May 7, 2014 at 2:44 PM, Tom Lane wrote: > Robert Haas writes: >> I've complained about this problem a few times before: there's nothing >> to prevent a background worker which doesn't request shared memory >> access from calling InitProcess() and then accessing shared memory >> anyway. The attached patch is a first crack at fixing it. > >> Comments? > > Looks reasonable to me. Thanks for the fast review. Committed, after fixing one further bug I spotted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 11:50 AM, Robert Haas wrote: > But that does not mean, as the phrase "folk > wisdom" might be taken to imply, that we don't know anything at all > about what actually works well in practice. Folk wisdom doesn't imply that. It implies that we think this works, and we may well be right, but there isn't all that much rigor behind some of it. I'm not blaming anyone for this state of affairs. I've heard plenty of people repeat the "don't exceed 8GB" rule - I regularly repeated it myself. I cannot find any rigorous defense of this, though. If you're aware of one, please point it out to me. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] making bgworkers without shmem access actually not have shmem access
On 07/05/14 20:37, Robert Haas wrote: At a minimum, it's got to be better than the status quo, where shared memory is accessible throughout the entire lifetime of non-shmem-access background workers. Seems reasonable to me, it might need to be revisited to at least try to figure out if we can make EXEC_BACKEND safer, but it's definitely better than the current state. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On 05/07/2014 02:52 PM, Andres Freund wrote: On 2014-05-07 14:48:51 -0400, Tom Lane wrote: Andres Freund writes: * The jentry representation should be changed so it's possible to get the type of a entry without checking individual types. That'll make code like findJsonbValueFromSuperHeader() (well, whatever you've renamed it to) much easier to read. Preferrably so it an have the same values (after shifting/masking) ask the JBE variants. And it needs space for futher types (or representations thereof). Further types? What further types? JSON seems unlikely to grow any other value types than what it's got. I am not thinking about user level exposed types, but e.g. a hash/object representation that allows duplicated keys and keeps the original order. Because I am pretty sure that'll come. That makes one of you. We've had hstore for years and nobody that I recall has asked for preservation of key order or duplicates. And any app that relies on either in JSON is still, IMNSHO, broken by design. OTOH, there are suggestions of "superjson" types that support other scalar types such as timestamps. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 05/07/2014 11:52 AM, Peter Geoghegan wrote: > On Wed, May 7, 2014 at 11:40 AM, Josh Berkus wrote: >> So, as one of several people who put literally hundreds of hours into >> the original benchmarking which established the sizing recommendations >> for shared_buffers (and other settings), I find the phrase "folk wisdom" >> personally offensive. So, can we stop with this? > > I have also put a lot of time into benchmarking. No personal offence > was intended, and I'm glad that we have some advice to give to users, > but the fact of the matter is that current *official* recommendations > are very vague. Well, they should be vague; the only hard data we have is rather out-of-date (I think 8.2 was our last set of tests). If we gave users specific, detailed recommendations, we'd be misleading them. For that matter, our advice on shared_buffers ... and our design for it ... is going to need to change radically soon, since Linux is getting an ARC with a frequency cache as well as a recency cache, and FreeBSD and OpenSolaris already have them. FWIW, if someone could fund me for a month, I'd be happy to create a benchmarking setup where we could test these kinds of things; I have pretty clear ideas how to build one. I imagine some of our other consultants could make the same offer. However, it's too much work for anyone to get done "in their spare time". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
btw ... in jsonb.h there is this comment: * Jsonb Keys and string array elements are treated equivalently when * serialized to text index storage. One day we may wish to create an opclass * that only indexes values, but for now keys and values are stored in GIN * indexes in a way that doesn't really consider their relationship to each * other. Is this an obsolete speculation about writing jsonb_hash_ops, or is there still something worth preserving there? If so, what? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 11:50 AM, Robert Haas wrote: >> Doesn't match my experience. Even with the current buffer manager >> there's usually enough locality to keep important pages in s_b for a >> meaningful time. I *have* seen workloads that should have fit into >> memory not fit because of double buffering. > > Same here. I think that it depends on whether or not you're thinking about the worst case. Most people are not going to be in the category you describe here. Plenty of people in the Postgres community run with very large shared_buffers settings, on non i/o bound workloads, and report good results - often massive, quickly apparent improvements. I'm mostly concerned with obsoleting the 8GB hard ceiling rule here. It probably doesn't matter whether and by how much one factor is worse than the other, though. I found the section "5.2 Temporal Control: Buffering" in the following paper, that speaks about the subject quite interesting: http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On Wed, May 7, 2014 at 12:08 PM, Tom Lane wrote: > * Jsonb Keys and string array elements are treated equivalently when > * serialized to text index storage. One day we may wish to create an opclass > * that only indexes values, but for now keys and values are stored in GIN > * indexes in a way that doesn't really consider their relationship to each > * other. > > Is this an obsolete speculation about writing jsonb_hash_ops, or is there > still something worth preserving there? If so, what? This is not obsolete. It would equally apply to a GiST opclass that wanted to support our current definition of existence. Array elements are keys simply by virtue of being strings, but otherwise are treated as values. See the large comment block within gin_extract_jsonb(). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
And while I'm looking at it ... The jsonb_ops storage format for values is inherently lossy, because it cannot distinguish the string values "n", "t", "f" from JSON null or boolean true, false respectively; nor does it know the difference between a number and a string containing digits. This appears to not quite be a bug because the consistent functions force recheck for all queries that care about values (as opposed to keys). But if it's documented anywhere I don't see where. And in any case, is it a good idea? We could fairly easily change things so that these cases are guaranteed distinguishable. We're using an entire byte to convey one bit of information (key or value); I'm inclined to redefine the flag byte so that it tells not just that but which JSON datatype is involved. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 2:58 PM, Peter Geoghegan wrote: > On Wed, May 7, 2014 at 11:50 AM, Robert Haas wrote: >> But that does not mean, as the phrase "folk >> wisdom" might be taken to imply, that we don't know anything at all >> about what actually works well in practice. > > Folk wisdom doesn't imply that. It implies that we think this works, > and we may well be right, but there isn't all that much rigor behind > some of it. I'm not blaming anyone for this state of affairs. I've > heard plenty of people repeat the "don't exceed 8GB" rule - I > regularly repeated it myself. I cannot find any rigorous defense of > this, though. If you're aware of one, please point it out to me. I'm not sure the level of rigor you'd like to see is going to be available here. Complex systems have complex behavior; that's life. At any rate, I'm not aware of any rigorous defense of the "don't exceed 8GB" rule. But, #1, I'd never put it that simply. What I've found is more like this: If it's possible to size shared_buffers so that the working set fits entirely within shared_buffers, that configuration is worthy of strong consideration. Otherwise, you probably want to keep shared_buffers low in order to avoid checkpoint-related I/O spikes and minimize double buffering; try 25% of system memory up to 512MB on Windows or up to 2GB on 32-bit Linux or up to 8GB on 64-bit Linux for starters, and then tune based on your workload. And #2, I think the origin of the 8GB number on 64-bit non-Windows systems is that people found that checkpoint-related I/O spikes became intolerable when you went too much above that number. On some systems, the threshold is lower than that - for example, I believe Merlin and others have reported numbers more like 2GB than 8GB - and on other systems, the threshold is higher - indeed, some people go way higher and never hit it at all. I agree that it would be nice to better-characterize why different users hit it at different levels, but it's probably highly dependent on hardware, workload, and kernel version, so I tend to doubt it can be characterized very simply. If I had go to guess, I'd bet that fixing Linux's abominable behavior around the fsync() call would probably go a long way toward making higher values of shared_buffers more practical. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
On Wed, May 7, 2014 at 12:27 PM, Tom Lane wrote: > The jsonb_ops storage format for values is inherently lossy, because it > cannot distinguish the string values "n", "t", "f" from JSON null or > boolean true, false respectively; nor does it know the difference between > a number and a string containing digits. This appears to not quite be a > bug because the consistent functions force recheck for all queries that > care about values (as opposed to keys). But if it's documented anywhere > I don't see where. The fact that we *don't* set the reset flag for JsonbExistsStrategyNumber, and why that's okay is prominently documented. So I'd say that it is. > And in any case, is it a good idea? We could fairly > easily change things so that these cases are guaranteed distinguishable. > We're using an entire byte to convey one bit of information (key or > value); I'm inclined to redefine the flag byte so that it tells not just > that but which JSON datatype is involved. It seemed simpler to do it that way. As I've said before, jsonb_ops is mostly concerned with hstore-style indexing. It could also be particularly useful for expressional indexes on "tags" arrays of strings, which is a common use-case. jsonb_hash_ops on the other hand is for testing containment, which is useful for querying heavily nested documents, where typically there is a very low selectivity for keys. It's not the default largely because I was concerned about not supporting all indexable operators by default, and because I suspected that it would be preferred to have a default closer to hstore. Anyway, doing things that way for values won't obviate the need to set the reset flag, unless you come up with a much more sophisticated scheme. Existence (of keys) is only tested in respect of the top-level. Containment (where values are tested) is more complicated. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
Peter Geoghegan writes: > On Wed, May 7, 2014 at 12:08 PM, Tom Lane wrote: >> Is this an obsolete speculation about writing jsonb_hash_ops, or is there >> still something worth preserving there? If so, what? > This is not obsolete. It would equally apply to a GiST opclass that > wanted to support our current definition of existence. Array elements > are keys simply by virtue of being strings, but otherwise are treated > as values. See the large comment block within gin_extract_jsonb(). It's not that aspect I'm complaining about, it's the bit about "one day we may wish to write". This comment should restrict itself to describing what jsonb_ops does, not make already-or-soon-to-be-obsolete statements about what other opclasses might or might not do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 checksum errors in recovery with gin index
On Wed, May 7, 2014 at 10:34 AM, Andres Freund wrote: > Hi, > > On 2014-05-07 10:21:26 -0700, Jeff Janes wrote: > > On Wed, May 7, 2014 at 12:48 AM, Andres Freund >wrote: > > > If you have the WAL a pg_xlogdump grepping for everything referring to > > > that block would be helpful. > > > > > > > The only record which mentions block 28486 by name is this one: > > Hm, try running it with -b specified. > Still nothing more. > > > rmgr: Gin len (rec/tot): 1576/ 1608, tx: 77882205, lsn: > > 11/30F4C2C0, prev 11/30F4C290, bkp: , desc: Insert new list page, > node: > > 1663/16384/16396 blkno: 28486 > > > > However, I think that that record precedes the recovery start point. > > If that's the case it seems likely that a PageSetLSN() or PageSetDirty() > are missing somewhere... > Wouldn't that result in corrupted data after crash recovery when checksum failures are ignored? I haven't seen any of that. Cheers, Jeff
Re: [HACKERS] Wanted: jsonb on-disk representation documentation
Peter Geoghegan writes: > On Wed, May 7, 2014 at 12:27 PM, Tom Lane wrote: >> The jsonb_ops storage format for values is inherently lossy, because it >> cannot distinguish the string values "n", "t", "f" from JSON null or >> boolean true, false respectively; nor does it know the difference between >> a number and a string containing digits. This appears to not quite be a >> bug because the consistent functions force recheck for all queries that >> care about values (as opposed to keys). But if it's documented anywhere >> I don't see where. > The fact that we *don't* set the reset flag for > JsonbExistsStrategyNumber, and why that's okay is prominently > documented. So I'd say that it is. Meh. Are you talking about the large comment block in gin_extract_jsonb? The readability of that comment starts to go downhill with its use of "reset" to refer to what everything else calls a "recheck" flag, and in any case it's claiming that we *don't* need a recheck for exists (a statement I suspect to be false, but more later). It entirely fails to explain that other query types *do* need a recheck because of arbitrary decisions about not representing JSON datatype information fully. There's another comment in gin_consistent_jsonb that's just as misleading, because it mentions (vaguely) some reasons why recheck is necessary without mentioning this one. A larger issue here is that, to the extent that this comment even has the information I'm after, the comment is in the wrong place. It is not attached to the code that's actually making the lossy representational choices (ie, make_scalar_key), nor to the code that decides whether or not a recheck is needed (ie, gin_consistent_jsonb). I don't think that basic architectural choices like these should be relegated to comment blocks inside specific functions to begin with. A README file would be better, perhaps, but there's not a specific directory associated with the jsonb code; so I think this sort of info belongs either in jsonb.h or in the file header comment for jsonb_gin.c. > Anyway, doing things that way for values won't obviate the need to set > the reset flag, unless you come up with a much more sophisticated > scheme. Existence (of keys) is only tested in respect of the > top-level. Containment (where values are tested) is more complicated. I'm not expecting that it will make things better for the current query operators. I am thinking that exact rather than lossy storage might help for future operators wanting to use this same index representation. Once we ship 9.4, it's going to be very hard to change the index representation, especially for the default opclass (cf the business about which opclass is default for type inet). So it behooves us to not throw information away needlessly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bgworker crashed or not?
On Tue, May 6, 2014 at 8:25 PM, Petr Jelinek wrote: >> What I'm inclined to do is change the logic so that: >> >> (1) After a crash-and-restart sequence, zero rw->rw_crashed_at, so >> that anything which is still registered gets restarted immediately. > > Yes, that's quite obvious change which I missed completely :). I've committed the portion of your patch which does this, with pretty extensive changes. I moved the function which resets the crash times to bgworker.c, renamed it, and made it just reset all of the crash times unconditionally; there didn't seem to be any point in skipping the irrelevant ones, so it seemed best to keep things simple. I also moved the call from reaper() where you had it to PostmasterStateMachine(), because the old placement did not work for me when I carried out the following steps: 1. Kill a background worker with a 60-second restart time using SIGTERM, so that it does exit(1). 2. Before it restarts, kill a regular backend with SIGQUIT. Let me know if you think I've got it wrong. >> (2) If a shmem-connected backend fails to release the deadman switch >> or exits with an exit code other than 0 or 1, we crash-and-restart. A >> non-shmem-connected backend never causes a crash-and-restart. > > +1 I did this as a separate commit, e2ce9aa27bf20eff2d991d0267a15ea5f7024cd7, just moving the check for ReleasePostmasterChildSlot inside the if statement. Then I realized that was bogus, so I just pushed eee6cf1f337aa488a20e9111df446cdad770e645 to fix that. Hopefully it's OK now. >> (3) When a background worker exits without triggering a >> crash-and-restart, an exit code of precisely 0 causes the worker to be >> unregistered; any other exit code has no special effect, so >> bgw_restart_time controls. > > +1 This isn't done yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 11:04 AM, Josh Berkus wrote: > On 05/06/2014 10:35 PM, Peter Geoghegan wrote: > > +1. In my view, we probably should have set it to a much higher > > absolute default value. The main problem with setting it to any > > multiple of shared_buffers that I can see is that shared_buffers is a > > very poor proxy for what effective_cache_size is supposed to > > represent. In general, the folk wisdom around sizing shared_buffers > > has past its sell-by date. > > Unfortunately nobody has the time/resources to do the kind of testing > required for a new recommendation for shared_buffers. > I think it is worse than that. I don't think we know what such testing would even look like. SSD? BBU? max_connections=2 with 256 cores? pgbench -N? capture and replay of Amazon's workload? If we could spell out/agree upon what kind of testing we would find convincing, that would probably go a long way to getting some people to work on carrying out the tests. Unless the conclusion was "please have 3TB or RAM and a 50 disk RAID", then there might be few takers. Cheers, Jeff
[HACKERS] jsonb existence queries are misimplemented by jsonb_ops
I wrote: > The readability of that comment starts to go downhill with its use of > "reset" to refer to what everything else calls a "recheck" flag, and in > any case it's claiming that we *don't* need a recheck for exists (a > statement I suspect to be false, but more later). And, indeed, it's false: regression=# create table j (f1 jsonb); CREATE TABLE regression=# insert into j values ('{"foo": {"bar": "baz"}}'); INSERT 0 1 regression=# insert into j values ('{"foo": {"blah": "baz"}}'); INSERT 0 1 regression=# insert into j values ('{"fool": {"bar": "baz"}}'); INSERT 0 1 regression=# create index on j using gin(f1); CREATE INDEX regression=# select * from j where f1 ? 'bar'; f1 (0 rows) regression=# set enable_seqscan to 0; SET regression=# select * from j where f1 ? 'bar'; f1 -- {"foo": {"bar": "baz"}} {"fool": {"bar": "baz"}} (2 rows) The indexscan is incorrectly returning rows where the queried key exists but isn't at top-level. We could fix this either by giving up on no-recheck for existence queries, or by changing the way that non-top-level keys get indexed. However I suspect the latter would break containment queries, or at least make their lives a lot more difficult. Another idea would be to change the definition of the exists operator so that it *does* look into sub-objects. It seems rather random to me that containment looks into sub-objects but exists doesn't. However, possibly there are good reasons for the non-orthogonality. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 checksum errors in recovery with gin index
On 05/07/2014 10:35 AM, Jeff Janes wrote: When recovering from a crash (with injection of a partial page write at time of crash) against 7c7b1f4ae5ea3b1b113682d4d I get a checksum verification failure. 16396 is a gin index. If I have it ignore checksum failures, there is no apparent misbehavior. I'm trying to bisect it, but it could take a while and I thought someone might have some theories based on the log: 29075 2014-05-06 23:29:51.411 PDT:LOG: 0: database system was not properly shut down; automatic recovery in progress 29075 2014-05-06 23:29:51.411 PDT:LOCATION: StartupXLOG, xlog.c:6361 29075 2014-05-06 23:29:51.412 PDT:LOG: 0: redo starts at 11/323FE1C0 29075 2014-05-06 23:29:51.412 PDT:LOCATION: StartupXLOG, xlog.c:6600 29075 2014-05-06 23:29:51.471 PDT:WARNING: 01000: page verification failed, calculated checksum 35967 but expected 7881 29075 2014-05-06 23:29:51.471 PDT:CONTEXT: xlog redo Delete list pages A-ha. The WAL record of list page deletion doesn't create a full-page images of the deleted pages. That's pretty sensible, as a deleted page doesn't contain any data that needs to be retained. However, if a full-page image is not taken, then the page should be completely recreated from scratch at replay. It's not doing that. Thanks for the testing! I'll have a stab at fixing that tomorrow. Basically, ginRedoDeleteListPages() needs to re-initialize the deleted pages. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb existence queries are misimplemented by jsonb_ops
I wrote: > Another idea would be to change the definition of the exists operator > so that it *does* look into sub-objects. It seems rather random to me > that containment looks into sub-objects but exists doesn't. However, > possibly there are good reasons for the non-orthogonality. No, wait, containment *doesn't* look into sub-objects: regression=# select * from j where f1 @> '{"foo": {"bar": "baz"}}'; f1 - {"foo": {"bar": "baz"}} (1 row) regression=# select * from j where f1 @> '{"bar": "baz"}'; f1 (0 rows) This is rather surprising in view of the way that section 8.14.4 goes on about nesting. But I guess the user-facing docs for jsonb are in little better shape than the internal docs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 05/07/2014 01:36 PM, Jeff Janes wrote: > On Wed, May 7, 2014 at 11:04 AM, Josh Berkus wrote: >> Unfortunately nobody has the time/resources to do the kind of testing >> required for a new recommendation for shared_buffers. > I think it is worse than that. I don't think we know what such testing > would even look like. SSD? BBU? max_connections=2 with 256 cores? > pgbench -N? capture and replay of Amazon's workload? > > If we could spell out/agree upon what kind of testing we would find > convincing, that would probably go a long way to getting some people to > work on carrying out the tests. Unless the conclusion was "please have 3TB > or RAM and a 50 disk RAID", then there might be few takers. Well, step #1 would be writing some easy-to-run benchmarks which carry out selected workloads and measure response times. The minimum starting set would include one OLTP/Web benchmark, and one DW benchmark. I'm not talking about the software to run the workload; we have that, in several varieties. I'm talking about the actual database generator and queries to run. That's the hard work. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On Wed, May 7, 2014 at 11:38 AM, Andres Freund wrote: > On 2014-05-07 13:32:41 -0500, Merlin Moncure wrote: > > > > *) raising shared buffers does not 'give more memory to postgres for > > caching' -- it can only reduce it via double paging > > That's absolutely not a necessary consequence. If pages are in s_b for a > while the OS will be perfectly happy to throw them away. > Is that an empirical observation? I've run some simulations a couple years ago, and also wrote some instrumentation to test that theory under favorably engineered (but still plausible) conditions, and couldn't get more than a small fraction of s_b to be so tightly bound in that the kernel could forget about them. Unless of course the entire workload or close to it fits in s_b. Cheers, Jeff
Re: [HACKERS] jsonb existence queries are misimplemented by jsonb_ops
On Wed, May 7, 2014 at 1:47 PM, Tom Lane wrote: > No, wait, containment *doesn't* look into sub-objects: > > regression=# select * from j where f1 @> '{"foo": {"bar": "baz"}}'; >f1 > - > {"foo": {"bar": "baz"}} > (1 row) > > regression=# select * from j where f1 @> '{"bar": "baz"}'; > f1 > > (0 rows) Yes it does. It's just not intended to work like that. You have to "give a path" to what you're looking for. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bgworker crashed or not?
On 07/05/14 22:32, Robert Haas wrote: On Tue, May 6, 2014 at 8:25 PM, Petr Jelinek wrote: I've committed the portion of your patch which does this, with pretty extensive changes. I moved the function which resets the crash times to bgworker.c, renamed it, and made it just reset all of the crash times unconditionally; there didn't seem to be any point in skipping the irrelevant ones, so it seemed best to keep things simple. I also moved the call from reaper() where you had it to PostmasterStateMachine(), because the old placement did not work for me when I carried out the following steps: 1. Kill a background worker with a 60-second restart time using SIGTERM, so that it does exit(1). 2. Before it restarts, kill a regular backend with SIGQUIT. Let me know if you think I've got it wrong. No I think it's fine, I didn't try that combination and just wanted to put it as deep in the call as possible. (2) If a shmem-connected backend fails to release the deadman switch or exits with an exit code other than 0 or 1, we crash-and-restart. A non-shmem-connected backend never causes a crash-and-restart. +1 I did this as a separate commit, e2ce9aa27bf20eff2d991d0267a15ea5f7024cd7, just moving the check for ReleasePostmasterChildSlot inside the if statement. Then I realized that was bogus, so I just pushed eee6cf1f337aa488a20e9111df446cdad770e645 to fix that. Hopefully it's OK now. The fixed one looks ok to me. (3) When a background worker exits without triggering a crash-and-restart, an exit code of precisely 0 causes the worker to be unregistered; any other exit code has no special effect, so bgw_restart_time controls. +1 This isn't done yet. Unless I am missing something this change was included in every patch I sent - setting rw->rw_terminate = true; in CleanupBackgroundWorker for zero exit code + comment changes. Or do you have objections to this approach? Anyway missing parts attached. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml index fd32d6c..5f86100 100644 --- a/doc/src/sgml/bgworker.sgml +++ b/doc/src/sgml/bgworker.sgml @@ -166,10 +166,12 @@ typedef struct BackgroundWorker - Background workers are expected to be continuously running; if they exit - cleanly, postgres will restart them immediately. Consider doing - interruptible sleep when they have nothing to do; this can be achieved by - calling WaitLatch(). Make sure the + Background workers are normally expected to be running continuously; + they get restarted automaticaly in case of crash (see + bgw_restart_time documentation for details), + but if they exit cleanly, postgres will not restart them. + Consider doing interruptible sleep when they have nothing to do; this can be + achieved by calling WaitLatch(). Make sure the WL_POSTMASTER_DEATH flag is set when calling that function, and verify the return code for a prompt exit in the emergency case that postgres itself has terminated. diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c index 64c9722..b642f37 100644 --- a/src/backend/postmaster/bgworker.c +++ b/src/backend/postmaster/bgworker.c @@ -884,8 +884,9 @@ RegisterDynamicBackgroundWorker(BackgroundWorker *worker, * running but is no longer. * * In the latter case, the worker may be stopped temporarily (if it is - * configured for automatic restart, or if it exited with code 0) or gone - * for good (if it is configured not to restart and exited with code 1). + * configured for automatic restart and exited with code 1) or gone for + * good (if it exited with code other than 1 or if it is configured not + * to restart). */ BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pidp) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 79d1c50..a0331e6 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2849,7 +2849,11 @@ CleanupBackgroundWorker(int pid, if (!EXIT_STATUS_0(exitstatus)) rw->rw_crashed_at = GetCurrentTimestamp(); else + { + /* Zero exit status means terminate */ rw->rw_crashed_at = 0; + rw->rw_terminate = true; + } /* * Additionally, for shared-memory-connected workers, just like a diff --git a/src/include/postmaster/bgworker.h b/src/include/postmaster/bgworker.h index c9550cc..d3dd1f2 100644 --- a/src/include/postmaster/bgworker.h +++ b/src/include/postmaster/bgworker.h @@ -16,10 +16,11 @@ * that the failure can only be transient (fork failure due to high load, * memory pressure, too many processes, etc); more permanent problems, like * failure to connect to a database, are detected later in the worker and dealt - * with just by having the worker exit normally. A worker which exits with a - * return code of 0 will be immediately re
Re: [HACKERS] jsonb existence queries are misimplemented by jsonb_ops
On Wed, May 7, 2014 at 1:51 PM, Peter Geoghegan wrote: > Yes it does. It's just not intended to work like that. You have to > "give a path" to what you're looking for. There is exactly one exception explicitly noted as such in the user-visible docs, which is arrays and the containment of single elements. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: Set effective_cache_size to greater of .conf value, shared_buffers
On 2014-05-07 13:51:57 -0700, Jeff Janes wrote: > On Wed, May 7, 2014 at 11:38 AM, Andres Freund wrote: > > > On 2014-05-07 13:32:41 -0500, Merlin Moncure wrote: > > > > > > *) raising shared buffers does not 'give more memory to postgres for > > > caching' -- it can only reduce it via double paging > > > > That's absolutely not a necessary consequence. If pages are in s_b for a > > while the OS will be perfectly happy to throw them away. > > > > Is that an empirical observation? Yes. > I've run some simulations a couple years > ago, and also wrote some instrumentation to test that theory under > favorably engineered (but still plausible) conditions, and couldn't get > more than a small fraction of s_b to be so tightly bound in that the kernel > could forget about them. Unless of course the entire workload or close to > it fits in s_b. I think it depends on your IO access patterns. If the whole working set fits into the kernel's page cache and there's no other demand for pages it will stay in. If you constantly rewrite most all your pages they'll also stay in the OS cache because they'll get written out. If the churn in shared_buffers is so high (because it's so small in comparison to the core hot data set) that there'll be dozens if not hundreds clock sweeps a second you'll also have no locality. It's also *hugely* kernel version specific :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers