Re: [HACKERS] strncpy is not a safe version of strcpy
Noah Misch writes: > On Wed, Aug 13, 2014 at 10:21:50AM -0400, Tom Lane wrote: >> I believe that we deal with this by the expedient of checking the lengths >> of tablespace paths in advance, when the tablespace is created. > The files under scrutiny here are not located in a tablespace. Even if they > were, isn't the length of $PGDATA/pg_tblspc the important factor? The length of $PGDATA is of no relevance whatsoever; we chdir into that directory at startup, and subsequently all paths are implicitly relative to there. If there is any backend code that's prepending $PGDATA to something else, it's wrong to start with. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] pg_copy - a command for reliable WAL archiving
I fixed some minor mistakes. Regards MauMau pg_copy_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Compute attr_needed for child relations (was Re: [HACKERS] inherit support for foreign tables)
(2014/08/08 18:51), Etsuro Fujita wrote: >>> (2014/06/30 22:48), Tom Lane wrote: I wonder whether it isn't time to change that. It was coded like that originally only because calculating the values would've been a waste of cycles at the time. But this is at least the third place where it'd be useful to have attr_needed for child rels. > I've revised the patch. There was a problem with the previous patch, which will be described below. Attached is the updated version of the patch addressing that. The previous patch doesn't cope with some UNION ALL cases properly. So, e.g., the server will crash for the following query: postgres=# create table ta1 (f1 int); CREATE TABLE postgres=# create table ta2 (f2 int primary key, f3 int); CREATE TABLE postgres=# create table tb1 (f1 int); CREATE TABLE postgres=# create table tb2 (f2 int primary key, f3 int); CREATE TABLE postgres=# explain verbose select f1 from ((select f1, f2 from (select f1, f2, f3 from ta1 left join ta2 on f1 = f2 limit 1) ssa) union all (select f1, f2 from (select f1, f2, f3 from tb1 left join tb2 on f1 = f2 limit 1) ssb)) ss; With the updated version, we get the right result: postgres=# explain verbose select f1 from ((select f1, f2 from (select f1, f2, f3 from ta1 left join ta2 on f1 = f2 limit 1) ssa) union all (select f1, f2 from (select f1, f2, f3 from tb1 left join tb2 on f1 = f2 limit 1) ssb)) ss; QUERY PLAN Append (cost=0.00..0.05 rows=2 width=4) -> Subquery Scan on ssa (cost=0.00..0.02 rows=1 width=4) Output: ssa.f1 -> Limit (cost=0.00..0.01 rows=1 width=4) Output: ta1.f1, (NULL::integer), (NULL::integer) -> Seq Scan on public.ta1 (cost=0.00..34.00 rows=2400 width=4) Output: ta1.f1, NULL::integer, NULL::integer -> Subquery Scan on ssb (cost=0.00..0.02 rows=1 width=4) Output: ssb.f1 -> Limit (cost=0.00..0.01 rows=1 width=4) Output: tb1.f1, (NULL::integer), (NULL::integer) -> Seq Scan on public.tb1 (cost=0.00..34.00 rows=2400 width=4) Output: tb1.f1, NULL::integer, NULL::integer Planning time: 0.453 ms (14 rows) While thinking to address this problem, Ashutosh also expressed concern about the UNION ALL handling in the previous patch in a private email. Thank you for the review, Ashutosh! Thanks, Best regards, Etsuro Fujita *** a/contrib/file_fdw/file_fdw.c --- b/contrib/file_fdw/file_fdw.c *** *** 799,806 check_selective_binary_conversion(RelOptInfo *baserel, } /* Collect all the attributes needed for joins or final output. */ ! pull_varattnos((Node *) baserel->reltargetlist, baserel->relid, ! &attrs_used); /* Add all the attributes used by restriction clauses. */ foreach(lc, baserel->baserestrictinfo) --- 799,810 } /* Collect all the attributes needed for joins or final output. */ ! for (i = baserel->min_attr; i <= baserel->max_attr; i++) ! { ! if (!bms_is_empty(baserel->attr_needed[i - baserel->min_attr])) ! attrs_used = bms_add_member(attrs_used, ! i - FirstLowInvalidHeapAttributeNumber); ! } /* Add all the attributes used by restriction clauses. */ foreach(lc, baserel->baserestrictinfo) *** a/src/backend/optimizer/path/allpaths.c --- b/src/backend/optimizer/path/allpaths.c *** *** 577,588 set_append_rel_size(PlannerInfo *root, RelOptInfo *rel, childrel->has_eclass_joins = rel->has_eclass_joins; /* ! * Note: we could compute appropriate attr_needed data for the child's ! * variables, by transforming the parent's attr_needed through the ! * translated_vars mapping. However, currently there's no need ! * because attr_needed is only examined for base relations not ! * otherrels. So we just leave the child's attr_needed empty. */ /* * Compute the child's size. --- 577,585 childrel->has_eclass_joins = rel->has_eclass_joins; /* ! * Compute the child's attr_needed. */ + adjust_appendrel_attr_needed(rel, childrel, appinfo); /* * Compute the child's size. *** *** 2173,2178 remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel) --- 2170,2176 { Bitmapset *attrs_used = NULL; ListCell *lc; + int i; /* * Do nothing if subquery has UNION/INTERSECT/EXCEPT: in principle we *** *** 2193,2204 remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel) * Collect a bitmap of all the output column numbers used by the upper * query. * ! * Add all the attributes needed for joins or final output. Note: we must ! * look at reltargetlist, not the attr_needed data, because attr_needed ! * isn't computed for inheritance child rels, cf set_append_rel_size(). ! * (XXX might be worth changi
[HACKERS] proposal: allow to specify result tupdesc and mode in SPI API
Hello we have not possibility to simple specify result types in SPI API functions. Planner has this functionality - see transformInsertRow function, but it is not visible from SPI. A new function should to look like: SPIPlanPtr SPI_prepare_params_rettupdesc(const char *src, ParserSetupHook parserSetup, void *parserSetupArg, int cursorOptions, TupDesc *retTupDesc, int CoercionMode) CoercionMode should be: COERCION_MODE_SQL .. same as INSERT or UPDATE does COERCION_MODE_SQL_NOERROR .. same as above with possible IO cast COERCION_MODE_EXPLICIT .. same as using explicit casting COERCION_MODE_EXPLICIT_NOERROR .. same as previous with possible IO cast Benefits: * simplify life to SPI users - no necessary late casting * possible small simplification of plpgsql with two benefits: ** reduce performance impact of hidden IO cast ** reduce possible issues with type transformation via hidden IO cast Comments, notes? Regards Pavel Stehule
Re: [HACKERS] strncpy is not a safe version of strcpy
On Wed, Aug 13, 2014 at 10:21:50AM -0400, Tom Lane wrote: > Kevin Grittner writes: > > I am concerned that failure to check for truncation could allow > > deletion of unexpected files or directories. > > I believe that we deal with this by the expedient of checking the lengths > of tablespace paths in advance, when the tablespace is created. The files under scrutiny here are not located in a tablespace. Even if they were, isn't the length of $PGDATA/pg_tblspc the important factor? $PGDATA can change between runs if the DBA moves the data directory or reaches it via different symlinks, so any DDL-time defense would be incomplete. > > Some might consider it overkill, but I tend to draw a pretty hard > > line on deleting or executing random files, even if the odds seem > > to be that the mangled name won't find a match. Granted, those > > problems exist now, but without checking for truncation it seems to > > me that we're just deleting *different* incorrect filenames, not > > really fixing the problem. I share your (Kevin's) discomfort with our use of strlcpy(). I wouldn't mind someone replacing most strlcpy()/snprintf() calls with calls to wrappers that ereport(ERROR) on truncation. Though as reliability problems go, this one has been minor. David's specific patch has no concrete problem: On Wed, Aug 13, 2014 at 10:26:01PM +1200, David Rowley wrote: > --- a/contrib/pg_archivecleanup/pg_archivecleanup.c > +++ b/contrib/pg_archivecleanup/pg_archivecleanup.c > @@ -108,7 +108,7 @@ CleanupPriorWALFiles(void) > { > while (errno = 0, (xlde = readdir(xldir)) != NULL) > { > - strncpy(walfile, xlde->d_name, MAXPGPATH); > + strlcpy(walfile, xlde->d_name, MAXPGPATH); The code proceeds to check strlen(walfile) == XLOG_DATA_FNAME_LEN, so a long name can't trick it. > TrimExtension(walfile, additional_ext); > > /* > diff --git a/src/backend/access/transam/xlogarchive.c > b/src/backend/access/transam/xlogarchive.c > index 37745dc..0c9498a 100644 > --- a/src/backend/access/transam/xlogarchive.c > +++ b/src/backend/access/transam/xlogarchive.c > @@ -459,7 +459,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname) > xlogfpath, oldpath))); > } > #else > - strncpy(oldpath, xlogfpath, MAXPGPATH); > + strlcpy(oldpath, xlogfpath, MAXPGPATH); This one never overflows, because it's copying from one MAXPGPATH buffer to another. Plain strcpy() would be fine, too. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Thu, Aug 14, 2014 at 5:56 AM, Stephen Frost wrote: > * Amit Kapila (amit.kapil...@gmail.com) wrote: > Oh, to be clear- I agree that logging of queries executed through SPI is > desirable; I'd certainly like to have that capability without having to > go through the auto-explain module or write my own module. That doesn't > mean we should consider them either "internal" (which I don't really > agree with- consider anonymous DO blocks...) or somehow related to > replication logging. > > > >Could we enable logging of both with a single GUC? > > > > > > I don't see those as the same at all. I'd like to see improved > > > flexibility in this area, certainly, but don't want two independent > > > considerations like these tied to one GUC.. > > > > Agreed, I also think both are different and shouldn't be logged > > with one GUC. Here the important thing to decide is which is > > the better way to proceed for allowing logging of replication > > commands such that it can allow us to extend it for more > > things. We have discussed below options: > > > > a. Make log_statement a list of comma separated options > > b. Have a separate parameter something like log_replication* > > c. Log it when user has mentioned option 'all' in log_statement. > > Regarding this, I'm generally in the camp that says to just include it > in 'all' and be done with it- for now. Okay, but tomorrow if someone wants to implement a patch to log statements executed through SPI (statements inside functions), then what will be your suggestion, does those also can be allowed to log with 'all' option or you would like to suggest him to wait for a proper auditing system? Wouldn't allowing to log everything under 'all' option can start confusing some users without having individual (ddl, mod, replication, ...) options for different kind of statements. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] pg_dump bug in 9.4beta2 and HEAD
I'm seeing an assertion failure with "pg_dump -c --if-exists" which is not ready to handle BLOBs it seems: pg_dump: pg_backup_archiver.c:472: RestoreArchive: Assertion `mark != ((void *)0)' failed. To reproduce: $ createdb test $ pg_dump -c --if-exists test (works, dumps empty database) $ psql test -c "select lo_create(1);" $ pg_dump -c --if-exists test (fails, with the above mentioned assertion) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Aug 13, 2014 at 4:25 PM, Andres Freund wrote: > > On 2014-08-13 09:51:58 +0530, Amit Kapila wrote: > > Overall, the main changes required in patch as per above feedback > > are: > > 1. add an additional counter for the number of those > > allocations not satisfied from the free list, with a > > name like buffers_alloc_clocksweep. > > 2. Autotune the low and high threshold values for buffers > > in freelist. In the patch, I have kept them as hard-coded > > values. > > 3. For populating freelist, have a separate process (bgreclaim) > > instead of doing it by bgwriter. > > I'm not convinced that 3) is the right way to go to be honest. Seems > like a huge bandaid to me. Doing both (populating freelist and flushing dirty buffers) via bgwriter isn't the best way either because it might not be able to perform both the jobs as per need. One example is it could take much longer time to flush a dirty buffer than to move it into free list, so if there are few buffers which we need to flush, then I think task of maintaining buffers in freelist will get hit even though there are buffers in list which can be moved to free list(non-dirty buffers). Another is maintaining the current behaviour of bgwriter which is to scan the entire buffer pool every few mins (assuming default configuration). We can attempt to solve this problem as suggested by Robert upthread but I am not completely sure if that can guarantee that the current behaviour will be retained as it is. I am not telling that having a separate process won't have any issues, but I think we can tackle them without changing or complicating current bgwriter behaviour. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] jsonb format is pessimal for toast compression
Andrew Dunstan writes: > On 08/13/2014 09:01 PM, Tom Lane wrote: >>> That's a fair question. I did a very very simple hack to replace the item >>> offsets with item lengths -- turns out that that mostly requires removing >>> some code that changes lengths to offsets ;-). > What does changing to lengths do to the speed of other operations? This was explicitly *not* an attempt to measure the speed issue. To do a fair trial of that, you'd have to work a good bit harder, methinks. Examining each of N items would involve O(N^2) work with the patch as posted, but presumably you could get it down to less in most or all cases --- in particular, sequential traversal could be done with little added cost. But it'd take a lot more hacking. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Thu, Aug 14, 2014 at 10:40 AM, Fujii Masao wrote: > On Thu, Aug 14, 2014 at 9:26 AM, Stephen Frost wrote: >> * Amit Kapila (amit.kapil...@gmail.com) wrote: >>> On Wed, Aug 13, 2014 at 4:24 AM, Stephen Frost wrote: >>> > Not entirely sure what you're referring to as 'internally generated' >>> > here.. >>> >>> Here 'internally generated' means that user doesn't execute those >>> statements, rather the replication/backup code form these statements >>> (IDENTIFY_SYSTEM, TIMELINE_HISTORY, BASE_BACKUP, ...) >>> and send to server to get the appropriate results. >> >> You could argue the same about pg_dump.. I'd not thought of it before, >> but it might be kind of neat to have psql support "connect in >> replication mode" and then allow the user to run replication commands. > > You can do that by specifying "replication=1" as the conninfo when > exexcuting psql. For example, > > $ psql -d "replication=1" > psql (9.5devel) > Type "help" for help. > > postgres=# IDENTIFY_SYSTEM; > systemid | timeline | xlogpos | dbname > -+--+---+ > 6047222920639525794 |1 | 0/1711678 | > (1 row) More details here: http://www.postgresql.org/docs/9.4/static/protocol-replication.html Note as well that not all the commands work though: =# START_REPLICATION PHYSICAL 0/0; unexpected PQresultStatus: 8 =# START_REPLICATION PHYSICAL 0/0; PQexec not allowed during COPY BOTH We may do something to improve about that but I am not sure it is worth it. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] option -T in pg_basebackup doesn't work on windows
On Thu, Aug 14, 2014 at 12:50 AM, MauMau wrote: > The code change appears correct, but the patch application failed against > the latest source code. I don't know why. Could you confirm this? > > patching file src/bin/pg_basebackup/pg_basebackup.c > Hunk #1 FAILED at 1119. > 1 out of 1 hunk FAILED -- saving rejects to file > src/bin/pg_basebackup/pg_basebackup.c.rej This conflict is caused by f25e0bf. > On the following line, I think %d must be %u, because Oid is an unsigned > integer. > char*linkloc = psprintf("%s/pg_tblspc/%d", basedir, oid); That's correct. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On 08/13/2014 09:01 PM, Tom Lane wrote: I wrote: That's a fair question. I did a very very simple hack to replace the item offsets with item lengths -- turns out that that mostly requires removing some code that changes lengths to offsets ;-). I then loaded up Larry's example of a noncompressible JSON value, and compared pg_column_size() which is just about the right thing here since it reports datum size after compression. Remembering that the textual representation is 12353 bytes: json: 382 bytes jsonb, using offsets: 12593 bytes jsonb, using lengths: 406 bytes Oh, one more result: if I leave the representation alone, but change the compression parameters to set first_success_by to INT_MAX, this value takes up 1397 bytes. So that's better, but still more than a 3X penalty compared to using lengths. (Admittedly, this test value probably is an outlier compared to normal practice, since it's a hundred or so repetitions of the same two strings.) What does changing to lengths do to the speed of other operations? cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
On Thu, Aug 14, 2014 at 9:26 AM, Stephen Frost wrote: > * Amit Kapila (amit.kapil...@gmail.com) wrote: >> On Wed, Aug 13, 2014 at 4:24 AM, Stephen Frost wrote: >> > Not entirely sure what you're referring to as 'internally generated' >> > here.. >> >> Here 'internally generated' means that user doesn't execute those >> statements, rather the replication/backup code form these statements >> (IDENTIFY_SYSTEM, TIMELINE_HISTORY, BASE_BACKUP, ...) >> and send to server to get the appropriate results. > > You could argue the same about pg_dump.. I'd not thought of it before, > but it might be kind of neat to have psql support "connect in > replication mode" and then allow the user to run replication commands. You can do that by specifying "replication=1" as the conninfo when exexcuting psql. For example, $ psql -d "replication=1" psql (9.5devel) Type "help" for help. postgres=# IDENTIFY_SYSTEM; systemid | timeline | xlogpos | dbname -+--+---+ 6047222920639525794 |1 | 0/1711678 | (1 row) >> Agreed, I also think both are different and shouldn't be logged >> with one GUC. Here the important thing to decide is which is >> the better way to proceed for allowing logging of replication >> commands such that it can allow us to extend it for more >> things. We have discussed below options: >> >> a. Make log_statement a list of comma separated options >> b. Have a separate parameter something like log_replication* >> c. Log it when user has mentioned option 'all' in log_statement. > > Regarding this, I'm generally in the camp that says to just include it > in 'all' and be done with it- for now. This is just another example > of where we really need a much more granular and configurable framework > for auditing and we're not going to get through GUCs, so let's keep the > GUC based approach simple and familiar to our users and build a proper > auditing system. +1 Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
I wrote: > That's a fair question. I did a very very simple hack to replace the item > offsets with item lengths -- turns out that that mostly requires removing > some code that changes lengths to offsets ;-). I then loaded up Larry's > example of a noncompressible JSON value, and compared pg_column_size() > which is just about the right thing here since it reports datum size after > compression. Remembering that the textual representation is 12353 bytes: > json: 382 bytes > jsonb, using offsets: 12593 bytes > jsonb, using lengths: 406 bytes Oh, one more result: if I leave the representation alone, but change the compression parameters to set first_success_by to INT_MAX, this value takes up 1397 bytes. So that's better, but still more than a 3X penalty compared to using lengths. (Admittedly, this test value probably is an outlier compared to normal practice, since it's a hundred or so repetitions of the same two strings.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
Bruce Momjian writes: > Seems we have two issues: > 1) the header makes testing for compression likely to fail > 2) use of pointers rather than offsets reduces compression potential > I understand we are focusing on #1, but how much does compression reduce > the storage size with and without #2? Seems we need to know that answer > before deciding if it is worth reducing the ability to do fast lookups > with #2. That's a fair question. I did a very very simple hack to replace the item offsets with item lengths -- turns out that that mostly requires removing some code that changes lengths to offsets ;-). I then loaded up Larry's example of a noncompressible JSON value, and compared pg_column_size() which is just about the right thing here since it reports datum size after compression. Remembering that the textual representation is 12353 bytes: json: 382 bytes jsonb, using offsets: 12593 bytes jsonb, using lengths: 406 bytes So this confirms my suspicion that the choice of offsets not lengths is what's killing compressibility. If it used lengths, jsonb would be very nearly as compressible as the original text. Hack attached in case anyone wants to collect more thorough statistics. We'd not actually want to do it like this because of the large expense of recomputing the offsets on-demand all the time. (It does pass the regression tests, for what that's worth.) regards, tom lane diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c index 04f35bf..2297504 100644 *** a/src/backend/utils/adt/jsonb_util.c --- b/src/backend/utils/adt/jsonb_util.c *** convertJsonbArray(StringInfo buffer, JEn *** 1378,1385 errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); - if (i > 0) - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } --- 1378,1383 *** convertJsonbObject(StringInfo buffer, JE *** 1430,1437 errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); - if (i > 0) - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); --- 1428,1433 *** convertJsonbObject(StringInfo buffer, JE *** 1445,1451 errmsg("total size of jsonb array elements exceeds the maximum of %u bytes", JENTRY_POSMASK))); - meta = (meta & ~JENTRY_POSMASK) | totallen; copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry)); metaoffset += sizeof(JEntry); } --- 1441,1446 *** uniqueifyJsonbObject(JsonbValue *object) *** 1592,1594 --- 1587,1600 object->val.object.nPairs = res + 1 - object->val.object.pairs; } } + + uint32 + jsonb_get_offset(const JEntry *ja, int index) + { + uint32 off = 0; + int i; + + for (i = 0; i < index; i++) + off += JBE_LEN(ja, i); + return off; + } diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h index 5f2594b..c9b18e1 100644 *** a/src/include/utils/jsonb.h --- b/src/include/utils/jsonb.h *** typedef uint32 JEntry; *** 153,162 * Macros for getting the offset and length of an element. Note multiple * evaluations and access to prior array element. */ ! #define JBE_ENDPOS(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) ((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1])) ! #define JBE_LEN(ja, i) ((i) == 0 ? JBE_ENDPOS((ja)[i]) \ ! : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1])) /* * A jsonb array or object node, within a Jsonb Datum. --- 153,163 * Macros for getting the offset and length of an element. Note multiple * evaluations and access to prior array element. */ ! #define JBE_LENFLD(je_) ((je_) & JENTRY_POSMASK) ! #define JBE_OFF(ja, i) jsonb_get_offset(ja, i) ! #define JBE_LEN(ja, i) JBE_LENFLD((ja)[i]) ! ! extern uint32 jsonb_get_offset(const JEntry *ja, int index); /* * A jsonb array or object node, within a Jsonb Datum. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] replication commands and log_statements
* Amit Kapila (amit.kapil...@gmail.com) wrote: > On Wed, Aug 13, 2014 at 4:24 AM, Stephen Frost wrote: > > Not entirely sure what you're referring to as 'internally generated' > > here.. > > Here 'internally generated' means that user doesn't execute those > statements, rather the replication/backup code form these statements > (IDENTIFY_SYSTEM, TIMELINE_HISTORY, BASE_BACKUP, ...) > and send to server to get the appropriate results. You could argue the same about pg_dump.. I'd not thought of it before, but it might be kind of neat to have psql support "connect in replication mode" and then allow the user to run replication commands. Still, this isn't the same. > Yes, few days back I have seen one of the user expects > such functionality. Refer below mail: > http://www.postgresql.org/message-id/caa4ek1lvuirqnmjv1vtmrg_f+1f9e9-8rdgnwidscqvtps1...@mail.gmail.com Oh, to be clear- I agree that logging of queries executed through SPI is desirable; I'd certainly like to have that capability without having to go through the auto-explain module or write my own module. That doesn't mean we should consider them either "internal" (which I don't really agree with- consider anonymous DO blocks...) or somehow related to replication logging. > >Could we enable logging of both with a single GUC? > > > > I don't see those as the same at all. I'd like to see improved > > flexibility in this area, certainly, but don't want two independent > > considerations like these tied to one GUC.. > > Agreed, I also think both are different and shouldn't be logged > with one GUC. Here the important thing to decide is which is > the better way to proceed for allowing logging of replication > commands such that it can allow us to extend it for more > things. We have discussed below options: > > a. Make log_statement a list of comma separated options > b. Have a separate parameter something like log_replication* > c. Log it when user has mentioned option 'all' in log_statement. Regarding this, I'm generally in the camp that says to just include it in 'all' and be done with it- for now. This is just another example of where we really need a much more granular and configurable framework for auditing and we're not going to get through GUCs, so let's keep the GUC based approach simple and familiar to our users and build a proper auditing system. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
Tom, I appreciate your patience and explanation. (I am new to PostgreSQL hacking. I have read many old posts but not all of it sticks, sorry). I know QNX support is not high on your TODO list, so I am trying to keep the effort moving without being a distraction. Couldn't backend "random code" corrupt any file in the PGDATA dir? Perhaps the new fcntl lock file could be kept outside PGDATA directory tree to make likelihood of backend "random code" interference remote. This could be present and used only on systems without System V shared memory (QNX), leaving existing platforms unaffected. I know this falls short of perfect, but perhaps is good enough to get the QNX port off the ground. I would rather have a QNX port with reasonable restrictions than no port at all. Also, I will try to experiment with named pipe locking as Robert had suggested. Thanks again for your feedback, I really do appreciate it. -Keith Baker > -Original Message- > From: Tom Lane [mailto:t...@sss.pgh.pa.us] > Sent: Wednesday, August 13, 2014 7:05 PM > To: Baker, Keith [OCDUS Non-J&J] > Cc: Robert Haas; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL > > "Baker, Keith [OCDUS Non-J&J]" writes: > > I assume you guys are working on other priorities, so I did some locking > experiments on QNX. > > > I know fcntl() locking has downsides, but I think it deserves a second look: > > - it is POSIX, so should be fairly consistent across platforms (at > > least more consistent than lockf and flock) > > - the "accidental" open/close lock release can be easily avoided > > (simply don't add new code which touches the new, unique lock file) > > I guess you didn't read the previous discussion. Asserting that it's "easy to > avoid" an accidental unlock doesn't make it true. In the case of a PG > backend, we have to expect that people will run random code inside, say, > plperlu or plpythonu functions. And it doesn't seem unlikely that someone > might scan the entire PGDATA directory tree as part of, for example, a > backup or archiving operation. If we had full control of everything that ever > happens in a PG backend process then *maybe* we could have adequate > confidence that we'd never lose the lock, but we don't. > > regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What happened to jsonb's JENTRY_ISFIRST?
On Wed, Aug 13, 2014 at 3:47 PM, Tom Lane wrote: > If the bit is unused now, should we be worrying about reclaiming it for > better use? Like say allowing jsonb's to be larger than just a quarter > of the maximum datum size? Commit d9daff0e0cb15221789e6c50d9733c8754c054fb removed it. This is an obsolete comment. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
"Baker, Keith [OCDUS Non-J&J]" writes: > I assume you guys are working on other priorities, so I did some locking > experiments on QNX. > I know fcntl() locking has downsides, but I think it deserves a second look: > - it is POSIX, so should be fairly consistent across platforms (at least more > consistent than lockf and flock) > - the "accidental" open/close lock release can be easily avoided (simply > don't add new code which touches the new, unique lock file) I guess you didn't read the previous discussion. Asserting that it's "easy to avoid" an accidental unlock doesn't make it true. In the case of a PG backend, we have to expect that people will run random code inside, say, plperlu or plpythonu functions. And it doesn't seem unlikely that someone might scan the entire PGDATA directory tree as part of, for example, a backup or archiving operation. If we had full control of everything that ever happens in a PG backend process then *maybe* we could have adequate confidence that we'd never lose the lock, but we don't. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] What happened to jsonb's JENTRY_ISFIRST?
jsonb.h claims that the high order bit of a JEntry word is set on the first element of a JEntry array. However, AFAICS, JBE_ISFIRST() is not used anywhere, which is a good thing because it refers to a constant JENTRY_ISFIRST that's not defined anywhere. Is this comment just a leftover from a convention that's been dropped, or is it still implemented but not via this macro? If the bit is unused now, should we be worrying about reclaiming it for better use? Like say allowing jsonb's to be larger than just a quarter of the maximum datum size? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
Robert and Tom, I assume you guys are working on other priorities, so I did some locking experiments on QNX. I know fcntl() locking has downsides, but I think it deserves a second look: - it is POSIX, so should be fairly consistent across platforms (at least more consistent than lockf and flock) - the "accidental" open/close lock release can be easily avoided (simply don't add new code which touches the new, unique lock file) - don't know if it will work on NFS, but that is not a priority for me (is that really a requirement for a QNX port?) Existing System V shared memory locking can be left in place for all existing platforms (so nothing lost), while fcntl()-style locking could be limited to platforms which lack System V shared memory (like QNX). Experimental patch is attached, but logic is basically this: a. postmaster obtains exclusive lock on data dir file "postmaster.fcntl" (or FATAL) b. postmaster then downgrades to shared lock (or FATAL) c. all other backend processes obtain shared lock on this file (or FATAL) A quick test on QNX 6.5 appeared to behave well (orphan backends left behind after kill -9 of postmaster held their locks, thus database restart was prevented as desired). Let me know if there are other test scenarios to consider. Thanks! -Keith Baker > -Original Message- > From: Robert Haas [mailto:robertmh...@gmail.com] > Sent: Thursday, July 31, 2014 12:58 PM > To: Tom Lane > Cc: Baker, Keith [OCDUS Non-J&J]; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL > > On Wed, Jul 30, 2014 at 11:02 AM, Tom Lane wrote: > > So it seems like we could possibly go this route, assuming we can > > think of a variant of your proposal that's race-condition-free. A > > disadvantage compared to a true file lock is that it would not protect > > against people trying to start postmasters from two different NFS > > client machines --- but we don't have protection against that now. > > (Maybe we could do this *and* do a regular file lock to offer some > > protection against that case, even if it's not bulletproof?) > > That's not a bad idea. By the way, it also wouldn't be too hard to test at > runtime whether or not flock() has first-close semantics. Not that we'd want > this exact design, but suppose you configure shmem_interlock=flock in > postgresql.conf. On startup, we test whether flock is reliable, determine > that it is, and proceed accordingly. > Now, you move your database onto an NFS volume and the semantics > change (because, hey, breaking userspace assumptions is fun) and try to > restart up your database, and it says FATAL: flock() is broken. > Now you can either move the database back, or set shmem_interlock to > some other value. > > Now maybe, as you say, it's best to use multiple locking protocols and hope > that at least one will catch whatever the dangerous situation is. > I'm just trying to point out that we need not blindly assume the semantics we > want are there (or that they are not); we can check. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL > Company fcntl_lock_20140813.patch Description: fcntl_lock_20140813.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to implement selectivity injection in postgresql
Jeff Janes writes: > On Wed, Aug 13, 2014 at 9:33 AM, Rajmohan C wrote: >> I need to implement Selectivity injection as shown in above query in >> PostgreSQL by which we can inject selectivity of each predicate or at least >> selectivity at relation level directly as part of query. > My plan was to create a boolean operator which always returns true, but > estimates its own selectivity as 0.001 (or better yet, parameterize that > selectivity estimate, if that is possible) which can be inserted into the > place where lower selectivity estimate is needed with an "AND". That doesn't seem especially helpful/convenient, especially not if you're trying to affect the estimation of a join clause. The last discussion I remember on this subject was to invent a special dummy function that would be understood by the planner and would work sort of like __builtin_expect() in gcc: selectivity(condition bool, probability float8) returns bool Semantically the function would just return its first argument (and the function itself would disappear at runtime) but the planner would take the value of the second argument as a selectivity estimate overriding whatever it might've otherwise deduced about the "condition". So you'd use it like SELECT ... WHERE selectivity(id = 42, 0.0001) and get functionally the same results as for SELECT ... WHERE id = 42 but with a different selectivity estimate for that WHERE condition. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > Here's a full version of this refactoring patch, all the rmgr's have > now been updated to use XLogReplayBuffer(). I think this is a > worthwhile change on its own, even if we drop the ball on the rest > of the WAL format patch, because it makes the redo-routines more > readable. I propose to commit this as soon as someone has reviewed > it, and we agree on a for what's currently called > XLogReplayBuffer(). I have tested this with my page-image comparison > tool. What's with XLogReplayLSN and XLogReplayRecord? IMO the xlog code has more than enough global variables already, it'd be good to avoid two more if possible. Is there no other good way to get this info down to whatever it is that needs them? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Heikki Linnakangas wrote: > Here's a full version of this refactoring patch, all the rmgr's have > now been updated to use XLogReplayBuffer(). I think this is a > worthwhile change on its own, even if we drop the ball on the rest > of the WAL format patch, because it makes the redo-routines more > readable. I propose to commit this as soon as someone has reviewed > it, and we agree on a for what's currently called > XLogReplayBuffer(). I have tested this with my page-image comparison > tool. XLogReadBufferForReplay ? ReadBufferForXLogReplay ? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to implement selectivity injection in postgresql
On Wed, Aug 13, 2014 at 9:33 AM, Rajmohan C wrote: > SELECT c1, c2, c3, FROM T1, T2, T3 > WHERE T1.x = T2.x AND > T2.y=T3.y AND > T1.x >= ? selectivity 0.1 AND > T2.y > ? selectivity 0.5 AND > T3.z = ? selectivity 0.2 AND > T3.w = ? > > I need to implement Selectivity injection as shown in above query in > PostgreSQL by which we can inject selectivity of each predicate or at least > selectivity at relation level directly as part of query. Is there any > on-going work on this front? If there is no ongoing work on this, How > should I start implementing this feature? > My plan was to create a boolean operator which always returns true, but estimates its own selectivity as 0.001 (or better yet, parameterize that selectivity estimate, if that is possible) which can be inserted into the place where lower selectivity estimate is needed with an "AND". And another one that always returns false, but has a selectivity estimate near 1, for use in OR conditions when the opposite change is needed. I think that will be much easier to do than to extent the grammar. And probably more acceptable to the core team. I think this could be done simply in an extension module without even needing to change the core code, but I never got around to investigating exactly how. Cheers, Jeff
Re: [HACKERS] PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?
2014-08-08 2:13 GMT+02:00 Josh Berkus : > On 08/07/2014 04:48 PM, Tom Lane wrote: > > plpgsql is not efficient at all about coercions performed as a side > > effect of assignments; if memory serves, it always handles them by > > converting to text and back. So basically the added cost here came > > from float8out() and float4in(). There has been some talk of trying > > to do such coercions via SQL casts, but nothing's been done for fear > > of compatibility problems. > > Yeah, that's a weeks-long project for someone. And would require a lot > of tests ... > It is not trivial task. There are two possible direction and both are not trivial (I am not sure about practical benefits for users - maybe for some PostGIS users - all for some trivial very synthetic benchmarks) a) we can enhance plpgsql exec_assign_value to accept pointer to cache on tupmap - it is relative invasive in plpgsql - and without benefits to other PL b) we can enhance SPI API to accept target TupDesc (with reusing transformInsertRow) - it should be little bit less invasive in plpgsql, but require change in SPI API. This path should be much more preferable - it can be used in SQL/PSM and it can be used in lot of C extensions - It can be more simply to specify expected TupDesc than enforce casting via manipulation with SQL string. I missed this functionality more times. I designed PL/pgPSM with same type strict level as PostgreSQL has - and this functionality can simplify code. PLpgSQL uses spi_prepare_params .. we can enhance this function or we can introduce new spi_prepare_params_enforce_result_type Regards Pavel > > -- > Josh Berkus > PostgreSQL Experts Inc. > http://pgexperts.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] how to implement selectivity injection in postgresql
On 13-08-2014 15:28, Rajmohan C wrote: > Yeah. I have to do it for my academic research. Is it available in > catalogs? It is to be computed at run time from the predicates in the query > right? > The selectivity information is available at runtime. See backend/optimizer/path/costsize.c. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to implement selectivity injection in postgresql
On 13-08-2014 13:33, Rajmohan C wrote: > SELECT c1, c2, c3, FROM T1, T2, T3 > WHERE T1.x = T2.x AND > T2.y=T3.y AND > T1.x >= ? selectivity 0.1 AND > T2.y > ? selectivity 0.5 AND > T3.z = ? selectivity 0.2 AND > T3.w = ? > > I need to implement Selectivity injection as shown in above query in > PostgreSQL by which we can inject selectivity of each predicate or at least > selectivity at relation level directly as part of query. Is there any > on-going work on this front? If there is no ongoing work on this, How > should I start implementing this feature? > Do you want to force a selectivity? Why don't you let the optimizer do it for you? Trust me it can do it better than you. If you want to force those selectivities for an academic exercise, that information belongs to catalog or could be SET before query starts. Start reading backend/optimizer/README. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On 13.8.2014 12:31, Tomas Vondra wrote: > On 13 Srpen 2014, 7:02, Jeff Davis wrote: >> On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote: >>> >>>(b) bad estimate of required memory -> this is common for aggregates >>>passing 'internal' state (planner uses some quite high defaults) >> >> Maybe some planner hooks? Ideas? > > My plan is to add this to the CREATE AGGREGATE somehow - either as a > constant parameter (allowing to set a custom constant size) or a callback > to a 'sizing' function (estimating the size based on number of items, > average width and ndistinct in the group). In any case, this is > independent of this patch. FWIW, the constant parameter is already implemented for 9.4. Adding the function seems possible - the most difficult part seems to be getting all the necessary info before count_agg_clauses() is called. For example now dNumGroups is evaluated after the call (and tuples/group seems like a useful info for sizing). While this seems unrelated to the patch discussed here, it's true that: (a) good estimate of the memory is important for initial estimate of batch count (b) dynamic increase of batch count alleviates issues from underestimating the amount of memory necessary for states But let's leave this out of scope for the current patch. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb format is pessimal for toast compression
On Tue, Aug 12, 2014 at 8:00 PM, Bruce Momjian wrote: > On Mon, Aug 11, 2014 at 01:44:05PM -0700, Peter Geoghegan wrote: >> On Mon, Aug 11, 2014 at 1:01 PM, Stephen Frost wrote: >> > We've got a clear example of someone, quite reasonably, expecting their >> > JSONB object to be compressed using the normal TOAST mechanism, and >> > we're failing to do that in cases where it's actually a win to do so. >> > That's the focus of this discussion and what needs to be addressed >> > before 9.4 goes out. >> >> Sure. I'm not trying to minimize that. We should fix it, certainly. >> However, it does bear considering that JSON data, with each document >> stored in a row is not an effective target for TOAST compression in >> general, even as text. > > Seems we have two issues: > > 1) the header makes testing for compression likely to fail > 2) use of pointers rather than offsets reduces compression potential I do think the best solution for 2 is what's been proposed already, to do delta-coding of the pointers in chunks (ie, 1 pointer, 15 deltas, repeat). But it does make binary search quite more complex. Alternatively, it could be somewhat compressed as follows: Segment = 1 pointer head, 15 deltas Pointer head = pointers[0] delta[i] = pointers[i] - pointers[0] for i in 1..15 (delta to segment head, not previous value) Now, you can have 4 types of segments. 8, 16, 32, 64 bits, which is the size of the deltas. You achieve between 8x and 1x compression, and even when 1x (no compression), you make it easier for pglz to find something compressible. Accessing it is also simple, if you have a segment index (tough part here). Replace the 15 for something that makes such segment index very compact ;) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Incremental Backup
On Tue, Aug 12, 2014 at 8:26 PM, Stephen Frost wrote: > * Claudio Freire (klaussfre...@gmail.com) wrote: >> I'm not talking about malicious attacks, with big enough data sets, >> checksum collisions are much more likely to happen than with smaller >> ones, and incremental backups are supposed to work for the big sets. > > This is an issue when you're talking about de-duplication, not when > you're talking about testing if two files are the same or not for > incremental backup purposes. The size of the overall data set in this > case is not relevant as you're only ever looking at the same (at most > 1G) specific file in the PostgreSQL data directory. Were you able to > actually produce a file with a colliding checksum as an existing PG > file, the chance that you'd be able to construct one which *also* has > a valid page layout sufficient that it wouldn't be obviously massivly > corrupted is very quickly approaching zero. True, but only with a strong hash, not an adler32 or something like that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
On 13.8.2014 17:52, Tom Lane wrote: > "Tomas Vondra" writes: >> So after 83 days, the regression tests on barnacle completed, and it >> smells like another memory leak in CacheMemoryContext, similar to those >> fixed in 078b2ed on May 18. > > I've pushed fixes for the issues I was able to identify by running the > create_view test. I definitely don't have the patience to run all the > tests this way, but if you do, please start a new run. > > A couple thoughts about barnacle's configuration: > > * It's not necessary to set -DCLOBBER_FREED_MEMORY or > -DMEMORY_CONTEXT_CHECKING > in CPPFLAGS; those are already selected by --enable-cassert. Removing > those -D switches would suppress a whole boatload of compiler warnings. I know - I already fixed this two months ago, but barnacle was still running with the old settings (and I decided to let it run). > * I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the > same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is > darned expensive and it's not clear you'd learn anything by running > them both together. I think you might be better advised to run two > separate buildfarm critters with those two options, and thereby perhaps > get turnaround in something less than 80 days. OK, I removed this for barnacle/addax/mite, let's see what's the impact. FWIW We have three other animals running with CLOBBER_CACHE_ALWAYS + RANDOMIZE_ALLOCATED_MEMORY, and it takes ~20h per branch. But maybe the price when combined with CLOBBER_CACHE_RECURSIVELY is much higher. > * It'd likely be a good idea to take out the TestUpgrade and TestDecoding > modules from the config too. Otherwise, we won't be seeing barnacle's > next report before 2015, judging from the runtime of the check step > compared to some of the other slow buildfarm machines. (I wonder whether > there's an easy way to skip the installcheck step, as that's going to > require a much longer run than it can possibly be worth too.) OK, I did this too. I stopped the already running test on addax and started the test on barnacle again. Let's see in a few days/weeks/months what is the result. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/13/2014 04:15 PM, Heikki Linnakangas wrote: Hmm, thinking about this some more, there is one sensible way to split this patch: We can add the XLogReplayBuffer() function and rewrite all the redo routines to use it, without changing any WAL record formats or anything in the way the WAL records are constructed. In the patch, XLogReplayBuffer() takes one input arument, the block reference ID, and it fetches the RelFileNode and BlockNumber of the block based on that. Without the WAL format changes, the information isn't there in the record, but we can require the callers to pass the RelFileNode and BlockNumber. The final patch will remove those arguments from every caller, but that's a very mechanical change. As in the attached patch. I only modified the heapam redo routines to use the new XLogReplayBuffer() idiom; the idea is to do that for every redo routine. After applying such a patch, the main WAL format changing patch becomes much smaller, and makes it easier to see from the redo routines where significant changes to the WAL record formats have been made. This also allows us to split the bikeshedding; we can discuss the name of XLogReplayBuffer() first :-). Here's a full version of this refactoring patch, all the rmgr's have now been updated to use XLogReplayBuffer(). I think this is a worthwhile change on its own, even if we drop the ball on the rest of the WAL format patch, because it makes the redo-routines more readable. I propose to commit this as soon as someone has reviewed it, and we agree on a for what's currently called XLogReplayBuffer(). I have tested this with my page-image comparison tool. - Heikki commit 4c6c7571e91f34919aff2c2981fe474537dc557b Author: Heikki Linnakangas Date: Wed Aug 13 15:39:08 2014 +0300 Refactor redo routines to use XLogReplayBuffer diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c index ffabf4e..acbb840 100644 --- a/src/backend/access/gin/ginxlog.c +++ b/src/backend/access/gin/ginxlog.c @@ -20,25 +20,23 @@ static MemoryContext opCtx; /* working memory for operations */ static void -ginRedoClearIncompleteSplit(XLogRecPtr lsn, RelFileNode node, BlockNumber blkno) +ginRedoClearIncompleteSplit(XLogRecPtr lsn, int block_index, + RelFileNode node, BlockNumber blkno) { Buffer buffer; Page page; - buffer = XLogReadBuffer(node, blkno, false); - if (!BufferIsValid(buffer)) - return; /* page was deleted, nothing to do */ - page = (Page) BufferGetPage(buffer); - - if (lsn > PageGetLSN(page)) + if (XLogReplayBuffer(block_index, node, blkno, &buffer) == BLK_NEEDS_REDO) { + page = (Page) BufferGetPage(buffer); + GinPageGetOpaque(page)->flags &= ~GIN_INCOMPLETE_SPLIT; PageSetLSN(page, lsn); MarkBufferDirty(buffer); } - - UnlockReleaseBuffer(buffer); + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); } static void @@ -351,26 +349,14 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record) rightChildBlkno = BlockIdGetBlockNumber((BlockId) payload); payload += sizeof(BlockIdData); - if (record->xl_info & XLR_BKP_BLOCK(0)) - (void) RestoreBackupBlock(lsn, record, 0, false, false); - else - ginRedoClearIncompleteSplit(lsn, data->node, leftChildBlkno); + ginRedoClearIncompleteSplit(lsn, 0, data->node, leftChildBlkno); } - /* If we have a full-page image, restore it and we're done */ - if (record->xl_info & XLR_BKP_BLOCK(isLeaf ? 0 : 1)) + if (XLogReplayBuffer(isLeaf ? 0 : 1, + data->node, data->blkno, &buffer) == BLK_NEEDS_REDO) { - (void) RestoreBackupBlock(lsn, record, isLeaf ? 0 : 1, false, false); - return; - } - - buffer = XLogReadBuffer(data->node, data->blkno, false); - if (!BufferIsValid(buffer)) - return; /* page was deleted, nothing to do */ - page = (Page) BufferGetPage(buffer); + page = BufferGetPage(buffer); - if (lsn > PageGetLSN(page)) - { /* How to insert the payload is tree-type specific */ if (data->flags & GIN_INSERT_ISDATA) { @@ -386,8 +372,8 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record) PageSetLSN(page, lsn); MarkBufferDirty(buffer); } - - UnlockReleaseBuffer(buffer); + if (BufferIsValid(buffer)) + UnlockReleaseBuffer(buffer); } static void @@ -476,12 +462,7 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record) * split */ if (!isLeaf) - { - if (record->xl_info & XLR_BKP_BLOCK(0)) - (void) RestoreBackupBlock(lsn, record, 0, false, false); - else - ginRedoClearIncompleteSplit(lsn, data->node, data->leftChildBlkno); - } + ginRedoClearIncompleteSplit(lsn, 0, data->node, data->leftChildBlkno); flags = 0; if (isLeaf) @@ -607,29 +588,19 @@ ginRedoVacuumDataLeafPage(XLogRecPtr lsn, XLogRecord *record) Buffer buffer; Page page; - /* If we have a full-page image, restore it and we're done */ - if (record->xl_info & XLR_BKP_BLOCK(0)) + if (XLogReplayBuffer(0, xlrec->node, xlrec->blkno, &buffer) == BLK_NEEDS_REDO) { - (void) RestoreBackupBlock(lsn, record, 0, false, false); -
[HACKERS] how to implement selectivity injection in postgresql
SELECT c1, c2, c3, FROM T1, T2, T3 WHERE T1.x = T2.x AND T2.y=T3.y AND T1.x >= ? selectivity 0.1 AND T2.y > ? selectivity 0.5 AND T3.z = ? selectivity 0.2 AND T3.w = ? I need to implement Selectivity injection as shown in above query in PostgreSQL by which we can inject selectivity of each predicate or at least selectivity at relation level directly as part of query. Is there any on-going work on this front? If there is no ongoing work on this, How should I start implementing this feature?
Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks
"Tomas Vondra" writes: > So after 83 days, the regression tests on barnacle completed, and it > smells like another memory leak in CacheMemoryContext, similar to those > fixed in 078b2ed on May 18. I've pushed fixes for the issues I was able to identify by running the create_view test. I definitely don't have the patience to run all the tests this way, but if you do, please start a new run. A couple thoughts about barnacle's configuration: * It's not necessary to set -DCLOBBER_FREED_MEMORY or -DMEMORY_CONTEXT_CHECKING in CPPFLAGS; those are already selected by --enable-cassert. Removing those -D switches would suppress a whole boatload of compiler warnings. * I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is darned expensive and it's not clear you'd learn anything by running them both together. I think you might be better advised to run two separate buildfarm critters with those two options, and thereby perhaps get turnaround in something less than 80 days. * It'd likely be a good idea to take out the TestUpgrade and TestDecoding modules from the config too. Otherwise, we won't be seeing barnacle's next report before 2015, judging from the runtime of the check step compared to some of the other slow buildfarm machines. (I wonder whether there's an easy way to skip the installcheck step, as that's going to require a much longer run than it can possibly be worth too.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] option -T in pg_basebackup doesn't work on windows
From: "Amit Kapila" During my recent work on pg_basebackup, I noticed that -T option doesn't seem to work on Windows. The reason for the same is that while updating symlinks it doesn't consider that on Windows, junction points can be directories due to which it is not able to update the symlink location. Fix is to make the code work like symlink removal code in destroy_tablespace_directories. Attached patch fixes problem. I could reproduce the problem on my Windows machine. The code change appears correct, but the patch application failed against the latest source code. I don't know why. Could you confirm this? patching file src/bin/pg_basebackup/pg_basebackup.c Hunk #1 FAILED at 1119. 1 out of 1 hunk FAILED -- saving rejects to file src/bin/pg_basebackup/pg_basebackup.c.rej On the following line, I think %d must be %u, because Oid is an unsigned integer. char*linkloc = psprintf("%s/pg_tblspc/%d", basedir, oid); Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strncpy is not a safe version of strcpy
Tom Lane wrote: > Kevin Grittner writes: > >> I am concerned that failure to check for truncation could allow >> deletion of unexpected files or directories. > > I believe that we deal with this by the expedient of checking the > lengths of tablespace paths in advance, when the tablespace is > created. As long as it is covered. I would point out that the when strlcpy is used it returns a size_t which can be directly compared to one of the arguments passed in (in this case MAXPGPATH) to detect whether the name was truncated for the cost of an integer compare (probably in registers). No additional scan of the data is needed. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strncpy is not a safe version of strcpy
Kevin Grittner writes: > I am concerned that failure to check for truncation could allow > deletion of unexpected files or directories. I believe that we deal with this by the expedient of checking the lengths of tablespace paths in advance, when the tablespace is created. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strncpy is not a safe version of strcpy
On 08/13/2014 04:31 PM, Kevin Grittner wrote: David Rowley wrote: I had a quick look at the usages of strncpy in master tonight and I've really just picked out the obviously broken ones for now. The other ones, on first look, either look safe, or require some more analysis to see what's actually done with the string. Does anyone disagree with the 2 changes in the attached? I am concerned that failure to check for truncation could allow deletion of unexpected files or directories. While this is probably not as dangerous as *executing* unexpected files, it seems potentially problematic. At the very least, a code comment explaining why calling unlink on something which is not what appears to be expected is not a problem there. Some might consider it overkill, but I tend to draw a pretty hard line on deleting or executing random files, even if the odds seem to be that the mangled name won't find a match. Granted, those problems exist now, but without checking for truncation it seems to me that we're just deleting *different* incorrect filenames, not really fixing the problem. strlcpy is clearly better than strncpy here, but I wonder if we should have yet another string copying function that throws an error instead of truncating, if the buffer is too small. What you really want in these cases is a "path too long" error. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries
2014-08-13 15:22 GMT+02:00 MauMau : > From: "Pavel Stehule" > >> 2014-08-13 13:59 GMT+02:00 MauMau : >> >>> Are you concerned about the impactof collection overhead on the queries >>> >>> diagnosed? Maybe not light, but I'm optimistic. Oracle has the track >>> record of long use, and MySQL provides performance schema starting from >>> 5.6. >>> >> >> >> partially, I afraid about total performance (about impact on IO) - when we >> use a usual tables, then any analyses without indexes are slow, so you >> need >> a indexes, and we cannot deferred index update. You should thinking about >> retention policy - and without partitioning you got massive deletes. So I >> cannot to imagine a usage of table based solution together with some >> higher >> load. Our MVCC storage is not practical for storing only inserted data, >> and >> some custom storage has no indexes - so this design is relative big >> project. >> >> I prefer a possibility to read log via SQL (maybe some FDW) than use >> tables >> for storing log. These tables can be relative very large in few days - and >> we cannot to write specialized engine like MySQL simply. >> > > I didn't mean performance statistics data to be stored in database tables. > I just meant: > > * pg_stat_system_events is a view to show data on memory, which returns > one row for each event across the system. This is similar to > V$SYSTEM_EVENT in Oracle. > > * pg_stat_session_events is a view to show data on memory, which returns > one row for each event on one session. This is similar to V$SESSION_EVENT > in Oracle. > > * The above views represent the current accumulated data like other > pg_stat_xxx views. > > * EXPLAIN ANALYZE and auto_explain shows all events for one query. The > lock waits you are trying to record in the server log is one of the events. > I am little bit sceptic about only memory based structure. Is it this concept acceptable for commiters? Pavel > > Regards > MauMau > >
Re: [HACKERS] ALTER TABLESPACE MOVE command tag tweak
Stephen Frost wrote: > * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: > > Stephen Frost wrote: > > > Alright, sounds like this is more-or-less the concensus. I'll see about > > > making it happen shortly. > > > > Were you able to work on this? > > Apologies, I've been gone more-or-less all of July. I'm back now and > have time to spend on this. Ping? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strncpy is not a safe version of strcpy
David Rowley wrote: > I had a quick look at the usages of strncpy in master tonight and > I've really just picked out the obviously broken ones for now. > The other ones, on first look, either look safe, or require some > more analysis to see what's actually done with the string. > > Does anyone disagree with the 2 changes in the attached? I am concerned that failure to check for truncation could allow deletion of unexpected files or directories. While this is probably not as dangerous as *executing* unexpected files, it seems potentially problematic. At the very least, a code comment explaining why calling unlink on something which is not what appears to be expected is not a problem there. Some might consider it overkill, but I tend to draw a pretty hard line on deleting or executing random files, even if the odds seem to be that the mangled name won't find a match. Granted, those problems exist now, but without checking for truncation it seems to me that we're just deleting *different* incorrect filenames, not really fixing the problem. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries
From: "Pavel Stehule" 2014-08-13 13:59 GMT+02:00 MauMau : Are you concerned about the impactof collection overhead on the queries diagnosed? Maybe not light, but I'm optimistic. Oracle has the track record of long use, and MySQL provides performance schema starting from 5.6. partially, I afraid about total performance (about impact on IO) - when we use a usual tables, then any analyses without indexes are slow, so you need a indexes, and we cannot deferred index update. You should thinking about retention policy - and without partitioning you got massive deletes. So I cannot to imagine a usage of table based solution together with some higher load. Our MVCC storage is not practical for storing only inserted data, and some custom storage has no indexes - so this design is relative big project. I prefer a possibility to read log via SQL (maybe some FDW) than use tables for storing log. These tables can be relative very large in few days - and we cannot to write specialized engine like MySQL simply. I didn't mean performance statistics data to be stored in database tables. I just meant: * pg_stat_system_events is a view to show data on memory, which returns one row for each event across the system. This is similar to V$SYSTEM_EVENT in Oracle. * pg_stat_session_events is a view to show data on memory, which returns one row for each event on one session. This is similar to V$SESSION_EVENT in Oracle. * The above views represent the current accumulated data like other pg_stat_xxx views. * EXPLAIN ANALYZE and auto_explain shows all events for one query. The lock waits you are trying to record in the server log is one of the events. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 08/13/2014 02:07 PM, Andres Freund wrote: On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote: On 08/13/2014 01:04 AM, Andres Freund wrote: * The patch mixes the API changes around WAL records with changes of how individual actions are logged. That makes it rather hard to review - and it's a 500kb patch already. I realize it's hard to avoid because the new API changes which information about blocks is available, but it does make it really hard to see whether the old/new code is doing something equivalent. It's hard to find more critical code than this :/ Yeah, I hear you. I considered doing this piecemeal, just adding the new functions first so that you could still use the old XLogRecData API, until all the functions have been converted. But in the end, I figured it's not worth it, as sooner or later we'd want to convert all the functions anyway. I think it might be worthwile anyway. I'd be very surprised if there aren't several significant bugs in the conversion. Your full page checking tool surely helps to reduce the number, but it's not foolproof. I can understand not wanting to do it though, it's a significant amount of work. Would you ask somebody else to do it in two steps? Hmm, thinking about this some more, there is one sensible way to split this patch: We can add the XLogReplayBuffer() function and rewrite all the redo routines to use it, without changing any WAL record formats or anything in the way the WAL records are constructed. In the patch, XLogReplayBuffer() takes one input arument, the block reference ID, and it fetches the RelFileNode and BlockNumber of the block based on that. Without the WAL format changes, the information isn't there in the record, but we can require the callers to pass the RelFileNode and BlockNumber. The final patch will remove those arguments from every caller, but that's a very mechanical change. As in the attached patch. I only modified the heapam redo routines to use the new XLogReplayBuffer() idiom; the idea is to do that for every redo routine. After applying such a patch, the main WAL format changing patch becomes much smaller, and makes it easier to see from the redo routines where significant changes to the WAL record formats have been made. This also allows us to split the bikeshedding; we can discuss the name of XLogReplayBuffer() first :-). - Heikki commit 1a770baa3a3f293e8c592f0419d279b7b8bf7b66 Author: Heikki Linnakangas Date: Wed Aug 13 15:39:08 2014 +0300 Refactor heapam.c redo routines to use XLogReplayBuffer diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index d731f98..bf863af 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -7137,15 +7137,13 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) { xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record); Buffer buffer; - Page page; - OffsetNumber *end; - OffsetNumber *redirected; - OffsetNumber *nowdead; - OffsetNumber *nowunused; - int nredirected; - int ndead; - int nunused; - Size freespace; + Size freespace = 0; + RelFileNode rnode; + BlockNumber blkno; + XLogReplayResult rc; + + rnode = xlrec->node; + blkno = xlrec->block; /* * We're about to remove tuples. In Hot Standby mode, ensure that there's @@ -7156,65 +7154,62 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record) * latestRemovedXid is invalid, skip conflict processing. */ if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid)) - ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, - xlrec->node); + ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode); /* * If we have a full-page image, restore it (using a cleanup lock) and * we're done. */ - if (record->xl_info & XLR_BKP_BLOCK(0)) - { - (void) RestoreBackupBlock(lsn, record, 0, true, false); - return; - } + rc = XLogReplayBufferExtended(0, rnode, MAIN_FORKNUM, blkno, + RBM_NORMAL, true, &buffer); + if (rc == BLK_NEEDS_REDO) + { + Page page = (Page) BufferGetPage(buffer); + OffsetNumber *end; + OffsetNumber *redirected; + OffsetNumber *nowdead; + OffsetNumber *nowunused; + int nredirected; + int ndead; + int nunused; + + nredirected = xlrec->nredirected; + ndead = xlrec->ndead; + end = (OffsetNumber *) ((char *) xlrec + record->xl_len); + redirected = (OffsetNumber *) ((char *) xlrec + SizeOfHeapClean); + nowdead = redirected + (nredirected * 2); + nowunused = nowdead + ndead; + nunused = (end - nowunused); + Assert(nunused >= 0); + + /* Update all item pointers per the record, and repair fragmentation */ + heap_page_prune_execute(buffer, +redirected, nredirected, +nowdead, ndead, +nowunused, nunused); + + freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ - buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); - if (!BufferIsValid(buffer
Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries
2014-08-13 13:59 GMT+02:00 MauMau : > From: "Pavel Stehule" > > isn't it too heavy? >> > > Are you concerned about the impactof collection overhead on the queries > diagnosed? Maybe not light, but I'm optimistic. Oracle has the track > record of long use, and MySQL provides performance schema starting from 5.6. partially, I afraid about total performance (about impact on IO) - when we use a usual tables, then any analyses without indexes are slow, so you need a indexes, and we cannot deferred index update. You should thinking about retention policy - and without partitioning you got massive deletes. So I cannot to imagine a usage of table based solution together with some higher load. Our MVCC storage is not practical for storing only inserted data, and some custom storage has no indexes - so this design is relative big project. > > > I have just terrible negative experience with Vertica, where this design >> is >> used - almost all information about queries are available, but any query >> to >> related tables are terrible slow, so I am inclined to more simple design >> oriented to log based solution. Table based solutions is not practical >> when >> you exec billions queries per day. I understand to motivation, but I >> afraid >> so it can be very expensive and slow on highly load servers. >> > > Which do you mean by "query related to tables", the queries from > applications being diagnosed, or the queries that diagnose the performance > using statistics views? > > Could you elaborate on your experience with Vertica? That trouble may be > just because Vertica's implementation is not refined. > > sure - Vertica is not mature database. More it has only one storage type optimized for OLAP, what is wrong for long catalog, and for working with performance events too. > I understand the feeling of being inclined to log based solution for its > implementation simplicity. However, the server log is difficult or > impossible to access using SQL queries. This prevents the development of > performance diagnostics functionality in GUI administration tools. Also, > statistics views allow for easy access on PAAS like Amazon RDS and Heroku. > I prefer a possibility to read log via SQL (maybe some FDW) than use tables for storing log. These tables can be relative very large in few days - and we cannot to write specialized engine like MySQL simply. Pavel > > Regards > MauMau > >
Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries
From: "Pavel Stehule" isn't it too heavy? Are you concerned about the impactof collection overhead on the queries diagnosed? Maybe not light, but I'm optimistic. Oracle has the track record of long use, and MySQL provides performance schema starting from 5.6. I have just terrible negative experience with Vertica, where this design is used - almost all information about queries are available, but any query to related tables are terrible slow, so I am inclined to more simple design oriented to log based solution. Table based solutions is not practical when you exec billions queries per day. I understand to motivation, but I afraid so it can be very expensive and slow on highly load servers. Which do you mean by "query related to tables", the queries from applications being diagnosed, or the queries that diagnose the performance using statistics views? Could you elaborate on your experience with Vertica? That trouble may be just because Vertica's implementation is not refined. I understand the feeling of being inclined to log based solution for its implementation simplicity. However, the server log is difficult or impossible to access using SQL queries. This prevents the development of performance diagnostics functionality in GUI administration tools. Also, statistics views allow for easy access on PAAS like Amazon RDS and Heroku. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote: > On 08/13/2014 01:04 AM, Andres Freund wrote: > >* I'm distinctly not a fan of passing around both the LSN and the struct > > XLogRecord to functions. We should bit the bullet and use something > > encapsulating both. > > You mean, the redo functions? Seems fine to me, and always it's been like > that... Meh. By that argument we could just not do this patch :) > >* XLogReplayBuffer imo isn't a very good name - the buffer isn't > > replayed. If anything it's sometimes replaced by the backup block, but > > the real replay still happens outside of that function. > > I agree, but haven't come up with a better name. XLogRestoreBuffer(), XLogRecoverBuffer(), XLogReadBufferForReplay(), ...? > >* Why do we need XLogBeginInsert(). Right now this leads to uglyness > > like duplicated if (RelationNeedsWAL()) wal checks and > > XLogCancelInsert() of inserts that haven't been started. > > I like clearly marking the beginning of the insert, with XLogBeginInsert(). > We certainly could design it so that it's not needed, and you could just > start registering stuff without calling XLogBeginInsert() first. But I > believe it's more robust this way. For example, you will get an error at the > next XLogBeginInsert(), if a previous operation had already registered some > data without calling XLogInsert(). I'm coming around to that view - especially as accidentally nesting xlog insertions is a real concern with the new API. > >And if we have Begin, why do we also need Cancel? > > We could just automatically "cancel" any previous insertion when > XLogBeginInsert() is called, but again it seems like bad form to leave > references to buffers and data just lying there, if an operation bails out > after registering some stuff and doesn't finish the insertion. > > >* "XXX: do we need to do this for every page?" - yes, and it's every > > tuple, not every page... And It needs to be done *after* the tuple has > > been put on the page, not before. Otherwise the location will be wrong. > > That comment is in heap_multi_insert(). All the inserted tuples have the > same command id, seems redundant to log multiple NEW_CID records for them. Yea, I know it's in heap_multi_insert(). They tuples have the same cid, but they don't have the same tid. Or well, wouldn't if log_heap_new_cid() were called after the PutHeapTuple(). Note that this won't trigger for any normal user defined relations. > >* The patch mixes the API changes around WAL records with changes of how > > individual actions are logged. That makes it rather hard to review - > > and it's a 500kb patch already. > > > > I realize it's hard to avoid because the new API changes which > > information about blocks is available, but it does make it really hard > > to see whether the old/new code is doing something > > equivalent. It's hard to find more critical code than this :/ > > Yeah, I hear you. I considered doing this piecemeal, just adding the new > functions first so that you could still use the old XLogRecData API, until > all the functions have been converted. But in the end, I figured it's not > worth it, as sooner or later we'd want to convert all the functions anyway. I think it might be worthwile anyway. I'd be very surprised if there aren't several significant bugs in the conversion. Your full page checking tool surely helps to reduce the number, but it's not foolproof. I can understand not wanting to do it though, it's a significant amount of work. Would you ask somebody else to do it in two steps? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-08-13 09:51:58 +0530, Amit Kapila wrote: > Overall, the main changes required in patch as per above feedback > are: > 1. add an additional counter for the number of those > allocations not satisfied from the free list, with a > name like buffers_alloc_clocksweep. > 2. Autotune the low and high threshold values for buffers > in freelist. In the patch, I have kept them as hard-coded > values. > 3. For populating freelist, have a separate process (bgreclaim) > instead of doing it by bgwriter. I'm not convinced that 3) is the right way to go to be honest. Seems like a huge bandaid to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Enable WAL archiving even in standby
Hi, I'd propose the attached WIP patch which allows us to enable WAL archiving even in standby. The patch adds "always" as the valid value of archive_mode. If it's set to "always", the archiver is started when the server is in standby mode and all the WAL files that walreceiver wrote to the disk are archived by using archive_command. Then, even after the server is promoted to master, the archiver keeps archiving WAL files. The patch doesn't change the meanings of the setting values "on" and "off" of archive_mode. I think that this feature is useful for the case, e.g., where large database needs to be replicated between remote servers. Imagine the situation where the replicated database gets corrupted completely in the remote standby. How should we address this problematic situation and restart the standby? One approach is to take a fresh backup from the master and restore it onto the standby. But since the database is large and there is long distance between two servers, this approach might take a surprisingly long time. Another approach is to restore the backup which was taken from the standby before. But most of many WAL files which the backup needs might exist only in the master (because WAL archiving cannot be enabled in the standby) and they need to be transfered from the master to the standby via long-distance network. So I think that this approach also would take a fairly long time. To shorten that time, you may think that archive_command in the master can be set so that it transfers WAL files from the master to the standby's archival storage. I agree that this setting can accelerate the database restore process. But this causes every WAL files to be transfered between remote servers twice (one is by streaming replication, another is by archive_command), and which is a waste of network bandwidth. Enabling WAL archiving in standby is one of solutions for this situation. We can expect that most of WAL files that are required for the backup taken from the standby would exist in the standby's archival storage. Back to the patch. If archive_mode is set to "always", archive_command is always used to archive WAL files even during recovery. Do we need to separate the command into two for master and standby, respectively? We can add something like standby_archive_command parameter which is used to archive only WAL files walreceiver writes. The other WAL files are archived by archive_command. I'm not sure if it's really worth separating the command that way. Is there any use case? The patch doesn't allow us to enable WAL archiving *only* during recovery. Should we support yet another archive_mode like "standby" which allows the archiver to be running only during recovery, but makes it end just after the server is promoted to master? I'm not sure if there is really use case for that. I've not included the update of document in the patch yet. If we agree to support this feature, I will do the remaining work. Regards, -- Fujii Masao *** a/src/backend/access/transam/xlog.c --- b/src/backend/access/transam/xlog.c *** *** 81,87 int CheckPointSegments = 3; int wal_keep_segments = 0; int XLOGbuffers = -1; int XLogArchiveTimeout = 0; ! bool XLogArchiveMode = false; char *XLogArchiveCommand = NULL; bool EnableHotStandby = false; bool fullPageWrites = true; --- 81,87 int wal_keep_segments = 0; int XLOGbuffers = -1; int XLogArchiveTimeout = 0; ! int XLogArchiveMode = ARCHIVE_MODE_OFF; char *XLogArchiveCommand = NULL; bool EnableHotStandby = false; bool fullPageWrites = true; *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *** *** 89,94 --- 89,95 #include "access/transam.h" #include "access/xlog.h" + #include "access/xlogarchive.h" #include "bootstrap/bootstrap.h" #include "catalog/pg_control.h" #include "lib/ilist.h" *** *** 823,831 PostmasterMain(int argc, char *argv[]) write_stderr("%s: max_wal_senders must be less than max_connections\n", progname); ExitPostmaster(1); } ! if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL) ereport(ERROR, ! (errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\""))); if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL) ereport(ERROR, (errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\""))); --- 824,832 write_stderr("%s: max_wal_senders must be less than max_connections\n", progname); ExitPostmaster(1); } ! if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL) ereport(ERROR, ! (errmsg("WAL archival (archive_mode=on/always) requires wal_level \"archive\", \"hot_standby\", or \"logical\""))); if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL) ereport(ERROR, (errmsg("WAL streaming (max_wal_senders > 0) requires w
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On 11 August 2014 10:29, Amit kapila wrote, 1. I have fixed all the review comments except few, and modified patch is attached. 2. For not fixed comments, find inline reply in the mail.. >1. >+Number of parallel connections to perform the operation. This option >will enable the vacuum >+operation to run on parallel connections, at a time one table will be >operated on one connection. > >a. How about describing w.r.t asynchronous connections >instead of parallel connections? >b. It is better to have line length as lesser than 80. >c. As you are using multiple connections to achieve parallelism, >I suggest you add a line in your description indicating user should >verify max_connections parameters. Something similar to pg_dump: > >2. >+So at one time as many tables will be vacuumed parallely as number of >jobs. > >can you briefly mention about the case when number of jobs >is more than number of tables? 1 and 2 ARE FIXED in DOC. >3. >+ /* When user is giving the table list, and list is smaller then >+ * number of tables >+ */ >+ if (tbl_count && (parallel > tbl_count)) >+ parallel = tbl_count; >+ > >Again here multiline comments are wrong. > >Some other instances are as below: >a. >/* This will give the free connection slot, if no slot is free it will >* wait for atleast one slot to get free. >*/ >b. >/* if table list is not provided then we need to do vaccum for whole DB >* get the list of all tables and prpare the list >*/ >c. >/* Some of the slot are free, Process the results for slots whichever >* are free >*/ COMMENTS are FIXED >4. >src/bin/scripts/vacuumdb.c:51: indent with spaces. >+ bool analyze_only, bool freeze, PQExpBuffer sql); >src/bin/scripts/vacuumdb.c:116: indent with spaces. >+int parallel = 0; >src/bin/scripts/vacuumdb.c:198: indent with spaces. >+optind++; >src/bin/scripts/vacuumdb.c:299: space before tab in indent. >+ vacuum_one_database(dbname, full, verbose, and_analyze, > >There are lot of redundant whitespaces, check with >git diff --check and fix them. ALL are FIXED > >5. >res = executeQuery(conn, >"select relname, nspname from pg_class c, pg_namespace ns" >" where (relkind = \'r\' or relkind = \'m\')" >" and c.relnamespace = ns.oid order by relpages desc", >progname, echo); > >a. Here you need to use SQL keywords in capital letters, refer one >of the other caller of executeQuery() in vacuumdb.c >b. Why do you need this condition c.relnamespace = ns.oid in above >query? IT IS POSSIBLE THAT, TWO NAMESPACE HAVE THE SAME TABLE NAME, SO WHEN WE ARE SENDING COMMAND FROM CLIENT WE NEED TO GIVE NAMESPACE WISE BECAUSE WE NEED TO VACUUM ALL THE TABLES.. (OTHERWISE TWO TABLE WITH SAME NAME FROM DIFFERENT NAMESPACE WILL BE TREATED SAME.) >I think to get the list of required objects from pg_class, you don't >need to have a join with pg_namespace. DONE >6. >vacuum_parallel() >{ >.. >if (!tables || !tables->head) >{ >.. >tbl_count++; >} >.. >} > >a. Here why you need a separate variable (tbl_count) to verify number >asynchronous/parallel connections you want, why can't we use ntuple? >b. there is a warning in code (I have compiled it on windows) as well >related to this variable. >vacuumdb.c(695): warning C4700: uninitialized local variable 'tbl_count' used Variable REMOVED > >7. >Fix for one of my previous comment is as below: >GetQueryResult() >{ >.. >if (!r && !completedb) >.. >} > >Here I think some generic errors like connection broken or others >will also get ignored. Is it possible that we can ignore particular >error which we want to ignore without complicating the code? > >Also in anycase add comments to explain why you are ignoring >error for particular case. Here we are getting message string, I think if we need to find the error code then we need to parse the string, and after that we need to compare with error codes. Is there any other way to do this ? Comments are added >8. >+ fprintf(stderr, _("%s: Number of parallel \"jobs\" should be at least 1\n"), >+ progname); >formatting of 2nd line progname is not as per standard (you can refer other >fprintf in the same file). DONE >9. + int parallel = 0; >I think it is better to name it as numAsyncCons or something similar. CHANGED as PER SUGGESTION >10. It is better if you can add function header for newly added >functions. ADDED >> >> Test2: (as per your scenario, where actual vacuum time is very less) >> >> Vacuum done for complete DB >> >> 8 tables all with 1 records and few dead tuples > >I think this test is missing in attached file. Few means? >Can you try with 0.1%, 1% of dead tuples in table and try to >take time in milliseconds if it is taking less time to complete >the test. TESTED with 1%, 0.1% and 0.01 % and results are as follows 1. With 1% (file test1%.sql) Base Code : 22.26 s 2 Threads : 12.82 s 4 Threads : 9.
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On 13 Srpen 2014, 7:02, Jeff Davis wrote: > On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote: >> CREATE AGGREGATE myaggregate ( >> ... >> SERIALIZE_FUNC = 'dump_data', >> DESERIALIZE_FUNC = 'read_data', >> ... >> ); > > Seems reasonable. > >> I don't see why it should get messy? In the end, you have a chunk of >> data and a hash for it. > > Perhaps it's fine; I'd have to see the approach. > >> It just means you need to walk through the hash table, look at the >> hashes and dump ~50% of the groups to a file. > > If you have fixed-size states, why would you *want* to remove the group? > What is gained? You're right that for your batching algorithm (based on lookups), that's not really needed, and keeping everything in memory is a good initial approach. My understanding of the batching algorithm (and I may be wrong on this one) is that once you choose the number of batches, it's pretty much fixed. Is that the case? But what will happen in case of significant cardinality underestimate? I.e. what will happen if you decide to use 16 batches, and then find out 256 would be more appropriate? I believe you'll end up with batches 16x the size you'd want, most likely exceeding work_mem. Do I understand that correctly? But back to the removal of aggregate states from memory (irrespectedly of the size) - this is what makes the hashjoin-style batching possible, because it: (a) makes the batching decision simple (peeking at hash) (b) makes it possible to repeatedly increase the number of batches (c) provides a simple trigger for the increase of batch count Some of this might be achievable even with keeping the states in memory. I mean, you can add more batches on the fly, and handle this similarly to hash join, while reading tuples from the batch (moving the tuples to the proper batch, if needed). The problem is that once you have the memory full, there's no trigger to alert you that you should increase the number of batches again. > One thing I like about my simple approach is that it returns a good > number of groups after each pass, and then those are completely finished > (returned to the operator above, even). That's impossible with HashJoin > because the hashing all needs to be done before the probe phase begins. The hash-join approach returns ~1/N groups after each pass, so I fail to see how this is better? > The weakness of my approach is the array_agg case that you mention, > because this approach doesn't offer a way to dump out transition states. > It seems like that could be added later, but let me know if you see a > problem there. Right. Let's not solve this in the first version of the patch. >> I think you're missing the point, here. You need to compute the hash in >> both cases. And then you either can do a lookup or just peek at the >> first >> few bits of the hash to see whether it's in the current batch or not. > > I understood that. The point I was trying to make (which might or might > not be true) was that: (a) this only matters for a failed lookup, > because a successful lookup would just go in the hash table anyway; and > (b) a failed lookup probably doesn't cost much compared to all of the > other things that need to happen along that path. OK. I don't have numbers proving otherwise at hand, and you're probably right that once the batching kicks in, the other parts are likely more expensive than this. > I should have chosen a better example though. For instance: if the > lookup fails, we need to write the tuple, and writing the tuple is sure > to swamp the cost of a failed hash lookup. > >> is much faster than a lookup. Also, as the hash table grows (beyond L3 >> cache size, which is a few MBs today), it becomes much slower in my >> experience - that's one of the lessons I learnt while hacking on the >> hashjoin. And we're dealing with hashagg not fitting into work_mem, so >> this seems to be relevant. > > Could be, but this is also the path that goes to disk, so I'm not sure > how significant it is. It may or may not go to the disk, actually. The fact that you're doing batching means it's written to a temporary file, but with large amounts of RAM it may not get written to disk. That's because the work_mem is only a very soft guarantee - a query may use multiple work_mem buffers in a perfectly legal way. So the users ten to set this rather conservatively. For example we have >256GB of RAM in each machine, usually <24 queries running at the same time and yet we have only work_mem=800MB. On the few occasions when a hash join is batched, it usually remains in page cache and never actually gets writte to disk. Or maybe it gets written, but it's still in the page cache so the backend never notices that. It's true there are other costs though - I/O calls, etc. So it's not free. > >> > Because I suspect there are costs in having an extra file around that >> > I'm not accounting for directly. We are implicitly assuming that the >> OS >> > will keep around e
Re: [HACKERS] strncpy is not a safe version of strcpy
On Wed, Aug 13, 2014 at 3:19 PM, Noah Misch wrote: > On Sat, Nov 16, 2013 at 12:53:10PM +1300, David Rowley wrote: > > I went on a bit of a strncpy cleanup rampage this morning and ended up > > finding quite a few places where strncpy is used wrongly. > > I'm not quite sure if I have got them all in this patch, but I' think > I've > > got the obvious ones at least. > > > > For the hash_search in jsconfuncs.c after thinking about it a bit more... > > Can we not just pass the attname without making a copy of it? I see > keyPtr > > in hash_search is const void * so it shouldn't get modified in there. I > > can't quite see the reason for making the copy. > > +1 for the goal of this patch. Another commit took care of your > jsonfuncs.c > concerns, and the patch for CVE-2014-0065 fixed several of the others. > Plenty > remain, though. > > Thanks for taking interest in this. I had a quick look at the usages of strncpy in master tonight and I've really just picked out the obviously broken ones for now. The other ones, on first look, either look safe, or require some more analysis to see what's actually done with the string. I think this is likely best tackled in small increments anyway. Does anyone disagree with the 2 changes in the attached? Regards David Rowley strncpy_fixes_pass1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries
2014-08-13 11:14 GMT+02:00 MauMau : > From: "Pavel Stehule" > > There are two relative independent tasks >> >> a) monitor and show total lock time of living queries >> >> b) monitor and log total lock time of executed queries. >> >> I am interested by @b now. When we work with slow query log, then we would >> to identify reason for long duration. Locks are important source of these >> queries on some systems. >> > > I'm interested in b, too. I was thinking of proposing a performance > diagnostics feature like Oracle's wait events (V$SYSTEM_EVENT and > V$SESSION_EVENT). So, if you do this, I'd like to contribute to the > functional design, code and doc review, and testing. > isn't it too heavy? I have just terrible negative experience with Vertica, where this design is used - almost all information about queries are available, but any query to related tables are terrible slow, so I am inclined to more simple design oriented to log based solution. Table based solutions is not practical when you exec billions queries per day. I understand to motivation, but I afraid so it can be very expensive and slow on highly load servers. > > The point is to collect as much information about bottlenecks as possible, > including lock waits. The rough sketch is: > > What info to collect: > * heavyweight lock waits shown by pg_locks > * lightweight lock waits > * latch waits > * socket waits (mainly for client input) > > > How the info is delivered: > * pg_stat_system_events shows the accumulated total accross the server > instance > * pg_stat_session_events shows the accumulated total for each session > * EXPLAIN ANALYZE and auto_explain shows the accumulated total for each > query > > We need to describe in the manual how to diagnose and tne the system with > these event info. > > Regards > MauMau > >
Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries
From: "Pavel Stehule" There are two relative independent tasks a) monitor and show total lock time of living queries b) monitor and log total lock time of executed queries. I am interested by @b now. When we work with slow query log, then we would to identify reason for long duration. Locks are important source of these queries on some systems. I'm interested in b, too. I was thinking of proposing a performance diagnostics feature like Oracle's wait events (V$SYSTEM_EVENT and V$SESSION_EVENT). So, if you do this, I'd like to contribute to the functional design, code and doc review, and testing. The point is to collect as much information about bottlenecks as possible, including lock waits. The rough sketch is: What info to collect: * heavyweight lock waits shown by pg_locks * lightweight lock waits * latch waits * socket waits (mainly for client input) How the info is delivered: * pg_stat_system_events shows the accumulated total accross the server instance * pg_stat_session_events shows the accumulated total for each session * EXPLAIN ANALYZE and auto_explain shows the accumulated total for each query We need to describe in the manual how to diagnose and tne the system with these event info. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback
> I don't think that it's good idea to control that behavior by using > --status-interval. I'm sure that there are some users who both want that > behavior and want set the maximum interval between a feedback is sent > back to the server because these settings are available in walreceiver. > But your version of --status-interval doesn't allow those settings at > all. That is, if set to -1, the maximum interval between each feedback > cannot be set. OTOH, if set to the positive number, > feedback-as-soon-as-fsync behavior cannot be enabled. Therefore, I'm > thinking that it's better to introduce new separate option for that > behavior. Thanks for the review! This patch was split option as you pointed out. If -r option is specified, status packets sent to server as soon as after fsync. Otherwise to send server status packet in the spacing of the --status-interval. Regards, -- Furuya Osamu pg_receivexlog-fsync-feedback-v2.patch Description: pg_receivexlog-fsync-feedback-v2.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Wed, Aug 13, 2014 at 4:49 PM, Michael Paquier wrote: > Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c, > it is not weird. This approach has the merit to make XLogRecData > completely bundled with the insertion and construction of the WAL > records. Then for the name xloginsert.c is fine for me too. At the same time, renaming XLogInsert to XLogCompleteInsert or XLogFinishInsert would be nice to make the difference with XLogBeginInsert. This could include XLogCancel renamed tos XLogCancelInsert. Appending the prefix XLogInsert* to those functions would make things more consistent as well. But feel free to discard those ideas if you do not like that. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Aug 12, 2014 at 6:44 PM, Heikki Linnakangas wrote: > On 08/05/2014 03:50 PM, Michael Paquier wrote: >> >> - When testing pg_xlogdump I found an assertion failure for record >> XLOG_GIN_INSERT: >> Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) != >> ((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0, function >> gin_desc, file gindesc.c, line 127. > > > I could not reproduce this. Do you have a test case? Indeed. I am not seeing anything now for this record. Perhaps I was using an incorrect version of pg_xlogdump. >> - Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint(): >> + int flags = REGBUF_FORCE_IMAGE; >> + if (buffer_std) >> + flags |= REGBUF_STANDARD; >> + XLogRegisterBuffer(0, buffer, flags); >> [...] >> - recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata); >> + recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI); > > > Indeed, fixed. With that, "initdb -k" works, I had not tested the patch with > page checksums before. Thanks for the fix. Yes now this works. >> - XLogRecGetBlockRefIds needing an already-allocated array for *out is not >> user-friendly. Cannot we just do all the work inside this function? > > > I agree it's not a nice API. Returning a palloc'd array would be nicer, but > this needs to work outside the backend too (e.g. pg_xlogdump). It could > return a malloc'd array in frontend code, but it's not as nice. On the > whole, do you think that be better than the current approach? A palloc'd array returned is nicer for the user. And I would rather rewrite the API like that, having it return the array instead: int *XLogRecGetBlockRefIds(XLogRecord *record, int *num_array) Alvaro has also outlined that in the FRONTEND context pfree and palloc would work as pg_free and pg_malloc (didn't recall it before he mentioned it btw). >> - All the functions in xlogconstruct.c could be defined in a separate >> header xlogconstruct.h. What they do is rather independent from xlog.h. >> The >> same applies to all the structures and functions of xlogconstruct.c in >> xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble, >> InitXLogRecordConstruct. (worth noticing as well that the only reason >> XLogRecData is defined externally of xlogconstruct.c is that it is being >> used in XLogInsert()). > > > Hmm. I left the defines for xlogconstruct.c functions that are used to build > a WAL record in xlog.h, so that it's not necessary to include both xlog.h > and xlogconstruct.h in all files that write a WAL record (XLogInsert() is > defined in xlog.h). > > But perhaps it would be better to move the prototype for XLogInsert to > xlogconstruct.h too? That might be a better division; everything needed to > insert a WAL record would be in xlogconstruct.h, and xlog.h would left for > more internal functions related to WAL. And perhaps rename the files to > xloginsert.c and xloginsert.h. Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c, it is not weird. This approach has the merit to make XLogRecData completely bundled with the insertion and construction of the WAL records. Then for the name xloginsert.c is fine for me too. Among the things changed since v5, v6 is introducing a safety limit to the maximum number of backup blocks within a single WAL record: #define XLR_MAX_BKP_BLOCKS 256 XLogRecordBlockData is updated to use uint8 instead of char to identify the block id, and XLogEnsureRecordSpace fails if more than XLR_MAX_BKP_BLOCKS are wanted by the caller. I am fine with this change, just mentioning it. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers
On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao wrote: > I sent the SIGSTOP signal to the walreceiver process in one of sync standbys, > and then ran write transactions again. In this case, they must not be > completed > because their WAL cannot be replicated to the standby that its walreceiver > was stopped. But they were successfully completed. At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal sender in sync, making backends wake up even if other standbys did not catch up. But we need to scan all the synchronous wal senders and find the minimum write and flush positions and update walsndctl with those values. Well that's a code path I forgot to cover. Attached is an updated patch fixing the problem you reported. Regards, -- Michael *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 2586,2597 include_dir 'conf.d' Specifies a comma-separated list of standby names that can support synchronous replication, as described in . ! At any one time there will be at most one active synchronous standby; ! transactions waiting for commit will be allowed to proceed after ! this standby server confirms receipt of their data. ! The synchronous standby will be the first standby named in this list ! that is both currently connected and streaming data in real-time ! (as shown by a state of streaming in the pg_stat_replication view). Other standby servers appearing later in this list represent potential --- 2586,2598 Specifies a comma-separated list of standby names that can support synchronous replication, as described in . ! At any one time there will be at a number of active synchronous standbys ! defined by synchronous_standby_num; transactions waiting ! for commit will be allowed to proceed after those standby servers ! confirms receipt of their data. The synchronous standbys will be ! the first entries named in this list that are both currently connected ! and streaming data in real-time (as shown by a state of ! streaming in the pg_stat_replication view). Other standby servers appearing later in this list represent potential *** *** 2627,2632 include_dir 'conf.d' --- 2628,2652 + + synchronous_standby_num (integer) + +synchronous_standby_num configuration parameter + + + + + Specifies the number of standbys that support + synchronous replication, as described in + , and listed as the first + elements of . + + + Default value is 1. + + + + vacuum_defer_cleanup_age (integer) *** a/doc/src/sgml/high-availability.sgml --- b/doc/src/sgml/high-availability.sgml *** *** 1081,1092 primary_slot_name = 'node_a_slot' WAL record is then sent to the standby. The standby sends reply messages each time a new batch of WAL data is written to disk, unless wal_receiver_status_interval is set to zero on the standby. ! If the standby is the first matching standby, as specified in ! synchronous_standby_names on the primary, the reply ! messages from that standby will be used to wake users waiting for ! confirmation that the commit record has been received. These parameters ! allow the administrator to specify which standby servers should be ! synchronous standbys. Note that the configuration of synchronous replication is mainly on the master. Named standbys must be directly connected to the master; the master knows nothing about downstream standby servers using cascaded replication. --- 1081,1092 WAL record is then sent to the standby. The standby sends reply messages each time a new batch of WAL data is written to disk, unless wal_receiver_status_interval is set to zero on the standby. ! If the standby is the first synchronous_standby_num matching ! standbys, as specified in synchronous_standby_names on the ! primary, the reply messages from that standby will be used to wake users ! waiting for confirmation that the commit record has been received. These ! parameters allow the administrator to specify which standby servers should ! be synchronous standbys. Note that the configuration of synchronous replication is mainly on the master. Named standbys must be directly connected to the master; the master knows nothing about downstream standby servers using cascaded replication. *** *** 1169,1177 primary_slot_name = 'node_a_slot' The best solution for avoiding data loss is to ensure you don't lose your last remaining synchronous stan