Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
>I am sorry but I can't understand the above results due to wrapping. >Are you saying compression was twice as slow? CPU usage at user level (in seconds) for compression set 'on' is 562 secs while that for compression set 'off' is 354 secs. As per the readings, it takes little less than double CPU time to compress. However , the total time taken to run 25 transactions for each of the scenario is as follows, compression = 'on' : 1838 secs = 'off' : 1701 secs Different is around 140 secs. Thank you, Rahila Syed On Wed, Dec 10, 2014 at 7:55 PM, Bruce Momjian wrote: > On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote: > > The tests ran for around 30 mins.Manual checkpoint was run before each > test. > > > > Compression WAL generated%compressionLatency-avg CPU usage > > (seconds) TPS > Latency > > stddev > > > > > > on 1531.4 MB ~35 % 7.351 ms > > > user diff: 562.67s system diff: 41.40s 135.96 > > > 13.759 ms > > > > > > off 2373.1 MB 6.781 > ms > > user diff: 354.20s system diff: 39.67s147.40 > > > 14.152 ms > > > > The compression obtained is quite high close to 35 %. > > CPU usage at user level when compression is on is quite noticeably high > as > > compared to that when compression is off. But gain in terms of reduction > of WAL > > is also high. > > I am sorry but I can't understand the above results due to wrapping. > Are you saying compression was twice as slow? > > -- > Bruce Momjian http://momjian.us > EnterpriseDB http://enterprisedb.com > > + Everyone has their own god. + >
Re: [HACKERS] Commitfest problems
On 12/10/2014 10:35 PM, Bruce Momjian wrote: > I have heard repeated concerns about the commitfest process in the past > few months. The fact we have been in a continual commitfest since > August also is concerning. > > I think we are reaching the limits on how much we can do with our > existing commitfest structure, and it might be time to consider changes. > While the commitfest process hasn't changed much and was very successful > in the first few years, a few things have changed externally: > > 1 more new developers involved in contributing small patches > 2 more full-time developers creating big patches > 3 increased time demands on experienced Postgres developers I will add: 4. commitfest managers have burned out and refuse to do it again -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Commitfest problems
I have heard repeated concerns about the commitfest process in the past few months. The fact we have been in a continual commitfest since August also is concerning. I think we are reaching the limits on how much we can do with our existing commitfest structure, and it might be time to consider changes. While the commitfest process hasn't changed much and was very successful in the first few years, a few things have changed externally: 1 more new developers involved in contributing small patches 2 more full-time developers creating big patches 3 increased time demands on experienced Postgres developers These changes are all driven by increased Postgres popularity. In an ideal world, as 1 and 2 increase, the time experienced developers have to work on commitfest items would also increase, but the opposite has happened. Many of our most experienced developers (3) hold responsible positions in their organizations which demand their time. You would think that more full-time developers (2) would help with the commitfest, but often their paid work is in a specific subsystem of Postgres, preventing them from effectively helping with other complex patches. We have always known that we can create novice developers faster than experienced ones, but it seems this difference is accelerating. A good example is the UPSERT patch, which, while receiving extensive feedback on the user interface and syntax, has not had a full review, though it has been posted for several months. It seems like a good time to consider options for restructuring our process. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5 release scheduling (was Re: logical column ordering)
On 12/10/2014 09:35 PM, Tom Lane wrote: > Josh Berkus writes: >> On 12/10/2014 05:14 PM, Stephen Frost wrote: >>> * Andres Freund (and...@2ndquadrant.com) wrote: But the scheduling of commits with regard to the 9.5 schedule actually opens a relevant question: When are we planning to release 9.5? Because If we try ~ one year from now it's a whole different ballgame than if we try to go back to september. And I think there's pretty good arguments for both. > >>> This should really be on its own thread for discussion... I'm leaning, >>> at the moment at least, towards the September release schedule. I agree >>> that having a later release would allow us to get more into it, but >>> there's a lot to be said for the consistency we've kept up over the past >>> few years with a September (our last non-September release was 8.4). > >> Can we please NOT discuss this in the thread about someone's patch? Thanks. > > Quite. So, here's a new thread. > > MHO is that, although 9.4 has slipped more than any of us would like, > 9.5 development launched right on time in August. So I don't see a > good reason to postpone 9.5 release just because 9.4 has slipped. > I think we should stick to the schedule agreed to in Ottawa last spring. > > Comments? If anything, it seems like a great reason to try to get 9.5 out BEFORE we open the tree for 9.6/10.0/Cloud++. While there were technical issues, 9.4 dragged a considerable amount because most people were ignoring it in favor of 9.5 development. So far, I haven't seen any features for 9.5 which would delay a more timely release the way we did for 9.4. Anybody know of a bombshell someone's going to drop on us for CF5? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Dec 8, 2014 at 8:16 PM, Peter Geoghegan wrote: > Attached revision, v1.6, slightly tweaks the ordering of per-statement > trigger execution. Right now, there is no way for a before row insert/update trigger to determine whether it was called as part of an INSERT ... ON CONFLICT UPDATE or not. It's also not possible for a DO INSTEAD trigger on a view (a before row insert trigger) to determine that it was called specifically due to an INSERT...IGNORE (which I think ought to imply that any corresponding, "redirected" insertion into a table should also use IGNOREthat's at least going to be something that a certain number of apps will need to be made robust against). The question is: Do we want to expose this distinction to triggers? The natural way to do so would probably be to add TG_SPECULATIVE special variable to plpgsql (and equivalent variables in other PLs). This text variable would be either "UPSERT" or "IGNORE"; it would be NULL when it was not applicable (e.g. with traditional INSERTs). How do people feel about this? Is it important to include this in our initial cut of the feature? I thought that I'd avoid that kind of thing until prompted to address it by others, since it probably won't end up being a common concern, but I'd like to hear a few opinions. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
I marked this as ready for committer. On Thu, Dec 11, 2014 at 8:39 AM, Etsuro Fujita wrote: > Hi Ashutosh, > > Thanks for the review! > > (2014/12/10 14:47), Ashutosh Bapat wrote: > >> We haven't heard anything from Horiguchi-san and Hanada-san for almost a >> week. So, I am fine marking it as "ready for committer". What do you say? >> > > ISTM that both of them are not against us, so let's ask for the > committer's review! > > > Thanks, > > Best regards, > Etsuro Fujita > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
[HACKERS] 9.5 release scheduling (was Re: logical column ordering)
Josh Berkus writes: > On 12/10/2014 05:14 PM, Stephen Frost wrote: >> * Andres Freund (and...@2ndquadrant.com) wrote: >>> But the scheduling of commits with regard to the 9.5 schedule actually >>> opens a relevant question: When are we planning to release 9.5? Because >>> If we try ~ one year from now it's a whole different ballgame than if we >>> try to go back to september. And I think there's pretty good arguments >>> for both. >> This should really be on its own thread for discussion... I'm leaning, >> at the moment at least, towards the September release schedule. I agree >> that having a later release would allow us to get more into it, but >> there's a lot to be said for the consistency we've kept up over the past >> few years with a September (our last non-September release was 8.4). > Can we please NOT discuss this in the thread about someone's patch? Thanks. Quite. So, here's a new thread. MHO is that, although 9.4 has slipped more than any of us would like, 9.5 development launched right on time in August. So I don't see a good reason to postpone 9.5 release just because 9.4 has slipped. I think we should stick to the schedule agreed to in Ottawa last spring. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double vacuum in initdb
Peter Eisentraut writes: > initdb currently does > PG_CMD_PUTS("ANALYZE;\nVACUUM FULL;\nVACUUM FREEZE;\n"); > FREEZE is now part of FULL, so this seems redundant. Also, ANALYZE can > be run as part of VACUUM. So this could be > PG_CMD_PUTS("VACUUM FULL ANALYZE;\n"); Merging the ANALYZE step would save few cycles, and it'd probably result in the contents of pg_statistic being at least partly unfrozen at the end of the process, so please don't go that far. > There has been some concerns about time spent in initdb in test suites, > which is why I looked into this. In testing, this change can shave off > between 10% and 20% of the run time of initdb, so it would be kind of > useful. > The last change to this was > commit 66cd8150636e48a8f143560136a25ec5eb355d8c > Author: Tom Lane > Date: Mon Nov 29 03:05:03 2004 + > Clean up initdb's error handling so that it prints something more > useful than just \'failed\' when there's a problem. Per gripe from > Chris Albertson. > In an unrelated change, use VACUUM FULL; VACUUM FREEZE; rather than > a single VACUUM FULL FREEZE command, to respond to my worries of a > couple days ago about the reliability of doing this in one go. > That was a long time ago. Is that still applicable? Probably not; what I was on about was http://www.postgresql.org/message-id/12179.1101591...@sss.pgh.pa.us which certainly isn't a case that exists anymore since we got rid of the old VACUUM FULL implementation. So I think we could go to PG_CMD_PUTS("ANALYZE;\nVACUUM FULL FREEZE;\n"); without any degradation of the intended results. Another idea would be to drop the FULL part and make this PG_CMD_PUTS("ANALYZE;\nVACUUM FREEZE;\n"); which would speed initdb but it would also lose the testing angle of being sure that we can VACUUM FULL all system catalogs. OTOH, I don't immediately see why we shouldn't test that somewhere in the regression tests rather than in every initdb. (I think part of the argument for the forced FULL was to make sure we broke anything that thought system catalogs had fixed relfilenodes, but that should be pretty well a done deal by now.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 11:51 PM, Robert Haas wrote: > > On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila wrote: > > Yeah and also how would user specify the values, as an example > > assume that table is partitioned on monthly_salary, so partition > > definition would look: > > > > PARTITION BY LIST(monthly_salary) > > ( > > PARTITION salary_less_than_thousand VALUES(300, 900), > > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > > ... > > ) > > > > Now if user wants to define multi-column Partition based on > > monthly_salary and annual_salary, how do we want him to > > specify the values. Basically how to distinguish which values > > belong to first column key and which one's belong to second > > column key. > > I assume you just add some parentheses. > > PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1), > (valA2, valB2), (valA3, valB3)) > > Multi-column list partitioning may or may not be worth implementing, > but the syntax is not a real problem. > Yeah either this way or what Josh has suggested upthread, the main point was that if at all we want to support multi-column list partitioning then we need to have slightly different syntax, however I feel that we can leave multi-column list partitioning for first version. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 7:52 PM, Alvaro Herrera wrote: > > Amit Langote wrote: > > > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila wrote: > > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera < alvhe...@2ndquadrant.com> > > > wrote: > > > >> FWIW in my original proposal I was rejecting some things that after > > >> further consideration turn out to be possible to allow; for instance > > >> directly referencing individual partitions in COPY. We could allow > > >> something like > > >> > > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > > >> or maybe > > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT > > >> > > > or > > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > > > COPY [TABLE] lineitems PARTITION TO STDOUT > > > > > > I think we should try to support operations on partitions via main > > > table whereever it is required. > > Um, I think the only difference is that you added the noise word TABLE > which we currently don't allow in COPY, Yeah, we could eliminate TABLE keyword from this syntax, the reason I have kept was for easier understanding of syntax, currently we don't have concept of PARTITION in COPY syntax, but now if we want to introduce such a concept, then it might be better to have TABLE keyword for the purpose of syntax clarity. > and that you added the > possibility of using named partitions, about which see below. > > > We can also allow to explicitly name a partition > > > > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT; > > The problem with naming partitions is that the user has to pick names > for every partition, which is tedious and doesn't provide any > significant benefit. The input I had from users of other partitioning > systems was that they very much preferred not to name the partitions at > all, It seems to me both Oracle and DB2 supports named partitions, so even though there are user's which don't prefer named partitions, I suspect equal or more number of users will be there who will prefer for the sake of migration and because they are already used to such a syntax. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WRITE_UINT_FIELD used where WRITE_OID_FIELD likely intended
On Thu, Dec 11, 2014 at 7:44 AM, Mark Dilger wrote: > At line 1787 of outfuncs.c, the line: > > WRITE_UINT_FIELD(reltablespace) > > should probably say > > WRITE_OID_FIELD(reltablespace) > > since that variable is of type Oid, not uint32. > Granted, these two macros are interchangeable, > but they won't be if we ever move to 64-bit Oids. > For correctness you are right. Looks like you spent quite some time looking at that.. -- Michael
[HACKERS] Too strict check when starting from a basebackup taken off a standby
Hi, A customer recently reported getting "backup_label contains data inconsistent with control file" after taking a basebackup from a standby and starting it with a typo in primary_conninfo. When starting postgres from a basebackup StartupXLOG() has the follow code to deal with backup labels: if (haveBackupLabel) { ControlFile->backupStartPoint = checkPoint.redo; ControlFile->backupEndRequired = backupEndRequired; if (backupFromStandby) { if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY) ereport(FATAL, (errmsg("backup_label contains data inconsistent with control file"), errhint("This means that the backup is corrupted and you will " "have to use another backup for recovery."))); ControlFile->backupEndPoint = ControlFile->minRecoveryPoint; } } while I'm not enthusiastic about the error message, that bit of code looks sane at first glance. We certainly expect the control file to indicate we're in recovery. Since we're unlinking the backup label shortly afterwards we'd normally not expect to hit that case after a shutdown in recovery. The problem is that after reading the backup label we also have to read the corresponding checkpoing from pg_xlog. If primary_conninfo and/or restore_command are misconfigured and can't restore files that can only be fixed by shutting down the cluster and fixing up recovery.conf - which sets DB_SHUTDOWNED_IN_RECOVERY in the control file. The easiest solution seems to be to simply also allow that as a state in the above check. It might be nicer to not allow a ShutdownXLOG to modify the control file et al at that stage, but I think that'd end up being more invasive. A short search shows that that also looks like a credible explanation for #12128... I plan to relax that check unless somebody comes up with a different & better plan. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
Hi Ashutosh, Thanks for the review! (2014/12/10 14:47), Ashutosh Bapat wrote: We haven't heard anything from Horiguchi-san and Hanada-san for almost a week. So, I am fine marking it as "ready for committer". What do you say? ISTM that both of them are not against us, so let's ask for the committer's review! Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
On 12/09/2014 09:11 PM, Tom Lane wrote: > Josh Berkus writes: >> Question on COPY, though: there's reasons why people would want COPY to >> dump in either physical or logical order. If you're doing COPY to >> create CSV files for output, then you want the columns in logical order. >> If you're doing COPY for pg_dump, then you want them in physical order >> for faster dump/reload. So we're almost certainly going to need to have >> an option for COPY. > > This is complete nonsense, Josh, or at least it is until you can provide > some solid evidence to believe that column ordering would make any > noticeable performance difference in COPY. I know of no reason to believe > that the existing user-defined-column-ordering option makes any difference. Chill, dude, chill. When we have a patch, I'll do some performance testing, and we'll see if it's something we care about or not. There's no reason to be belligerent about it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
On 12/10/2014 05:14 PM, Stephen Frost wrote: > * Andres Freund (and...@2ndquadrant.com) wrote: >> > But the scheduling of commits with regard to the 9.5 schedule actually >> > opens a relevant question: When are we planning to release 9.5? Because >> > If we try ~ one year from now it's a whole different ballgame than if we >> > try to go back to september. And I think there's pretty good arguments >> > for both. > This should really be on its own thread for discussion... I'm leaning, > at the moment at least, towards the September release schedule. I agree > that having a later release would allow us to get more into it, but > there's a lot to be said for the consistency we've kept up over the past > few years with a September (our last non-September release was 8.4). Can we please NOT discuss this in the thread about someone's patch? Thanks. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
On Tue, Dec 9, 2014 at 3:24 AM, Simon Riggs wrote: > Feedback I am receiving is that the API is unusable. That could be > because it is impenetrable, or because it is unusable. I'm not sure it > matters which. It would be nice to here what someone is trying to use it for and what problems that person is encountering. Without that, it's pretty much impossible for anyone to fix anything. As for sample code, KaiGai had a working example, which of course got broken when Tom changed the API, but it didn't look to me like Tom's changes would have made anything impossible that was possible before. I'm frankly kind of astonished by the tenor of this entire conversation; there is certainly plenty of code in the backend that is less self-documenting than this is; and KaiGai did already put up a wiki page with documentation as you requested. From his response, it sounds like he has updated the ctidscan code, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Lockless StrategyGetBuffer() clock sweep
On Mon, Dec 8, 2014 at 2:51 PM, Andres Freund wrote: > On 2014-10-30 07:55:08 -0400, Robert Haas wrote: >> On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund >> wrote: >> >> But if it is, then how about >> >> adding a flag that is 4 bytes wide or less alongside bgwriterLatch, >> >> and just checking the flag instead of checking bgwriterLatch itself? >> > >> > Yea, that'd be nicer. I didn't do it because it made the patch slightly >> > more invasive... The complexity really is only needed because we're not >> > guaranteed that 64bit reads are atomic. Although we actually can be >> > sure, because there's no platform with nonatomic intptr_t reads - so >> > 64bit platforms actually *do* have atomic 64bit reads/writes. >> > >> > So if we just have a integer 'setBgwriterLatch' somewhere we're good. We >> > don't even need to take a lock to set it. Afaics the worst that can >> > happen is that several processes set the latch... >> >> OK, that design is fine with me. > > There's a slight problem with this, namely restarts of bgwriter. If it > crashes the reference to the relevant latch will currently be reset via > StrategyNotifyBgWriter(). In reality that's not a problem because > sizeof(void*) writes are always atomic, but currently we don't assume > that. We'd sometimes write to a old latch, but that's harmless because > they're never deallocated. > > So I can see several solutions right now: > 1) Redefine our requirements so that aligned sizeof(void*) writes are >always atomic. That doesn't affect any currently supported platform >and won't ever affect any future platform either. E.g. linux has had >that requirement for a long time. > 2) Use a explicitly defined latch for the bgworker instead of using the >PGPROC->procLatch. That way it never has to be reset and all >SetLatch()s will eventually go to the right process. > 3) Continue requiring the spinlock when setting the latch. Maybe you could store the pgprocno instead of the PROC *. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Mon, Dec 08, 2014 at 02:23:39AM +0100, Petr Jelinek wrote: > On 08/12/14 00:56, Noah Misch wrote: > >The commit_ts test suite gives me the attached diff on a 32-bit MinGW build > >running on 64-bit Windows Server 2003. I have not checked other Windows > >configurations; the suite does pass on GNU/Linux. > > Hmm I wonder if "< now()" needs to be changed to "<= now()" in those queries > to make them work correctly on that plarform, I don't have machine with that > environment handy right now, so I would appreciate if you could try that, in > case you don't have time for that, I will try to setup something later... I will try that, though perhaps not until next week. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
On Sat, Dec 6, 2014 at 10:08 PM, Tomas Vondra wrote: > select a.i, b.i from a join b on (a.i = b.i); I think the concern is that the inner side might be something more elaborate than a plain table scan, like an aggregate or join. I might be all wet, but my impression is that you can make rescanning arbitrarily expensive if you work at it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] double vacuum in initdb
On Wed, Dec 10, 2014 at 8:50 PM, Peter Eisentraut wrote: > In an unrelated change, use VACUUM FULL; VACUUM FREEZE; rather than > a single VACUUM FULL FREEZE command, to respond to my worries of a > couple days ago about the reliability of doing this in one go. > > That was a long time ago. Is that still applicable? Gosh, I hope not. Note that that was back when we still had old-style VACUUM FULL, which was significantly more fragile than what we've got now, I think... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WRITE_UINT_FIELD used where WRITE_OID_FIELD likely intended
At line 1787 of outfuncs.c, the line: WRITE_UINT_FIELD(reltablespace) should probably say WRITE_OID_FIELD(reltablespace) since that variable is of type Oid, not uint32. Granted, these two macros are interchangeable, but they won't be if we ever move to 64-bit Oids. mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS
> "Tom" == Tom Lane writes: More comment on this later, but I want to highlight these specific points since we need clear answers here to avoid wasting unnecessary time and effort: Tom> I've not spent any real effort looking at gsp2.patch yet, but it Tom> seems even worse off comment-wise: if there's any explanation in Tom> there at all of what a "chained aggregate" is, I didn't find it. (Maybe "stacked" would have been a better term.) What that code does is produce plans that look like this: GroupAggregate -> Sort -> ChainAggregate -> Sort -> ChainAggregate in much the same way that WindowAgg nodes are generated. Where would you consider the best place to comment this? The WindowAgg equivalent seems to be discussed primarily in the header comment of nodeWindowAgg.c. Tom> I'd also counsel you to find some other way to do it than Tom> putting bool chain_head fields in Aggref nodes; There are no chain_head fields in Aggref nodes. Agg.chain_head is true for the Agg node at the top of the chain (the GroupAggregate node in the above example), while AggState.chain_head is set on the ChainAggregate nodes to point to the AggState of the GroupAggregate node. What we need to know before doing any further work on this is whether this idea of stacking up aggregate and sort nodes is a viable one. (The feedback I've had so far suggests that the performance is acceptable, even if there are still optimization opportunities that can be tackled later, like adding HashAggregate support.) Tom> I took a quick look at gsp-u.patch. It seems like that approach Tom> should work, with of course the caveat that using CUBE/ROLLUP as Tom> function names in a GROUP BY list would be problematic. I'm not Tom> convinced by the commentary in ruleutils.c suggesting that extra Tom> parentheses would help disambiguate: aren't extra parentheses Tom> still going to contain grouping specs according to the standard? The spec is of minimal help here since it does not allow expressions in GROUP BY at all, last I looked; only column references. The extra parens do actually disambiguate because CUBE(x) and (CUBE(x)) are not equivalent anywhere; while CUBE(x) can appear inside GROUPING SETS (...), it cannot appear inside a (...) list nested inside a GROUPING SETS list (or anywhere else). As the comments in gram.y explain, the productions used are intended to follow the spec with the exception of using a_expr where the spec requires . So CUBE and ROLLUP are recognized as special only as part of a group_by_item ( in the spec), and as soon as we see a paren that isn't part of the "GROUPING SETS (" opener, we're forced into parsing an a_expr, in which CUBE() would become a function call. (The case of upgrading from an old pg version seems to require the use of --quote-all-identifiers in pg_dump) Tom> Forcibly schema-qualifying such function names seems like a less Tom> fragile answer on that end. That I guess would require keeping more state, unless you applied it everywhere to any function with a keyword for a name? I dunno. The question that needs deciding here is less whether the approach _could_ work but whether we _want_ it. The objection has been made that we are in effect introducing a new category of "unreserved almost everywhere" keyword, which I think has a point; on the other hand, reserving CUBE is a seriously painful prospect. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fix typo um vacuumdb tests
On 12/10/14 4:55 PM, Fabrízio de Royes Mello wrote: > Hi all, > > A little typo in vacuumdb tests. Fixed, thanks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] double vacuum in initdb
initdb currently does PG_CMD_PUTS("ANALYZE;\nVACUUM FULL;\nVACUUM FREEZE;\n"); FREEZE is now part of FULL, so this seems redundant. Also, ANALYZE can be run as part of VACUUM. So this could be PG_CMD_PUTS("VACUUM FULL ANALYZE;\n"); There has been some concerns about time spent in initdb in test suites, which is why I looked into this. In testing, this change can shave off between 10% and 20% of the run time of initdb, so it would be kind of useful. The last change to this was commit 66cd8150636e48a8f143560136a25ec5eb355d8c Author: Tom Lane Date: Mon Nov 29 03:05:03 2004 + Clean up initdb's error handling so that it prints something more useful than just \'failed\' when there's a problem. Per gripe from Chris Albertson. In an unrelated change, use VACUUM FULL; VACUUM FREEZE; rather than a single VACUUM FULL FREEZE command, to respond to my worries of a couple days ago about the reliability of doing this in one go. That was a long time ago. Is that still applicable? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSSAPI, SSPI - include_realm default
* Peter Eisentraut (pete...@gmx.net) wrote: > On 12/9/14 5:40 PM, Stephen Frost wrote: > > I agree with this but I don't really see why we wouldn't say "hey, this > > is going to change in 9.5." > > Well, for one thing, we don't even know if it's going to be called 9.5. ;-) Now that is certainly a very good point. > And there is always a chance for a technical reason popping up that we > might not make the change after all in 9.5. I suppose. > I'd be fine with something more along the lines of "the default might > change in the future ... if you want to be future-proof, set it explicitly". Sure, that'd work for me. > Let's see an actual patch. Will do. Might be a few days before I get to it but I don't think there's any cause for rush anyway. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 7:25 PM, Amit Langote wrote: > In heap_create(), do we create storage for a top level partitioned table > (say, RELKIND_PARTITIONED_TABLE)? How about a partition that is further > sub-partitioned? We might allocate storage for a partition at some point and > then later choose to sub-partition it. In such a case, perhaps, we would have > to move existing data to the storage of subpartitions and deallocate the > partition's storage. In other words only leaf relations in a partition > hierarchy would have storage. Is there such a notion within code for some > other purpose or we'd have to invent it for partitioning scheme? I think it would be advantageous to have storage only for the leaf partitions, because then you don't need to waste time doing a zero-block sequential scan of the root as part of the append-plan, an annoyance of the current system. We have no concept for this right now; in fact, right now, the relkind fully determines whether a given relation has storage. One idea is to make the leaves relkind = 'r' and the interior notes some new relkind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
* Andres Freund (and...@2ndquadrant.com) wrote: > But the scheduling of commits with regard to the 9.5 schedule actually > opens a relevant question: When are we planning to release 9.5? Because > If we try ~ one year from now it's a whole different ballgame than if we > try to go back to september. And I think there's pretty good arguments > for both. This should really be on its own thread for discussion... I'm leaning, at the moment at least, towards the September release schedule. I agree that having a later release would allow us to get more into it, but there's a lot to be said for the consistency we've kept up over the past few years with a September (our last non-September release was 8.4). I'd certainly vote against planning for a mid-December release as, at least in my experience, it's a bad idea to try and do December (or January 1..) major releases. October might work, but that's not much of a change from September. Late January or February would probably work but that's quite a shift from September and don't think it'd be particularly better. Worse, I'm nervous we might get into a habit of longer and longer releases. Having yearly releases, imv at least, is really good for the project and those who depend on it. New features are available pretty quickly to end-users and people can plan around our schedule pretty easily (eg- plan to do DB upgrades in January/February). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] On partitioning
> From: Robert Haas [mailto:robertmh...@gmail.com] > On Mon, Dec 8, 2014 at 2:56 PM, Andres Freund > wrote: > >> I don't think that's mutually exclusive with the idea of > >> partitions-as-tables. I mean, you can add code to the ALTER TABLE > >> path that says if (i_am_not_the_partitioning_root) ereport(ERROR, ...) > >> wherever you want. > > > > That'll be a lot of places you'll need to touch. More fundamentally: Why > > should we name something a table that's not one? > > Well, I'm not convinced that it isn't one. And adding a new relkind > will involve a bunch of code churn, too. But I don't much care to > pre-litigate this: when someone has got a patch, we can either agree > that the approach is OK or argue that it is problematic because X. I > think we need to hammer down the design in broad strokes first, and > I'm not sure we're totally there yet. > In heap_create(), do we create storage for a top level partitioned table (say, RELKIND_PARTITIONED_TABLE)? How about a partition that is further sub-partitioned? We might allocate storage for a partition at some point and then later choose to sub-partition it. In such a case, perhaps, we would have to move existing data to the storage of subpartitions and deallocate the partition's storage. In other words only leaf relations in a partition hierarchy would have storage. Is there such a notion within code for some other purpose or we'd have to invent it for partitioning scheme? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] thinko in convertToJsonb()
Michael Paquier writes: > On Tue, Dec 9, 2014 at 11:11 AM, Mark Dilger wrote: >> Perhaps the code really meant to say: >> reserveFromBuffer(&buffer, VARHDRSZ) > Good catch! The code is indeed incorrect. Attached is a one-line patch > addressing that, I guess someone is going to pick up that sooner or > later. Pushed, thanks! regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advance local xmin more aggressively
On Wed, Dec 10, 2014 at 3:28 PM, Heikki Linnakangas wrote: >> Care to code it up? > > Here you are. That was quick. You need to add a semicolon to the end of line 20 in pairingheap.c. I wonder what the worst case for this is. I tried it on your destruction test case from before and it doesn't seem to cost anything material: your patch: 0m38.714s 0m34.424s 0m35.191s master: 0m35.260s 0m34.643s 0m34.945s my patch: 0m34.728s 0m34.619s 0m35.015s The first measurement with your patch is somewhat of an outlier, but that was also the first measurement I took, which might have thrown off the results. Basically, either your patch or my patch looks free from here. I'm kind of suspicious about that: it doesn't seem like the additional linked-list manipulation you're doing in RegisterSnapshotOnOwner ought to cost something, but maybe that function just isn't called enough for it to matter, at least on this test case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Casting issues with domains
Kevin Grittner writes: > Tom Lane wrote: >> As far as that goes, I think the OP was unhappy about the performance >> of the information_schema views, which in our implementation do exactly >> that so that the exposed types of the view columns conform to the SQL >> standard, even though the underlying catalogs use PG-centric types. >> >> I don't believe that that's the only reason why the performance of the >> information_schema views tends to be sucky, but it's certainly a reason. > Is that schema too "edge case" to justify some functional indexes > on the cast values on the underlying catalogs? (I'm inclined to > think so, but it seemed like a question worth putting out > there) We don't support functional indexes on system catalogs, so whether it'd be justified is sorta moot. On the whole though I'm inclined to agree that the information_schema views aren't used enough to justify adding overhead to system-catalog updates, even if the pieces for that all existed. > Or, since these particular domains are known, is there any sane way > to "special-case" these to allow the underlying types to work? I don't particularly care for a kluge solution here. I notice that recent versions of the SQL spec contain the notion of a "distinct type", which is a user-defined type that is representationally identical to some base type but has its own name, and comes equipped with assignment-grade casts to and from the base type (which in PG terms would be binary-compatible casts, though the spec doesn't require that). It seems like this might be intended to be the sort of "zero cost type alias" we were talking about, except that the SQL committee seems to have got it wrong by not specifying the cast-to-base-type as being implicit. Which ISTM you would want so that operators/functions on the base type would apply automatically to the distinct type. But perhaps we could extend the spec with some option to CREATE TYPE to allow the cast to come out that way. Or in short, maybe we should try to replace the domains used in the current information_schema with distinct types. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLESAMPLE patch
On 11/12/14 00:24, Petr Jelinek wrote: Hello, Attached is a basic implementation of TABLESAMPLE clause. It's SQL standard clause and couple of people tried to submit it before so I think I don't need to explain in length what it does - basically returns "random" sample of a table using a specified sampling method. I implemented both SYSTEM and BERNOULLI sampling as specified by SQL standard. The SYSTEM sampling does block level sampling using same algorithm as ANALYZE, BERNOULLI scans whole table and picks tuple randomly. There is API for sampling methods which consists of 4 functions at the moment - init, end, nextblock and nexttuple. I added catalog which maps the sampling method to the functions implementing this API. The grammar creates new TableSampleRange struct that I added for sampling. Parser then uses the catalog to load information about the sampling method into TableSampleClause which is then attached to RTE. Planner checks for if this parameter is present in the RTE and if it finds it it will create plan with just one path - SampleScan. SampleScan implements standard executor API and calls the sampling method API as needed. It is possible to write custom sampling methods. The sampling method parameters are not limited to just percent number as in standard but dynamic list of expressions which is checked against the definition of the init function in a similar fashion (although much simplified) as function calls are. Notable lacking parts are: - proper costing and returned row count estimation - given the dynamic nature of parameters I think for we'll need to let the sampling method do this, so there will have to be fifth function in the API. - ruleutils support (it needs a bit of code in get_from_clause_item function) - docs are sparse at the moment Forgot the obligatory: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] TABLESAMPLE patch
Hello, Attached is a basic implementation of TABLESAMPLE clause. It's SQL standard clause and couple of people tried to submit it before so I think I don't need to explain in length what it does - basically returns "random" sample of a table using a specified sampling method. I implemented both SYSTEM and BERNOULLI sampling as specified by SQL standard. The SYSTEM sampling does block level sampling using same algorithm as ANALYZE, BERNOULLI scans whole table and picks tuple randomly. There is API for sampling methods which consists of 4 functions at the moment - init, end, nextblock and nexttuple. I added catalog which maps the sampling method to the functions implementing this API. The grammar creates new TableSampleRange struct that I added for sampling. Parser then uses the catalog to load information about the sampling method into TableSampleClause which is then attached to RTE. Planner checks for if this parameter is present in the RTE and if it finds it it will create plan with just one path - SampleScan. SampleScan implements standard executor API and calls the sampling method API as needed. It is possible to write custom sampling methods. The sampling method parameters are not limited to just percent number as in standard but dynamic list of expressions which is checked against the definition of the init function in a similar fashion (although much simplified) as function calls are. Notable lacking parts are: - proper costing and returned row count estimation - given the dynamic nature of parameters I think for we'll need to let the sampling method do this, so there will have to be fifth function in the API. - ruleutils support (it needs a bit of code in get_from_clause_item function) - docs are sparse at the moment -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 01d24a5..250ae29 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -49,7 +49,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionwhere from_item can be one of: -[ ONLY ] table_name [ * ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ] +[ ONLY ] table_name [ * ] [ TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ] [ LATERAL ] ( select ) [ AS ] alias [ ( column_alias [, ...] ) ] with_query_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ] [ LATERAL ] function_name ( [ argument [, ...] ] ) @@ -317,6 +317,38 @@ TABLE [ ONLY ] table_name [ * ] + TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] + + +Table sample clause after +table_name indicates that +a sampling_method should +be used to retrieve subset of rows in the table. +The sampling_method can be +one of: + + + SYSTEM + + + BERNOULLI + + +Both of those sampling methods currently accept only single argument +which is the percent (floating point from 0 to 100) of the rows to +be returned. +The SYSTEM sampling method does block level +sampling with each block having same chance of being selected and +returns all rows from each selected block. +The BERNOULLI scans whole table and returns +individual rows with equal probability. +The optional numeric parameter REPEATABLE is used +as random seed for sampling. + + + + + alias diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile index 21721b4..595737c 100644 --- a/src/backend/access/Makefile +++ b/src/backend/access/Makefile @@ -8,6 +8,7 @@ subdir = src/backend/access top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist transam +SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist \ + transam tsm include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/tsm/Makefile b/src/backend/access/tsm/Makefile new file mode 100644 index 000..73bbbd7 --- /dev/null +++ b/src/backend/access/tsm/Makefile @@ -0,0 +1,17 @@ +#- +# +# Makefile-- +#Makefile for access/tsm +# +# IDENTIFICATION +#src/backend/access/tsm/Makefile +# +#- + +subdir = src/backend/access/tsm +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global + +OBJS = tsm_system.o tsm_bernoulli.o + +include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/tsm/tsm_bernoulli.c b/src/backend/access/tsm/tsm_bernoulli.c new file mode 100644 index 000..c273ca6 --- /de
Re: [HACKERS] Casting issues with domains
Tom Lane wrote: > Kevin Grittner writes: >> It's kinda hard for me to visualize where it makes sense to define >> the original table column as the bare type but use a domain when >> referencing it in the view. > > As far as that goes, I think the OP was unhappy about the performance > of the information_schema views, which in our implementation do exactly > that so that the exposed types of the view columns conform to the SQL > standard, even though the underlying catalogs use PG-centric types. > > I don't believe that that's the only reason why the performance of the > information_schema views tends to be sucky, but it's certainly a reason. Is that schema too "edge case" to justify some functional indexes on the cast values on the underlying catalogs? (I'm inclined to think so, but it seemed like a question worth putting out there) Or, since these particular domains are known, is there any sane way to "special-case" these to allow the underlying types to work? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Casting issues with domains
Kevin Grittner writes: > It's kinda hard for me to visualize where it makes sense to define > the original table column as the bare type but use a domain when > referencing it in the view. As far as that goes, I think the OP was unhappy about the performance of the information_schema views, which in our implementation do exactly that so that the exposed types of the view columns conform to the SQL standard, even though the underlying catalogs use PG-centric types. I don't believe that that's the only reason why the performance of the information_schema views tends to be sucky, but it's certainly a reason. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fix typo um vacuumdb tests
Hi all, A little typo in vacuumdb tests. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Timbira: http://www.timbira.com.br >> Blog: http://fabriziomello.github.io >> Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello >> Github: http://github.com/fabriziomello diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl index d6555f5..ac160ba 100644 --- a/src/bin/scripts/t/100_vacuumdb.pl +++ b/src/bin/scripts/t/100_vacuumdb.pl @@ -29,4 +29,4 @@ issues_sql_like( issues_sql_like( [ 'vacuumdb', '-Z', 'postgres' ], qr/statement: ANALYZE;/, - 'vacuumdb -z'); + 'vacuumdb -Z'); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST kNN search queue (Re: KNN-GiST with recheck)
On 12/10/2014 10:59 PM, Arthur Silva wrote: It may be better to replace the lib/binaryheap altogether as it offers comparable/better performance. It's not always better. A binary heap is more memory-efficient, for starters. There are only two uses of lib/binaryheap: reorderbuffer.c and merge append. Reorderbuffer isn't that performance critical, although a binary heap may well be better there, because the comparison function is very cheap. For merge append, it might be a win, especially if the comparison function is expensive. (That's on the assumption that the overall number of comparisons needed with a pairing heap is smaller - I'm not sure how true that is). That would be worth testing. I'd love to test some other heap implementation in in tuplesort.c. It has a custom binary heap implementation that's used in the final merge phase of an external sort, and also when doing a so-called bounded sort, i.e. "ORDER BY x LIMIT Y". But that would be difficult to replace, because tuplesort.c collects tuples in an array in memory, and then turns that into a heap. An array is efficient to turn into a binary heap, but to switch to another data structure, you'd suddenly need extra memory. And we do the switch when we run out of work_mem, so allocating more isn't really an option. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] libpq pipelining
On Friday, December 05, 2014 12:22:38 PM Heikki Linnakangas wrote: > Oh, that's what the PQgetLastQuery/PQgetNextQuery functions work! I > didn't understand that before. I'd suggest renaming them to something > like PQgetSentQuery() and PQgetResultQuery(). The first/last/next names > made me think that they're used to iterate a list of queries, but in > fact they're supposed to be used at very different stages. > > - Heikki Okay, I have renamed them with your suggestions, and added PQsetPipelining/PQgetPipelining, defaulting to pipelining off. There should be no behavior change unless pipelining is enabled. Documentation should be mostly complete except the possible addition of an example and maybe a general pipelining overview paragraph. I have implemented async query support (that takes advantage of pipelining) in Qt, along with a couple test cases. This led to me discovering a bug with my last patch where a PGquery object could be reused twice in a row. I have fixed that. I contemplated not reusing the PGquery objects at all, but that wouldn't solve the problem because it's very possible that malloc will return a recent free of the same size anyway. Making the guarantee that a PGquery won't be reused twice in a row should be sufficient, and the only alternative is to add a unique id, but that will add further complexity that I don't think is warranted. Feedback is very welcome and appreciated. Thanks, Matt Newell diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index d829a4b..4e0431e 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -3947,9 +3947,14 @@ int PQsendQuery(PGconn *conn, const char *command); After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the - results. PQsendQuery cannot be called again - (on the same connection) until PQgetResult - has returned a null pointer, indicating that the command is done. + results. If pipelining is enabled PQsendQuery + may be called multiple times before reading the results. See + PQsetPipelining and PQisPipelining. + Call PQgetSentQuery to get a PGquery + which can be used to identify which results obtained from + PQgetResult belong to each pipelined query. + If only one query is dispatched at a time, you can call PQgetResult + until a NULL value is returned to indicate the end of the query. @@ -4133,8 +4138,8 @@ PGresult *PQgetResult(PGconn *conn); PQgetResult must be called repeatedly until - it returns a null pointer, indicating that the command is done. - (If called when no command is active, + it returns a null pointer, indicating that all dispatched commands + are done. (If called when no command is active, PQgetResult will just return a null pointer at once.) Each non-null result from PQgetResult should be processed using the @@ -4144,14 +4149,17 @@ PGresult *PQgetResult(PGconn *conn); PQgetResult will block only if a command is active and the necessary response data has not yet been read by PQconsumeInput. + If query pipelining is being used, PQgetResultQuery + can be called after PQgetResult to match the result to the query. Even when PQresultStatus indicates a fatal -error, PQgetResult should be called until it -returns a null pointer, to allow libpq to -process the error information completely. +error, PQgetResult should be called until the +query has no more results (null pointer return if not using query +pipelining, otherwise see PQgetResultQuery), +to allow libpq to process the error information completely. @@ -4385,6 +4393,158 @@ int PQflush(PGconn *conn); read-ready and then read the response as described above. + + + +PQsetPipelining + + PQsetPipelining + + + + + + Enables or disables query pipelining. + +int PQsetPipelining(PGconn *conn, int arg); + + + + + Enables pipelining for the connectino if arg is 1, or disables it + if arg is 0. When pipelining is enabled multiple async queries can + be sent before processing the results of the first. If pipelining + is disabled an error will be raised an async query is attempted + while another is active. + + + + + + +PQisPipelining + + PQisPipelining + + + + + + Returns the pipelining status of the connection + +int PQisPipelining(PGconn *conn); + + + + + Returns 1 if pipelining is enabled, or 0 if pipeling is disabled. + Query pipelining is disabled unless enabled with a call to + PQsetPipelining. + + + + + + +PQgetQueryCommand + + PQgetQueryCommand + + + + + + Returns the query string associated with th
Re: [HACKERS] GiST kNN search queue (Re: KNN-GiST with recheck)
On Wed, Dec 10, 2014 at 1:50 PM, Heikki Linnakangas wrote: > On 01/28/2014 04:12 PM, Alexander Korotkov wrote: > >> >3. A binary heap would be a better data structure to buffer the rechecked >>> >values. A Red-Black tree allows random insertions and deletions, but in >>> >this case you need to insert arbitrary values but only remove the >>> minimum >>> >item. That's exactly what a binary heap excels at. We have a nice binary >>> >heap implementation in the backend that you can use, see >>> >src/backend/lib/binaryheap.c. >>> > >>> >> Hmm. For me binary heap would be a better data structure for KNN-GiST at >> all :-) >> > > I decided to give this a shot, replacing the red-black tree in GiST with > the binary heap we have in lib/binaryheap.c. It made the GiST code somewhat > simpler, as the binaryheap interface is simpler than the red-black tree > one. Unfortunately, performance was somewhat worse. That was quite > surprising, as insertions and deletions are both O(log N) in both data > structures, but the red-black tree implementation is more complicated. > > I implemented another data structure called a Pairing Heap. It's also a > fairly simple data structure, but insertions are O(1) instead of O(log N). > It also performs fairly well in practice. > > With that, I got a small but measurable improvement. To test, I created a > table like this: > > create table gisttest (id integer, p point); > insert into gisttest select id, point(random(), random()) from > generate_series(1, 100) id; > create index i_gisttest on gisttest using gist (p); > > And I ran this query with pgbench: > > select id from gisttest order by p <-> '(0,0)' limit 1000; > > With unpatched master, I got about 650 TPS, and with the patch 720 TPS. > That's a nice little improvement, but perhaps more importantly, the pairing > heap implementation consumes less memory. To measure that, I put a > MemoryContextStats(so->queueCtx) call into gistendscan. With the above > query, but without the "limit" clause, on master I got: > > GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998 > chunks); 21296 used > > And with the patch: > > GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 chunks); > 21072 used > > That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice > to reduce that memory consumption, as there is no hard upper bound on how > much might be needed. If the GiST tree is really disorganized for some > reason, a query might need a lot more. > > > So all in all, I quite like this patch, even though it doesn't do anything > too phenomenal. It adds a some code, in the form of the new pairing heap > implementation, but it makes the GiST code a little bit simpler. And it > gives a small performance gain, and reduces memory usage a bit. > > - Heikki > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > It may be better to replace the lib/binaryheap altogether as it offers comparable/better performance.
Re: [HACKERS] advance local xmin more aggressively
On 12/10/2014 08:35 PM, Robert Haas wrote: On Wed, Dec 10, 2014 at 12:57 PM, Heikki Linnakangas wrote: Clever. Could we use that method in ResourceOwnerReleaseInternal and ResourceOwnerDelete, too? Might be best to have a ResourceOwnerWalk(resowner, callback) function for walking all resource owners in a tree, instead of one for walking the snapshots in them. Sure. It would be a little more complicated there since you want to stop when you get back to the starting point, but not too bad. But is that solving any actual problem? I thought that a transaction commit or abort in some special circumstances might call ResourceOwnerReleaseInternal on the top level, but I can't make it happen. The machinery in xact.c is too clever, and always releases the resource owners from the bottom up. And I can't find a way to create a deep resource owner tree in any other way. So I guess it's fine as it is. MemoryContextCheck and MemoryContextPrint also recurse, however. MemoryContextCheck is only enabled with --enable-cassert, but MemoryContextPrint is called when you run out of memory. That could turn a plain "out of memory" error into a stack overrun, triggering a server crash and restart. It occurs to me that the pairing heap I just posted in another thread (http://www.postgresql.org/message-id/54886bb8.9040...@vmware.com) would be a good fit for this. It's extremely cheap to insert to and to find the minimum item (O(1)), and the delete operation is O(log N), amortized. I didn't implement a delete operation, for removing a particular node, I only did delete-min, but it's basically the same code. Using a pairing heap for this might be overkill, but if we have that implementation anyway, the code in snapmgr.c to use it would be very simple, so I see little reason not to. It might even be simpler than your patch, because you wouldn't need to have the heuristics on whether to attempt updating the xmin; it would be cheap enough to always try it. Care to code it up? Here you are. - Heikki diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile index 327a1bc..b24ece6 100644 --- a/src/backend/lib/Makefile +++ b/src/backend/lib/Makefile @@ -12,6 +12,6 @@ subdir = src/backend/lib top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -OBJS = ilist.o binaryheap.o stringinfo.o +OBJS = ilist.o binaryheap.o pairingheap.o stringinfo.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/lib/pairingheap.c b/src/backend/lib/pairingheap.c new file mode 100644 index 000..06cd117 --- /dev/null +++ b/src/backend/lib/pairingheap.c @@ -0,0 +1,237 @@ +/*- + * + * pairingheap.c + * A Pairing Heap implementation + * + * Portions Copyright (c) 2012-2014, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/backend/lib/pairingheap.c + * + *- + */ + +#include "postgres.h" + +#include + +#include "lib/pairingheap.h" + +static pairingheap_node * merge(pairingheap *heap, pairingheap_node *a, pairingheap_node *b) +static pairingheap_node *merge_children(pairingheap *heap, pairingheap_node *children); + +/* + * pairingheap_allocate + * + * Returns a pointer to a newly-allocated heap, with the heap property + * defined by the given comparator function, which will be invoked with the + * additional argument specified by 'arg'. + */ +pairingheap * +pairingheap_allocate(pairingheap_comparator compare, void *arg) +{ + pairingheap *heap; + + heap = (pairingheap *) palloc(sizeof(pairingheap)); + heap->ph_compare = compare; + heap->ph_arg = arg; + + heap->ph_root = NULL; + + return heap; +} + +/* + * pairingheap_free + * + * Releases memory used by the given pairingheap. + * + * Note: The items in the heap are not released! + */ +void +pairingheap_free(pairingheap *heap) +{ + pfree(heap); +} + + +/* A helper function to merge two subheaps into one. */ +static pairingheap_node * +merge(pairingheap *heap, pairingheap_node *a, pairingheap_node *b) +{ + if (a == NULL) + return b; + if (b == NULL) + return a; + + /* Put the larger of the items as a child of the smaller one */ + if (heap->ph_compare(a, b, heap->ph_arg) < 0) + { + pairingheap_node *tmp; + + tmp = a; + a = b; + b = tmp; + } + + if (a->first_child) + a->first_child->prev_or_parent = b; + b->prev_or_parent = a; + b->next_sibling = a->first_child; + a->first_child = b; + return a; +} + +/* + * pairingheap_add + * + * Adds the given datum to the heap in O(1) time. + */ +void +pairingheap_add(pairingheap *heap, pairingheap_node *d) +{ + d->first_child = NULL; + + /* Link the new item as a new tree */ + heap->ph_root = merge(heap, heap->ph_root, d); +} + +/* + * pairingheap_first + * + * Returns a pointer to the first (root, topmost) node in the heap + * without modifying the heap. The caller must ensure that this + * routine is not used on an empty heap. Always O(1
Re: [HACKERS] GSSAPI, SSPI - include_realm default
On 12/9/14 5:40 PM, Stephen Frost wrote: > I agree with this but I don't really see why we wouldn't say "hey, this > is going to change in 9.5." Well, for one thing, we don't even know if it's going to be called 9.5. ;-) And there is always a chance for a technical reason popping up that we might not make the change after all in 9.5. I'd be fine with something more along the lines of "the default might change in the future ... if you want to be future-proof, set it explicitly". Let's see an actual patch. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Casting issues with domains
Thomas Reiss wrote: > postgres=# create table test1 (a text); > CREATE TABLE > postgres=# insert into test1 select generate_series(1,10); > INSERT 0 10 > postgres=# create index idx1 on test1(a); > CREATE INDEX > postgres=# analyze test1 ; > ANALYZE; > postgres=# explain select * from test1 where a = 'toto'; > QUERY PLAN > --- > Index Only Scan using idx1 on test1 (cost=0.29..8.31 rows=1 width=5) >Index Cond: (a = 'toto'::text) > (2 lignes) > > Now we create a tstdom domain and cast the a column to tstdom in the > view definition : > postgres=# create domain tstdom as text; > CREATE DOMAIN > postgres=# create view test2 as select a::tstdom from test1 ; > CREATE VIEW > postgres=# explain select * from test2 where a='toto'; > QUERY PLAN > -- > Seq Scan on test1 (cost=0.00..1693.00 rows=500 width=5) >Filter: (((a)::tstdom)::text = 'toto'::text) > (2 lignes) > > As you can see, a is casted to tstdom then again to text. This casts > prevents the optimizer to choose an index scan to retrieve the data. The > casts are however strictly equivalent and should be not prevent the > optimizer to use indexes. You can create an index to be used for searching using the domain. Following the steps in your example, you can run this: postgres=# create index idx2 on test1 ((a::tstdom)); CREATE INDEX postgres=# vacuum analyze test1; VACUUM postgres=# explain select * from test2 where a='toto'; QUERY PLAN -- Index Scan using idx2 on test1 (cost=0.29..8.31 rows=1 width=5) Index Cond: (((a)::tstdom)::text = 'toto'::text) (2 rows) It's even easier if "a" is defined to be a member of the domain in the original table: postgres=# create domain tstdom as text; CREATE DOMAIN postgres=# create table test1 (a tstdom); CREATE TABLE postgres=# insert into test1 select generate_series(1,10); INSERT 0 10 postgres=# create index idx1 on test1(a); CREATE INDEX postgres=# analyze test1 ; ANALYZE postgres=# explain select * from test1 where a = 'toto'; QUERY PLAN --- Index Only Scan using idx1 on test1 (cost=0.29..8.31 rows=1 width=5) Index Cond: (a = 'toto'::text) (2 rows) postgres=# create view test2 as select a::tstdom from test1 ; CREATE VIEW postgres=# explain select * from test2 where a='toto'; QUERY PLAN --- Index Only Scan using idx1 on test1 (cost=0.29..8.31 rows=1 width=5) Index Cond: (a = 'toto'::text) (2 rows) It's kinda hard for me to visualize where it makes sense to define the original table column as the bare type but use a domain when referencing it in the view. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS
Andrew Gierth writes: > And here is that recut patch set. I started looking over this patch, but eventually decided that it needs more work to be committable than I'm prepared to put in right now. My single biggest complaint is about the introduction of struct GroupedVar. If we stick with that, we're going to have to teach an extremely large number of places that know about Vars to also know about GroupedVars. This will result in code bloat and errors of omission. If you think the latter concern is hypothetical, note that you can't get 40 lines into gsp1.patch without finding such an omission, namely the patch fails to teach pg_stat_statements.c about GroupedVars. (That also points up that some of the errors of omission will be in third-party code that we can't fix easily.) I think you should get rid of that concept and instead implement the behavior by having nodeAgg.c set the relevant fields of the representative tuple slot to NULL, so that a regular Var does the right thing. I'm also not happy about the quality of the internal documentation. The big problem here is the seriously lacking documentation of the new parse node types, eg +/* + * Node representing substructure in GROUPING SETS + * + * This is not actually executable, but it's used in the raw parsetree + * representation of GROUP BY, and in the groupingSets field of Query, to + * preserve the original structure of rollup/cube clauses for readability + * rather than reducing everything to grouping sets. + */ + +typedef enum +{ + GROUPING_SET_EMPTY, + GROUPING_SET_SIMPLE, + GROUPING_SET_ROLLUP, + GROUPING_SET_CUBE, + GROUPING_SET_SETS +} GroupingSetKind; + +typedef struct GroupingSet +{ + Exprxpr; + GroupingSetKind kind; + List *content; + int location; +} GroupingSet; The only actual documentation there is a long-winded excuse for having put the struct declaration in the wrong place. (Since it's not an executable expression, it should be in parsenodes.h not primnodes.h.) Good luck figuring out what "content" is a list of, or indeed anything at all except that this has got something to do with grouping sets. If one digs around in the patch long enough, some useful information can be found in the header comments for various functions --- but there should be a spec for what this struct means, what its fields are, what the relevant invariants are *in the .h file*. Poking around in parsenodes.h, eg the description of SortGroupClause, should give you an idea of the standard here. I'm not too happy about struct Grouping either. If one had to guess, one would probably guess that this was part of the representation of a GROUP BY clause; a guess led on by the practice of the patch of dealing with this and struct GroupingSet together, as in eg pg_stat_statements.c and nodes.h. Reading enough of the patch will eventually clue you that this is the representation of a call of the GROUPING() pseudo-function, but that's not exactly clear from either the name of the struct or its random placement between Var and Const in primnodes.h. And the comment is oh so helpful: +/* + * Grouping + */ I'd be inclined to call it GroupingFunc and put it after Aggref/WindowFunc. Also please note that there is an attempt throughout the system to order code stanzas that deal with assorted node types in an order matching the order in which they're declared in the *nodes.h files. You should never be flipping a coin to decide where to add such code, and "put it at the end of the existing list" is usually not the best answer either. Some other random examples of inadequate attention to commenting: @@ -243,7 +243,7 @@ typedef struct AggStatePerAggData * rest. */ - Tuplesortstate *sortstate; /* sort object, if DISTINCT or ORDER BY */ + Tuplesortstate **sortstate; /* sort object, if DISTINCT or ORDER BY */ This change didn't even bother to pluralize the comment, let alone explain the length of the array or what it's indexed according to, let alone explain why we now need multiple tuplesort objects in what is still apparently a "per aggregate" state struct. (BTW, as a matter of good engineering I think it's useful to change a field's name when you change its meaning and representation so fundamentally. In this case, renaming to "sortstates" would have been clearer and would have helped ensure that you didn't miss fixing any referencing code.) @@ -338,81 +339,101 @@ static Datum GetAggInitVal(Datum textInitVal, Oid transtype); static void initialize_aggregates(AggState *aggstate, AggStatePerAgg peragg, - AggStatePerGroup pergroup) + AggStatePerGroup pergroup, + int numReinitialize) { int aggno; I wonder what numReinitialize is, or why it's
Re: [HACKERS] advance local xmin more aggressively
On Wed, Dec 10, 2014 at 12:57 PM, Heikki Linnakangas wrote: > Clever. Could we use that method in ResourceOwnerReleaseInternal and > ResourceOwnerDelete, too? Might be best to have a > ResourceOwnerWalk(resowner, callback) function for walking all resource > owners in a tree, instead of one for walking the snapshots in them. Sure. It would be a little more complicated there since you want to stop when you get back to the starting point, but not too bad. But is that solving any actual problem? >> 2. Instead of traversing the tree until we find an xmin equal to the >> one we're currently advertising, the code now traverses the entire >> tree each time it runs. However, it also keeps a record of how many >> times the oldest xmin occurred in the tree, which is decremented each >> time we unregister a snapshot with that xmin; the traversal doesn't >> run again until that count reaches 0. That fixed the performance >> regression on your test case. >> >> With a million subtransactions: >> >> master 34.464s 33.742s 34.317s >> advance-xmin 34.516s 34.069s 34.196s > > Well, you can still get the slowness back by running other stuff in the > background. I admit that it's a very obscure case, probably fine in > practice. I would still feel better if snapmgr.c did its own bookkeeping, > from an aesthetic point of view. In a heap, or even just a linked list, if > the performance characteristics of that is acceptable. > > It occurs to me that the pairing heap I just posted in another thread > (http://www.postgresql.org/message-id/54886bb8.9040...@vmware.com) would be > a good fit for this. It's extremely cheap to insert to and to find the > minimum item (O(1)), and the delete operation is O(log N), amortized. I > didn't implement a delete operation, for removing a particular node, I only > did delete-min, but it's basically the same code. Using a pairing heap for > this might be overkill, but if we have that implementation anyway, the code > in snapmgr.c to use it would be very simple, so I see little reason not to. > It might even be simpler than your patch, because you wouldn't need to have > the heuristics on whether to attempt updating the xmin; it would be cheap > enough to always try it. Care to code it up? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 10:59 PM, Amit Kapila wrote: > Yeah and also how would user specify the values, as an example > assume that table is partitioned on monthly_salary, so partition > definition would look: > > PARTITION BY LIST(monthly_salary) > ( > PARTITION salary_less_than_thousand VALUES(300, 900), > PARTITION salary_less_than_two_thousand VALUES (500,1000,1500), > ... > ) > > Now if user wants to define multi-column Partition based on > monthly_salary and annual_salary, how do we want him to > specify the values. Basically how to distinguish which values > belong to first column key and which one's belong to second > column key. I assume you just add some parentheses. PARTITION BY LIST (colA, colB) (PARTITION VALUES ((valA1, valB1), (valA2, valB2), (valA3, valB3)) Multi-column list partitioning may or may not be worth implementing, but the syntax is not a real problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Mon, Dec 8, 2014 at 5:05 PM, Jim Nasby wrote: > Agreed, but it's possible to keep a block/CTID interface while doing > something different on the disk. Objection: hand-waving. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
On Wed, Dec 10, 2014 at 9:22 AM, Alvaro Herrera wrote: > The problem with naming partitions is that the user has to pick names > for every partition, which is tedious and doesn't provide any > significant benefit. The input I had from users of other partitioning > systems was that they very much preferred not to name the partitions at > all, which is why I chose the PARTITION FOR VALUE syntax (not sure if > this syntax is exactly what other systems use; it just seemed the > natural choice.) FWIW, Oracle does name partitions. It generates the names automatically if you don't care to specify them, and the partition names for a given table live in their own namespace that is separate from the toplevel object namespace. For example: CREATE TABLE sales ( invoice_no NUMBER, sale_year INT NOT NULL, sale_month INT NOT NULL, sale_day INT NOT NULL ) STORAGE (INITIAL 100K NEXT 50K) LOGGING PARTITION BY RANGE ( sale_year, sale_month, sale_day) ( PARTITION sales_q1 VALUES LESS THAN ( 1999, 04, 01 ) TABLESPACE tsa STORAGE (INITIAL 20K, NEXT 10K), PARTITION sales_q2 VALUES LESS THAN ( 1999, 07, 01 ) TABLESPACE tsb, PARTITION sales_q3 VALUES LESS THAN ( 1999, 10, 01 ) TABLESPACE tsc, PARTITION sales q4 VALUES LESS THAN ( 2000, 01, 01 ) TABLESPACE tsd) ENABLE ROW MOVEMENT; I don't think this practice has much to recommend it. We're going to need a way to refer to individual partitions by name, and I don't see much benefit in making that name something other than what is stored in pg_class.relname. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advance local xmin more aggressively
On 12/10/2014 06:56 PM, Robert Haas wrote: On Wed, Dec 10, 2014 at 9:49 AM, Robert Haas wrote: I guess this bears some further thought. I certainly don't like the fact that this makes the whole system crap out at a lower number of subtransactions than presently. The actual performance numbers don't bother me very much; I'm comfortable with the possibility that closing a cursor will be some modest percentage slower if you've got thousands of active savepoints. Here's a new version with two changes: 1. I changed the traversal of the resource owner tree to iterate instead of recurse. It now does a depth-first, pre-order traversal of the tree; when we reach the last child of a node, we follow its parent pointer to get back to where we were. That way, we don't need to keep anything on the stack. That fixed the crash at 100k cursors, but it was still 4x slower. Clever. Could we use that method in ResourceOwnerReleaseInternal and ResourceOwnerDelete, too? Might be best to have a ResourceOwnerWalk(resowner, callback) function for walking all resource owners in a tree, instead of one for walking the snapshots in them. 2. Instead of traversing the tree until we find an xmin equal to the one we're currently advertising, the code now traverses the entire tree each time it runs. However, it also keeps a record of how many times the oldest xmin occurred in the tree, which is decremented each time we unregister a snapshot with that xmin; the traversal doesn't run again until that count reaches 0. That fixed the performance regression on your test case. With a million subtransactions: master 34.464s 33.742s 34.317s advance-xmin 34.516s 34.069s 34.196s Well, you can still get the slowness back by running other stuff in the background. I admit that it's a very obscure case, probably fine in practice. I would still feel better if snapmgr.c did its own bookkeeping, from an aesthetic point of view. In a heap, or even just a linked list, if the performance characteristics of that is acceptable. It occurs to me that the pairing heap I just posted in another thread (http://www.postgresql.org/message-id/54886bb8.9040...@vmware.com) would be a good fit for this. It's extremely cheap to insert to and to find the minimum item (O(1)), and the delete operation is O(log N), amortized. I didn't implement a delete operation, for removing a particular node, I only did delete-min, but it's basically the same code. Using a pairing heap for this might be overkill, but if we have that implementation anyway, the code in snapmgr.c to use it would be very simple, so I see little reason not to. It might even be simpler than your patch, because you wouldn't need to have the heuristics on whether to attempt updating the xmin; it would be cheap enough to always try it. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
On 2014-12-10 12:06:11 -0500, Robert Haas wrote: > Ultimately, I think this is mostly going to be a question of what > Alvaro feels comfortable with; he's presumably going to have a better > sense of where and to what extent there might be bugs lurking than any > of the rest of us. But the point is worth considering, because I > think we would probably all agree that having a release that is stable > and usable right out of the gate is more important than having any > single feature get into that release. Sure, 9.4 needs to be out of the gate. I don't think anybody doubts that. But the scheduling of commits with regard to the 9.5 schedule actually opens a relevant question: When are we planning to release 9.5? Because If we try ~ one year from now it's a whole different ballgame than if we try to go back to september. And I think there's pretty good arguments for both. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
On Wed, Dec 10, 2014 at 11:22 AM, Stephen Frost wrote: > I'm not quite sure that I see how that's relevant. Bugs will happen, > unfortunately, no matter how much review is done of a given patch. That > isn't to say that we shouldn't do any review, but it's a trade-off. > This change, at least, strikes me as less likely to have subtle bugs > in it as compared to the dropped column case. Yeah, that's possible. They seem similar to me because they both introduce new ways for the tuple as it is stored on disk to be different from what must be shown to the user. But I don't really know how well that translates to what needs to be changed on a code level. If we're basically going back over all the same places that needed special handling for attisdropped, then the likelihood of bugs may be quite a bit lower than it was for that patch, because now we know (mostly!) which places require attisdropped handling and we can audit them all to make sure they handle this, too. But if it's a completely different set of places that need to be touched, then I think there's a lively possibility for bugs of omission. Ultimately, I think this is mostly going to be a question of what Alvaro feels comfortable with; he's presumably going to have a better sense of where and to what extent there might be bugs lurking than any of the rest of us. But the point is worth considering, because I think we would probably all agree that having a release that is stable and usable right out of the gate is more important than having any single feature get into that release. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advance local xmin more aggressively
On Wed, Dec 10, 2014 at 9:49 AM, Robert Haas wrote: > I guess this bears some further thought. I certainly don't like the > fact that this makes the whole system crap out at a lower number of > subtransactions than presently. The actual performance numbers don't > bother me very much; I'm comfortable with the possibility that closing > a cursor will be some modest percentage slower if you've got thousands > of active savepoints. Here's a new version with two changes: 1. I changed the traversal of the resource owner tree to iterate instead of recurse. It now does a depth-first, pre-order traversal of the tree; when we reach the last child of a node, we follow its parent pointer to get back to where we were. That way, we don't need to keep anything on the stack. That fixed the crash at 100k cursors, but it was still 4x slower. 2. Instead of traversing the tree until we find an xmin equal to the one we're currently advertising, the code now traverses the entire tree each time it runs. However, it also keeps a record of how many times the oldest xmin occurred in the tree, which is decremented each time we unregister a snapshot with that xmin; the traversal doesn't run again until that count reaches 0. That fixed the performance regression on your test case. With a million subtransactions: master 34.464s 33.742s 34.317s advance-xmin 34.516s 34.069s 34.196s -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index 0955bcc..529209f 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -21,6 +21,7 @@ #include "postgres.h" #include "access/hash.h" +#include "miscadmin.h" #include "storage/predicate.h" #include "storage/proc.h" #include "utils/memutils.h" @@ -113,6 +114,7 @@ typedef struct ResourceOwnerData ResourceOwner CurrentResourceOwner = NULL; ResourceOwner CurTransactionResourceOwner = NULL; ResourceOwner TopTransactionResourceOwner = NULL; +ResourceOwner FirstRootResourceOwner = NULL; /* * List of add-on callbacks for resource releasing @@ -167,6 +169,11 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name) owner->nextchild = parent->firstchild; parent->firstchild = owner; } + else + { + owner->nextchild = FirstRootResourceOwner; + FirstRootResourceOwner = owner; + } return owner; } @@ -442,6 +449,8 @@ ResourceOwnerDelete(ResourceOwner owner) * the owner tree. Better a leak than a crash. */ ResourceOwnerNewParent(owner, NULL); + Assert(FirstRootResourceOwner == owner); + FirstRootResourceOwner = owner->nextchild; /* And free the object. */ if (owner->buffers) @@ -502,6 +511,14 @@ ResourceOwnerNewParent(ResourceOwner owner, } } } + else + { + ResourceOwner *link = &FirstRootResourceOwner; + + while (*link != owner) + link = &((*link)->nextchild); + *link = owner->nextchild; + } if (newparent) { @@ -513,7 +530,8 @@ ResourceOwnerNewParent(ResourceOwner owner, else { owner->parent = NULL; - owner->nextchild = NULL; + owner->nextchild = FirstRootResourceOwner; + FirstRootResourceOwner = owner; } } @@ -1161,6 +1179,59 @@ ResourceOwnerForgetSnapshot(ResourceOwner owner, Snapshot snapshot) } /* + * Invoke a caller-supplied function on every snapshot registered with any + * resource owner in the system. The callback can abort the traversal by + * returning true. + */ +bool +ResourceOwnerWalkAllSnapshots(ResourceWalkSnapshotCallback callback, void *arg) +{ + ResourceOwner owner = FirstRootResourceOwner; + bool visited = false; + + /* + * We do this traversal non-recursively in order to avoid running out of + * stack space, which can otherwise happen with large numbers of nested + * subtransactions. The easiest way to do that is to search depth-first, + * so that we visit all of a given node's descendents before reaching its + * right sibling. When we've visited all of the node's descendents, we'll + * follow the last child's parent pointer back to that node, but visited + * will be set to true at that point, so we'll advance to the right + * sibling instead of traversing it again. + */ + while (owner != NULL) + { + if (!visited) + { + int i; + + for (i = 0; i < owner->nsnapshots; ++i) +if (callback(owner->snapshots[i], arg)) + return true; + + if (owner->firstchild != NULL) + { +owner = owner->firstchild; +continue; + } + } + + if (owner->nextchild != NULL) + { + owner = owner->nextchild; + visited = false; + } + else + { + owner = owner->parent; + visited = true; + } + } + + return false; +} + +/* * Debugging subroutine */ static void diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index d601efe..1fd73c8 100644 --- a/src/backend/utils/time/snapmgr.c +++ b/src/backend/utils/time/snapmgr.c @@ -122,14 +122,28 @@ typedef struct ActiveSnapshotElt /* Top
Re: [HACKERS] logical column ordering
Robert, all, * Robert Haas (robertmh...@gmail.com) wrote: > To the extent that I have any concern about the patch at this point, > it's around stability. I would awfully rather see something like this > get committed at the beginning of a development cycle than the end. I tend to agree with this; we have a pretty bad habit of bouncing around big patches until the end and then committing them. That's not as bad when the patch has been getting reviews and feedback over a few months (or years) but this particular patch isn't even code-complete at this point, aiui. > It's quite possible that I'm being more nervous than is justified, but > given that we're *still* fixing bugs related to dropped-column > handling (cf. 9b35ddce93a2ef336498baa15581b9d10f01db9c from July of > this year) which was added in July 2002, maybe not. I'm not quite sure that I see how that's relevant. Bugs will happen, unfortunately, no matter how much review is done of a given patch. That isn't to say that we shouldn't do any review, but it's a trade-off. This change, at least, strikes me as less likely to have subtle bugs in it as compared to the dropped column case. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] logical column ordering
On Wed, Dec 10, 2014 at 9:25 AM, Alvaro Herrera wrote: > FWIW I have no intention to add options for physical/logical ordering > anywhere. All users will see is that tables will follow the same > (logical) order everywhere. Just to be clear, I wasn't in any way attending to say that the patch had a problem in this area. I was just expressing concern about the apparent rush to judgement on whether converting between physical and logical column ordering would be expensive. I certainly think that's something that we should test - for example, we might want to consider whether there are cases where you could maybe convince the executor to spend a lot of time pointlessly reorganizing tuples in ways that wouldn't happen today. But I have no particular reason to think that any issues we hit there issues won't be solvable. To the extent that I have any concern about the patch at this point, it's around stability. I would awfully rather see something like this get committed at the beginning of a development cycle than the end. It's quite possible that I'm being more nervous than is justified, but given that we're *still* fixing bugs related to dropped-column handling (cf. 9b35ddce93a2ef336498baa15581b9d10f01db9c from July of this year) which was added in July 2002, maybe not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GiST kNN search queue (Re: KNN-GiST with recheck)
On 01/28/2014 04:12 PM, Alexander Korotkov wrote: >3. A binary heap would be a better data structure to buffer the rechecked >values. A Red-Black tree allows random insertions and deletions, but in >this case you need to insert arbitrary values but only remove the minimum >item. That's exactly what a binary heap excels at. We have a nice binary >heap implementation in the backend that you can use, see >src/backend/lib/binaryheap.c. > Hmm. For me binary heap would be a better data structure for KNN-GiST at all :-) I decided to give this a shot, replacing the red-black tree in GiST with the binary heap we have in lib/binaryheap.c. It made the GiST code somewhat simpler, as the binaryheap interface is simpler than the red-black tree one. Unfortunately, performance was somewhat worse. That was quite surprising, as insertions and deletions are both O(log N) in both data structures, but the red-black tree implementation is more complicated. I implemented another data structure called a Pairing Heap. It's also a fairly simple data structure, but insertions are O(1) instead of O(log N). It also performs fairly well in practice. With that, I got a small but measurable improvement. To test, I created a table like this: create table gisttest (id integer, p point); insert into gisttest select id, point(random(), random()) from generate_series(1, 100) id; create index i_gisttest on gisttest using gist (p); And I ran this query with pgbench: select id from gisttest order by p <-> '(0,0)' limit 1000; With unpatched master, I got about 650 TPS, and with the patch 720 TPS. That's a nice little improvement, but perhaps more importantly, the pairing heap implementation consumes less memory. To measure that, I put a MemoryContextStats(so->queueCtx) call into gistendscan. With the above query, but without the "limit" clause, on master I got: GiST scan context: 2109752 total in 10 blocks; 2088456 free (24998 chunks); 21296 used And with the patch: GiST scan context: 1061160 total in 9 blocks; 1040088 free (12502 chunks); 21072 used That's 2MB vs 1MB. While that's not much in absolute terms, it'd be nice to reduce that memory consumption, as there is no hard upper bound on how much might be needed. If the GiST tree is really disorganized for some reason, a query might need a lot more. So all in all, I quite like this patch, even though it doesn't do anything too phenomenal. It adds a some code, in the form of the new pairing heap implementation, but it makes the GiST code a little bit simpler. And it gives a small performance gain, and reduces memory usage a bit. - Heikki diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index 7a8692b..52b2c53 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -18,6 +18,7 @@ #include "access/relscan.h" #include "miscadmin.h" #include "pgstat.h" +#include "lib/pairingheap.h" #include "utils/builtins.h" #include "utils/memutils.h" #include "utils/rel.h" @@ -243,8 +244,6 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, GISTPageOpaque opaque; OffsetNumber maxoff; OffsetNumber i; - GISTSearchTreeItem *tmpItem = so->tmpTreeItem; - bool isNew; MemoryContext oldcxt; Assert(!GISTSearchItemIsHeap(*pageItem)); @@ -275,18 +274,15 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, oldcxt = MemoryContextSwitchTo(so->queueCxt); /* Create new GISTSearchItem for the right sibling index page */ - item = palloc(sizeof(GISTSearchItem)); - item->next = NULL; + item = palloc(SizeOfGISTSearchItem(scan->numberOfOrderBys)); item->blkno = opaque->rightlink; item->data.parentlsn = pageItem->data.parentlsn; /* Insert it into the queue using same distances as for this page */ - tmpItem->head = item; - tmpItem->lastHeap = NULL; - memcpy(tmpItem->distances, myDistances, + memcpy(item->distances, myDistances, sizeof(double) * scan->numberOfOrderBys); - (void) rb_insert(so->queue, (RBNode *) tmpItem, &isNew); + pairingheap_add(so->queue, &item->fbNode); MemoryContextSwitchTo(oldcxt); } @@ -348,8 +344,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, oldcxt = MemoryContextSwitchTo(so->queueCxt); /* Create new GISTSearchItem for this item */ - item = palloc(sizeof(GISTSearchItem)); - item->next = NULL; + item = palloc(SizeOfGISTSearchItem(scan->numberOfOrderBys)); if (GistPageIsLeaf(page)) { @@ -372,12 +367,10 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, } /* Insert it into the queue using new distance data */ - tmpItem->head = item; - tmpItem->lastHeap = GISTSearchItemIsHeap(*item) ? item : NULL; - memcpy(tmpItem->distances, so->distances, + memcpy(item->distances, so->distances, sizeof(double) * scan->numberOfOrderBys); - (void) rb_insert(so->queue,
Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps
On 10/12/14 04:26, Michael Paquier wrote: On Thu, Dec 4, 2014 at 9:26 PM, Heikki Linnakangas wrote: Yeah, it was raised. I don't think it was really addressed. There was lengthy discussion on whether to include LSN, node id, and/or some other information, but there was no discussion on: * What is a node ID? * How is it used? * Who assigns it? etc. None of those things are explained in the documentation nor code comments. The node ID sounds suspiciously like the replication identifiers that have been proposed a couple of times. I don't remember if I like replication identifiers or not, but I'd rather discuss that issue explicitly and separately. I don't want the concept of a replication/node ID to fly under the radar like this. Replication identifiers are bigger thing but yes there is overlap as the nodeid itself makes it possible to implement replication identifiers outside of core if needed. What would the SQL API look like, and what would it be used for? It would probably mirror the C one, from my POV it's not needed but maybe some replication solution would prefer to use this without having to write C components... I can't comment on that, because without any documentation, I don't even know what the C API is. BTW, why is it OK that the node-ID of a commit is logged as a separate WAL record, after the commit record? That means that it's possible that a transaction commits, but you crash just before writing the SETTS record, so that after replay the transaction appears committed but with the default node ID. (Maybe that's OK given the intended use case, but it's hard to tell without any documentation. See a pattern here? ;-) ) This is actually good point, it's not OK to have just commit record but no nodeid record if nodeid was set. Could it be possible to get a refresh on that before it bitrots too much or we'll simply forget about it? In the current state, there is no documentation able to explain what is the point of the nodeid stuff, how to use it and what use cases it is aimed at. The API in committs.c should as well be part of it. If that's not done, I think that it would be fair to remove this portion and simply WAL-log the commit timestamp its GUC is activated. I have this on my list so it's not being forgotten and I don't see it bitrotting unless we are changing XLog api again. I have bit busy schedule right now though, can it wait till next week? - I will post some code documentation patches then. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advance local xmin more aggressively
On Wed, Dec 10, 2014 at 7:34 AM, Heikki Linnakangas wrote: >>> 1. I don't really see why resource owners shouldn't be traversed like >>> that. They are clearly intended to form a hierarchy, and there's >>> existing code that recurses through the hierarchy from a given level >>> downward. What's ugly about that? > > I can't exactly point my finger on it, but it just feels wrong from a > modularity point of view. Their purpose is to make sure that we don't leak > resources on abort, by allowing easy an "release everything" operation. It's > not designed for finding objects based on some other criteria. > > There is similar double bookkeeping of many other things that are tracked by > resource owners. Heavy-weight locks are tracked by LOCALLOCK structs, buffer > pins in PrivateRefCount array etc. Those things existed before resource > owners were invented, but if we were starting from scratch, that design > would still make sense, as different objects have different access criteria. > fd.c needs to be able to find the least-recently-used open file, for > example, and you need to find the snapshot with the lowest xmin. I think all of these are performance trade-offs rather than questions of style. LOCALLOCK structs and private buffer pin tracking existed before resource owners because we need to do lookups of locks by lock tag or buffers by buffer number frequently enough that, if we had to grovel trough all the locks or buffer pins in the system without any indexing, it would be slow. We *could* do it that way now that we have resource owners, but it's pretty obvious that it would suck. That's less obvious here. If we throw all of the registered snapshots into a binary heap, finding the lowest xmin will be cheaper every time we do it, but we'll have the overhead of maintaining that binary heap even if it never gets used. > I think you're confusing the N and the N above. It's true that deleting a > resource owner is O(M), where the M is the number of children of that > resource owner. It does not follow that releasing N resource owners is > O(N^2), where N is the number of resource owners released. Calling > ResourceOwnerDelete(x) will only visit each resource owner in that tree > once, so it's O(N), where N is the total number of resource owners in the > tree. Not if the portals are dropped in separate commands with an unpredictable ordering. Then you have N separate calls to ResourceOwnerDelete. > I did some testing of the worst case scenario. The attached script first > creates a lot of cursors, then a lot of savepoints, and finally closes the > cursors in FIFO order. When the number of savepoints is high enough, this > actually segfaults with your patch, because you run out of stack space when > recursing the subxact resource owners. That's hardly this patch's fault, I'm > actually surprised it doesn't crash without it, because we recurse into all > resource owners in ResourceOwnerRelease too. Apparently the subxacts are > closed in LIFO order at commit, but there might be are other cases where you > could trigger that. In any case, a stack-depth check would be nice. Thanks for the test case; that's helpful. I added a stack depth check, but it doesn't help much: ERROR: stack depth limit exceeded HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate. STATEMENT: CLOSE cur0; ERROR: stack depth limit exceeded HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate. WARNING: AbortSubTransaction while in ABORT state ERROR: stack depth limit exceeded HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate. WARNING: AbortSubTransaction while in ABORT state ERROR: stack depth limit exceeded HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate. WARNING: AbortSubTransaction while in ABORT state ERROR: stack depth limit exceeded HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate. PANIC: ERRORDATA_STACK_SIZE exceeded That's at 100k subtransactions. I took some performance data for lower numbers of subtransactions: -- at 1000 subtransactions -- master 0.156s 0.157s 0.154s advance-xmin 0.177s 0.175s 0.177s -- at 1 subtransactions -- master 0.458s 0.457s 0.456s advance-xmin 0.639s 0.637s 0.633 I guess this bears some further thought. I certainly don't like the fact that this makes the whole system crap out at a lower number of subtransactions than presently. The actual performance numbers don't bother me very much; I'm comfortable with the possibility that closing a cursor will be some modest percentage slower if you've got thousands of active savepoints.
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Wed, Dec 10, 2014 at 12:10 PM, Rahila Syed wrote: > >What I would suggest is instrument the backend with getrusage() at > >startup and shutdown and have it print the difference in user time and > >system time. Then, run tests for a fixed number of transactions and > >see how the total CPU usage for the run differs. > > Folllowing are the numbers obtained on tests with absolute CPU usage, > fixed number of transactions and longer duration with latest fpw > compression patch > > pgbench command : pgbench -r -t 25 -M prepared > > To ensure that data is not highly compressible, empty filler columns were > altered using > > alter table pgbench_accounts alter column filler type text using > gen_random_uuid()::text > > checkpoint_segments = 1024 > checkpoint_timeout = 5min > fsync = on > > The tests ran for around 30 mins.Manual checkpoint was run before each > test. > > Compression WAL generated%compressionLatency-avg CPU usage > (seconds) TPS Latency > stddev > > > on 1531.4 MB ~35 % 7.351 ms > user diff: 562.67s system diff: 41.40s 135.96 > 13.759 ms > > > off 2373.1 MB 6.781 > ms user diff: 354.20s system diff: 39.67s147.40 > 14.152 ms > > The compression obtained is quite high close to 35 %. > CPU usage at user level when compression is on is quite noticeably high as > compared to that when compression is off. But gain in terms of reduction of > WAL is also high. > > Server specifications: > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos > RAM: 32GB > Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm > > > > Thank you, > > Rahila Syed > > > > > > On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas > wrote: > >> On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed >> wrote: >> >>If that's really true, we could consider having no configuration any >> >>time, and just compressing always. But I'm skeptical that it's >> >>actually true. >> > >> > I was referring to this for CPU utilization: >> > >> http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com >> > >> > >> > The above tests were performed on machine with configuration as follows >> > Server specifications: >> > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 >> nos >> > RAM: 32GB >> > Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos >> > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm >> >> I think that measurement methodology is not very good for assessing >> the CPU overhead, because you are only measuring the percentage CPU >> utilization, not the absolute amount of CPU utilization. It's not >> clear whether the duration of the tests was the same for all the >> configurations you tried - in which case the number of transactions >> might have been different - or whether the number of operations was >> exactly the same - in which case the runtime might have been >> different. Either way, it could obscure an actual difference in >> absolute CPU usage per transaction. It's unlikely that both the >> runtime and the number of transactions were identical for all of your >> tests, because that would imply that the patch makes no difference to >> performance; if that were true, you wouldn't have bothered writing >> it >> >> What I would suggest is instrument the backend with getrusage() at >> startup and shutdown and have it print the difference in user time and >> system time. Then, run tests for a fixed number of transactions and >> see how the total CPU usage for the run differs. >> >> Last cycle, Amit Kapila did a bunch of work trying to compress the WAL >> footprint for updates, and we found that compression was pretty darn >> expensive there in terms of CPU time. So I am suspicious of the >> finding that it is free here. It's not impossible that there's some >> effect which causes us to recoup more CPU time than we spend >> compressing in this case that did not apply in that case, but the >> projects are awfully similar, so I tend to doubt it. >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> > > This can be improved in the future by using other algorithms.
Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
Hi, Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais : > Since few months, we occasionally see .ready files appearing on some slave > instances from various context. The two I have in mind are under 9.2.x. […] > So it seems for some reasons, these old WALs were "forgotten" by the > restartpoint mechanism when they should have been recylced/deleted. Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas : > 1. Where do the FF files come from? In 9.2, FF-segments are not supposed to > created, ever. […] > 2. Why are the .done files sometimes not being created? We’ve encountered behaviour which seems to match what has been described here: On Streaming Replication slaves, there is an odd piling up of old WALs and .ready files in pg_xlog, going back several months. The fine people on IRC have pointed me to this thread, and have encouraged me to revive it with our observations, so here we go: Environment: Master, 9.2.9 |- Slave S1, 9.2.9, on the same network as the master '- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a cascading replication) wal_keep_segments M=100 S1=100 S2=30 checkpoint_segments M=100 S1=30 S2=30 wal_level hot_standby (all) archive_mode on (all) archive_command on both slaves: /bin/true archive_timeout 600s (all) - On both slaves, we have „ghost“ WALs and corresponding .ready files (currently >600 of each on S2, slowly becoming a disk space problem) - There’s always gaps in the ghost WAL names, often roughly 0x20, but not always - The slave with the „bad“ network link has significantly more of these files, which suggests that disturbances of the Streaming Replication increase chances of triggering this bug; OTOH, the presence of a name gap pattern suggests the opposite - We observe files named *FF as well As you can see in the directory listings below, this setup is *very* low traffic, which may explain the pattern in WAL name gaps (?). I’ve listed the entries by time, expecting to easily match WALs to their .ready files. There sometimes is an interesting delay between the WAL’s mtime and the .ready file — especially for *FF, where there’s several days between the WAL and the .ready file. - Master: http://pgsql.privatepaste.com/52ad612dfb - Slave S1: http://pgsql.privatepaste.com/58b4f3bb10 - Slave S2: http://pgsql.privatepaste.com/a693a8d7f4 I’ve only skimmed through the thread; my understanding is that there were several patches floating around, but nothing was committed. If there’s any way I can help, please let me know. - D. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
Robert Haas wrote: > On Wed, Dec 10, 2014 at 12:17 AM, Tom Lane wrote: > > Alvaro Herrera writes: > >> Andrew Dunstan wrote: > >>> I seriously doubt it, although I could be wrong. Unless someone can show a > >>> significant performance gain from using physical order, which would be a > >>> bit > >>> of a surprise to me, I would just stick with logical ordering as the > >>> default. > > > >> Well, we have an optimization that avoids a projection step IIRC by > >> using the "physical tlist" instead of having to build a tailored one. I > >> guess the reason that's there is because somebody did measure an > >> improvement. Maybe it *is* worth having as an option for pg_dump ... > > > > The physical tlist thing is there because it's demonstrable that > > ExecProject() takes nonzero time. COPY does not go through ExecProject > > though. What's more, it already has code to deal with a user-specified > > column order, and nobody's ever claimed that that code imposes a > > measurable performance overhead. > > Also, if we're adding options to use the physical rather than the > logical column ordering in too many places, that's probably a sign > that we need to rethink this whole concept. The concept of a logical > column ordering doesn't have much meaning if you're constantly forced > to fall back to some other column ordering whenever you want good > performance. FWIW I have no intention to add options for physical/logical ordering anywhere. All users will see is that tables will follow the same (logical) order everywhere. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote: > The tests ran for around 30 mins.Manual checkpoint was run before each test. > > Compression WAL generated %compression Latency-avg CPU usage > (seconds) TPS Latency > stddev > > > on 1531.4 MB ~35 % 7.351 ms > user diff: 562.67s system diff: 41.40s 135.96 > 13.759 ms > > > off 2373.1 MB 6.781 ms > > user diff: 354.20s system diff: 39.67s 147.40 > > 14.152 ms > > The compression obtained is quite high close to 35 %. > CPU usage at user level when compression is on is quite noticeably high as > compared to that when compression is off. But gain in terms of reduction of > WAL > is also high. I am sorry but I can't understand the above results due to wrapping. Are you saying compression was twice as slow? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Amit Langote wrote: > On Wed, Dec 10, 2014 at 12:46 PM, Amit Kapila wrote: > > On Tue, Dec 9, 2014 at 7:21 PM, Alvaro Herrera > > wrote: > >> FWIW in my original proposal I was rejecting some things that after > >> further consideration turn out to be possible to allow; for instance > >> directly referencing individual partitions in COPY. We could allow > >> something like > >> > >> COPY lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > >> or maybe > >> COPY PARTITION FOR VALUE '2000-01-01' ON TABLE lineitems TO STDOUT > >> > > or > > COPY [TABLE] lineitems PARTITION FOR VALUE '2000-01-01' TO STDOUT > > COPY [TABLE] lineitems PARTITION TO STDOUT > > > > I think we should try to support operations on partitions via main > > table whereever it is required. Um, I think the only difference is that you added the noise word TABLE which we currently don't allow in COPY, and that you added the possibility of using named partitions, about which see below. > We can also allow to explicitly name a partition > > COPY [TABLE ] lineitems PARTITION lineitems_2001 TO STDOUT; The problem with naming partitions is that the user has to pick names for every partition, which is tedious and doesn't provide any significant benefit. The input I had from users of other partitioning systems was that they very much preferred not to name the partitions at all, which is why I chose the PARTITION FOR VALUE syntax (not sure if this syntax is exactly what other systems use; it just seemed the natural choice.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
>What I would suggest is instrument the backend with getrusage() at >startup and shutdown and have it print the difference in user time and >system time. Then, run tests for a fixed number of transactions and >see how the total CPU usage for the run differs. Folllowing are the numbers obtained on tests with absolute CPU usage, fixed number of transactions and longer duration with latest fpw compression patch pgbench command : pgbench -r -t 25 -M prepared To ensure that data is not highly compressible, empty filler columns were altered using alter table pgbench_accounts alter column filler type text using gen_random_uuid()::text checkpoint_segments = 1024 checkpoint_timeout = 5min fsync = on The tests ran for around 30 mins.Manual checkpoint was run before each test. Compression WAL generated%compressionLatency-avg CPU usage (seconds) TPS Latency stddev on 1531.4 MB ~35 % 7.351 ms user diff: 562.67s system diff: 41.40s 135.96 13.759 ms off 2373.1 MB 6.781 ms user diff: 354.20s system diff: 39.67s147.40 14.152 ms The compression obtained is quite high close to 35 %. CPU usage at user level when compression is on is quite noticeably high as compared to that when compression is off. But gain in terms of reduction of WAL is also high. Server specifications: Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos RAM: 32GB Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm Thank you, Rahila Syed On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas wrote: > On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed > wrote: > >>If that's really true, we could consider having no configuration any > >>time, and just compressing always. But I'm skeptical that it's > >>actually true. > > > > I was referring to this for CPU utilization: > > > http://www.postgresql.org/message-id/1410414381339-5818552.p...@n5.nabble.com > > > > > > The above tests were performed on machine with configuration as follows > > Server specifications: > > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos > > RAM: 32GB > > Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos > > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm > > I think that measurement methodology is not very good for assessing > the CPU overhead, because you are only measuring the percentage CPU > utilization, not the absolute amount of CPU utilization. It's not > clear whether the duration of the tests was the same for all the > configurations you tried - in which case the number of transactions > might have been different - or whether the number of operations was > exactly the same - in which case the runtime might have been > different. Either way, it could obscure an actual difference in > absolute CPU usage per transaction. It's unlikely that both the > runtime and the number of transactions were identical for all of your > tests, because that would imply that the patch makes no difference to > performance; if that were true, you wouldn't have bothered writing > it > > What I would suggest is instrument the backend with getrusage() at > startup and shutdown and have it print the difference in user time and > system time. Then, run tests for a fixed number of transactions and > see how the total CPU usage for the run differs. > > Last cycle, Amit Kapila did a bunch of work trying to compress the WAL > footprint for updates, and we found that compression was pretty darn > expensive there in terms of CPU time. So I am suspicious of the > finding that it is free here. It's not impossible that there's some > effect which causes us to recoup more CPU time than we spend > compressing in this case that did not apply in that case, but the > projects are awfully similar, so I tend to doubt it. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: [HACKERS] Small TRUNCATE glitch
Bruce Momjian writes: > On Wed, Dec 10, 2014 at 10:32:42AM +0200, Heikki Linnakangas wrote: >> >I don't think we need to have 2PC files survive a pg_upgrade. It seems >> >perfectly okay to remove them from the new cluster at some appropriate >> >time, *if* they are copied from the old cluster at all (I don't think >> >they should be.) >> >> I think pg_upgrade should check if there are any prepared >> transactions pending, and refuse to upgrade if there are. It could >> be made to work, but it's really not worth the trouble. If there are >> any pending prepared transactions in the system when you run >> pg_upgrade, it's more likely to be a mistake or oversight in the >> first place, than on purpose. > > pg_upgrade already checks for prepared transactions and errors out if > they exist; see check_for_prepared_transactions(). Alright, that's good to know. So I'm just adding a new bool field to the 2PC pgstat record instead of the bit magic. Attached is v0.2, now with a regression test included. -- Alex >From 4c8fae27ecd9c94e7c3da0997f03099045a152d9 Mon Sep 17 00:00:00 2001 From: Alex Shulgin Date: Tue, 9 Dec 2014 16:35:14 +0300 Subject: [PATCH] WIP: track TRUNCATEs in pgstat transaction stats. The n_live_tup and n_dead_tup counters need to be set to 0 after a TRUNCATE on the relation. We can't issue a special message to the stats collector because the xact might be later aborted, so we track the fact that the relation was truncated during the xact (and reset this xact's insert/update/delete counters). When xact is committed, we use the `truncated' flag to reset the n_live_tup and n_dead_tup counters. --- src/backend/commands/tablecmds.c | 2 + src/backend/postmaster/pgstat.c| 52 - src/include/pgstat.h | 3 + src/test/regress/expected/truncate.out | 136 + src/test/regress/sql/truncate.sql | 98 5 files changed, 288 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c new file mode 100644 index 1e737a0..192d033 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** ExecuteTruncate(TruncateStmt *stmt) *** 1224,1229 --- 1224,1231 */ reindex_relation(heap_relid, REINDEX_REL_PROCESS_TOAST); } + + pgstat_count_heap_truncate(rel); } /* diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c new file mode 100644 index c7f41a5..88c83d2 *** a/src/backend/postmaster/pgstat.c --- b/src/backend/postmaster/pgstat.c *** typedef struct TwoPhasePgStatRecord *** 201,206 --- 201,207 PgStat_Counter tuples_deleted; /* tuples deleted in xact */ Oid t_id; /* table's OID */ bool t_shared; /* is it a shared catalog? */ + bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; /* *** pgstat_count_heap_delete(Relation rel) *** 1864,1869 --- 1865,1894 } /* + * pgstat_count_heap_truncate - update tuple counters due to truncate + */ + void + pgstat_count_heap_truncate(Relation rel) + { + PgStat_TableStatus *pgstat_info = rel->pgstat_info; + + if (pgstat_info != NULL) + { + /* We have to log the effect at the proper transactional level */ + int nest_level = GetCurrentTransactionNestLevel(); + + if (pgstat_info->trans == NULL || + pgstat_info->trans->nest_level != nest_level) + add_tabstat_xact_level(pgstat_info, nest_level); + + pgstat_info->trans->tuples_inserted = 0; + pgstat_info->trans->tuples_updated = 0; + pgstat_info->trans->tuples_deleted = 0; + pgstat_info->trans->truncated = true; + } + } + + /* * pgstat_update_heap_dead_tuples - update dead-tuples count * * The semantics of this are that we are reporting the nontransactional *** AtEOXact_PgStat(bool isCommit) *** 1927,1932 --- 1952,1959 tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted; if (isCommit) { + tabstat->t_counts.t_truncated = trans->truncated; + /* insert adds a live tuple, delete removes one */ tabstat->t_counts.t_delta_live_tuples += trans->tuples_inserted - trans->tuples_deleted; *** AtEOSubXact_PgStat(bool isCommit, int ne *** 1991,1999 { if (trans->upper && trans->upper->nest_level == nestDepth - 1) { ! trans->upper->tuples_inserted += trans->tuples_inserted; ! trans->upper->tuples_updated += trans->tuples_updated; ! trans->upper->tuples_deleted += trans->tuples_deleted; tabstat->trans = trans->upper; pfree(trans); } --- 2018,2037 { if (trans->upper && trans->upper->nest_level == nestDepth - 1) { ! if (trans->truncated) ! { ! trans->upper->truncated = true; ! /* replace upper xact stats with ours */ ! trans->upper->tuples_inserted = trans->tuples_inserted; !
Re: [HACKERS] PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Tomas Vondra wrote: > back when we were discussing the hashjoin patches (now committed), > Robert proposed that maybe it'd be a good idea to sometimes increase the > number of tuples per bucket instead of batching. > > That is, while initially sizing the hash table - if the hash table with > enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into > work_mem, try a bit higher load factor before starting to batch. > > Attached patch is an initial attempt to implement this - it's a bit > rough on the edges, but hopefully enough to judge the benefit of this. > > The default load factor is 1. The patch tries to degrade this to 2, 4 or > 8 in attempt to fit the hash table into work mem. If it doesn't work, it > starts batching with the default load factor. If the batching is > required while the hashjoin is running, the load factor is switched back > to the default one (if we're batching, there's no point in keeping the > slower hash table). > > The patch also modifies explain output, to show the load factor. > > The testing I've done so far are rather disappointing, though. > > create table a as select i from generate_series(1,100) s(i); > create table b as select i from generate_series(1,100) s(i); > > analyze a; > analyze b; > > select a.i, b.i from a join b on (a.i = b.i); > > work_mem batchestuples per bucket duration > - > 64 MB11585 ms > 46 MB12639 ms > 43 MB14794 ms > 40 MB18 1075 ms > 39 MB21623 ms > > So, even increasing the load factor to 2 is slower than batching. Right, I saw the same thing. I tried pretty hard to create a case where increasing the load factor from 1 to 2 was faster than going to a second batch, and was unable to do so. I hypothesized that the second batch was cached by the OS and flipping the data in and out of the OS cache was faster than chasing through the links. I expect that if you have a large enough hashtable to barely exceed your machines ability to cache, you *might* see a benefit in the hash table access by increasing the load factor. I think it would be incredibly hard to accurately identify those cases, and every time a hash table was used there would be a cost to trying to figure it out. I just can't see this being a win. > I'm not sure what's the best way forward - clearly, for cases that fit > into RAM (temp files into page cache), batching is faster. For tables > large enough to cause a lot of I/O, it may make a difference - but I'm > not sure how to identify these cases. > > So ISTM implementing this requires a reliable way to identify which case > we're dealing with - if the outer table is large enough (prefer > increasing load factor) or not (prefer batching). Until then keeping the > current simple/predictible approach is probably better. > > Of course, this also depends on additional variables (e.g. is the system > memory-stressed?). All that is on-target, but my take-away is that increasing load factor to avoid a second batch was an interesting idea that turns out to not really be a good one. If someone can craft a reproducible test case that demonstrates a win for increasing the load factor that doesn't have significant cost for cases where it isn't a win, I might change my opinion; but count me as a skeptic. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical column ordering
On Wed, Dec 10, 2014 at 12:17 AM, Tom Lane wrote: > Alvaro Herrera writes: >> Andrew Dunstan wrote: >>> I seriously doubt it, although I could be wrong. Unless someone can show a >>> significant performance gain from using physical order, which would be a bit >>> of a surprise to me, I would just stick with logical ordering as the >>> default. > >> Well, we have an optimization that avoids a projection step IIRC by >> using the "physical tlist" instead of having to build a tailored one. I >> guess the reason that's there is because somebody did measure an >> improvement. Maybe it *is* worth having as an option for pg_dump ... > > The physical tlist thing is there because it's demonstrable that > ExecProject() takes nonzero time. COPY does not go through ExecProject > though. What's more, it already has code to deal with a user-specified > column order, and nobody's ever claimed that that code imposes a > measurable performance overhead. Also, if we're adding options to use the physical rather than the logical column ordering in too many places, that's probably a sign that we need to rethink this whole concept. The concept of a logical column ordering doesn't have much meaning if you're constantly forced to fall back to some other column ordering whenever you want good performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] advance local xmin more aggressively
On 12/09/2014 10:35 PM, Robert Haas wrote: On Mon, Dec 8, 2014 at 9:31 AM, Robert Haas wrote: On Mon, Dec 8, 2014 at 4:56 AM, Heikki Linnakangas wrote: I don't immediately see the problem either, but I have to say that grovelling through all the resource owners seems ugly anyway. Resource owners are not meant to be traversed like that. And there could be a lot of them, and you have to visit every one of them. That could get expensive if there are a lot of resource owners. 1. I don't really see why resource owners shouldn't be traversed like that. They are clearly intended to form a hierarchy, and there's existing code that recurses through the hierarchy from a given level downward. What's ugly about that? I can't exactly point my finger on it, but it just feels wrong from a modularity point of view. Their purpose is to make sure that we don't leak resources on abort, by allowing easy an "release everything" operation. It's not designed for finding objects based on some other criteria. There is similar double bookkeeping of many other things that are tracked by resource owners. Heavy-weight locks are tracked by LOCALLOCK structs, buffer pins in PrivateRefCount array etc. Those things existed before resource owners were invented, but if we were starting from scratch, that design would still make sense, as different objects have different access criteria. fd.c needs to be able to find the least-recently-used open file, for example, and you need to find the snapshot with the lowest xmin. Upthread, I suggested keeping a tally of the number of snapshots with the advertised xmin and recomputing the xmin to advertise only when it reaches 0. This patch doesn't implementation that optimization, but it does have code that aborts the traversal of the resource owner hierarchy as soon as we see an xmin that will preclude advancing our advertised xmin. Releasing N resource owners could therefore cost O(N^2) in the worst case, but note that releasing N resource owners is *already* an O(N^2) operation in the worst case, because the list of children of a particular parent resource owner is singly linked, and thus deleting a resource owner is O(N). It's been that way for an awfully long time without anyone complaining, probably because (a) it's not particularly common to have large numbers of cursors open simultaneously and (b) even if you do have that, the constant factor is pretty low. I think you're confusing the N and the N above. It's true that deleting a resource owner is O(M), where the M is the number of children of that resource owner. It does not follow that releasing N resource owners is O(N^2), where N is the number of resource owners released. Calling ResourceOwnerDelete(x) will only visit each resource owner in that tree once, so it's O(N), where N is the total number of resource owners in the tree. I did some testing of the worst case scenario. The attached script first creates a lot of cursors, then a lot of savepoints, and finally closes the cursors in FIFO order. When the number of savepoints is high enough, this actually segfaults with your patch, because you run out of stack space when recursing the subxact resource owners. That's hardly this patch's fault, I'm actually surprised it doesn't crash without it, because we recurse into all resource owners in ResourceOwnerRelease too. Apparently the subxacts are closed in LIFO order at commit, but there might be are other cases where you could trigger that. In any case, a stack-depth check would be nice. - Heikki cursors.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSSAPI, SSPI - include_realm default
* Bruce Momjian (br...@momjian.us) wrote: > On Tue, Dec 9, 2014 at 05:40:35PM -0500, Stephen Frost wrote: > > > I thought the idea was to backpatch documentation saying "it's a good idea > > > to change this value to x because of y". Not actually referring to the > > > upcoming change directly. And I still think that part is a good idea, as > > > it > > > helps people avoid potential security pitfalls. > > > > I agree with this but I don't really see why we wouldn't say "hey, this > > is going to change in 9.5." Peter's argument sounds like he'd rather we > > not make any changes to the existing documentation, and I don't agree > > with that, and if we're making changes then, imv, we might as well > > comment that the default is changed in 9.5. > > I agree with Peter --- it is unwise to reference a future released > feature in a backbranch doc patch. Updating the backbranch docs to add > a recommendation is fine. Alright, I don't agree but it's not worth the argument. I'll work on the doc-update patch for the back-branches. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Small TRUNCATE glitch
On Wed, Dec 10, 2014 at 10:32:42AM +0200, Heikki Linnakangas wrote: > >I don't think we need to have 2PC files survive a pg_upgrade. It seems > >perfectly okay to remove them from the new cluster at some appropriate > >time, *if* they are copied from the old cluster at all (I don't think > >they should be.) > > I think pg_upgrade should check if there are any prepared > transactions pending, and refuse to upgrade if there are. It could > be made to work, but it's really not worth the trouble. If there are > any pending prepared transactions in the system when you run > pg_upgrade, it's more likely to be a mistake or oversight in the > first place, than on purpose. pg_upgrade already checks for prepared transactions and errors out if they exist; see check_for_prepared_transactions(). -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSSAPI, SSPI - include_realm default
On Tue, Dec 9, 2014 at 05:40:35PM -0500, Stephen Frost wrote: > > I thought the idea was to backpatch documentation saying "it's a good idea > > to change this value to x because of y". Not actually referring to the > > upcoming change directly. And I still think that part is a good idea, as it > > helps people avoid potential security pitfalls. > > I agree with this but I don't really see why we wouldn't say "hey, this > is going to change in 9.5." Peter's argument sounds like he'd rather we > not make any changes to the existing documentation, and I don't agree > with that, and if we're making changes then, imv, we might as well > comment that the default is changed in 9.5. I agree with Peter --- it is unwise to reference a future released feature in a backbranch doc patch. Updating the backbranch docs to add a recommendation is fine. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Tue, Dec 9, 2014 at 2:15 PM, Amit Kapila wrote: > On Mon, Dec 8, 2014 at 3:17 PM, Simon Riggs wrote: > > > > On 8 December 2014 at 11:46, Michael Paquier > wrote: > > > I don't really like those new names, but I'd prefer > > > wal_compression_level if we go down that road with 'none' as default > > > value. We may still decide in the future to support compression at the > > > record level instead of context level, particularly if we have an API > > > able to do palloc_return_null_at_oom, so the idea of WAL compression > > > is not related only to FPIs IMHO. > > > > We may yet decide, but the pglz implementation is not effective on > > smaller record lengths. Nor has any testing been done to show that is > > even desirable. > > > > It's even much worse for non-compressible (or less-compressible) > WAL data. I am not clear here that how a simple on/off switch > could address such cases because the data could be sometimes > dependent on which table user is doing operations (means schema or > data in some tables are more prone for compression in which case > it can give us benefits). I think may be we should think something on > lines what Robert has touched in one of his e-mails (context-aware > compression strategy). > So, I have been doing some measurements using the patch compressing FPWs and had a look at the transaction latency using pgbench -P 1 with those parameters on my laptop: shared_buffers=512MB checkpoint_segments=1024 checkpoint_timeout = 5min fsync=off A checkpoint was executed just before a 20-min run, so 3 checkpoints at least kicked in during each measurement, roughly that: pgbench -i -s 100 psql -c 'checkpoint;' date > ~/report.txt pgbench -P 1 -c 16 -j 16 -T 1200 2>> ~/report.txt & 1) Compression of FPW: latency average: 9.007 ms latency stddev: 25.527 ms tps = 1775.614812 (including connections establishing) Here is the latency when a checkpoint that wrote 28% of the buffers begun (570s): progress: 568.0 s, 2000.9 tps, lat 8.098 ms stddev 23.799 progress: 569.0 s, 1873.9 tps, lat 8.442 ms stddev 22.837 progress: 570.2 s, 1622.4 tps, lat 9.533 ms stddev 24.027 progress: 571.0 s, 1633.4 tps, lat 10.302 ms stddev 27.331 progress: 572.1 s, 1588.4 tps, lat 9.908 ms stddev 25.728 progress: 573.1 s, 1579.3 tps, lat 10.186 ms stddev 25.782 All the other checkpoints have the same profile, giving that the transaction latency increases by roughly 1.5~2ms to 10.5~11ms. 2) No compression of FPW: latency average: 8.507 ms latency stddev: 25.052 ms tps = 1870.368880 (including connections establishing) Here is the latency for a checkpoint that wrote 28% of buffers: progress: 297.1 s, 1997.9 tps, lat 8.112 ms stddev 24.288 progress: 298.1 s, 1990.4 tps, lat 7.806 ms stddev 21.849 progress: 299.0 s, 1986.9 tps, lat 8.366 ms stddev 22.896 progress: 300.0 s, 1648.1 tps, lat 9.728 ms stddev 25.811 progress: 301.0 s, 1806.5 tps, lat 8.646 ms stddev 24.187 progress: 302.1 s, 1810.9 tps, lat 8.960 ms stddev 24.201 progress: 303.0 s, 1831.9 tps, lat 8.623 ms stddev 23.199 progress: 304.0 s, 1951.2 tps, lat 8.149 ms stddev 22.871 Here is another one that began around 600s (20% of buffers): progress: 594.0 s, 1738.8 tps, lat 9.135 ms stddev 25.140 progress: 595.0 s, 893.2 tps, lat 18.153 ms stddev 67.186 progress: 596.1 s, 1671.0 tps, lat 9.470 ms stddev 25.691 progress: 597.1 s, 1580.3 tps, lat 10.189 ms stddev 26.430 progress: 598.0 s, 1570.9 tps, lat 10.089 ms stddev 23.684 progress: 599.2 s, 1657.0 tps, lat 9.385 ms stddev 23.794 progress: 600.0 s, 1665.5 tps, lat 10.280 ms stddev 25.857 progress: 601.1 s, 1571.7 tps, lat 9.851 ms stddev 25.341 progress: 602.1 s, 1577.7 tps, lat 10.056 ms stddev 25.331 progress: 603.0 s, 1600.1 tps, lat 10.329 ms stddev 25.429 progress: 604.0 s, 1593.8 tps, lat 10.004 ms stddev 26.816 Not sure what happened here, the burst has been a bit higher. However roughly the latency was never higher than 10.5ms for the non-compression case. With those measurements I am getting more or less 1ms of latency difference between the compression and non-compression cases when checkpoint show up. Note that fsync is disabled. Also, I am still planning to hack a patch able to compress directly records with a scratch buffer up 32k and see the difference with what I got here. For now, the results are attached. Comments welcome. -- Michael fpw_results.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Small TRUNCATE glitch
On 12/10/2014 03:04 AM, Alvaro Herrera wrote: Alex Shulgin wrote: The 2PC part requires extending bool flag to fit the trunc flag, is this approach sane? Given that 2PC transaction should survive server restart, it's reasonable to expect it to also survive the upgrade, so I see no clean way of adding another bool field to the TwoPhasePgStatRecord struct (unless some would like to add checks on record length, etc.). I don't think we need to have 2PC files survive a pg_upgrade. It seems perfectly okay to remove them from the new cluster at some appropriate time, *if* they are copied from the old cluster at all (I don't think they should be.) I think pg_upgrade should check if there are any prepared transactions pending, and refuse to upgrade if there are. It could be made to work, but it's really not worth the trouble. If there are any pending prepared transactions in the system when you run pg_upgrade, it's more likely to be a mistake or oversight in the first place, than on purpose. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers