Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Tue, Jan 14, 2014 at 2:29 PM, Tom Lane wrote: > > One reason I'm thinking this is that whatever we do to ameliorate > the semantic issues is going to slow down the forward transition > function --- to no benefit unless the aggregate is being used as a > window function in a moving window. So I'm less than convinced > that we *should* implement any of these designs in the default > aggregates, even if we get to the point where we arguably *could* > do it with little risk of functional differences. > > regards, tom lane > I was thinking about this earlier today and came up with an idea that perhaps aggregates could support 2 transition functions. 1 for normal aggregation and for windows with UNBOUNDED PRECEDING. The 2nd transition function could be used when there is a possibility that we would need to perform an inverse transition. This 2nd transition function could do all the extra tracking it needed without having to worry that it would slow down normal aggregation. With numeric that might be tracking each numeric's scale as it enters and exits the window frame. It might even be possible to perform inverse transitions with float and double here, we could just internally use numeric, and have the final function convert back to the original type. Though perhaps that might have the side effect of making floating point calculations too accurate? Or at least not matching the IEEE standards. Of course, I'm not thinking this could be part of this patch, but I thought that I'd post the idea while all this is fresh in people's heads. David
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Thu, Jan 16, 2014 at 9:45 AM, Florian Pflug wrote: > BTW, AVG() and STDDEV() have the same issue. The problem is just partially > masked by the division by N (or N-1) at the end, because we always emit as > least 16 fractional digits when dividing. So you have to have an input > value with a larger scale than that to trigger it. > > For the following query > > select avg(x) over (order by i rows between current row and 1 following) > from (values > (1,1), (2,1), (3,0.1), (4,1), (5,1) > ) t(i,x); > > 9.3 returns > avg > - > 1. > 0.50001 > 0.50001 > 1. > 1. > > but HEAD+patch returns > avg > - > 1. > 0.50001 > 0.50001 > 1.0 > 1.0 > > Uhhh, that's bad news indeed. That means that I'll need to remove not only all inverse transition functions for all aggregates on numeric types, but also avg for int types, the stddev* functions for everything, since they internally use numeric. I guess that only leaves SUM for smallint, int, bigint, cash and interval, along with count(exp), count(*)... > I have to admit that I'm *very* tempted to suggest we simply ignore this - > but that *would* mean accepting that windowed aggregates are non- > deterministic in the sense that their result (even if only in the number > of trailing zeros) depends on values outside of the frame. Which, I guess, > is a box that best stays closed... > > Yeah, I can understand the temptation but I agree we can't go changing results. > I'm currently thinking the best way forward is to get a basic patch > without any NUMERIC stuff committed, and to revisit this after that's done. > > Agreed... I'll warm up my delete key. Regards David Rowley > best regards, > Florian Pflug > >
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Wed, Jan 15, 2014 at 8:23 PM, Peter Geoghegan wrote: > I have an idea of what I could do to fix this, but I don't have time > to make sure that my hunch is correct. It might just be a matter of: @@ -186,6 +186,13 @@ ExecLockHeapTupleForUpdateSpec(EState *estate, switch (test) { case HeapTupleInvisible: + /* +* Tuple may have originated from this command, in which case it's +* already locked +*/ + if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple.t_data)) && + HeapTupleHeaderGetCmin(tuple.t_data) == estate->es_output_cid) + return true; /* Tuple became invisible; try again */ if (IsolationUsesXactSnapshot()) ereport(ERROR, -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Thu, Jan 16, 2014 at 3:47 PM, Florian Pflug wrote: > BTW, as it stands, the patch currently uses the inverse transition > function only when *all* the aggregates belonging to one window clause > provide one. After working with the code a bit, I think that a bit of > reshuffling would allow that distinction to be made on a per-aggregate > basis. The question is, is it worth it? > > I didn't think it was all that worth it due to the fact that we'd still need to process every tuple in the frame's scope for each aggregate which has no inverse transition function, of course, there would be slightly less work to do in such cases. I guess I just assumed that work load types that would benefit from inverse transitions would have many rows instead of many aggregates and few rows. I guess to implement the aggregated up to marker variables would just need to be per aggregate rather than per window. If you think it would be worth it I can modify the patch to work that way. Regards David Rowley > best regards, > Florian Pflug >
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
On 01/16/2014 02:53 AM, Josh Berkus wrote: > Hackers, > > ALTER SYSTEM SET has been committed and recovery.conf GUCs are being > reviewed. I'm going to make a last case for conf.d in relation to these > two patches before 9.4 goes out the door. Personally, I'd find conf.d greatly more useful when enabled by default. The need to hack postgresql.conf to use it eliminates half the point of having it. I haven't seen any compelling reason why conf.d should NOT be enabled by default. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
On Thu, Jan 16, 2014 at 1:31 AM, Josh Berkus wrote: > On 01/15/2014 11:10 AM, Stephen Frost wrote: > >> Independent of the above, I don't agree with that. postgresql.auto.conf >> is a special thing and should have its own special place. For once >> thing, when putting configuration files in a separate directory >> structure (/etc/ vs /var), then postgresql.auto.conf should stay in the >> data directory. > > Ah, I'd forgotten about that line of argument. Where is auto.conf now? Data Directory With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
Hi MauMau, Ah. Sorry, I missed that part. As NTFS junctions and symbolic links are different (although they behave similarly), there seems only a minor inconvenience related to misleading error message i.e. + #ifdef WIN32 > + if (rmdir(linkloc) < 0 && errno != ENOENT) > + #else >if (unlink(linkloc) < 0 && errno != ENOENT) > + #endif >ereport(ERROR, >(errcode_for_file_access(), > errmsg("could not remove symbolic link \"%s\": %m", What is your opinion about it, Is it not worth changing ? . Thanks. On Wed, Jan 15, 2014 at 7:42 PM, MauMau wrote: > From: "Asif Naeem" > > As you have >> > followed destroy_tablespace_directories() function, Is there any specific > reason not to use same logic to detect type of the file/link i.e. > “(lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))”, It also seems have > more appropriate error message i.e. > > Thanks for reviewing and testing the patch. Yes, at first I did what you > mentioned, but modified the patch according to some advice in the mail > thread. During redo, create_tablespace_directories() needs to handle the > case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory even > on UNIX/Linux. Please see TablespaceCreateDbspace is(). > destroy_tablespace_directories() > doesn't have to handle such situation. > > Regards > MauMau > >
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Thu, Jan 16, 2014 at 12:49 AM, Robert Haas wrote: > On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila wrote: >> Unpatched >> --- >> testname | wal_generated | >> duration >> --+--+-- >> one short and one long field, no change |1054923224 | 33.101135969162 >> >> After pgrb_delta_encoding_v4 >> - >> >> testname | wal_generated | >> duration >> --+--+-- >> one short and one long field, no change | 877859144 | 30.6749138832092 >> >> >> Temporary Changes >> (Revert Max Chunksize = 4 and logic of finding longer match) >> - >> >> testname| wal_generated | >> duration >> --+--+-- >> one short and one long field, no change | 677337304 | 25.4048750400543 > > Sure, but watch me not care. > > If we're interested in taking advantage of the internal > compressibility of tuples, we can do a lot better than this patch. We > can compress the old tuple and the new tuple. We can compress > full-page images. We can compress inserted tuples. But that's not > the point of this patch. > > The point of *this* patch is to exploit the fact that the old and new > tuples are likely to be very similar, NOT to squeeze out every ounce > of compression from other sources. Okay, got your point. Another minor thing is that in latest patch which I have sent yesterday, I have modified it such that while formation of chunks if there is a data at end of string which doesn't have special pattern and is less than max chunk size, we also consider that as a chunk. The reason of doing this was that let us say if we have 104 bytes string which contains no special bit pattern, then it will just have one 64 byte chunk and will leave the remaining bytes, which might miss the chance of doing compression for that data. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Tue, Jan 14, 2014 at 3:07 AM, Heikki Linnakangas wrote: >> Right, but with your approach, can you really be sure that you have >> the right rejecting tuple ctid (not reject)? In other words, as you >> wait for the exclusion constraint to conclusively indicate that there >> is a conflict, minutes may have passed in which time other conflicts >> may emerge in earlier unique indexes. Whereas with an approach where >> values are locked, you are guaranteed that earlier unique indexes have >> no conflicting values. Maintaining that property seems useful, since >> we check in a well-defined order, and we're still projecting a ctid. >> Unlike when row locking is involved, we can make no assumptions or >> generalizations around where conflicts will occur. Although that may >> also be a general concern with your approach when row locking, for >> multi-master replication use-cases. There may be some value in knowing >> it cannot have been earlier unique indexes (and so the existing values >> for those unique indexes in the locked row should stay the same - >> don't many conflict resolution policies work that way?). > > I don't understand what you're saying. Can you give an example? > > In the use case I was envisioning above, ie. you insert N rows, and if any > of them violate constraint, you still want to insert the non-violating > instead of rolling back the whole transaction, you don't care. You don't > care what existing rows the new rows conflicted with. > > Even if you want to know what you conflicted with, I can't make sense of > what you're saying. In the btreelock approach, the value locks are > immediately released once you discover that there's conflict. So by the time > you get to do anything with the ctid of the existing tuple you conflicted > with, new conflicting tuples might've appeared. That's true, but at least the timeframe in which an additional conflict may occur on just-locked index values in bound to more or less an instant. In any case how important this is is an interesting question, and perhaps one that Andres can weigh in on as someone that knows a lot about multi-master replication. This issue is particularly interesting because this testcase appears to make both patches livelock, for reasons that I believe are related: https://github.com/petergeoghegan/upsert/blob/master/torture.sh I have an idea of what I could do to fix this, but I don't have time to make sure that my hunch is correct. I'm travelling tomorrow to give a talk at PDX pug, so I'll have limited access to e-mail. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
* Peter Eisentraut (pete...@gmx.net) wrote: > In my mind, a conf.d directory is an extension of a single-file > configuration, and so it should be handled that way. I'm apparently out on some funny limb with this thought, but I'll throw it out there anyway- in my head, the 'postgresql.auto.conf' thing that essentially ends up included as part of 'postgresql.conf' should be handled the same way a single 'postgresql.conf' or 'conf.d' directory is. Now, I've never particularly agreed with it, but at least on Debian/Ubuntu, the /etc conf directories are owned by the postgres user by default. I dislike the idea of separating the PG config into things in /etc and things in PGDATA as it will make life more difficult for the poor sysadmins trying to figure out what's going on. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan16, 2014, at 02:32 , Florian Pflug wrote: > On Jan14, 2014, at 17:39 , Florian Pflug wrote: >> On Jan14, 2014, at 11:06 , David Rowley wrote: >>> Here's a patch which removes sum(numeric) and changes the documents a >>> little to remove a reference to using sum(numeric) to workaround the fact >>> that there's no inverse transitions for sum(float). I also made a small >>> change in the aggregates.sql tests to check that the aggfnoid <= . >> >> I've looked over the patch, here a few comments. >> >> For STRICT pairs of transfer and inverse transfer functions we should >> complain if any of them ever return NULL. That can never be correct anyway, >> since a STRICT function cannot undo a non-NULL -> NULL transition. >> >> The same goes for non-STRICT, at least if we ever want to allow an inverse >> transfer function to indicate "Sorry, cannot undo in this case, please >> rescan". If we allowed NULL as just another state value now, we couldn't >> easily undo that later, so we'd rob ourselves of the obvious way for the >> inverse transfer function to indicate this condition to its caller. >> >> The notnullcount machinery seems to apply to both STRICT and non-STRICT >> transfer function pairs. Shouldn't that be constrained to STRICT transfer >> function pairs? For non-STRICT pairs, it's up to the transfer functions to >> deal with NULL inputs however they please, no? > > I modified the patch to do this, and ran into a problem. Currently, > aggregates with state type "internal" cannot have a strict transfer function, > even if they behave like they did, i.e. ignore non-NULL inputs. This is > because the only possible initial value for state type "internal" is NULL, > and it's up to the transfer function to create the state - usually upon > seeing the first non-NULL input. Now, currently that isn't a huge problem - > the transfer function simply has to check for NULL input values itself, and > if it's indeed conceptually strict, it just returns in this case. > > With inverse transfer functions, however, each such pair of forward and > inverse transfer functions would also need to duplicate the NULL-counting > logic from nodeWindowAgg.c if it want's to be conceptually strict, i.e. > ignore NULL inputs, but return NULL if there are *only* NULL inputs (or no > inputs at all). That seems like a lot of duplicated code in the long run. > > In essence, what we'd want for strict pairs of forward and inverse transfer > functions is for the forward transfer function to be strict in all arguments > except the state, and the inverse to be strict according to the usual > definition. We can't express that property of the forward transfer function > within pg_proc, but we don't really have to - a local hack in nodeWindowAgg.c > suffices. So what I'm proposing is: > > We allow the forward transfer function to be non-strict even if the inverse > is strict, but only if the initial value is NULL. In that case we behave as > if the forward transfer function was strict, except that upon seeing the > first non-NULL input we call it with a NULL state. The return value must > still be non-NULL in all cases. Ok, I tried this and it worked out quite OK. Updated patch is attached. It passes regression tests, but those currently don't seem to include any aggregates which *don't* ignore NULL values, so that case is probably untested. Also, it allows non-strict forward transfer functions together with strict inverse transfer functions even for non-NULL initial states now. It seemed rather pedantic to forbid this. BTW, as it stands, the patch currently uses the inverse transition function only when *all* the aggregates belonging to one window clause provide one. After working with the code a bit, I think that a bit of reshuffling would allow that distinction to be made on a per-aggregate basis. The question is, is it worth it? best regards, Florian Pflug inverse_transition_functions_v2.2_fgp.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 8:41 PM, Jan Kara wrote: > On Wed 15-01-14 10:12:38, Robert Haas wrote: >> On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara wrote: >> > Filesystems could in theory provide facility like atomic write (at least up >> > to a certain size say in MB range) but it's not so easy and when there are >> > no strong usecases fs people are reluctant to make their code more complex >> > unnecessarily. OTOH without widespread atomic write support I understand >> > application developers have similar stance. So it's kind of chicken and egg >> > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place >> > due to its data=journal mode so if someone on the PostgreSQL side wanted to >> > research on this, knitting some experimental ext4 patches should be doable. >> >> Atomic 8kB writes would improve performance for us quite a lot. Full >> page writes to WAL are very expensive. I don't remember what >> percentage of write-ahead log traffic that accounts for, but it's not >> small. > OK, and do you need atomic writes on per-IO basis or per-file is enough? > It basically boils down to - is all or most of IO to a file going to be > atomic or it's a smaller fraction? The write-ahead log wouldn't need it, but data files writes would. So we'd need it a lot, but not for absolutely everything. For any given file, we'd either care about writes being atomic, or we wouldn't. > As Dave notes, unless there is HW support (which is coming with newest > solid state drives), ext4/xfs will have to implement this by writing data > to a filesystem journal and after transaction commit checkpointing them to > a final location. Which is exactly what you do with your WAL logs so > it's not clear it will be a performance win. But it is easy enough to code > for ext4 that I'm willing to try... Yeah, hardware support would be great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan14, 2014, at 17:39 , Florian Pflug wrote: > On Jan14, 2014, at 11:06 , David Rowley wrote: >> Here's a patch which removes sum(numeric) and changes the documents a little >> to remove a reference to using sum(numeric) to workaround the fact that >> there's no inverse transitions for sum(float). I also made a small change in >> the aggregates.sql tests to check that the aggfnoid <= . > > I've looked over the patch, here a few comments. > > For STRICT pairs of transfer and inverse transfer functions we should > complain if any of them ever return NULL. That can never be correct anyway, > since a STRICT function cannot undo a non-NULL -> NULL transition. > > The same goes for non-STRICT, at least if we ever want to allow an inverse > transfer function to indicate "Sorry, cannot undo in this case, please > rescan". If we allowed NULL as just another state value now, we couldn't > easily undo that later, so we'd rob ourselves of the obvious way for the > inverse transfer function to indicate this condition to its caller. > > The notnullcount machinery seems to apply to both STRICT and non-STRICT > transfer function pairs. Shouldn't that be constrained to STRICT transfer > function pairs? For non-STRICT pairs, it's up to the transfer functions to > deal with NULL inputs however they please, no? I modified the patch to do this, and ran into a problem. Currently, aggregates with state type "internal" cannot have a strict transfer function, even if they behave like they did, i.e. ignore non-NULL inputs. This is because the only possible initial value for state type "internal" is NULL, and it's up to the transfer function to create the state - usually upon seeing the first non-NULL input. Now, currently that isn't a huge problem - the transfer function simply has to check for NULL input values itself, and if it's indeed conceptually strict, it just returns in this case. With inverse transfer functions, however, each such pair of forward and inverse transfer functions would also need to duplicate the NULL-counting logic from nodeWindowAgg.c if it want's to be conceptually strict, i.e. ignore NULL inputs, but return NULL if there are *only* NULL inputs (or no inputs at all). That seems like a lot of duplicated code in the long run. In essence, what we'd want for strict pairs of forward and inverse transfer functions is for the forward transfer function to be strict in all arguments except the state, and the inverse to be strict according to the usual definition. We can't express that property of the forward transfer function within pg_proc, but we don't really have to - a local hack in nodeWindowAgg.c suffices. So what I'm proposing is: We allow the forward transfer function to be non-strict even if the inverse is strict, but only if the initial value is NULL. In that case we behave as if the forward transfer function was strict, except that upon seeing the first non-NULL input we call it with a NULL state. The return value must still be non-NULL in all cases. Thoughts? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.2.1 & index-only scans : abnormal heap fetches after VACUUM FULL
On 28 November 2013 22:17, Robert Haas wrote: > The fact that you've needed to modify SetHintBits() to make this work > is pretty nasty. I'm not exactly sure what to do about that, but it > doesn't seem good. That makes me feel bad also. I think we should be looking for some special case routines rather than piggybacking there. Wonderful to see you personally and directly addressing this concern and very supportive of this solution for 9.4. I wonder about a user option to let these routines wait until they are the oldest snapshot, so they can then set every page as all-visible without even bothering to check individual items. Or maybe a user option to set them all-visible even without actually being the oldest, as is possible with COPY FREEZE. Full lock on the table is a big thing, so a shame to waste the opportunity to set everything visible just because some long running task is still executing somewhere (but clearly not here otherwise we wouldn't have the lock). -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
On Tue, Jan 14, 2014 at 3:25 PM, Heikki Linnakangas wrote: > Attached is a patch doing that, to again demonstrate what I mean. I'm not > sure if setting xmin to invalid is really the best way to mark the tuple > dead; I don't think a tuple's xmin can currently be InvalidTransaction under > any other circumstances, so there might be some code out there that's not > prepared for it. So using an infomask bit might indeed be better. Or > something else entirely. Have you thought about the implications for other snapshot types (or other tqual.c routines)? My concern is that a client of that infrastructure (either current or future) could spuriously conclude that a heap tuple satisfied it, when in fact only a promise tuple satisfied it. It wouldn't necessarily follow that the promise would be fulfilled, nor that there would be some other proper heap tuple equivalent to that fulfilled promise tuple as far as those clients are concerned. heap_delete() will not call HeapTupleSatisfiesUpdate() when you're deleting a promise tuple, which on the face of it is fine - it's always going to technically be instantaneously invisible, because it's always created by the same command id (i.e. HeapTupleSatisfiesUpdate() would just return HeapTupleInvisible if called). So far so good, but we are technically doing something else quite new - deleting a would-be instantaneously invisible tuple. So like your concern about setting xmin to invalid, my concern is that code may exist that treats cmin < cmax as an invariant. Now, you might think that that would be a manageable concern, and to be fair a look at the ComboCids code that mostly arbitrates that stuff seems to indicate that it's okay, but it's still worth noting. I think you should consider breaking off the relcache parts of my patch and committing them, because they're independently useful. If we are going to have a lot of conflicts that need to be handled by a heap_delete(), there is no point in inserting non-unique index tuples for what is not yet conclusively a proper (non-promise) tuple. Those should always come last. And even without upsert, strictly inserting into unique indexes first seems like a useful thing relative to the cost. Unique violations are the cause of many aborted transactions, and there is no need to ever bloat non-unique indexes of the same slot when that happens. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote: > Dave Chinner writes: > > On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote: > >> What we'd really like for checkpointing is to hand the kernel a boatload > >> (several GB) of dirty pages and say "how about you push all this to disk > >> over the next few minutes, in whatever way seems optimal given the storage > >> hardware and system situation. Let us know when you're done." > > > The issue there is that the kernel has other triggers for needing to > > clean data. We have no infrastructure to handle variable writeback > > deadlines at the moment, nor do we have any infrastructure to do > > roughly metered writeback of such files to disk. I think we could > > add it to the infrastructure without too much perturbation of the > > code, but as you've pointed out that still leaves the fact there's > > no obvious interface to configure such behaviour. Would it need to > > be persistent? > > No, we'd be happy to re-request it during each checkpoint cycle, as > long as that wasn't an unduly expensive call to make. I'm not quite > sure where such requests ought to "live" though. One idea is to tie > them to file descriptors; but the data to be written might be spread > across more files than we really want to keep open at one time. It would be a property of the inode, as that is how writeback is tracked and timed. Set and queried through a file descriptor, though - it's basically the same context that fadvise works through. > But the only other idea that comes to mind is some kind of global sysctl, > which would probably have security and permissions issues. (One thing > that hasn't been mentioned yet in this thread, but maybe is worth pointing > out now, is that Postgres does not run as root, and definitely doesn't > want to. So we don't want a knob that would require root permissions > to twiddle.) I have assumed all along that requiring root to do stuff would be a bad thing. :) > We could probably live with serially checkpointing data > in sets of however-many-files-we-can-have-open, if file descriptors are > the place to keep the requests. Inodes live longer than file descriptors, but there's no guarantee that they live from one fd context to another. Hence my question about persistence ;) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Robert Haas wrote: > On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara wrote: > > Filesystems could in theory provide facility like atomic write (at least up > > to a certain size say in MB range) but it's not so easy and when there are > > no strong usecases fs people are reluctant to make their code more complex > > unnecessarily. OTOH without widespread atomic write support I understand > > application developers have similar stance. So it's kind of chicken and egg > > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place > > due to its data=journal mode so if someone on the PostgreSQL side wanted to > > research on this, knitting some experimental ext4 patches should be doable. > > Atomic 8kB writes would improve performance for us quite a lot. Full > page writes to WAL are very expensive. I don't remember what > percentage of write-ahead log traffic that accounts for, but it's not > small. Essentially, the "atomic writes" will essentially be journalled data so initially there is not going to be any different in performance between journalling the data in userspace and journalling it in the filesystem journal. Indeed, it could be worse because the filesystem journal is typically much smaller than a database WAL file, and it will flush much more frequently and without the database having any say in when that occurs. AFAICT, we're stuck with sucky WAL until block layer and hardware support atomic writes. FWIW, I've certainly considered adding per-file data journalling capabilities to XFS in the past. If we decide that this is the way to proceed (i.e. as a stepping stone towards hardware atomic write support), then I can go back to my notes from a few years ago and see what still needs to be done to support it Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 07:13:27PM -0500, Tom Lane wrote: > Dave Chinner writes: > > On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote: > >> And most importantly, "Also, please don't freeze up everything else in the > >> process" > > > If you hand writeback off to the kernel, then writeback for memory > > reclaim needs to take precedence over "metered writeback". If we are > > low on memory, then cleaning dirty memory quickly to avoid ongoing > > allocation stalls, failures and potentially OOM conditions is far more > > important than anything else. > > I think you're in violent agreement, actually. Jeff's point is exactly > that we'd rather the checkpoint deadline slid than that the system goes > to hell in a handbasket for lack of I/O cycles. Here "metered" really > means "do it as a low-priority task". No, I meant the opposite - in low memory situations, the system is going to go to hell in a handbasket because we are going to cause a writeback IO storm cleaning memory regardless of these IO priorities. i.e. there is no way we'll let "low priority writeback to avoid IO storms" cause OOM conditions to occur. That is, in OOM conditions, cleaning dirty pages becomes one of the highest priority tasks of the system Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote: > On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane wrote: > > > Heikki Linnakangas writes: > > > On 01/15/2014 07:50 AM, Dave Chinner wrote: > > >> FWIW [and I know you're probably sick of hearing this by now], but > > >> the blk-io throttling works almost perfectly with applications that > > >> use direct IO. > > > > > For checkpoint writes, direct I/O actually would be reasonable. > > > Bypassing the OS cache is a good thing in that case - we don't want the > > > written pages to evict other pages from the OS cache, as we already have > > > them in the PostgreSQL buffer cache. > > > > But in exchange for that, we'd have to deal with selecting an order to > > write pages that's appropriate depending on the filesystem layout, > > other things happening in the system, etc etc. We don't want to build > > an I/O scheduler, IMO, but we'd have to. > > > > > Writing one page at a time with O_DIRECT from a single process might be > > > quite slow, so we'd probably need to use writev() or asynchronous I/O to > > > work around that. > > > > Yeah, and if the system has multiple spindles, we'd need to be issuing > > multiple O_DIRECT writes concurrently, no? > > > > writev effectively does do that, doesn't it? But they do have to be on the > same file handle, so that could be a problem. I think we need something > like sorted checkpoints sooner or later, anyway. No, it doesn't. writev() allows you to supply multiple user buffers for a single IO to fixed offset. If th efile is contiguous, then it will be issued as a single IO. If you want concurrent DIO, then you need to use multiple threads or AIO. > > What we'd really like for checkpointing is to hand the kernel a boatload > > (several GB) of dirty pages and say "how about you push all this to disk > > over the next few minutes, in whatever way seems optimal given the storage > > hardware and system situation. Let us know when you're done." > > And most importantly, "Also, please don't freeze up everything else in the > process" If you hand writeback off to the kernel, then writeback for memory reclaim needs to take precedence over "metered writeback". If we are low on memory, then cleaning dirty memory quickly to avoid ongoing allocation stalls, failures and potentially OOM conditions is far more important than anything else. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote: > Heikki Linnakangas writes: > > On 01/15/2014 07:50 AM, Dave Chinner wrote: > >> FWIW [and I know you're probably sick of hearing this by now], but > >> the blk-io throttling works almost perfectly with applications that > >> use direct IO. > > > For checkpoint writes, direct I/O actually would be reasonable. > > Bypassing the OS cache is a good thing in that case - we don't want the > > written pages to evict other pages from the OS cache, as we already have > > them in the PostgreSQL buffer cache. > > But in exchange for that, we'd have to deal with selecting an order to > write pages that's appropriate depending on the filesystem layout, > other things happening in the system, etc etc. We don't want to build > an I/O scheduler, IMO, but we'd have to. I don't see that as necessary - nobody else needs to do this with direct IO. Indeed, if the application does ascending offset order writeback from within a file, then it's replicating exactly what the kernel page cache writeback does. If what the kernel does is good enough for you, then I can't see how doing the same thing with a background thread doing direct IO is going to need any special help > > Writing one page at a time with O_DIRECT from a single process might be > > quite slow, so we'd probably need to use writev() or asynchronous I/O to > > work around that. > > Yeah, and if the system has multiple spindles, we'd need to be issuing > multiple O_DIRECT writes concurrently, no? > > What we'd really like for checkpointing is to hand the kernel a boatload > (several GB) of dirty pages and say "how about you push all this to disk > over the next few minutes, in whatever way seems optimal given the storage > hardware and system situation. Let us know when you're done." The issue there is that the kernel has other triggers for needing to clean data. We have no infrastructure to handle variable writeback deadlines at the moment, nor do we have any infrastructure to do roughly metered writeback of such files to disk. I think we could add it to the infrastructure without too much perturbation of the code, but as you've pointed out that still leaves the fact there's no obvious interface to configure such behaviour. Would it need to be persistent? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
Robert Haas writes: > I don't see that as a problem. What we're struggling with today is > that, until we fsync(), the system is too lazy about writing back > dirty pages. And then when we fsync(), it becomes very aggressive and > system-wide throughput goes into the tank. What we're aiming to do > here is get is to start the writeback sooner than it would otherwise > start so that it is spread out over a longer period of time. Yeah. It's sounding more and more like the right semantics are to give the kernel a hint that we're going to fsync these files later, so it ought to get on with writing them anytime the disk has nothing better to do. I'm not sure if there's value in being specific about how much later; that would probably depend on details of the scheduler that I don't know. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
Dave Chinner writes: > On Wed, Jan 15, 2014 at 07:08:18PM -0500, Tom Lane wrote: >> No, we'd be happy to re-request it during each checkpoint cycle, as >> long as that wasn't an unduly expensive call to make. I'm not quite >> sure where such requests ought to "live" though. One idea is to tie >> them to file descriptors; but the data to be written might be spread >> across more files than we really want to keep open at one time. > It would be a property of the inode, as that is how writeback is > tracked and timed. Set and queried through a file descriptor, > though - it's basically the same context that fadvise works > through. Ah, got it. That would be fine on our end, I think. >> We could probably live with serially checkpointing data >> in sets of however-many-files-we-can-have-open, if file descriptors are >> the place to keep the requests. > Inodes live longer than file descriptors, but there's no guarantee > that they live from one fd context to another. Hence my question > about persistence ;) I plead ignorance about what an "fd context" is. However, if what you're saying is that there's a small chance of the kernel forgetting the request during normal system operation, I think we could probably tolerate that, if the API is designed so that we ultimately do an fsync on the file anyway. The point of the hint would be to try to ensure that the later fsync had little to do. If sometimes it didn't work, well, that's life. We're ahead of the game as long as it usually works. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 7:22 PM, Dave Chinner wrote: > No, I meant the opposite - in low memory situations, the system is > going to go to hell in a handbasket because we are going to cause a > writeback IO storm cleaning memory regardless of these IO > priorities. i.e. there is no way we'll let "low priority writeback > to avoid IO storms" cause OOM conditions to occur. That is, in OOM > conditions, cleaning dirty pages becomes one of the highest priority > tasks of the system I don't see that as a problem. What we're struggling with today is that, until we fsync(), the system is too lazy about writing back dirty pages. And then when we fsync(), it becomes very aggressive and system-wide throughput goes into the tank. What we're aiming to do here is get is to start the writeback sooner than it would otherwise start so that it is spread out over a longer period of time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE FOREIGN TABLE ( ... LIKE ... )
On 01/16/2014 01:21 AM, Tom Lane wrote: > Vik Fearing writes: >>> I am marking this patch as 'returned with feedback' in the commitfest app. >> It looks like this patch got left behind in the previous commitfest. >> What is the policy for moving it? Is it too late and will have to wait >> until the next commitfest? >> https://commitfest.postgresql.org/action/patch_view?id=1254 > I think you were in error to mark it "returned with feedback", as that > caused everyone to stop paying attention to it in that commitfest. > (And David dropped the ball too, as he should have done something to > bring it back from that state, if it was committable or nearly so.) I see. Sorry about that. > I see no reason why you shouldn't move it to the new fest; perhaps > mark it as waiting on author, since really it's his responsibility > to take the next step, ie comment on your version of the patch. Done. -- Vik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE FOREIGN TABLE ( ... LIKE ... )
Vik Fearing writes: >> I am marking this patch as 'returned with feedback' in the commitfest app. > It looks like this patch got left behind in the previous commitfest. > What is the policy for moving it? Is it too late and will have to wait > until the next commitfest? > https://commitfest.postgresql.org/action/patch_view?id=1254 I think you were in error to mark it "returned with feedback", as that caused everyone to stop paying attention to it in that commitfest. (And David dropped the ball too, as he should have done something to bring it back from that state, if it was committable or nearly so.) I see no reason why you shouldn't move it to the new fest; perhaps mark it as waiting on author, since really it's his responsibility to take the next step, ie comment on your version of the patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE FOREIGN TABLE ( ... LIKE ... )
On Thu, Jan 16, 2014 at 01:07:50AM +0100, Vik Fearing wrote: > On 11/24/2013 02:03 AM, Vik Fearing wrote: > > On 10/15/2013 07:50 AM, David Fetter wrote: > >> On Mon, Oct 07, 2013 at 11:16:56PM -0700, David Fetter wrote: > >>> Folks, > >>> > >>> Please find attached a patch implementing and documenting, to some > >>> extent, $subject. I did this in aid of being able to import SQL > >>> standard catalogs and other entities where a known example could > >>> provide a template for a foreign table. > >>> > >>> Should there be errhint()s, too? Should we pile up all such errors > >>> and mention them at the end rather than simply bailing on the first > >>> one? > >>> > >>> TBD: regression tests. > >> Now included: regression tests for disallowed LIKE options. > > I like this patch, but I don't like its implementation at all. > > > > First of all, the documentation doesn't compile: > > > > openjade:ref/create_foreign_table.sgml:124:17:E: end tag for "LISTITEM" > > omitted, but OMITTAG NO was specified > > openjade:ref/create_foreign_table.sgml:119:4: start tag was here > > > > > > I fixed that, and then noticed that like_option is not explained like it > > is in CREATE TABLE. > > > > Then I got down to the description of the LIKE clause in both pages, and > > I noticed the last line of CREATE TABLE, which is "Inapplicable options > > (e.g., INCLUDING INDEXES from a view) are ignored.". This is > > inconsistent with the behavior of this patch to throw errors for > > inapplicable options. > > > > Attached is a patch which corrects and completes the documentation > > issues noted above, and also silently ignores inapplicable options. In > > addition to reducing patch size, this also allows the use of INCLUDING > > ALL. Because these options no longer produce errors, and that's all the > > regression tests were looking for, I have removed those tests > > (unfortunately leaving none). > > > > Aside from this difference in behavior, I see no fault in this patch. > > > > I am marking this patch as 'returned with feedback' in the commitfest app. > > > > It looks like this patch got left behind in the previous commitfest. > What is the policy for moving it? Is it too late and will have to wait > until the next commitfest? > > https://commitfest.postgresql.org/action/patch_view?id=1254 I think we should still be OK putting it in the current one. Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Linux kernel impact on PostgreSQL performance
On 1/15/14, 12:00 AM, Claudio Freire wrote: My completely unproven theory is that swapping is overwhelmed by near-misses. Ie: a process touches a page, and before it's actually swapped in, another process touches it too, blocking on the other process' read. But the second process doesn't account for that page when evaluating predictive models (ie: read-ahead), so the next I/O by process 2 is unexpected to the kernel. Then the same with 1. Etc... In essence, swap, by a fluke of its implementation, fails utterly to predict the I/O pattern, and results in far sub-optimal reads. Explicit I/O is free from that effect, all read calls are accountable, and that makes a difference. Maybe, if the kernel could be fixed in that respect, you could consider mmap'd files as a suitable form of temporary storage. But that would depend on the success and availability of such a fix/patch. Another option is to consider some of the more "radical" ideas in this thread, but only for temporary data. Our write sequencing and other needs are far less stringent for this stuff. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
Dave Chinner writes: > On Wed, Jan 15, 2014 at 02:29:40PM -0800, Jeff Janes wrote: >> And most importantly, "Also, please don't freeze up everything else in the >> process" > If you hand writeback off to the kernel, then writeback for memory > reclaim needs to take precedence over "metered writeback". If we are > low on memory, then cleaning dirty memory quickly to avoid ongoing > allocation stalls, failures and potentially OOM conditions is far more > important than anything else. I think you're in violent agreement, actually. Jeff's point is exactly that we'd rather the checkpoint deadline slid than that the system goes to hell in a handbasket for lack of I/O cycles. Here "metered" really means "do it as a low-priority task". regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
Dave Chinner writes: > On Wed, Jan 15, 2014 at 10:12:38AM -0500, Tom Lane wrote: >> What we'd really like for checkpointing is to hand the kernel a boatload >> (several GB) of dirty pages and say "how about you push all this to disk >> over the next few minutes, in whatever way seems optimal given the storage >> hardware and system situation. Let us know when you're done." > The issue there is that the kernel has other triggers for needing to > clean data. We have no infrastructure to handle variable writeback > deadlines at the moment, nor do we have any infrastructure to do > roughly metered writeback of such files to disk. I think we could > add it to the infrastructure without too much perturbation of the > code, but as you've pointed out that still leaves the fact there's > no obvious interface to configure such behaviour. Would it need to > be persistent? No, we'd be happy to re-request it during each checkpoint cycle, as long as that wasn't an unduly expensive call to make. I'm not quite sure where such requests ought to "live" though. One idea is to tie them to file descriptors; but the data to be written might be spread across more files than we really want to keep open at one time. But the only other idea that comes to mind is some kind of global sysctl, which would probably have security and permissions issues. (One thing that hasn't been mentioned yet in this thread, but maybe is worth pointing out now, is that Postgres does not run as root, and definitely doesn't want to. So we don't want a knob that would require root permissions to twiddle.) We could probably live with serially checkpointing data in sets of however-many-files-we-can-have-open, if file descriptors are the place to keep the requests. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE FOREIGN TABLE ( ... LIKE ... )
On 11/24/2013 02:03 AM, Vik Fearing wrote: > On 10/15/2013 07:50 AM, David Fetter wrote: >> On Mon, Oct 07, 2013 at 11:16:56PM -0700, David Fetter wrote: >>> Folks, >>> >>> Please find attached a patch implementing and documenting, to some >>> extent, $subject. I did this in aid of being able to import SQL >>> standard catalogs and other entities where a known example could >>> provide a template for a foreign table. >>> >>> Should there be errhint()s, too? Should we pile up all such errors >>> and mention them at the end rather than simply bailing on the first >>> one? >>> >>> TBD: regression tests. >> Now included: regression tests for disallowed LIKE options. > I like this patch, but I don't like its implementation at all. > > First of all, the documentation doesn't compile: > > openjade:ref/create_foreign_table.sgml:124:17:E: end tag for "LISTITEM" > omitted, but OMITTAG NO was specified > openjade:ref/create_foreign_table.sgml:119:4: start tag was here > > > I fixed that, and then noticed that like_option is not explained like it > is in CREATE TABLE. > > Then I got down to the description of the LIKE clause in both pages, and > I noticed the last line of CREATE TABLE, which is "Inapplicable options > (e.g., INCLUDING INDEXES from a view) are ignored.". This is > inconsistent with the behavior of this patch to throw errors for > inapplicable options. > > Attached is a patch which corrects and completes the documentation > issues noted above, and also silently ignores inapplicable options. In > addition to reducing patch size, this also allows the use of INCLUDING > ALL. Because these options no longer produce errors, and that's all the > regression tests were looking for, I have removed those tests > (unfortunately leaving none). > > Aside from this difference in behavior, I see no fault in this patch. > > I am marking this patch as 'returned with feedback' in the commitfest app. > It looks like this patch got left behind in the previous commitfest. What is the policy for moving it? Is it too late and will have to wait until the next commitfest? https://commitfest.postgresql.org/action/patch_view?id=1254 -- Vik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE TABLESPACE WITH
On 01/15/2014 03:07 AM, Michael Paquier wrote: > I just had a quick look at this patch, no testing at all. Thank you for looking at it. > I am seeing that regression tests are still missing here, they should be > added in > src/test/regress/input/tablespace.source. New patch attached, with regression tests. > Then, for the use of either > WITH or SET... For example CREATE FUNCTION uses SET for a > configuration parameter, and ALTER FUNCTION is consistent with that. > So SET makes more sense to be consistent with CREATE TABLESPACE? This > patch should not be modified once again as long as there are no more > opinions though... Based on your analysis of current behavior elsewhere, and the messier patch with SET, I have switched my vote from ambivalent to preferring WITH. -- Vik *** a/doc/src/sgml/ref/create_tablespace.sgml --- b/doc/src/sgml/ref/create_tablespace.sgml *** *** 21,27 PostgreSQL documentation ! CREATE TABLESPACE tablespace_name [ OWNER user_name ] LOCATION 'directory' --- 21,30 ! CREATE TABLESPACE tablespace_name ! [ OWNER user_name ] ! LOCATION 'directory' ! [ WITH ( tablespace_option = value [, ... ] ) ] *** *** 87,92 CREATE TABLESPACE tablespace_name [ --- 90,113 + + + tablespace_option + + + A tablespace parameter to be set or reset. Currently, the only + available parameters are seq_page_cost and + random_page_cost. Setting either value for a particular + tablespace will override the planner's usual estimate of the cost of + reading pages from tables in that tablespace, as established by + the configuration parameters of the same name (see + , + ). This may be useful if one + tablespace is located on a disk which is faster or slower than the + remainder of the I/O subsystem. + + + *** a/src/backend/commands/tablespace.c --- b/src/backend/commands/tablespace.c *** *** 234,239 CreateTableSpace(CreateTableSpaceStmt *stmt) --- 234,240 Oid tablespaceoid; char *location; Oid ownerId; + Datum newOptions; /* Must be super user */ if (!superuser()) *** *** 317,323 CreateTableSpace(CreateTableSpaceStmt *stmt) values[Anum_pg_tablespace_spcowner - 1] = ObjectIdGetDatum(ownerId); nulls[Anum_pg_tablespace_spcacl - 1] = true; ! nulls[Anum_pg_tablespace_spcoptions - 1] = true; tuple = heap_form_tuple(rel->rd_att, values, nulls); --- 318,333 values[Anum_pg_tablespace_spcowner - 1] = ObjectIdGetDatum(ownerId); nulls[Anum_pg_tablespace_spcacl - 1] = true; ! ! /* Generate new proposed spcoptions (text array) */ ! newOptions = transformRelOptions((Datum) 0, ! stmt->options, ! NULL, NULL, false, false); ! (void) tablespace_reloptions(newOptions, true); ! if (newOptions != (Datum) 0) ! values[Anum_pg_tablespace_spcoptions - 1] = newOptions; ! else ! nulls[Anum_pg_tablespace_spcoptions - 1] = true; tuple = heap_form_tuple(rel->rd_att, values, nulls); *** a/src/backend/nodes/copyfuncs.c --- b/src/backend/nodes/copyfuncs.c *** *** 3370,3375 _copyCreateTableSpaceStmt(const CreateTableSpaceStmt *from) --- 3370,3376 COPY_STRING_FIELD(tablespacename); COPY_STRING_FIELD(owner); COPY_STRING_FIELD(location); + COPY_NODE_FIELD(options); return newnode; } *** a/src/backend/nodes/equalfuncs.c --- b/src/backend/nodes/equalfuncs.c *** *** 1610,1615 _equalCreateTableSpaceStmt(const CreateTableSpaceStmt *a, const CreateTableSpace --- 1610,1616 COMPARE_STRING_FIELD(tablespacename); COMPARE_STRING_FIELD(owner); COMPARE_STRING_FIELD(location); + COMPARE_NODE_FIELD(options); return true; } *** a/src/backend/parser/gram.y --- b/src/backend/parser/gram.y *** *** 3588,3599 opt_procedural: * */ ! CreateTableSpaceStmt: CREATE TABLESPACE name OptTableSpaceOwner LOCATION Sconst { CreateTableSpaceStmt *n = makeNode(CreateTableSpaceStmt); n->tablespacename = $3; n->owner = $4; n->location = $6; $$ = (Node *) n; } ; --- 3588,3600 * */ ! CreateTableSpaceStmt: CREATE TABLESPACE name OptTableSpaceOwner LOCATION Sconst opt_reloptions { CreateTableSpaceStmt *n = makeNode(CreateTableSpaceStmt); n->tablespacename = $3; n->owner = $4; n->location = $6; + n->options = $7; $$ = (Node *) n; } ; *** a/src/include/nodes/parsenodes.h --- b/src/include/nodes/parsenodes.h *** *** 1669,1674 typedef struct CreateTableSpaceStmt --- 1669,1675 char *tabl
Re: [HACKERS] trgm regex index peculiarity
Alexander Korotkov writes: > Revised version of patch with necessary comments. I looked at this patch a bit. It seems like this: + *BLANK_COLOR_SIZE - How much blank character is more frequent than + * other character in average + #define BLANK_COLOR_SIZE 32 is a drastic overestimate. Using the program Erik provided to generate some sample data, I find that blanks in trigrams are maybe about three times more common than other characters. Possibly Erik's data isn't very representative of real text, but if that's the case we'd better get a more representative sample case before we start optimizing ... Now maybe the above number is okay for this purpose anyway, but if so it needs some different justification; maybe call it a "penalty"? But nonetheless it seems like a darn large penalty, particularly for " x" type trigrams where the value would effectively be squared. Compared to the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like you might as well forget the "sizing" approach and just flat out reject trigrams containing COLOR_BLANK, because they'll never get through the size filter. I'm inclined to think you need a larger MAX_TRGM_SIZE; and WISH_TRGM_SIZE seems mighty small as well, compared to what the code would have done before. Also, surely this bit: ! trgmNFA->totalTrgmCount = (int) totalTrgmSize; is flat out wrong? expandColorTrigrams() expects that trgmNFA->totalTrgmCount is the truth, not some penalty-bloated abstraction. The fact that the patch hasn't failed on you completely demonstrates that you've not tested any cases where blank-containing trigrams got through the filter, reinforcing my feeling that it's probably too strict now. I wonder if it would be more appropriate to continue to measure the limit MAX_TRGM_COUNT in terms of actual trigram counts, and just use the penalized "size" as the sort key while determining which color trigrams to discard first. Another thought is that it's not clear that you should apply the same penalty to blanks in all three positions. Because of the padding settings, any one word will normally lead to padded strings " a", " ab", "yz ", so that spaces in the first position are about twice as common as spaces in the other positions. (It's a little less than that because single-letter words produce only " a" and " a "; but I'd think that such words aren't very common in text that trigrams are effective for.) So I'm thinking the penalty ought to take that into account. I'm also inclined to think that it might be worth adding a size field to ColorTrgmInfo rather than repeatedly calculating the size estimate. We only allow a thousand and some ColorTrgmInfos at most, so the extra space wouldn't be that large, and I'm concerned by the expense the current patch adds to the sort comparator. Another point is that this comment: * Note: per-color-trigram counts cannot overflow an int so long as * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about * 1290. However, the grand total totalTrgmCount might conceivably * overflow an int, so we use int64 for that within this routine. is no longer valid, or at least it fails to demonstrate that the result of getColorTrigramSize() can't overflow an int; indeed I rather fear it can. The patch failed to even update the variable name in this comment, let alone address the substantive question. There are some other cosmetic things I didn't like, for instance colorTrgmInfoCountCmp() is no longer comparing counts but you didn't rename it. That's about it for substantive comments though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Backup throttling
Alvaro Herrera escribió: > Antonin Houska escribió: > > Thanks for checking. The new version addresses your findings. > > I gave this patch a look. BTW I also moved the patch the commitfest currently running, and set it waiting-on-author. Your move. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 7:12 AM, Tom Lane wrote: > Heikki Linnakangas writes: > > On 01/15/2014 07:50 AM, Dave Chinner wrote: > >> FWIW [and I know you're probably sick of hearing this by now], but > >> the blk-io throttling works almost perfectly with applications that > >> use direct IO. > > > For checkpoint writes, direct I/O actually would be reasonable. > > Bypassing the OS cache is a good thing in that case - we don't want the > > written pages to evict other pages from the OS cache, as we already have > > them in the PostgreSQL buffer cache. > > But in exchange for that, we'd have to deal with selecting an order to > write pages that's appropriate depending on the filesystem layout, > other things happening in the system, etc etc. We don't want to build > an I/O scheduler, IMO, but we'd have to. > > > Writing one page at a time with O_DIRECT from a single process might be > > quite slow, so we'd probably need to use writev() or asynchronous I/O to > > work around that. > > Yeah, and if the system has multiple spindles, we'd need to be issuing > multiple O_DIRECT writes concurrently, no? > writev effectively does do that, doesn't it? But they do have to be on the same file handle, so that could be a problem. I think we need something like sorted checkpoints sooner or later, anyway. > What we'd really like for checkpointing is to hand the kernel a boatload > (several GB) of dirty pages and say "how about you push all this to disk > over the next few minutes, in whatever way seems optimal given the storage > hardware and system situation. Let us know when you're done." And most importantly, "Also, please don't freeze up everything else in the process" Cheers, Jeff
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 15 January 2014 16:47, Robert Haas wrote: >>> There would be a postgresql.conf parameter prune_cost_limit, as well >>> as a table level parameter that would prevent pruning except via >>> VACUUM. >>> >>> This will help in these ways >>> * Reduce write I/O from SELECTs and pg_dump - improving backups and BI >>> queries >>> * Allow finer grained control over Hot Standby conflicts >>> * Potentially allow diagnostic inspection of older data via SeqScan >> Patch attached, implemented to reduce writes by SELECTs only. > > I am still not sure whether we want this, but I think it's definitely > an improvement over the previous version. Assorted comments: > > - Naming consistency seems to me to dictate that there should be more > similarity between the reloption name (allow_buffer_cleanup) and the > GUC (prune_page_dirty_limit). Now that I've written the patch, I'm seeing those as two different things, but YMMV and I am very open to naming suggestions. > - The documentation doesn't describe the use case where suppressing > cleanup on a per-table basis would be desirable, and I can't think of > one, either. We already know that HOT is ineffective in areas of high contention (previous thread by me). Prior experience was that smaller tables didn't show much apparent benefit from using HOT either; its effectiveness was limited to medium and large tables being updated. The two already stated use cases that would apply are these ones * Allow finer grained control over Hot Standby conflicts * Potentially allow diagnostic inspection of older data via SeqScan So the use cases for the two parameters seem quite different and we may decide we want one but not the other. > - There are a variety of ways to limit pruning; here, you've chosen to > limit it to a particular number of pruning operations per executor > invocation. But the flag is global, not part of the executor state, > so a query that calls a PL/pgsql function during execution will reset > the counter for the parent query also, which doesn't seem very > principled. That is subtle thing in this patch and I agree that potential problem exists. The current limit is set according to the current executing statement, but the current total is not reset until start of the top level statement. So the behaviour is not reset during statements executed within PL/pgSQL function. > In a patch I posted a few years ago to set hint bits only sometimes, I > settled on an algorithm where I dirtied the first 50 pages per scan > and then skipped the next 950, or something like that. The idea was > that you wanted the pages that did get dirtied to be clustered > together to avoid random I/O; and also that you wanted table of > arbitrary size to get hinted within a certain number of scans (e.g. > 20). The limiting here is much more aggressive, so on large tables it > will amount to basically no pruning at all. I dunno whether that's a > good idea or not. But if the idea of making this an integer rather > than a boolean is to allow some pruning to still happen while keeping > it checked within reasonable bounds, I'm not sure it will succeed. It sounds like you're in favour of the overall concept of limiting writes, which is good. The behaviour I think we need, based on listening to everybody so far is * OLTP is unaffected * Large SELECTs and pg_dump don't cause lots of write I/O. and hence why "prune_page_dirty_limit" offers a change in behaviour at a certain point. Reducing cleanup to "only 5%" just reduces but doesn't remove the problem. If the data is stored on very poor I/O infrastructure, any significant volume of writes can adversely affect performance. As we reduce the percentage, we also reduce the benefit from inducing writes in the first place and so I would question why bother at all using a percentage. For me, a parameter that gives you absolute rather than relative control is more desirable. The current behaviour assumes it is OK for the first/next user to touch the data to be the one that won't mind re-writing everything. In time critical applications, the first/next user could well have a very urgent need to access the data quickly and doesn't want to have to pay this price. In seldom-accessed data applications, VACUUM has lots of time to run out of hours, so users are OK to defer this work. Some applications exist where we literally want zero I/O. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
On 1/15/14, 3:01 PM, Josh Berkus wrote: > Three issues: > > a) if postgresql is still going to look for a recovery.conf file in the > usual place, but we are changing the names and meaning of some of the > parameters, then aren't we making the upgrade problem much worse? That assumes that we are changing the names and meanings of some of the parameters, which I don't see a reason for. > b) what if the admin *does* want to disable reading recovery.conf in > order to prevent old utilities from mistakenly including one? How will > they do that? That assumes that there is a reason for doing that, which goes away if point (a) is addressed. > c) would this be in the configdir, datadir, or both? That might depend on the parameter and what a tool wants to do with it. There is also the consideration of whether some of those tools couldn't be changed to use ALTER SYSTEM. > I'd thought that part of the idea of the merger was to remove the > "magic" status of recovery.conf. Well, clearly, everyone has their own ideas about that. I have several non-overlapping ones of my own. ;-) But my point is that we should look what actually comes out of that discussion before we start designing other facilities that interact with it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
On 1/15/14, 4:35 PM, Tom Lane wrote: > Peter Eisentraut writes: >> On 1/15/14, 1:53 PM, Josh Berkus wrote: >>> Yes, I'm also arguing that postgresql.auto.conf should go into conf.d. >>> I said I'd bring that up again after ALTER SYSTEM SET was committed, and >>> here it is. > >> Independent of the above, I don't agree with that. postgresql.auto.conf >> is a special thing and should have its own special place. For once >> thing, when putting configuration files in a separate directory >> structure (/etc/ vs /var), then postgresql.auto.conf should stay in the >> data directory. > > It seems like we still aren't all on the same page as to whether the > conf.d directory (and contained files) should be expected to be writable > by the postgres server or not. I think it's hopeless to proceed further > unless there's a strong consensus on that. You can turn this around and ask, why should it be writable? The server has no need to write anything there. In my mind, a conf.d directory is an extension of a single-file configuration, and so it should be handled that way. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
Peter Eisentraut writes: > On 1/15/14, 1:53 PM, Josh Berkus wrote: >> Yes, I'm also arguing that postgresql.auto.conf should go into conf.d. >> I said I'd bring that up again after ALTER SYSTEM SET was committed, and >> here it is. > Independent of the above, I don't agree with that. postgresql.auto.conf > is a special thing and should have its own special place. For once > thing, when putting configuration files in a separate directory > structure (/etc/ vs /var), then postgresql.auto.conf should stay in the > data directory. It seems like we still aren't all on the same page as to whether the conf.d directory (and contained files) should be expected to be writable by the postgres server or not. I think it's hopeless to proceed further unless there's a strong consensus on that. My vote would be that the server should *not* be expected to be able to write these files. It might physically be able to, in a quick-and-dirty installation, but in a setup prepared by a packaging system I'd not expect that the config files would be postgres-owned. Given that assumption, it's clear that postgresql.auto.conf can't live in conf.d. However, if you reject that assumption then maybe it should. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
Florian Pflug writes: > I'm currently thinking the best way forward is to get a basic patch > without any NUMERIC stuff committed, and to revisit this after that's done. +1. I liked Robert's suggestion that the "fast" aggregates be implemented as a contrib module rather than directly in core, too. If we do that then it's perfectly reasonable to handle that as a separate patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Jan15, 2014, at 19:56 , Robert Haas wrote: > It strikes me that for numeric what you really need is to just tell > the sum operation, whether through a parameter or otherwise, how many > decimal places to show in the output. Obviously that's not a > practical change for sum() itself, but if we're inventing new stuff it > can be done. You can already do that, just cast the result of SUM(numeric) to an appropriately constrained NUMERIC type, i.e. to NUMERIC(prec, scale). BTW, AVG() and STDDEV() have the same issue. The problem is just partially masked by the division by N (or N-1) at the end, because we always emit as least 16 fractional digits when dividing. So you have to have an input value with a larger scale than that to trigger it. For the following query select avg(x) over (order by i rows between current row and 1 following) from (values (1,1), (2,1), (3,0.1), (4,1), (5,1) ) t(i,x); 9.3 returns avg - 1. 0.50001 0.50001 1. 1. but HEAD+patch returns avg - 1. 0.50001 0.50001 1.0 1.0 I have to admit that I'm *very* tempted to suggest we simply ignore this - but that *would* mean accepting that windowed aggregates are non- deterministic in the sense that their result (even if only in the number of trailing zeros) depends on values outside of the frame. Which, I guess, is a box that best stays closed... I'm currently thinking the best way forward is to get a basic patch without any NUMERIC stuff committed, and to revisit this after that's done. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)
Hi, On 2014-01-15 13:28:25 -0500, Robert Haas wrote: > - I think you should just regard ReplicationSlotCtlLock as protecting > the "name" and "active" flags of every slot. ReplicationSlotCreate() > would then not need to mess with the spinlocks at all, and > ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I > think. Functions like ReplicationSlotsComputeRequiredXmin() can > acquire this lock in shared mode and only lock the slots that are > actually active, instead of all of them. I first thought you meant that we should get rid of the spinlock, but after rereading I think you just mean that ->name, ->active, ->in_use are only allowed to change while holding the lwlock exclusively so we don't need to spinlock in those cases? If so, yes, that works for me. > - If you address /* FIXME: apply sanity checking to slot name */, then > I think that also addresses /* XXX: do we want to use truncate > identifier instead? */. In other words, let's just error out if the > name is too long. I'm not sure what other sanity checking is needed > here; maybe just restrict it to 7-bit characters to avoid problems > with encodings for individual databases varying. Yea, erroring out seems like a good idea. But I think we need to restrict slot names a bit more than that, given they are used as filenames... We could instead name the files using the slot's offset, but I'd prefer to not go that route. > - ReplicationSlotAcquire probably needs to ignore slots that are not active. Not sure what you mean? If the slot isn't in_use we'll skip it in the loop. > - If there's a coding rule that slot->database can't be changed while > the slot is active, then the check to make sure that the user isn't > trying to bind to a slot with a mis-matching database could be done > before the code described in the previous point, avoiding the need to > go back and release the resource. I don't think slot->database should be allowed to change at all... > - I think the critical section in ReplicationSlotDrop is bogus. If > DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't > gone. Well, if delete slot fails, we don't really know at which point it failed which means that the on-disk state might not correspond to the in-memory state. I don't see a point in adding code trying to handle that case correctly... > - cancel_before_shmem_exit is only guaranteed to remove the > most-recently-added callback. Yea :(. I think that's safe for the current usages but seems mighty fragile... Not sure what to do about it. Just register KillSlot once and keep it registered? > - Why does ReplicationSlotsComputeRequiredXmin() need to hold > ProcArrayLock at all? There's reasoning, but I just noticed that it's basis might be flawed anyway :(. When starting changeset extraction in a new slot we need to guarantee that we only start decoding records we know the catalog tuples haven't been removed for. So, when creating the slot I've so far done a GetOldestXmin() and used that to check against xl_running_xact->oldestRunningXid. But GetOldestXmin() can go backwards... I'll think a bit and try to simplify this. > - ReplicationSlotsComputeRequiredXmin scarcely needs to hold the > spinlock while it does all of those gyrations. It can just acquire > the spinlock, copy the three fields needed into local variables, and > release the spinlock. The rest can be worked out afterwards. > Similarly in ReplicationSlotsComputeRequiredXmin. Yea, will change. > - A comment in KillSlot wonders whether locking is required. I say > yes. It's safe to take lwlocks and spinlocks during shmem exit, and > failing to do so seems like a recipe for subtle corner-case bugs. I agree that it's safe to use spinlocks, but lwlocks? What if we are erroring out while holding an lwlock? Code that runs inside a TransactionCommand is protected against that, but if not ProcKill() which invokes LWLockReleaseAll() runs pretty late in the teardown process... > - pg_get_replication_slots() wonders what permissions we require. I > don't know that any special permissions are needed here; the data > we're exposing doesn't appear to be sensitive. Unless I'm missing > something? I don't see a problem either, but it seems others have - pg_stat_replication only displays minor amounts of information if one doesn't have the replication privilege... Not sure what the reasoning there is, and whether it applies here as well. > - There seems to be no interface to acquire or release slots from > either SQL or the replication protocol, nor any way for a client of > this code to update its slot details. I don't think either ever needs to do that - START_TRANSACTION SLOT slot ...; and decoding_slot_*changes will acquire/release for them while active. What would the usecase be to allow them to be acquired from SQL? The slot details get updates by the respective replication code. For streaming rep, that should happen via reply and feedback messages. For changeset ext
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
Josh, * Josh Berkus (j...@agliodbs.com) wrote: > However, Debian is *never* going to add conf.d to the packages if we > don't recommend it as an upstream project. And, frankly, I use the > apt.postgresql.org packages anyway, which would certainly include > anything which was decided on this list. Those are both categorial false claims. Debian certainly does not ship with 'trust' auth, nor do our PGDG packages. They also move the conf files out of the data directory. Last, but not least, they absolutely enable conf.d directories even when that is not the default upstream. Additionally, I fully expect and hope that the apt.postgresql.org packages to follow the Debian/Ubuntu package management- having those diverge would absolutely be a horrible idea and cause a great deal of trouble for our users. Ideally, we can all agree, but this notion that the PGDG must follow what happens on -hackers is simply wrong, we need include and coordinate with the OS package maintainers. > It's more than in interesting thought. It's the difference between > having 20 lines of backwards compatibility code, and having 150 plus a > bunch of additional user documentation and setup. If I was writing the tool, I'm pretty sure that I'd be writing all that code either way. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
On 01/15/2014 11:10 AM, Stephen Frost wrote: > I don't buy this argument one bit- certainly, on Debian, the defaults > are overridden for a number of variables already and could be done to > enable a conf.d directory as well. Directory creation would, of > course, also be able to be handled by the Debian package. Any tool > which is being packaged for Debian would simply have to Depend on the > version of the PG package that enabled the conf.d option by default. However, Debian is *never* going to add conf.d to the packages if we don't recommend it as an upstream project. And, frankly, I use the apt.postgresql.org packages anyway, which would certainly include anything which was decided on this list. > This doesn't deal with upgrade cases or where the user has disabled the > feature and so the tool would need to check for a directory's > existance, but changing our default isn't going to completely address > that issue either. There's a HUGE difference between "This tool depends on conf.d, so please don't disable it" and "please edit postgresql.conf in the following way, and create this directory with these permissions ..." >> I'm particularly thinking about this in relation to the merger of >> recovery.conf and postgresql.conf. There are several tools already >> (RepMgr, OminPITR, HandyRep, pgPool, etc.) which manage recovery.conf >> separately from postgresql.conf. These tools will want to continue >> managing recovery.conf as a separate file, even if it's /included in >> postgresql.conf now. > > Certainly an interesting thought. It's more than in interesting thought. It's the difference between having 20 lines of backwards compatibility code, and having 150 plus a bunch of additional user documentation and setup. >> If conf.d exists by default, and is enabled in postgresql.conf by >> default, this is easy: the tool just drops a recovery.conf file into >> conf.d. Changing file locations and variable names is a fairly simple >> exercise in backwards compatibility. > > This isn't quite that simple on at least Debian, and I believe RH, as > there's going to be multiple directories involved (one for each cluster > which exists on the system). Also, there are rules which packages need > to follow on Debian regarding conf file changes, which these certainly > would be. Again, from the perspective of the tool (assuming it already supports Debian), this is just a directory change. ("if version == 9.4, directory = directory + "/conf.d/") > My recommendation would be to argue the point with the package > maintainers. Even if we *had* a default, I'm not convinced that the > package maintainers would keep it that way until and unless they update > their scripts and whatnot to handle it. This is disengenuous. The package maintainers are never, ever going to add conf.d and enable it by default unless it's the official recommendation of the PostgreSQL project that they do so. Exactly how far would I get with "I couldn't get pgsql-hackers to agree to this, so I'm asking you"? On 01/15/2014 11:40 AM, Peter Eisentraut wrote: > That seems like a very general argument. We'd need to consider exactly > what these tools want to do, and how that affects the recovery.conf > merger. My personal proposal was to allow postgresql.conf to contain > recovery parameters, but read recovery.conf last anyway. That would > solve that problem. Three issues: a) if postgresql is still going to look for a recovery.conf file in the usual place, but we are changing the names and meaning of some of the parameters, then aren't we making the upgrade problem much worse? b) what if the admin *does* want to disable reading recovery.conf in order to prevent old utilities from mistakenly including one? How will they do that? c) would this be in the configdir, datadir, or both? I'd thought that part of the idea of the merger was to remove the "magic" status of recovery.conf. >> If conf.d exists by default, and is enabled in postgresql.conf by >> default, this is easy: the tool just drops a recovery.conf file into >> conf.d. > > That would just give false hope, I think. My standard operating > procedure is to delete the default postgresql.conf and write a new one > from scratch (or have deployment tools do it). I don't want to > subscribe to a policy that the only endorsed way to maintain > postgresql.conf is to start with the default one and edit around carefully. > > The only way you could achieve your goal is to read the configuration > directory automatically, but that would come with its own set of problems. Frankly, I'm not concerned about people who rewrite their postgresql.conf from scratch already; it's *easy* to tell those people to re-add the conf.d reference to that file. I'm talking about the vast majority of our users who never edit pg.conf by hand. > Independent of the above, I don't agree with that. postgresql.auto.conf > is a special thing and should have its own special place. For once > thing,
Re: [HACKERS] tests for client programs
On Wed, January 15, 2014 06:30, Peter Eisentraut wrote: > As we all know, the client programs (src/bin/) don't have any real test > > So I wrote something. > > I chose to use Perl-based tools, prove and Test::More, because those are > [ 0001-Add-TAP-tests-for-client-programs.patch ] 32 k I gave this a quick try. Centos 6.5 final / perl 5.18.2 As mentioned earlier I had to install IPC::Run. 2 tests stumbled: 1. One test ( pg_ctl/t/001_start_stop.pl ) failed because I had PGDATA set. I unset all PG+ vars after that. No a big problem but nonetheless it might be better if the test suite removes /controls the variables before running. 2. The pg_isready test failed command_fails() ('fails with no server running') because it defaults to the compiled-in server-port (and that server was running). I added the test-designated port (65432, as defined in TestLib.pm). This simple change is in the attached patch. With these two changes the whole test suite passed. Thanks, Erik Rijkers --- src/bin/scripts/t/080_pg_isready.pl.orig 2014-01-15 20:08:16.325916223 +0100 +++ src/bin/scripts/t/080_pg_isready.pl 2014-01-15 20:18:24.705927054 +0100 @@ -7,7 +7,7 @@ program_version_ok('pg_isready'); program_options_handling_ok('pg_isready'); -command_fails(['pg_isready'], 'fails with no server running'); +command_fails(['pg_isready', '-p65432'], 'fails with no server running'); my $tempdir = tempdir; start_test_server $tempdir; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add %z support to elog/ereport?
On 2013-12-08 01:32:45 +0100, Andres Freund wrote: > On 2013-12-06 09:54:59 -0500, Peter Eisentraut wrote: > > On 11/11/13, 12:01 PM, Tom Lane wrote: > > > I do recall Peter saying that the infrastructure knows how to > > > verify conversion specs in translated strings, so it would have to be > > > aware of 'z' flags for this to work. > > > > It just checks that the same conversion placeholders appear in the > > translation. It doesn't interpret the actual placeholders. I don't > > think this will be a problem. > > Ok, so here's a patch. > > Patch 01 introduces infrastructure, and elog.c implementation. > Patch 02 converts some elog/ereport() callers to using the z modifier, > some were wrong at least on 64 bit windows. There's just been another instance of printing size_t values being annoying, reminding me of this patch. I'll add it to the current commitfest given I've posted it over a month ago unless somebody protests? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
On 1/15/14, 1:53 PM, Josh Berkus wrote: > I'm particularly thinking about this in relation to the merger of > recovery.conf and postgresql.conf. There are several tools already > (RepMgr, OminPITR, HandyRep, pgPool, etc.) which manage recovery.conf > separately from postgresql.conf. These tools will want to continue > managing recovery.conf as a separate file, even if it's /included in > postgresql.conf now. That seems like a very general argument. We'd need to consider exactly what these tools want to do, and how that affects the recovery.conf merger. My personal proposal was to allow postgresql.conf to contain recovery parameters, but read recovery.conf last anyway. That would solve that problem. > If conf.d exists by default, and is enabled in postgresql.conf by > default, this is easy: the tool just drops a recovery.conf file into > conf.d. That would just give false hope, I think. My standard operating procedure is to delete the default postgresql.conf and write a new one from scratch (or have deployment tools do it). I don't want to subscribe to a policy that the only endorsed way to maintain postgresql.conf is to start with the default one and edit around carefully. The only way you could achieve your goal is to read the configuration directory automatically, but that would come with its own set of problems. > Yes, I'm also arguing that postgresql.auto.conf should go into conf.d. > I said I'd bring that up again after ALTER SYSTEM SET was committed, and > here it is. Independent of the above, I don't agree with that. postgresql.auto.conf is a special thing and should have its own special place. For once thing, when putting configuration files in a separate directory structure (/etc/ vs /var), then postgresql.auto.conf should stay in the data directory. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation
On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila wrote: > Unpatched > --- > testname | wal_generated | > duration > --+--+-- > one short and one long field, no change |1054923224 | 33.101135969162 > > After pgrb_delta_encoding_v4 > - > > testname | wal_generated | > duration > --+--+-- > one short and one long field, no change | 877859144 | 30.6749138832092 > > > Temporary Changes > (Revert Max Chunksize = 4 and logic of finding longer match) > - > > testname| wal_generated | > duration > --+--+-- > one short and one long field, no change | 677337304 | 25.4048750400543 Sure, but watch me not care. If we're interested in taking advantage of the internal compressibility of tuples, we can do a lot better than this patch. We can compress the old tuple and the new tuple. We can compress full-page images. We can compress inserted tuples. But that's not the point of this patch. The point of *this* patch is to exploit the fact that the old and new tuples are likely to be very similar, NOT to squeeze out every ounce of compression from other sources. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed 15-01-14 14:38:44, Hannu Krosing wrote: > On 01/15/2014 02:01 PM, Jan Kara wrote: > > On Wed 15-01-14 12:16:50, Hannu Krosing wrote: > >> On 01/14/2014 06:12 PM, Robert Haas wrote: > >>> This would be pretty similar to copy-on-write, except > >>> without the copying. It would just be > >>> forget-from-the-buffer-pool-on-write. > >> +1 > >> > >> A version of this could probably already be implement using MADV_DONTNEED > >> and MADV_WILLNEED > >> > >> Thet is, just after reading the page in, use MADV_DONTNEED on it. When > >> evicting > >> a clean page, check that it is still in cache and if it is, then > >> MADV_WILLNEED it. > >> > >> Another nice thing to do would be dynamically adjusting kernel > >> dirty_background_ratio > >> and other related knobs in real time based on how many buffers are dirty > >> inside postgresql. > >> Maybe in background writer. > >> > >> Question to LKM folks - will kernel react well to frequent changes to > >> /proc/sys/vm/dirty_* ? > >> How frequent can they be (every few second? every second? 100Hz ?) > > So the question is what do you mean by 'react'. We check whether we > > should start background writeback every dirty_writeback_centisecs (5s). We > > will also check whether we didn't exceed the background dirty limit (and > > wake writeback thread) when dirtying pages. However this check happens once > > per several dirtied MB (unless we are close to dirty_bytes). > > > > When writeback is running we check roughly once per second (the logic is > > more complex there but I don't think explaining details would be useful > > here) whether we are below dirty_background_bytes and stop writeback in > > that case. > > > > So changing dirty_background_bytes every few seconds should work > > reasonably, once a second is pushing it and 100 Hz - no way. But I'd also > > note that you have conflicting requirements on the kernel writeback. On one > > hand you want checkpoint data to steadily trickle to disk (well, trickle > > isn't exactly the proper word since if you need to checkpoing 16 GB every 5 > > minutes than you need a steady throughput of ~50 MB/s just for > > checkpointing) so you want to set dirty_background_bytes low, on the other > > hand you don't want temporary files to get to disk so you want to set > > dirty_background_bytes high. > Is it possible to have more fine-grained control over writeback, like > configuring dirty_background_bytes per file system / device (or even > a file or a group of files) ? Currently it isn't possible to tune dirty_background_bytes per device directly. However see below. > If not, then how hard would it be to provide this ? We do track amount of dirty pages per device and the thread doing the flushing is also per device. The thing is that currently we compute the per-device background limit as dirty_background_bytes * p, where p is a proportion of writeback happening on this device to total writeback in the system (computed as floating average with exponential time-based backoff). BTW, similarly maximum per-device dirty limit is derived from global dirty_bytes in the same way. And you can also set bounds on the proportion 'p' in /sys/block/sda/bdi/{min,max}_ratio so in theory you should be able to set fixed background limit for a device by setting matching min and max proportions. Honza -- Jan Kara SUSE Labs, CR -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
Josh, * Josh Berkus (j...@agliodbs.com) wrote: > In 9.3, we added support for a config directory (conf.d), but have it > disabled by default. For tool authors, this makes conf.d useless since > you never know, on any given installation, whether it will be > present/enabled or not. While we don't want to prevent users from > disabling it, conf.d only becomes useful if it's present by default. > There's a simple reason why: if you want to write a tool which manages > postgresql.conf, you don't want the user to have to manually edit > postgresql.conf (and create a directory) in order to enable the tool. I don't buy this argument one bit- certainly, on Debian, the defaults are overridden for a number of variables already and could be done to enable a conf.d directory as well. Directory creation would, of course, also be able to be handled by the Debian package. Any tool which is being packaged for Debian would simply have to Depend on the version of the PG package that enabled the conf.d option by default. This doesn't deal with upgrade cases or where the user has disabled the feature and so the tool would need to check for a directory's existance, but changing our default isn't going to completely address that issue either. Indeed, the Debian package would at least be able to indicate, through the existance or not of the directory, whether a given cluster was set up by default with the conf.d structure or not. > I'm particularly thinking about this in relation to the merger of > recovery.conf and postgresql.conf. There are several tools already > (RepMgr, OminPITR, HandyRep, pgPool, etc.) which manage recovery.conf > separately from postgresql.conf. These tools will want to continue > managing recovery.conf as a separate file, even if it's /included in > postgresql.conf now. Certainly an interesting thought. > If conf.d exists by default, and is enabled in postgresql.conf by > default, this is easy: the tool just drops a recovery.conf file into > conf.d. Changing file locations and variable names is a fairly simple > exercise in backwards compatibility. This isn't quite that simple on at least Debian, and I believe RH, as there's going to be multiple directories involved (one for each cluster which exists on the system). Also, there are rules which packages need to follow on Debian regarding conf file changes, which these certainly would be. > If conf.d does NOT exist by default, things become complicated, and > backwards compatibility for replication management tools becomes much > harder. My recommendation would be to argue the point with the package maintainers. Even if we *had* a default, I'm not convinced that the package maintainers would keep it that way until and unless they update their scripts and whatnot to handle it. > Yes, I'm also arguing that postgresql.auto.conf should go into conf.d. > I said I'd bring that up again after ALTER SYSTEM SET was committed, and > here it is. I disagree with enabling that by default. Thanks, Stephen signature.asc Description: Digital signature
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 3:41 PM, Stephen Frost wrote: > * Claudio Freire (klaussfre...@gmail.com) wrote: >> But, still, the implementation is very similar to what postgres needs: >> sharing a physical page for two distinct logical pages, efficiently, >> with efficient copy-on-write. > > Agreed, except that KSM seems like it'd be slow/lazy about it and I'm > guessing there's a reason the pagecache isn't included normally.. KSM does an active de-duplication. That's slow. This would be leveraging KSM structures in the kernel (page sharing) but without all the de-duplication logic. > >> So it'd be just a matter of removing that limitation regarding page >> cache and shared pages. > > Any idea why that limitation is there? No, but I'm guessing it's because nobody bothered to implement the required copy-on-write in the page cache, which would be a PITA to write - think of all the complexities with privilege checks and everything - even though the benefits for many kinds of applications would be important. >> If you asked me, I'd implement it as copy-on-write on the page cache >> (not the user page). That ought to be low-overhead. > > Not entirely sure I'm following this- if it's a shared page, it doesn't > matter who starts writing to it, as soon as that happens, it need to get > copied. Perhaps you mean that the application should keep the > "original" and that the page-cache should get the "copy" (or, really, > perhaps just forget about the page existing at that point- we won't want > it again...). > > Would that be a way to go, perhaps? This does go back to the "make it > act like mmap, but not *be* mmap", but the idea would be: > open(..., O_ZEROCOPY_READ) > read() - Goes to PG's shared buffers, pagecache and PG share the page > page fault (PG writes to it) - pagecache forgets about the page > write() / fsync() - operate as normal Yep. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)
On Tue, Jan 14, 2014 at 4:16 AM, David Rowley wrote: > On Tue, Jan 14, 2014 at 9:09 PM, David Rowley wrote: >> >> I think unless anyone has some objections I'm going to remove the inverse >> transition for SUM(numeric) and modify the documents to tell the user how to >> build their own FAST_SUM(numeric) using the built in functions to do it. I'm >> starting to think that playing around with resetting numeric scale will turn >> a possible 9.4 patch into a 9.5/10.0 patch. I see no reason why what's there >> so far, minus sum(numeric), can't go in... >> > > Of course its only now that I discover that this is not possible to do this: > > CREATE AGGREGATE fast_sum (numeric) > ( > stype = numeric, > sfunc = numeric_avg_accum, > invfunc = numeric_avg_accum_inv, > finalfunc = numeric_sum > ); > > because SUM(numeric) uses an internal type to store the transition state. > > hmmm, built-in fast_sum anyone? > Is there any simple way to limit these to only be used in the context of a > window? If so is it worth it? > Would we want fast_sum() for float too? Maybe these additional "fast" functions (one might also say "inaccurate") should live in an extension, in contrib. It strikes me that for numeric what you really need is to just tell the sum operation, whether through a parameter or otherwise, how many decimal places to show in the output. Obviously that's not a practical change for sum() itself, but if we're inventing new stuff it can be done. For floats, things are not so good. The data type is inexact by nature, and I think cases where you get substantially wrong answers will be common enough that people who attempt to use whatever we devise in this area will have sum trouble. *ducks* -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Why conf.d should be default, and auto.conf and recovery.conf should be in it
Hackers, ALTER SYSTEM SET has been committed and recovery.conf GUCs are being reviewed. I'm going to make a last case for conf.d in relation to these two patches before 9.4 goes out the door. In 9.3, we added support for a config directory (conf.d), but have it disabled by default. For tool authors, this makes conf.d useless since you never know, on any given installation, whether it will be present/enabled or not. While we don't want to prevent users from disabling it, conf.d only becomes useful if it's present by default. There's a simple reason why: if you want to write a tool which manages postgresql.conf, you don't want the user to have to manually edit postgresql.conf (and create a directory) in order to enable the tool. I'm particularly thinking about this in relation to the merger of recovery.conf and postgresql.conf. There are several tools already (RepMgr, OminPITR, HandyRep, pgPool, etc.) which manage recovery.conf separately from postgresql.conf. These tools will want to continue managing recovery.conf as a separate file, even if it's /included in postgresql.conf now. If conf.d exists by default, and is enabled in postgresql.conf by default, this is easy: the tool just drops a recovery.conf file into conf.d. Changing file locations and variable names is a fairly simple exercise in backwards compatibility. If conf.d does NOT exist by default, things become complicated, and backwards compatibility for replication management tools becomes much harder. Yes, I'm also arguing that postgresql.auto.conf should go into conf.d. I said I'd bring that up again after ALTER SYSTEM SET was committed, and here it is. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
* Claudio Freire (klaussfre...@gmail.com) wrote: > But, still, the implementation is very similar to what postgres needs: > sharing a physical page for two distinct logical pages, efficiently, > with efficient copy-on-write. Agreed, except that KSM seems like it'd be slow/lazy about it and I'm guessing there's a reason the pagecache isn't included normally.. > So it'd be just a matter of removing that limitation regarding page > cache and shared pages. Any idea why that limitation is there? > If you asked me, I'd implement it as copy-on-write on the page cache > (not the user page). That ought to be low-overhead. Not entirely sure I'm following this- if it's a shared page, it doesn't matter who starts writing to it, as soon as that happens, it need to get copied. Perhaps you mean that the application should keep the "original" and that the page-cache should get the "copy" (or, really, perhaps just forget about the page existing at that point- we won't want it again...). Would that be a way to go, perhaps? This does go back to the "make it act like mmap, but not *be* mmap", but the idea would be: open(..., O_ZEROCOPY_READ) read() - Goes to PG's shared buffers, pagecache and PG share the page page fault (PG writes to it) - pagecache forgets about the page write() / fsync() - operate as normal The differences here from O_DIRECT are that the pagecache will keep the page while clean (absolutely valuable from PG's perspective- we might have to evict the page from shared buffers sooner than the kernel does), and the write()'s happen at the kernel's pace, allowing for write-combining, etc, until an fsync() happens, of course. This isn't the "big win" of dealing with I/O issues during checkpoints that we'd like to see, but it certainly feels like it'd be an improvement over the current double-buffering situation at least. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] tests for client programs
On 1/15/14, 1:46 AM, Erik Rijkers wrote: > The seems to be a dependency on IPC::Run > > I can install that, of course... but I suppose you want to make this work > without that. No, IPC::Run will be required. It looked like it was part of the default installation where I tested, but apparently not. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning recovery.conf into GUCs
On Wed, Jan 15, 2014 at 2:06 AM, Jaime Casanova wrote: > On Wed, Jan 15, 2014 at 2:00 AM, Jaime Casanova wrote: >> On Mon, Nov 18, 2013 at 12:27 PM, Andres Freund >> wrote: >> >>> * Maybe we should rename names like pause_at_recovery_target to >>> recovery_pause_at_target? Since we already force everyone to bother >>> changing their setup... >> >> i don't have a problem with this, anyone else? if no one speaks i will >> do what Andres suggests >> > > Actually Michael had objected to this idea but i forgot about that... > so i will wait for some consensus (my personal opinion is that > Michael's argument is a good one) I prefer the current name. It's more like the way English is spoken. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop duplicate buffers in OS
On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa wrote: > I create patch that can drop duplicate buffers in OS using usage_count > alogorithm. I have developed this patch since last summer. This feature seems > to > be discussed in hot topic, so I submit it more faster than my schedule. > > When usage_count is high in shared_buffers, they are hard to drop from > shared_buffers. However, these buffers wasn't required in file cache. Because > they aren't accessed by postgres(postgres access to shared_buffers). > So I create algorithm that dropping file cache which is high usage_count in > shared_buffers and is clean state in OS. If file cache are clean state in OS, > and > executing posix_fadvice DONTNEED, it can only free in file cache without > writing > physical disk. This algorithm will solve double-buffered situation problem and > can use memory more efficiently. > > I am testing DBT-2 benchmark now... The thing about this is that our usage counts for shared_buffers don't really work right now; it's common for everything, or nearly everything, to have a usage count of 5. So I'm reluctant to rely on that for much of anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)
Review of patch 0002: - I think you should just regard ReplicationSlotCtlLock as protecting the "name" and "active" flags of every slot. ReplicationSlotCreate() would then not need to mess with the spinlocks at all, and ReplicationSlotAcquire and ReplicationSlotDrop get a bit simpler too I think. Functions like ReplicationSlotsComputeRequiredXmin() can acquire this lock in shared mode and only lock the slots that are actually active, instead of all of them. - If you address /* FIXME: apply sanity checking to slot name */, then I think that also addresses /* XXX: do we want to use truncate identifier instead? */. In other words, let's just error out if the name is too long. I'm not sure what other sanity checking is needed here; maybe just restrict it to 7-bit characters to avoid problems with encodings for individual databases varying. - ReplicationSlotAcquire probably needs to ignore slots that are not active. - ReplicationSlotAcquire should be tweaked so that the code that holds the spinlock is more self-contained. If you adopt the above-proposed recasting of ReplicationSlotCtlLock, then the part that holds the spinlock can probably look like this: SpinLockAcquire(&slot->mutex); was_active = slot->active; slot->active = true; SpinLockRelease(&slot->mutex), which looks quite a bit safer. - If there's a coding rule that slot->database can't be changed while the slot is active, then the check to make sure that the user isn't trying to bind to a slot with a mis-matching database could be done before the code described in the previous point, avoiding the need to go back and release the resource. - I think the critical section in ReplicationSlotDrop is bogus. If DeleteSlot() fails, we scarcely need to PANIC. The slot just isn't gone. - cancel_before_shmem_exit is only guaranteed to remove the most-recently-added callback. - Why does ReplicationSlotsComputeRequiredXmin() need to hold ProcArrayLock at all? - ReplicationSlotsComputeRequiredXmin scarcely needs to hold the spinlock while it does all of those gyrations. It can just acquire the spinlock, copy the three fields needed into local variables, and release the spinlock. The rest can be worked out afterwards. Similarly in ReplicationSlotsComputeRequiredXmin. - A comment in KillSlot wonders whether locking is required. I say yes. It's safe to take lwlocks and spinlocks during shmem exit, and failing to do so seems like a recipe for subtle corner-case bugs. - pg_get_replication_slots() wonders what permissions we require. I don't know that any special permissions are needed here; the data we're exposing doesn't appear to be sensitive. Unless I'm missing something? - PG_STAT_GET_LOGICAL_DECODING_SLOTS_COLS has a leftover "logical" in its name. - There seems to be no interface to acquire or release slots from either SQL or the replication protocol, nor any way for a client of this code to update its slot details. The value of catalog_xmin/data_xmin vs. effective_catalog_xmin/effective_data_xmin is left to the imagination. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS]
On Thu, Jan 9, 2014 at 2:50 PM, Robert Haas wrote: > > On Tue, Jan 7, 2014 at 10:55 PM, Dilip kumar wrote: > > Below attached patch is implementing following todo item.. > > > > machine-readable pg_controldata? > > > > http://www.postgresql.org/message-id/4b901d73.8030...@agliodbs.com > > > > Possible approaches: > > > > 1. Implement as backend function and provide a view to user. > > I think this would be useful. > Hi, Joe Conway already created this feature [1] some time ago. I think this code can be reused to create an extension. Regards, [1] https://github.com/jconway/pg_controldata -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Timbira: http://www.timbira.com.br >> Blog sobre TI: http://fabriziomello.blogspot.com >> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 2:52 PM, Tom Lane wrote: > Robert Haas writes: >> I think that the bottom line is that we're not likely to make massive >> changes to the way that we do block caching now. Even if some other >> scheme could work much better on Linux (and so far I'm unconvinced >> that any of the proposals made here would in fact work much better), >> we aim to be portable to Windows as well as other UNIX-like systems >> (BSD, Solaris, etc.). So using completely Linux-specific technology >> in an overhaul of our block cache seems to me to have no future. > > Unfortunately, I have to agree with this. Even if there were a way to > merge our internal buffers with the kernel's, it would surely be far > too invasive to coexist with buffer management that'd still work on > more traditional platforms. > > But we could add hint calls, or modify the I/O calls we use, and that > ought to be a reasonably localized change. That's what's pretty nice with the zero-copy read idea. It's almost transparent. You read to a page-aligned address, and it works. The only code change would be enabling zero-copy reads, which I'm not sure will be low-overhead enough to leave enabled by default. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Deprecations in authentication
Magnus Hagander writes: > One thing I noticed - in MSVC, the config parameter "krb5" (equivalent of > the removed --with-krb5) enabled *both* krb5 and gssapi, and there is no > separate config parameter for gssapi. Do we want to rename that one to > "gss", or do we want to keep it as "krb5"? Renaming it would break > otherwise working environments, but it's kind of weird to leave it... +1 for renaming --- anybody who's building with "krb5" and expecting to, you know, actually *get* krb5 would probably rather find out about this change at build time instead of down the road a ways. A compromise position would be to introduce a gss parameter while leaving krb5 in place as a deprecated (perhaps undocumented?) synonym for it. But I think that's basically confusing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
Robert Haas writes: > I think that the bottom line is that we're not likely to make massive > changes to the way that we do block caching now. Even if some other > scheme could work much better on Linux (and so far I'm unconvinced > that any of the proposals made here would in fact work much better), > we aim to be portable to Windows as well as other UNIX-like systems > (BSD, Solaris, etc.). So using completely Linux-specific technology > in an overhaul of our block cache seems to me to have no future. Unfortunately, I have to agree with this. Even if there were a way to merge our internal buffers with the kernel's, it would surely be far too invasive to coexist with buffer management that'd still work on more traditional platforms. But we could add hint calls, or modify the I/O calls we use, and that ought to be a reasonably localized change. > And the idea of being able to do an 8kB atomic write with OS support > so that we don't have to save full page images in our write-ahead log > to cover the "torn page" scenario seems very intriguing indeed. If > that worked well, it would be a *big* deal for us. +1. That would be a significant win, and trivial to implement, since we already have a way to switch off full-page images for people who trust their filesystems to do atomic writes. It's just that safe use of that switch isn't widely possible ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
On Wed, Jan 15, 2014 at 12:39 PM, Tom Lane wrote: > Robert Haas writes: >> On Tue, Jan 14, 2014 at 9:28 PM, Peter Eisentraut wrote: >>> Something is causing this new compiler warning: >>> >>> setup.c: In function ‘setup_dynamic_shared_memory’: >>> setup.c:102:3: error: format ‘%llu’ expects argument of type ‘long long >>> unsigned int’, but argument 2 has type ‘Size’ [-Werror=format=] > >> Oops. I've pushed a fix (at least, I hope it's a fix). > > Unless I'm misreading what you did, that will just move the problem to > a different set of platforms. size_t is not necessarily "long" (one > counterexample is Win64, IIRC). > > In the past we've usually resorted to explicitly casting. You could > silence the warning correctly with either of >ereport("... %lu ...", (unsigned long) shm_mq_minimum_size); >ereport("... " UINT64_FORMAT " ...", (uint64) shm_mq_minimum_size); > > However, the problem with the second one (which also existed in your > original code) is that it breaks translatability of the string, because > any given translator is only going to see the version that gets compiled > on their hardware. Unless there's a significant likelihood of > shm_mq_minimum_size exceeding 4GB, I'd go with the first one. I think the current value is about 56, and while the header might expand in the future, 4294967296 would be a heck of a lot of header data. So I've changed it as you suggest. I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Changeset Extraction v7.0 (was logical changeset generation)
This 0001 patch, to log running transactions more frequently, has been pending for a long time now, and I haven't heard any objections, so I've gone ahead and committed that part. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
Robert Haas writes: > On Tue, Jan 14, 2014 at 9:28 PM, Peter Eisentraut wrote: >> Something is causing this new compiler warning: >> >> setup.c: In function setup_dynamic_shared_memory: >> setup.c:102:3: error: format %llu expects argument of type long long >> unsigned int, but argument 2 has type Size [-Werror=format=] > Oops. I've pushed a fix (at least, I hope it's a fix). Unless I'm misreading what you did, that will just move the problem to a different set of platforms. size_t is not necessarily "long" (one counterexample is Win64, IIRC). In the past we've usually resorted to explicitly casting. You could silence the warning correctly with either of ereport("... %lu ...", (unsigned long) shm_mq_minimum_size); ereport("... " UINT64_FORMAT " ...", (uint64) shm_mq_minimum_size); However, the problem with the second one (which also existed in your original code) is that it breaks translatability of the string, because any given translator is only going to see the version that gets compiled on their hardware. Unless there's a significant likelihood of shm_mq_minimum_size exceeding 4GB, I'd go with the first one. If you really want to be able to print values >4GB in translatable messages, what you need to do is to print into a local string variable with UINT64_FORMAT, then use %s in the translatable string. There's examples of this in commands/sequence.c among other places. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 1:35 PM, Stephen Frost wrote: >> And there's a nice bingo. Had forgotten about KSM. KSM could help lots. >> >> I could try to see of madvising shared_buffers as mergeable helps. But >> this should be an automatic case of KSM - ie, when reading into a >> page-aligned address, the kernel should summarily apply KSM-style >> sharing without hinting. The current madvise interface puts the burden >> of figuring out what duplicates what on the kernel, but postgres >> already knows. > > I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04 > with 3.5.0-23-generic, it's not doing anything with just PG running. > The page here: http://www.linux-kvm.org/page/KSM seems to indicate why: > > > KSM is a memory-saving de-duplication feature, that merges anonymous > (private) pages (not pagecache ones). > > > Looks like it won't merge between pagecache and private/application > memory? Or is it just that we're not madvise()'ing the shared buffers > region? I'd be happy to test doing that, if there's a chance it'll > actually work.. Yes, it's onlyl *intended* for merging private memory. But, still, the implementation is very similar to what postgres needs: sharing a physical page for two distinct logical pages, efficiently, with efficient copy-on-write. So it'd be just a matter of removing that limitation regarding page cache and shared pages. If you asked me, I'd implement it as copy-on-write on the page cache (not the user page). That ought to be low-overhead. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Tue, Jan 14, 2014 at 4:13 PM, Simon Riggs wrote: > On 8 January 2014 08:33, Simon Riggs wrote: >> VACUUM cleans up blocks, which is nice because it happens offline in a >> lazy manner. >> >> We also make SELECT clean up blocks as it goes. That is useful in OLTP >> workloads, but it means that large SQL queries and pg_dump effectively >> do much the same work as VACUUM, generating huge amounts of I/O and >> WAL on the master, the cost and annoyance of which is experienced >> directly by the user. That is avoided on standbys. >> >> Effects of that are that long running statements often run much longer >> than we want, increasing bloat as a result. It also produces wildly >> varying response times, depending upon extent of cleanup required. >> >> It is a simple task to make that behaviour optional on the master. >> >> I propose a USERSET parameter, prune_cost_limit (<---insert better name here) >> which will make the behaviour optional, default -1, in normal user >> processes. VACUUM will ignore this parameter and so its actions will >> never be deferred. >> >> In detail, this parameter would disable pruning for any scan larger >> than the cost limit. So large scans will disable the behaviour. The >> default, -1, means never disable pruning, which is the current >> behavour. >> >> We track the number of pages dirtied by the current statement. When >> this reaches prune_cost_limit, we will apply these behaviours to all >> shared_buffer block accesses... >> >> (1) avoid running heap_page_prune_opt() >> >> (2) avoid dirtying the buffer for hints. (This is safe because the >> hinted changes will either be lost or will be part of the full page >> image when we make a logged-change). >> >> (i.e. doesn't apply to temp tables) >> >> For example, if we set prune_cost_limit = 4 this behaviour allows >> small index lookups via bitmapheapscan to continue to cleanup, while >> larger index and seq scans will avoid cleanup. >> >> >> >> There would be a postgresql.conf parameter prune_cost_limit, as well >> as a table level parameter that would prevent pruning except via >> VACUUM. >> >> This will help in these ways >> * Reduce write I/O from SELECTs and pg_dump - improving backups and BI >> queries >> * Allow finer grained control over Hot Standby conflicts >> * Potentially allow diagnostic inspection of older data via SeqScan >> >> Prototype patch shows this is possible and simple enough for 9.4. >> Major objections? Or should I polish up and submit? > > Patch attached, implemented to reduce writes by SELECTs only. I am still not sure whether we want this, but I think it's definitely an improvement over the previous version. Assorted comments: - Naming consistency seems to me to dictate that there should be more similarity between the reloption name (allow_buffer_cleanup) and the GUC (prune_page_dirty_limit). - The documentation doesn't describe the use case where suppressing cleanup on a per-table basis would be desirable, and I can't think of one, either. - There are a variety of ways to limit pruning; here, you've chosen to limit it to a particular number of pruning operations per executor invocation. But the flag is global, not part of the executor state, so a query that calls a PL/pgsql function during execution will reset the counter for the parent query also, which doesn't seem very principled. In a patch I posted a few years ago to set hint bits only sometimes, I settled on an algorithm where I dirtied the first 50 pages per scan and then skipped the next 950, or something like that. The idea was that you wanted the pages that did get dirtied to be clustered together to avoid random I/O; and also that you wanted table of arbitrary size to get hinted within a certain number of scans (e.g. 20). The limiting here is much more aggressive, so on large tables it will amount to basically no pruning at all. I dunno whether that's a good idea or not. But if the idea of making this an integer rather than a boolean is to allow some pruning to still happen while keeping it checked within reasonable bounds, I'm not sure it will succeed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
* Claudio Freire (klaussfre...@gmail.com) wrote: > Yes, that's basically zero-copy reads. > > It could be done. The kernel can remap the page to the physical page > holding the shared buffer and mark it read-only, then expire the > buffer and transfer ownership of the page if any page fault happens. > > But that incurrs: > - Page faults, lots > - Hugely bloated mappings, unless KSM is somehow leveraged for this The page faults might be a problem but might be worth it. Bloated mappings sounds like a real issue though. > And there's a nice bingo. Had forgotten about KSM. KSM could help lots. > > I could try to see of madvising shared_buffers as mergeable helps. But > this should be an automatic case of KSM - ie, when reading into a > page-aligned address, the kernel should summarily apply KSM-style > sharing without hinting. The current madvise interface puts the burden > of figuring out what duplicates what on the kernel, but postgres > already knows. I'm certainly curious as to if KSM could help here, but on Ubuntu 12.04 with 3.5.0-23-generic, it's not doing anything with just PG running. The page here: http://www.linux-kvm.org/page/KSM seems to indicate why: KSM is a memory-saving de-duplication feature, that merges anonymous (private) pages (not pagecache ones). Looks like it won't merge between pagecache and private/application memory? Or is it just that we're not madvise()'ing the shared buffers region? I'd be happy to test doing that, if there's a chance it'll actually work.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Soften pg_[start|stop]_backup to allow them on a standby?
On 2014-01-15 11:19:52 -0500, Robert Haas wrote: > On Tue, Jan 14, 2014 at 7:54 AM, Andres Freund wrote: > > On 2014-01-14 12:31:09 +0900, Michael Paquier wrote: > >> Currently, pg_start_backup and pg_stop_backup cannot run on a standby > >> because it is not possible to write a backup_label file to disk, > >> because of the nature of a standby server preventing to write any data > >> in its PGDATA. Is this thought right? This is what the comments at the > >> top of do_pg_start_backup make me conclude. > > > > No, the actual reason is that a plain pg_stop_backup() writes WAL - > > which we can't do on a standby. The walsender command gets around this > > by storing the required data in the backup label itself, but that > > requires the label to be written after the basebackup actually finished > > which doesn't work for plain start/stop backup. > > This is true, but a better way to say it might be that when we fire up > postgres and point it at the backup, it needs to begin recovery at a > checkpoint; otherwise, pages torn by the backup process won't get > fixed. Maybe there's a way that pg_start_backup() could piggyback on > the most recent checkpoint rather than performing one itself; if so, > such a mode could be used on either the master or the standby (but > would require replaying more WAL, of course). That's what the walsender variant essentially already does for backups taken in recovery. What that does not solve is correctly setting the min recovery point though. I don't immediately see how we could compute that correctly without either a wal record or a backup label written at the end of recovery. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 10:53 AM, Mel Gorman wrote: > I realise that now and sorry for the noise. > > I later read the parts of the thread that covered the strict ordering > requirements and in a summary mail I split the requirements in two. In one, > there are dirty sticky pages that the kernel should not writeback unless > it has no other option or fsync is called. This may be suitable for large > temporary files that Postgres does not necessarily want to hit the platter > but also does not have strict ordering requirements for. The second is have > pages that are strictly kept dirty until the application syncs them. An > unbounded number of these pages would blow up but maybe bounds could be > placed on it. There are no solid conclusions on that part yet. I think that the bottom line is that we're not likely to make massive changes to the way that we do block caching now. Even if some other scheme could work much better on Linux (and so far I'm unconvinced that any of the proposals made here would in fact work much better), we aim to be portable to Windows as well as other UNIX-like systems (BSD, Solaris, etc.). So using completely Linux-specific technology in an overhaul of our block cache seems to me to have no future. On the other hand, giving the kernel hints about what we're doing that would enable it to be smarter seems to me to have a lot of potential. Ideas so far mentioned include: - Hint that we're going to do an fsync on file X at time Y, so that the kernel can schedule the write-out to complete right around that time. - Hint that a block is a good candidate for reclaim without actually purging it if there's no memory pressure. - Hint that a page we modify in our cache should be dropped from the kernel cache. - Hint that a page we write back to the operating system should be dropped from the kernel cache after the I/O completes. It's hard to say which of these ideas will work well without testing them, and the overhead of the extra system calls might be significant in some of those cases, but it seems a promising line of inquiry. And the idea of being able to do an 8kB atomic write with OS support so that we don't have to save full page images in our write-ahead log to cover the "torn page" scenario seems very intriguing indeed. If that worked well, it would be a *big* deal for us. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Soften pg_[start|stop]_backup to allow them on a standby?
On Tue, Jan 14, 2014 at 7:54 AM, Andres Freund wrote: > On 2014-01-14 12:31:09 +0900, Michael Paquier wrote: >> Currently, pg_start_backup and pg_stop_backup cannot run on a standby >> because it is not possible to write a backup_label file to disk, >> because of the nature of a standby server preventing to write any data >> in its PGDATA. Is this thought right? This is what the comments at the >> top of do_pg_start_backup make me conclude. > > No, the actual reason is that a plain pg_stop_backup() writes WAL - > which we can't do on a standby. The walsender command gets around this > by storing the required data in the backup label itself, but that > requires the label to be written after the basebackup actually finished > which doesn't work for plain start/stop backup. This is true, but a better way to say it might be that when we fire up postgres and point it at the backup, it needs to begin recovery at a checkpoint; otherwise, pages torn by the backup process won't get fixed. Maybe there's a way that pg_start_backup() could piggyback on the most recent checkpoint rather than performing one itself; if so, such a mode could be used on either the master or the standby (but would require replaying more WAL, of course). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.warn_shadow
On 1/15/14 3:09 PM, Pavel Stehule wrote: You first should to say, what is warning and why it is only warning and not error. Personally, I'm a huge fan of the computer telling me that I might have made a mistake. But on the other hand, I also really hate it when the computer gets in my way when I'm trying to test something quickly and making these mistakes on purpose. Warnings are really good for that: hard to ignore (YMMV) accidentally, but easy to spot when developing. As to how we would categorize these checks between warnings and errors.. I can't really answer that. I'm tempted to say "anything that is an error now is an error, any additional checks we might add are warnings", but that's effectively just freezing the definition at an arbitrary point in time. And why plpgsql warning processing should be different than general postgresql processing? What do you mean? We're adding extra checks on *top* of the normal "this is clearly an error" conditions. PostgreSQL in general doesn't really do that. Consider: SELECT * FROM foo WHERE fooid IN (SELECT fooid FROM bar); where the table "bar" doesn't have a column "fooid". That's a perfectly valid query, but it almost certainly doesn't do what you would want. Personally I'd like to see a WARNING here normally, but I've eventually learned to live with this caveat. I'm hoping that in PL/PgSQL we could at least solve some of the most annoying pitfalls. My objection is against too general option. Every option shoudl to do one clean thing. It looks to me like the GUC *is* doing only one thing. "list of warnings I want to see", or the shorthand "all" for convenience. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 10:16:27AM -0500, Robert Haas wrote: > On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman wrote: > > That applies if the dirty pages are forced to be kept dirty. You call > > this pinned but pinned has special meaning so I would suggest calling it > > something like dirty-sticky pages. It could be the case that such hinting > > will have the pages excluded from dirty background writing but can still > > be cleaned if dirty limits are hit or if fsync is called. It's a hint, > > not a forced guarantee. > > > > It's still a hand grenade because if this is tracked on a per-page basis > > because of what happens if the process crashes? Those pages stay dirty > > potentially forever. An alternative would be to track this on a per-inode > > instead of per-page basis. The hint would only exist where there is an > > open fd for that inode. Treat it as a privileged call with a sysctl > > controlling how many dirty-sticky pages can exist in the system with the > > information presented during OOM kills and maybe it starts becoming a bit > > more manageable. Dirty-sticky pages are not guaranteed to stay dirty > > until userspace action, the kernel just stays away until there are no > > other sensible options. > > I think this discussion is vividly illustrating why this whole line of > inquiry is a pile of fail. If all the processes that have the file > open crash, the changes have to be *thrown away* not written to disk > whenever the kernel likes. > I realise that now and sorry for the noise. I later read the parts of the thread that covered the strict ordering requirements and in a summary mail I split the requirements in two. In one, there are dirty sticky pages that the kernel should not writeback unless it has no other option or fsync is called. This may be suitable for large temporary files that Postgres does not necessarily want to hit the platter but also does not have strict ordering requirements for. The second is have pages that are strictly kept dirty until the application syncs them. An unbounded number of these pages would blow up but maybe bounds could be placed on it. There are no solid conclusions on that part yet. -- Mel Gorman SUSE Labs -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] COPY and Volatile default expressions
On 16 April 2013 14:37, Simon Riggs wrote: > On 16 April 2013 13:57, Heikki Linnakangas wrote: > >> You still need to check the args, if the function is nextval, otherwise you >> incorrectly perform the optimization for something like >> "nextval(myvolatilefunc())". > > Guess so. At least its an easy change. > > Thanks for checking. Rebased v3 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services copy_tuning_with_default_nextval.v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
On 2014-01-15 10:19:32 -0500, Robert Haas wrote: > On Tue, Jan 14, 2014 at 9:28 PM, Peter Eisentraut wrote: > > Something is causing this new compiler warning: > > > > setup.c: In function ‘setup_dynamic_shared_memory’: > > setup.c:102:3: error: format ‘%llu’ expects argument of type ‘long long > > unsigned int’, but argument 2 has type ‘Size’ [-Werror=format=] > > Oops. I've pushed a fix (at least, I hope it's a fix). I don't think that works on LLP64 platforms like windows - long is 32bit there. That's basically why I proposed the z modifier to elog to encapsulate this in some other thread... Till then you'll have to cast to a known size. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
On Tue, Jan 14, 2014 at 9:53 PM, Kevin Grittner wrote: > I'm not seeing that one but I am now getting these: > > setup.c: In function ‘test_shm_mq_setup’: > setup.c:65:25: warning: ‘outq’ may be used uninitialized in this function > [-Wmaybe-uninitialized] > setup.c:66:24: warning: ‘inq’ may be used uninitialized in this function > [-Wmaybe-uninitialized] That warning is bogus and I don't get it, but I'll add an initialization to placate the compiler, which is being a bit too smart for it's own good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
On Tue, Jan 14, 2014 at 9:28 PM, Peter Eisentraut wrote: > Something is causing this new compiler warning: > > setup.c: In function ‘setup_dynamic_shared_memory’: > setup.c:102:3: error: format ‘%llu’ expects argument of type ‘long long > unsigned int’, but argument 2 has type ‘Size’ [-Werror=format=] Oops. I've pushed a fix (at least, I hope it's a fix). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Lsf-pc] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 4:44 AM, Mel Gorman wrote: > That applies if the dirty pages are forced to be kept dirty. You call > this pinned but pinned has special meaning so I would suggest calling it > something like dirty-sticky pages. It could be the case that such hinting > will have the pages excluded from dirty background writing but can still > be cleaned if dirty limits are hit or if fsync is called. It's a hint, > not a forced guarantee. > > It's still a hand grenade because if this is tracked on a per-page basis > because of what happens if the process crashes? Those pages stay dirty > potentially forever. An alternative would be to track this on a per-inode > instead of per-page basis. The hint would only exist where there is an > open fd for that inode. Treat it as a privileged call with a sysctl > controlling how many dirty-sticky pages can exist in the system with the > information presented during OOM kills and maybe it starts becoming a bit > more manageable. Dirty-sticky pages are not guaranteed to stay dirty > until userspace action, the kernel just stays away until there are no > other sensible options. I think this discussion is vividly illustrating why this whole line of inquiry is a pile of fail. If all the processes that have the file open crash, the changes have to be *thrown away* not written to disk whenever the kernel likes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed, Jan 15, 2014 at 4:35 AM, Jan Kara wrote: > Filesystems could in theory provide facility like atomic write (at least up > to a certain size say in MB range) but it's not so easy and when there are > no strong usecases fs people are reluctant to make their code more complex > unnecessarily. OTOH without widespread atomic write support I understand > application developers have similar stance. So it's kind of chicken and egg > problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place > due to its data=journal mode so if someone on the PostgreSQL side wanted to > research on this, knitting some experimental ext4 patches should be doable. Atomic 8kB writes would improve performance for us quite a lot. Full page writes to WAL are very expensive. I don't remember what percentage of write-ahead log traffic that accounts for, but it's not small. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
Heikki Linnakangas writes: > On 01/15/2014 07:50 AM, Dave Chinner wrote: >> FWIW [and I know you're probably sick of hearing this by now], but >> the blk-io throttling works almost perfectly with applications that >> use direct IO. > For checkpoint writes, direct I/O actually would be reasonable. > Bypassing the OS cache is a good thing in that case - we don't want the > written pages to evict other pages from the OS cache, as we already have > them in the PostgreSQL buffer cache. But in exchange for that, we'd have to deal with selecting an order to write pages that's appropriate depending on the filesystem layout, other things happening in the system, etc etc. We don't want to build an I/O scheduler, IMO, but we'd have to. > Writing one page at a time with O_DIRECT from a single process might be > quite slow, so we'd probably need to use writev() or asynchronous I/O to > work around that. Yeah, and if the system has multiple spindles, we'd need to be issuing multiple O_DIRECT writes concurrently, no? What we'd really like for checkpointing is to hand the kernel a boatload (several GB) of dirty pages and say "how about you push all this to disk over the next few minutes, in whatever way seems optimal given the storage hardware and system situation. Let us know when you're done." Right now, because there's no way to negotiate such behavior, we're reduced to having to dribble out the pages (in what's very likely a non-optimal order) and hope that the kernel is neither too lazy nor too aggressive about cleaning dirty pages in its caches. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Portal suddenly disappears?
>> I'm confused too. Surely there are lots of ways a portal could get >> dropped, but most of them would have left traces in the postmaster log, >> I'd think, since you evidently have log_statement == LOGSTMT_ALL. >> What was log_min_messages set to? > > I don't have it at this moment. I requested the user to give the > log_min_messages setting. It turned out that it is set to default: notice. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 01/15/2014 07:50 AM, Dave Chinner wrote: However, the first problem is dealing with the IO storm problem on fsync. Then we can measure the effect of spreading those writes out in time and determine what triggers read starvations (if they are apparent). The we can look at whether IO scheduling tweaks or whether blk-io throttling solves those problems. Or whether something else needs to be done to make it work in environments where problems are manifesting. FWIW [and I know you're probably sick of hearing this by now], but the blk-io throttling works almost perfectly with applications that use direct IO. For checkpoint writes, direct I/O actually would be reasonable. Bypassing the OS cache is a good thing in that case - we don't want the written pages to evict other pages from the OS cache, as we already have them in the PostgreSQL buffer cache. Writing one page at a time with O_DIRECT from a single process might be quite slow, so we'd probably need to use writev() or asynchronous I/O to work around that. We'd still need to issue an fsync() to flush any already-written pages from the OS cache to disk, though. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ALTER TABLE lock strength reduction patch is unsafe
On 1 August 2013 01:53, Noah Misch wrote: > On Mon, Jul 15, 2013 at 05:50:40PM +0100, Simon Riggs wrote: >> On 15 July 2013 15:06, Robert Haas wrote: >> > Generally, the question on the table is: to what extent do MVCC >> > catalog scans make the world safe for concurrent DDL, or put >> > negatively, what hazards remain? >> >> On Tom's test I've been unable to find a single problem. >> >> > Noah discovered an interesting one recently: apparently, the relcache >> > machinery has some logic in it that depends on the use of >> > AccessExclusiveLock in subtle ways. I'm going to attempt to explain >> > it, and maybe he can jump in and explain it better. Essentially, the >> > problem is that when a relcache reload occurs, certain data structures >> > (like the tuple descriptor, but there are others) are compared against >> > the old version of the same data structure. If there are no changes, >> > we do nothing; else, we free the old one and install the new one. The >> > reason why we don't free the old one and install the new one >> > unconditionally is because other parts of the backend might have >> > pointers to the old data structure, so just replacing it all the time >> > would result in crashes. Since DDL requires AccessExclusiveLock + >> > CheckTableNotInUse(), any actual change to the data structure will >> > happen when we haven't got any pointers anyway. > >> If you look at this as a generalised problem you probably can find >> some issues, but that doesn't mean they apply in the specific cases >> we're addressing. >> >> The lock reductions we are discussing in all cases have nothing at all >> to do with structure and only relate to various options. Except in the >> case of constraints, though even there I see no issues as yet. > > I was able to distill the above hypothesis into an actual crash with > reduce_lock_levels.v13.patch. Test recipe: > > 1. Build with --enable-cassert and with -DCATCACHE_FORCE_RELEASE=1. An > AcceptInvalidationMessages() will then happen at nearly every syscache lookup, > making it far easier to hit a problem of this sort. > > 2. Run these commands as setup: > create table root (c int); > create table part (check (c > 0), check (c > 0)) inherits (root); > > 3. Attach a debugger to your session and set a breakpoint at plancat.c:660 (as > of commit 16f38f72ab2b8a3b2d45ba727d213bb3cea4). > > 4. Run this in your session; the breakpoint will trip: > select * from root where c = -1; > > 5. Start another session and run: > alter table part add check (c > 0); > > 6. Exit the debugger to release the first session. It will crash. > > plancache.c:657 stashes a pointer to memory belonging to the rd_att of a > relcache entry. It then proceeds to call eval_const_expressions(), which > performs a syscache lookup in its simplify_function() subroutine. Under > CATCACHE_FORCE_RELEASE, the syscache lookup reliably prompts an > AcceptInvalidationMessages(). The ensuing RelationClearRelation() against > "part" notices the new constraint, decides keep_tupdesc = false, and frees the > existing tupdesc. plancache.c is now left holding a pointer into freed > memory. The next loop iteration will crash when it dereferences a pointer > stored within that freed memory at plancat.c:671. > > > A remediation strategy that seemed attractive when I last contemplated this > problem is to repoint rd_att immediately but arrange to free the obsolete > TupleDesc in AtEOXact_RelationCache(). v15 to fix the above problem. Looking at other potential problems around plancache but nothing found as yet. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services reduce_lock_levels.v15.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] PostgreSQL fails to start on Windows if it crashes after tablespace creation
From: "Asif Naeem" As you have followed destroy_tablespace_directories() function, Is there any specific reason not to use same logic to detect type of the file/link i.e. “(lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))”, It also seems have more appropriate error message i.e. Thanks for reviewing and testing the patch. Yes, at first I did what you mentioned, but modified the patch according to some advice in the mail thread. During redo, create_tablespace_directories() needs to handle the case where the $PGDATA/pg_tblspc/xxx is not a symlink but a directory even on UNIX/Linux. Please see TablespaceCreateDbspace is(). destroy_tablespace_directories() doesn't have to handle such situation. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Linux kernel impact on PostgreSQL performance (summary v1 2014-1-15)
> One assumption would be that Postgres is perfectly happy with the current > kernel behaviour in which case our discussion here is done. It has been demonstrated that this statement was farcical. The thread is massive just from interaction with the LSF/MM program committee. I'm hoping that there will be Postgres representation at LSF/MM this year to bring the issues to a wider audience. I expect that LSF/MM can only commit to one person attending the whole summit due to limited seats but we could be more more flexible for the Postgres track itself so informal meetings can be arranged for the evenings and at collab summit. In this gets forgotten, this mail describes what has already been discussed and some of the proposals. Some stuff I do not describe because it was superseded by later discussion. If I missed something important, misinterpreted or simply screwed up then shout and I'll update this. I'd rather none of this gets lost even if it takes months or years to address it all. On testing of modern kernels Josh Berkus claims that most people are using Postgres with 2.6.19 and consequently there may be poor awareness of recent kernel developments. This is a disturbingly large window of opportunity for problems to have been introduced. It begs the question what sort of penetration modern distributions shipping Postgres has. More information on why older kernels dominate in Postgres installation would be nice. Postgres bug reports and LKML --- It is claimed that LKML does not welcome bug reports but it's less clear what the basis of this claim is. Is it because the reports are ignored? A possible explanation is that they are simply getting lost in the LKML noise and there would be better luck if the bug report was cc'd to a specific subsystem list. Another explanation is that there is not enough data available to debug the problem. The worst explanation is that to date the problem has not been fixable but the details of this have been lost and are now unknown. Is is possible that some of these bug reports can be refreshed so at least there is a chance they get addressed? Apparently there were changes to the reclaim algorithms that crippled performance without any sysctls. The problem may be compounded by the introduction of adaptive replacement cache in the shape of the thrash detection patches currently being reviewed. Postgres investigated the use of ARC in the past and ultimately abandoned it. Details are in the archives (http://www.Postgres.org/search/?m=1&q=arc&l=1&d=-1&s=r). I have not read then, just noting they exist for future reference. Sysctls to control VM behaviour are not popular as such tuning parameters are often used as an excuse to not properly fix the problem. Would it be possible to describe a test case that shows 2.6.19 performing well and a modern kernel failing? That would give the VM people a concrete basis to work from to either fix the problem or identify exactly what sysctls are required to make this work. I am confident that any bug related to VM reclaim in this area has been lost. At least, I recall no instances of it being discussed on linux-mm and it has not featured on LSF/MM during the last years. IO Scheduling - Kevin Grittner has stated that it is known that the DEADLINE and NOOP schedulers perform better than any alternatives for most database loads. It would be desirable to quantify this for some test case and see can the default scheduler cope in some way. The deadline scheduler makes sense to a large extent though. Postgres is sensitive to large latencies due to IO write spikes. It is at least plausible that deadline would give more deterministic behaviour for parallel reads in the presence of large writes assuming there were not ordering problems between the reads/writes and the underlying filesystem. For reference, these IO spikes can be massive. If the shared buffer is completely dirtied in a short space of time then it could be 20-25% of RAM being dirtied and writeback required in typical configurations. There have been cases where it was worked around by limiting the size of the shared buffer to a small enough size so that it can be written back quickly. There are other tuning options available such as altering when dirty background writing starts within the kernel but that will not help if the dirtying happens in a very short space of time. Dave Chinner described the considerations as follows There's no absolute rule here, but the threshold for background writeback needs to consider the amount of dirty data being generated, the rate at which it can be retired and the checkpoint period the application is configured with. i.e. it needs to be slow enough to not cause serious read IO perturbations, but still fast enough that it avoids peaks at synchronisation points. And most importantly, it needs to be fast enought that it
Re: [HACKERS] plpgsql.warn_shadow
2014/1/15 Marko Tiikkaja > On 1/15/14 2:27 PM, Pavel Stehule wrote: > >> 2014/1/15 Marko Tiikkaja >> >>> Yeah, me neither, it's just something that needs to be communicated very >>> clearly. So probably just a list plpgsql.warnings would be the most >>> appropriate then. >>> >>> >> I am thinking so the name is not good. Changing handling warnings is messy >> - minimally in Postgres, where warnings and errors are different >> creatures. >> >> what about >> >> plpgsql.enhanced_checks = (none | warning | error) >> > > You crammed several suggestions into one here: > > 1) You're advocating the ability to turn warnings into errors. This has > been met with some resistance. I think it's a useful feature, but I would > be happy with just having warnings available. > 2) This syntax doesn't allow the user to specify a list of warnings to > enable. Which might be fine, I guess. I imagine the normal approach would > be to turn all warnings on anyway, and possibly fine-tune with per-function > directives if some functions do dangerous things on purpose. > 3) You want to change the name to "enhanced_checks". I still think the > main feature is about displaying warnings to the user. I don't > particularly like this suggestion. > You first should to say, what is warning and why it is only warning and not error. And why plpgsql warning processing should be different than general postgresql processing? My objection is against too general option. Every option shoudl to do one clean thing. Regards Pavel > > > Regards, > Marko Tiikkaja >
Re: [HACKERS] plpgsql.warn_shadow
On 1/15/14 2:27 PM, Pavel Stehule wrote: 2014/1/15 Marko Tiikkaja Yeah, me neither, it's just something that needs to be communicated very clearly. So probably just a list plpgsql.warnings would be the most appropriate then. I am thinking so the name is not good. Changing handling warnings is messy - minimally in Postgres, where warnings and errors are different creatures. what about plpgsql.enhanced_checks = (none | warning | error) You crammed several suggestions into one here: 1) You're advocating the ability to turn warnings into errors. This has been met with some resistance. I think it's a useful feature, but I would be happy with just having warnings available. 2) This syntax doesn't allow the user to specify a list of warnings to enable. Which might be fine, I guess. I imagine the normal approach would be to turn all warnings on anyway, and possibly fine-tune with per-function directives if some functions do dangerous things on purpose. 3) You want to change the name to "enhanced_checks". I still think the main feature is about displaying warnings to the user. I don't particularly like this suggestion. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 5:23 PM, Dave Chinner wrote: > By default, background writeback doesn't start until 10% of memory > is dirtied, and on your machine that's 25GB of RAM. That's way to > high for your workload. > > It appears to me that we are seeing large memory machines much more > commonly in data centers - a couple of years ago 256GB RAM was only > seen in supercomputers. Hence machines of this size are moving from > "tweaking settings for supercomputers is OK" class to "tweaking > settings for enterprise servers is not OK" > > Perhaps what we need to do is deprecate dirty_ratio and > dirty_background_ratio as the default values as move to the byte > based values as the defaults and cap them appropriately. e.g. > 10/20% of RAM for small machines down to a couple of GB for large > machines I think that's right. In our case we know we're going to call fsync() eventually and that's going to produce a torrent of I/O. If that torrent fits in downstream caches or can be satisfied quickly without disrupting the rest of the system too much, then life is good. But the downstream caches don't typically grow proportionately to the size of system memory. Maybe a machine with 16GB has 1GB of battery-backed write cache, but it doesn't follow that 256GB machine has 16GB of battery-backed write cache. > Essentially, changing dirty_background_bytes, dirty_bytes and > dirty_expire_centiseconds to be much smaller should make the kernel > start writeback much sooner and so you shouldn't have to limit the > amount of buffers the application has to prevent major fsync > triggered stalls... I think this has been tried with some success, but I don't know the details. I think the bytes values are clearly more useful than the percentages, because you can set them smaller and with better granularity. One thought that occurs to me is that it might be useful to have PostgreSQL tell the system when we expect to perform an fsync. Imagine fsync_is_coming(int fd, time_t). We know long in advance (minutes) when we're gonna do it, so in some sense what we'd like to tell the kernel is: we're not in a hurry to get this data on disk right now, but when the indicated time arrives, we are going to do fsyncs of a bunch of files in rapid succession, so please arrange to flush the data as close to that time as possible (to maximize write-combining) while still finishing by that time (so that the fsyncs are fast and more importantly so that they don't cause a system-wide stall). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 4:23 PM, James Bottomley wrote: > Yes, that's what I was thinking: it's a cache. About how many files > comprise this cache? Are you thinking it's too difficult for every > process to map the files? No, I'm thinking that would throw cache coherency out the window. Separate mappings are all well and good until somebody decides to modify the page, but after that point the database processes need to see the modified version of the page (which is, further, hedged about with locks) yet the operating system MUST NOT see the modified version of the page until the write-ahead log entry for the page modification has been flushed to disk. There's really no way to do that without having our own private cache. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On 01/15/2014 02:01 PM, Jan Kara wrote: > On Wed 15-01-14 12:16:50, Hannu Krosing wrote: >> On 01/14/2014 06:12 PM, Robert Haas wrote: >>> This would be pretty similar to copy-on-write, except >>> without the copying. It would just be >>> forget-from-the-buffer-pool-on-write. >> +1 >> >> A version of this could probably already be implement using MADV_DONTNEED >> and MADV_WILLNEED >> >> Thet is, just after reading the page in, use MADV_DONTNEED on it. When >> evicting >> a clean page, check that it is still in cache and if it is, then >> MADV_WILLNEED it. >> >> Another nice thing to do would be dynamically adjusting kernel >> dirty_background_ratio >> and other related knobs in real time based on how many buffers are dirty >> inside postgresql. >> Maybe in background writer. >> >> Question to LKM folks - will kernel react well to frequent changes to >> /proc/sys/vm/dirty_* ? >> How frequent can they be (every few second? every second? 100Hz ?) > So the question is what do you mean by 'react'. We check whether we > should start background writeback every dirty_writeback_centisecs (5s). We > will also check whether we didn't exceed the background dirty limit (and > wake writeback thread) when dirtying pages. However this check happens once > per several dirtied MB (unless we are close to dirty_bytes). > > When writeback is running we check roughly once per second (the logic is > more complex there but I don't think explaining details would be useful > here) whether we are below dirty_background_bytes and stop writeback in > that case. > > So changing dirty_background_bytes every few seconds should work > reasonably, once a second is pushing it and 100 Hz - no way. But I'd also > note that you have conflicting requirements on the kernel writeback. On one > hand you want checkpoint data to steadily trickle to disk (well, trickle > isn't exactly the proper word since if you need to checkpoing 16 GB every 5 > minutes than you need a steady throughput of ~50 MB/s just for > checkpointing) so you want to set dirty_background_bytes low, on the other > hand you don't want temporary files to get to disk so you want to set > dirty_background_bytes high. Is it possible to have more fine-grained control over writeback, like configuring dirty_background_bytes per file system / device (or even a file or a group of files) ? If not, then how hard would it be to provide this ? This is a bit backwards from keeping-the-cache-clean perspective, but would help a lot with hinting the writer that a big sync is coming. > And also that changes of > dirty_background_bytes probably will not take into account other events > happening on the system (maybe a DB backup is running...). So I'm somewhat > skeptical you will be able to tune dirty_background_bytes frequently in a > useful way. > Cheers -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] trgm regex index peculiarity
On Fri, Jun 21, 2013 at 5:39 PM, Erik Rijkers wrote: > On Fri, June 21, 2013 15:11, Alexander Korotkov wrote: > > On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers wrote: > > > >> On Fri, June 21, 2013 05:25, Tom Lane wrote: > >> > "Erik Rijkers" writes: > >> >> In a 112 MB test table (containing random generated text) with a trgm > >> index (gin_trgm_ops), I consistently get these > >> >> timings: > >> >> select txt from azjunk6 where txt ~ '^abcd'; > >> >>130 ms > >> >> select txt from azjunk6 > >> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; > >> >>3 ms > >> > > > > > Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and > 'bcd'. > > However trigrams '__a' is much more frequent than '_ab' which in turn is > > much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting > of > > '__a' and '_ab' and that gives so significant speedup. > > > [trgm_regex_optimize.1.patch ] > > Yes, that fixes the problem, thanks. > Revised version of patch with necessary comments. -- With best regards, Alexander Korotkov. trgm-regex-optimize.2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed 15-01-14 12:16:50, Hannu Krosing wrote: > On 01/14/2014 06:12 PM, Robert Haas wrote: > > This would be pretty similar to copy-on-write, except > > without the copying. It would just be > > forget-from-the-buffer-pool-on-write. > > +1 > > A version of this could probably already be implement using MADV_DONTNEED > and MADV_WILLNEED > > Thet is, just after reading the page in, use MADV_DONTNEED on it. When > evicting > a clean page, check that it is still in cache and if it is, then > MADV_WILLNEED it. > > Another nice thing to do would be dynamically adjusting kernel > dirty_background_ratio > and other related knobs in real time based on how many buffers are dirty > inside postgresql. > Maybe in background writer. > > Question to LKM folks - will kernel react well to frequent changes to > /proc/sys/vm/dirty_* ? > How frequent can they be (every few second? every second? 100Hz ?) So the question is what do you mean by 'react'. We check whether we should start background writeback every dirty_writeback_centisecs (5s). We will also check whether we didn't exceed the background dirty limit (and wake writeback thread) when dirtying pages. However this check happens once per several dirtied MB (unless we are close to dirty_bytes). When writeback is running we check roughly once per second (the logic is more complex there but I don't think explaining details would be useful here) whether we are below dirty_background_bytes and stop writeback in that case. So changing dirty_background_bytes every few seconds should work reasonably, once a second is pushing it and 100 Hz - no way. But I'd also note that you have conflicting requirements on the kernel writeback. On one hand you want checkpoint data to steadily trickle to disk (well, trickle isn't exactly the proper word since if you need to checkpoing 16 GB every 5 minutes than you need a steady throughput of ~50 MB/s just for checkpointing) so you want to set dirty_background_bytes low, on the other hand you don't want temporary files to get to disk so you want to set dirty_background_bytes high. And also that changes of dirty_background_bytes probably will not take into account other events happening on the system (maybe a DB backup is running...). So I'm somewhat skeptical you will be able to tune dirty_background_bytes frequently in a useful way. Honza -- Jan Kara SUSE Labs, CR -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.warn_shadow
2014/1/15 Marko Tiikkaja > On 1/15/14 2:00 PM, Florian Pflug wrote: > >> On Jan15, 2014, at 13:32 , Marko Tiikkaja wrote: >> >>> On 1/15/14 1:23 PM, Florian Pflug wrote: >>> The fact that it's named plpgsql.warnings already clearly documents that this only affects plpgsql. But whether a particular warning is emitted during compilation or during execution it largely irrelevant, I think. For example, if we called this compiler_warning, we'd couldn't add a warning which triggers when SELECT .. INTO ingores excessive rows. >>> >>> There is the fact that something being a "compiler warning" gives you an >>> idea on its effects on performance. But maybe that would be better >>> described in the documentation (perhaps even more accurately). >>> >>> I like the idea of warning about SELECT .. INTO, though, but that one >>> could have a non-negligible performance penalty during execution. >>> >> >> I'm not overly concerned about that. I image people would usually enable >> warnings during development, not production. >> > > Yeah, me neither, it's just something that needs to be communicated very > clearly. So probably just a list plpgsql.warnings would be the most > appropriate then. > I am thinking so the name is not good. Changing handling warnings is messy - minimally in Postgres, where warnings and errors are different creatures. what about plpgsql.enhanced_checks = (none | warning | error) > > > Regards, > Marko Tiikkaja >
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Wed 15-01-14 10:27:26, Heikki Linnakangas wrote: > On 01/15/2014 06:01 AM, Jim Nasby wrote: > >For the sake of completeness... it's theoretically silly that Postgres > >is doing all this stuff with WAL when the filesystem is doing something > >very similar with it's journal. And an SSD drive (and next generation > >spinning rust) is doing the same thing *again* in it's own journal. > > > >If all 3 communities (or even just 2 of them!) could agree on the > >necessary interface a tremendous amount of this duplicated technology > >could be eliminated. > > > >That said, I rather doubt the Postgres community would go this route, > >not so much because of the presumably massive changes needed, but more > >because our community is not a fan of restricting our users to things > >like "Thou shalt use a journaled FS or risk all thy data!" > > The WAL is also used for continuous archiving and replication, not > just crash recovery. We could skip full-page-writes, though, if we > knew that the underlying filesystem/storage is guaranteeing that a > write() is atomic. > > It might be useful for PostgreSQL somehow tell the filesystem that > we're taking care of WAL-logging, so that the filesystem doesn't > need to. Well, journalling fs generally cares about its metadata consistency. We have much weaker guarantees regarding file data because those guarantees come at a cost most people don't want to pay. Filesystems could in theory provide facility like atomic write (at least up to a certain size say in MB range) but it's not so easy and when there are no strong usecases fs people are reluctant to make their code more complex unnecessarily. OTOH without widespread atomic write support I understand application developers have similar stance. So it's kind of chicken and egg problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place due to its data=journal mode so if someone on the PostgreSQL side wanted to research on this, knitting some experimental ext4 patches should be doable. Honza -- Jan Kara SUSE Labs, CR -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance
On Tue, Jan 14, 2014 at 09:54:20PM -0600, Jim Nasby wrote: > On 1/14/14, 3:41 PM, Dave Chinner wrote: > >On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote: > >>On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman > >>wrote: Whether the problem is with the system call or the > >>programmer is harder to determine. I think the problem is in > >>part that it's not exactly clear when we should call it. So > >>suppose we want to do a checkpoint. What we used to do a long > >>time ago is write everything, and then fsync it all, and then > >>call it good. But that produced horrible I/O storms. So what > >>we do now is do the writes over a period of time, with sleeps in > >>between, and then fsync it all at the end, hoping that the > >>kernel will write some of it before the fsyncs arrive so that we > >>don't get a huge I/O spike. And that sorta works, and it's > >>definitely better than doing it all at full speed, but it's > >>pretty imprecise. If the kernel doesn't write enough of the > >>data out in advance, then there's still a huge I/O storm when we > >>do the fsyncs and everything grinds to a halt. If it writes out > >>more data than needed in advance, it increases the total number > >>of physical writes because we get less write-combining, and that > >>hurts performance, too. > > I think there's a pretty important bit that Robert didn't mention: > we have a specific *time* target for when we want all the fsync's > to complete. People that have problems here tend to tune > checkpoints to complete every 5-15 minutes, and they want the > write traffic for the checkpoint spread out over 90% of that time > interval. To put it another way, fsync's should be done when 90% > of the time to the next checkpoint hits, but preferably not a lot > before then. I think that is pretty much understood. I don't recall anyone mentioning a typical checkpoint period, though, so knowing the typical timeframe of IO storms and how much data is typically written in a checkpoint helps us understand the scale of the problem. > >It sounds to me like you want the kernel to start background > >writeback earlier so that it doesn't build up as much dirty data > >before you require a flush. There are several ways to do this by > >tweaking writeback knobs. The simplest is probably just to set > >/proc/sys/vm/dirty_background_bytes to an appropriate threshold > >(say 50MB) and dirty_expire_centiseconds to a few seconds so that > >background writeback starts and walks all dirty inodes almost > >immediately. This will keep a steady stream of low level > >background IO going, and fsync should then not take very long. > > Except that still won't throttle writes, right? That's the big > issue here: our users often can't tolerate big spikes in IO > latency. They want user requests to always happen within a > specific amount of time. Right, but that's a different problem and one that io scheduling tweaks can have a major effect on. e.g. the deadline scheduler should be able to provide a maximum upper bound on read IO latency even while writes are in progress, though how successful it is is dependent on the nature of the write load and the architecture of the underlying storage. However, the first problem is dealing with the IO storm problem on fsync. Then we can measure the effect of spreading those writes out in time and determine what triggers read starvations (if they are apparent). The we can look at whether IO scheduling tweaks or whether blk-io throttling solves those problems. Or whether something else needs to be done to make it work in environments where problems are manifesting. FWIW [and I know you're probably sick of hearing this by now], but the blk-io throttling works almost perfectly with applications that use direct IO. > So while delaying writes potentially reduces the total amount of > data you're writing, users that run into problems here ultimately > care more about ensuring that their foreground IO completes in a > timely fashion. Understood. Applications that crunch randomly through large data sets are almost always read IO latency bound > >Fundamentally, though, we need bug reports from people seeing > >these problems when they see them so we can diagnose them on > >their systems. Trying to discuss/diagnose these problems without > >knowing anything about the storage, the kernel version, writeback > >thresholds, etc really doesn't work because we can't easily > >determine a root cause. > > So is lsf...@linux-foundation.org the best way to accomplish that? No. That is just the list for organising the LFSMM summit. ;) For general pagecache and writeback issues, discussions, etc, linux-fsde...@vger.kernel.org is the list to use. LKML simple has too much noise to be useful these days, so I'd avoid it. Otherwise the filesystem specific lists are are good place to get help for specific problems (e.g. linux-e...@vger.kernel.org and x...@oss.sgi.com). We tend to cross-post to other relevant lists as tria
[HACKERS] What's the condition of bug "PANIC: WAL contains references to invalid pages"?
Hello, Please tell me a bit about the following bug which has just been solved. I wish this is exactly what has been annoying for a year. Hot standby 9.2.6 -> 9.2.6 PANIC: WAL contains references to invalid pages http://www.postgresql.org/message-id/CAL_0b1s4QCkFy_55kk_8XWcJPs7wsgVWf8vn4=jxe6v4r7h...@mail.gmail.com I've read the discussion, but I'm wondering what the condition where this failure happens. I guess I understand the following conditions need to hold true. Are there any other conditions? * The database server crashes while a btree index is being extended (by page split). * Hot standby is used. * The standby is rebuilt and started. When I last investigated the bug, the user was doing repeated failover testing --- stop the master by running "pg_ctl stop -mi" while some application was performing database updates, promote the standby, rebuild the standby with pg_basebackup, and start the new standby. In one of those iterations, the newly rebuilt standby crashed with "WAL contains references to invalid pages". This seems to match the above mail thread. However, I don't understand why btree_xlog_vacuum() encountered an all-zero page. How did the all-zero page appear on the standby? Was it transferred from master by pg_basebackup? FYI, the server log didn't contain any messages related to disk full, nor any ERROR messages. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql.warn_shadow
On 1/15/14 2:00 PM, Florian Pflug wrote: On Jan15, 2014, at 13:32 , Marko Tiikkaja wrote: On 1/15/14 1:23 PM, Florian Pflug wrote: The fact that it's named plpgsql.warnings already clearly documents that this only affects plpgsql. But whether a particular warning is emitted during compilation or during execution it largely irrelevant, I think. For example, if we called this compiler_warning, we'd couldn't add a warning which triggers when SELECT .. INTO ingores excessive rows. There is the fact that something being a "compiler warning" gives you an idea on its effects on performance. But maybe that would be better described in the documentation (perhaps even more accurately). I like the idea of warning about SELECT .. INTO, though, but that one could have a non-negligible performance penalty during execution. I'm not overly concerned about that. I image people would usually enable warnings during development, not production. Yeah, me neither, it's just something that needs to be communicated very clearly. So probably just a list plpgsql.warnings would be the most appropriate then. Regards, Marko Tiikkaja -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] identify table oid for an AggState during plan tree initialization
Hmm, ok, this is slighly involved. Is it important that you should do this at the execution time? At planner level where Agg node is involved, one can look for Aggref node and traverse down the args list to find any vars are involved in aggregation. You can find Var nodes involved in an Aggref by calling pull_var_clauses() with appropriate arguments. The varno in those Var node can be used to pull RangeTblEntry (RTE) in range table of query, hanging off of Query (rt_fetch). If the RTE belongs to a table (it could be anything subquery, function, join ...), relid in the RTE gives you OID of the table. Though this is simple for simple query given as example in your mail, the process is tricky for complex queries. On Wed, Jan 15, 2014 at 6:05 PM, Masterprojekt Naumann1 < mpws201...@gmail.com> wrote: > 2014/1/15 Ashutosh Bapat > >> Hi Cathleen, >> An aggregate can be working on more than one table e.g. >> >> select count(*) from a, b, c where a.c1 = b.c1 and b.c1 = c.c1; >> >> In such a case, which table would you like to be reported? >> >> IOW, it doesn't look to be sensible to attach and aggregate with a table. >> >> If you can explain what you intend to do with that information, there >> might be other ways to solve your problem. >> >> >> On Wed, Jan 15, 2014 at 3:29 PM, Masterprojekt Naumann1 < >> mpws201...@gmail.com> wrote: >> >>> >>> 2014/1/15 Masterprojekt Naumann1 >>> Hi, during the initialization of the nodes in the plan tree (in ExecInitNode in the file execProcnode.c) I want to find out for a node with the type T_Agg which table will be aggregated. I tried the following: resultAsAggState = ExecInitAgg((Agg *) node, estate, eflags); if (resultAsAggState) { //tableOid = rel->rd_id; //tableOid = resultAsAggState->ss.ss_currentRelation->rd_id; } It would be great to get the Oid of the table, but I would also be satisfied if I could get at least the name of the table. Does anyone know if it is possible to gather these information? Best regards Cathleen >>> >>> Sorry my mail program send the mail to early. Please ignore the line >>> tableOid = rel->rd_id. I was just deleting the line when google send the >>> mail :( >>> >>> resultAsAggState is of the type AggState. As you can see I tried to get >>> the Oid with the ScanState inside of the AggState, but its Relation >>> ss_currentRelation is not set already. Is there another way to get the Oid? >>> >> >> >> >> -- >> Best Wishes, >> Ashutosh Bapat >> EnterpriseDB Corporation >> The Postgres Database Company >> > > > Hi, > > maybe I should clarify what I want to achieve: At the moment I only handle > special aggregations with a filter that compares values with a constant. An > exampe query would be "select sum(c_custkey), c_nationkey from customer > group by c_nationkey having sum(c_custkey) = 445820473;" This query can be > run on TPCH. > > For such special queries I want to find out, on which column in the query > result will be filtered and what is the filtered value. For the columnid, > it is sufficient, if I get the table OID and the columnid of the original > column, i.e. the table OID of customer und the columnid of c_custkey in the > original table. I know that the expression is represented by an OpExpr, > that has an opno and operators. I would like to find such expression in an > AggState. Additionally, I need the table OID, which may be elsewhere. > Where can I find this information? > > Best regards, > Cathleen > > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] plpgsql.warn_shadow
On Jan15, 2014, at 13:32 , Marko Tiikkaja wrote: > On 1/15/14 1:23 PM, Florian Pflug wrote: >> The fact that it's named plpgsql.warnings already clearly documents that >> this only affects plpgsql. But whether a particular warning is emitted >> during compilation or during execution it largely irrelevant, I think. For >> example, if we called this compiler_warning, we'd couldn't add a warning >> which triggers when SELECT .. INTO ingores excessive rows. > > There is the fact that something being a "compiler warning" gives you an idea > on its effects on performance. But maybe that would be better described in > the documentation (perhaps even more accurately). > > I like the idea of warning about SELECT .. INTO, though, but that one could > have a non-negligible performance penalty during execution. I'm not overly concerned about that. I image people would usually enable warnings during development, not production. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Rate Limiting
On Wed, Jan 15, 2014 at 3:20 AM, Simon Riggs wrote: > We've discussed previously the negative impact of large bulk > operations, especially wrt WAL writes. Patch here allows maintenance > operations to have their WAL generation slowed down as a replication > lag prevention feature. > > I believe there was originally intended to be some work on I/O rate > limiting, but that hasn't happened and is in some ways orthogonal to > this patch and we will likely eventually want both. > > Single new parameter works very similarly to vacuum_cost_delay > > wal_rate_limit_delay = Xms > Seems like a really bad name if we are only slowing down some commands - that seems to indicate we're slowing down all of them. I think it should be something that indicates that it only affects the maintenance commands. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] identify table oid for an AggState during plan tree initialization
2014/1/15 Ashutosh Bapat > Hi Cathleen, > An aggregate can be working on more than one table e.g. > > select count(*) from a, b, c where a.c1 = b.c1 and b.c1 = c.c1; > > In such a case, which table would you like to be reported? > > IOW, it doesn't look to be sensible to attach and aggregate with a table. > > If you can explain what you intend to do with that information, there > might be other ways to solve your problem. > > > On Wed, Jan 15, 2014 at 3:29 PM, Masterprojekt Naumann1 < > mpws201...@gmail.com> wrote: > >> >> 2014/1/15 Masterprojekt Naumann1 >> >>> Hi, >>> >>> during the initialization of the nodes in the plan tree (in ExecInitNode >>> in the file execProcnode.c) I want to find out for a node with the type >>> T_Agg which table will be aggregated. I tried the following: >>> >>> resultAsAggState = ExecInitAgg((Agg *) node, estate, eflags); >>> >>> if (resultAsAggState) >>> { >>> //tableOid = rel->rd_id; >>> //tableOid = resultAsAggState->ss.ss_currentRelation->rd_id; >>> } >>> It would be great to get the Oid of the table, but I would also be >>> satisfied if I could get at least the name of the table. Does anyone know >>> if it is possible to gather these information? >>> >>> Best regards >>> Cathleen >>> >> >> Sorry my mail program send the mail to early. Please ignore the line >> tableOid = rel->rd_id. I was just deleting the line when google send the >> mail :( >> >> resultAsAggState is of the type AggState. As you can see I tried to get >> the Oid with the ScanState inside of the AggState, but its Relation >> ss_currentRelation is not set already. Is there another way to get the Oid? >> > > > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > Hi, maybe I should clarify what I want to achieve: At the moment I only handle special aggregations with a filter that compares values with a constant. An exampe query would be "select sum(c_custkey), c_nationkey from customer group by c_nationkey having sum(c_custkey) = 445820473;" This query can be run on TPCH. For such special queries I want to find out, on which column in the query result will be filtered and what is the filtered value. For the columnid, it is sufficient, if I get the table OID and the columnid of the original column, i.e. the table OID of customer und the columnid of c_custkey in the original table. I know that the expression is represented by an OpExpr, that has an opno and operators. I would like to find such expression in an AggState. Additionally, I need the table OID, which may be elsewhere. Where can I find this information? Best regards, Cathleen