Re: [PATCHES] Relation forks FSM rewrite patches
Heikki Linnakangas wrote: Here's an updated version of the relation forks patch, and an incremental FSM rewrite patch on top of that. The relation forks patch is ready for review. The FSM implementation is more work-in-progress still, but I'd like to get some review on that as well, with the goal of doing more performance testing and committing it after the commit fest. The one part that I'm not totally satisfied in the relation forks patch is the smgrcreate() function. The question problem is: which piece of code decides which forks to create for a relation, and when to create them? I settled on the concept that all forks that a relation will need are created at once, in one smgrcreate() call. There's no function to create additional forks later on. Likewise, there's no function to unlink individual forks, you have to unlink the whole relation. Currently, heap_create() decides which forks to create. That's fine at the moment, but will become problematic in the future, as it's indexam-specific which forks an index requires. That decision should really be done in the indexam. One possibility would be to only create the main fork in heap_create(), and let indexam create any additional forks it needs in ambuild function. I've had a bit of a play with this, looks pretty nice. One point that comes to mind is that we are increasing the number of required file descriptors by a factor of 2 (and possibly 3 if we use a fork for the visibility map). Is this a problem do you think? Cheers Mark -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [HACKERS] [PATCHES] WITH RECURSIVE patch V0.1
Merlin Moncure wrote: On Sun, May 18, 2008 at 5:22 PM, Zoltan Boszormenyi [EMAIL PROTECTED] wrote: Can we get the rows in tree order, please? I.e. something like this: Is ordering by tree order defined in the standard when no explicit order is given? If not, it probably returns them in the order they are pulled up, which might be the fastest way +1 for the fastest way, which I expect to often be find all level 1 matches, find all level 2 matches, ... If ORDER BY is important, it should be specified (although it may be difficult or impossible to properly represent ORDER BY for a tree? not sure?) I think most uses of recursive require extra client side code to deal with anyways, so only relative order is important (order within a particular branch). There are things I'd like to use this for right now. Currently I use plpgsql procedures to implement my own recursion. :-) Cheers, mark -- Mark Mielke [EMAIL PROTECTED]
Re: [PATCHES] configure option for XLOG_BLCKSZ
On Fri, May 2, 2008 at 12:04 AM, Joshua D. Drake [EMAIL PROTECTED] wrote: Tom Lane wrote: Mark Wong [EMAIL PROTECTED] writes: I saw a that a patch was committed that exposed a configure switch for BLCKSZ. I was hoping that I could do that same for XLOG_BLCKSZ. Well, we certainly *could*, but what's the use-case really? The case for varying BLCKSZ is marginal already, and I've seen none at all for varying XLOG_BLCKSZ. Why do we need to make it easier than edit pg_config_manual.h? The use case I could see is for performance testing but I would concur that it doesn't take much to modify pg_config_manual.h. In thinking about it, this might actually be a foot gun. You have a new pg guy, download source and think to himself..., Hey I have a 4k block size as formatted on my hard disk. Then all of a sudden they have an incompatible PostgreSQL with everything else. As someone who has tested varying both those parameters it feels awkward to have a configure option for one and not the other, or vice versa. I have slightly stronger feelings for having them both as configure options because it's easier to script, but feel a little more strongly about having BLCKSZ and XLOG_BLCKSZ both as either configure options or in pg_config_manual.h. To have them such that one needs to change them in different manners makes a tad more work in automating testing. So my case is just for ease of testing. Regards, Mark -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] configure option for XLOG_BLCKSZ
On Fri, May 2, 2008 at 8:50 AM, Tom Lane [EMAIL PROTECTED] wrote: Mark Wong [EMAIL PROTECTED] writes: As someone who has tested varying both those parameters it feels awkward to have a configure option for one and not the other, or vice versa. I have slightly stronger feelings for having them both as configure options because it's easier to script, but feel a little more strongly about having BLCKSZ and XLOG_BLCKSZ both as either configure options or in pg_config_manual.h. To have them such that one needs to change them in different manners makes a tad more work in automating testing. So my case is just for ease of testing. Well, that's a fair point. Another issue though is whether it makes sense for XLOG_BLCKSZ to be different from BLCKSZ at all, at least in the default case. They are both the unit of I/O and it's not clear why you'd want different units. Mark, has your testing shown any indication that they really ought to be separately configurable? I could see having the same configure switch set both of 'em. I still believe it makes sense to have them separated. I did have some data, which has since been destroyed, that suggested there were some system characterization differences for OLTP workloads with PostgreSQL. Let's hope those disks get delivered to Portland soon. :) Regards, Mark -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] configure option for XLOG_BLCKSZ
On Fri, May 2, 2008 at 9:16 AM, Tom Lane [EMAIL PROTECTED] wrote: Mark Wong [EMAIL PROTECTED] writes: I still believe it makes sense to have them separated. I did have some data, which has since been destroyed, that suggested there were some system characterization differences for OLTP workloads with PostgreSQL. Let's hope those disks get delivered to Portland soon. :) Fair enough. It's not that much more code to have another configure switch --- will go do that. If we are allowing blocksize and relation seg size to have configure switches, seems that symmetry would demand that XLOG_SEG_SIZE be configurable as well. Thoughts? I don't have a feel for this one, but when we get the disks set up we can certainly test to see what effects it has. :) Regards, Mark -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
[PATCHES] configure option for XLOG_BLCKSZ
Hi all, I saw a that a patch was committed that exposed a configure switch for BLCKSZ. I was hoping that I could do that same for XLOG_BLCKSZ. I think I got the configure.in, sgml, pg_config_manual.h, and pg_config.h.in changes correct. Regards, Mark Index: configure === RCS file: /projects/cvsroot/pgsql/configure,v retrieving revision 1.592 diff -c -r1.592 configure *** configure 2 May 2008 01:08:22 - 1.592 --- configure 2 May 2008 04:39:34 - *** *** 1374,1379 --- 1374,1380 --with-libs=DIRSalternative spelling of --with-libraries --with-pgport=PORTNUM set default port number [5432] --with-blocksize=BLOCKSIZE set block size in kB [8] + --with-xlog-blocksize=BLOCKSIZE set xlog block size in kB [8] --with-segsize=SEGSIZE set segment size in GB [1] --with-tcl build Tcl modules (PL/Tcl) --with-tclconfig=DIRtclConfig.sh is in DIR *** *** 2602,2607 --- 2603,2658 _ACEOF + { echo $as_me:$LINENO: checking for xlog block size 5 + echo $ECHO_N checking for xlog block size... $ECHO_C 6; } + + pgac_args=$pgac_args with_xlog_blocksize + + + # Check whether --with-xlog-blocksize was given. + if test ${with_xlog_blocksize+set} = set; then + withval=$with_xlog_blocksize; + case $withval in + yes) + { { echo $as_me:$LINENO: error: argument required for --with-xlog-blocksize option 5 + echo $as_me: error: argument required for --with-xlog-blocksize option 2;} +{ (exit 1); exit 1; }; } + ;; + no) + { { echo $as_me:$LINENO: error: argument required for --with-xlog-blocksize option 5 + echo $as_me: error: argument required for --with-xlog-blocksize option 2;} +{ (exit 1); exit 1; }; } + ;; + *) + xlog_blocksize=$withval + ;; + esac + + else + xlog_blocksize=8 + fi + + + case ${xlog_blocksize} in + 1) XLOG_BLCKSZ=1024;; + 2) XLOG_BLCKSZ=2048;; + 4) XLOG_BLCKSZ=4096;; + 8) XLOG_BLCKSZ=8192;; + 16) XLOG_BLCKSZ=16384;; + 32) XLOG_BLCKSZ=32768;; + *) { { echo $as_me:$LINENO: error: Invalid block size. Allowed values are 1,2,4,8,16,32. 5 + echo $as_me: error: Invalid block size. Allowed values are 1,2,4,8,16,32. 2;} +{ (exit 1); exit 1; }; } + esac + { echo $as_me:$LINENO: result: ${xlog_blocksize}kB 5 + echo ${ECHO_T}${xlog_blocksize}kB 6; } + + + cat confdefs.h _ACEOF + #define XLOG_BLCKSZ ${XLOG_BLCKSZ} + _ACEOF + + # # File segment size # Index: configure.in === RCS file: /projects/cvsroot/pgsql/configure.in,v retrieving revision 1.558 diff -c -r1.558 configure.in *** configure.in 2 May 2008 01:08:26 - 1.558 --- configure.in 2 May 2008 04:39:34 - *** *** 249,254 --- 249,278 Changing BLCKSZ requires an initdb. ]) + AC_MSG_CHECKING([for xlog block size]) + PGAC_ARG_REQ(with, xlog-blocksize, [ --with-xlog-blocksize=BLOCKSIZE set xlog block size in kB [[8]]], + [xlog_blocksize=$withval], + [xlog_blocksize=8]) + case ${xlog_blocksize} in + 1) XLOG_BLCKSZ=1024;; + 2) XLOG_BLCKSZ=2048;; + 4) XLOG_BLCKSZ=4096;; + 8) XLOG_BLCKSZ=8192;; + 16) XLOG_BLCKSZ=16384;; + 32) XLOG_BLCKSZ=32768;; + *) AC_MSG_ERROR([Invalid block size. Allowed values are 1,2,4,8,16,32.]) + esac + AC_MSG_RESULT([${xlog_blocksize}kB]) + + AC_DEFINE_UNQUOTED([XLOG_BLCKSZ], ${XLOG_BLCKSZ}, [ + Size of a WAL file block. This need have no particular relation to BLCKSZ. + XLOG_BLCKSZ must be a power of 2, and if your system supports O_DIRECT I/O, + XLOG_BLCKSZ must be a multiple of the alignment requirement for direct-I/O + buffers, else direct I/O may fail. + + Changing XLOG_BLCKSZ requires an initdb. + ]) + # # File segment size # Index: doc/src/sgml/installation.sgml === RCS file: /projects/cvsroot/pgsql/doc/src/sgml/installation.sgml,v retrieving revision 1.308 diff -c -r1.308 installation.sgml *** doc/src/sgml/installation.sgml 2 May 2008 01:08:26 - 1.308 --- doc/src/sgml/installation.sgml 2 May 2008 04:39:36 - *** *** 1104,1109 --- 1104,1123 /varlistentry varlistentry +termoption--with-xlog-blocksize=replaceableBLOCKSIZE/replaceable/option/term +listitem + para + Set the firsttermxlog block size/, in kilobytes. This is the unit + of storage and I/O within the WAL files. The default, 8 kilobytes, + is suitable for most situations; but other values may be useful + in special cases. + The value must be a power of 2 between 1 and 32 (kilobytes). + Note that changing this value requires an initdb. + /para +/listitem + /varlistentry + + varlistentry termoption--disable-spinlocks/option/term listitem
[PATCHES] Proposing correction to posix_fadvise() usage in xlog.c
I believe I have a correction to the usage of posix_fadvise() in xlog.c. Basically posix_fadvise() is being called right before the WAL segment file is closed, which effectively doesn't do anything as opposed to when the file is opened. This proposed correction calls posix_fadvise() in three locations in order to make sure POSIX_FADV_DONTNEED is set correctly since there are three cases for opening a WAL segment file for writing. I'm hesitant to post any data I have because I only have a little pc with a SATA drive in it. My hardware knowledge on SATA controllers and drives is a little weak, but my testing with dbt-2 is showing the performance dropping. I am guessing that SATA drives have write cache enabled by default so it seems to make sense that using POSIX_FADV_DONTNEED will cause writes to be slower by writing through the disk cache. Again, assuming that is possible with SATA hardware. If memory serves, one of the wins here is suppose to be that in a scenario where we are not expecting to re-read writes to the WAL we also do not want the writes to disk to flush out other data from the operating system disk cache. But I'm not sure how best to test correctness. Anyway, I hope I'm not way off but I'm sure someone will correct me. :) Regards, Mark pgsql-log-fadvise.patch Description: Binary data ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PATCHES] Auto create (top level) directory for create tablespace
I thought it made sense for CREATE TABLESPACE to attempt to create the top level location directory - and also for tablespace redo to do likwewise during WAL replay. Tablespace creation then behaves a bit more like intidb with respect to directory creation, which I found quite nice. Patch against HEAD, passes regression tests. Cheers Mark Index: src/backend/commands/tablespace.c === RCS file: /projects/cvsroot/pgsql/src/backend/commands/tablespace.c,v retrieving revision 1.51 diff -c -r1.51 tablespace.c *** src/backend/commands/tablespace.c 15 Nov 2007 21:14:34 - 1.51 --- src/backend/commands/tablespace.c 15 Dec 2007 19:04:45 - *** *** 199,204 --- 199,205 Oid tablespaceoid; char *location; char *linkloc; + struct stat st; Oid ownerId; /* Must be super user */ *** *** 297,302 --- 298,317 recordDependencyOnOwner(TableSpaceRelationId, tablespaceoid, ownerId); /* + * Try to create the target directory if it does not exist. + */ + if (stat(location, st) 0) + { + if (mkdir(location, 0700) != 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not create location directory \%s\: %m, + location))); + + } + + + /* * Attempt to coerce target directory to safe permissions. If this fails, * it doesn't exist or has the wrong owner. */ *** *** 1279,1284 --- 1294,1314 xl_tblspc_create_rec *xlrec = (xl_tblspc_create_rec *) XLogRecGetData(record); char *location = xlrec-ts_path; char *linkloc; + struct stat st; + + /* + * Try to create the target directory if it does not exist. + */ + if (stat(location, st) 0) + { + if (mkdir(location, 0700) != 0) + { + ereport(ERROR, + (errcode_for_file_access(), + errmsg(could not create location directory \%s\: %m, + location))); + } + } /* * Attempt to coerce target directory to safe permissions. If this Index: src/test/regress/output/tablespace.source === RCS file: /projects/cvsroot/pgsql/src/test/regress/output/tablespace.source,v retrieving revision 1.5 diff -c -r1.5 tablespace.source *** src/test/regress/output/tablespace.source 3 Jun 2007 22:16:03 - 1.5 --- src/test/regress/output/tablespace.source 15 Dec 2007 19:04:46 - *** *** 57,63 -- Will fail with bad path CREATE TABLESPACE badspace LOCATION '/no/such/location'; ! ERROR: could not set permissions on directory /no/such/location: No such file or directory -- No such tablespace CREATE TABLE bar (i int) TABLESPACE nosuchspace; ERROR: tablespace nosuchspace does not exist --- 57,63 -- Will fail with bad path CREATE TABLESPACE badspace LOCATION '/no/such/location'; ! ERROR: could not create location directory /no/such/location: No such file or directory -- No such tablespace CREATE TABLE bar (i int) TABLESPACE nosuchspace; ERROR: tablespace nosuchspace does not exist ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Auto create (top level) directory for create tablespace
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: I thought it made sense for CREATE TABLESPACE to attempt to create the top level location directory - I thought we had deliberately made it not do that. Auto-recreate during replay sounds even worse. The problem is that a tablespace would normally be under a mount point, and auto-create has zero chance of getting such a path right. Ignoring this point is actually a fine recipe for destroying your data; see Joe Conway's report a couple years back about getting burnt by a soft NFS mount. If the DB directory is not there, auto-creating it is a horrible idea. Hmm - ok, unmounted filesystems could bite you. However, they could bite folks creating the directory manually too...(I guess you could argue it is less likely though). On the replay front, the use case I was thinking about is standby database - the classic foot gun there is to create a tablespace on source box and forget to add the appropriate directory on the target and bang! replay fails. It does seem to me like there are scenarios where either behavior is undesirable... a possible option is a configuration parameter to choose between auto creation or not. However I'm happy to go with the consensus here - if its universally deemed to be a terrible idea, then let's ditch the patch :-) Best wishes Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PATCHES] Cache lookup failed for relation X [was: DROP FUNCTION cache lookup failed for relation X]
Tom Lane wrote: Still in the think-about-it mode, personally ... my proposed fix is certainly much too invasive to consider back-patching, so unless someone comes up with a way-simpler idea, it's 8.3 material at best ... I ran into a variant of this today - simply creating and dropping a table repeatedly while doing \d from another session: Session 1: perl -e 'while (1) {print drop table if exists z0; \n create table z0 (a int, b int);\n drop table z0;\n}' | psql cache z0.log 21 Session 2: psql cache =# \d ERROR: cache lookup failed for relation 945897 (in RelationIsVisible, namespace.c:406) The previous discussion centered around working on on locking in dependency.c whilst dropping related objects - but does this apply when there is just one? Anyway I tried to understand what was happening and the attached rather hacky patch seems to cure the behaviour - So I've submitted it as a discussion aid, rather than 'the way of fixing this'... since I'm hoping there is a better way :-) regards Mark Index: src/backend/catalog/namespace.c === RCS file: /projects/cvsroot/pgsql/src/backend/catalog/namespace.c,v retrieving revision 1.99 diff -c -r1.99 namespace.c *** src/backend/catalog/namespace.c 27 Aug 2007 03:36:08 - 1.99 --- src/backend/catalog/namespace.c 1 Nov 2007 07:55:34 - *** *** 19,26 --- 19,28 */ #include postgres.h + #include access/heapam.h #include access/xact.h #include catalog/dependency.h + #include catalog/indexing.h #include catalog/namespace.h #include catalog/pg_authid.h #include catalog/pg_conversion.h *** *** 41,46 --- 43,49 #include storage/ipc.h #include utils/acl.h #include utils/builtins.h + #include utils/fmgroids.h #include utils/guc.h #include utils/inval.h #include utils/lsyscache.h *** *** 398,409 Form_pg_class relform; Oid relnamespace; bool visible; reltup = SearchSysCache(RELOID, ObjectIdGetDatum(relid), 0, 0, 0); if (!HeapTupleIsValid(reltup)) ! elog(ERROR, cache lookup failed for relation %u, relid); relform = (Form_pg_class) GETSTRUCT(reltup); recomputeNamespacePath(); --- 401,441 Form_pg_class relform; Oid relnamespace; bool visible; + bool fromcache = true; reltup = SearchSysCache(RELOID, ObjectIdGetDatum(relid), 0, 0, 0); if (!HeapTupleIsValid(reltup)) ! { ! /* See if we can get it directly. */ ! Relation relation; ! HeapScanDesc scan; ! ScanKeyData scanKeyData; ! ! fromcache = false; ! ! ScanKeyInit(scanKeyData, ! ObjectIdAttributeNumber, ! BTEqualStrategyNumber, F_OIDEQ, ! ObjectIdGetDatum(ClassOidIndexId)); ! ! relation = heap_open(RelationRelationId, AccessShareLock); ! ! scan = heap_beginscan(relation, ActiveSnapshot, ! 1, scanKeyData); ! ! reltup = heap_getnext(scan, ForwardScanDirection); ! reltup = heap_copytuple(reltup); ! if (!HeapTupleIsValid(reltup)) ! elog(ERROR, cache lookup failed for relation %u, relid); ! ! ! heap_endscan(scan); ! ! heap_close(relation, AccessShareLock); ! ! } relform = (Form_pg_class) GETSTRUCT(reltup); recomputeNamespacePath(); *** *** 446,452 } } ! ReleaseSysCache(reltup); return visible; } --- 478,487 } } ! if (fromcache) ! ReleaseSysCache(reltup); ! else ! heap_freetuple(reltup); return visible; } ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] [DOCS] OS/X startup scripts
Alvaro Herrera wrote: David Fetter wrote: On Mon, May 14, 2007 at 03:31:40PM +1200, Mark Kirkwood wrote: David Fetter wrote: cvs diff works just great until you want to add or remove a file without write permissions to the CVS repository, i.e. when you've checked out as anonymous. I usually saved an untouched version of the tree to compare against, so something like: $ cvs diff -Nacr pgsql.orig pgsql gives a complete patch including added/deleted files. It is a bit primitive, but is pretty easy to do! Except that it also includes diffs for generated files, which tend to be huge. To work around that you need to create a list of files to exclude, and the whole thing (which was cumbersome already) starts to get unmanageable. I do use the cvsup mirror myself which makes things so much easier and faster. Not so, you just do a $ make maintainer-clean before the diff, which clears all those out. While maintaining a cvs mirror is also good, for a small piece of work (or a work on a well defined area of the code) this requires no fooling around with repositories at all (which is nice). Cheers Mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [DOCS] OS/X startup scripts
David Fetter wrote: cvs diff works just great until you want to add or remove a file without write permissions to the CVS repository, i.e. when you've checked out as anonymous. I usually saved an untouched version of the tree to compare against, so something like: $ cvs diff -Nacr pgsql.orig pgsql gives a complete patch including added/deleted files. It is a bit primitive, but is pretty easy to do! Cheers Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Updated bitmap index patch
Mark Kirkwood wrote: I have applied this to todays HEAD performed some quick tests - looks good! I have to re-create a TPC-H dataset to test one of the previous bugs, so I'll probably look at that tomorrow or so. The TPC-H query query that previously produced a SIGSEGV now runs and gives the correct answer. Cheers Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Updated bitmap index patch
Gavin Sherry wrote: Hi all, Attached is an updated bitmap index patch. It contains bug fixes, API changes, binary changes (page identifier to distinguish it from other indexes) and has been brought up to HEAD. I have applied this to todays HEAD performed some quick tests - looks good! I have to re-create a TPC-H dataset to test one of the previous bugs, so I'll probably look at that tomorrow or so. I worked on a few approaches to VACUUM, none very satisfactory. The problem is, breaking a compressed word representing matches can have serious consequences -- at the least, creation of new words, at the worst, creation of a new page. If a lot of this were to happen, REINDEX would be much more efficient (this is what earlier patches did). One approach I looked at was modifying the existing read API to be able to do something like kill prior tuple. This, I think, made the API quite complex and it was hard to implement, since the existing mechanism decompresses words on the fly and it would be hard to identify which TID is no longer a match. So, I dropped this idea pretty quickly. The second approach is to just manually traverse each vector and change matches to non-matches where necessary. The complexity then is in managing the consequences of breaking compressed words, doing WAL (efficiently) and calculating free space. I've only partially implemented this approach. At this stage, I don't have time to finish it due to other commitments. The second approach seems like better the way to go (as far as I understand the issues...). How much work is remaining on this? - not sure that I'll have time to look at it either ... but may as well know the size to the job :-) ! Cheers Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] scan_recycle_buffers
Simon Riggs wrote: On Sat, 2007-03-10 at 07:59 +, Simon Riggs wrote: On Fri, 2007-03-09 at 18:05 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2007-03-09 at 16:45 -0500, Tom Lane wrote: We do that anyway; but certainly Simon's patch ought not be injecting an additional one. It should be possible to pass that down from the planner to the executor, in certain cases. Huh? See HeapScanDesc-rs_nblocks. Many thanks. New patch enclosed, implementation as you've requested. Not ready to apply yet, but good for testing. A quick test using the setup for Buffer cache is not scan resistant thread: Firstly vanilla 8.3 from 20070310: Shared Buffers Elapsed vmstat IO rate -- --- -- 400MB 101 s122 MB/s 128KB79 s155 MB/s [1] Now apply cycle scan v2: Shared Buffers Scan_recycle_buffers Elapsed vmstat IO rate -- --- - 400MB 0101 s122 MB/s 400MB 878 s 155 MB/s 400MB 16 77 s 155 MB/s 400MB 32 78 s 155 MB/s 400MB 64 82 s 148 MB/s 400MB 128 93 s 128 MB/s Certainly seems to have the desired effect! Cheers Mark [1] I'm not seeing 166 MB/s like previous 8.2.3 data, however 8.3 PGDATA is located further toward the end of the disk array - which I suspect is limiting the IO rate a little. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] [HACKERS] HOT WIP Patch - version 2
On Tue, Feb 20, 2007 at 08:31:45PM +0530, Pavan Deolasee wrote: I see your point, but as you mentioned do we really care ? The chain needs to be broken so that the intermediate DEAD tuples can be vacuumed. We can't vacuum them normally because they could be a part of live HOT-update chain. Resetting the HOT-updated status of the root tuple helps to mark the index entry LP_DELETE once the entire HOT-update chain is dead. ... For some reason this paragraph raised a query in my mind. Will we be able to toggle this new hot update code at configure time, so that we can measure what sort of effect this change has once it is complete? Even if only during the early testing cycles for the next release, I think it would be useful. Cheers, mark -- [EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] __ . . _ ._ . . .__. . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/|_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] patch adding new regexp functions
Jeremy Drake wrote: The regexp_split function code was based on some code that a friend of mine wrote which used PCRE rather than postgres' internal regexp support. I don't know exactly what his use-case was, but he probably had one because he wrote the function and had it returning SETOF text ;) Perhaps he can share a general idea of what it was (nudge nudge)? db=# CREATE OR REPLACE FUNCTION split(p TEXT, t TEXT) RETURNS SETOF TEXT AS $$ db$# my ($p, $t) = @_; db$# return [ split(/$p/,$t) ]; db$# $$ LANGUAGE plperl; CREATE FUNCTION Time: 1.254 ms db=# select distinct word from (select * from split('\\W+','mary had a little lamb, whose fleece was black as soot') as word) as ss; word a as black fleece had lamb little mary soot was whose (11 rows) Time: 30.517 ms As you can see, this can easily be done with a plperl function. Some people may not want to install plperl, or may not want to allow arbitrary patterns to be handed to perl in this fashion. That was not my concern. I was simply trying to see if I could make it faster in a C-language coded function. In the end I dropped the project because the plperl function works fast enough for me and I don't have any objection to plperl from a security standpoint, etc. mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Avg performance for int8/numeric
Bruce Momjian wrote: I have tested this patch but it generates regression failures. There was some code drift, so I am attaching an updated version of the patch, and the regression diffs. The 'four' column is an 'int4' so my guess is that somehow the wrong aggregate is being called. Good catch - I must have neglected to run the regression test after amending the number of array arguments for the numeric avg :-(. Hmmm - this changing the number of array args for avg means we can't mix transition functions for variance with final functions for avg - which is exactly what the regression suite does with the 'newavg' aggregate. I've 'fixed' this by amending the definition of 'newavg' to use the transition and final function that 'avg' does. However I found myself asking if this lost us the point of that test - so I looked back at the older postgres versions (e.g. 7.1.3) and saw that back *then* 'newavg' and 'avg' were defined using the same functions...so I think making the change as indicated is ok. I've attached a new patch with this change. Cheers Mark diff -Nacr pgsql.orig/src/backend/utils/adt/numeric.c pgsql/src/backend/utils/adt/numeric.c *** pgsql.orig/src/backend/utils/adt/numeric.c Sun Jan 21 11:36:20 2007 --- pgsql/src/backend/utils/adt/numeric.c Fri Feb 16 18:09:24 2007 *** *** 2165,2170 --- 2165,2204 return result; } + /* + * Improve avg performance by not caclulating sum(X*X). + */ + static ArrayType * + do_numeric_avg_accum(ArrayType *transarray, Numeric newval) + { + Datum *transdatums; + int ndatums; + Datum N, + sumX; + ArrayType *result; + + /* We assume the input is array of numeric */ + deconstruct_array(transarray, + NUMERICOID, -1, false, 'i', + transdatums, NULL, ndatums); + if (ndatums != 2) + elog(ERROR, expected 2-element numeric array); + N = transdatums[0]; + sumX = transdatums[1]; + + N = DirectFunctionCall1(numeric_inc, N); + sumX = DirectFunctionCall2(numeric_add, sumX, + NumericGetDatum(newval)); + + transdatums[0] = N; + transdatums[1] = sumX; + + result = construct_array(transdatums, 2, + NUMERICOID, -1, false, 'i'); + + return result; + } + Datum numeric_accum(PG_FUNCTION_ARGS) { *** *** 2175,2180 --- 2209,2226 } /* + * Optimized case for average of numeric. + */ + Datum + numeric_avg_accum(PG_FUNCTION_ARGS) + { + ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0); + Numeric newval = PG_GETARG_NUMERIC(1); + + PG_RETURN_ARRAYTYPE_P(do_numeric_avg_accum(transarray, newval)); + } + + /* * Integer data types all use Numeric accumulators to share code and * avoid risk of overflow. For int2 and int4 inputs, Numeric accumulation * is overkill for the N and sum(X) values, but definitely not overkill *** *** 2219,2224 --- 2265,2286 PG_RETURN_ARRAYTYPE_P(do_numeric_accum(transarray, newval)); } + /* + * Optimized case for average of int8. + */ + Datum + int8_avg_accum(PG_FUNCTION_ARGS) + { + ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0); + Datum newval8 = PG_GETARG_DATUM(1); + Numeric newval; + + newval = DatumGetNumeric(DirectFunctionCall1(int8_numeric, newval8)); + + PG_RETURN_ARRAYTYPE_P(do_numeric_avg_accum(transarray, newval)); + } + + Datum numeric_avg(PG_FUNCTION_ARGS) { *** *** 2232,2242 deconstruct_array(transarray, NUMERICOID, -1, false, 'i', transdatums, NULL, ndatums); ! if (ndatums != 3) ! elog(ERROR, expected 3-element numeric array); N = DatumGetNumeric(transdatums[0]); sumX = DatumGetNumeric(transdatums[1]); - /* ignore sumX2 */ /* SQL92 defines AVG of no values to be NULL */ /* N is zero iff no digits (cf. numeric_uminus) */ --- 2294,2303 deconstruct_array(transarray, NUMERICOID, -1, false, 'i', transdatums, NULL, ndatums); ! if (ndatums != 2) ! elog(ERROR, expected 2-element numeric array); N = DatumGetNumeric(transdatums[0]); sumX = DatumGetNumeric(transdatums[1]); /* SQL92 defines AVG of no values to be NULL */ /* N is zero iff no digits (cf. numeric_uminus) */ diff -Nacr pgsql.orig/src/include/catalog/catversion.h pgsql/src/include/catalog/catversion.h *** pgsql.orig/src/include/catalog/catversion.h Fri Feb 16 18:06:34 2007 --- pgsql/src/include/catalog/catversion.h Fri Feb 16 18:09:24 2007 *** *** 53,58 */ /* mmddN */ ! #define CATALOG_VERSION_NO 200702131 #endif --- 53,58 */ /* mmddN */ ! #define CATALOG_VERSION_NO 200702151 #endif diff -Nacr pgsql.orig/src/include/catalog/pg_aggregate.h pgsql/src/include/catalog/pg_aggregate.h *** pgsql.orig/src/include/catalog/pg_aggregate.h Sun Jan 21 11:36:22 2007 --- pgsql/src/include/catalog/pg_aggregate.h Fri Feb 16 18:09:24 2007 *** *** 80,89 */ /* avg */ ! DATA(insert ( 2100 int8_accum numeric_avg 0 1231
Re: [PATCHES] [pgsql-patches] Patch to avoid gprof profiling overwrites
[EMAIL PROTECTED] wrote: And the patch is where? You caught me; guess I'd better make something up fast, huh? Here it is, thanks. -- Korry --- [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: It's difficult to profile a backend server process (using gprof) because each process overwrites any earlier profile as it exits. It is especially tricky to nab a useful profile if you happen to have autovacuum enabled. This patch reduces the problem by forcing the backend to 'cd' to a new directory ($PGDATA/gprof/pid) just before calling exit(), but only if the backend was compiled with -DLINUX_PROFILE. Nice addition - tested on FreeBSD 6.2, works great! The name for the define variable could perhaps be better - feels silly adding -DLINUX_PROFILE on Freebsd! (maybe just PROFILE or GPROF_PROFILE?). Cheers Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] [pgsql-patches] Patch to avoid gprof profilingoverwrites
[EMAIL PROTECTED] wrote: The name for the define variable could perhaps be better - feels silly adding -DLINUX_PROFILE on Freebsd! (maybe just PROFILE or GPROF_PROFILE?). That wasn't my choice, there is other code elsewhere that depends on that symbol, I just added a little bit more. Right - but LINUX_PROFILE was added to correct Linux specific oddities with the time counter accumulation, whereas your patch is not Linux specific at all. So I think a more representative symbol is required. Cheers Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] [pgsql-patches] Patch to avoid gprof profilingoverwrites
Mark Kirkwood wrote: [EMAIL PROTECTED] wrote: The name for the define variable could perhaps be better - feels silly adding -DLINUX_PROFILE on Freebsd! (maybe just PROFILE or GPROF_PROFILE?). That wasn't my choice, there is other code elsewhere that depends on that symbol, I just added a little bit more. Right - but LINUX_PROFILE was added to correct Linux specific oddities with the time counter accumulation, whereas your patch is not Linux specific at all. So I think a more representative symbol is required. In fact - thinking about this a bit more, probably a construction like: #if defined(LINUX_PROFILE) || defined(PROFILE) or similar would work - because I think those of us not on Linux do *not* want to define LINUX_PROFILE, as our timer accumulation for forked process works fine as it is...but we don't want to make Linux guys have to define LINUX_PROFILE *and* PROFILE... ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] README for vcbuild
Magnus Hagander wrote: Note that PostgreSQL builds natively with Visual C++. You must *there for* make sure that you do *NOT* have any tools from Cygwin or Mingw available in the system PATH. Also, make sure you don't have any Cygwin/Mingw environment variables leaking through. Little typo: Note that PostgreSQL builds natively with Visual C++. You must *therefore* ... Cheers Mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Avg performance for int8/numeric
Neil Conway wrote: On Fri, 2006-11-24 at 11:08 +1300, Mark Kirkwood wrote: - Modifies do_numeric_accum to have an extra bool parameter and does not calc sumX2 when it is false. I think it would be clearer to reorganize this function slightly, and have only a single branch on useSumX2. On first glance it isn't obviously that transdatums[2] is defined (but unchanged) when useSumX2 is false. Right - new patch attached that adds a new function do_numeric_avg_accum that only uses N and sum(X). This means I could amend the avg aggregates for numeric, int8 to have a initvalues of {0,0}. Cheers Mark ? gmon.out Index: src/backend/utils/adt/numeric.c === RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/numeric.c,v retrieving revision 1.96 diff -c -r1.96 numeric.c *** src/backend/utils/adt/numeric.c 4 Oct 2006 00:29:59 - 1.96 --- src/backend/utils/adt/numeric.c 25 Nov 2006 00:00:58 - *** *** 2097,2102 --- 2097,2136 return result; } + /* + * Improve avg performance by not caclulating sum(X*X). + */ + static ArrayType * + do_numeric_avg_accum(ArrayType *transarray, Numeric newval) + { + Datum *transdatums; + int ndatums; + Datum N, + sumX; + ArrayType *result; + + /* We assume the input is array of numeric */ + deconstruct_array(transarray, + NUMERICOID, -1, false, 'i', + transdatums, NULL, ndatums); + if (ndatums != 2) + elog(ERROR, expected 2-element numeric array); + N = transdatums[0]; + sumX = transdatums[1]; + + N = DirectFunctionCall1(numeric_inc, N); + sumX = DirectFunctionCall2(numeric_add, sumX, + NumericGetDatum(newval)); + + transdatums[0] = N; + transdatums[1] = sumX; + + result = construct_array(transdatums, 2, + NUMERICOID, -1, false, 'i'); + + return result; + } + Datum numeric_accum(PG_FUNCTION_ARGS) { *** *** 2107,2112 --- 2141,2158 } /* + * Optimized case for average of numeric. + */ + Datum + numeric_avg_accum(PG_FUNCTION_ARGS) + { + ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0); + Numeric newval = PG_GETARG_NUMERIC(1); + + PG_RETURN_ARRAYTYPE_P(do_numeric_avg_accum(transarray, newval)); + } + + /* * Integer data types all use Numeric accumulators to share code and * avoid risk of overflow. For int2 and int4 inputs, Numeric accumulation * is overkill for the N and sum(X) values, but definitely not overkill *** *** 2151,2156 --- 2197,2218 PG_RETURN_ARRAYTYPE_P(do_numeric_accum(transarray, newval)); } + /* + * Optimized case for average of int8. + */ + Datum + int8_avg_accum(PG_FUNCTION_ARGS) + { + ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0); + Datum newval8 = PG_GETARG_DATUM(1); + Numeric newval; + + newval = DatumGetNumeric(DirectFunctionCall1(int8_numeric, newval8)); + + PG_RETURN_ARRAYTYPE_P(do_numeric_avg_accum(transarray, newval)); + } + + Datum numeric_avg(PG_FUNCTION_ARGS) { *** *** 2164,2174 deconstruct_array(transarray, NUMERICOID, -1, false, 'i', transdatums, NULL, ndatums); ! if (ndatums != 3) ! elog(ERROR, expected 3-element numeric array); N = DatumGetNumeric(transdatums[0]); sumX = DatumGetNumeric(transdatums[1]); - /* ignore sumX2 */ /* SQL92 defines AVG of no values to be NULL */ /* N is zero iff no digits (cf. numeric_uminus) */ --- 2226,2235 deconstruct_array(transarray, NUMERICOID, -1, false, 'i', transdatums, NULL, ndatums); ! if (ndatums != 2) ! elog(ERROR, expected 2-element numeric array); N = DatumGetNumeric(transdatums[0]); sumX = DatumGetNumeric(transdatums[1]); /* SQL92 defines AVG of no values to be NULL */ /* N is zero iff no digits (cf. numeric_uminus) */ Index: src/include/catalog/pg_aggregate.h === RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_aggregate.h,v retrieving revision 1.58 diff -c -r1.58 pg_aggregate.h *** src/include/catalog/pg_aggregate.h 4 Oct 2006 00:30:07 - 1.58 --- src/include/catalog/pg_aggregate.h 25 Nov 2006 00:01:01 - *** *** 80,89 */ /* avg */ ! DATA(insert ( 2100 int8_accum numeric_avg 0 1231 {0,0,0} )); DATA(insert ( 2101 int4_avg_accum int8_avg 0 1016 {0,0} )); DATA(insert ( 2102 int2_avg_accum int8_avg 0 1016 {0,0} )); ! DATA(insert ( 2103 numeric_accum numeric_avg 0 1231 {0,0,0} )); DATA(insert ( 2104 float4_accum float8_avg 0 1022 {0,0,0} )); DATA(insert ( 2105 float8_accum float8_avg 0 1022 {0,0,0} )); DATA(insert ( 2106 interval_accum interval_avg 0 1187 {0 second,0 second} )); --- 80,89 */ /* avg */ ! DATA(insert ( 2100 int8_avg_accum numeric_avg 0 1231 {0,0} )); DATA(insert ( 2101 int4_avg_accum int8_avg 0 1016 {0,0} )); DATA(insert ( 2102 int2_avg_accum int8_avg 0 1016 {0,0} )); ! DATA(insert ( 2103
Re: [PATCHES] Avg performance for int8/numeric
Neil Conway wrote: (it is still slower than doing sum/count - possibly due to the construct/deconstruct overhead of the numeric transition array). This would indeed be worth profiling. If it turns out that array overhead is significant, I wonder if we could use a composite type for the transition variable instead. That might also make it easier to represent the N value as an int8 rather than a numeric. I've profiled the 2nd patch using the setup indicated below. The first 64 lines of the flat graph are attached. The complete profile is here: http://homepages.paradise.net.nz/markir/download/postgres/postgres-avg.gprof.gz Setup: avg=# \d avgtest Table public.avgtest Column | Type | Modifiers +---+--- id | integer | val0 | bigint| val1 | numeric(12,2) | val2 | numeric(10,0) | avg=# analyze verbose avgtest; INFO: analyzing public.avgtest INFO: avgtest: scanned 3000 of 87689 pages, containing 342138 live rows and 0 dead rows; 3000 rows in sample, 1580 estimated total rows ANALYZE Time: 252.033 ms avg=# select avg(val2) from avgtest; avg - 714285.21428580 (1 row) Time: 35196.028 ms avg=# \q regards Mark Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds secondscalls s/call s/call name 14.42 2.16 2.16 12977 0.00 0.00 AllocSetAlloc 9.08 3.52 1.36 2000 0.00 0.00 add_abs 5.54 4.35 0.83 1000 0.00 0.00 slot_deform_tuple 5.41 5.16 0.81 60001673 0.00 0.00 AllocSetFree 4.34 5.81 0.65 1000 0.00 0.00 construct_md_array 4.21 6.44 0.63 2003 0.00 0.00 make_result 3.54 6.97 0.53 1000 0.00 0.00 numeric_add 3.27 7.46 0.49 3003 0.00 0.00 set_var_from_num 3.00 7.91 0.45 12652 0.00 0.00 MemoryContextAlloc 2.74 8.32 0.41 1001 0.00 0.00 heapgettup_pagemode 2.54 8.70 0.38 1000 0.00 0.00 advance_transition_function 2.40 9.06 0.36 3006 0.00 0.00 alloc_var 2.27 9.40 0.34 1000 0.00 0.00 do_numeric_avg_accum 2.00 9.70 0.30 1001 0.00 0.00 CopyArrayEls 2.00 10.00 0.30 1000 0.00 0.00 numeric_inc 1.94 10.29 0.29 2002 0.00 0.00 ArrayGetNItems 1.94 10.58 0.29 1001 0.00 0.00 deconstruct_array 1.87 10.86 0.28 2002 0.00 0.00 ArrayCastAndSet 1.74 11.12 0.26 60001672 0.00 0.00 pfree 1.67 11.37 0.25 1001 0.00 0.00 slot_getattr 1.60 11.61 0.24 1000 0.00 0.00 advance_aggregates 1.54 11.84 0.23 4006 0.00 0.00 free_var 1.54 12.07 0.23 1001 0.00 0.00 datumCopy 1.47 12.29 0.22 1001 0.00 0.00 SeqNext 1.40 12.50 0.21 2000 0.00 0.00 add_var 1.34 12.70 0.20 2003 0.00 0.00 strip_var 1.34 12.90 0.20 1001 0.00 0.00 ExecScan 1.27 13.09 0.19 1003 0.00 0.00 AllocSetReset 1.20 13.27 0.18 1003 0.00 0.00 ExecProcNode 1.13 13.44 0.17 7010 0.00 0.00 pg_detoast_datum 0.93 13.58 0.14 1000 0.00 0.00 numeric_avg_accum 0.93 13.72 0.142 0.07 6.61 ExecAgg 0.87 13.85 0.13 1001 0.00 0.00 datumGetSize 0.87 13.98 0.1387860 0.00 0.00 heapgetpage 0.73 14.09 0.11 1001 0.00 0.00 DirectFunctionCall2 0.73 14.20 0.11 1000 0.00 0.00 construct_array 0.60 14.29 0.09 1148 0.00 0.00 DirectFunctionCall1 0.53 14.37 0.08 1001 0.00 0.00 ExecStoreTuple 0.53 14.45 0.08 1000 0.00 0.00 HeapTupleSatisfiesSnapshot 0.40 14.51 0.06 1103 0.00 0.00 heap_getnext 0.33 14.56 0.05 254419 0.00 0.00 hash_search_with_hash_value 0.27 14.60 0.04 1001 0.00 0.00 MemoryContextReset 0.27 14.64 0.04 1000 0.00 0.00 ExecEvalVar 0.27 14.68 0.04 1000 0.00 0.00 XidInSnapshot 0.27 14.72 0.04 511482 0.00 0.00 LWLockRelease 0.27 14.76 0.04 164939 0.00 0.00 hash_any 0.27 14.80 0.0487760 0.00 0.00 StrategyGetBuffer 0.20 14.83 0.03 1009 0.00 0.00 TransactionIdPrecedes 0.20 14.86 0.0387760 0.00 0.00 FileRead 0.13 14.88 0.02 1001 0.00 0.00 ExecSeqScan 0.13 14.90 0.02 511481 0.00 0.00 LWLockAcquire 0.13 14.92 0.0288217 0.00 0.00
Re: [PATCHES] Avg performance for int8/numeric
Luke Lonergan wrote: So, if I understand this correctly, we're calling Alloc and ContextAlloc 10 times for every row being summed? There are approx 10M rows and the profile snippet below shows 100M calls to each of those. Unless I've accidentally run gprof on the profile output for a 100M row case I had lying around :-( ... I'll check ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Avg performance for int8/numeric
Mark Kirkwood wrote: Luke Lonergan wrote: So, if I understand this correctly, we're calling Alloc and ContextAlloc 10 times for every row being summed? There are approx 10M rows and the profile snippet below shows 100M calls to each of those. Unless I've accidentally run gprof on the profile output for a 100M row case I had lying around :-( ... I'll check I haven't (so profile as attached is ok)... ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PATCHES] pg_buffercache tidyup
This is probably terrible timing, but I noticed Tom had done some nice tidying up on pg_freespacemap to eliminate the clumsy conversion to and from strings. This patch does a similar thing for pg_buffercache. I did wonder about not showing buffers that are invalid or not in use (currently displays all attributes bar the id as NULL). Comments? Cheers Mark Index: contrib/pg_buffercache/pg_buffercache_pages.c === RCS file: /projects/cvsroot/pgsql/contrib/pg_buffercache/pg_buffercache_pages.c,v retrieving revision 1.10 diff -c -r1.10 pg_buffercache_pages.c *** contrib/pg_buffercache/pg_buffercache_pages.c 19 Oct 2006 18:32:46 - 1.10 --- contrib/pg_buffercache/pg_buffercache_pages.c 22 Oct 2006 06:27:52 - *** *** 8,13 --- 8,14 */ #include postgres.h #include funcapi.h + #include access/heapam.h #include catalog/pg_type.h #include storage/buf_internals.h #include storage/bufmgr.h *** *** 44,52 typedef struct { ! AttInMetadata *attinmeta; BufferCachePagesRec *record; - char *values[NUM_BUFFERCACHE_PAGES_ELEM]; } BufferCachePagesContext; --- 45,52 typedef struct { ! TupleDesc tupdesc; BufferCachePagesRec *record; } BufferCachePagesContext; *** *** 77,83 /* Switch context when allocating stuff to be used in later calls */ oldcontext = MemoryContextSwitchTo(funcctx-multi_call_memory_ctx); ! /* Construct a tuple to return. */ tupledesc = CreateTemplateTupleDesc(NUM_BUFFERCACHE_PAGES_ELEM, false); TupleDescInitEntry(tupledesc, (AttrNumber) 1, bufferid, INT4OID, -1, 0); --- 77,87 /* Switch context when allocating stuff to be used in later calls */ oldcontext = MemoryContextSwitchTo(funcctx-multi_call_memory_ctx); ! ! /* Create a user function context for cross-call persistence */ ! fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext)); ! ! /* Construct a tuple to return, and save its descriptor. */ tupledesc = CreateTemplateTupleDesc(NUM_BUFFERCACHE_PAGES_ELEM, false); TupleDescInitEntry(tupledesc, (AttrNumber) 1, bufferid, INT4OID, -1, 0); *** *** 92,118 TupleDescInitEntry(tupledesc, (AttrNumber) 6, isdirty, BOOLOID, -1, 0); ! /* Generate attribute metadata needed later to produce tuples */ ! funcctx-attinmeta = TupleDescGetAttInMetadata(tupledesc); ! ! /* ! * Create a function context for cross-call persistence and initialize ! * the buffer counters. ! */ ! fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext)); ! funcctx-max_calls = NBuffers; ! funcctx-user_fctx = fctx; /* Allocate NBuffers worth of BufferCachePagesRec records. */ fctx-record = (BufferCachePagesRec *) palloc(sizeof(BufferCachePagesRec) * NBuffers); ! /* allocate the strings for tuple formation */ ! fctx-values[0] = (char *) palloc(3 * sizeof(uint32) + 1); ! fctx-values[1] = (char *) palloc(3 * sizeof(uint32) + 1); ! fctx-values[2] = (char *) palloc(3 * sizeof(uint32) + 1); ! fctx-values[3] = (char *) palloc(3 * sizeof(uint32) + 1); ! fctx-values[4] = (char *) palloc(3 * sizeof(uint32) + 1); ! fctx-values[5] = (char *) palloc(2); /* Return to original context when allocating transient memory */ MemoryContextSwitchTo(oldcontext); --- 96,109 TupleDescInitEntry(tupledesc, (AttrNumber) 6, isdirty, BOOLOID, -1, 0); ! fctx-tupdesc = BlessTupleDesc(tupledesc); /* Allocate NBuffers worth of BufferCachePagesRec records. */ fctx-record = (BufferCachePagesRec *) palloc(sizeof(BufferCachePagesRec) * NBuffers); ! /* Set max calls and remember the user function context. */ ! funcctx-max_calls = NBuffers; ! funcctx-user_fctx = fctx; /* Return to original context when allocating transient memory */ MemoryContextSwitchTo(oldcontext); *** *** 167,184 if (funcctx-call_cntr funcctx-max_calls) { uint32 i = funcctx-call_cntr; ! char *values[NUM_BUFFERCACHE_PAGES_ELEM]; ! int j; ! ! /* ! * Use a temporary values array, initially pointing to fctx-values, ! * so it can be reassigned w/o losing the storage for subsequent ! * calls. ! */ ! for (j = 0; j NUM_BUFFERCACHE_PAGES_ELEM; j++) ! { ! values[j] = fctx-values[j]; ! } /* --- 158,165 if (funcctx-call_cntr funcctx-max_calls) { uint32 i = funcctx-call_cntr; ! Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; ! bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; /* *** *** 189,224 fctx-record[i].isvalid == false) { ! sprintf(values[0], %u, fctx-record[i].bufferid); ! values[1] = NULL; ! values[2] = NULL; ! values[3] = NULL; ! values[4] = NULL; ! values[5] = NULL; } else { ! sprintf(values[0], %u, fctx-record[i].bufferid
Re: [PATCHES] WIP: Hierarchical Queries - stage 1
Hi Tom, Thanks for your initial thoughts on this. On Wed, 2006-09-20 at 20:13 -0400, Tom Lane wrote: (cut) You really can't get away with having the identical representation for CTEs and ordinary sub-selects in the range table. For instance, it looks like your patch will think that select ... from (select ...) as x, x, ... is legal when it certainly is not. I think you need either a new RTEKind or an additional flag in the RTE to show that it's a CTE rather than a plain subselect. I'm not entirely sure that you even want the CTEs in the rangetable at all --- that still needs some thought. For semantic reasons, I can see why you are questioning whether or not the CTE should be contained within the rangetable - there is an implicit hint that RTEs reflect entries within the FROM clause, but then I also see that the rewriter adds RTEs when substituting view definitions into queries. The comments in parsenodes.h also suggest that an RTE is a namespace/data source reference for a named entity within the query. The main problem I can see with keeping the CTEs outside the rangetable is that according to the source, jointree nodes must currently have RANGETBLREF nodes as leaf nodes; as I understand it, your suggestion of maintaining the CTEs separately would involve something along the lines of keeping a separate CTETable and creating some form of CTETBLREF node that could be referenced within the jointree. While arguably it may be semantically neater, it appears to involve quite a bit of extra work... could you explain in more detail as to why you feel that CTEs should remain outside the rangetable? This comes back to the question of whether the CTE per se should be an RTE at all. Maybe only the reference to it should be an RTE. The behavior when seeing a plain RangeVar in FROM would be to first search the side list of valid CTEs, and only on failure go looking for a real table. This is effectively what the patch does, albeit not particularly elegantly. I'll spend some time on making those changes a bit more refined. Kind regards, Mark. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PATCHES] WIP: Hierarchical Queries - stage 1
Hi everyone, After spending several days reading PostgreSQL source code (and another couple of days coding), I've managed to come up with some alpha code that attempts to implement non-recursive WITH common table expressions. Having got this far, I feel that I need to ask for advice from some more experienced PostgreSQL hackers and so I post the alpha version here. The patch attempts to alter the parser grammar, and using code modified from that used when processing subselects in the FROM clause, attempts to treat the CTE as if it were presented the same as a FROM subselect. My main issue at the moment is that the code in transformFromClauseItem seems a terrible hack, mainly because the grammar returns each string within the FROM clause as a RangeVar, and transformFromClauseItem assumes that each RangeVar represents a physical relation. Of course, this is not the case when referencing a CTE and so the code first checks to see if an entry has already been created when processing the WITH clause; if it does then we return NULL to indicate that transformFromClause should do nothing. Messy, but I wanted to see what other developers thought before jumping in and rewriting this part of the code. Another point to think about is what should a query return if the SELECT doesn't refer to a CTE? For example: - WITH foo(a, b) AS (SELECT * FROM pg_class) SELECT * FROM pg_class; Since I've inserted the CTE as a range table entry of type RTE_SUBQUERY then the current behaviour is to perform a cross-join between foo and bar which I'm not sure is the correct behaviour since CTEs seem to be more like views in this respect. Finally, the current code fails when supplying an additional alias to a CTE in the select statement and then trying to refer to it, e.g. - with myrel(p1) as (select oid from pg_class) select myrel.p1 from myrel AS foo, pg_class AS bar WHERE myrel.p1 = bar.oid; -- WORKS - with myrel(p1) as (select oid from pg_class) select myrel.p1 from myrel AS foo, pg_class AS bar WHERE foo.p1 = bar.oid; -- FAILS ERROR: missing FROM-clause entry for table foo at character 103 So in this case, should foo be accepted as a valid alias for myrel? My feeling is that I will end up with an RTE_CTE range table entry kind which borrows from the current subquery behaviour, but it would be very useful to see where the similarity between the two range table entry types ends. TIA, Mark. Index: src/backend/parser/analyze.c === RCS file: /projects/cvsroot/pgsql/src/backend/parser/analyze.c,v retrieving revision 1.351 diff -c -r1.351 analyze.c *** src/backend/parser/analyze.c 18 Sep 2006 16:04:04 - 1.351 --- src/backend/parser/analyze.c 20 Sep 2006 19:02:44 - *** *** 2087,2092 --- 2087,2095 /* make FOR UPDATE/FOR SHARE info available to addRangeTableEntry */ pstate-p_locking_clause = stmt-lockingClause; + /* process the WITH clause */ + transformWithClause(pstate, stmt-withClause); + /* process the FROM clause */ transformFromClause(pstate, stmt-fromClause); Index: src/backend/parser/gram.y === RCS file: /projects/cvsroot/pgsql/src/backend/parser/gram.y,v retrieving revision 2.565 diff -c -r2.565 gram.y *** src/backend/parser/gram.y 3 Sep 2006 22:37:05 - 2.565 --- src/backend/parser/gram.y 20 Sep 2006 19:02:52 - *** *** 346,351 --- 346,354 %type str OptTableSpace OptConsTableSpace OptTableSpaceOwner %type list opt_check_option + %type list OptCommonTableExpressionList CommonTableExpressionList + %type node CommonTableExpression + /* * If you make any token changes, update the keyword table in *** *** 5594,5599 --- 5597,5638 | WITHOUT HOLD { $$ = FALSE; } ; + + + /* + * + * COMMON TABLE EXPRESSIONS + *WITH STATEMENTS + * + */ + + + OptCommonTableExpressionList: + WITH CommonTableExpressionList { $$ = $2; } + | /* EMPTY */ { $$ = NIL; } + ; + + CommonTableExpressionList: + CommonTableExpression { $$ = list_make1($1); } + | CommonTableExpressionList ',' CommonTableExpression { $$ = lappend($1, $3); } + ; + + CommonTableExpression: + ColId '(' columnList ')' AS select_with_parens + { + Alias *a = makeNode(Alias); + a-aliasname = $1; + a-colnames = $3; + + RangeSubselect *n = makeNode(RangeSubselect); + n-subquery = $6; + n-alias = a; + $$ = (Node *) n; + } + ; + + + /* * * QUERY: *** *** 5705,5723 * However, this is not checked by the grammar; parse analysis must check it. */ simple_select: ! SELECT
Re: [HACKERS] [PATCHES] Patch for UUID datatype (beta)
On Tue, Sep 19, 2006 at 08:20:13AM -0500, Jim C. Nasby wrote: On Mon, Sep 18, 2006 at 07:45:07PM -0400, [EMAIL PROTECTED] wrote: I would not use a 100% random number generator for a UUID value as was suggested. I prefer inserting the MAC address and the time, to at least allow me to control if a collision is possible. This is not easy to do using a few lines of C code. I'd rather have a UUID type in core with no generation routine, than no UUID type in core because the code is too complicated to maintain, or not portable enough. As others have mentioned, using MAC address doesn't remove the possibility of a collision. It does, as I control the MAC address. I can choose not to overwrite it. I can choose to ensure that any cases where it is overwritten, it is overwritten with a unique value. Random number does not provide this level of control. Maybe a good compromise that would allow a generator function to go into the backend would be to combine the current time with a random number. That will ensure that you won't get a dupe, so long as your clock never runs backwards. Which standard UUID generation function would you be thinking of? Inventing a new one doesn't seem sensible. I'll have to read over the versions again... Cheers, mark -- [EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] __ . . _ ._ . . .__. . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/|_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] Patch for UUID datatype (beta)
On Tue, Sep 19, 2006 at 11:21:51PM -0400, Alvaro Herrera wrote: [EMAIL PROTECTED] wrote: On Tue, Sep 19, 2006 at 08:20:13AM -0500, Jim C. Nasby wrote: On Mon, Sep 18, 2006 at 07:45:07PM -0400, [EMAIL PROTECTED] wrote: I would not use a 100% random number generator for a UUID value as was suggested. I prefer inserting the MAC address and the time, to at least allow me to control if a collision is possible. This is not easy to do using a few lines of C code. I'd rather have a UUID type in core with no generation routine, than no UUID type in core because the code is too complicated to maintain, or not portable enough. As others have mentioned, using MAC address doesn't remove the possibility of a collision. It does, as I control the MAC address. What happens if you have two postmaster running on the same machine? Could be bad things. :-) For the case of two postmaster processes, I assume you mean two different databases? If you never intend to merge the data between the two databases, the problem is irrelevant. There is a much greater chance that any UUID form is more unique, or can be guaranteed to be unique, within a single application instance, than across all application instances in existence. If you do intend to merge the data, you may have a problem. If I have two connections to PostgreSQL - would the plpgsql procedures be executed from two different processes? With an in-core generation routine, I think it is possible for it to collide unless inter-process synchronization is used (unlikely) to ensure generation of unique time/sequence combinations each time. I use this right now (mostly), but as I've mentioned, it isn't my favourite. It's convenient. I don't believe it provides the sort of guarantees that a SERIAL provides. A model that intended to try and guarantee uniqueness would provide a UUID generation service for the entire host, that was not specific to any application, or database, possibly accessible via the loopback address. It would ensure that at any given time, either the time is new, or the sequence is new for the time. If computer time ever went backwards, it could keep the last time issued persistent, and increment from this point forward through the clock sequence values until real time catches up. An alternative would be along the lines of a /dev/uuid device, that like /dev/random, would be responsible for outputting unique uuid values for the system. Who does this? Probably nobody. I'm tempted to implement it, though, for my uses. :-) Cheers, mark -- [EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] __ . . _ ._ . . .__. . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/|_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PATCHES] Patch for UUID datatype (beta)
On Mon, Sep 18, 2006 at 10:33:22AM -0400, Tom Lane wrote: Andreas Pflug [EMAIL PROTECTED] writes: Isn't guaranteed uniqueness the very attribute that's expected? AFAIK there's a commonly accepted algorithm providing this. Anyone who thinks UUIDs are guaranteed unique has been drinking too much of the kool-aid. They're at best probably unique. Some generator algorithms might make it more probable than others, but you simply cannot guarantee it for UUIDs generated on noncommunicating machines. The versions that include a MAC address, time, and serial number for the machine come pretty close, presuming that the user has not overwritten the MAC address with something else. It's unique at manufacturing time. If the generation is performed from a library with the same state, on the same machine, on the off chance that you do request multiple generations at the same exact time (from my experience, this is already unlikely) the serial number should be bumped for that time. So yeah - if you set your MAC address, or if your machine time is ever set back, or if you assume a serial number of 0 each time (generation routine isn't shared among processes on the system), you can get overlap. All of these can be controlled, making it possible to eliminate overlap. One of the big reasons that I'm hesitant to put a UUID generation function into core is the knowledge that none of them are or can be perfect ... so people might need different ones depending on local conditions. I'm inclined to think that a reasonable setup would put the datatype (with input, output, comparison and indexing support) into core, but provide a generation function as a contrib module, making it easily replaceable. I have UUID generation in core in my current implementation. In the last year that I've been using it, I have already chosen twice to generate UUIDs from my calling program. I find it faster, as it avoids have to call out to PostgreSQL twice. Once to generate the UUID, and once to insert the row using it. I have no strong need for UUID generation to be in core, and believe there does exist strong reasons not to. Performance is better when not in core. Portability of PostgreSQL is better when not in core. Ability to control how UUID is defined is better when not in control. The only thing an in-core version provides is convenience for those that do not have easy access to a UUID generation library. I don't care for that convenience. Cheers, mark -- [EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] __ . . _ ._ . . .__. . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/|_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] Patch for UUID datatype (beta)
On Mon, Sep 18, 2006 at 04:17:50PM -0500, Jim C. Nasby wrote: On Mon, Sep 18, 2006 at 12:23:16PM -0400, [EMAIL PROTECTED] wrote: I have UUID generation in core in my current implementation. In the last year that I've been using it, I have already chosen twice to generate UUIDs from my calling program. I find it faster, as it avoids have to call out to PostgreSQL twice. Once to generate the UUID, and once to insert the row using it. I have no strong need for UUID generation to be in core, and believe there does exist strong reasons not to. Performance is better when not in core. Portability of PostgreSQL is better when not in core. Ability to control how UUID is defined is better when not in control. That's kinda short-sighted. You're assuming that the only place you'll want to generate UUIDs is outside the database. What about a stored procedure that's adding data to the database? How about populating a table via a SELECT INTO? There's any number of cases where you'd want to generate a UUID inside the database. contrib module. The only thing an in-core version provides is convenience for those that do not have easy access to a UUID generation library. I don't care for that convenience. It's not about access to a library, it's about how do you get to that library from inside the database, which may not be very easy. You may not care for that convenience, but I certainly would. Then load the contrib module. I do both. I'd happily reduce my contrib module to be based upon a built-in UUID type within PostgreSQL, providing the necessary UUID generation routines. I would not use a 100% random number generator for a UUID value as was suggested. I prefer inserting the MAC address and the time, to at least allow me to control if a collision is possible. This is not easy to do using a few lines of C code. I'd rather have a UUID type in core with no generation routine, than no UUID type in core because the code is too complicated to maintain, or not portable enough. Cheers, mark -- [EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] __ . . _ ._ . . .__. . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/|_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] WIP: splitting BLCKSZ
Here's an updated patch with help from Simon. Once I get a test system going again in the lab I'll start posting some data. I'm planning a combination of block sizes (BLCKSZ and XLOG_BLCKSZ) and number of WAL buffers. Thanks, MarkIndex: src/backend/access/transam/xlog.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.229 diff -c -r1.229 xlog.c *** src/backend/access/transam/xlog.c 28 Mar 2006 22:01:16 - 1.229 --- src/backend/access/transam/xlog.c 3 Apr 2006 15:03:09 - *** *** 113,122 /* * Limitation of buffer-alignment for direct IO depends on OS and filesystem, ! * but BLCKSZ is assumed to be enough for it. */ #ifdef O_DIRECT ! #define ALIGNOF_XLOG_BUFFER BLCKSZ #else #define ALIGNOF_XLOG_BUFFER ALIGNOF_BUFFER #endif --- 113,122 /* * Limitation of buffer-alignment for direct IO depends on OS and filesystem, ! * but XLOG_BLCKSZ is assumed to be enough for it. */ #ifdef O_DIRECT ! #define ALIGNOF_XLOG_BUFFER XLOG_BLCKSZ #else #define ALIGNOF_XLOG_BUFFER ALIGNOF_BUFFER #endif *** *** 130,136 /* User-settable parameters */ int CheckPointSegments = 3; ! int XLOGbuffers = 8; char *XLogArchiveCommand = NULL; char *XLOG_sync_method = NULL; const char XLOG_sync_method_default[] = DEFAULT_SYNC_METHOD_STR; --- 130,136 /* User-settable parameters */ int CheckPointSegments = 3; ! int XLOGbuffers = 16; char *XLogArchiveCommand = NULL; char *XLOG_sync_method = NULL; const char XLOG_sync_method_default[] = DEFAULT_SYNC_METHOD_STR; *** *** 374,380 * and xlblocks values depends on WALInsertLock and WALWriteLock. */ char *pages; /* buffers for unwritten XLOG pages */ ! XLogRecPtr *xlblocks; /* 1st byte ptr-s + BLCKSZ */ Size XLogCacheByte; /* # bytes in xlog buffers */ int XLogCacheBlck; /* highest allocated xlog buffer index */ TimeLineID ThisTimeLineID; --- 374,380 * and xlblocks values depends on WALInsertLock and WALWriteLock. */ char *pages; /* buffers for unwritten XLOG pages */ ! XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */ Size XLogCacheByte; /* # bytes in xlog buffers */ int XLogCacheBlck; /* highest allocated xlog buffer index */ TimeLineID ThisTimeLineID; *** *** 397,403 /* Free space remaining in the current xlog page buffer */ #define INSERT_FREESPACE(Insert) \ ! (BLCKSZ - ((Insert)-currpos - (char *) (Insert)-currpage)) /* Construct XLogRecPtr value for current insertion point */ #define INSERT_RECPTR(recptr,Insert,curridx) \ --- 397,403 /* Free space remaining in the current xlog page buffer */ #define INSERT_FREESPACE(Insert) \ ! (XLOG_BLCKSZ - ((Insert)-currpos - (char *) (Insert)-currpage)) /* Construct XLogRecPtr value for current insertion point */ #define INSERT_RECPTR(recptr,Insert,curridx) \ *** *** 441,447 static uint32 readSeg = 0; static uint32 readOff = 0; ! /* Buffer for currently read page (BLCKSZ bytes) */ static char *readBuf = NULL; /* Buffer for current ReadRecord result (expandable) */ --- 441,447 static uint32 readSeg = 0; static uint32 readOff = 0; ! /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; /* Buffer for current ReadRecord result (expandable) */ *** *** 706,712 * If cache is half filled then try to acquire write lock and do * XLogWrite. Ignore any fractional blocks in performing this check. */ ! LogwrtRqst.Write.xrecoff -= LogwrtRqst.Write.xrecoff % BLCKSZ; if (LogwrtRqst.Write.xlogid != LogwrtResult.Write.xlogid || (LogwrtRqst.Write.xrecoff = LogwrtResult.Write.xrecoff + XLogCtl-XLogCacheByte / 2)) --- 706,712 * If cache is half filled then try to acquire write lock and do * XLogWrite. Ignore any fractional blocks in performing this check. */ ! LogwrtRqst.Write.xrecoff -= LogwrtRqst.Write.xrecoff % XLOG_BLCKSZ; if (LogwrtRqst.Write.xlogid != LogwrtResult.Write.xlogid || (LogwrtRqst.Write.xrecoff = LogwrtResult.Write.xrecoff + XLogCtl-XLogCacheByte / 2)) *** *** 1228,1239 { /* crossing a logid boundary */ NewPageEndPtr.xlogid += 1; ! NewPageEndPtr.xrecoff = BLCKSZ; } else ! NewPageEndPtr.xrecoff += BLCKSZ; XLogCtl-xlblocks[nextidx] = NewPageEndPtr; ! NewPage = (XLogPageHeader) (XLogCtl-pages + nextidx * (Size) BLCKSZ); Insert-curridx = nextidx; Insert-currpage = NewPage; --- 1228,1239 { /* crossing a logid boundary */ NewPageEndPtr.xlogid += 1; ! NewPageEndPtr.xrecoff = XLOG_BLCKSZ; } else ! NewPageEndPtr.xrecoff += XLOG_BLCKSZ; XLogCtl-xlblocks[nextidx] = NewPageEndPtr; ! NewPage = (XLogPageHeader) (XLogCtl-pages + nextidx * (Size)
Re: [PATCHES] WIP: splitting BLCKSZ
On Wed, 22 Mar 2006 14:19:48 -0500 Tom Lane [EMAIL PROTECTED] wrote: Mark Wong [EMAIL PROTECTED] writes: I proposed to explore splitting BLCKSZ into separate values for logging and data to see if there might be anything to gain: http://archives.postgresql.org/pgsql-hackers/2006-03/msg00745.php My first pass was to do more or less a search and replace (attached) and I am already running into trouble with a 'make check' (below). I'm guessing that when initdb is run, I'm not properly saving the values that I've defined for DATA_BLCKSZ and possibly LOG_BLCKSZ. I'd suggest leaving BLCKSZ as-is and inventing XLOG_BLCKSZ to be used only within the WAL code; should make for a *far* smaller patch. Offhand I don't think that anything except xlog.c knows the WAL block size --- it should be fairly closely associated with dependencies on XLOG_SEG_SIZE, if you are looking for something to grep for. Ok, I have attached something much smaller. Appears to pass a 'make check' but I'll keep going to make sure it's really correct and works. Thanks, MarkIndex: src/backend/access/transam/xlog.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.227 diff -c -r1.227 xlog.c *** src/backend/access/transam/xlog.c 5 Mar 2006 15:58:22 - 1.227 --- src/backend/access/transam/xlog.c 23 Mar 2006 19:13:31 - *** *** 113,122 /* * Limitation of buffer-alignment for direct IO depends on OS and filesystem, ! * but BLCKSZ is assumed to be enough for it. */ #ifdef O_DIRECT ! #define ALIGNOF_XLOG_BUFFER BLCKSZ #else #define ALIGNOF_XLOG_BUFFER ALIGNOF_BUFFER #endif --- 113,122 /* * Limitation of buffer-alignment for direct IO depends on OS and filesystem, ! * but XLOG_BLCKSZ is assumed to be enough for it. */ #ifdef O_DIRECT ! #define ALIGNOF_XLOG_BUFFER XLOG_BLCKSZ #else #define ALIGNOF_XLOG_BUFFER ALIGNOF_BUFFER #endif *** *** 374,380 * and xlblocks values depends on WALInsertLock and WALWriteLock. */ char *pages; /* buffers for unwritten XLOG pages */ ! XLogRecPtr *xlblocks; /* 1st byte ptr-s + BLCKSZ */ Size XLogCacheByte; /* # bytes in xlog buffers */ int XLogCacheBlck; /* highest allocated xlog buffer index */ TimeLineID ThisTimeLineID; --- 374,380 * and xlblocks values depends on WALInsertLock and WALWriteLock. */ char *pages; /* buffers for unwritten XLOG pages */ ! XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */ Size XLogCacheByte; /* # bytes in xlog buffers */ int XLogCacheBlck; /* highest allocated xlog buffer index */ TimeLineID ThisTimeLineID; *** *** 397,403 /* Free space remaining in the current xlog page buffer */ #define INSERT_FREESPACE(Insert) \ ! (BLCKSZ - ((Insert)-currpos - (char *) (Insert)-currpage)) /* Construct XLogRecPtr value for current insertion point */ #define INSERT_RECPTR(recptr,Insert,curridx) \ --- 397,403 /* Free space remaining in the current xlog page buffer */ #define INSERT_FREESPACE(Insert) \ ! (XLOG_BLCKSZ - ((Insert)-currpos - (char *) (Insert)-currpage)) /* Construct XLogRecPtr value for current insertion point */ #define INSERT_RECPTR(recptr,Insert,curridx) \ *** *** 441,447 static uint32 readSeg = 0; static uint32 readOff = 0; ! /* Buffer for currently read page (BLCKSZ bytes) */ static char *readBuf = NULL; /* Buffer for current ReadRecord result (expandable) */ --- 441,447 static uint32 readSeg = 0; static uint32 readOff = 0; ! /* Buffer for currently read page (XLOG_BLCKSZ bytes) */ static char *readBuf = NULL; /* Buffer for current ReadRecord result (expandable) */ *** *** 662,668 { COMP_CRC32(rdata_crc, page, ! BLCKSZ); } else { --- 662,668 { COMP_CRC32(rdata_crc, page, ! XLOG_BLCKSZ); } else { *** *** 672,678 bkpb-hole_offset); COMP_CRC32(rdata_crc, page + (bkpb-hole_offset + bkpb-hole_length), ! BLCKSZ - (bkpb-hole_offset + bkpb-hole_length)); } } } --- 672,678 bkpb-hole_offset); COMP_CRC32(rdata_crc, page + (bkpb-hole_offset + bkpb-hole_length), ! XLOG_BLCKSZ - (bkpb-hole_offset + bkpb-hole_length)); } } } *** *** 705,711 * If cache is half filled then try to acquire write lock and do * XLogWrite. Ignore any fractional blocks in performing this check. */ ! LogwrtRqst.Write.xrecoff -= LogwrtRqst.Write.xrecoff % BLCKSZ; if (LogwrtRqst.Write.xlogid != LogwrtResult.Write.xlogid || (LogwrtRqst.Write.xrecoff = LogwrtResult.Write.xrecoff + XLogCtl-XLogCacheByte / 2)) --- 705,711 * If cache is half filled then try
Re: [PATCHES] [HACKERS] pg_freespacemap question
Mark Kirkwood wrote: Tom Lane wrote: I do notice a rather serious shortcoming of pg_freespacemap in its current incarnation, which is that it *only* shows you the per-page free space data, and not any of the information that would let you determine what the FSM is doing to filter the raw data. The per-relation avgRequest and lastPageCount fields would be interesting for instance. Perhaps there should be a second view with one row per relation to carry the appropriate data. Ok - I did wonder about 2 views, but was unsure if the per-relation stuff was interesting. Given that it looks like it is interesting, I'll see about getting a second view going. This patch implements the second view for FSM relations. I have renamed the functions and views to be: pg_freespacemap_relations pg_freespacemap_pages This patch depends on the previous one (which was called simply 'pg_freespacemap.patch'). Cheers Mark pg_freespacemap-1.patch.gz Description: application/gzip ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PATCHES] Pg_buffercache tidy
This patch wraps the DDL in a BEGIN...COMMIT (as suggested by Jim for pg_freespacemap a while ago). In addition I have amended the example queries to correctly project out only relations in the current database (relations with the same name from different databases were getting counted together). Cheers Mark Index: pg_buffercache.sql.in === RCS file: /projects/cvsroot/pgsql/contrib/pg_buffercache/pg_buffercache.sql.in,v retrieving revision 1.3 diff -c -r1.3 pg_buffercache.sql.in *** pg_buffercache.sql.in 27 Feb 2006 16:09:48 - 1.3 --- pg_buffercache.sql.in 19 Mar 2006 10:45:06 - *** *** 1,4 --- 1,5 -- Adjust this setting to control where the objects get created. + BEGIN; SET search_path = public; -- Register the function. *** *** 16,18 --- 17,21 -- Don't want these to be available at public. REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC; REVOKE ALL ON pg_buffercache FROM PUBLIC; + + COMMIT; Index: README.pg_buffercache === RCS file: /projects/cvsroot/pgsql/contrib/pg_buffercache/README.pg_buffercache,v retrieving revision 1.2 diff -c -r1.2 README.pg_buffercache *** README.pg_buffercache 31 May 2005 00:07:47 - 1.2 --- README.pg_buffercache 19 Mar 2006 10:55:44 - *** *** 76,83 isdirty boolean); regression=# SELECT c.relname, count(*) AS buffers !FROM pg_class c, pg_buffercache b !WHERE b.relfilenode = c.relfilenode GROUP BY c.relname ORDER BY 2 DESC LIMIT 10; relname | buffers --- 76,84 isdirty boolean); regression=# SELECT c.relname, count(*) AS buffers !FROM pg_class c INNER JOIN pg_buffercache b !ON b.relfilenode = c.relfilenode INNER JOIN pg_database d !ON (b.reldatabase = d.oid AND d.datname = current_database()) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10; relname | buffers ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] pg_freespacemap question
Tatsuo Ishii wrote: BTW, I noticed difference of outputs from pg_freespacemap and pgstattuple. I ran pgbench and inspected accounts table by using these tools. pg_freespacemap: sum of bytes: 250712 pgstattuple: free_space: 354880 Shouldn't they be identical? I would have thought so - unless there are not enough pages left in the FSM... pg_freespacemap is reporting on what gets into the FSM - so provided I haven't put a bug in there somewhere (!) - we need to look at how VACUUM reports free space to the FSM cheers Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] [HACKERS] pg_freespacemap question
Tom Lane wrote: Tatsuo Ishii wrote: BTW, I noticed difference of outputs from pg_freespacemap and pgstattuple. I ran pgbench and inspected accounts table by using these tools. pg_freespacemap: sum of bytes: 250712 pgstattuple: free_space: 354880 Shouldn't they be identical? vacuum/fsm disregard pages with uselessly small amounts of free space (less than the average tuple size, IIRC). Ah - that what I was seeing! Thanks. I do notice a rather serious shortcoming of pg_freespacemap in its current incarnation, which is that it *only* shows you the per-page free space data, and not any of the information that would let you determine what the FSM is doing to filter the raw data. The per-relation avgRequest and lastPageCount fields would be interesting for instance. Perhaps there should be a second view with one row per relation to carry the appropriate data. Ok - I did wonder about 2 views, but was unsure if the per-relation stuff was interesting. Given that it looks like it is interesting, I'll see about getting a second view going. Cheers Mark ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] pg_freespacemap question
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: Good points! I had not noticed this test case. Probably NULL is better Would setting it to 'BLCKSZ - (fixed index header stuff)' be better, No, I don't think so, because that will just make it harder to recognize what's what (remember that BLCKSZ isn't really a constant, and the index overhead is not the same for all AMs either). The point here is that for indexes the FSM tracks whole-page availability, not the amount of free space within pages. So I think NULL is a reasonable representation of that. Using NULL will make it easy to filter the results if you want to see only heap-page data or only index-page data, whereas it will be very hard to do that if the view adopts an ultimately-artificial convention about the amount of available space on an index page. Right - after suggesting it I realized that coding the different index overhead for each possible AM would have been ... difficult :-). A patch is attached to implement the NULL free bytes and other recommendations: 1/ Index free bytes set to NULL 2/ Comment added to the README briefly mentioning the index business 3/ Columns reordered more logically 4/ 'Blockid' column removed 5/ Free bytes column renamed to just 'bytes' instead of 'blockfreebytes' Now 5/ was only hinted at, but seemed worth doing while I was there (hopefully I haven't made it too terse now). cheers Mark Index: pg_freespacemap.c === RCS file: /projects/cvsroot/pgsql/contrib/pg_freespacemap/pg_freespacemap.c,v retrieving revision 1.2 diff -c -r1.2 pg_freespacemap.c *** pg_freespacemap.c 14 Feb 2006 15:03:59 - 1.2 --- pg_freespacemap.c 9 Mar 2006 03:38:10 - *** *** 12,18 #include storage/freespace.h #include utils/relcache.h ! #define NUM_FREESPACE_PAGES_ELEM6 #if defined(WIN32) || defined(__CYGWIN__) /* Need DLLIMPORT for some things that are not so marked in main headers */ --- 12,18 #include storage/freespace.h #include utils/relcache.h ! #define NUM_FREESPACE_PAGES_ELEM5 #if defined(WIN32) || defined(__CYGWIN__) /* Need DLLIMPORT for some things that are not so marked in main headers */ *** *** 29,40 typedef struct { - uint32 blockid; - uint32 relfilenode; uint32 reltablespace; uint32 reldatabase; uint32 relblocknumber; ! uint32 blockfreebytes; } FreeSpacePagesRec; --- 29,40 typedef struct { uint32 reltablespace; uint32 reldatabase; + uint32 relfilenode; uint32 relblocknumber; ! uint32 bytes; ! boolisindex; } FreeSpacePagesRec; *** *** 91,107 /* Construct a tuple to return. */ tupledesc = CreateTemplateTupleDesc(NUM_FREESPACE_PAGES_ELEM, false); ! TupleDescInitEntry(tupledesc, (AttrNumber) 1, blockid, ! INT4OID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 2, relfilenode, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 3, reltablespace, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 4, reldatabase, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 5, relblocknumber, INT8OID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 6, blockfreebytes, INT4OID, -1, 0); /* Generate attribute metadata needed later to produce tuples */ --- 91,105 /* Construct a tuple to return. */ tupledesc = CreateTemplateTupleDesc(NUM_FREESPACE_PAGES_ELEM, false); ! TupleDescInitEntry(tupledesc, (AttrNumber) 1, reltablespace, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 2, reldatabase, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 3, relfilenode, OIDOID, -1, 0); ! TupleDescInitEntry(tupledesc, (AttrNumber) 4, relblocknumber, INT8OID, -1, 0); ! TupleDescInitEntry(tupledesc
Re: [PATCHES] [SQL] Interval subtracting
Attached is the new patch. To summarize: - new function justify_interval(interval) - modified function justify_hours(interval) - modified function justify_days(interval) These functions are defined to meet the requirements as discussed in this thread. Specifically: - justify_hours makes certain the sign bit on the hours matches the sign bit on the days. It only checks the sign bit on the days, and not the months, when determining if the hours should be positive or negative. After the call, -24 hours 24. - justify_days makes certain the sign bit on the days matches the sign bit on the months. It's behavior does not depend on the hours, nor does it modify the hours. After the call, -30 days 30. - justify_interval makes sure the sign bits on all three fields months, days, and hours are all the same. After the call, -24 hours 24 AND -30 days 30. 'make check' passes all tests. There are no tests for justify_interval, as it is new. But the existing tests for justify_hours and justify_days appear to still work, even though the behavior has changed. Apparently, their test cases are not sensitive to the particular changes that have occurred. I would include new tests in the patch but do not know on which reference machine/platform the patches are supposed to be generated. mark Index: src/backend/utils/adt/timestamp.c === RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/timestamp.c,v retrieving revision 1.160 diff --context -r1.160 timestamp.c *** src/backend/utils/adt/timestamp.c 22 Nov 2005 22:30:33 - 1.160 --- src/backend/utils/adt/timestamp.c 3 Mar 2006 20:23:26 - *** *** 1975,1980 --- 1975,2054 } /* + * interval_justify_interval() + * + * Adjust interval so 'month', 'day', and 'time' portions are within + * customary bounds. Specifically: + * + *0 = abs(time) 24 hours + *0 = abs(day) 30 days + * + * Also, the sign bit on all three fields is made equal, so either + * all three fields are negative or all are positive. + */ + Datum + interval_justify_interval(PG_FUNCTION_ARGS) + { + Interval *span = PG_GETARG_INTERVAL_P(0); + Interval *result; + + #ifdef HAVE_INT64_TIMESTAMP + int64 wholeday; + #else + double wholeday; + #endif + int32 wholemonth; + + result = (Interval *) palloc(sizeof(Interval)); + result-month = span-month; + result-day = span-day; + result-time = span-time; + + #ifdef HAVE_INT64_TIMESTAMP + TMODULO(result-time, wholeday, USECS_PER_DAY); + #else + TMODULO(result-time, wholeday, (double) SECS_PER_DAY); + #endif + result-day += wholeday;/* could overflow... */ + + wholemonth = result-day / DAYS_PER_MONTH; + result-day -= wholemonth * DAYS_PER_MONTH; + result-month += wholemonth; + + if (result-month 0 result-day 0) + { + result-day -= DAYS_PER_MONTH; + result-month++; + } + else if (result-month 0 result-day 0) + { + result-day += DAYS_PER_MONTH; + result-month--; + } + + if (result-time 0 result-day 0) + { + #ifdef HAVE_INT64_TIMESTAMP + result-time += USECS_PER_DAY; + #else + result-time += (double) SECS_PER_DAY; + #endif + result-day--; + } + else if (result-time 0 result-day 0) + { + #ifdef HAVE_INT64_TIMESTAMP + result-time -= USECS_PER_DAY; + #else + result-time -= (double) SECS_PER_DAY; + #endif + result-day++; + } + + PG_RETURN_INTERVAL_P(result); + } + + /* *interval_justify_hours() * *Adjust interval so 'time' contains less than a whole day, adding *** *** 2006,2011 --- 2080,2104 #endif result-day += wholeday;/* could overflow... */ + if (result-time 0 result-day 0) + { + #ifdef HAVE_INT64_TIMESTAMP + result-time += USECS_PER_DAY; + #else + result-time += (double) SECS_PER_DAY; + #endif + result-day--; + } + else if (result-time 0 result-day 0) + { + #ifdef HAVE_INT64_TIMESTAMP + result-time -= USECS_PER_DAY; + #else + result-time -= (double) SECS_PER_DAY; + #endif + result-day++; + } + PG_RETURN_INTERVAL_P(result); } *** *** 2031,2036 --- 2124,2140 result-day -= wholemonth * DAYS_PER_MONTH; result-month += wholemonth; + if (result-month 0 result-day 0) + { + result-day -= DAYS_PER_MONTH; + result-month++; + } + else if (result-month 0 result-day 0) + { + result-day += DAYS_PER_MONTH
Re: [PATCHES] [SQL] Interval subtracting
Mark Dilger wrote: Mark Dilger wrote: Tom Lane wrote: Milen A. Radev [EMAIL PROTECTED] writes: Milorad Poluga напи�а: SELECT '10 years 1 mons 1 days'::interval - '9 years 10 mons 15 days'::interval ?column?--- 3 mons -14 days Why not '2 mons 16 days' ? Please read the last paragraph in section 8.5.1.4 of the manual (http://www.postgresql.org/docs/8.1/static/datatype-datetime.html#AEN4775) . It mentions the functions named justify_days and justify_hours that could do what you need. justify_days doesn't currently do anything with this result --- it thinks its charter is only to reduce day components that are = 30 days. However, I think a good case could be made that it should normalize negative days too; that is, the invariant on its result should be 0 = days 30, not merely days 30. Similarly for justify_hours. Comments anyone? Patch anyone? Sure, if nobody objects to this change I can write the patch. mark I've modified the code and it now behaves as follows: select justify_days('3 months -12 days'::interval); justify_days 2 mons 18 days select justify_days('3 months -33 days'::interval); justify_days --- 1 mon 27 days select justify_hours('3 months -33 days -12 hours'::interval); justify_hours --- 3 mons -34 days +12:00:00 select justify_days(justify_hours('3 months -33 days -12 hours'::interval)); justify_days 1 mon 26 days 12:00:00 select justify_hours('-73 hours'::interval); justify_hours --- -4 days +23:00:00 select justify_days('-62 days'::interval); justify_days -- -3 mons +28 days I find the last two results somewhat peculiar, as the new functionality pushes the negative values upwards (from hours to days, days to months). Changing '-73 hours' to '-3 days -1 hour' might be more intuitive? The '-4 days +23 hours' is however consistent with the behavior in the other cases. Thoughts? I will package this up into a patch fairly soon. mark The patch is attached. Since the functionality is being intentionally changed, not surprisingly the regression tests for timestamp, timestamptz and horology failed. The regression.diffs are also attached. I intended to update the docs for justify_days and justify_hours, but the docs don't detail the behavior at a sufficient level for any change to be warranted. mark Index: src/backend/utils/adt/timestamp.c === RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/timestamp.c,v retrieving revision 1.160 diff --context=5 -r1.160 timestamp.c *** src/backend/utils/adt/timestamp.c 22 Nov 2005 22:30:33 - 1.160 --- src/backend/utils/adt/timestamp.c 1 Mar 2006 18:35:36 - *** *** 2003,2013 TMODULO(result-time, wholeday, USECS_PER_DAY); #else TMODULO(result-time, wholeday, (double) SECS_PER_DAY); #endif result-day += wholeday;/* could overflow... */ ! PG_RETURN_INTERVAL_P(result); } /* *interval_justify_days() --- 2003,2021 TMODULO(result-time, wholeday, USECS_PER_DAY); #else TMODULO(result-time, wholeday, (double) SECS_PER_DAY); #endif result-day += wholeday;/* could overflow... */ ! if (result-time 0) ! { ! #ifdef HAVE_INT64_TIMESTAMP ! result-time += USECS_PER_DAY; ! #else ! result-time += (double) SECS_PER_DAY; ! #endif ! result-day--; ! } PG_RETURN_INTERVAL_P(result); } /* *interval_justify_days() *** *** 2028,2037 --- 2036,2050 result-time = span-time; wholemonth = result-day / DAYS_PER_MONTH; result-day -= wholemonth * DAYS_PER_MONTH; result-month += wholemonth; + if (result-day 0) + { + result-day += DAYS_PER_MONTH; + result-month--; + } PG_RETURN_INTERVAL_P(result); } /* timestamp_pl_interval() *** ./expected/timestamp.outSat Jun 25 20:04:18 2005 --- ./results/timestamp.out Wed Mar 1 10:21:00 2006 *** *** 442,448 SELECT '' AS 54, d1 - timestamp without time zone '1997-01-02' AS diff FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01'; 54 | diff ! + | @ 9863 days ago | @ 39 days 17 hours 32 mins 1 sec | @ 39 days 17 hours 32 mins 1 sec --- 442,448 SELECT '' AS 54, d1 - timestamp without time zone '1997-01-02' AS diff FROM TIMESTAMP_TBL WHERE d1 BETWEEN '1902-01-01' AND '2038-01-01'; 54 | diff ! +--- | @ 9863 days ago | @ 39 days 17 hours 32 mins
Re: [PATCHES] Free WAL caches on switching segments
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: Tom Lane wrote: Sounds like a recipe for ensuring it never will be tested. What's needed here is some actual tests, not preparation... Does the OP have a test scenario that those of us with appropriate OS's could try? Come to think of it, what are the appropriate OS's? (I see NetBSD mentioned so I suppose all the *BSDs, but what others?). The test run by the OP was just pgbench, Ah - right, missed that sorry. which is probably not the greatest scenario for showing the benefits of this patch, but at least it's neutral ground. You need a situation in which the kernel is under memory stress, else early free of disk cache buffers isn't going to make any difference whatever --- so choose a pgbench scale factor that makes the database noticeably larger than the test machine's RAM. Other than that, follow the usual guidelines for producing trustworthy pgbench numbers: number of clients smaller than scale factor, number of transactions per client at least 1000 or so (to eliminate startup transients), repeat test a couple times to make sure numbers are reproducible. Thinking about this, presumably any write intensive, multi-user benchmark would seem to be suitable, so would something like OSDL's DBT-2 actually be better to try? Cheers Mark (P.s - academic in my case, unless I try out the latest NetBSD or Linux on one of my FreeBSD boxes) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Free WAL caches on switching segments
Tom Lane wrote: Bruce Momjian pgman@candle.pha.pa.us writes: Yes, your vote counts very much. What if I apply the patch, but mark the posix_advise() call in a NOT_USED macro block, so it will be ready for people to test, but will not be used until we are sure. Sounds like a recipe for ensuring it never will be tested. What's needed here is some actual tests, not preparation... Gotta second that. Does the OP have a test scenario that those of us with appropriate OS's could try? Come to think of it, what are the appropriate OS's? (I see NetBSD mentioned so I suppose all the *BSDs, but what others?). Cheers Mark ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PATCHES] Want to add to contrib.... xmldbx
I have a fairly simple extension I want to add to contrib. It is an XML parser that is designed to work with a specific dialect. I have a PHP extension called xmldbx, it allows the PHP system to serialize its web session data to an XML stream. (or just serialize variables) PHP's normal serializer is a non-standard home grown format that it impossile to read. The xmldbx format may not be too much easier, but I have a format document for it. I also have a PostgreSQL extension that can take serialized data and use it in a query, like: select xmldbx('data.userdata.id', sessions.data); If the PHP variable has $userdata['id'] set to some varible in its session data, it will be returned. This allows a lot of flexability and a tight integration between PostgreSQL and PHP. For more information: http://www.mohawksoft.org -- (The open source end of mohawk softare :-) What do you all think? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Summary table trigger example race condition
Jim C. Nasby wrote: On Sun, Jan 08, 2006 at 04:13:01PM +1300, Mark Kirkwood wrote: After re-examining the original code, it looks like it was not actually vulnerable to a race condition! (it does the UPDATE, then if not found will do an INSERT, and handle unique violation with a repeat of the same UPDATE - i.e three DML statements, which are enough to handle the race in this case). What happens if someone deletes the row between the failed insert and the second update? :) In this example the rows in the summary table never get deleted by DELETE operations on that main one - the trigger just decrements the various amounts - i.e DELETE becomes UPDATE, so no problem there. AFAICT, example 36-1 is the only way to handle this without creating a race condition. For the general case indeed you are correct - a good reason for this change :-). In addition, that fact that it is actually quite difficult to be sure that any race condition is actually being handled makes (another) good reason for using the most robust method in the 'official' examples. Best wishes Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Summary table trigger example race condition
Mark Kirkwood wrote: Jim C. Nasby wrote: On Sun, Jan 08, 2006 at 04:13:01PM +1300, Mark Kirkwood wrote: What happens if someone deletes the row between the failed insert and the second update? :) In this example the rows in the summary table never get deleted by DELETE operations on that main one - the trigger just decrements the various amounts - i.e DELETE becomes UPDATE, so no problem there. Sorry Jim, I just realized you probably meant someone directly deleting rows in the summary table itself. Well yes, that would certainly fox it! I guess I was implicitly considering a use case where the only direct DML on the summary table would be some sort of ETL process, which would probably lock out other changes (and re-create the summary table data afterwards in all likelihood). Cheers Mark ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Summary table trigger example race condition
Mark Kirkwood wrote: Jim C. Nasby wrote: On Fri, Jan 06, 2006 at 02:00:34PM +1300, Mark Kirkwood wrote: However, I think the actual change is not quite right - after running DOH! It would be good if doc/src had a better mechanism for handling code; one that would allow for writing the code natively (so you don't have to worry about translating into lt; and into gt;) and for unit testing the different pieces of code. Yes it would - I usually build the SGML - HTML, then cut the code out of a browser session to test - the pain is waiting for the docs to build. Anyway, updated patch attached. This one is good! After re-examining the original code, it looks like it was not actually vulnerable to a race condition! (it does the UPDATE, then if not found will do an INSERT, and handle unique violation with a repeat of the same UPDATE - i.e three DML statements, which are enough to handle the race in this case). However Jim's change handles the race needing only two DML statements in a loop, which seems much more elegant! In addition it provides a nice example of the 'merge' style code shown in e.g 36-1. Cheers Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Summary table trigger example race condition
Jim C. Nasby wrote: On Fri, Jan 06, 2006 at 02:00:34PM +1300, Mark Kirkwood wrote: However, I think the actual change is not quite right - after running DOH! It would be good if doc/src had a better mechanism for handling code; one that would allow for writing the code natively (so you don't have to worry about translating into lt; and into gt;) and for unit testing the different pieces of code. Yes it would - I usually build the SGML - HTML, then cut the code out of a browser session to test - the pain is waiting for the docs to build. Anyway, updated patch attached. This one is good! Cheers Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] TODO Item - Add system view to show free space map
Mark Kirkwood wrote: Bruce Momjian wrote: Mark Kirkwood wrote: Simon Riggs wrote: I like this, but not because I want to read it myself, but because I want to make autovacuum responsible for re-allocating free space when it runs out. This way we can have an autoFSM feature in 8.2 Not wanting to denigrate value of the interesting but slightly OT direction this thread has taken - but does anybody want to comment/review the patch itself :-) ? I saw this question about a transaction block and your reply: http://archives.postgresql.org/pgsql-patches/2005-10/msg00226.php but no new patch. I know someone suggested pgfoundry but it seems most natural in /contrib. Do you want to update the patch? In the expectation of further revisions, I was going to batch that one in with the 'rest' - given that there have not been any, I'll submit a revised patch. Here it is - I seem to have had trouble sending any attachments to this list recently. Bruce the patch (sent privately), so its in the patches queue, but thought I would have another go at getting it to -patches so others can review it more easily! cheers Mark contrib-freespacemap.patch.gz Description: application/gzip ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] TODO Item - Add system view to show free space map
Simon Riggs wrote: I like this, but not because I want to read it myself, but because I want to make autovacuum responsible for re-allocating free space when it runs out. This way we can have an autoFSM feature in 8.2 Not wanting to denigrate value of the interesting but slightly OT direction this thread has taken - but does anybody want to comment/review the patch itself :-) ? Cheers Mark ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PATCHES] TODO Item - Add system view to show free space map contents
This patch implements a view to display the free space map contents - e.g : regression=# SELECT c.relname, m.relblocknumber, m.blockfreebytes FROM pg_freespacemap m INNER JOIN pg_class c ON c.relfilenode = m.relfilenode LIMIT 10; relname | relblocknumber | blockfreebytes ++ sql_features| 5 | 2696 sql_implementation_info | 0 | 7104 sql_languages | 0 | 8016 sql_packages| 0 | 7376 sql_sizing | 0 | 6032 pg_authid | 0 | 7424 pg_toast_2618 | 13 | 4588 pg_toast_2618 | 12 | 1680 pg_toast_2618 | 10 | 1436 pg_toast_2618 | 7 | 1136 (10 rows) [I found being able to display the FSM pretty cool, even if I say so myself]. It is written as a contrib module (similar to pg_buffercache) so as to make any revisions non-initdb requiring. The code needs to know about several of the (currently) internal data structures in freespace.c, so I moved these into freespace.h. Similarly for the handy macros to actually compute the free space. Let me know if this was the wrong way to proceed! Additionally access to the FSM pointer itself is required, I added a function in freespace.c to return this, rather than making it globally visible, again if the latter is a better approach, it is easily changed. cheers Mark P.s : Currently don't have access to a windows box, so had to just 'take a stab' at what DLLIMPORTs were required. diff -Ncar pgsql.orig/contrib/pg_freespacemap/Makefile pgsql/contrib/pg_freespacemap/Makefile *** pgsql.orig/contrib/pg_freespacemap/Makefile Thu Jan 1 12:00:00 1970 --- pgsql/contrib/pg_freespacemap/Makefile Thu Oct 27 17:52:10 2005 *** *** 0 --- 1,17 + # $PostgreSQL$ + + MODULE_big = pg_freespacemap + OBJS = pg_freespacemap.o + + DATA_built = pg_freespacemap.sql + DOCS = README.pg_freespacemap + + ifdef USE_PGXS + PGXS := $(shell pg_config --pgxs) + include $(PGXS) + else + subdir = contrib/pg_freespacemap + top_builddir = ../.. + include $(top_builddir)/src/Makefile.global + include $(top_srcdir)/contrib/contrib-global.mk + endif diff -Ncar pgsql.orig/contrib/pg_freespacemap/README.pg_freespacemap pgsql/contrib/pg_freespacemap/README.pg_freespacemap *** pgsql.orig/contrib/pg_freespacemap/README.pg_freespacemap Thu Jan 1 12:00:00 1970 --- pgsql/contrib/pg_freespacemap/README.pg_freespacemapThu Oct 27 18:06:20 2005 *** *** 0 --- 1,98 + Pg_freespacemap - Real time queries on the free space map (FSM). + --- + + This module consists of a C function 'pg_freespacemap()' that returns + a set of records, and a view 'pg_freespacemap' to wrapper the function. + + The module provides the ability to examine the contents of the free space + map, without having to restart or rebuild the server with additional + debugging code. + + By default public access is REVOKED from both of these, just in case there + are security issues lurking. + + + Installation + + + Build and install the main Postgresql source, then this contrib module: + + $ cd contrib/pg_freespacemap + $ gmake + $ gmake install + + + To register the functions: + + $ psql -d database -f pg_freespacemap.sql + + + Notes + - + + The definition of the columns exposed in the view is: + +Column | references | Description + +--+ +blockid| | Id, 1.. max_fsm_pages +relfilenode| pg_class.relfilenode | Refilenode of the relation. +reltablespace | pg_tablespace.oid| Tablespace oid of the relation. +reldatabase| pg_database.oid | Database for the relation. +relblocknumber | | Offset of the page in the relation. +blockfreebytes | | Free bytes in the block/page. + + + There is one row for each page in the free space map. + + Because the map is shared by all the databases, there are pages from + relations not belonging to the current database. + + When the pg_freespacemap view is accessed, internal free space map locks are + taken, and a copy of the map data is made for the view to display. + This ensures that the view produces a consistent set of results, while not + blocking normal activity longer than necessary. Nonetheless there + could be some impact on database performance if this view is read often. + + + Sample output + - + + regression=# \d pg_freespacemap + View
Re: [PATCHES] TODO Item - Add system view to show free space map
Jim C. Nasby wrote: Shouldn't the DDL in pg_freespacemap.sql.in be wrapped in a transaction? Specifically I'm considering the case of the script stopping before the REVOKEs. That's nice, (probably should have done it in pg_buffercache )! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] TODO Item - Add system view to show free space map
Christopher Kings-Lynne wrote: Want to host it on pgfoundry until 8.2 is released? Absolutely - I'll let it run the gauntlet of freedback to fix the silly mistakes I put in :-), then do patches for 8.1 and 8.0 (maybe 7.4 and 7.3 as well - if it rains a lot). cheers Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[PATCHES] TODO item - tid operator
Please find enclosed a patch that adds a '' operator for tid types (per TODO item). Patch is against current sources (beta4). If I have understood the requirement properly, the relevant function 'tidne' already existed in src/backend/utils/adt/tid.c, just needed an ifdef NOT_USED removed. The '' operator however, needed to be added. I have renumbered the OIDs for the various tid functions in order to keep them together with the 'new' tidne one in pg_proc.h - as we have run out of OIDs in the old range. Similarly for the '=' operator in pg_operator.h. with the patch applied, all 98 regression tests pass, plus it seems to work ok :-) test=# select ctid, id from foo where ctid != '(0,1)'; ctid | id ---+ (0,2) | 2 (0,3) | 3 (2 rows) regards Mark Index: src/include/catalog/pg_proc.h === RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_proc.h,v retrieving revision 1.387 diff -c -r1.387 pg_proc.h *** src/include/catalog/pg_proc.h 15 Oct 2005 02:49:42 - 1.387 --- src/include/catalog/pg_proc.h 24 Oct 2005 02:34:49 - *** *** 1625,1637 DATA(insert OID = 1291 ( array_length_coerce PGNSP PGUID 12 f f t f s 3 2277 2277 23 16 _null_ _null_ _null_ array_length_coerce - _null_ )); DESCR(adjust any array to new element typmod); - DATA(insert OID = 1292 ( tideq PGNSP PGUID 12 f f t f i 2 16 27 27 _null_ _null_ _null_ tideq - _null_ )); - DESCR(equal); - DATA(insert OID = 1293 ( currtid PGNSP PGUID 12 f f t f v 2 27 26 27 _null_ _null_ _null_ currtid_byreloid - _null_ )); - DESCR(latest tid of a tuple); - DATA(insert OID = 1294 ( currtid2PGNSP PGUID 12 f f t f v 2 27 25 27 _null_ _null_ _null_ currtid_byrelname - _null_ )); - DESCR(latest tid of a tuple); - DATA(insert OID = 2168 ( pg_database_size PGNSP PGUID 12 f f t f v 1 20 19 _null_ _null_ _null_ pg_database_size_name - _null_ )); DESCR(Calculate total disk space usage for the specified database); --- 1625,1630 *** *** 3765,3770 --- 3758,3772 DATA(insert OID = 2592 ( gist_circle_compressPGNSP PGUID 12 f f t f i 1 2281 2281 _null_ _null_ _null_ gist_circle_compress - _null_ )); DESCR(GiST support); + /* Tid functions */ + DATA(insert OID = 2601 ( tideq PGNSP PGUID 12 f f t f i 2 16 27 27 _null_ _null_ _null_ tideq - _null_ )); + DESCR(equal); + DATA(insert OID = 2602 ( currtid PGNSP PGUID 12 f f t f v 2 27 26 27 _null_ _null_ _null_ currtid_byreloid - _null_ )); + DESCR(latest tid of a tuple); + DATA(insert OID = 2603 ( currtid2PGNSP PGUID 12 f f t f v 2 27 25 27 _null_ _null_ _null_ currtid_byrelname - _null_ )); + DESCR(latest tid of a tuple); + DATA(insert OID = 2604 ( tidne PGNSP PGUID 12 f f t f i 2 16 27 27 _null_ _null_ _null_ tidne - _null_ )); + DESCR(not equal); /* * Symbolic values for provolatile column: these indicate whether the result Index: src/include/catalog/pg_operator.h === RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_operator.h,v retrieving revision 1.137 diff -c -r1.137 pg_operator.h *** src/include/catalog/pg_operator.h 15 Oct 2005 02:49:42 - 1.137 --- src/include/catalog/pg_operator.h 24 Oct 2005 02:34:50 - *** *** 128,135 DATA(insert OID = 389 ( !!PGNSP PGUID l f 0 20 1700 0 0 0 0 0 0 numeric_fac - - )); DATA(insert OID = 385 ( = PGNSP PGUID b t 29 29 16 385 0 0 0 0 0 cideq eqsel eqjoinsel )); DATA(insert OID = 386 ( = PGNSP PGUID b t 22 22 16 386 0 0 0 0 0 int2vectoreq eqsel eqjoinsel )); ! DATA(insert OID = 387 ( = PGNSP PGUID b f 27 27 16 387 0 0 0 0 0 tideq eqsel eqjoinsel )); ! #define TIDEqualOperator 387 DATA(insert OID = 410 ( = PGNSP PGUID b t 20 20 16 410 411 412 412 412 413 int8eq eqsel eqjoinsel )); DATA(insert OID = 411 ( PGNSP PGUID b f 20 20 16 411 410 0 0 0 0 int8ne neqsel neqjoinsel )); --- 128,136 DATA(insert OID = 389 ( !!PGNSP PGUID l f 0 20 1700 0 0 0 0 0 0 numeric_fac - - )); DATA(insert OID = 385 ( = PGNSP PGUID b t 29 29 16 385 0 0 0 0 0 cideq eqsel eqjoinsel )); DATA(insert OID = 386 ( = PGNSP PGUID b t 22 22 16 386 0 0 0 0 0 int2vectoreq eqsel eqjoinsel )); ! DATA(insert OID = 390 ( PGNSP PGUID b f 27 27 16 390 0 0 0 0 0 tidne neqsel neqjoinsel )); ! DATA(insert OID = 391 ( = PGNSP PGUID b f 27 27 16 391
Re: [PATCHES] TODO item - tid operator
Neil Conway wrote: You also probably want to add the declaration for tidne() to include/utils/builtins.h Doh! - Indeed, I obviously missed a few extra compile warnings! Revised patch attached. Index: src/backend/utils/adt/tid.c === RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/tid.c,v retrieving revision 1.49 diff -c -r1.49 tid.c *** src/backend/utils/adt/tid.c 27 May 2005 00:57:49 - 1.49 --- src/backend/utils/adt/tid.c 24 Oct 2005 20:55:57 - *** *** 174,180 arg1-ip_posid == arg2-ip_posid); } - #ifdef NOT_USED Datum tidne(PG_FUNCTION_ARGS) { --- 174,179 *** *** 185,191 BlockIdGetBlockNumber((arg2-ip_blkid)) || arg1-ip_posid != arg2-ip_posid); } - #endif /* *Functions to get latest tid of a specified tuple. --- 184,189 Index: src/include/utils/builtins.h === RCS file: /projects/cvsroot/pgsql/src/include/utils/builtins.h,v retrieving revision 1.267 diff -c -r1.267 builtins.h *** src/include/utils/builtins.h18 Oct 2005 20:38:58 - 1.267 --- src/include/utils/builtins.h24 Oct 2005 20:55:58 - *** *** 530,535 --- 530,536 extern Datum tidrecv(PG_FUNCTION_ARGS); extern Datum tidsend(PG_FUNCTION_ARGS); extern Datum tideq(PG_FUNCTION_ARGS); + extern Datum tidne(PG_FUNCTION_ARGS); extern Datum currtid_byreloid(PG_FUNCTION_ARGS); extern Datum currtid_byrelname(PG_FUNCTION_ARGS); Index: src/include/catalog/pg_proc.h === RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_proc.h,v retrieving revision 1.387 diff -c -r1.387 pg_proc.h *** src/include/catalog/pg_proc.h 15 Oct 2005 02:49:42 - 1.387 --- src/include/catalog/pg_proc.h 24 Oct 2005 20:56:01 - *** *** 1625,1637 DATA(insert OID = 1291 ( array_length_coerce PGNSP PGUID 12 f f t f s 3 2277 2277 23 16 _null_ _null_ _null_ array_length_coerce - _null_ )); DESCR(adjust any array to new element typmod); - DATA(insert OID = 1292 ( tideq PGNSP PGUID 12 f f t f i 2 16 27 27 _null_ _null_ _null_ tideq - _null_ )); - DESCR(equal); - DATA(insert OID = 1293 ( currtid PGNSP PGUID 12 f f t f v 2 27 26 27 _null_ _null_ _null_ currtid_byreloid - _null_ )); - DESCR(latest tid of a tuple); - DATA(insert OID = 1294 ( currtid2PGNSP PGUID 12 f f t f v 2 27 25 27 _null_ _null_ _null_ currtid_byrelname - _null_ )); - DESCR(latest tid of a tuple); - DATA(insert OID = 2168 ( pg_database_size PGNSP PGUID 12 f f t f v 1 20 19 _null_ _null_ _null_ pg_database_size_name - _null_ )); DESCR(Calculate total disk space usage for the specified database); --- 1625,1630 *** *** 3765,3770 --- 3758,3772 DATA(insert OID = 2592 ( gist_circle_compressPGNSP PGUID 12 f f t f i 1 2281 2281 _null_ _null_ _null_ gist_circle_compress - _null_ )); DESCR(GiST support); + /* Tid functions */ + DATA(insert OID = 2601 ( tideq PGNSP PGUID 12 f f t f i 2 16 27 27 _null_ _null_ _null_ tideq - _null_ )); + DESCR(equal); + DATA(insert OID = 2602 ( currtid PGNSP PGUID 12 f f t f v 2 27 26 27 _null_ _null_ _null_ currtid_byreloid - _null_ )); + DESCR(latest tid of a tuple); + DATA(insert OID = 2603 ( currtid2PGNSP PGUID 12 f f t f v 2 27 25 27 _null_ _null_ _null_ currtid_byrelname - _null_ )); + DESCR(latest tid of a tuple); + DATA(insert OID = 2604 ( tidne PGNSP PGUID 12 f f t f i 2 16 27 27 _null_ _null_ _null_ tidne - _null_ )); + DESCR(not equal); /* * Symbolic values for provolatile column: these indicate whether the result Index: src/include/catalog/pg_operator.h === RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_operator.h,v retrieving revision 1.137 diff -c -r1.137 pg_operator.h *** src/include/catalog/pg_operator.h 15 Oct 2005 02:49:42 - 1.137 --- src/include/catalog/pg_operator.h 24 Oct 2005 20:56:02 - *** *** 128,135 DATA(insert OID = 389 ( !!PGNSP PGUID l f 0 20 1700 0 0 0 0 0 0 numeric_fac - - )); DATA(insert OID = 385 ( = PGNSP PGUID b t 29 29 16 385 0 0 0 0 0 cideq eqsel eqjoinsel )); DATA(insert OID = 386 ( = PGNSP PGUID b t 22 22 16 386 0 0 0 0 0 int2vectoreq eqsel eqjoinsel )); ! DATA(insert OID = 387 ( = PGNSP PGUID b f 27 27 16 387 0 0 0 0 0 tideq eqsel eqjoinsel )); ! #define
Re: [PATCHES] [HACKERS] Autovacuum loose ends
On Fri, 12 Aug 2005 17:49:41 -0400 Alvaro Herrera [EMAIL PROTECTED] wrote: On Fri, Aug 12, 2005 at 10:49:43AM -0700, Mark Wong wrote: I thought I'd run a couple of tests to see if it would be helpful against CVS from Aug 3, 2005. Here's a run with autovacuum turned off: http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/42/ 5186.55 notpm Autvacuum on with default settings: http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/ 5462.23 notpm Just noticed what seems to be a bug: in http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/42/db/index_info.input plot index_info.data using 1:2 title i_customer with lines, \ index_info.data using 1:2 title i_orders with lines, \ index_info.data using 1:3 title pk_customer with lines, \ index_info.data using 1:4 title pk_district with lines, \ index_info.data using 1:5 title pk_item with lines, \ index_info.data using 1:6 title pk_new_order with lines, \ index_info.data using 1:7 title pk_order_line with lines, \ index_info.data using 1:8 title pk_orders with lines, \ index_info.data using 1:9 title pk_stock with lines, \ index_info.data using 1:11 title pk_warehouse with lines Notice how the subindexes are wrong ... I think it should be 1:3 for i_orders, no? Apparently indexes_scan.data has the same problem. Whoops! I think I fixed it for real now and the charts should be updated now. It was broken slightly more previously. It called my attention that the pk_warehouse index seems to have a very different usage in both runs in index_info, but in indexes_scan they seem similar. Thanks, Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] [HACKERS] Autovacuum loose ends
On Fri, 12 Aug 2005 18:42:09 -0400 Alvaro Herrera [EMAIL PROTECTED] wrote: On Fri, Aug 12, 2005 at 03:16:04PM -0700, Mark Wong wrote: On Fri, 12 Aug 2005 17:49:41 -0400 Alvaro Herrera [EMAIL PROTECTED] wrote: Notice how the subindexes are wrong ... I think it should be 1:3 for i_orders, no? Apparently indexes_scan.data has the same problem. Whoops! I think I fixed it for real now and the charts should be updated now. It was broken slightly more previously. Hmm, did you fix the 42 case only? The other one is broken too ... The other dev4-015 cases should be fixed too. Also, it seems the tran_lock.out file captured wrong input -- I think you mean WHERE transactionid IS NULL in the query instead of WHERE transaction IS NULL. Hmm, ok I can try that in a future test run. I'm not very familiar with this table, what's the difference between transaction and transactionid? I wonder what the big down-spikes (?) at minutes ~45 and ~85 correspond to. Are those checkpoints? The IO vmstat chart would indicate that, I think. That's correct, those should be checkpoints. Anyway, it's interesting to see the performance go up with autovacuum on. I certainly didn't expect that in this kind of test. I think in Mary's case it was hurting, but she's running the workload dramatically different. I think she was planning to revisit that after we sort out what's going on with the grouped WAL writes. Mark ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] remove BufferBlockPointers for speed and space
Mark Kirkwood wrote: Looks to me like -O2 makes the difference very small (on this platform/gcc combo) - is 5/169 worth doing? Ahem - misunderstood your comment here, sorry. Qingqing Zhou wrote: compiled with gcc testbuf.c. I tried -O2 actually, and it turns out that the timing is reduced a lot so not believable. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
Ok, I finally got a couple of tests done against CVS from Aug 3, 2005. I'm not sure if I'm showing anything insightful though. I've learned that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on Linux, which you guys may have already known. There appears to be a fair performance decrease in using open_sync. Just to double check, am I correct in understanding only open_sync uses O_DIRECT? fdatasync http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/38/ 5462 notpm open_sync http://www.testing.osdl.org/projects/dbt2dev/results/dev4-015/40/ 4860 notpm Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes
Here are comments that Daniel McNeil made earlier, which I've neglected to forward earlier. I've cc'ed him and Mark Havercamp, which some of you got to meet the other day. Mark - With O_DIRECT on Linux, when the write() returns the i/o has been transferred to the disk. Normally, this i/o will be DMAed directly from user-space to the device. The current exception is when doing an O_DIRECT write to a hole in a file. (If an program does a truncate() or lseek()/write() that makes a file larger, the file system does not allocated space between the old end of file and the new end of file.) An O_DIRECT write to hole like this, requires the file system to allocated space, but there is a race condition between the O_DIRECT write doing the allocate and then write to initialized the newly allocated data and any other process that attempts a buffered (page cache) read of the same area in the file -- it was possible for the read to data from the allocated region before the O_DIRECT write(). The fix in Linux is for the O_DIRECT write() to fall back to use buffer i/o to do the write() and flush the data from the page cache to the disk. A write() with O_DIRECT only means the data has been transferred to the disk. Depending on the file system and mount options, it does not mean the meta data for the file has been written to disk (see fsync man page). Fsync() will guarantee the data and metadata have been written to disk. Lastly, if a disk has a write back cache, an O_DIRECT write() does not guarantee that the disk has put the data on the physical media. I think some of the journal file systems now support i/o barriers on commit which will flush the disk write back cache. (I'm still looking the kernel code to see how this is done). Conclusion: O_DIRECT + fsync() can make sense. It avoids the copying of data to the page cache before being written and will also guarantee that the file's metadata is also written to disk. It also prevents the page cache from filling up with write data that will never be read (I assume it is only read if a recovery is necessary - which should be rare). It can also helps disks with write back cache when using the journaling file system that use i/o barriers. You would want to use large writes, since the kernel page cache won't be writing multiple pages for you. I need to look at the kernel code more to comment on O_DIRECT with O_SYNC. Questions: Does the database transaction logger preallocate the log file? Does the logger care about the order in which each write hits the disk? Now someone else can comment on my comments. Daniel ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] COPY FROM performance improvements
I just ran through a few tests with the v14 patch against 100GB of data from dbt3 and found a 30% improvement; 3.6 hours vs 5.3 hours. Just to give a few details, I only loaded data and started a COPY in parallel for each the data files: http://www.testing.osdl.org/projects/dbt3testing/results/fast_copy/ Here's a visual of my disk layout, for those familiar with the database schema: http://www.testing.osdl.org/projects/dbt3testing/results/fast_copy/layout-dev4-010-dbt3.html I have 6 arrays of fourteen 15k rpm drives in a split-bus configuration attached to a 4-way itanium2 via 6 compaq smartarray pci-x controllers. Let me know if you have any questions. Mark ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] COPY FROM performance improvements
On Thu, 14 Jul 2005 17:22:18 -0700 Alon Goldshuv [EMAIL PROTECTED] wrote: I revisited my patch and removed the code duplications that were there, and added support for CSV with buffered input, so CSV now runs faster too (although it is not as optimized as the TEXT format parsing). So now TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file. Hi Alon, I'm curious, what kind of system are you testing this on? I'm trying to load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm interested in the results you would expect. Mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] COPY FROM performance improvements
Hi Alon, Yeah, that helps. I just need to break up my scripts a little to just load the data and not build indexes. Is the following information good enough to give a guess about the data I'm loading, if you don't mind? ;) Here's a link to my script to create tables: http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432path=scripts/pgsql/create_tables.sh.in File sizes: -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl Number of rows: # wc -l *.tbl 1500 customer.tbl 600037902 lineitem.tbl 25 nation.tbl 15000 orders.tbl 2000 part.tbl 8000 partsupp.tbl 5 region.tbl 100 supplier.tbl Thanks, Mark On Tue, 19 Jul 2005 14:05:56 -0700 Alon Goldshuv [EMAIL PROTECTED] wrote: Hi Mark, I improved the data *parsing* capabilities of COPY, and didn't touch the data conversion or data insertion parts of the code. The parsing improvement will vary largely depending on the ratio of parsing -to- converting and inserting. Therefore, the speed increase really depends on the nature of your data: 100GB file with long data rows (lots of parsing) Small number of columns (small number of attr conversions per row) less rows (less tuple insertions) Will show the best performance improvements. However, same file size 100GB with Short data rows (minimal parsing) large number of columns (large number of attr conversions per row) AND/OR more rows (more tuple insertions) Will show improvements but not as significant. In general I'll estimate 40%-95% improvement in load speed for the 1st case and 10%-40% for the 2nd. But that also depends on the hardware, disk speed etc... This is for TEXT format. As for CSV, it may be faster but not as much as I specified here. BINARY will stay the same as before. HTH Alon. On 7/19/05 12:54 PM, Mark Wong [EMAIL PROTECTED] wrote: On Thu, 14 Jul 2005 17:22:18 -0700 Alon Goldshuv [EMAIL PROTECTED] wrote: I revisited my patch and removed the code duplications that were there, and added support for CSV with buffered input, so CSV now runs faster too (although it is not as optimized as the TEXT format parsing). So now TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file. Hi Alon, I'm curious, what kind of system are you testing this on? I'm trying to load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm interested in the results you would expect. Mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] COPY FROM performance improvements
Whoopsies, yeah good point about the PRIMARY KEY. I'll fix that. Mark On Tue, 19 Jul 2005 18:17:52 -0400 Andrew Dunstan [EMAIL PROTECTED] wrote: Mark, You should definitely not be doing this sort of thing, I believe: CREATE TABLE orders ( o_orderkey INTEGER, o_custkey INTEGER, o_orderstatus CHAR(1), o_totalprice REAL, o_orderDATE DATE, o_orderpriority CHAR(15), o_clerk CHAR(15), o_shippriority INTEGER, o_comment VARCHAR(79), PRIMARY KEY (o_orderkey)) Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want using ALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from deferring constarint creation. cheers andrew Mark Wong wrote: Hi Alon, Yeah, that helps. I just need to break up my scripts a little to just load the data and not build indexes. Is the following information good enough to give a guess about the data I'm loading, if you don't mind? ;) Here's a link to my script to create tables: http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432path=scripts/pgsql/create_tables.sh.in File sizes: -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl Number of rows: # wc -l *.tbl 1500 customer.tbl 600037902 lineitem.tbl 25 nation.tbl 15000 orders.tbl 2000 part.tbl 8000 partsupp.tbl 5 region.tbl 100 supplier.tbl Thanks, Mark On Tue, 19 Jul 2005 14:05:56 -0700 Alon Goldshuv [EMAIL PROTECTED] wrote: Hi Mark, I improved the data *parsing* capabilities of COPY, and didn't touch the data conversion or data insertion parts of the code. The parsing improvement will vary largely depending on the ratio of parsing -to- converting and inserting. Therefore, the speed increase really depends on the nature of your data: 100GB file with long data rows (lots of parsing) Small number of columns (small number of attr conversions per row) less rows (less tuple insertions) Will show the best performance improvements. However, same file size 100GB with Short data rows (minimal parsing) large number of columns (large number of attr conversions per row) AND/OR more rows (more tuple insertions) Will show improvements but not as significant. In general I'll estimate 40%-95% improvement in load speed for the 1st case and 10%-40% for the 2nd. But that also depends on the hardware, disk speed etc... This is for TEXT format. As for CSV, it may be faster but not as much as I specified here. BINARY will stay the same as before. HTH Alon. On 7/19/05 12:54 PM, Mark Wong [EMAIL PROTECTED] wrote: On Thu, 14 Jul 2005 17:22:18 -0700 Alon Goldshuv [EMAIL PROTECTED] wrote: I revisited my patch and removed the code duplications that were there, and added support for CSV with buffered input, so CSV now runs faster too (although it is not as optimized as the TEXT format parsing). So now TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file. Hi Alon, I'm curious, what kind of system are you testing this on? I'm trying to load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm interested in the results you would expect. Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PATCHES] Bad link Makefile patch
I built the v8.0.3 product from postgresql-8.0.3-1PGDG.src.rpm on RedHat9 (I'm thinking the source RPM for RH9 should not have exactly the same name as the FC3 version, since they are different files). When I tried to roll it into an RPM CD builder transaction, I got 'RPM dependency errors': CRITICAL ERROR: Unable to resolve dependency libecpg.so.3 for postgresql-libs and CRITICAL ERROR: Unable to resolve dependency libpq.so.3 for postgresql-libs. I was only including the base rpm, -server, and -lib in the RPM transaction set. The culprit was a nasty bit at $TOP/src/Makefile.shlib:243. This piece insures that for the link step of libecpg.so.5.0 and libecpg_compat.so.2.0, /usr/lib is searched for libpq and libecpg.so _before_ the locally built copy is searched. At least for libecpg.so.5.0 and libecpg_compat.so.2.0, the attached patch fixes the problem. Not sure if there would be other instances in -contrib, etc. Hope this helps and that I'm not redundant with your other fans. Regards, Mark userlib_link.patch Description: Binary data ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PATCHES] More to Bad link Makefile patch
So, my co-workers chided me mercilessly to get the -contrib RPM working as well; so the full patch is now attached. BTW, I did search the archive and this problem did not stick out; but it could be my crummy reference skills. And, no, I didn't read all 3500 emails since the v8.0.3 release. Also, there's some good chance that the binary RPMs on the postgresql.org FTP site have the same flaw that I encountered as described below. Regards, Mark On Wed, 6 Jul 2005 09:08:50 -0700 Mark Deric [EMAIL PROTECTED] wrote: I built the v8.0.3 product from postgresql-8.0.3-1PGDG.src.rpm on RedHat9 (I'm thinking the source RPM for RH9 should not have exactly the same name as the FC3 version, since they are different files). When I tried to roll it into an RPM CD builder transaction, I got 'RPM dependency errors': CRITICAL ERROR: Unable to resolve dependency libecpg.so.3 for postgresql-libs and CRITICAL ERROR: Unable to resolve dependency libpq.so.3 for postgresql-libs. I was only including the base rpm, -server, and -lib in the RPM transaction set. The culprit was a nasty bit at $TOP/src/Makefile.shlib:243. This piece insures that for the link step of libecpg.so.5.0 and libecpg_compat.so.2.0, /usr/lib is searched for libpq and libecpg.so _before_ the locally built copy is searched. At least for libecpg.so.5.0 and libecpg_compat.so.2.0, the attached patch fixes the problem. Not sure if there would be other instances in-contrib, etc. Hope this helps and that I'm not redundant with your other fans. Regards, Mark userlib_link.patch Description: Binary data ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes
Neil Conway wrote: Bruce Momjian wrote: + /* + * Return the length of a datum, possibly compressed + */ + Datum + pg_column_size(PG_FUNCTION_ARGS) + { + Datumvalue = PG_GETARG_DATUM(0); + intresult; + + /*fn_extra stores the fixed column length, or -1 for varlena. */ + if (fcinfo-flinfo-fn_extra == NULL)/* first call? */ + { + /* On the first call lookup the datatype of the supplied argument */ [...] Is this optimization worth the code complexity? I didn't performance test it, but the idea of hammering the catalogs for each value to be processed seemed a bad thing. + tp = SearchSysCache(TYPEOID, + ObjectIdGetDatum(argtypeid), + 0, 0, 0); + if (!HeapTupleIsValid(tp)) + { + /* Oid not in pg_type, should never happen. */ + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg(invalid typid: %u, argtypeid))); + } elog(ERROR) is usually used for can't happen errors; also, the usual error message in this scenario is cache lookup failed [...]. Perhaps better to use get_typlen() here, anyway. Ahem,..ah yes, get_typlen()! (ducks)- that is clearly much better! I have attached a little change to varlena.c that uses it. I left the ereport as it was, but am not fussed about it either way. BTW - Bruce, I like the changes, makes the beast more friendly to use! Best wishes Mark Index: src/backend/utils/adt/varlena.c === RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/varlena.c,v retrieving revision 1.125 diff -c -r1.125 varlena.c *** src/backend/utils/adt/varlena.c 6 Jul 2005 19:02:52 - 1.125 --- src/backend/utils/adt/varlena.c 7 Jul 2005 02:52:50 - *** *** 28,34 #include utils/builtins.h #include utils/lsyscache.h #include utils/pg_locale.h - #include utils/syscache.h typedef struct varlena unknown; --- 28,33 *** *** 2364,2385 { /* On the first call lookup the datatype of the supplied argument */ Oid argtypeid = get_fn_expr_argtype(fcinfo-flinfo, 0); ! HeapTuple tp; ! int typlen; ! tp = SearchSysCache(TYPEOID, ! ObjectIdGetDatum(argtypeid), ! 0, 0, 0); ! if (!HeapTupleIsValid(tp)) { /* Oid not in pg_type, should never happen. */ ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR), !errmsg(invalid typid: %u, argtypeid))); } ! ! typlen = ((Form_pg_type)GETSTRUCT(tp))-typlen; ! ReleaseSysCache(tp); fcinfo-flinfo-fn_extra = MemoryContextAlloc(fcinfo-flinfo-fn_mcxt, sizeof(int)); *(int *)fcinfo-flinfo-fn_extra = typlen; --- 2363,2379 { /* On the first call lookup the datatype of the supplied argument */ Oid argtypeid = get_fn_expr_argtype(fcinfo-flinfo, 0); ! int typlen= get_typlen(argtypeid); ! ! if (typlen == 0) { /* Oid not in pg_type, should never happen. */ ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR), !errmsg(cache lookup failed for type %u, argtypeid))); } ! fcinfo-flinfo-fn_extra = MemoryContextAlloc(fcinfo-flinfo-fn_mcxt, sizeof(int)); *(int *)fcinfo-flinfo-fn_extra = typlen; ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes
Alvaro Herrera wrote: On Thu, Jul 07, 2005 at 03:01:46PM +1200, Mark Kirkwood wrote: Neil Conway wrote: elog(ERROR) is usually used for can't happen errors. I have attached a little change to varlena.c that uses it. I left the ereport as it was, but am not fussed about it either way. I am, because it gives useless messages to the translators to work on. elog parameters are not marked for translation, ereport are (errmsg and friends, really). So please don't do that. Ok, didn't realize the difference! Revised patch attached that uses elog. Neil, I will runs some tests to see if there is any performance saving with the first-call-only business. Cheers Mark Index: src/backend/utils/adt/varlena.c === RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/varlena.c,v retrieving revision 1.125 diff -c -r1.125 varlena.c *** src/backend/utils/adt/varlena.c 6 Jul 2005 19:02:52 - 1.125 --- src/backend/utils/adt/varlena.c 7 Jul 2005 03:40:44 - *** *** 28,34 #include utils/builtins.h #include utils/lsyscache.h #include utils/pg_locale.h - #include utils/syscache.h typedef struct varlena unknown; --- 28,33 *** *** 2364,2385 { /* On the first call lookup the datatype of the supplied argument */ Oid argtypeid = get_fn_expr_argtype(fcinfo-flinfo, 0); ! HeapTuple tp; ! int typlen; ! tp = SearchSysCache(TYPEOID, ! ObjectIdGetDatum(argtypeid), ! 0, 0, 0); ! if (!HeapTupleIsValid(tp)) { /* Oid not in pg_type, should never happen. */ ! ereport(ERROR, ! (errcode(ERRCODE_INTERNAL_ERROR), !errmsg(invalid typid: %u, argtypeid))); } ! ! typlen = ((Form_pg_type)GETSTRUCT(tp))-typlen; ! ReleaseSysCache(tp); fcinfo-flinfo-fn_extra = MemoryContextAlloc(fcinfo-flinfo-fn_mcxt, sizeof(int)); *(int *)fcinfo-flinfo-fn_extra = typlen; --- 2363,2377 { /* On the first call lookup the datatype of the supplied argument */ Oid argtypeid = get_fn_expr_argtype(fcinfo-flinfo, 0); ! int typlen= get_typlen(argtypeid); ! ! if (typlen == 0) { /* Oid not in pg_type, should never happen. */ ! elog(ERROR, cache lookup failed for type %u, argtypeid); } ! fcinfo-flinfo-fn_extra = MemoryContextAlloc(fcinfo-flinfo-fn_mcxt, sizeof(int)); *(int *)fcinfo-flinfo-fn_extra = typlen; ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes
Neil Conway wrote: Mark Kirkwood wrote: I didn't performance test it, but the idea of hammering the catalogs for each value to be processed seemed a bad thing. Well, the syscache already sits in front of the catalogs themselves. I'd be curious to see what the performance difference actually is... I did some tests with a two tables, one small and one large, I am seeing a consistent difference favoring the first-call-only type checking: Table public.dim1 Column | Type | Modifiers ++--- d1key | integer| not null dat| character varying(100) | not null dattyp | character varying(20) | not null dfill | character varying(100) | not null INFO: dim1: scanned 24 of 24 pages, containing 1001 live rows and 0 dead rows; 1001 rows in sample, 1001 estimated total rows SELECT max(pg_column_size(dfill)) FROM dim1 Run 1st call only check? elapsed(ms) 1y5.5 2y3.1 3n11.3 4n4.1 Now try a bigger table: Table public.fact0 Column | Type | Modifiers ++--- d0key | integer| not null d1key | integer| not null d2key | integer| not null fval | integer| not null ffill | character varying(100) | not null INFO: fact0: scanned 3000 of 18182 pages, containing 165000 live rows and 0 dead rows; 3000 rows in sample, 110 estimated total rows SELECT max(pg_column_size(ffill)) FROM fact0 Run 1st call only check? elapsed(ms) 1y2497 2y2484 3y2496 4y2480 5n3572 6n3570 7n3558 8n3457 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: On Thu, Jul 07, 2005 at 03:01:46PM +1200, Mark Kirkwood wrote: I have attached a little change to varlena.c that uses it. I left the ereport as it was, but am not fussed about it either way. I am, because it gives useless messages to the translators to work on. Exactly. I had already made a private note to fix that patch --- inventing your own random wording and punctuation for an extremely standard error message is simply Not Acceptable. Apologies all, my not fussed about it was highly misleading - I had already changed the error message as per Neil's suggestion in the revised patch, but that was not obvious from my comment :-( regards Mark ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes
I did a few cleanups on the last patch. Please examine this one instead. The changes are: 1. Add documentation for pg_datum_length builtin. 2. Correct some typos in the code comments. 3. Move the code in toastfuncs.c to varlena.c as it is probably the correct place. 4. Use ereport instead of elog. 5 Quiet compiler warning in pg_datum_length. Best wishes Mark Mark Kirkwood wrote: The next iteration - Hopefully I have got the idea basically right. I wonder if I have done the am I a varlena the long way.., pls advise if so! diff -Nacr ./doc/src/sgml/func.sgml.orig ./doc/src/sgml/func.sgml *** ./doc/src/sgml/func.sgml.orig Mon Jun 20 15:38:23 2005 --- ./doc/src/sgml/func.sgmlMon Jun 20 15:45:51 2005 *** *** 2187,2192 --- 2187,2200 /row row + entryliteralfunctionpg_datum_length/function(parameterstring/parameter)/literal/entry +entrytypeinteger/type/entry +entryNumber of bytes (before toast decompression) in string/entry +entryliteralpg_datum_length( 'jo\\000se'::bytea)/literal/entry +entryliteral5/literal/entry + /row + + row entryliteralfunctionposition/function(parametersubstring/parameter in parameterstring/parameter)/literal/entry entrytypeinteger/type/entry entryLocation of specified substring/entry diff -Nacr ./src/backend/access/heap/tuptoaster.c.orig ./src/backend/access/heap/tuptoaster.c *** ./src/backend/access/heap/tuptoaster.c.orig Mon Jun 20 17:11:37 2005 --- ./src/backend/access/heap/tuptoaster.c Mon Jun 20 17:11:44 2005 *** *** 1436,1438 --- 1436,1482 return result; } + + /* -- + * toast_datum_size + * + *Show the (possibly compressed) size of a datum + * -- + */ + Size + toast_datum_size(Datum value) + { + + varattrib *attr = (varattrib *) DatumGetPointer(value); + Sizeresult; + + if (VARATT_IS_EXTERNAL(attr)) + { + /* +* Attribute is stored externally - If it is compressed too, +* then we need to get the external datum and calculate its size, +* otherwise we just use the external rawsize. +*/ + if (VARATT_IS_COMPRESSED(attr)) + { + varattrib *attrext = toast_fetch_datum(attr); + result = VARSIZE(attrext); + pfree(attrext); + } + else + { + result = attr-va_content.va_external.va_rawsize; + } + } + else + { + /* +* Attribute is stored inline either compressed or not, just +* calculate the size of the datum in either case. +*/ + result = VARSIZE(attr); + } + + return result; + + } diff -Nacr ./src/backend/utils/adt/varlena.c.orig ./src/backend/utils/adt/varlena.c *** ./src/backend/utils/adt/varlena.c.orig Mon Jun 20 14:28:03 2005 --- ./src/backend/utils/adt/varlena.c Mon Jun 20 17:17:58 2005 *** *** 28,33 --- 28,34 #include utils/builtins.h #include utils/lsyscache.h #include utils/pg_locale.h + #include utils/syscache.h typedef struct varlena unknown; *** *** 2330,2333 --- 2331,2396 result_text = PG_STR_GET_TEXT(hexsum); PG_RETURN_TEXT_P(result_text); + } + + /* + * Show the (possibly compressed) length of a datum. + */ + Datum + pg_datum_length(PG_FUNCTION_ARGS) + { + + Datum value = PG_GETARG_DATUM(0); + int result; + + + if (fcinfo-flinfo-fn_extra == NULL) + { + /* +* On the first call lookup the datatype of the supplied argument +* and check if is a varlena. +*/ + Oid argtypeid = get_fn_expr_argtype(fcinfo-flinfo, 0); + HeapTuple tp; + int typlen = 0; + + + tp = SearchSysCache(TYPEOID, + ObjectIdGetDatum(argtypeid), + 0, 0, 0); + if (HeapTupleIsValid(tp)) + { + Form_pg_type typtup = (Form_pg_type) GETSTRUCT(tp); + typlen = typtup-typlen; + ReleaseSysCache(tp); + } + else + { + /* Oid not in pg_type, should never happen. */ + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), +errmsg(invalid typid: %u, argtypeid
Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes
The next iteration - Hopefully I have got the idea basically right. I wonder if I have done the am I a varlena the long way.., pls advise if so! Cheers Mark Tom Lane wrote: My recollection of that discussion is that we just wanted something that would return the actual VARSIZE() of the datum. You're building something way too confusing ... A more interesting point is that the way you've declared the function, it will only work on text values, which is pretty limiting. Ideally we'd define this thing as pg_datum_length(any-varlena-type) returns int but there is no pseudotype corresponding to any varlena type. I was thinking about this the other day in connection with my proposal to make something that could return the TOAST value OID of an out-of-line datum. I think the only non-restrictive way to do it would be to declare the function as taking any, and then add a runtime check using the get_fn_expr_argtype() mechanism to test that what you've been given is in fact a varlena datatype. diff -Nacr ./src/backend/access/heap/tuptoaster.c.orig ./src/backend/access/heap/tuptoaster.c *** ./src/backend/access/heap/tuptoaster.c.orig Mon Mar 21 13:23:58 2005 --- ./src/backend/access/heap/tuptoaster.c Sun Jun 19 14:24:43 2005 *** *** 1436,1438 --- 1436,1482 return result; } + + /* -- + * toast_datum_size + * + *Show the (possibly compressed) size of a datum + * -- + */ + Size + toast_datum_size(Datum value) + { + + varattrib *attr = (varattrib *) DatumGetPointer(value); + Sizeresult; + + if (VARATT_IS_EXTERNAL(attr)) + { + /* +* Attribute is stored externally - If it is compressed too, +* then we need to get the external datum and calculate its size, +* otherwise we just use the external rawsize. +*/ + if (VARATT_IS_COMPRESSED(attr)) + { + varattrib *attrext = toast_fetch_datum(attr); + result = VARSIZE(attrext); + pfree(attrext); + } + else + { + result = attr-va_content.va_external.va_rawsize; + } + } + else + { + /* +* Attribute is stored inline either compressed or not, just +* calculate the size of the datum in either case. +*/ + result = VARSIZE(attr); + } + + return result; + + } diff -Nacr ./src/backend/utils/adt/Makefile.orig ./src/backend/utils/adt/Makefile *** ./src/backend/utils/adt/Makefile.orig Fri Apr 2 09:28:45 2004 --- ./src/backend/utils/adt/MakefileFri Jun 17 17:52:09 2005 *** *** 24,30 tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \ network.o mac.o inet_net_ntop.o inet_net_pton.o \ ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \ ! ascii.o quote.o pgstatfuncs.o encode.o like.o: like.c like_match.c --- 24,30 tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \ network.o mac.o inet_net_ntop.o inet_net_pton.o \ ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \ ! ascii.o quote.o pgstatfuncs.o encode.o toastfuncs.o like.o: like.c like_match.c diff -Nacr ./src/backend/utils/adt/toastfuncs.c.orig ./src/backend/utils/adt/toastfuncs.c *** ./src/backend/utils/adt/toastfuncs.c.orig Fri Jun 17 17:52:09 2005 --- ./src/backend/utils/adt/toastfuncs.cSun Jun 19 17:35:26 2005 *** *** 0 --- 1,74 + /*- + * + * toastfuncs.c + * Functions for accessing information about toasted data. + * + * Portions Copyright (c) 1996-2005, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + *- + */ + + #include postgres.h + #include fmgr.h + + #include catalog/pg_type.h + #include access/xact.h + #include access/tuptoaster.h + #include utils/builtins.h + #include utils/syscache.h + + Datum + pg_datum_length(PG_FUNCTION_ARGS) + { + + Datum value = PG_GETARG_DATUM(0); + int size; + + + if (fcinfo-flinfo-fn_extra == NULL) + { + /* +* On the first call check lookup the datatype of the supplied argument +* and check if is a varlena . +*/ + Oid argtypeid = get_fn_expr_argtype(fcinfo-flinfo, 0); + HeapTuple tp; + int typlen
Re: [PATCHES] TODO Item - Return compressed length of TOAST datatypes
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: I thought I would have a look at: (Datatypes) Add function to return compressed length of TOAST data values. My recollection of that discussion is that we just wanted something that would return the actual VARSIZE() of the datum. You're building something way too confusing ... I was guessing a little about exactly what was wanted - for some reason I couldn't access the mail archives for the last few days (however, can now). A more interesting point is that the way you've declared the function, it will only work on text values, which is pretty limiting. Ideally we'd define this thing as pg_datum_length(any-varlena-type) returns int but there is no pseudotype corresponding to any varlena type. I was thinking about this the other day in connection with my proposal to make something that could return the TOAST value OID of an out-of-line datum. I think the only non-restrictive way to do it would be to declare the function as taking any, and then add a runtime check using the get_fn_expr_argtype() mechanism to test that what you've been given is in fact a varlena datatype. Yes - was thinking I needed to check if the Datum was a varlena (and didn't know how to do it...), so thanks for the pointer! With these thoughts in mind, I'll have another go :-) Cheers Mark ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PATCHES] TODO Item - Return compressed length of TOAST datatypes (WIP)
I thought I would have a look at: (Datatypes) Add function to return compressed length of TOAST data values. A WIP patch is attached for comment (wanted to check I hadn't bitten off more than I could chew *before* asking questions!). A few questions come to mind: 1) The name - I have called it 'toast_compressed_length'. Seems longish - I'm wondering if just 'compressed_length' is ok? 2) What should be returned for toasted data that is not compressed (or plain stored data for that matter)? The WIP patch just gives the uncompressed size (I notice I may need to subtract VARHDRSZ in some cases). 3) What should be returned for non-varlena types? The WIP patch is treating everything as a varlena, so is returning incorrect information for that case. 4) The builtin is declared as immutable - I am not so sure about that (I am wondering if altering a column's storage from MAIN - EXTENDED and then updating the column to be itself will fool it). 5) Any multi-byte locale considerations? regards Mark diff -Nacr src/include/catalog/pg_proc.h.orig src/include/catalog/pg_proc.h *** src/include/catalog/pg_proc.h.orig Fri Jun 17 15:30:17 2005 --- src/include/catalog/pg_proc.h Fri Jun 17 17:08:18 2005 *** *** 3655,3660 --- 3655,3664 DATA(insert OID = 2560 ( pg_postmaster_start_time PGNSP PGUID 12 f f t f s 0 1184 _null_ _null_ _null_ pgsql_postmaster_start_time - _null_ )); DESCR(postmaster start time); + /* Toast compressed length */ + DATA(insert OID = 2561 ( toast_compressed_lengthPGNSP PGUID 12 f f t f i 1 23 25 _null_ _null_ _null_ toast_compressed_length - _null_ )); + DESCR(toast compressed length); + /* * Symbolic values for provolatile column: these indicate whether the result diff -Nacr src/include/access/tuptoaster.h.orig src/include/access/tuptoaster.h *** src/include/access/tuptoaster.h.origThu Jun 16 21:12:57 2005 --- src/include/access/tuptoaster.h Thu Jun 16 21:14:06 2005 *** *** 138,141 --- 138,149 */ extern Size toast_raw_datum_size(Datum value); + /* -- + * toast_compressed_datum_size - + * + *Return the compressed (toasted) size of a varlena datum + * -- + */ + extern Size toast_compressed_datum_size(Datum value); + #endif /* TUPTOASTER_H */ diff -Nacr src/include/utils/pg_lzcompress.h.orig src/include/utils/pg_lzcompress.h *** src/include/utils/pg_lzcompress.h.orig Thu Jun 16 21:21:37 2005 --- src/include/utils/pg_lzcompress.h Thu Jun 16 21:21:11 2005 *** *** 228,231 --- 228,238 extern intpglz_get_next_decomp_char_from_lzdata(PGLZ_DecompState *dstate); extern intpglz_get_next_decomp_char_from_plain(PGLZ_DecompState *dstate); + /* -- + * Function to get compressed size. + * Internal use only. + * -- + */ + extern intpglz_fetch_size(PGLZ_Header *source); + #endif /* _PG_LZCOMPRESS_H_ */ diff -Nacr src/include/utils/builtins.h.orig src/include/utils/builtins.h *** src/include/utils/builtins.h.orig Fri Jun 17 15:25:01 2005 --- src/include/utils/builtins.hFri Jun 17 15:27:30 2005 *** *** 828,831 --- 828,834 /* catalog/pg_conversion.c */ extern Datum pg_convert_using(PG_FUNCTION_ARGS); + /* toastfuncs.c */ + Datum toast_compressed_length(PG_FUNCTION_ARGS); + #endif /* BUILTINS_H */ diff -Nacr src/backend/access/heap/tuptoaster.c.orig src/backend/access/heap/tuptoaster.c *** src/backend/access/heap/tuptoaster.c.orig Thu Jun 16 20:56:59 2005 --- src/backend/access/heap/tuptoaster.cFri Jun 17 15:12:30 2005 *** *** 1436,1438 --- 1436,1499 return result; } + + /* -- + * toast_compressed_datum_size + * + *Show the compressed size of a datum + * -- + */ + Size + toast_compressed_datum_size(Datum value) + { + + + Sizesize; + varattrib *attr = (varattrib *) DatumGetPointer(value); + + if (!PointerIsValid(attr)) + { + /* +* No storage or NULL. +*/ + size = 0; + } + else if (VARATT_IS_EXTERNAL(attr)) + { + /* +* Attribute is stored externally +* If it is compressed too, then we need to get the external datum +* and interrogate *its* compressed size +* otherwise just use the external rawsize (i.e. no compression) +*/ + if (VARATT_IS_COMPRESSED(attr)) + { + varattrib *attrext = toast_fetch_datum(attr); + size = pglz_fetch_size((PGLZ_Header *)attrext); + pfree(attrext); + } + else + { + + size = attr-va_content.va_external.va_rawsize; + } + } + else if (VARATT_IS_COMPRESSED
Re: [PATCHES] [HACKERS] pg_buffercache causes assertion failure
Mark Kirkwood wrote: Mark Kirkwood wrote: I couldn't use int4 as the underlying datatype is unsigned int (not available as exposed Pg type). However, using int8 sounds promising (is int8 larger than unsigned int on 64-bit platforms?). Blocknumber is defined as uint32 in block.h - so should always be safe to represent as an int8 I am thinking. I will look at patching pg_buffercache, changing numeric - int8 for the relblocknumber column. This seems a tidier solution than using numeric, and loses the numeric overhead. This patch changes the use of numeric to int8 to represent the relblocknumber column. regards Mark diff -Ncar pgsql.orig/contrib/pg_buffercache/README.pg_buffercache pgsql/contrib/pg_buffercache/README.pg_buffercache *** pgsql.orig/contrib/pg_buffercache/README.pg_buffercache Tue May 31 11:02:41 2005 --- pgsql/contrib/pg_buffercache/README.pg_buffercache Tue May 31 11:05:48 2005 *** *** 66,78 relfilenode| oid | reltablespace | oid | reldatabase| oid | !relblocknumber | numeric | isdirty| boolean | View definition: SELECT p.bufferid, p.relfilenode, p.reltablespace, p.reldatabase, p.relblocknumber, p.isdirty FROM pg_buffercache_pages() p(bufferid integer, relfilenode oid, ! reltablespace oid, reldatabase oid, relblocknumber numeric(10,0), isdirty boolean); regression=# SELECT c.relname, count(*) AS buffers --- 66,78 relfilenode| oid | reltablespace | oid | reldatabase| oid | !relblocknumber | bigint | isdirty| boolean | View definition: SELECT p.bufferid, p.relfilenode, p.reltablespace, p.reldatabase, p.relblocknumber, p.isdirty FROM pg_buffercache_pages() p(bufferid integer, relfilenode oid, ! reltablespace oid, reldatabase oid, relblocknumber bigint, isdirty boolean); regression=# SELECT c.relname, count(*) AS buffers diff -Ncar pgsql.orig/contrib/pg_buffercache/pg_buffercache.sql.in pgsql/contrib/pg_buffercache/pg_buffercache.sql.in *** pgsql.orig/contrib/pg_buffercache/pg_buffercache.sql.in Tue May 31 11:02:41 2005 --- pgsql/contrib/pg_buffercache/pg_buffercache.sql.in Tue May 31 09:15:03 2005 *** *** 11,17 CREATE VIEW pg_buffercache AS SELECT P.* FROM pg_buffercache_pages() AS P (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, !relblocknumber numeric(10), isdirty bool); -- Don't want these to be available at public. REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC; --- 11,17 CREATE VIEW pg_buffercache AS SELECT P.* FROM pg_buffercache_pages() AS P (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, !relblocknumber int8, isdirty bool); -- Don't want these to be available at public. REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC; diff -Ncar pgsql.orig/contrib/pg_buffercache/pg_buffercache_pages.c pgsql/contrib/pg_buffercache/pg_buffercache_pages.c *** pgsql.orig/contrib/pg_buffercache/pg_buffercache_pages.cTue May 31 11:02:41 2005 --- pgsql/contrib/pg_buffercache/pg_buffercache_pages.c Tue May 31 11:23:46 2005 *** *** 93,99 TupleDescInitEntry(tupledesc, (AttrNumber) 4, reldatabase, OIDOID, -1, 0); TupleDescInitEntry(tupledesc, (AttrNumber) 5, relblockbumber, ! NUMERICOID, -1, 0); TupleDescInitEntry(tupledesc, (AttrNumber) 6, isdirty, BOOLOID, -1, 0); --- 93,99 TupleDescInitEntry(tupledesc, (AttrNumber) 4, reldatabase, OIDOID, -1, 0); TupleDescInitEntry(tupledesc, (AttrNumber) 5, relblockbumber, ! INT8OID, -1, 0); TupleDescInitEntry(tupledesc, (AttrNumber) 6, isdirty, BOOLOID, -1, 0); ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PATCHES] [HACKERS] pg_buffercache causes assertion failure
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: This patch changes the use of numeric to int8 to represent the relblocknumber column. Applied, thanks. This reminds me: I did some patches for 7.4 and 8.0 a while ago (attached) - while I do not expect these to be applied (unless it's ok for contrib to get extra modules in stable releases...), is there somewhere for things like this to go? cheers Mark P.s : They are amended to use int8 too :-) diff -Naur pgsql-7.4.7.orig/contrib/Makefile pgsql-7.4.7/contrib/Makefile --- pgsql-7.4.7.orig/contrib/Makefile Fri Mar 18 11:44:25 2005 +++ pgsql-7.4.7/contrib/MakefileFri Mar 18 10:55:55 2005 @@ -25,6 +25,7 @@ noupdate\ oid2name\ pg_autovacuum \ + pg_buffercache \ pg_dumplo \ pg_logger \ pgbench \ diff -Naur pgsql-7.4.7.orig/contrib/README pgsql-7.4.7/contrib/README --- pgsql-7.4.7.orig/contrib/README Fri Mar 18 11:44:19 2005 +++ pgsql-7.4.7/contrib/README Fri Mar 18 10:55:55 2005 @@ -136,6 +136,10 @@ Automatically performs vacuum by Matthew T. O'Connor matthew@zeut.net +pg_buffercache - + Real-time queries on the shared buffer cache + by Mark Kirkwood [EMAIL PROTECTED] + pg_dumplo - Dump large objects by Karel Zak [EMAIL PROTECTED] diff -Naur pgsql-7.4.7.orig/contrib/pg_buffercache/Makefile pgsql-7.4.7/contrib/pg_buffercache/Makefile --- pgsql-7.4.7.orig/contrib/pg_buffercache/MakefileThu Jan 1 12:00:00 1970 +++ pgsql-7.4.7/contrib/pg_buffercache/Makefile Fri Mar 18 10:55:55 2005 @@ -0,0 +1,17 @@ +# $PostgreSQL: pgsql/contrib/pg_buffercache/Makefile,v 1.1 2005/03/12 15:36:24 neilc Exp $ + +MODULE_big = pg_buffercache +OBJS = pg_buffercache_pages.o + +DATA_built = pg_buffercache.sql +DOCS = README.pg_buffercache + +ifdef USE_PGXS +PGXS = $(shell pg_config --pgxs) +include $(PGXS) +else +subdir = contrib/pg_buffercache +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff -Naur pgsql-7.4.7.orig/contrib/pg_buffercache/README.pg_buffercache pgsql-7.4.7/contrib/pg_buffercache/README.pg_buffercache --- pgsql-7.4.7.orig/contrib/pg_buffercache/README.pg_buffercache Thu Jan 1 12:00:00 1970 +++ pgsql-7.4.7/contrib/pg_buffercache/README.pg_buffercacheFri Mar 18 11:47:28 2005 @@ -0,0 +1,110 @@ +Pg_buffercache - Real time queries on the shared buffer cache. +-- + + This module consists of a C function 'pg_buffercache_pages()' that returns + a set of records, plus a view 'pg_buffercache' to wrapper the function. + + The intent is to do for the buffercache what pg_locks does for locks, i.e - + ability to examine what is happening at any given time without having to + restart or rebuild the server with debugging code added. + + By default public access is REVOKED from both of these, just in case there + are security issues lurking. + + +Installation + + + Build and install the main Postgresql source, then this contrib module: + + $ cd contrib/pg_buffercache + $ gmake + $ gmake install + + + To register the functions: + + $ psql -d database -f pg_buffercache.sql + + +Notes +- + + The definition of the columns exposed in the view is: + + Column | references | Description + +--+ + bufferid | | Id, 1-shared_buffers. + relfilenode| pg_class.relfilenode | Refilenode of the relation. + reldatabase| pg_database.oid | Database for the relation. + relblocknumber | | Offset of the page in the relation. + isdirty| | Is the page dirty? + + + There is one row for each buffer in the shared cache. Unused buffers are + shown with all fields null except bufferid. + + Because the cache is shared by all the databases, there are pages from + relations not belonging to the current database. + + When the pg_buffercache view is accessed, internal buffer manager locks are + taken, and a copy of the buffer cache data is made for the view to display. + This ensures that the view produces a consistent set of results, while not + blocking normal buffer activity longer than necessary. Nonetheless there + could be some impact on database performance if this view is read often. + + +Sample output +- + + regression=# \d pg_buffercache; + View public.pg_buffercache + Column | Type | Modifiers + +-+--- + bufferid | integer | + relfilenode| oid | + reldatabase| oid | + relblocknumber | bigint | + isdirty| boolean | + View definition: + SELECT p.bufferid, p.relfilenode, p.reldatabase, + p.relblocknumber, p.isdirty + FROM
Re: [PATCHES] [HACKERS] pg_buffercache causes assertion failure
Neil Conway wrote: On Tue, 2005-05-31 at 13:07 +1200, Mark Kirkwood wrote: I did some patches for 7.4 and 8.0 a while ago (attached) - while I do not expect these to be applied Right, I don't see a need to backport this. is there somewhere for things like this to go? Pg Foundry? Of course! Thanks Mark ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Proof of concept for MIN/MAX optimization
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: The handling of nulls is a little unexpected (still todo?) : Yeah, that was broken in the first draft :-( ... I think it's OK in the committed version though. (post cvs update) yeah - looks good! regards Mark ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] Proof of concept for MIN/MAX optimization
Tom Lane wrote: I haven't done the work yet to add a column to pg_aggregate, so this just knows about optimizing a couple of hard-wired cases (int4 and text). Other than that it's reasonably complete, I think. Comments? Looks good : regression=# explain select max(unique1) from tenk1; QUERY PLAN - Result (cost=0.15..0.16 rows=1 width=0) InitPlan - Limit (cost=0.00..0.15 rows=1 width=4) - Index Scan Backward using tenk1_unique1 on tenk1 (cost=0.00..1512.59 rows=1 width=4) (4 rows) The handling of nulls is a little unexpected (still todo?) : regression=# insert into tenk2 (unique1,unique2) values(null, 1); INSERT 0 1 regression=# select max(unique1) from tenk2; max - (1 row) regression=# set enable_indexscan=0; SET regression=# select max(unique1) from tenk2; max -- (1 row) cheers Mark ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] [HACKERS] contrib/pg_buffercache
Neil Conway wrote: Mark Kirkwood wrote: Great that it fixes it... however, I had submitted a tidier patch that puts the macro in the header How is this tidier? err... puts the macro in the header (I don't see a reason for pg_buffercache_pages.h at all, actually.) (chuckles) - well, that puts a different complexion on it, best leave it as is then! (in fact, feel free to move the declaration from pg_buffercache_pages.h to pg_buffercache_pages.c, and eliminate the header altogether) cheers Mark ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [HACKERS] contrib/pg_buffercache
Andrew Dunstan wrote: I didn't see the original of the later patch, which is why I sent in mine. I honestly don't care that much either way, although I'm inclined to agree that the header file is just unnecessary noise. Well, looks like a strong feeling for no header :-). I didn't really think about *not* having one to be honest. So, anyway - instead of inviting Neil to spend valuable time eliminating it, I should do it myself... So, please ignore my previous patch to the header file, and consider this one - which eliminates it completely. best wishes Mark diff -Nacr pg_buffercache.orig/pg_buffercache_pages.c pg_buffercache/pg_buffercache_pages.c *** pg_buffercache.orig/pg_buffercache_pages.c Fri Apr 1 11:25:49 2005 --- pg_buffercache/pg_buffercache_pages.c Fri Apr 1 11:28:22 2005 *** *** 12,18 #include storage/buf_internals.h #include storage/bufmgr.h #include utils/relcache.h - #include pg_buffercache_pages.h #define NUM_BUFFERCACHE_PAGES_ELEM6 --- 12,17 *** *** 21,26 --- 20,27 extern DLLIMPORT BufferDesc *BufferDescriptors; extern DLLIMPORT volatile uint32 InterruptHoldoffCount; #endif + + Datum pg_buffercache_pages(PG_FUNCTION_ARGS); /* diff -Nacr pg_buffercache.orig/pg_buffercache_pages.h pg_buffercache/pg_buffercache_pages.h *** pg_buffercache.orig/pg_buffercache_pages.h Fri Apr 1 11:25:49 2005 --- pg_buffercache/pg_buffercache_pages.h Thu Jan 1 12:00:00 1970 *** *** 1,18 - /*- - * - * pg_buffercache_pages.h - *Prototypes for pg_buffercache_pages - * - * - *$PostgreSQL: pgsql/contrib/pg_buffercache/pg_buffercache_pages.h,v 1.1 2005/03/12 15:36:24 neilc Exp $ - * - *- - */ - - - #ifndef PG_BUFFERCACHE_PAGES_H - #define PG_BUFFERCACHE_PAGES_H - - extern Datum pg_buffercache_pages(PG_FUNCTION_ARGS); - - #endif /* PG_BUFFERCACHE_PAGES_H */ --- 0 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] contrib/pg_buffercache
Neil Conway wrote: Andrew Dunstan wrote: I have confirmed that the attached patch works on Cygwin as well as Windows. Please apply. Applied, thanks. Great that it fixes it... however, I had submitted a tidier patch that puts the macro in the header (probably after asking Andrew to test the first one, oops - sorry Andrew). I have tested it on win32 native. Do you want to back out the first one and use this instead? regards Mark *** pg_buffercache_pages.h.orig Thu Mar 17 10:12:20 2005 --- pg_buffercache_pages.h Thu Mar 17 13:44:45 2005 *** *** 15,18 --- 15,24 extern Datum pg_buffercache_pages(PG_FUNCTION_ARGS); + /* A little hackery for Windows and Cygwin */ + #if defined (WIN32) || defined (__CYGWIN__) + extern DLLIMPORT BufferDesc *BufferDescriptors; + extern DLLIMPORT volatile uint32 InterruptHoldoffCount; + #endif + #endif /* PG_BUFFERCACHE_PAGES_H */ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] WAL: O_DIRECT and multipage-writer
On Tue, Jan 25, 2005 at 06:06:23PM +0900, ITAGAKI Takahiro wrote: Environment: OS : Linux kernel 2.6.9 CPU: Pentium 4 3GHz disk : ATA 5400rpm (Data and WAL are placed on same partition.) memory : 1GB config : shared_buffers=1, wal_buffers=256, XLOG_SEG_SIZE=256MB, checkpoint_segment=4 Hi Itagaki, In light of this thread, have you compared the performance on Linux-2.4? Direct io on block device has performance regression on 2.6.x kernel http://www.ussg.iu.edu/hypermail/linux/kernel/0503.1/0328.html Mark ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PATCHES] [HACKERS] contrib/pg_buffercache
Andrew Dunstan wrote: It fixes the build error on Windows - haven't tried because i don't have time, but I know it won't work on Cygwin, because WIN32 isn't (usually) defined on Cygwin - see previous almost endless discussions. Yes - I recall that discussion a while ago. This patch should sort the issue. One question, should I be using defined(__MINGW32__) as opposed to defined(WIN32)? I figured I didn't as in this case it is not necessary to distinguish between native and cygwin. regards Mark *** pg_buffercache_pages.h.orig Thu Mar 17 10:12:20 2005 --- pg_buffercache_pages.h Thu Mar 17 13:44:45 2005 *** *** 15,18 --- 15,24 extern Datum pg_buffercache_pages(PG_FUNCTION_ARGS); + /* A little hackery for Windows and Cygwin */ + #if defined (WIN32) || defined (__CYGWIN__) + extern DLLIMPORT BufferDesc *BufferDescriptors; + extern DLLIMPORT volatile uint32 InterruptHoldoffCount; + #endif + #endif /* PG_BUFFERCACHE_PAGES_H */ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Display Pg buffer cache (WIP)
A couple of minor amendments here: - remove link to libpq (from cut+past of dblnk's Makefile) - add comment for pg_buffercache module in contrib/README - change my listed email to this one (I have resigned) regards Mark diff -Nacr pgsql.orig/contrib/Makefile pgsql/contrib/Makefile *** pgsql.orig/contrib/Makefile Thu Mar 3 11:29:53 2005 --- pgsql/contrib/Makefile Wed Mar 9 10:10:41 2005 *** *** 26,31 --- 26,32 noupdate\ oid2name\ pg_autovacuum \ + pg_buffercache \ pg_dumplo \ pg_trgm \ pgbench \ diff -Nacr pgsql.orig/contrib/README pgsql/contrib/README *** pgsql.orig/contrib/README Thu Mar 3 11:29:53 2005 --- pgsql/contrib/READMEFri Mar 11 12:11:18 2005 *** *** 136,141 --- 136,145 Automatically performs vacuum by Matthew T. O'Connor matthew@zeut.net + pg_buffercace - + Real time queries on the shared buffer cache. + by Mark Kirkwood [EMAIL PROTECTED] + pg_dumplo - Dump large objects by Karel Zak [EMAIL PROTECTED] diff -Nacr pgsql.orig/contrib/pg_buffercache/Makefile pgsql/contrib/pg_buffercache/Makefile *** pgsql.orig/contrib/pg_buffercache/Makefile Thu Jan 1 12:00:00 1970 --- pgsql/contrib/pg_buffercache/Makefile Thu Mar 10 08:19:20 2005 *** *** 0 --- 1,19 + # $PostgreSQL$ + + MODULE_big = pg_buffercache + OBJS = pg_buffercache_pages.o + + DATA_built = pg_buffercache.sql + DOCS = README.pg_buffercache + REGRESS = pg_buffercache + + + ifdef USE_PGXS + PGXS = $(shell pg_config --pgxs) + include $(PGXS) + else + subdir = contrib/pg_buffercache + top_builddir = ../.. + include $(top_builddir)/src/Makefile.global + include $(top_srcdir)/contrib/contrib-global.mk + endif diff -Nacr pgsql.orig/contrib/pg_buffercache/README.pg_buffercache pgsql/contrib/pg_buffercache/README.pg_buffercache *** pgsql.orig/contrib/pg_buffercache/README.pg_buffercache Thu Jan 1 12:00:00 1970 --- pgsql/contrib/pg_buffercache/README.pg_buffercache Fri Mar 11 12:11:03 2005 *** *** 0 --- 1,112 + Pg_buffercache - Real time queries on the shared buffer cache. + -- + + This module consists of a C function 'pg_buffercache_pages()' that returns + a set of records, plus a view 'pg_buffercache' to wrapper the function. + + The intent is to do for the buffercache what pg_locks does for locks, i.e - + ability to examine what is happening at any given time without having to + restart or rebuild the server with debugging code added. + + By default public access is REVOKED from both of these, just in case there + are security issues lurking. + + + Installation + + + Build and install the main Postgresql source, then this contrib module: + + $ cd contrib/pg_buffercache + $ gmake + $ gmake install + + + To register the functions: + + $ psql -d database -f pg_buffercache.sql + + + Notes + - + + The definition of the columns exposed in the view is: + +Column | references | Description + +--+ +bufferid | | Id, 1-shared_buffers. +relfilenode| pg_class.relfilenode | Refilenode of the relation. +reltablespace | pg_tablespace.oid| Tablespace oid of the relation. +reldatabase| pg_database.oid | Database for the relation. +relblocknumber | | Offset of the page in the relation. +isdirty| | Is the page dirty? + + + There is one row for each buffer in the shared cache. Unused buffers are + shown with all fields null except bufferid. + + Because the cache is shared by all the databases, there are pages from + relations not belonging to the current database. + + When the pg_buffercache view is accessed, internal buffer manager locks are + taken, and a copy of the buffer cache data is made for the view to display. + This ensures that the view produces a consistent set of results, while not + blocking normal buffer activity longer than necessary. Nonetheless there + could be some impact on database performance if this view is read often. + + + Sample output + - + + regression=# \d pg_buffercache; +View public.pg_buffercache +Column | Type | Modifiers + +-+--- +bufferid | integer | +relfilenode| oid | +reltablespace | oid | +reldatabase| oid | +relblocknumber | numeric | +isdirty| boolean | + View definition: +SELECT p.bufferid, p.relfilenode, p.reltablespace, p.reldatabase, + p.relblocknumber, p.isdirty + FROM pg_buffercache_pages() p(bufferid integer, relfilenode oid
Re: [PATCHES] Display Pg buffer cache (WIP)
Tom Lane wrote: I would rather see this as a contrib module. There has been *zero* consensus that we need this in core, nor any discussion about whether it might be a security hole. Hmmm, I guess my motivation for thinking it might be useful in core was that it was similar to 'pg_locks' and 'pg_stats', i.e. exposing some internals in a convenient queryable manner (useful for problem solving). On the security front, I should have added a 'REVOKE ALL ...FROM PUBLIC' on the view to at least control access (fortunately easy to add). cheers Mark ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Display Pg buffer cache (WIP)
Neil Conway wrote: Mark Kirkwood wrote: + TupleDescInitEntry(tupledesc, (AttrNumber) 5, relblockbumber, + NUMERICOID, -1, 0); I think this should be an int4, not numeric. I was looking for an UINT4OID :-), but numeric seemed the best compromise (safer than overflowing int4). I guess I could add a type 'blocknumber' that is actually the same as 'oid', but it seems liks a lot of effort for one column. cheers Mark ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Display Pg buffer cache (WIP)
Tom Lane wrote: One reason for making it contrib is that I don't think you've got it entirely right yet, and there will be several iterations before it settles down. In a contrib module that is no problem, in core it's a forced initdb each time. Yeah - certainly less intrusive as a contrib if amendments are required! Barring a huge groundswell of support for it in core :-) I will resubmit a patch for it as a contrib module. cheers Mark ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Display Pg buffer cache (WIP)
Mark Kirkwood wrote: Tom Lane wrote: One reason for making it contrib is that I don't think you've got it entirely right yet, and there will be several iterations before it settles down. In a contrib module that is no problem, in core it's a forced initdb each time. Yeah - certainly less intrusive as a contrib if amendments are required! Barring a huge groundswell of support for it in core :-) I will resubmit a patch for it as a contrib module. And here it is... Mark diff -Nacr pgsql.orig/contrib/Makefile pgsql/contrib/Makefile *** pgsql.orig/contrib/Makefile Thu Mar 3 11:29:53 2005 --- pgsql/contrib/Makefile Wed Mar 9 10:10:41 2005 *** *** 26,31 --- 26,32 noupdate\ oid2name\ pg_autovacuum \ + pg_buffercache \ pg_dumplo \ pg_trgm \ pgbench \ diff -Nacr pgsql.orig/contrib/pg_buffercache/Makefile pgsql/contrib/pg_buffercache/Makefile *** pgsql.orig/contrib/pg_buffercache/Makefile Thu Jan 1 12:00:00 1970 --- pgsql/contrib/pg_buffercache/Makefile Wed Mar 9 11:42:05 2005 *** *** 0 --- 1,21 + # $PostgreSQL$ + + MODULE_big = pg_buffercache + PG_CPPFLAGS = -I$(libpq_srcdir) + OBJS = pg_buffercache_pages.o + SHLIB_LINK = $(libpq) + + DATA_built = pg_buffercache.sql + DOCS = README.pg_buffercache + REGRESS = pg_buffercache + + + ifdef USE_PGXS + PGXS = $(shell pg_config --pgxs) + include $(PGXS) + else + subdir = contrib/pg_buffercache + top_builddir = ../.. + include $(top_builddir)/src/Makefile.global + include $(top_srcdir)/contrib/contrib-global.mk + endif diff -Nacr pgsql.orig/contrib/pg_buffercache/README.pg_buffercache pgsql/contrib/pg_buffercache/README.pg_buffercache *** pgsql.orig/contrib/pg_buffercache/README.pg_buffercache Thu Jan 1 12:00:00 1970 --- pgsql/contrib/pg_buffercache/README.pg_buffercache Wed Mar 9 12:26:56 2005 *** *** 0 --- 1,112 + Pg_buffercache - Real-time queries on the shared buffer cache. + -- + + This module consists of a C function 'pg_buffercache_pages()' that returns + a set of records, plus a view 'pg_buffercache' to wrapper the function. + + The intent is to do for the buffercache what pg_locks does for locks, i.e - + ability to examine what is happening at any given time without having to + restart or rebuild the server with debugging code added. + + By default public access is REVOKED from both of these, just in case there + are security issues lurking. + + + Installation + + + Build and install the main Postgresql source, then this contrib module: + + $ cd contrib/pg_buffercache + $ gmake + $ gmake install + + + To register the functions: + + $ psql -d database -f pg_buffercache.sql + + + Notes + - + + The definition of the columns exposed in the view is: + +Column | references | Description + +--+ +bufferid | | Id, 1-shared_buffers. +relfilenode| pg_class.relfilenode | Refilenode of the relation. +reltablespace | pg_tablespace.oid| Tablespace oid of the relation. +reldatabase| pg_database.oid | Database for the relation. +relblocknumber | | Offset of the page in the relation. +isdirty| | Is the page dirty? + + + There is one row for each buffer in the shared cache. Unused buffers are + shown with all fields null except bufferid. + + Because the cache is shared by all the databases, there are pages from + relations not belonging to the current database. + + When the pg_buffercache view is accessed, internal buffer manager locks are + taken, and a copy of the buffer cache data is made for the view to display. + This ensures that the view produces a consistent set of results, while not + blocking normal buffer activity longer than necessary. Nonetheless there + could be some impact on database performance if this view is read often. + + + Sample output + - + + regression=# \d pg_buffercache; +View public.pg_buffercache +Column | Type | Modifiers + +-+--- +bufferid | integer | +relfilenode| oid | +reltablespace | oid | +reldatabase| oid | +relblocknumber | numeric | +isdirty| boolean | + View definition: +SELECT p.bufferid, p.relfilenode, p.reltablespace, p.reldatabase, + p.relblocknumber, p.isdirty + FROM pg_buffercache_pages() p(bufferid integer, relfilenode oid, + reltablespace oid, reldatabase oid, relblocknumber numeric(10,0), + isdirty boolean); + + regression=# SELECT c.relname, count(*) AS buffers +FROM pg_class c, pg_buffercache b
Re: [PATCHES] Display Pg buffer cache (WIP)
The latest iteration. I have added documentation and updated the expected output so that the regression tests pass. In addition, after looking at the various system view names, I decided that 'pg_cache_dump' does not fit in nicely - so chose an more Pg suitable name of 'pg_buffercache'. Some renaming of the backend functions happened too. Finally, since I was saving blocknum, it went into the view as well. Hopefully I am dealing with invalid buffer tags sensibly now. The per-buffer spin lock is still being held - altho it is obviously trivial to remove if not actually required. regards Mark P.s : remembered to use diff -c Mark Kirkwood wrote: Neil Conway wrote: Tom Lane wrote: It'd be possible to dispense with the per-buffer spinlocks so long as you look only at the tag (and perhaps the TAG_VALID flag bit). The tags can't be changing while you hold the BufMappingLock. That's what I had thought at first, but this comment in buf_internals.h dissuaded me: buf_hdr_lock must be held to examine or change the tag, flags, usage_count, refcount, or wait_backend_id fields. The comment already notes this isn't true if you've got the buffer pinned; it would be worth adding another exception for holding the BufMappingLock, IMHO. I'm dubious that there's any point in recording information as transient as the refcounts and dirtybits I think it's worth recording dirty bits -- it provides an indication of the effectiveness of the bgwriter, for example. Reference counts could be done away with, although I doubt it would have a significant effect on the time spent holding the lock. Let's suppose refcount is eliminated. I will then be examining the tag, flags and buf_id elements of the buffer. Holding the BufMappingLock prevents the tag changing, but what about the flags? In addition Tom pointed out that I am not examining the BM_TAG_VALID or BM_VALID flag bits (I am only checking if tag.blockNum equals InvalidBlockNumber). My initial thought is to handle !BM_TAG_VALID or !BM_VALID similarly to InvalidBlockNumber i.e all non buf_id fields set to NULL. Mark ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED]) diff -Nacr pgsql.orig/doc/src/sgml/catalogs.sgml pgsql/doc/src/sgml/catalogs.sgml *** pgsql.orig/doc/src/sgml/catalogs.sgml Mon Mar 7 12:20:17 2005 --- pgsql/doc/src/sgml/catalogs.sgmlTue Mar 8 12:03:50 2005 *** *** 3875,3880 --- 3875,3885 tbody row + entrylink linkend=view-pg-buffercachestructnamepg_buffercache/structname/link/entry + entryshared buffer cache/entry + /row + + row entrylink linkend=view-pg-indexesstructnamepg_indexes/structname/link/entry entryindexes/entry /row *** *** 3917,3922 --- 3922,4021 /tbody /tgroup /table + /sect1 + + sect1 id=view-pg-buffercache + titlestructnamepg_buffercache/structname/title + + indexterm zone=view-pg-buffercache +primarypg_buffercache/primary + /indexterm + para +The view structnamepg_buffercache/structname provides access to +some information from the shared buffer cache. + /para + + para +There is one row for each buffer in the shared cache. Unused buffers are +shown with all fields null except structfieldbufferid/structfield. +Because the cache is shared by all the databases, there are pages from +relations not belonging to the current database. + /para + + + table +titlestructnamepg_buffercache/structname Columns/title + +tgroup cols=4 + thead + row + entryName/entry + entryType/entry + entryReferences/entry + entryDescription/entry + /row + /thead + tbody + row + entrybufferid/entry + entrytypeinteger/type/entry + entry/entry + entry +The buffer number. This is numbered 1 to varnameshared_buffers/varname. + /entry + /row + row + entryrelfilenode/entry + entrytypeoid/type/entry + entryliterallink linkend=catalog-pg-classstructnamepg_class/structname/link.relfilenode/literal/entry + entry + The on-disk file for the relation that this page came from. + /entry + /row + row + entryreltablespace/entry + entrytypeoid/type/entry + entry + literallink linkend=catalog-pg-tablespacestructnamepg_tablespace/structname/link.oid/literal + /entry + entryTablespace the corresponding relation is in./entry + /row + row + entryreldatabase/entry + entrytypeoid/type/entry + entryliterallink linkend=catalog-pg-databasestructnamepg_database/structname/link.oid/literal/entry + entry +Database the corresponding relation belongs to, or zero if the +relation is a globally-shared table/entry + /row
Re: [PATCHES] Display Pg buffer cache (WIP)
Neil Conway wrote: Tom Lane wrote: It'd be possible to dispense with the per-buffer spinlocks so long as you look only at the tag (and perhaps the TAG_VALID flag bit). The tags can't be changing while you hold the BufMappingLock. That's what I had thought at first, but this comment in buf_internals.h dissuaded me: buf_hdr_lock must be held to examine or change the tag, flags, usage_count, refcount, or wait_backend_id fields. The comment already notes this isn't true if you've got the buffer pinned; it would be worth adding another exception for holding the BufMappingLock, IMHO. I'm dubious that there's any point in recording information as transient as the refcounts and dirtybits I think it's worth recording dirty bits -- it provides an indication of the effectiveness of the bgwriter, for example. Reference counts could be done away with, although I doubt it would have a significant effect on the time spent holding the lock. Let's suppose refcount is eliminated. I will then be examining the tag, flags and buf_id elements of the buffer. Holding the BufMappingLock prevents the tag changing, but what about the flags? In addition Tom pointed out that I am not examining the BM_TAG_VALID or BM_VALID flag bits (I am only checking if tag.blockNum equals InvalidBlockNumber). My initial thought is to handle !BM_TAG_VALID or !BM_VALID similarly to InvalidBlockNumber i.e all non buf_id fields set to NULL. Mark ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PATCHES] Display Pg buffer cache (WIP)
Neil Conway wrote: If you do decide to hold the BufMappingLock, it might make sense to: 1. allocate an array of NBuffers elements 2. acquire BufferMappingLock in share mode 3. sequentially scan through the buffer pool, copying data into the array 4. release the lock 5. on each subsequent call to the SRF, format and return an element of the array Which should reduce the time to lock is held. This will require allocating NBuffers * size_of_stats memory (where size_of_stats will be something like 16 bytes). That is a better approach, so I've used it in this new iteration. In addition to holding the BufMappingLock, each buffer header is (spin) locked before examining it, hopefully this is correct - BTW, I like the new buffer lock design. I'm still using BuildTupleFromCStrings, so there is considerable use of sprintf conversion and temporary char * stuff. I would like this to be a bit cleaner, so any suggestions welcome. regards Mark diff -Naur pgsql.orig/src/backend/catalog/system_views.sql pgsql/src/backend/catalog/system_views.sql --- pgsql.orig/src/backend/catalog/system_views.sql Fri Mar 4 14:23:09 2005 +++ pgsql/src/backend/catalog/system_views.sql Fri Mar 4 14:21:33 2005 @@ -277,3 +277,9 @@ DO INSTEAD NOTHING; GRANT SELECT, UPDATE ON pg_settings TO PUBLIC; + +CREATE VIEW pg_cache_dump AS + SELECT D.* FROM pg_cache_dump() AS D + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, +isdirty bool, refcount int4); + diff -Naur pgsql.orig/src/backend/utils/adt/cachedump.c pgsql/src/backend/utils/adt/cachedump.c --- pgsql.orig/src/backend/utils/adt/cachedump.cThu Jan 1 12:00:00 1970 +++ pgsql/src/backend/utils/adt/cachedump.c Sat Mar 5 20:21:45 2005 @@ -0,0 +1,221 @@ +/*- + * + * cachedump.c + *display some contents of the buffer cache + * + * Portions Copyright (c) 1996-2005, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * $PostgreSQL$ + *- + */ +#include postgres.h +#include funcapi.h +#include catalog/pg_type.h +#include storage/buf_internals.h +#include storage/bufmgr.h +#include utils/relcache.h +#include utils/builtins.h + + +#define NUM_CACHE_DUMP_ELEM6 + +/* + * Record structure holding the to be exposed cache data. + */ +typedef struct +{ + uint32 bufferid; + Oid relfilenode; + Oid reltablespace; + Oid reldatabase; + boolisdirty; + uint32 refcount; + BlockNumber blocknum; + +} CacheDumpRec; + + +/* + * Function context for data persisting over repeated calls. + */ +typedef struct +{ + AttInMetadata *attinmeta; + CacheDumpRec*record; + char*values[NUM_CACHE_DUMP_ELEM]; +} CacheDumpContext; + + +/* + * Function returning data from the shared buffer cache - buffer number, + * relation node/tablespace/database, dirty indicator and refcount. + */ +Datum +cache_dump(PG_FUNCTION_ARGS) +{ + FuncCallContext *funcctx; + Datum result; + MemoryContext oldcontext; + CacheDumpContext*fctx; /* User function context. */ + TupleDesc tupledesc; + HeapTuple tuple; + + if (SRF_IS_FIRSTCALL()) + { + RelFileNode rnode; + uint32 i; + BufferDesc *bufHdr; + + + funcctx = SRF_FIRSTCALL_INIT(); + + /* Switch context when allocating stuff to be used in later calls */ + oldcontext = MemoryContextSwitchTo(funcctx-multi_call_memory_ctx); + + /* construct a tuple to return */ + tupledesc = CreateTemplateTupleDesc(NUM_CACHE_DUMP_ELEM, false); + TupleDescInitEntry(tupledesc, (AttrNumber) 1, bufferid, + INT4OID, -1, 0); + TupleDescInitEntry(tupledesc, (AttrNumber) 2, relfilenode, + OIDOID, -1, 0); + TupleDescInitEntry(tupledesc, (AttrNumber) 3, reltablespace, + OIDOID, -1, 0); + TupleDescInitEntry(tupledesc, (AttrNumber) 4, reldatabase, + OIDOID, -1, 0); + TupleDescInitEntry(tupledesc, (AttrNumber) 5, isdirty, + BOOLOID, -1, 0); + TupleDescInitEntry(tupledesc, (AttrNumber) 6, refcount
Re: [PATCHES] Display Pg buffer cache (WIP)
Tom Lane wrote: Mark Kirkwood [EMAIL PROTECTED] writes: I am using anoncvs from yesterday, so if Tom's new scheme is *very* new I may be missing it. It's not committed yet ;-) Yep - that's pretty new, apologies for slow grey matter... I have been following the discussion about the new buffer design, but didn't think to connect it to this context! cheers Mark ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PATCHES] Display Pg buffer cache (WIP)
Neil Conway wrote: I don't like accessing shared data structures without acquiring the appropriate locks; even if it doesn't break anything, it seems like just asking for trouble. In order to be able to inspect a buffer's tag in Tom's new locking scheme (not yet in HEAD, but will be in 8.1), you need only hold a per-buffer lock. That will mean acquiring and releasing a spinlock for each buffer, which isn't _too_ bad... I am looking at this bit now. I take the first part to mean that don't need to wrap LWLockAcquire(BufMgrLock, LW_EXCLUSIVE) ... LWLockRelease(BufMgrLock) around the inspection of the buffers (?) This per buffer lock, are we talking about LockBuffer(buf, BUFFER_LOCK_SHARE) ... releaseBuffer(buf) ? If so, it looks like I need to do some stuff with ResourceOwners, otherwise ReleaseBuffer will fail (or am I wandering up the wrong track here?). I am using anoncvs from yesterday, so if Tom's new scheme is *very* new I may be missing it. Thanks Mark ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PATCHES] WIP: buffer manager rewrite (take 2)
On Wed, Feb 16, 2005 at 07:50:28PM -0500, Tom Lane wrote: Second iteration of buffer manager rewrite. This uses the idea of a usage counter instead of just a recently-used flag bit. I allowed the counter to go up to 5, but some playing around with that value would be interesting. (Tweak BM_MAX_USAGE_COUNT in src/include/storage/buf_internals.h, then recompile the files in src/backend/storage/buffer/.) Also there are more GUC variables now for controlling the bgwriter. I see a huge performance increase, when applied to CVS from 20050301. Baseline against 8.0.1: http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/ throughput 3639.97 CVS from 20050301: http://www.osdl.org/projects/dbt2dev/results/dev4-010/314/ throughput 5483.01 I only ran this for 30 minutes, as opposed to 60, but it looks promising. So about a 50% increase in throughput for my test. Not to shabby. ;) Mark ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] WIP: buffer manager rewrite (take 2)
On Wed, Mar 02, 2005 at 08:48:35PM -0500, Tom Lane wrote: Mark Wong [EMAIL PROTECTED] writes: CVS from 20050301 [ plus clock-sweep buffer manager ] : http://www.osdl.org/projects/dbt2dev/results/dev4-010/314/ throughput 5483.01 I only ran this for 30 minutes, as opposed to 60, but it looks promising. So about a 50% increase in throughput for my test. Not to shabby. ;) Sweet ... and the response-time improvement is just stunning. I think we may have a winner. Is there a reason you didn't run the test longer though? I would normally run for 1 hour, but I guess my fingers were thinking something different at the time. =P Mark ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[PATCHES] Display Pg buffer cache (WIP)
I found it useful to be able to look at what relations are occupying the cache (mainly for debugging and comparing with page estimates). I am wondering if it might be a useful addition generally? How it works a SRF called dump_cache() is created, which is then exposed as system view pg_dump_cache. So stuff like the following can be performed (cache of 2000 immediately after 'make installcheck'): regression=# SELECT c.relname, count(*) AS buffers regression-# FROM pg_class c, pg_dump_cache d regression-# WHERE d.relfilenode = c.relfilenode regression-# GROUP BY c.relname regression-# ORDER BY 2 DESC LIMIT 10; relname | buffers -+- tenk1 | 345 tenk2 | 345 onek| 138 pg_attribute| 102 road| 81 pg_attribute_relid_attnam_index | 74 onek2 | 69 pg_proc | 59 pg_proc_proname_args_nsp_index | 56 hash_f8_heap| 49 (10 rows) regression=# As of now the patch breaks 1 regression test (rules - that new view...) and does not peek into the contents of the buffers themselves (I was mainly interested in which relations were there), and does not worry too much about a consistent view of the buffer contents (as I didn't want it to block all other activity!). If it seems like a worthwhile addition I will amend the regression expected and resubmit. Any comments? Mark diff -Naur pgsql.orig/src/backend/catalog/system_views.sql pgsql/src/backend/catalog/system_views.sql --- pgsql.orig/src/backend/catalog/system_views.sql Thu Mar 3 11:29:55 2005 +++ pgsql/src/backend/catalog/system_views.sql Thu Mar 3 11:41:24 2005 @@ -277,3 +277,8 @@ DO INSTEAD NOTHING; GRANT SELECT, UPDATE ON pg_settings TO PUBLIC; + +CREATE VIEW pg_dump_cache AS + SELECT D.* FROM pg_dump_cache() AS D + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid); + diff -Naur pgsql.orig/src/backend/utils/adt/Makefile pgsql/src/backend/utils/adt/Makefile --- pgsql.orig/src/backend/utils/adt/Makefile Thu Mar 3 11:29:53 2005 +++ pgsql/src/backend/utils/adt/MakefileThu Mar 3 11:32:10 2005 @@ -24,7 +24,7 @@ tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \ network.o mac.o inet_net_ntop.o inet_net_pton.o \ ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \ - ascii.o quote.o pgstatfuncs.o encode.o + ascii.o quote.o pgstatfuncs.o encode.o dumpcache.o like.o: like.c like_match.c diff -Naur pgsql.orig/src/backend/utils/adt/dumpcache.c pgsql/src/backend/utils/adt/dumpcache.c --- pgsql.orig/src/backend/utils/adt/dumpcache.cThu Jan 1 12:00:00 1970 +++ pgsql/src/backend/utils/adt/dumpcache.c Thu Mar 3 11:51:53 2005 @@ -0,0 +1,119 @@ +/*- + * + * dumpcache.c + *display some contents for the buffer cache + * + *- + */ +#include postgres.h +#include funcapi.h +#include catalog/pg_type.h +#include storage/buf_internals.h +#include storage/bufmgr.h +#include utils/relcache.h + + +extern Datum dump_cache(PG_FUNCTION_ARGS); + + +/* + * Function context for data persisting over repeated calls. + */ +typedef struct { + int buffer; + AttInMetadata *attinmeta; + BufferDesc *bufhdr; + RelFileNode rnode; + char*values[3]; +} dumpcache_fctx; + + +/* + * Return a tuple that has bufferid, relfilenoide, reltablespace and + * reldatabase OIDs. + */ +Datum +dump_cache(PG_FUNCTION_ARGS) +{ + FuncCallContext *funcctx; + Datum result; + MemoryContext oldcontext; + dumpcache_fctx *fctx; /* User function context. */ + TupleDesc tupledesc; + HeapTuple tuple; + + if (SRF_IS_FIRSTCALL()) + { + funcctx = SRF_FIRSTCALL_INIT(); + + /* Switch context when allocating stuff to be used in later calls */ + oldcontext = MemoryContextSwitchTo(funcctx-multi_call_memory_ctx); + + /* construct a tuple to return */ + tupledesc = CreateTemplateTupleDesc(4, false); + TupleDescInitEntry(tupledesc, (AttrNumber) 1, bufferid, + INT4OID, -1, 0); + TupleDescInitEntry(tupledesc, (AttrNumber) 2, relfilenode, + OIDOID, -1, 0); + TupleDescInitEntry(tupledesc, (AttrNumber) 3, reltablespace
Re: [PATCHES] [HACKERS] WAL: O_DIRECT and multipage-writer (+ memory leak)
On Thu, Feb 03, 2005 at 07:25:55PM +0900, ITAGAKI Takahiro wrote: Hello everyone. I fixed two bugs in the patch that I sent before. Check and test new one, please. Ok, finally got back into the office and was able to run 1 set of tests. So the new baseline result with 8.0.1: http://www.osdl.org/projects/dbt2dev/results/dev4-010/309/ Throughput: 3639.97 Results with the patch but open_direct not set: http://www.osdl.org/projects/dbt2dev/results/dev4-010/308/ Throughput: 3494.72 Results with the patch and open_direct set: http://www.osdl.org/projects/dbt2dev/results/dev4-010/312/ Throughput: 3489.69 You can verify that the wall_sync_method is set to open_direct under the database parameters link, but I'm wondering if I missed something. It looks a little odd the the performance dropped. Mark ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PATCHES] WIP: bufmgr rewrite per recent discussions
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: 17 February 2005 15:46 To: Mark Cave-Ayland Cc: pgsql-patches@postgresql.org Subject: Re: [PATCHES] WIP: bufmgr rewrite per recent discussions (cut) 3. Pad the LWLock struct (in src/backend/storage/lmgr/lwlock.c) to some power of 2 up to 128 bytes --- same issue of space wasted versus cross-lock contention. Having seen the results above, is it still worth looking at this? Yeah, probably, because there are other possible contention sources besides buffers that might be alleviated by padding LWLocks. For instance the buffer manager global locks and the LockMgrLock are probably all in the same cache line at the moment. Hi Tom, Here are the results from the LWLock test. Firstly here are the results with your second patch with no modifications as a refresher: PATCH #2 No modifications 10001 10 204.909702 205.01051 345.098727 345.411606 375.812059 376.37741 195.100496 195.197463 348.791481 349.111363 314.718619 315.139878 199.637965 199.735195 313.561366 313.803225 365.061177 365.666103 195.935529 196.029082 325.893744 326.171754 370.040623 370.625072 196.661374 196.756481 314.468751 314.711517 319.643145 320.099164 Mean: 198.4490132 198.5457462 329.5628138 329.841893 349.0551246 349.5815254 Here are the results with ALIGNOF_BUFFER=128 and padding LWLock to 64 bytes: PATCH #2 with ALIGNOF_BUFFER = 128 and LWLock padded to 64 bytes 10001 10 199.672932 199.768756 307.051571 307.299088 367.394745 368.016266 196.443585 196.532912 344.898219 345.204228 375.300921 375.979186 191.098411 191.185807 329.485633 329.77679 305.413304 305.841889 201.110132 201.210049 314.219784 314.476356 314.03306 314.477869 196.615748 196.706032 337.315295 337.62437 370.537538 371.16593 Mean: 196.9881616 197.0807112 326.5941004 326.8761664 346.5359136 347.096228 And finally here are the results with ALIGNOF_BUFFER = 128 and LWLock padded to 128 bytes: PATCH #2 with ALIGNOF_BUFFER = 128 and LWLock padded to 128 bytes 10001 10 195.357405 195.449704 346.916069 347.235895 373.354775 373.934842 190.428061 190.515077 323.932436 324.211975 361.908206 362.476886 206.059573 206.159472 338.288825 338.590642 306.22198 306.618689 195.336711 195.427757 309.316534 309.56603 305.295391 305.695336 188.896205 188.983969 322.889651 323.245394 377.673313 378.269907 Mean: 195.215591 195.3071958 328.268703 328.5699872 344.890733 345.399132 So again I don't see any performance improvement. However, I did manage to find out what was causing the 'freezing' I mentioned in my earlier email. By temporarily turning fsync=false in postgresql.conf, the freezing goes away, so I'm guessing it's something to do with disk/kernel caches and buffering. Since the drives are software RAID 1 with ext3 I guess that the server is running I/O bound under load which is perhaps why padding the data structures doesn't seem to make much difference. I'm not sure whether this makes the test results particularly useful though :( Kind regards, Mark. WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---(end of broadcast)--- TIP 8: explain analyze is your friend