Re: [HACKERS] double writes using double-write buffer approach [WIP]
Dan, I believe your approach of double buffer write is right as it has potential that it can avoid the latency backends incur during full page writes after checkpoint. Although there are chances that overall I/O will be more in this case but if we can make sure that in most scenarios backend has to never do I/O it can show performance improvement as well as compare to full page writes. -Original Message- From: Dan Scales [mailto:sca...@vmware.com] Sent: Thursday, February 09, 2012 5:30 AM To: Amit Kapila Cc: PG Hackers Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP] Is there any problem if the double-write happens only by bgwriter or checkpoint. Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a double-write buffer, but bgwriter will check this double-buffer and flush from it. Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell bgwriter to flush from double-write buffer. This can ensure very less I/O by any backend. Yes, I think this is a good idea. I could make changes so that the backends hand off the responsibility to flush batches of the double-write buffer to the bgwriter whenever possible. This would avoid some long IO waits in the backends, though the backends may of course eventually wait anyways for the bgwriter if IO is not fast enough. I did write the code so that any process can write a completed batch if the batch is not currently being flushed (so as to deal with crashes by backends). Having the backends flush the batches as they fill them up was just simpler for a first prototype. Dan - Original Message - From: Amit Kapila amit.kap...@huawei.com To: Dan Scales sca...@vmware.com, PG Hackers pgsql-hackers@postgresql.org Sent: Tuesday, February 7, 2012 1:08:49 AM Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP] I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. I don't understand this point, because from the data in your mail, it appears that when shared buffers are less means when more evictions can happen, the performance is less. ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process which can degrade performance. Is there any problem if the double-write happens only by bgwriter or checkpoint. Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a double-write buffer, but bgwriter will check this double-buffer and flush from it. Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell bgwriter to flush from double-write buffer. This can ensure very less I/O by any backend. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales Sent: Saturday, January 28, 2012 4:02 AM To: PG Hackers Subject: [HACKERS] double writes using double-write buffer approach [WIP] I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous patch that only batched up writes by the checkpointer). I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. It also centralizes most of the code change in smgr.c. However, it is trickier to reason about. The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written. Note that a full copy of the page is required, but can folded in with a checksum calculation. Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double writes -- the pages are first written and fsynced to a double-write file, then written to the data files, which are then fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing the size of the WAL log). The key changes are conceptually simple: 1. In smgrwrite(), copy the page to the double-write buffer. If a big enough batch has accumulated, then flush the batch using double writes. [I don't think I need to intercept calls to smgrextend(), but I am not totally sure.] 2. In smgrread(), always look first in the double-write buffer for a particular page, before going to disk. 3. At the end of a checkpoint and on shutdown, always make sure that the current contents of the double-write buffer are flushed. 4. Pass flags around in some cases to indicate whether a page buffer needs a double write or not. (I think eventually this would be an attribute of the buffer, set when the page is WAL-logged, rather than a flag passed around.) 5.
Re: [HACKERS] Checkpoint sync pause
Without sorted checkpoints (or some other fancier method) you have to write out the entire pool before you can do any fsyncs. Or you have to do multiple fsyncs of the same file, with at least one occurring after the entire pool was written. With a sorted checkpoint, you can start issuing once-only fsyncs very early in the checkpoint process. I think that on large servers, that would be the main benefit, not the actually more efficient IO. (On small servers I've seen sorted checkpoints be much faster on shutdown checkpoints, but not on natural checkpoints, and presumably this improvement *is* due to better ordering). On your servers, you need big delays between fsyncs and not between writes (as they are buffered until the fsync). But in other situations, people need the delays between the writes. By using sorted checkpoints with fsyncs between each file, the delays between writes are naturally delays between fsyncs as well. So I think the benefit of using sorted checkpoints is that code to improve your situations is less likely to degrade someone else's situation, without having to introduce an extra layer of tunables. What I understood is that you are suggesting, it is better to do sorted checkpoints which essentially means flush nearby buffers together. However if does this way, might be it will violate Oracle Patent (20050044311 - Reducing disk IO by full-cache write-merging). I am not very sure about it. But you can refer it once. I think the linked list is a bit of a red herring. Many of the concepts people discuss implementing on the linked list could just as easily be implemented with the clock sweep. And I've seen no evidence at all that the clock sweep is the problem. The LWLock that protects can obviously be a problem, but that seems to be due to the overhead of acquiring a contended lock, not the work done under the lock. Reducing the lock-strength around this might be a good idea, but that reduction could be done just as easily (and as far as I can tell, more easily) with the clock sweep than the linked list. with clock-sweep, there are many chances that backend needs to traverse more to find a suitable buffer. However, if clean buffer is put in freelist, it can be directly picked from there. -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Jeff Janes Sent: Monday, February 13, 2012 12:14 AM To: Greg Smith Cc: Robert Haas; PostgreSQL-development Subject: Re: [HACKERS] Checkpoint sync pause On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith gsm...@gregsmith.com wrote: On 02/03/2012 11:41 PM, Jeff Janes wrote: -The steady stream of backend writes that happen between checkpoints have filled up most of the OS write cache. A look at /proc/meminfo shows around 2.5GB Dirty: backend writes includes bgwriter writes, right? Right. Has using a newer kernal with dirty_background_bytes been tried, so it could be set to a lower level? If so, how did it do? Or does it just refuse to obey below the 5% level, as well? Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster than it improves checkpoint latency. Does it cause VACUUM to create latency for other processes (like the checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM just slow down without effecting other tasks? It seems to me that just lowering dirty_background_bytes (while not also lowering dirty_bytes) should not cause the latter to happen, but it seems like these kernel tunables never do exactly what they advertise. This may not be relevant to the current situation, but I wonder if we don't need a vacuum_cost_page_dirty_seq so that if the pages we are dirtying are consecutive (or at least closely spaced) they cost less, in anticipation that the eventual writes will be combined and thus consume less IO resources. I would think it would be common for some regions of table to be intensely dirtied, and some to be lightly dirtied (but still aggregating up to a considerable amount of random IO). But the vacuum process might also need to be made more bursty, as even if it generates sequential dirty pages the IO system might write them randomly anyway if there are too many delays interspersed Since the sort of servers that have checkpoint issues are quite often ones that have VACUUM ones, too, that whole path doesn't seem very productive. The one test I haven't tried yet is whether increasing the size of the VACUUM ring buffer might improve how well the server responds to a lower write cache. I wouldn't expect this to help. It seems like it would hurt, as it just leaves the data for even longer (however long it takes to circumnavigate the ring buffer) before there is any possibility of it getting written. I guess it does increase the chances that the dirty pages will accidentally get written by the bgwriter rather than the vacuum itself, but I doubt that that would be
Re: [HACKERS] Optimize referential integrity checks (todo item)
On Sun, Feb 12, 2012 at 7:36 AM, Vik Reykja vikrey...@gmail.com wrote: I decided to take a crack at the todo item created from the following post: http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php The attached patch makes the desired changes in both code and function naming. It seemed quite easy to do but wasn't marked as easy on the todo, so I'm wondering if I've missed something. All regression tests pass. The patch was not getting applied. Was seeing below message: postgresql$ git apply /Downloads/unchanged.patch error: src/backend/utils/adt/ri_triggers.c: already exists in working directory Have come up with attached patch which hopefully should not have missed any of your changes. Please verify the changes. Regards, Chetan PS: would like the patch name to be something meaningful. -- EnterpriseDB Corporation The Enterprise PostgreSQL Company Website: www.enterprisedb.com EnterpriseDB Blog : http://blogs.enterprisedb.com Follow us on Twitter : http://www.twitter.com/enterprisedb diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index 03a974a..09bacb7 100644 --- a/src/backend/utils/adt/ri_triggers.c +++ b/src/backend/utils/adt/ri_triggers.c @@ -205,11 +205,11 @@ static void ri_BuildQueryKeyFull(RI_QueryKey *key, static void ri_BuildQueryKeyPkCheck(RI_QueryKey *key, const RI_ConstraintInfo *riinfo, int32 constr_queryno); -static bool ri_KeysEqual(Relation rel, HeapTuple oldtup, HeapTuple newtup, +static bool ri_KeysUnchanged(Relation rel, HeapTuple oldtup, HeapTuple newtup, const RI_ConstraintInfo *riinfo, bool rel_is_pk); -static bool ri_AllKeysUnequal(Relation rel, HeapTuple oldtup, HeapTuple newtup, +static bool ri_AllKeysChanged(Relation rel, HeapTuple oldtup, HeapTuple newtup, const RI_ConstraintInfo *riinfo, bool rel_is_pk); -static bool ri_OneKeyEqual(Relation rel, int column, +static bool ri_OneKeyUnchanged(Relation rel, int column, HeapTuple oldtup, HeapTuple newtup, const RI_ConstraintInfo *riinfo, bool rel_is_pk); static bool ri_AttributesEqual(Oid eq_opr, Oid typeid, @@ -932,9 +932,9 @@ RI_FKey_noaction_upd(PG_FUNCTION_ARGS) } /* - * No need to check anything if old and new keys are equal + * No need to check anything if old and new keys are unchanged */ - if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true)) + if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true)) { heap_close(fk_rel, RowShareLock); return PointerGetDatum(NULL); @@ -1281,9 +1281,9 @@ RI_FKey_cascade_upd(PG_FUNCTION_ARGS) } /* - * No need to do anything if old and new keys are equal + * No need to do anything if old and new keys are unchanged */ - if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true)) + if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true)) { heap_close(fk_rel, RowExclusiveLock); return PointerGetDatum(NULL); @@ -1646,9 +1646,9 @@ RI_FKey_restrict_upd(PG_FUNCTION_ARGS) } /* - * No need to check anything if old and new keys are equal + * No need to check anything if old and new keys are unchanged */ - if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true)) + if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true)) { heap_close(fk_rel, RowShareLock); return PointerGetDatum(NULL); @@ -1993,9 +1993,9 @@ RI_FKey_setnull_upd(PG_FUNCTION_ARGS) } /* - * No need to do anything if old and new keys are equal + * No need to do anything if old and new keys are unchanged */ - if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true)) + if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true)) { heap_close(fk_rel, RowExclusiveLock); return PointerGetDatum(NULL); @@ -2012,13 +2012,10 @@ RI_FKey_setnull_upd(PG_FUNCTION_ARGS) * our cached plan, unless the update happens to change all * columns in the key. Fortunately, for the most common case of a * single-column foreign key, this will be true. - * - * In case you're wondering, the inequality check works because we - * know that the old key value has no NULLs (see above). */ use_cached_query = (riinfo.confmatchtype == FKCONSTR_MATCH_FULL) || -ri_AllKeysUnequal(pk_rel, old_row, new_row, +ri_AllKeysChanged(pk_rel, old_row, new_row, riinfo, true); /* @@ -2064,7 +2061,7 @@ RI_FKey_setnull_upd(PG_FUNCTION_ARGS) * to changed columns in pk_rel's key */ if (riinfo.confmatchtype == FKCONSTR_MATCH_FULL || - !ri_OneKeyEqual(pk_rel, i, old_row, new_row, + !ri_OneKeyUnchanged(pk_rel, i, old_row, new_row, riinfo, true)) { appendStringInfo(querybuf, @@ -2389,9 +2386,9 @@ RI_FKey_setdefault_upd(PG_FUNCTION_ARGS) } /* - * No need to do anything if old and new keys are equal + * No need to do anything if old and new keys are unchanged */ - if
Re: [HACKERS] bitfield and gcc
On Sat, Feb 11, 2012 at 01:54, Gaetano Mendola mend...@gmail.com wrote: I wonder if somewhere in Postgres source we are relying on the GCC correct behaviour regarding the read-modify-write of bitfield in structures. Probably not. I'm pretty sure that we don't have any bitfields, since not all compilers are happy with them. And it looks like this behavior doesn't affect other kinds of struct fields. It sounds like the GCC guys are saying that it's theoretically possible that the compiler will generate 64-bit read-modify-writes regardless of the struct member types. In this light, PostgreSQL code is not correct -- our slock_t uses a char type on i386/AMD64/SPARC and 32-bit int on IA-64/PPC64. There are plenty of places where it's adjacent to other small fields. However, I don't think the latter is a problem with any compilers in practice, as that would break a lot more code than just btrfs and Postgres. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CUDA Sorting
2012/2/13 Greg Smith g...@2ndquadrant.com: On 02/11/2012 08:14 PM, Gaetano Mendola wrote: The trend is to have server capable of running CUDA providing GPU via external hardware (PCI Express interface with PCI Express switches), look for example at PowerEdge C410x PCIe Expansion Chassis from DELL. The C410X adds 16 PCIe slots to a server, housed inside a separate 3U enclosure. That's a completely sensible purchase if your goal is to build a computing cluster, where a lot of work is handed off to a set of GPUs. I think that's even less likely to be a cost-effective option for a database server. Adding a single dedicated GPU installed in a server to accelerate sorting is something that might be justifiable, based on your benchmarks. This is a much more expensive option than that though. Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants to see just how big this external box is. I did some experimenst timing the sort done with CUDA and the sort done with pg_qsort: CUDA pg_qsort 33Milion integers: ~ 900 ms, ~ 6000 ms 1Milion integers: ~ 21 ms, ~ 162 ms 100k integers: ~ 2 ms, ~ 13 ms CUDA time has already in the copy operations (host-device, device-host). As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) Xeon(R) CPU X5650 @ 2.67GHz That's really interesting, and the X5650 is by no means a slow CPU. So this benchmark is providing a lot of CPU power yet still seeing over a 6X speedup in sort times. It sounds like the PCI Express bus has gotten fast enough that the time to hand data over and get it back again can easily be justified for medium to large sized sorts. It would be helpful to take this patch and confirm whether it scales when using in parallel. Easiest way to do that would be to use the pgbench -f feature, which allows running an arbitrary number of some query at once. Seeing whether this acceleration continued to hold as the number of clients increases is a useful data point. Is it possible for you to break down where the time is being spent? For example, how much of this time is consumed in the GPU itself, compared to time spent transferring data between CPU and GPU? I'm also curious where the bottleneck is at with this approach. If it's the speed of the PCI-E bus for smaller data sets, adding more GPUs may never be practical. If the bus can handle quite a few of these at once before it saturates, it might be possible to overload a single GPU. That seems like it would be really hard to reach for database sorting though; I can't really defend justify my gut feel for that being true though. I've never seen a PostgreSQL server capable of running CUDA, and I don't expect that to change. That sounds like: I think there is a world market for maybe five computers. - IBM Chairman Thomas Watson, 1943 Yes, and 640K will be enough for everyone, ha ha. (Having said the 640K thing is flat out denied by Gates, BTW, and no one has come up with proof otherwise). I think you've made an interesting case for this sort of acceleration now being useful for systems doing what's typically considered a data warehouse task. I regularly see servers waiting for far more than 13M integers to sort. And I am seeing a clear trend toward providing more PCI-E slots in servers now. Dell's R810 is the most popular single server model my customers have deployed in the last year, and it has 5 X8 slots in it. It's rare all 5 of those are filled. As long as a dedicated GPU works fine when dropped to X8 speeds, I know a fair number of systems where one of those could be added now. There's another data point in your favor I didn't notice before your last e-mail. Amazon has a Cluster GPU Quadruple Extra Large node type that runs with NVIDIA Tesla hardware. That means the installed base of people who could consider CUDA is higher than I expected. To demonstrate how much that costs, to provision a GPU enabled reserved instance from Amazon for one year costs $2410 at Light Utilization, giving a system with 22GB of RAM and 1.69GB of storage. (I find the reserved prices easier to compare with dedicated hardware than the hourly ones) That's halfway between the High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at $2200. If someone could prove sorting was a bottleneck on their server, that isn't an unreasonable option to consider on a cloud-based database deployment. I still think that an approach based on OpenCL is more likely to be suitable for PostgreSQL, which was part of why I gave CUDA low odds here. The points in favor of OpenCL are: -Since you last posted, OpenCL compiling has switched to using LLVM as their standard compiler. Good PostgreSQL support for LLVM isn't far away. It looks to me like the compiler situation
Re: [HACKERS] Finer Extension dependencies
Hi, Sorry for the delays, I'm back on PostgreSQL related work again. Hitoshi Harada umi.tan...@gmail.com writes: I just tried DROP EXTENSION now, and found it broken :( Please find v2 of the patch. I did change the dependency management in between the simple cases and the more challenging ones and forgot that I had to retest it all in between, which is what happen on a tight schedule and when working at night, I guess. So the best option I've found here had me add a new function in pg_depend.c, it's working as intended now. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support diff --git a/contrib/pg_upgrade_support/pg_upgrade_support.c b/contrib/pg_upgrade_support/pg_upgrade_support.c index 472f152..b5633d4 100644 --- a/contrib/pg_upgrade_support/pg_upgrade_support.c +++ b/contrib/pg_upgrade_support/pg_upgrade_support.c @@ -151,6 +151,7 @@ create_empty_extension(PG_FUNCTION_ARGS) Datum extConfig; Datum extCondition; List *requiredExtensions; + List *features = NIL; /* FIXME, get features from catalogs */ if (PG_ARGISNULL(4)) extConfig = PointerGetDatum(NULL); @@ -190,7 +191,8 @@ create_empty_extension(PG_FUNCTION_ARGS) text_to_cstring(extVersion), extConfig, extCondition, - requiredExtensions); + requiredExtensions, + features); PG_RETURN_VOID(); } diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index ea98cb7..8090758 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -149,6 +149,11 @@ /row row + entrylink linkend=catalog-pg-extensionstructnamepg_extension_feature/structname/link/entry + entryfeatures provided by installed extensions/entry + /row + + row entrylink linkend=catalog-pg-foreign-data-wrapperstructnamepg_foreign_data_wrapper/structname/link/entry entryforeign-data wrapper definitions/entry /row @@ -3058,6 +3063,51 @@ /para /sect1 + sect1 id=catalog-pg-extension-feature + titlestructnamepg_extension_feature/structname/title + + indexterm zone=catalog-pg-extension-feature + primarypg_extension/primary + /indexterm + + para + The catalog structnamepg_extension_feature/structname stores + information about the features provided by installed extensions. + See xref linkend=extend-extensions for details about extensions. + /para + + table + titlestructnamepg_extension_features/ Columns/title + + tgroup cols=4 +thead + row + entryName/entry + entryType/entry + entryReferences/entry + entryDescription/entry + /row +/thead + +tbody + row + entrystructfieldextoid/structfield/entry + entrytypeoid/type/entry + entryliterallink linkend=catalog-pg-extensionstructnamepg_extension/structname/link.oid/literal/entry + entryOid of the extension that provides this feature/entry + /row + + row + entrystructfieldextfeature/structfield/entry + entrytypename/type/entry + entry/entry + entryName of the feature/entry + /row + +/tbody + /tgroup + /table + /sect1 sect1 id=catalog-pg-foreign-data-wrapper titlestructnamepg_foreign_data_wrapper/structname/title @@ -6815,11 +6865,17 @@ row entrystructfieldrequires/structfield/entry entrytypename[]/type/entry - entryNames of prerequisite extensions, + entryNames of prerequisite features, or literalNULL/literal if none/entry /row row + entrystructfieldprovides/structfield/entry + entrytypename[]/type/entry + entryNames of provided features/entry + /row + + row entrystructfieldcomment/structfield/entry entrytypetext/type/entry entryComment string from the extension's control file/entry diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml index 8d5b9d0..af5fc4c 100644 --- a/doc/src/sgml/extend.sgml +++ b/doc/src/sgml/extend.sgml @@ -463,9 +463,30 @@ termvarnamerequires/varname (typestring/type)/term listitem para -A list of names of extensions that this extension depends on, -for example literalrequires = 'foo, bar'/literal. Those -extensions must be installed before this one can be installed. +A list of features that this extension depends on, for +example literalrequires = 'foo, bar'/literal. Those features +must be provided by an already installed extension before this one +can be installed. + /para + /listitem + /varlistentry + + varlistentry + termvarnameprovides/varname (typestring/type)/term + listitem + para +A list of names of features that this extension provides, for +example literalprovides = 'foo, extname_bugfix_12345'/literal. +Those features can help providing finer dependencies: when updating +an existing extension you can add new features in this list so that +
Re: [HACKERS] ecpglib use PQconnectdbParams
Because connect_timeout is a separate libpq connection parameter, but now it's stuck into options. It might have worked more or less by accident before. So it is not an option, right? But the old function accepted it as an option it seems. It's not clear to me why this only appears on checktcp. And why we don't run those tests by default. That should be clarified, because This was decided when regression tests for ecpg were introduced to not depend on the use of TCP ports. For details see this thread: http://archives.postgresql.org/pgsql-hackers/2006-08/msg00078.php and in particular these emails: http://archives.postgresql.org/pgsql-hackers/2006-08/msg00118.php http://archives.postgresql.org/pgsql-hackers/2006-08/msg00134.php Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org Jabber: michael.meskes at googlemail dot com VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
(2012/02/10 20:39), Shigeru Hanada wrote: (2012/02/08 20:51), Shigeru Hanada wrote: Attached revised patches. Changes from last version are below. snip I've found and fixed a bug which generates wrong remote query when any column of a foreign table has been dropped. Also regression test for this case is added. I attached only pgsql_fdw_v8.patch, because pgsql_fdw_pushdown_v3.patch in last post still can be applied onto v8 patch. Regards, The patches have been applied, but role-related regression tests failed in my environment. I fixed it in a similar fashion of /src/test/regress/sql/foreign_data.sql. Please find attached a updated patch for the regression tests. BTW, What do you think about this? http://archives.postgresql.org/pgsql-hackers/2012-01/msg00229.php Best regards, Etsuro Fujita *** sql/pgsql_fdw.sql.orig 2012-02-13 19:52:08.0 +0900 --- sql/pgsql_fdw.sql 2012-02-13 19:44:17.0 +0900 *** *** 2,7 --- 2,19 -- create FDW objects -- === + -- Clean up in case a prior regression run failed + + -- Suppress NOTICE messages when roles don't exist + SET client_min_messages TO 'error'; + + DROP ROLE IF EXISTS pgsql_fdw_user; + + RESET client_min_messages; + + CREATE ROLE pgsql_fdw_user LOGIN SUPERUSER; + SET SESSION AUTHORIZATION 'pgsql_fdw_user'; + CREATE EXTENSION pgsql_fdw; CREATE SERVER loopback1 FOREIGN DATA WRAPPER pgsql_fdw; *** *** 168,173 --- 180,186 EXPLAIN (COSTS false) SELECT * FROM ft1 t1 WHERE t1.c1 = abs(t1.c2); EXPLAIN (COSTS false) SELECT * FROM ft1 t1 WHERE t1.c1 = t1.c2; DROP OPERATOR === (int, int) CASCADE; + DROP OPERATOR !== (int, int) CASCADE; DROP FUNCTION pgsql_fdw_abs(int); -- === *** *** 212,216 -- === -- cleanup -- === DROP EXTENSION pgsql_fdw CASCADE; ! --- 225,231 -- === -- cleanup -- === + DROP SCHEMA S 1 CASCADE; DROP EXTENSION pgsql_fdw CASCADE; ! \c ! DROP ROLE pgsql_fdw_user; *** expected/pgsql_fdw.out.orig 2012-02-13 19:52:03.0 +0900 --- expected/pgsql_fdw.out 2012-02-13 19:51:49.0 +0900 *** *** 1,6 --- 1,13 -- === -- create FDW objects -- === + -- Clean up in case a prior regression run failed + -- Suppress NOTICE messages when roles don't exist + SET client_min_messages TO 'error'; + DROP ROLE IF EXISTS pgsql_fdw_user; + RESET client_min_messages; + CREATE ROLE pgsql_fdw_user LOGIN SUPERUSER; + SET SESSION AUTHORIZATION 'pgsql_fdw_user'; CREATE EXTENSION pgsql_fdw; CREATE SERVER loopback1 FOREIGN DATA WRAPPER pgsql_fdw; CREATE SERVER loopback2 FOREIGN DATA WRAPPER pgsql_fdw *** *** 130,147 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (colname 'C 1'); ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (colname 'C 1'); \dew+ ! List of foreign-data wrappers !Name| Owner | Handler | Validator | Access privileges | FDW Options | Description ! ---+--+---+-+---+-+- ! pgsql_fdw | postgres | pgsql_fdw_handler | pgsql_fdw_validator | | | (1 row) \des+ ! List of foreign servers !Name| Owner | Foreign-data wrapper | Access privileges | Type | Version | FDW Options | Description ! ---+--+--+---+--+-+-+- ! loopback1 | postgres | pgsql_fdw| | | |
Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases
Re: Alex Hunsaker 2012-02-10 cafapbrr9y1fu6gpvu+8ta8vty6qvcm3dfarkt8jg_ehgetx...@mail.gmail.com Does the attached fix the issue for you? Yes. :) Christoph -- c...@df7cb.de | http://www.df7cb.de/ signature.asc Description: Digital signature
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Thu, Feb 9, 2012 at 3:37 PM, Jay Levitt jay.lev...@gmail.com wrote: So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and (probably both of our) 9.1-HEAD takes 1918s... is that something to worry about, and if so, are there any tests I can run to assist? That bug doesn't affect me personally, but y'know, community and all that. Also, I wonder if it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way slower doing Y., and this is a canary in the coal mine. This might be a lame hypothesis, but... is it possible that you built your 9.1-tip binaries with --enable-cassert? Or with different optimization options? There's been some work done on GiST in 9.2, which as Alexander Korotkov who did the work mentioned upthread, might have some issue. But I can't see how there can be a 4x regression between minor releases, though maybe it wouldn't hurt to test. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing special case OID generation
On Sat, Feb 11, 2012 at 4:23 AM, Simon Riggs si...@2ndquadrant.com wrote: Yeh, I was thinking we would do well to implement cached sequences for say first 1000 sequences. Another option might be to store all the sequences for a particular database in a single underlying data file. The current implementation uses a whole page for a single tuple that is presumably much smaller than that. So when you create a sequence foo, it's really creating a row in some new system catalog pg_sequences, or something like that. Idea would be to make Sequences as fast as OIDs and get rid of the weird OID code. Honestly, I think the biggest hassle of the OID code is not so much the way they're generated as the way they're stored within heap tuples. I've wondered whether we should go through the system catalogs and replace all of the hidden OID columns with a normal column called oid of type OID, always placing it at attnum = 1 since we have a lot of code that assumes the OID column always has the same attnum. That would be a fairly large notational change, but maybe it wouldn't break anything /too/ badly... Anyway, if we could get away with that, we could eventually (after N releases) drop the special case support for system OID columns, which would be a nice simplification. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] initdb and fsync
On Fri, Feb 10, 2012 at 3:57 PM, Peter Eisentraut pete...@gmx.net wrote: On sön, 2012-02-05 at 10:53 -0800, Jeff Davis wrote: initdb should do these syncs by default and offer an option to disable them. For test frameworks that run initdb often, that makes sense. But for developers, it doesn't make sense to spend 0.5s typing an option that saves you 0.3s. So, we'd need some more convenient way to choose the no-fsync option, like an environment variable that developers can set. Or maybe developers don't care about 0.3s? You can use https://launchpad.net/libeatmydata for those cases. That's hilarious. But, a command-line option seems more convenient. It also seems entirely sufficient. The comments above suggest that it would take too long to type the option, but any PG developers who are worried about the speed difference surely know how to create shell aliases, shell functions, shell scripts, ... and if anyone's really concerned about it, we can provide a short form for the option. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
On Mon, Feb 13, 2012 at 7:45 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Feb 9, 2012 at 3:37 PM, Jay Levitt jay.lev...@gmail.com wrote: So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and (probably both of our) 9.1-HEAD takes 1918s... is that something to worry about, and if so, are there any tests I can run to assist? That bug doesn't affect me personally, but y'know, community and all that. Also, I wonder if it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way slower doing Y., and this is a canary in the coal mine. This might be a lame hypothesis, but... is it possible that you built your 9.1-tip binaries with --enable-cassert? Or with different optimization options? There's been some work done on GiST in 9.2, which as Alexander Korotkov who did the work mentioned upthread, might have some issue. But I can't see how there can be a 4x regression between minor releases, though maybe it wouldn't hurt to test. So I tested. On my MacBook Pro, your test script builds the index in just over 25 s on both REL9_1_2 and this morning's REL9_1_STABLE. This is with the following non-default configuration settings: shared_buffers = 400MB maintenance_work_mem = 1GB checkpoint_segments = 30 checkpoint_timeout = 10min checkpoint_completion_target = 0.9 checkpoint_warning = 60s I then tested with master, which also showed similar performance. Based on this comment from your original email: [***] never completed after 10-20 minutes; nothing in server.log at default logging levels, postgres process consuming about 1 CPU in IOWAIT, checkpoints every 7-8 seconds ...I wonder if you have left checkpoint_segments set to the default value of 3, which would account for the very frequent checkpoints. At any rate, I can't measure a difference between the branches on this test. That doesn't mean there isn't one, but in my test setup I'm not seeing it. As an afterthought, I also retested with wal_level=archive added to the config, but I still don't see any significant difference between 9.1.2, 9.1-stable, and 9.2-devel. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing special case OID generation
On Mon, Feb 13, 2012 at 15:08, Robert Haas robertmh...@gmail.com wrote: Honestly, I think the biggest hassle of the OID code is not so much the way they're generated as the way they're stored within heap tuples. I've wondered whether we should go through the system catalogs and replace all of the hidden OID columns with a normal column called oid of type OID Do we have a clear idea about what to do with user tables that are created WITH OIDS? Do we care about compatibility with that at all? Do we generate this explicit oid column manually or do we just tell users to use a serial or global sequence instead? Personally I'd also like to see us get rid of the default_with_oids setting -- I assume the existence of that is the reason why pgAdmin and TOAD still generate table DDL with an explicit WITH (OIDS=FALSE) Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing special case OID generation
On Mon, Feb 13, 2012 at 8:51 AM, Marti Raudsepp ma...@juffo.org wrote: On Mon, Feb 13, 2012 at 15:08, Robert Haas robertmh...@gmail.com wrote: Honestly, I think the biggest hassle of the OID code is not so much the way they're generated as the way they're stored within heap tuples. I've wondered whether we should go through the system catalogs and replace all of the hidden OID columns with a normal column called oid of type OID Do we have a clear idea about what to do with user tables that are created WITH OIDS? Do we care about compatibility with that at all? I think it would be fine to eventually drop support for user tables with OIDs. That hasn't been enabled by default for a very long time: commit 7ce9b7c0d8c8dbefc04978765422f760dcf3788c Author: Bruce Momjian br...@momjian.us Date: Mon Dec 1 22:08:02 2003 + This patch adds a new GUC var, default_with_oids, which follows the proposal for eventually deprecating OIDs on user tables that I posted earlier to pgsql-hackers. pg_dump now always specifies WITH OIDS or WITHOUT OIDS when dumping a table. The documentation has been updated. Neil Conway I think there's not much benefit to deprecating that feature as long as we need system OID columns in the catalogs. But if we got rid of them there then I think we could drop support in userland, too. Do we generate this explicit oid column manually or do we just tell users to use a serial or global sequence instead? Personally I'd also like to see us get rid of the default_with_oids setting -- I assume the existence of that is the reason why pgAdmin and TOAD still generate table DDL with an explicit WITH (OIDS=FALSE) That probably has as much to do with inertia as anything else. I agree that it's ugly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize referential integrity checks (todo item)
On Sat, Feb 11, 2012 at 9:06 PM, Vik Reykja vikrey...@gmail.com wrote: I decided to take a crack at the todo item created from the following post: http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php The attached patch makes the desired changes in both code and function naming. It seemed quite easy to do but wasn't marked as easy on the todo, so I'm wondering if I've missed something. It's kind of hard to say whether you've missed something, because you haven't really explained what problem this is solving; the thread you linked too isn't very clear about that either. At first blush, it seems like you've renamed a bunch of stuff without making very much change to what actually happens. Changing lots of copies of equal to unchanged doesn't seem to me to be accomplishing anything. All regression tests pass. You should add some new ones showing how this patch improves the behavior relative to the previous code. Or if you can't, then you should provide a complete, self-contained test case that a reviewer can use to see how your proposed changes improve things. We're in the middle of a CommitFest right now, so please add this patch to the next one if you would like it reviewed: https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only quals vs. security_barrier views
On Sat, Feb 11, 2012 at 7:16 AM, Jesper Krogh jes...@krogh.cc wrote: Ok, but there are still cases where we don't even need to construct a data tuple at all: 2012-02-11 13:14:01.579 jk=# explain select count(*) from testtable where fts @@ to_tsquery('english','test1'); QUERY PLAN --- Aggregate (cost=31.24..31.25 rows=1 width=0) - Bitmap Heap Scan on testtable (cost=16.03..31.23 rows=4 width=0) Recheck Cond: (fts @@ '''test1'''::tsquery) - Bitmap Index Scan on ftsid (cost=0.00..16.03 rows=4 width=0) Index Cond: (fts @@ '''test1'''::tsquery) (5 rows) In that case I believe you DO need the heap tuple. That Recheck Cond there means that the index might be lossy - i.e. return tuples that don't really match the search condition. Another idea sprung into my head, that indices on (ctid,some mix of columns) could actually serve as some kind of vertical partitioning of the table. The ctid of a tuple is its physical position in the table. It makes no sense to index that. Since it's unique, it makes even less sense to index that plus other things in the same index. Does anyone have any comments on the issue raised in my original email? I would like to get (some version of) his patch committed, but I would also like to not back ourselves into a corner. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch n...@leadboat.com wrote: On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote: On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch n...@leadboat.com wrote: I like the design you have chosen. ?It would find applications beyond TRUNCATE, so your use of non-specific naming is sound. ?For example, older snapshots will see an empty table t after CREATE TABLE t AS SELECT 1 commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe commands should perhaps just become MVCC-safe, but there will always be use cases for operations that shortcut MVCC. ?When one truly does want that, your proposal for keeping behavior consistent makes plenty of sense. I guess I'm not particularly excited by the idea of trying to make TRUNCATE MVCC-safe. I notice that the example involves the REPEATABLE READ isolation level, which is already known to be busted in a variety of ways; that's why we now have SERIALIZABLE, and why most people use READ COMMITTED. Are there examples of this behavior at other isolation levels? I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and not at READ COMMITTED. They tend to be narrow race conditions at READ COMMITTED, yet easy to demonstrate at REPEATABLE READ. Related: http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php Yeah. Well, that's actually an interesting example, because it illustrates how general this problem is. We could potentially get ourselves into a situation where just about every system catalog table needs an xmin field to store the point at which the object came into existence - or for that matter, was updated. But it's not quite the same as the xmin of the row itself, because some updates might be judged not to matter. There could also be intermediate cases where updates are invalidating for some purposes but not others. I think we'd better get our hands around more of the problem space before we start trying to engineer solutions. Incidentally, people use READ COMMITTED because they don't question the default, not because they know hazards of REPEATABLE READ. I don't know the bustedness you speak of; could we improve the documentation to inform folks? The example that I remember was related to SELECT FOR UPDATE/SELECT FOR SHARE. The idea of those statements is that you want to prevent the row from being updated or deleted until some other concurrent action is complete; for example, in the case of a foreign key, we'd like to prevent the referenced row from being deleted or updated in the relevant columns until the inserting transaction is committed. But it doesn't work, because when the updating or deleting process gets done with the lock wait, they are still using the same snapshot as before, and merrily do exactly the the thing that the lock-wait was supposed to prevent. If an actual UPDATE is used, it's safe (I think): anyone who was going to UPDATE or DELETE the row will fail with some kind of serialization error. But a SELECT FOR UPDATE that commits is treated more like an UPDATE that rolls back: it's as if the lock never existed. Someone (Florian?) proposed a patch to change this, but it seemed problematic for reasons I no longer exactly remember. When using an actual foreign key, we work around this by taking a new snapshot to cross-check that things haven't changed under us, but user-level code can't do that. At READ COMMITTED, depending on the situation, either the fact that we take new snapshots pretty frequently or the EPQ machinery sometimes make things work sensibly anyway, and at SERIALIZABLE, SSI prevents these kinds of anomalies. But REPEATABLE READ has no protection. I wish I could find the thread where we discussed this before. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing special case OID generation
On Monday, February 13, 2012 02:08:08 PM Robert Haas wrote: On Sat, Feb 11, 2012 at 4:23 AM, Simon Riggs si...@2ndquadrant.com wrote: Yeh, I was thinking we would do well to implement cached sequences for say first 1000 sequences. Another option might be to store all the sequences for a particular database in a single underlying data file. The current implementation uses a whole page for a single tuple that is presumably much smaller than that. So when you create a sequence foo, it's really creating a row in some new system catalog pg_sequences, or something like that. I wonder if the tigher packing would be noticeable contentionwise If several hot sequences end up in a single page that could end up being measurable. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] When do we lose column names?
On Sat, Feb 11, 2012 at 11:11 AM, Andrew Dunstan aduns...@postgresql.org wrote: Other candidates I have found that don't set colnames and should probably use dummy names are: * src/backend/parser/gram.y (row: production) * src/backend/optimizer/prep/prepunion.c:adjust_appendrel_attrs_mutator() * src/backend/optimizer/util/var.c:flatten_join_alias_vars_mutator() Hm, can't the last two get real column names from somewhere? Possibly. I'll dig a bit deeper. I've had a look at these two. It's at least not obvious to me how to do this simply, if at all. In the last case it looks like we'd need to process the object recursively just like we do to extract the field values, and I don't know where to get them in the appendrel case at all, possibly because I'm not very familiar with this code. Do we actually need to bother with these cases? The regression tests pass without touching them, either because they don't matter or because we don't have a test for these cases that would tickle the assertion that was failing. If they don't matter, would it not be better just to note that in the code rather than building a list of field names for no good purpose? In flatten_join_alias_vars_mutator(), we've got a RangeTblEntry to work with. I think the column names are to be found in the alias and/or eref attributes of the RangeTblEntry. Each of those is an Alias object, which is defined like this: typedef struct Alias { NodeTag type;char *aliasname; /* aliased rel name (never qualified) */ List *colnames; /* optional list of column aliases */ } Alias; I'm not sure whether we should look at rte-eref.colnames, rte-alias.colnames, or both. In adjust_appendrel_attrs_mutator(), we have a list, translated_vars, whose order matches the column order of the parent rel. If we had the parent's RangeTblEntry, we could probably precede as in the previous case. But the AppendRelInfo only contains the index of the RT. Maybe we can figure out a way to use rt_fetch to get the RangeTblEntry itself, but that requires a pointer to the range table itself, which we haven't got. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Removing special case OID generation
On Mon, Feb 13, 2012 at 9:41 AM, Andres Freund and...@anarazel.de wrote: On Monday, February 13, 2012 02:08:08 PM Robert Haas wrote: On Sat, Feb 11, 2012 at 4:23 AM, Simon Riggs si...@2ndquadrant.com wrote: Yeh, I was thinking we would do well to implement cached sequences for say first 1000 sequences. Another option might be to store all the sequences for a particular database in a single underlying data file. The current implementation uses a whole page for a single tuple that is presumably much smaller than that. So when you create a sequence foo, it's really creating a row in some new system catalog pg_sequences, or something like that. I wonder if the tigher packing would be noticeable contentionwise If several hot sequences end up in a single page that could end up being measurable. For the contention to really be an issue, you'd need a very high rate of access to that sequence - in my tests so far, the only things that seem to get hot enough to really hurt are the roots of btrees and visibility map pages. And on the plus side, you'd be reducing the number of pages fighting to stay in shared_buffers. That having been said, it's something to watch out for - I certainly don't know enough to say for certain that it wouldn't be a problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize referential integrity checks (todo item)
On Mon, Feb 13, 2012 at 15:25, Robert Haas robertmh...@gmail.com wrote: On Sat, Feb 11, 2012 at 9:06 PM, Vik Reykja vikrey...@gmail.com wrote: I decided to take a crack at the todo item created from the following post: http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php The attached patch makes the desired changes in both code and function naming. It seemed quite easy to do but wasn't marked as easy on the todo, so I'm wondering if I've missed something. It's kind of hard to say whether you've missed something, because you haven't really explained what problem this is solving; the thread you linked too isn't very clear about that either. At first blush, it seems like you've renamed a bunch of stuff without making very much change to what actually happens. Changing lots of copies of equal to unchanged doesn't seem to me to be accomplishing anything. It's very simple really, and most of it is indeed renaming the functions. The problem this solves is that foreign key constraints are sometimes checked when they don't need to be. See my example below. All regression tests pass. You should add some new ones showing how this patch improves the behavior relative to the previous code. Or if you can't, then you should provide a complete, self-contained test case that a reviewer can use to see how your proposed changes improve things. I have no idea how a regression test would be able to see this change, so here's a test case that you can follow with the debugger. /* initial setup */ create table a (x int, y int, primary key (x, y)); create table b (x int, y int, z int, foreign key (x, y) references a); insert into a values (1, 2); insert into b values (1, null, 3); /* seeing the difference */ update b set z=0; When that update is run, it will check if the FK (x, y) has changed to know if it needs to verify that the values are present in the other table. The equality functions that do that don't consider two nulls to be equal (per sql logic) and so reverified the constraint. Tom noticed that it didn't need to because it hadn't really changed. In the above example, the current code will recheck the constraint and the new code won't. It's not really testing equality anymore (because null does not equal null), so I renamed them causing a lot of noise in the diff. We're in the middle of a CommitFest right now, Yes, I wasn't expecting this to be committed, I just didn't want to lose track of it. so please add this patch to the next one if you would like it reviewed: https://commitfest.postgresql.org/action/commitfest_view/open Will do.
Re: [HACKERS] Removing special case OID generation
Robert Haas robertmh...@gmail.com writes: Another option might be to store all the sequences for a particular database in a single underlying data file. We've looked into that before, and found some roadblocks IIRC, though it probably isn't completely infeasible. See the archives. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize referential integrity checks (todo item)
On Mon, Feb 13, 2012 at 11:02, Chetan Suttraway chetan.suttra...@enterprisedb.com wrote: The patch was not getting applied. Was seeing below message: postgresql$ git apply /Downloads/unchanged.patch error: src/backend/utils/adt/ri_triggers.c: already exists in working directory Have come up with attached patch which hopefully should not have missed any of your changes. Thank you for doing that. What command did you use? I followed the procedure on the wiki [1] but I must be doing something wrong. [1] http://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git Please verify the changes. They look good. Thanks again.
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
Robert Haas robertmh...@gmail.com wrote: The example that I remember was related to SELECT FOR UPDATE/SELECT FOR SHARE. The idea of those statements is that you want to prevent the row from being updated or deleted until some other concurrent action is complete; for example, in the case of a foreign key, we'd like to prevent the referenced row from being deleted or updated in the relevant columns until the inserting transaction is committed. But it doesn't work, because when the updating or deleting process gets done with the lock wait, they are still using the same snapshot as before, and merrily do exactly the the thing that the lock-wait was supposed to prevent. This issue is one which appears to be a problem for people trying to migrate from Oracle, where a write conflict would be generated. If an actual UPDATE is used, it's safe (I think): anyone who was going to UPDATE or DELETE the row will fail with some kind of serialization error. Right; a write conflict. But a SELECT FOR UPDATE that commits is treated more like an UPDATE that rolls back: it's as if the lock never existed. Someone (Florian?) proposed a patch to change this, but it seemed problematic for reasons I no longer exactly remember. It had to do with only having one xmax and how that worked with subtransactions. Of course, besides the technical obstacles, such a semantic change could break existing code for PostgreSQL users. :-( When using an actual foreign key, we work around this by taking a new snapshot to cross-check that things haven't changed under us, but user-level code can't do that. At READ COMMITTED, depending on the situation, either the fact that we take new snapshots pretty frequently or the EPQ machinery sometimes make things work sensibly anyway, and at SERIALIZABLE, SSI prevents these kinds of anomalies. But REPEATABLE READ has no protection. Well, personally I have a hard time calling READ COMMITTED behavior sensible. Consider this: -- connection 1 test=# create table t (id int not null primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index t_pkey for table t CREATE TABLE test=# insert into t select generate_series(1, 10); INSERT 0 10 -- connection 2 test=# begin; BEGIN test=# update t set id = id - 1; UPDATE 10 -- connection 1 test=# select * from t where id = (select min(id) from t) for update; [blocks] -- connection 2 test=# commit; COMMIT -- connection 1 [unblocks] id (0 rows) -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] When do we lose column names?
Robert Haas robertmh...@gmail.com writes: On Sat, Feb 11, 2012 at 11:11 AM, Andrew Dunstan aduns...@postgresql.org wrote: Do we actually need to bother with these cases? In flatten_join_alias_vars_mutator(), we've got a RangeTblEntry to work with. I think the column names are to be found in the alias and/or eref attributes of the RangeTblEntry. The eref names are the ones to use. alias just records the original AS clause (if any) attached to the RTE, which is mostly useful only for reverse-listing the query. In adjust_appendrel_attrs_mutator(), we have a list, translated_vars, whose order matches the column order of the parent rel. If we had the parent's RangeTblEntry, we could probably precede as in the previous case. But the AppendRelInfo only contains the index of the RT. Maybe we can figure out a way to use rt_fetch to get the RangeTblEntry itself, but that requires a pointer to the range table itself, which we haven't got. This is surely fixable by passing a bit more information down. If you (Andrew) have something that covers everything but this issue, pass it over and I'll take a whack at it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto_explain produces invalid JSON
[ sorry for ignoring this over the weekend --- I wasn't feeling very well ] Andrew Dunstan and...@dunslane.net writes: On 02/11/2012 03:22 PM, Tom Lane wrote: I'm inclined to think that this is auto_explain's error, not that of the core code, ie we should be changing the output. Well, maybe this is more to your taste, although it strikes me as more than something of a kludge. At least it's short :-) I see you've already committed this, but I agree that it's quite a kluge. After looking a bit more at the existing explain code, it seems like the critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup calls around the ExplainPrintPlan call (see ExplainOnePlan), while auto_explain does not. I did not like your originally proposed patch because it seemed to introduce yet another formatting concept into code that has already got a few too many. But couldn't we fix this by exporting ExplainOpenGroup/ExplainCloseGroup and then calling those from auto_explain? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)
On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 13.02.2012 01:04, Jeff Janes wrote: Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD. I have no idea if I did it correctly, in particular if calling GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side effects that make that a bad thing to do. I'm not proposing it as the real fix, I just wanted to get around this problem in order to do more testing. Thanks. That's basically the right approach. Attached patch contains a cleaned up version of that. It does get rid of the there is no contrecord flag errors, but recover still does not work. Now the count of tuples in the table is always correct (I never provoke a crash during the initial table load), but sometimes updates to those tuples that were reported to have been committed are lost. This is more subtle, it does not happen on every crash. It seems that when recovery ends on record with zero length at..., that recovery is correct. But when it ends on invalid magic number in log file.. then the recovery is screwed up. Can you write a self-contained test case for that? I've been trying to reproduce that by running the regression tests and pgbench with a streaming replication standby, which should be pretty much the same as crash recovery. No luck this far. Probably I could reproduce the same problem as Jeff got. Here is the test case: $ initdb -D data $ pg_ctl -D data start $ psql -c create table t (i int); insert into t values(generate_series(1,1)); delete from t $ pg_ctl -D data stop -m i $ pg_ctl -D data start The crash recovery emitted the following server logs: LOG: database system was interrupted; last known up at 2012-02-14 02:07:01 JST LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/179CC90 LOG: invalid magic number in log file 0, segment 1, offset 8060928 LOG: redo done at 0/17AD858 LOG: database system is ready to accept connections LOG: autovacuum launcher started After recovery, I could not see the table t which I created before: $ psql -c select count(*) from t ERROR: relation t does not exist Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Access Error Details from PL/pgSQL
Hackers, In PL/pgSQL exception handling, I'm able to access the error code (SQLSTATE) and error message (SQLERRM). Is there any way to get at error details (yet)? If not, could SQLDETAIL or some such be added? Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Access Error Details from PL/pgSQL
Hello 2012/2/13 David E. Wheeler da...@justatheory.com: Hackers, In PL/pgSQL exception handling, I'm able to access the error code (SQLSTATE) and error message (SQLERRM). Is there any way to get at error details (yet)? If not, could SQLDETAIL or some such be added? no in stable http://www.depesz.com/2011/07/20/waiting-for-9-2-stacked-diagnostics-in-plpgsql/ Pavel Thanks, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CUDA Sorting
On Feb 13, 2012 11:39 a.m., Kohei KaiGai kai...@kaigai.gr.jp wrote: 2012/2/13 Greg Smith g...@2ndquadrant.com: On 02/11/2012 08:14 PM, Gaetano Mendola wrote: The trend is to have server capable of running CUDA providing GPU via external hardware (PCI Express interface with PCI Express switches), look for example at PowerEdge C410x PCIe Expansion Chassis from DELL. The C410X adds 16 PCIe slots to a server, housed inside a separate 3U enclosure. That's a completely sensible purchase if your goal is to build a computing cluster, where a lot of work is handed off to a set of GPUs. I think that's even less likely to be a cost-effective option for a database server. Adding a single dedicated GPU installed in a server to accelerate sorting is something that might be justifiable, based on your benchmarks. This is a much more expensive option than that though. Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants to see just how big this external box is. I did some experimenst timing the sort done with CUDA and the sort done with pg_qsort: CUDA pg_qsort 33Milion integers: ~ 900 ms, ~ 6000 ms 1Milion integers:~ 21 ms, ~ 162 ms 100k integers: ~ 2 ms, ~ 13 ms CUDA time has already in the copy operations (host-device, device-host). As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R) Xeon(R) CPU X5650 @ 2.67GHz That's really interesting, and the X5650 is by no means a slow CPU. So this benchmark is providing a lot of CPU power yet still seeing over a 6X speedup in sort times. It sounds like the PCI Express bus has gotten fast enough that the time to hand data over and get it back again can easily be justified for medium to large sized sorts. It would be helpful to take this patch and confirm whether it scales when using in parallel. Easiest way to do that would be to use the pgbench -f feature, which allows running an arbitrary number of some query at once. Seeing whether this acceleration continued to hold as the number of clients increases is a useful data point. Is it possible for you to break down where the time is being spent? For example, how much of this time is consumed in the GPU itself, compared to time spent transferring data between CPU and GPU? I'm also curious where the bottleneck is at with this approach. If it's the speed of the PCI-E bus for smaller data sets, adding more GPUs may never be practical. If the bus can handle quite a few of these at once before it saturates, it might be possible to overload a single GPU. That seems like it would be really hard to reach for database sorting though; I can't really defend justify my gut feel for that being true though. I've never seen a PostgreSQL server capable of running CUDA, and I don't expect that to change. That sounds like: I think there is a world market for maybe five computers. - IBM Chairman Thomas Watson, 1943 Yes, and 640K will be enough for everyone, ha ha. (Having said the 640K thing is flat out denied by Gates, BTW, and no one has come up with proof otherwise). I think you've made an interesting case for this sort of acceleration now being useful for systems doing what's typically considered a data warehouse task. I regularly see servers waiting for far more than 13M integers to sort. And I am seeing a clear trend toward providing more PCI-E slots in servers now. Dell's R810 is the most popular single server model my customers have deployed in the last year, and it has 5 X8 slots in it. It's rare all 5 of those are filled. As long as a dedicated GPU works fine when dropped to X8 speeds, I know a fair number of systems where one of those could be added now. There's another data point in your favor I didn't notice before your last e-mail. Amazon has a Cluster GPU Quadruple Extra Large node type that runs with NVIDIA Tesla hardware. That means the installed base of people who could consider CUDA is higher than I expected. To demonstrate how much that costs, to provision a GPU enabled reserved instance from Amazon for one year costs $2410 at Light Utilization, giving a system with 22GB of RAM and 1.69GB of storage. (I find the reserved prices easier to compare with dedicated hardware than the hourly ones) That's halfway between the High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at $2200. If someone could prove sorting was a bottleneck on their server, that isn't an unreasonable option to consider on a cloud-based database deployment. I still think that an approach based on OpenCL is more likely to be suitable for PostgreSQL, which was part of why I gave CUDA low odds here. The points in favor of OpenCL are: -Since you last posted,
Re: [HACKERS] Access Error Details from PL/pgSQL
On Feb 13, 2012, at 9:30 AM, Pavel Stehule wrote: no in stable http://www.depesz.com/2011/07/20/waiting-for-9-2-stacked-diagnostics-in-plpgsql/ Ah, great, I had forgotten about that. Thank you, David -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto_explain produces invalid JSON
On 02/13/2012 11:15 AM, Tom Lane wrote: [ sorry for ignoring this over the weekend --- I wasn't feeling very well ] Andrew Dunstanand...@dunslane.net writes: On 02/11/2012 03:22 PM, Tom Lane wrote: I'm inclined to think that this is auto_explain's error, not that of the core code, ie we should be changing the output. Well, maybe this is more to your taste, although it strikes me as more than something of a kludge. At least it's short :-) I see you've already committed this, but I agree that it's quite a kluge. After looking a bit more at the existing explain code, it seems like the critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup calls around the ExplainPrintPlan call (see ExplainOnePlan), while auto_explain does not. I did not like your originally proposed patch because it seemed to introduce yet another formatting concept into code that has already got a few too many. But couldn't we fix this by exporting ExplainOpenGroup/ExplainCloseGroup and then calling those from auto_explain? Yeah, maybe. We'd still have to do it conditionally (have to use ExplainBeginOutput for the XML case), but it would possibly be less kludgy. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto_explain produces invalid JSON
Andrew Dunstan and...@dunslane.net writes: On 02/13/2012 11:15 AM, Tom Lane wrote: After looking a bit more at the existing explain code, it seems like the critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup calls around the ExplainPrintPlan call (see ExplainOnePlan), while auto_explain does not. Yeah, maybe. We'd still have to do it conditionally (have to use ExplainBeginOutput for the XML case), but it would possibly be less kludgy. Hm? I wasn't suggesting removing the ExplainBeginOutput call, but more like ExplainBeginOutput(es); + ExplainOpenGroup(...); ExplainQueryText(es, queryDesc); ExplainPrintPlan(es, queryDesc); + ExplainCloseGroup(...); ExplainEndOutput(es); Details still TBD; the point is just that it's not clear to me why auto_explain should need a formatting concept that doesn't already exist within explain.c. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
On Mon, Feb 13, 2012 at 10:48 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Well, personally I have a hard time calling READ COMMITTED behavior sensible. Consider this: [ gigantic pile of fail ] Yeah, that's bad all right. I think it's hard to argue that the current behavior is sensible; the trick is to figure out something that's better but doesn't impose too much additional overhead. I wonder if it's possible to use SSI (or some stripped-down mechanism along similar lines) to track these kinds of write conflicts, rather than cluttering the system catalogs with lots more TransactionId fields. Like, when we TRUNCATE a table, we could essentially make a note in memory someplace recording the write conflict. Unfortunately, the full-blown SSI machinery seems too expensive to use for this, especially on all-read workloads where there are no actual conflicts but lots of conflict checks. But that could probably be optimized. The attraction of using something like an in-memory conflict manager is that it would probably be quite general. We could fix problems of this type with no on-disk format changes whenever we discover them (as we will certainly continue to do) just by adding the appropriate flag-a-conflict calls to the right places in the code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto_explain produces invalid JSON
On 02/13/2012 12:48 PM, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: On 02/13/2012 11:15 AM, Tom Lane wrote: After looking a bit more at the existing explain code, it seems like the critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup calls around the ExplainPrintPlan call (see ExplainOnePlan), while auto_explain does not. Yeah, maybe. We'd still have to do it conditionally (have to use ExplainBeginOutput for the XML case), but it would possibly be less kludgy. Hm? I wasn't suggesting removing the ExplainBeginOutput call, but more like ExplainBeginOutput(es); + ExplainOpenGroup(...); ExplainQueryText(es, queryDesc); ExplainPrintPlan(es, queryDesc); + ExplainCloseGroup(...); ExplainEndOutput(es); Details still TBD; the point is just that it's not clear to me why auto_explain should need a formatting concept that doesn't already exist within explain.c. This will introduce an extra level of nesting for no good reason. But this would work: - ExplainBeginOutput(es); + if (auto_explain_log_format != EXPLAIN_FORMAT_JSON) + ExplainBeginOutput(es); + else + ExplainOpenGroup(NULL, NULL, true, es); ExplainQueryText(es, queryDesc); ExplainPrintPlan(es, queryDesc); - ExplainEndOutput(es); + if (auto_explain_log_format != EXPLAIN_FORMAT_JSON) + ExplainEndOutput(es); + else + ExplainCloseGroup(NULL, NULL, true, es); cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CUDA Sorting
I don't think we should be looking at either CUDA or OpenCL directly. We should be looking for a generic library that can target either and is well maintained and actively developed. Any GPU code we write ourselves would rapidly be overtaken by changes in the hardware and innovations in parallel algorithms. If we find a library that provides a sorting api and adapt our code to use it then we'll get the benefits of any new hardware feature as the library adds support for them. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] auto_explain produces invalid JSON
On 02/13/2012 01:33 PM, Andrew Dunstan wrote: On 02/13/2012 12:48 PM, Tom Lane wrote: Andrew Dunstanand...@dunslane.net writes: On 02/13/2012 11:15 AM, Tom Lane wrote: After looking a bit more at the existing explain code, it seems like the critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup calls around the ExplainPrintPlan call (see ExplainOnePlan), while auto_explain does not. Yeah, maybe. We'd still have to do it conditionally (have to use ExplainBeginOutput for the XML case), but it would possibly be less kludgy. Hm? I wasn't suggesting removing the ExplainBeginOutput call, but more like ExplainBeginOutput(es); +ExplainOpenGroup(...); ExplainQueryText(es, queryDesc); ExplainPrintPlan(es, queryDesc); +ExplainCloseGroup(...); ExplainEndOutput(es); Details still TBD; the point is just that it's not clear to me why auto_explain should need a formatting concept that doesn't already exist within explain.c. This will introduce an extra level of nesting for no good reason. But this would work: - ExplainBeginOutput(es); + if (auto_explain_log_format != EXPLAIN_FORMAT_JSON) + ExplainBeginOutput(es); + else + ExplainOpenGroup(NULL, NULL, true, es); ExplainQueryText(es, queryDesc); ExplainPrintPlan(es, queryDesc); - ExplainEndOutput(es); + if (auto_explain_log_format != EXPLAIN_FORMAT_JSON) + ExplainEndOutput(es); + else + ExplainCloseGroup(NULL, NULL, true, es); Except that it causes other problems. I think we'd probably bet sleeping dogs lie. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] When do we lose column names?
On 02/13/2012 11:00 AM, Tom Lane wrote: Robert Haasrobertmh...@gmail.com writes: On Sat, Feb 11, 2012 at 11:11 AM, Andrew Dunstan aduns...@postgresql.org wrote: Do we actually need to bother with these cases? In flatten_join_alias_vars_mutator(), we've got a RangeTblEntry to work with. I think the column names are to be found in the alias and/or eref attributes of the RangeTblEntry. The eref names are the ones to use. alias just records the original AS clause (if any) attached to the RTE, which is mostly useful only for reverse-listing the query. In adjust_appendrel_attrs_mutator(), we have a list, translated_vars, whose order matches the column order of the parent rel. If we had the parent's RangeTblEntry, we could probably precede as in the previous case. But the AppendRelInfo only contains the index of the RT. Maybe we can figure out a way to use rt_fetch to get the RangeTblEntry itself, but that requires a pointer to the range table itself, which we haven't got. This is surely fixable by passing a bit more information down. If you (Andrew) have something that covers everything but this issue, pass it over and I'll take a whack at it. What I have fixes one of the three cases I have identified that appear to need fixing, the one that caused the json tests to crash with your original partial patch. See https://bitbucket.org/adunstan/pgdevel/changesets/tip/rowexprs. I won't have time to return to this for a few days. The two remaining cases should be fairly independent I think. If you don't get to this before then I'll look at the flatten_join_alias_vars_mutator case again and try to fix it based on the above, and then give you what I've got. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe
Robert Haas robertmh...@gmail.com wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: Well, personally I have a hard time calling READ COMMITTED behavior sensible. Consider this: [ gigantic pile of fail ] Yeah, that's bad all right. I think it's hard to argue that the current behavior is sensible; the trick is to figure out something that's better but doesn't impose too much additional overhead. I wonder if it's possible to use SSI (or some stripped-down mechanism along similar lines) to track these kinds of write conflicts, rather than cluttering the system catalogs with lots more TransactionId fields. Like, when we TRUNCATE a table, we could essentially make a note in memory someplace recording the write conflict. Potential additional uses of the predicate locking developed for SSI keep surfacing. At some point we should probably pick a couple of them and try to fashion an API for the non-blocking predicate locking logic that serves them and SSI. Since this predicate locking system is explicitly interested only in read-write conflicts, it does seem like it could work for SELECT FOR UPDATE versus writes. As mentioned once or twice before, it was pretty clear that while predicate locking is required for SSI and is probably 80% of the C code in the patch, it is really a separate thing -- we just didn't want to try to create a generalized API based on the one initial usage. I think that an API based on registering and listening would be the ticket. Unfortunately, the full-blown SSI machinery seems too expensive to use for this, especially on all-read workloads where there are no actual conflicts but lots of conflict checks. In an all-read workload, if you actually declare the transactions to be read-only SSI should not introduce much overhead. If there's much overhead showing up there at the moment, it would probably be pretty easy to eliminate. When there are any read-write transactions active, it's a different story. But that could probably be optimized. Undoubtedly. It's disappointing that neither Dan nor I could find the round tuits to make the kinds of changes in the SSI locking that you made in some other areas for 9.2. I'm not really sure how the performance impact breaks down between predicate locking and SSI proper, although I would tend to start from the assumption that, like the lines of code, it's 80% in the predicate locking. The attraction of using something like an in-memory conflict manager is that it would probably be quite general. We could fix problems of this type with no on-disk format changes whenever we discover them (as we will certainly continue to do) just by adding the appropriate flag-a-conflict calls to the right places in the code. Assuming that the problems could be expressed in terms of read-write conflicts, that's probably largely true. I'm not sure that holds for some of the funny READ COMMITTED behaviors, though. I think the only real cure there would be to make subtransactions cheap enough that we could wrap execution of each SELECT and DML statement in a subtransaction and roll back for another try with a new snapshot on conflict. If you want to track something other than read-write conflicts and/or you want blocking when a conflict is found, you might be better off looking to bend the heavyweight locks to your purposes. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Robert Haas wrote: On Mon, Feb 13, 2012 at 7:45 AM, Robert Haasrobertmh...@gmail.com wrote: On Thu, Feb 9, 2012 at 3:37 PM, Jay Levittjay.lev...@gmail.com wrote: So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and (probably both of our) 9.1-HEAD takes 1918s... is that something to worry about, and if so, are there any tests I can run to assist? That bug doesn't affect me personally, but y'know, community and all that. Also, I wonder if it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way slower doing Y., and this is a canary in the coal mine. This might be a lame hypothesis, but... is it possible that you built your 9.1-tip binaries with --enable-cassert? Or with different optimization options? No, I think I/O just varies more than my repeated tests on 1M rows indicated. I ran the 10M-row test four times on the same server, alternating between packaged 9.1.2 and source-built 9.1.2 (default configure options), and saw these times: INSERT INDEX apt 76 578 source 163 636 apt 73 546 source 80 473 EBS has no performance guarantees at all; you share your disks with an arbitrary number of other users, so if someone in the neighborhood decides to do some heavy disk I/O, you lose. Let this be a lesson to me: run benchmarks locally! So I tested. On my MacBook Pro, your test script builds the index in just over 25 s on both REL9_1_2 and this morning's REL9_1_STABLE. I think that's the 1-million version I emailed; try adding a zero and see if it doesn't take a little longer. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql tab completion for SELECT
On tor, 2012-02-09 at 23:02 +0100, Dimitri Fontaine wrote: Peter Eisentraut pete...@gmx.net writes: Make tab-completion complete also function names – like: SELECT pg_gettabtab to see all functions that start with pg_get. Make tab-completion work for columns in SELECT. I know that when writing SELECT clause, psql doesn’t know which table it will deal with, but it could search through all the columns in database. That seems pretty useful, and it's more or less a one-line change, as in the attached patch. Does that includes support for completing SRF functions in the FROM clause? No, that's an entirely different issue. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql tab completion for SELECT
On fre, 2012-02-10 at 01:24 -0500, Tom Lane wrote: That seems pretty nearly entirely bogus. What is the argument for supposing that the word right after SELECT is a function name? I would think it would be a column name (from who-knows-what table) much more often. That's what the patch does. It looks like you misread what I wrote. Also, if it is useful, people will expect it to work in more places than just the first word after SELECT --- for instance, somebody who didn't realize what a kluge it was would expect it to also work right after a top-level comma after SELECT. Or probably after a left parenthesis in the SELECT list. Etc. All of psql tab completion works like that. We only complete the first table after FROM, the first column after ORDER BY, the first privilege to grant, the first role to grant to, the first table to drop, the first reloption, etc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pl/python long-lived allocations in datum-dict transformation
On 12/02/12 00:48, Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: This is annoying for functions that plough through large tables, doing some calculation. Attached is a patch that does the conversion of PostgreSQL Datums into Python dict objects in a scratch memory context that gets reset every time. As best I can tell, this patch proposes creating a new, separate context (chewing up 8KB+) for every plpython procedure that's ever used in a given session. This cure could easily be worse than the disease as far Yeah, that's not ideal. What's more, it's unclear that it won't malfunction altogether if the function is used recursively (ie, what if PLyDict_FromTuple ends up calling the same function again?) I was a bit worried about that, but the only place where PLyDict_FromTuple calls into some other code is when it calls the type's specific I/O function, which AFAICT can't call back into user code (except for user-defined C I/O routines). It's not very comfortable, but I think PLyDict_FromTuple can be allowed to be non-reentrant. Can't you fix it so that the temp context is associated with a particular function execution, rather than being statically allocated per-function? That would be cool, but I failed to easily get a handle on something that's like the execution context of a PL/Python function... Actually, if we assume that PLyDict_FromTuple (which is quite a low-level thing) never calls PL/Python UDFs we could keep a single memory context in top-level PL/Python memory and pay the overhead once in a session, not once per function. OTOH if we want to make it reentrant, some more tinkering would be in order. Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CUDA Sorting
On Feb 13, 2012 7:49 p.m., Greg Stark st...@mit.edu wrote: I don't think we should be looking at either CUDA or OpenCL directly. We should be looking for a generic library that can target either and is well maintained and actively developed. Any GPU code we write ourselves would rapidly be overtaken by changes in the hardware and innovations in parallel algorithms. If we find a library that provides a sorting api and adapt our code to use it then we'll get the benefits of any new hardware feature as the library adds support for them. To sort integer I used the Thrust Nvidia library.
Re: [HACKERS] Speed dblink using alternate libpq tuple storage
On Tue, Feb 07, 2012 at 04:44:09PM +0200, Marko Kreen wrote: Although it seems we could allow exceptions, at least when we are speaking of Postgres backend, as the connection and result are internally consistent state when the handler is called, and the partial PGresult is stored under PGconn-result. Even the connection is in consistent state, as the row packet is fully processed. So we have 3 variants, all work, but which one do we want to support? 1) Exceptions are not allowed. 2) Exceptions are allowed, but when it happens, user must call PQfinish() next, to avoid processing incoming data from potentially invalid state. 3) Exceptions are allowed, and further row processing can continue with PQisBusy() / PQgetResult() 3.1) The problematic row is retried. (Current behaviour) 3.2) The problematic row is skipped. I converted the patch to support 3.2), that is - skip row on exception. That required some cleanups of getAnotherTuple() API, plus I made some other cleanups. Details below. But during this I also started to think what happens if the user does *not* throw exceptions. Eg. Python does exceptions via regular return values, which means complications when passing them upwards. The main case I'm thinking of is actually resultset iterator in high-level language. Current callback-only style API requires that rows are stuffed into temporary buffer until the network blocks and user code gets control again. This feels clumsy for a performance-oriented API. Another case is user code that wants to do complex processing. Doing lot of stuff under callback seems dubious. And what if some part of code calls PQfinish() during some error recovery? I tried imaging some sort of getFoo() style API for fetching in-flight row data, but I always ended up with rewrite libpq step, so I feel it's not productive to go there. Instead I added simple feature: rowProcessor can return '2', in which case getAnotherTuple() does early exit without setting any error state. In user side it appears as PQisBusy() returned with TRUE result. All pointers stay valid, so callback can just stuff them into some temp area. ATM there is not indication though whether the exit was due to callback or other reasons, so user must detect it based on whether new temp pointers appeares, which means those must be cleaned before calling PQisBusy() again. This actually feels like feature, those must not stay around after single call. It's included in main patch, but I also attached it as separate patch so that it can be examined separately and reverted if not acceptable. But as it's really simple, I recommend including it. It's usage might now be obvious though, should we include example code in doc? New feature: * Row processor can return 2, then PQisBusy() does early exit. Supportde only when connection is in non-blocking mode. Cleanups: * Row data is tagged as processed when rowProcessor is called, so exceptions will skip the row. This simplifies non-exceptional handling as well. * Converted 'return EOF' in V3 getAnotherTuple() to set error instead. AFAICS those EOFs are remnants from V2 getAnotherTuple() not something that is coded deliberately. Note that when v3 getAnotherTuple() is called the row packet is fully in buffer. The EOF on broken packet to signify 'give me more data' does not result in any useful behaviour, instead you can simply get into infinite loop. Fix bugs in my previous changes: * Split the OOM error handling from custom error message handling, previously the error message in PGresult was lost due to OOM logic early free of PGresult. * Drop unused goto label from experimental debug code. -- marko *** a/src/interfaces/libpq/exports.txt --- b/src/interfaces/libpq/exports.txt *** *** 160,162 PQconnectStartParams 157 --- 160,164 PQping158 PQpingParams 159 PQlibVersion 160 + PQsetRowProcessor 161 + PQsetRowProcessorErrMsg 162 *** a/src/interfaces/libpq/fe-connect.c --- b/src/interfaces/libpq/fe-connect.c *** *** 2693,2698 makeEmptyPGconn(void) --- 2693,2701 conn-wait_ssl_try = false; #endif + /* set default row processor */ + PQsetRowProcessor(conn, NULL, NULL); + /* * We try to send at least 8K at a time, which is the usual size of pipe * buffers on Unix systems. That way, when we are sending a large amount *** *** 2711,2718 makeEmptyPGconn(void) --- 2714,2726 initPQExpBuffer(conn-errorMessage); initPQExpBuffer(conn-workBuffer); + /* set up initial row buffer */ + conn-rowBufLen = 32; + conn-rowBuf = (PGrowValue *)malloc(conn-rowBufLen * sizeof(PGrowValue)); + if (conn-inBuffer == NULL || conn-outBuffer == NULL || + conn-rowBuf == NULL || PQExpBufferBroken(conn-errorMessage) || PQExpBufferBroken(conn-workBuffer)) { *** *** 2814,2819 freePGconn(PGconn *conn)
Re: [HACKERS] pl/python long-lived allocations in datum-dict transformation
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes: On 12/02/12 00:48, Tom Lane wrote: What's more, it's unclear that it won't malfunction altogether if the function is used recursively (ie, what if PLyDict_FromTuple ends up calling the same function again?) I was a bit worried about that, but the only place where PLyDict_FromTuple calls into some other code is when it calls the type's specific I/O function, which AFAICT can't call back into user code (except for user-defined C I/O routines). It's not very comfortable, but I think PLyDict_FromTuple can be allowed to be non-reentrant. I think that's pretty short-sighted. Even if it's safe today (which I am not 100% convinced of), there are plenty of foreseeable reasons why it might^Wwill break in the future. * There is no reason to think that datatype I/O functions will never be written in anything but C. People have asked repeatedly for the ability to write them in higher-level languages. I doubt that would ever be possible in plpgsql, but with languages that can do bit-twiddling like plpython or plperl, it seems possible. * A datatype I/O function, even if written in C, could call user-written code. See domain_in for example, which can invoke arbitrary processing via domain constraint checking. If you were proposing to patch PLyObject_ToTuple rather than the other direction, this patch would be breakable today. Admittedly the breakage would require some rather contrived coding (your domain's constraint check function does *what*?), but it would still be a security bug. * Once we have the ability to associate a temp memory context with plpython, there will be a temptation to use it for other purposes besides this one, and it will not be long before such a purpose does open a recursion risk, even if there's none there today. (Speaking of which, it sure looks to me like PLyObject_ToDatum, PLyObject_ToTuple, etc leak memory like there's no tomorrow.) OTOH if we want to make it reentrant, some more tinkering would be in order. I think that's in order. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_test_fsync performance
I have heard complaints that /contrib/pg_test_fsync is too slow. I thought it was impossible to speed up pg_test_fsync without reducing its accuracy. However, now that I some consumer-grade SATA 2 drives, I noticed that the slowness is really in the open_sync test: Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 76.421 ops/sec 2 * 8kB open_sync writes 38.689 ops/sec 4 * 4kB open_sync writes 19.140 ops/sec 8 * 2kB open_sync writes 4.938 ops/sec 16 * 1kB open_sync writes 2.480 ops/sec These last few lines can take very long, so I developed the attached patch that scales down the number of tests. This makes it more reasonable to run pg_test_fsync. I would like to apply this for PG 9.2. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_test_fsync/pg_test_fsync.c b/contrib/pg_test_fsync/pg_test_fsync.c new file mode 100644 index 042b02b..970deda *** a/contrib/pg_test_fsync/pg_test_fsync.c --- b/contrib/pg_test_fsync/pg_test_fsync.c *** test_open_syncs(void) *** 358,368 printf((This is designed to compare the cost of writing 16kB\n); printf(in different write open_sync sizes.)\n); ! test_open_sync(16kB open_sync write, 16); ! test_open_sync( 8kB open_sync writes, 8); ! test_open_sync( 4kB open_sync writes, 4); ! test_open_sync( 2kB open_sync writes, 2); ! test_open_sync( 1kB open_sync writes, 1); } /* --- 358,368 printf((This is designed to compare the cost of writing 16kB\n); printf(in different write open_sync sizes.)\n); ! test_open_sync( 1 * 16kB open_sync write, 16); ! test_open_sync( 2 * 8kB open_sync writes, 8); ! test_open_sync( 4 * 4kB open_sync writes, 4); ! test_open_sync( 8 * 2kB open_sync writes, 2); ! test_open_sync(16 * 1kB open_sync writes, 1); } /* *** test_open_sync(const char *msg, int writ *** 376,381 --- 376,385 ops, writes; #endif + int save_ops_per_test = ops_per_test; + + /* This test can be long, so scale down the number of tests */ + ops_per_test = ops_per_test * writes_size / 16; printf(LABEL_FORMAT, msg); fflush(stdout); *** test_open_sync(const char *msg, int writ *** 402,407 --- 406,412 #else printf(NA_FORMAT, n/a\n); #endif + ops_per_test = save_ops_per_test; } static void -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_test_fsync performance
On Mon, Feb 13, 2012 at 7:42 PM, Bruce Momjian br...@momjian.us wrote: I have heard complaints that /contrib/pg_test_fsync is too slow. I thought it was impossible to speed up pg_test_fsync without reducing its accuracy. However, now that I some consumer-grade SATA 2 drives, I noticed that the slowness is really in the open_sync test: Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 76.421 ops/sec 2 * 8kB open_sync writes 38.689 ops/sec 4 * 4kB open_sync writes 19.140 ops/sec 8 * 2kB open_sync writes 4.938 ops/sec 16 * 1kB open_sync writes 2.480 ops/sec These last few lines can take very long, so I developed the attached patch that scales down the number of tests. This makes it more reasonable to run pg_test_fsync. I would like to apply this for PG 9.2. On my MacOS X, it's fsync_writethrough that's insanely slow: [rhaas pg_test_fsync]$ ./pg_test_fsync 2000 operations per test Direct I/O is not supported on this platform. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync3523.267 ops/sec fdatasync3360.023 ops/sec fsync2410.048 ops/sec fsync_writethrough 12.576 ops/sec open_sync3649.475 ops/sec Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync1885.284 ops/sec fdatasync2544.652 ops/sec fsync3241.218 ops/sec fsync_writethrough ^C Instead of or in addition to a fixed number operations per test, maybe we should cut off each test after a certain amount of wall-clock time, like 15 seconds. It's kind of insane to run one of these tests for 3 minutes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_test_fsync performance
Robert Haas robertmh...@gmail.com writes: Instead of or in addition to a fixed number operations per test, maybe we should cut off each test after a certain amount of wall-clock time, like 15 seconds. +1, I was about to suggest the same thing. Running any of these tests for a fixed number of iterations will result in drastic degradation of accuracy as soon as the machine's behavior changes noticeably from what you were expecting. Run them for a fixed time period instead. Or maybe do a few, then check elapsed time and estimate a number of iterations to use, if you're worried about the cost of doing gettimeofday after each write. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_test_fsync performance
On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: Instead of or in addition to a fixed number operations per test, maybe we should cut off each test after a certain amount of wall-clock time, like 15 seconds. +1, I was about to suggest the same thing. Running any of these tests for a fixed number of iterations will result in drastic degradation of accuracy as soon as the machine's behavior changes noticeably from what you were expecting. Run them for a fixed time period instead. Or maybe do a few, then check elapsed time and estimate a number of iterations to use, if you're worried about the cost of doing gettimeofday after each write. Good idea, and it worked out very well. I changed the -o loops parameter to -s seconds which calls alarm() after (default) 2 seconds, and then once the operation completes, computes a duration per operation. The test now runs in 30 seconds and produces similar output to the longer version. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/contrib/pg_test_fsync/pg_test_fsync.c b/contrib/pg_test_fsync/pg_test_fsync.c new file mode 100644 index 3fcb087..8554426 *** a/contrib/pg_test_fsync/pg_test_fsync.c --- b/contrib/pg_test_fsync/pg_test_fsync.c *** *** 27,41 #define NA_FORMAT %18s #define OPS_FORMAT %9.3f ops/sec static const char *progname; ! static int ops_per_test = 2000; static int needs_unlink = 0; static char full_buf[XLOG_SEG_SIZE], *buf, *filename = FSYNC_FILENAME; static struct timeval start_t, stop_t; static void handle_args(int argc, char *argv[]); --- 27,50 #define NA_FORMAT %18s #define OPS_FORMAT %9.3f ops/sec + #define START_TIMER \ + do { \ + alarm_triggered = false; \ + alarm(secs_per_test); \ + gettimeofday(start_t, NULL); \ + } while (0) + + static const char *progname; ! static int secs_per_test = 2; static int needs_unlink = 0; static char full_buf[XLOG_SEG_SIZE], *buf, *filename = FSYNC_FILENAME; static struct timeval start_t, stop_t; + static bool alarm_triggered = false; static void handle_args(int argc, char *argv[]); *** static void test_sync(int writes_per_op) *** 46,57 static void test_open_syncs(void); static void test_open_sync(const char *msg, int writes_size); static void test_file_descriptor_sync(void); static void signal_cleanup(int sig); #ifdef HAVE_FSYNC_WRITETHROUGH static int pg_fsync_writethrough(int fd); #endif ! static void print_elapse(struct timeval start_t, struct timeval stop_t); static void die(const char *str); --- 55,67 static void test_open_syncs(void); static void test_open_sync(const char *msg, int writes_size); static void test_file_descriptor_sync(void); + static void process_alarm(int sig); static void signal_cleanup(int sig); #ifdef HAVE_FSYNC_WRITETHROUGH static int pg_fsync_writethrough(int fd); #endif ! static void print_elapse(struct timeval start_t, struct timeval stop_t, int ops); static void die(const char *str); *** main(int argc, char *argv[]) *** 65,70 --- 75,81 /* Prevent leaving behind the test file */ signal(SIGINT, signal_cleanup); signal(SIGTERM, signal_cleanup); + signal(SIGALRM, process_alarm); #ifdef SIGHUP /* Not defined on win32 */ signal(SIGHUP, signal_cleanup); *** handle_args(int argc, char *argv[]) *** 96,102 { static struct option long_options[] = { {filename, required_argument, NULL, 'f'}, ! {ops-per-test, required_argument, NULL, 'o'}, {NULL, 0, NULL, 0} }; int option; /* Command line option */ --- 107,113 { static struct option long_options[] = { {filename, required_argument, NULL, 'f'}, ! {secs-per-test, required_argument, NULL, 's'}, {NULL, 0, NULL, 0} }; int option; /* Command line option */ *** handle_args(int argc, char *argv[]) *** 117,123 } } ! while ((option = getopt_long(argc, argv, f:o:, long_options, optindex)) != -1) { switch (option) --- 128,134 } } ! while ((option = getopt_long(argc, argv, f:s:, long_options, optindex)) != -1) { switch (option) *** handle_args(int argc, char *argv[]) *** 126,133 filename = strdup(optarg); break; ! case 'o': ! ops_per_test = atoi(optarg); break; default: --- 137,144 filename = strdup(optarg); break; ! case 's': ! secs_per_test = atoi(optarg); break; default: *** handle_args(int argc, char *argv[]) *** 148,154 exit(1); } ! printf(%d operations per test\n, ops_per_test); #if PG_O_DIRECT != 0 printf(O_DIRECT supported on this
[HACKERS] SSI rw-conflicts and 2PC
Looking over the SSI 2PC code recently, I noticed that I overlooked a case that could lead to non-serializable behavior after a crash. When we PREPARE a serializable transaction, we store part of the SERIALIZABLEXACT in the statefile (in addition to the list of SIREAD locks). One of the pieces of information we record is whether the transaction had any conflicts in or out. The problem is that that can change if a new conflict occurs after the transaction has prepared. Here's an example of the problem (based on the receipt-report test): -- Setup CREATE TABLE ctl (k text NOT NULL PRIMARY KEY, deposit_date date NOT NULL); INSERT INTO ctl VALUES ('receipt', DATE '2008-12-22'); CREATE TABLE receipt (receipt_no int NOT NULL PRIMARY KEY, deposit_date date NOT NULL, amount numeric(13,2)); -- T2 BEGIN ISOLATION LEVEL SERIALIZABLE; INSERT INTO receipt VALUES (3, (SELECT deposit_date FROM ctl WHERE k = 'receipt'), 4.00); PREPARE TRANSACTION 't2'; -- T3 BEGIN ISOLATION LEVEL SERIALIZABLE; UPDATE ctl SET deposit_date = DATE '2008-12-23' WHERE k = 'receipt'; COMMIT; -- T1 BEGIN ISOLATION LEVEL SERIALIZABLE; SELECT * FROM ctl WHERE k = 'receipt'; SELECT * FROM receipt WHERE deposit_date = DATE '2008-12-22'; COMMIT; Running this sequence of transactions normally, T1 will be rolled back because of the pattern of conflicts T1 - T2 - T3, as we'd expect. This should still be true even if we restart the database before executing the last transaction -- but it's not. The problem is that, when T2 prepared, it had no conflicts, so we recorded that in the statefile. The T2 - T3 conflict happened later, so we didn't know about it during recovery. I discussed this a bit with Kevin and we agreed that this is important to fix, since it's a false negative that violates serializability. The question is how to fix it. There are a couple of options... The easiest answer would be to just treat every prepared transaction found during recovery as though it had a conflict in and out. This is roughly a one-line change, and it's certainly safe. But the downside is that this is pretty restrictive: after recovery, we'd have to abort any serializable transaction that tries to read anything that a prepared transaction wrote, or modify anything that it read, until that transaction is either committed or rolled back. To do better than that, we want to know accurately whether the prepared transaction had a conflict with a transaction that prepared or committed before the crash. We could do this if we had a way to append a record to the 2PC statefile of an already-prepared transaction -- then we'd just add a new record indicating the conflict. Of course, we don't have a way to do that. It'd be tricky to add support for this, since it has to be crash-safe, so the question is whether the improved precision justifies the complexity it would require. A third option is to observe that the only conflicts *in* that matter from a recovered prepared transaction are from other prepared transactions. So we could have prepared transactions include in their statefile the xids of any prepared transactions they conflicted with at prepare time, and match them up during recovery to reconstruct the graph. This is a middle ground between the other two options. It doesn't require modifying the statefile after prepare. However, conflicts *out* to non-prepared transactions do matter, and this doesn't record those, so we'd have to do the conservative thing -- which means that after recovery, no one can read anything a prepared transaction wrote. I thought I'd throw these options out there to see which ones people think are reasonable (or any better ideas). Of the three, I think the first (simplest) solution is the only one we could plausibly backpatch to 9.1. But if the extra aborts after recovery seem too expensive, we may want to consider one of the other options for future releases. Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)
On Mon, Feb 13, 2012 at 7:51 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: I rebased the patch due to the updates of pg_proc.h. Please see the newer one. Thanks, Thanks, committed. I think, though, that some further adjustment is needed here, because you currently can't do ALTER FUNCTION ... NO LEAKPROOF, which seems unacceptable. It's fairly clear why not, though: you get a grammar conflict, because the parser allows this: create or replace function z() returns int as $$select 1$$ language sql set transaction not deferrable; However, since that syntax doesn't actually work, I'm thinking we could just refactor things a bit to reject that at the parser stage. The attached patch adopts that approach. Anyone have a better idea? I also think we ought to stick create_function_3 into one of the parallel groups in the regression tests, if possible. Can you investigate that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company not-leakproof.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server
(2012/02/13 20:50), Etsuro Fujita wrote: The patches have been applied, but role-related regression tests failed in my environment. I fixed it in a similar fashion of /src/test/regress/sql/foreign_data.sql. Please find attached a updated patch for the regression tests. Good catch, thanks. I'll revise pgsql_fdw tests little more. BTW, What do you think about this? http://archives.postgresql.org/pgsql-hackers/2012-01/msg00229.php I'm sorry that I've left the thread unfinished... I've given up to propose Join-push-down of foreign tables for 9.2, because it will take a while to achieve general semantics mapping for join push-down and WHERE clause push-down. For 9.2, I'm proposing pgsql_fdw which has WHERE clause push-down for built-in elements which are free from collation. I'd like to go back to that item after 9.2 development enters beta or RC, hopefully :) -- Shigeru Hanada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Measuring relation free space
On Wed, Jan 25, 2012 at 9:47 PM, Noah Misch n...@leadboat.com wrote: With all that done, run some quick benchmarks: see how SELECT free_percent FROM pgstattuple(rel) fares compared to SELECT relation_free_space(rel) for a large heap and for a large B-tree index. If the timing difference is too small to be interesting to you, remove relation_free_space() and submit your pgstattuple() improvements alone. Otherwise, submit as written. Ok. I split this in three patches. 1) pgstattuple-gin_spgist.patch This first patch adds gin and spgist support to pgstattuple, also makes pgstattuple use a ring buffer when reading tables or indexes. 2) pgstattuple-relation_free_space.patch This patch adds the relation_free_space function to pgstattuple. the function relation_free_space() is faster than pgstattuple(), to test that i initialize pgbench with a scale of 40. In that context pgstattuple() tooks 1.4s to process pgbench_account table and relation_free_space() tooks 730ms (half the time!) In the index the difference is less notorious, 170ms the former and 150ms the latter. 3) pgstattuple-stats_target.patch This patch adds a stats_target parameter to the relation_free_space() function, it mimics the way analyze choose the blocks to read and is faster than plain relation_free_space() but of course could be inexact if the pages that we don't read are the ones with more free space -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c index beff1b9..9f2ec1f 100644 *** a/contrib/pgstattuple/pgstatindex.c --- b/contrib/pgstattuple/pgstatindex.c *** pgstatindex(PG_FUNCTION_ARGS) *** 95,100 --- 95,101 BlockNumber nblocks; BlockNumber blkno; BTIndexStat indexStat; + BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD); if (!superuser()) ereport(ERROR, *** pgstatindex(PG_FUNCTION_ARGS) *** 122,128 * Read metapage */ { ! Buffer buffer = ReadBuffer(rel, 0); Page page = BufferGetPage(buffer); BTMetaPageData *metad = BTPageGetMeta(page); --- 123,129 * Read metapage */ { ! Buffer buffer = ReadBufferExtended(rel, MAIN_FORKNUM, 0, RBM_NORMAL, bstrategy); Page page = BufferGetPage(buffer); BTMetaPageData *metad = BTPageGetMeta(page); *** pgstatindex(PG_FUNCTION_ARGS) *** 159,165 CHECK_FOR_INTERRUPTS(); /* Read and lock buffer */ ! buffer = ReadBuffer(rel, blkno); LockBuffer(buffer, BUFFER_LOCK_SHARE); page = BufferGetPage(buffer); --- 160,166 CHECK_FOR_INTERRUPTS(); /* Read and lock buffer */ ! buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy); LockBuffer(buffer, BUFFER_LOCK_SHARE); page = BufferGetPage(buffer); diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c index e5ddd87..6bbc957 100644 *** a/contrib/pgstattuple/pgstattuple.c --- b/contrib/pgstattuple/pgstattuple.c *** *** 24,33 --- 24,35 #include postgres.h + #include access/gin_private.h #include access/gist_private.h #include access/hash.h #include access/nbtree.h #include access/relscan.h + #include access/spgist_private.h #include catalog/namespace.h #include funcapi.h #include miscadmin.h *** static void pgstat_hash_page(pgstattuple *** 73,83 --- 75,96 Relation rel, BlockNumber blkno); static void pgstat_gist_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno); + static void pgstat_gin_page(pgstattuple_type *stat, + Relation rel, BlockNumber blkno); + static void pgstat_spgist_page(pgstattuple_type *stat, + Relation rel, BlockNumber blkno); static Datum pgstat_index(Relation rel, BlockNumber start, pgstat_page pagefn, FunctionCallInfo fcinfo); static void pgstat_index_page(pgstattuple_type *stat, Page page, OffsetNumber minoff, OffsetNumber maxoff); + /* + * Buffer access strategy for reading relations, it's simpler to keep it + * global because pgstat_*_page() functions read one buffer at a time. + * pgstat_heap() and pgstat_index() should initialize it before use. + */ + BufferAccessStrategy bstrategy; + /* * build_pgstattuple_type -- build a pgstattuple_type tuple */ *** pgstat_relation(Relation rel, FunctionCa *** 229,235 return pgstat_index(rel, GIST_ROOT_BLKNO + 1, pgstat_gist_page, fcinfo); case GIN_AM_OID: ! err = gin index; break; default: err = unknown index; --- 242,253 return pgstat_index(rel, GIST_ROOT_BLKNO + 1, pgstat_gist_page, fcinfo); case GIN_AM_OID: ! return pgstat_index(rel, GIN_METAPAGE_BLKNO + 1, ! pgstat_gin_page, fcinfo); ! break; ! case SPGIST_AM_OID: ! return pgstat_index(rel,
Re: [HACKERS] pg_stats_recovery view
On Thu, Feb 2, 2012 at 2:32 AM, Magnus Hagander mag...@hagander.net wrote: I haven't looked through the code in detail, but one direct comment: do we really need/want to send this through the stats collector? It will only ever have one sender - perhaps we should just either store it in shared memory or store it locally and only send it on demand? fyi, i intend to send a reworked patch later today, it will store the info locally and send it on demand. about the _short_desc functions, i added that because i wanted to understand what was happening during recovery and the wal_record_type (xl_info) being a number is not that clear -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers