Re: [HACKERS] double writes using double-write buffer approach [WIP]

2012-02-13 Thread Amit Kapila
Dan, I believe your approach of double buffer write is right as it has 
potential that it can avoid the latency backends incur during full page writes 
after checkpoint. Although there are chances that overall I/O will be more in 
this case but if we can make sure that in most scenarios backend has to never 
do I/O it can show performance improvement as well as compare to full page 
writes.

-Original Message-
From: Dan Scales [mailto:sca...@vmware.com] 
Sent: Thursday, February 09, 2012 5:30 AM
To: Amit Kapila
Cc: PG Hackers
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

 Is there any problem if the double-write happens only by bgwriter or 
 checkpoint. 
 Something like whenever backend process has to evict the buffer, it will do 
 same as you have described that write in a double-write buffer, but  
 bgwriter  will check this double-buffer and flush from it.
 Also whenever any backend will see that the double buffer is more than 2/3rd 
 or some threshhold value full it will tell bgwriter to flush from  
 double-write buffer.
 This can ensure very less I/O by any backend.

Yes, I think this is a good idea.  I could make changes so that the backends 
hand off the responsibility to flush batches of the double-write buffer to the 
bgwriter whenever possible.  This would avoid some long IO waits in the 
backends, though the backends may of course eventually wait anyways for the 
bgwriter if IO is not fast enough.  I did write the code so that any process 
can write a completed batch if the batch is not currently being flushed (so as 
to deal with crashes by backends).  Having the backends flush the batches as 
they fill them up was just simpler for a first prototype.

Dan

- Original Message -
From: Amit Kapila amit.kap...@huawei.com
To: Dan Scales sca...@vmware.com, PG Hackers 
pgsql-hackers@postgresql.org
Sent: Tuesday, February 7, 2012 1:08:49 AM
Subject: Re: [HACKERS] double writes using double-write buffer approach [WIP]

 I think it is a good idea, and can help double-writes perform better in the 
 case of lots of backend evictions.
   I don't understand this point, because from the data in your mail, it 
appears that when shared buffers are less means when more evictions can happen, 
the performance is less.

ISTM that the performance is less incase shared buffers size is less because 
I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or 
checkpoint. 
Something like whenever backend process has to evict the buffer, it will do 
same as you have described that write in a double-write buffer, but bgwriter  
will check this double-buffer and flush from it.
Also whenever any backend will see that the double buffer is more than 2/3rd or 
some threshhold value full it will tell bgwriter to flush from double-write 
buffer.
This can ensure very less I/O by any backend.


-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using double-write buffer approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had 
proposed (as an alternative to a previous patch that only batched up writes by 
the checkpointer).  I think it is a good idea, and can help double-writes 
perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is trickier 
to reason about.

The idea is that all page writes generally are copied to a double-write buffer, 
rather than being immediately written.  Note that a full copy of the page is 
required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been added), 
some writes are pushed out using double writes -- the pages are first written 
and fsynced to a double-write file, then written to the data files, which are 
then fsynced.  Then double writes allow for fixing torn pages, so 
full_page_writes can be turned off (thus greatly reducing the size of the WAL 
log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
enough batch has accumulated, then flush the batch using double
writes.  [I don't think I need to intercept calls to smgrextend(),
but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
needs a double write or not.  (I think eventually this would be an
attribute of the buffer, set when the page is WAL-logged, rather than
a flag passed around.)

5. 

Re: [HACKERS] Checkpoint sync pause

2012-02-13 Thread Amit Kapila
 Without sorted checkpoints (or some other fancier method) you have to
 write out the entire pool before you can do any fsyncs.  Or you have
 to do multiple fsyncs of the same file, with at least one occurring
 after the entire pool was written.  With a sorted checkpoint, you can
 start issuing once-only fsyncs very early in the checkpoint process.
 I think that on large servers, that would be the main benefit, not the
 actually more efficient IO.  (On small servers I've seen sorted
 checkpoints be much faster on shutdown checkpoints, but not on natural
 checkpoints, and presumably this improvement *is* due to better
 ordering).

 On your servers, you need big delays between fsyncs and not between
 writes (as they are buffered until the fsync).  But in other
 situations, people need the delays between the writes.  By using
 sorted checkpoints with fsyncs between each file, the delays between
 writes are naturally delays between fsyncs as well.  So I think the
 benefit of using sorted checkpoints is that code to improve your
 situations is less likely to degrade someone else's situation, without
 having to introduce an extra layer of tunables.

What I understood is that you are suggesting, it is better to do sorted
checkpoints which essentially means flush nearby buffers together.
However if does this way, might be it will violate Oracle Patent
(20050044311 - Reducing disk IO by full-cache write-merging). I am not very
sure about it. But you can refer it once.

 I think the linked list is a bit of a red herring.  Many of the
 concepts people discuss implementing on the linked list could just as
 easily be implemented with the clock sweep.  And I've seen no evidence
 at all that the clock sweep is the problem.  The LWLock that protects
 can obviously be a problem, but that seems to be due to the overhead
 of acquiring a contended lock, not the work done under the lock.
 Reducing the lock-strength around this might be a good idea, but that
 reduction could be done just as easily (and as far as I can tell, more
 easily) with the clock sweep than the linked list.

with clock-sweep, there are many chances that backend needs to traverse more
to find a suitable buffer.
However, if clean buffer is put in freelist, it can be directly picked from
there.

-Original Message-
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Jeff Janes
Sent: Monday, February 13, 2012 12:14 AM
To: Greg Smith
Cc: Robert Haas; PostgreSQL-development
Subject: Re: [HACKERS] Checkpoint sync pause

On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith gsm...@gregsmith.com wrote:
 On 02/03/2012 11:41 PM, Jeff Janes wrote:

 -The steady stream of backend writes that happen between checkpoints
have
 filled up most of the OS write cache.  A look at /proc/meminfo shows
 around
 2.5GB Dirty:

 backend writes includes bgwriter writes, right?


 Right.


 Has using a newer kernal with dirty_background_bytes been tried, so it
 could be set to a lower level?  If so, how did it do?  Or does it just
 refuse to obey below the 5% level, as well?


 Trying to dip below 5% using dirty_background_bytes slows VACUUM down
faster
 than it improves checkpoint latency.

Does it cause VACUUM to create latency for other processes (like the
checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM
just slow down without effecting other tasks?

It seems to me that just lowering dirty_background_bytes (while not
also lowering dirty_bytes) should not cause the latter to happen, but
it seems like these kernel tunables never do exactly what they
advertise.

This may not be relevant to the current situation, but I wonder if we
don't need a vacuum_cost_page_dirty_seq so that if the pages we are
dirtying are consecutive (or at least closely spaced) they cost less,
in anticipation that the eventual writes will be combined and thus
consume less IO resources.  I would think it would be common for some
regions of table to be intensely dirtied, and some to be lightly
dirtied (but still aggregating up to a considerable amount of random
IO).   But the vacuum process might also need to be made more
bursty, as even if it generates sequential dirty pages the IO system
might write them randomly anyway if there are too many delays
interspersed


 Since the sort of servers that have
 checkpoint issues are quite often ones that have VACUUM ones, too, that
 whole path doesn't seem very productive.  The one test I haven't tried yet
 is whether increasing the size of the VACUUM ring buffer might improve how
 well the server responds to a lower write cache.

I wouldn't expect this to help.  It seems like it would hurt, as it
just leaves the data for even longer (however long it takes to
circumnavigate the ring buffer) before there is any possibility of it
getting written.  I guess it does increase the chances that the dirty
pages will accidentally get written by the bgwriter rather than the
vacuum itself, but I doubt that that would be 

Re: [HACKERS] Optimize referential integrity checks (todo item)

2012-02-13 Thread Chetan Suttraway
On Sun, Feb 12, 2012 at 7:36 AM, Vik Reykja vikrey...@gmail.com wrote:

 I decided to take a crack at the todo item created from the following post:
 http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php

 The attached patch makes the desired changes in both code and function
 naming.

 It seemed quite easy to do but wasn't marked as easy on the todo, so I'm
 wondering if I've missed something.  All regression tests pass.


The patch was not getting applied. Was seeing below message:
postgresql$ git apply  /Downloads/unchanged.patch
error: src/backend/utils/adt/ri_triggers.c: already exists in working
directory

Have come up with attached patch which hopefully should not have missed any
of your changes.
Please verify the changes.

Regards,
Chetan

PS: would like the patch name to be something meaningful.


-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 03a974a..09bacb7 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -205,11 +205,11 @@ static void ri_BuildQueryKeyFull(RI_QueryKey *key,
 static void ri_BuildQueryKeyPkCheck(RI_QueryKey *key,
 		const RI_ConstraintInfo *riinfo,
 		int32 constr_queryno);
-static bool ri_KeysEqual(Relation rel, HeapTuple oldtup, HeapTuple newtup,
+static bool ri_KeysUnchanged(Relation rel, HeapTuple oldtup, HeapTuple newtup,
 			 const RI_ConstraintInfo *riinfo, bool rel_is_pk);
-static bool ri_AllKeysUnequal(Relation rel, HeapTuple oldtup, HeapTuple newtup,
+static bool ri_AllKeysChanged(Relation rel, HeapTuple oldtup, HeapTuple newtup,
   const RI_ConstraintInfo *riinfo, bool rel_is_pk);
-static bool ri_OneKeyEqual(Relation rel, int column,
+static bool ri_OneKeyUnchanged(Relation rel, int column,
 			   HeapTuple oldtup, HeapTuple newtup,
 			   const RI_ConstraintInfo *riinfo, bool rel_is_pk);
 static bool ri_AttributesEqual(Oid eq_opr, Oid typeid,
@@ -932,9 +932,9 @@ RI_FKey_noaction_upd(PG_FUNCTION_ARGS)
 			}
 
 			/*
-			 * No need to check anything if old and new keys are equal
+			 * No need to check anything if old and new keys are unchanged
 			 */
-			if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true))
+			if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true))
 			{
 heap_close(fk_rel, RowShareLock);
 return PointerGetDatum(NULL);
@@ -1281,9 +1281,9 @@ RI_FKey_cascade_upd(PG_FUNCTION_ARGS)
 			}
 
 			/*
-			 * No need to do anything if old and new keys are equal
+			 * No need to do anything if old and new keys are unchanged
 			 */
-			if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true))
+			if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true))
 			{
 heap_close(fk_rel, RowExclusiveLock);
 return PointerGetDatum(NULL);
@@ -1646,9 +1646,9 @@ RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
 			}
 
 			/*
-			 * No need to check anything if old and new keys are equal
+			 * No need to check anything if old and new keys are unchanged
 			 */
-			if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true))
+			if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true))
 			{
 heap_close(fk_rel, RowShareLock);
 return PointerGetDatum(NULL);
@@ -1993,9 +1993,9 @@ RI_FKey_setnull_upd(PG_FUNCTION_ARGS)
 			}
 
 			/*
-			 * No need to do anything if old and new keys are equal
+			 * No need to do anything if old and new keys are unchanged
 			 */
-			if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true))
+			if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true))
 			{
 heap_close(fk_rel, RowExclusiveLock);
 return PointerGetDatum(NULL);
@@ -2012,13 +2012,10 @@ RI_FKey_setnull_upd(PG_FUNCTION_ARGS)
 			 * our cached plan, unless the update happens to change all
 			 * columns in the key.	Fortunately, for the most common case of a
 			 * single-column foreign key, this will be true.
-			 *
-			 * In case you're wondering, the inequality check works because we
-			 * know that the old key value has no NULLs (see above).
 			 */
 
 			use_cached_query = (riinfo.confmatchtype == FKCONSTR_MATCH_FULL) ||
-ri_AllKeysUnequal(pk_rel, old_row, new_row,
+ri_AllKeysChanged(pk_rel, old_row, new_row,
   riinfo, true);
 
 			/*
@@ -2064,7 +2061,7 @@ RI_FKey_setnull_upd(PG_FUNCTION_ARGS)
 	 * to changed columns in pk_rel's key
 	 */
 	if (riinfo.confmatchtype == FKCONSTR_MATCH_FULL ||
-		!ri_OneKeyEqual(pk_rel, i, old_row, new_row,
+		!ri_OneKeyUnchanged(pk_rel, i, old_row, new_row,
 		riinfo, true))
 	{
 		appendStringInfo(querybuf,
@@ -2389,9 +2386,9 @@ RI_FKey_setdefault_upd(PG_FUNCTION_ARGS)
 			}
 
 			/*
-			 * No need to do anything if old and new keys are equal
+			 * No need to do anything if old and new keys are unchanged
 			 */
-			if 

Re: [HACKERS] bitfield and gcc

2012-02-13 Thread Marti Raudsepp
On Sat, Feb 11, 2012 at 01:54, Gaetano Mendola mend...@gmail.com wrote:
 I wonder if somewhere in Postgres source we are relying on the GCC
 correct behaviour regarding the read-modify-write of bitfield in
 structures.

Probably not. I'm pretty sure that we don't have any bitfields, since
not all compilers are happy with them. And it looks like this behavior
doesn't affect other kinds of struct fields.

It sounds like the GCC guys are saying that it's theoretically
possible that the compiler will generate 64-bit read-modify-writes
regardless of the struct member types. In this light, PostgreSQL code
is not correct -- our slock_t uses a char type on i386/AMD64/SPARC
and 32-bit int on IA-64/PPC64. There are plenty of places where it's
adjacent to other small fields.

However, I don't think the latter is a problem with any compilers in
practice, as that would break a lot more code than just btrfs and
Postgres.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUDA Sorting

2012-02-13 Thread Kohei KaiGai
2012/2/13 Greg Smith g...@2ndquadrant.com:
 On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

 The trend is to have server capable of running CUDA providing GPU via
 external hardware (PCI Express interface with PCI Express switches), look
 for example at PowerEdge C410x PCIe Expansion Chassis from DELL.


 The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
 enclosure.  That's a completely sensible purchase if your goal is to build a
 computing cluster, where a lot of work is handed off to a set of GPUs.  I
 think that's even less likely to be a cost-effective option for a database
 server.  Adding a single dedicated GPU installed in a server to accelerate
 sorting is something that might be justifiable, based on your benchmarks.
  This is a much more expensive option than that though.  Details at
 http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants
 to see just how big this external box is.


 I did some experimenst timing the sort done with CUDA and the sort done
 with pg_qsort:
                       CUDA      pg_qsort
 33Milion integers:   ~ 900 ms,  ~ 6000 ms
 1Milion integers:    ~  21 ms,  ~  162 ms
 100k integers:       ~   2 ms,  ~   13 ms
 CUDA time has already in the copy operations (host-device, device-host).
 As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)
 Xeon(R) CPU X5650  @ 2.67GHz


 That's really interesting, and the X5650 is by no means a slow CPU.  So this
 benchmark is providing a lot of CPU power yet still seeing over a 6X speedup
 in sort times.  It sounds like the PCI Express bus has gotten fast enough
 that the time to hand data over and get it back again can easily be
 justified for medium to large sized sorts.

 It would be helpful to take this patch and confirm whether it scales when
 using in parallel.  Easiest way to do that would be to use the pgbench -f
 feature, which allows running an arbitrary number of some query at once.
  Seeing whether this acceleration continued to hold as the number of clients
 increases is a useful data point.

 Is it possible for you to break down where the time is being spent?  For
 example, how much of this time is consumed in the GPU itself, compared to
 time spent transferring data between CPU and GPU?  I'm also curious where
 the bottleneck is at with this approach.  If it's the speed of the PCI-E bus
 for smaller data sets, adding more GPUs may never be practical.  If the bus
 can handle quite a few of these at once before it saturates, it might be
 possible to overload a single GPU.  That seems like it would be really hard
 to reach for database sorting though; I can't really defend justify my gut
 feel for that being true though.


  I've never seen a PostgreSQL server capable of running CUDA, and I
  don't expect that to change.

 That sounds like:

 I think there is a world market for maybe five computers.
 - IBM Chairman Thomas Watson, 1943


 Yes, and 640K will be enough for everyone, ha ha.  (Having said the 640K
 thing is flat out denied by Gates, BTW, and no one has come up with proof
 otherwise).

 I think you've made an interesting case for this sort of acceleration now
 being useful for systems doing what's typically considered a data warehouse
 task.  I regularly see servers waiting for far more than 13M integers to
 sort.  And I am seeing a clear trend toward providing more PCI-E slots in
 servers now.  Dell's R810 is the most popular single server model my
 customers have deployed in the last year, and it has 5 X8 slots in it.  It's
 rare all 5 of those are filled.  As long as a dedicated GPU works fine when
 dropped to X8 speeds, I know a fair number of systems where one of those
 could be added now.

 There's another data point in your favor I didn't notice before your last
 e-mail.  Amazon has a Cluster GPU Quadruple Extra Large node type that
 runs with NVIDIA Tesla hardware.  That means the installed base of people
 who could consider CUDA is higher than I expected.  To demonstrate how much
 that costs, to provision a GPU enabled reserved instance from Amazon for one
 year costs $2410 at Light Utilization, giving a system with 22GB of RAM
 and 1.69GB of storage.  (I find the reserved prices easier to compare with
 dedicated hardware than the hourly ones)  That's halfway between the
 High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and
 the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
 $2200.  If someone could prove sorting was a bottleneck on their server,
 that isn't an unreasonable option to consider on a cloud-based database
 deployment.

 I still think that an approach based on OpenCL is more likely to be suitable
 for PostgreSQL, which was part of why I gave CUDA low odds here.  The points
 in favor of OpenCL are:

 -Since you last posted, OpenCL compiling has switched to using LLVM as their
 standard compiler.  Good PostgreSQL support for LLVM isn't far away.  It
 looks to me like the compiler situation 

Re: [HACKERS] Finer Extension dependencies

2012-02-13 Thread Dimitri Fontaine
Hi,

Sorry for the delays, I'm back on PostgreSQL related work again.

Hitoshi Harada umi.tan...@gmail.com writes:
 I just tried DROP EXTENSION now, and found it broken :(

Please find v2 of the patch.  I did change the dependency management in
between the simple cases and the more challenging ones and forgot that I
had to retest it all in between, which is what happen on a tight
schedule and when working at night, I guess.

So the best option I've found here had me add a new function in
pg_depend.c, it's working as intended now.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

diff --git a/contrib/pg_upgrade_support/pg_upgrade_support.c b/contrib/pg_upgrade_support/pg_upgrade_support.c
index 472f152..b5633d4 100644
--- a/contrib/pg_upgrade_support/pg_upgrade_support.c
+++ b/contrib/pg_upgrade_support/pg_upgrade_support.c
@@ -151,6 +151,7 @@ create_empty_extension(PG_FUNCTION_ARGS)
 	Datum		extConfig;
 	Datum		extCondition;
 	List	   *requiredExtensions;
+	List   *features = NIL;	/* FIXME, get features from catalogs */
 
 	if (PG_ARGISNULL(4))
 		extConfig = PointerGetDatum(NULL);
@@ -190,7 +191,8 @@ create_empty_extension(PG_FUNCTION_ARGS)
 		 text_to_cstring(extVersion),
 		 extConfig,
 		 extCondition,
-		 requiredExtensions);
+		 requiredExtensions,
+		 features);
 
 	PG_RETURN_VOID();
 }
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index ea98cb7..8090758 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -149,6 +149,11 @@
  /row
 
  row
+  entrylink linkend=catalog-pg-extensionstructnamepg_extension_feature/structname/link/entry
+  entryfeatures provided by installed extensions/entry
+ /row
+
+ row
   entrylink linkend=catalog-pg-foreign-data-wrapperstructnamepg_foreign_data_wrapper/structname/link/entry
   entryforeign-data wrapper definitions/entry
  /row
@@ -3058,6 +3063,51 @@
   /para
  /sect1
 
+ sect1 id=catalog-pg-extension-feature
+  titlestructnamepg_extension_feature/structname/title
+
+  indexterm zone=catalog-pg-extension-feature
+   primarypg_extension/primary
+  /indexterm
+
+  para
+   The catalog structnamepg_extension_feature/structname stores
+   information about the features provided by installed extensions.
+   See xref linkend=extend-extensions for details about extensions.
+  /para
+
+  table
+   titlestructnamepg_extension_features/ Columns/title
+
+   tgroup cols=4
+thead
+ row
+  entryName/entry
+  entryType/entry
+  entryReferences/entry
+  entryDescription/entry
+ /row
+/thead
+
+tbody
+ row
+  entrystructfieldextoid/structfield/entry
+  entrytypeoid/type/entry
+  entryliterallink linkend=catalog-pg-extensionstructnamepg_extension/structname/link.oid/literal/entry
+  entryOid of the extension that provides this feature/entry
+ /row
+
+ row
+  entrystructfieldextfeature/structfield/entry
+  entrytypename/type/entry
+  entry/entry
+  entryName of the feature/entry
+ /row
+
+/tbody
+   /tgroup
+  /table
+ /sect1
 
  sect1 id=catalog-pg-foreign-data-wrapper
   titlestructnamepg_foreign_data_wrapper/structname/title
@@ -6815,11 +6865,17 @@
  row
   entrystructfieldrequires/structfield/entry
   entrytypename[]/type/entry
-  entryNames of prerequisite extensions,
+  entryNames of prerequisite features,
or literalNULL/literal if none/entry
  /row
 
  row
+  entrystructfieldprovides/structfield/entry
+  entrytypename[]/type/entry
+  entryNames of provided features/entry
+ /row
+
+ row
   entrystructfieldcomment/structfield/entry
   entrytypetext/type/entry
   entryComment string from the extension's control file/entry
diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index 8d5b9d0..af5fc4c 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -463,9 +463,30 @@
   termvarnamerequires/varname (typestring/type)/term
   listitem
para
-A list of names of extensions that this extension depends on,
-for example literalrequires = 'foo, bar'/literal.  Those
-extensions must be installed before this one can be installed.
+A list of features that this extension depends on, for
+example literalrequires = 'foo, bar'/literal. Those features
+must be provided by an already installed extension before this one
+can be installed.
+   /para
+  /listitem
+ /varlistentry
+
+ varlistentry
+  termvarnameprovides/varname (typestring/type)/term
+  listitem
+   para
+A list of names of features that this extension provides, for
+example literalprovides = 'foo, extname_bugfix_12345'/literal.
+Those features can help providing finer dependencies: when updating
+an existing extension you can add new features in this list so that
+  

Re: [HACKERS] ecpglib use PQconnectdbParams

2012-02-13 Thread Michael Meskes
 Because connect_timeout is a separate libpq connection parameter, but
 now it's stuck into options.  It might have worked more or less by
 accident before.

So it is not an option, right? But the old function accepted it as an option it 
seems.

 It's not clear to me why this only appears on checktcp.  And why we
 don't run those tests by default.  That should be clarified, because

This was decided when regression tests for ecpg were introduced to not depend
on the use of TCP ports.

For details see this thread:
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00078.php
and in particular these emails: 
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00118.php
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00134.php

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at googlemail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-13 Thread Etsuro Fujita
(2012/02/10 20:39), Shigeru Hanada wrote:
 (2012/02/08 20:51), Shigeru Hanada wrote:
 Attached revised patches.  Changes from last version are below.
 snip
 
 I've found and fixed a bug which generates wrong remote query when any
 column of a foreign table has been dropped.  Also regression test for
 this case is added.
 
 I attached only pgsql_fdw_v8.patch, because pgsql_fdw_pushdown_v3.patch
 in last post still can be applied onto v8 patch.
 
 Regards,

The patches have been applied, but role-related regression tests failed
in my environment.  I fixed it in a similar fashion of
/src/test/regress/sql/foreign_data.sql.  Please find attached a updated
patch for the regression tests.

BTW, What do you think about this?

http://archives.postgresql.org/pgsql-hackers/2012-01/msg00229.php

Best regards,
Etsuro Fujita
*** sql/pgsql_fdw.sql.orig  2012-02-13 19:52:08.0 +0900
--- sql/pgsql_fdw.sql   2012-02-13 19:44:17.0 +0900
***
*** 2,7 
--- 2,19 
  -- create FDW objects
  -- ===
  
+ -- Clean up in case a prior regression run failed
+ 
+ -- Suppress NOTICE messages when roles don't exist
+ SET client_min_messages TO 'error';
+ 
+ DROP ROLE IF EXISTS pgsql_fdw_user;
+ 
+ RESET client_min_messages;
+ 
+ CREATE ROLE pgsql_fdw_user LOGIN SUPERUSER;
+ SET SESSION AUTHORIZATION 'pgsql_fdw_user';
+ 
  CREATE EXTENSION pgsql_fdw;
  
  CREATE SERVER loopback1 FOREIGN DATA WRAPPER pgsql_fdw;
***
*** 168,173 
--- 180,186 
  EXPLAIN (COSTS false) SELECT * FROM ft1 t1 WHERE t1.c1 = abs(t1.c2);
  EXPLAIN (COSTS false) SELECT * FROM ft1 t1 WHERE t1.c1 = t1.c2;
  DROP OPERATOR === (int, int) CASCADE;
+ DROP OPERATOR !== (int, int) CASCADE;
  DROP FUNCTION pgsql_fdw_abs(int);
  
  -- ===
***
*** 212,216 
  -- ===
  -- cleanup
  -- ===
  DROP EXTENSION pgsql_fdw CASCADE;
! 
--- 225,231 
  -- ===
  -- cleanup
  -- ===
+ DROP SCHEMA S 1 CASCADE;
  DROP EXTENSION pgsql_fdw CASCADE;
! \c
! DROP ROLE pgsql_fdw_user;
*** expected/pgsql_fdw.out.orig 2012-02-13 19:52:03.0 +0900
--- expected/pgsql_fdw.out  2012-02-13 19:51:49.0 +0900
***
*** 1,6 
--- 1,13 
  -- ===
  -- create FDW objects
  -- ===
+ -- Clean up in case a prior regression run failed
+ -- Suppress NOTICE messages when roles don't exist
+ SET client_min_messages TO 'error';
+ DROP ROLE IF EXISTS pgsql_fdw_user;
+ RESET client_min_messages;
+ CREATE ROLE pgsql_fdw_user LOGIN SUPERUSER;
+ SET SESSION AUTHORIZATION 'pgsql_fdw_user';
  CREATE EXTENSION pgsql_fdw;
  CREATE SERVER loopback1 FOREIGN DATA WRAPPER pgsql_fdw;
  CREATE SERVER loopback2 FOREIGN DATA WRAPPER pgsql_fdw
***
*** 130,147 
  ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (colname 'C 1');
  ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (colname 'C 1');
  \dew+
!  List of foreign-data wrappers
!Name|  Owner   |  Handler  |  Validator  | Access 
privileges | FDW Options | Description 
! 
---+--+---+-+---+-+-
!  pgsql_fdw | postgres | pgsql_fdw_handler | pgsql_fdw_validator | 
  | | 
  (1 row)
  
  \des+
!   

 List of foreign servers
!Name|  Owner   | Foreign-data wrapper | Access privileges | Type | 
Version |   

  FDW Options   

  | Description 
! 
---+--+--+---+--+-+-+-
!  loopback1 | postgres | pgsql_fdw|   |  | 
| 

Re: [HACKERS] pl/perl and utf-8 in sql_ascii databases

2012-02-13 Thread Christoph Berg
Re: Alex Hunsaker 2012-02-10 
cafapbrr9y1fu6gpvu+8ta8vty6qvcm3dfarkt8jg_ehgetx...@mail.gmail.com
 Does the attached fix the issue for you?

Yes. :)

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


signature.asc
Description: Digital signature


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-13 Thread Robert Haas
On Thu, Feb 9, 2012 at 3:37 PM, Jay Levitt jay.lev...@gmail.com wrote:
 So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and
 (probably both of our) 9.1-HEAD takes 1918s... is that something to worry
 about, and if so, are there any tests I can run to assist? That bug doesn't
 affect me personally, but y'know, community and all that.  Also, I wonder if
 it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way
 slower doing Y., and this is a canary in the coal mine.

This might be a lame hypothesis, but... is it possible that you built
your 9.1-tip binaries with --enable-cassert?  Or with different
optimization options?

There's been some work done on GiST in 9.2, which as Alexander
Korotkov who did the work mentioned upthread, might have some issue.
But I can't see how there can be a 4x regression between minor
releases, though maybe it wouldn't hurt to test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing special case OID generation

2012-02-13 Thread Robert Haas
On Sat, Feb 11, 2012 at 4:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Yeh, I was thinking we would do well to implement cached sequences for
 say first 1000 sequences.

Another option might be to store all the sequences for a particular
database in a single underlying data file.  The current implementation
uses a whole page for a single tuple that is presumably much smaller
than that.  So when you create a sequence foo, it's really creating
a row in some new system catalog pg_sequences, or something like that.

 Idea would be to make Sequences as fast as OIDs and get rid of the
 weird OID code.

Honestly, I think the biggest hassle of the OID code is not so much
the way they're generated as the way they're stored within heap
tuples.  I've wondered whether we should go through the system
catalogs and replace all of the hidden OID columns with a normal
column called oid of type OID, always placing it at attnum = 1 since
we have a lot of code that assumes the OID column always has the same
attnum.  That would be a fairly large notational change, but maybe it
wouldn't break anything /too/ badly...

Anyway, if we could get away with that, we could eventually (after N
releases) drop the special case support for system OID columns, which
would be a nice simplification.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb and fsync

2012-02-13 Thread Robert Haas
On Fri, Feb 10, 2012 at 3:57 PM, Peter Eisentraut pete...@gmx.net wrote:
 On sön, 2012-02-05 at 10:53 -0800, Jeff Davis wrote:
  initdb should do these syncs by default and offer an option to
 disable them.

 For test frameworks that run initdb often, that makes sense.

 But for developers, it doesn't make sense to spend 0.5s typing an
 option
 that saves you 0.3s. So, we'd need some more convenient way to choose
 the no-fsync option, like an environment variable that developers can
 set. Or maybe developers don't care about 0.3s?

 You can use https://launchpad.net/libeatmydata for those cases.

That's hilarious.

But, a command-line option seems more convenient.

It also seems entirely sufficient.  The comments above suggest that it
would take too long to type the option, but any PG developers who are
worried about the speed difference surely know how to create shell
aliases, shell functions, shell scripts, ... and if anyone's really
concerned about it, we can provide a short form for the option.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-13 Thread Robert Haas
On Mon, Feb 13, 2012 at 7:45 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Feb 9, 2012 at 3:37 PM, Jay Levitt jay.lev...@gmail.com wrote:
 So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and
 (probably both of our) 9.1-HEAD takes 1918s... is that something to worry
 about, and if so, are there any tests I can run to assist? That bug doesn't
 affect me personally, but y'know, community and all that.  Also, I wonder if
 it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way
 slower doing Y., and this is a canary in the coal mine.

 This might be a lame hypothesis, but... is it possible that you built
 your 9.1-tip binaries with --enable-cassert?  Or with different
 optimization options?

 There's been some work done on GiST in 9.2, which as Alexander
 Korotkov who did the work mentioned upthread, might have some issue.
 But I can't see how there can be a 4x regression between minor
 releases, though maybe it wouldn't hurt to test.

So I tested.  On my MacBook Pro, your test script builds the index in
just over 25 s on both REL9_1_2 and this morning's REL9_1_STABLE.
This is with the following non-default configuration settings:

shared_buffers = 400MB
maintenance_work_mem = 1GB
checkpoint_segments = 30
checkpoint_timeout = 10min
checkpoint_completion_target = 0.9
checkpoint_warning = 60s

I then tested with master, which also showed similar performance.
Based on this comment from your original email:

 [***] never completed after 10-20 minutes; nothing in server.log at default 
 logging levels, postgres process consuming about 1 CPU in IOWAIT, 
 checkpoints every 7-8 seconds

...I wonder if you have left checkpoint_segments set to the default
value of 3, which would account for the very frequent checkpoints.

At any rate, I can't measure a difference between the branches on this
test.  That doesn't mean there isn't one, but in my test setup I'm not
seeing it.  As an afterthought, I also retested with wal_level=archive
added to the config, but I still don't see any significant difference
between 9.1.2, 9.1-stable, and 9.2-devel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing special case OID generation

2012-02-13 Thread Marti Raudsepp
On Mon, Feb 13, 2012 at 15:08, Robert Haas robertmh...@gmail.com wrote:
 Honestly, I think the biggest hassle of the OID code is not so much
 the way they're generated as the way they're stored within heap
 tuples.  I've wondered whether we should go through the system
 catalogs and replace all of the hidden OID columns with a normal
 column called oid of type OID

Do we have a clear idea about what to do with user tables that are
created WITH OIDS? Do we care about compatibility with that at all? Do
we generate this explicit oid column manually or do we just tell
users to use a serial or global sequence instead?

Personally I'd also like to see us get rid of the default_with_oids
setting -- I assume the existence of that is the reason why pgAdmin
and TOAD still generate table DDL with an explicit WITH (OIDS=FALSE)

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing special case OID generation

2012-02-13 Thread Robert Haas
On Mon, Feb 13, 2012 at 8:51 AM, Marti Raudsepp ma...@juffo.org wrote:
 On Mon, Feb 13, 2012 at 15:08, Robert Haas robertmh...@gmail.com wrote:
 Honestly, I think the biggest hassle of the OID code is not so much
 the way they're generated as the way they're stored within heap
 tuples.  I've wondered whether we should go through the system
 catalogs and replace all of the hidden OID columns with a normal
 column called oid of type OID

 Do we have a clear idea about what to do with user tables that are
 created WITH OIDS? Do we care about compatibility with that at all?

I think it would be fine to eventually drop support for user tables
with OIDs.  That hasn't been enabled by default for a very long time:

commit 7ce9b7c0d8c8dbefc04978765422f760dcf3788c
Author: Bruce Momjian br...@momjian.us
Date:   Mon Dec 1 22:08:02 2003 +

This patch adds a new GUC var, default_with_oids, which follows the
proposal for eventually deprecating OIDs on user tables that I posted
earlier to pgsql-hackers. pg_dump now always specifies WITH OIDS or
WITHOUT OIDS when dumping a table. The documentation has been updated.

Neil Conway

I think there's not much benefit to deprecating that feature as long
as we need system OID columns in the catalogs.  But if we got rid of
them there then I think we could drop support in userland, too.

 Do
 we generate this explicit oid column manually or do we just tell
 users to use a serial or global sequence instead?

 Personally I'd also like to see us get rid of the default_with_oids
 setting -- I assume the existence of that is the reason why pgAdmin
 and TOAD still generate table DDL with an explicit WITH (OIDS=FALSE)

That probably has as much to do with inertia as anything else.  I
agree that it's ugly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimize referential integrity checks (todo item)

2012-02-13 Thread Robert Haas
On Sat, Feb 11, 2012 at 9:06 PM, Vik Reykja vikrey...@gmail.com wrote:
 I decided to take a crack at the todo item created from the following post:
 http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php

 The attached patch makes the desired changes in both code and function
 naming.

 It seemed quite easy to do but wasn't marked as easy on the todo, so I'm
 wondering if I've missed something.

It's kind of hard to say whether you've missed something, because you
haven't really explained what problem this is solving; the thread you
linked too isn't very clear about that either.  At first blush, it
seems like you've renamed a bunch of stuff without making very much
change to what actually happens.  Changing lots of copies of equal
to unchanged doesn't seem to me to be accomplishing anything.

 All regression tests pass.

You should add some new ones showing how this patch improves the
behavior relative to the previous code.  Or if you can't, then you
should provide a complete, self-contained test case that a reviewer
can use to see how your proposed changes improve things.

We're in the middle of a CommitFest right now, so please add this
patch to the next one if you would like it reviewed:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-13 Thread Robert Haas
On Sat, Feb 11, 2012 at 7:16 AM, Jesper Krogh jes...@krogh.cc wrote:
 Ok, but there are still cases where we don't even need to construct
 a data tuple at all:

 2012-02-11 13:14:01.579 jk=# explain select count(*) from testtable where
 fts @@ to_tsquery('english','test1');
                                QUERY PLAN
 ---
  Aggregate  (cost=31.24..31.25 rows=1 width=0)
   -  Bitmap Heap Scan on testtable  (cost=16.03..31.23 rows=4 width=0)
         Recheck Cond: (fts @@ '''test1'''::tsquery)
         -  Bitmap Index Scan on ftsid  (cost=0.00..16.03 rows=4 width=0)
               Index Cond: (fts @@ '''test1'''::tsquery)
 (5 rows)

In that case I believe you DO need the heap tuple.  That Recheck
Cond there means that the index might be lossy - i.e. return tuples
that don't really match the search condition.

 Another idea sprung into my head, that indices on (ctid,some mix of
 columns)
 could actually serve as some kind of vertical partitioning of the table.

The ctid of a tuple is its physical position in the table.  It makes
no sense to index that.  Since it's unique, it makes even less sense
to index that plus other things in the same index.

Does anyone have any comments on the issue raised in my original
email?  I would like to get (some version of) his patch committed, but
I would also like to not back ourselves into a corner.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-13 Thread Robert Haas
On Fri, Feb 10, 2012 at 11:46 PM, Noah Misch n...@leadboat.com wrote:
 On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
 On Fri, Feb 10, 2012 at 6:42 AM, Noah Misch n...@leadboat.com wrote:
  I like the design you have chosen. ?It would find applications beyond
  TRUNCATE, so your use of non-specific naming is sound. ?For example, older
  snapshots will see an empty table t after CREATE TABLE t AS SELECT 1
  commits; that's a comparable MVCC anomaly. ?Some of our non-MVCC-safe 
  commands
  should perhaps just become MVCC-safe, but there will always be use cases 
  for
  operations that shortcut MVCC. ?When one truly does want that, your 
  proposal
  for keeping behavior consistent makes plenty of sense.

 I guess I'm not particularly excited by the idea of trying to make
 TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
 READ isolation level, which is already known to be busted in a variety
 of ways; that's why we now have SERIALIZABLE, and why most people use
 READ COMMITTED.  Are there examples of this behavior at other
 isolation levels?

 I've yet to see an MVCC anomaly that one can reproduce at REPEATABLE READ and
 not at READ COMMITTED.  They tend to be narrow race conditions at READ
 COMMITTED, yet easy to demonstrate at REPEATABLE READ.  Related:
 http://archives.postgresql.org/pgsql-performance/2011-02/msg00451.php

Yeah.  Well, that's actually an interesting example, because it
illustrates how general this problem is.  We could potentially get
ourselves into a situation where just about every system catalog table
needs an xmin field to store the point at which the object came into
existence - or for that matter, was updated.  But it's not quite the
same as the xmin of the row itself, because some updates might be
judged not to matter.  There could also be intermediate cases where
updates are invalidating for some purposes but not others.  I think
we'd better get our hands around more of the problem space before we
start trying to engineer solutions.

 Incidentally, people use READ COMMITTED because they don't question the
 default, not because they know hazards of REPEATABLE READ.  I don't know the
 bustedness you speak of; could we improve the documentation to inform folks?

The example that I remember was related to SELECT FOR UPDATE/SELECT
FOR SHARE.  The idea of those statements is that you want to prevent
the row from being updated or deleted until some other concurrent
action is complete; for example, in the case of a foreign key, we'd
like to prevent the referenced row from being deleted or updated in
the relevant columns until the inserting transaction is committed.
But it doesn't work, because when the updating or deleting process
gets done with the lock wait, they are still using the same snapshot
as before, and merrily do exactly the the thing that the lock-wait was
supposed to prevent.  If an actual UPDATE is used, it's safe (I
think): anyone who was going to UPDATE or DELETE the row will fail
with some kind of serialization error.  But a SELECT FOR UPDATE that
commits is treated more like an UPDATE that rolls back: it's as if the
lock never existed.  Someone (Florian?) proposed a patch to change
this, but it seemed problematic for reasons I no longer exactly
remember.

When using an actual foreign key, we work around this by taking a new
snapshot to cross-check that things haven't changed under us, but
user-level code can't do that.  At READ COMMITTED, depending on the
situation, either the fact that we take new snapshots pretty
frequently or the EPQ machinery sometimes make things work sensibly
anyway, and at SERIALIZABLE, SSI prevents these kinds of anomalies.
But REPEATABLE READ has no protection.  I wish I could find the thread
where we discussed this before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing special case OID generation

2012-02-13 Thread Andres Freund
On Monday, February 13, 2012 02:08:08 PM Robert Haas wrote:
 On Sat, Feb 11, 2012 at 4:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
  Yeh, I was thinking we would do well to implement cached sequences for
  say first 1000 sequences.
 
 Another option might be to store all the sequences for a particular
 database in a single underlying data file.  The current implementation
 uses a whole page for a single tuple that is presumably much smaller
 than that.  So when you create a sequence foo, it's really creating
 a row in some new system catalog pg_sequences, or something like that.
I wonder if the tigher packing would be noticeable contentionwise If 
several hot sequences end up in a single page that could end up being 
measurable.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-13 Thread Robert Haas
On Sat, Feb 11, 2012 at 11:11 AM, Andrew Dunstan
aduns...@postgresql.org wrote:
 Other candidates I have found that don't set colnames and should
 probably use dummy names are:
   * src/backend/parser/gram.y (row: production)
   *
 src/backend/optimizer/prep/prepunion.c:adjust_appendrel_attrs_mutator()
   * src/backend/optimizer/util/var.c:flatten_join_alias_vars_mutator()

 Hm, can't the last two get real column names from somewhere?

 Possibly. I'll dig a bit deeper.

 I've had a look at these two. It's at least not obvious to me how to do this
 simply, if at all. In the last case it looks like we'd need to process the
 object recursively just like we do to extract the field values, and I don't
 know where to get them in the appendrel case at all, possibly because I'm
 not very familiar with this code.

 Do we actually need to bother with these cases? The regression tests pass
 without touching them, either because they don't matter or because we don't
 have a test for these cases that would tickle the assertion that was
 failing. If they don't matter, would it not be better just to note that in
 the code rather than building a list of field names for no good purpose?

In flatten_join_alias_vars_mutator(), we've got a RangeTblEntry to
work with.  I think the column names are to be found in the alias
and/or eref attributes of the RangeTblEntry.  Each of those is an
Alias object, which is defined like this:

typedef struct Alias
{
NodeTag type;char   *aliasname;
/* aliased rel name (never qualified) */
List   *colnames;   /* optional list of column aliases */
} Alias;

I'm not sure whether we should look at rte-eref.colnames,
rte-alias.colnames, or both.

In adjust_appendrel_attrs_mutator(), we have a list, translated_vars,
whose order matches the column order of the parent rel.  If we had the
parent's RangeTblEntry, we could probably precede as in the previous
case.  But the AppendRelInfo only contains the index of the RT.  Maybe
we can figure out a way to use rt_fetch to get the RangeTblEntry
itself, but that requires a pointer to the range table itself, which
we haven't got.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing special case OID generation

2012-02-13 Thread Robert Haas
On Mon, Feb 13, 2012 at 9:41 AM, Andres Freund and...@anarazel.de wrote:
 On Monday, February 13, 2012 02:08:08 PM Robert Haas wrote:
 On Sat, Feb 11, 2012 at 4:23 AM, Simon Riggs si...@2ndquadrant.com wrote:
  Yeh, I was thinking we would do well to implement cached sequences for
  say first 1000 sequences.

 Another option might be to store all the sequences for a particular
 database in a single underlying data file.  The current implementation
 uses a whole page for a single tuple that is presumably much smaller
 than that.  So when you create a sequence foo, it's really creating
 a row in some new system catalog pg_sequences, or something like that.
 I wonder if the tigher packing would be noticeable contentionwise If
 several hot sequences end up in a single page that could end up being
 measurable.

For the contention to really be an issue, you'd need a very high rate
of access to that sequence - in my tests so far, the only things that
seem to get hot enough to really hurt are the roots of btrees and
visibility map pages.  And on the plus side, you'd be reducing the
number of pages fighting to stay in shared_buffers.  That having been
said, it's something to watch out for - I certainly don't know enough
to say for certain that it wouldn't be a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimize referential integrity checks (todo item)

2012-02-13 Thread Vik Reykja
On Mon, Feb 13, 2012 at 15:25, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Feb 11, 2012 at 9:06 PM, Vik Reykja vikrey...@gmail.com wrote:
  I decided to take a crack at the todo item created from the following
 post:
  http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php
 
  The attached patch makes the desired changes in both code and function
  naming.
 
  It seemed quite easy to do but wasn't marked as easy on the todo, so I'm
  wondering if I've missed something.

 It's kind of hard to say whether you've missed something, because you
 haven't really explained what problem this is solving; the thread you
 linked too isn't very clear about that either.  At first blush, it
 seems like you've renamed a bunch of stuff without making very much
 change to what actually happens.  Changing lots of copies of equal
 to unchanged doesn't seem to me to be accomplishing anything.


It's very simple really, and most of it is indeed renaming the functions.
The problem this solves is that foreign key constraints are sometimes
checked when they don't need to be.  See my example below.


  All regression tests pass.

 You should add some new ones showing how this patch improves the
 behavior relative to the previous code.  Or if you can't, then you
 should provide a complete, self-contained test case that a reviewer
 can use to see how your proposed changes improve things.


I have no idea how a regression test would be able to see this change, so
here's a test case that you can follow with the debugger.

/* initial setup */
create table a (x int, y int, primary key (x, y));
create table b (x int, y int, z int, foreign key (x, y) references a);
insert into a values (1, 2);
insert into b values (1, null, 3);

/* seeing the difference */
update b set z=0;

When that update is run, it will check if the FK (x, y) has changed to know
if it needs to verify that the values are present in the other table.  The
equality functions that do that don't consider two nulls to be equal (per
sql logic) and so reverified the constraint.  Tom noticed that it didn't
need to because it hadn't really changed.

In the above example, the current code will recheck the constraint and the
new code won't.  It's not really testing equality anymore (because null
does not equal null), so I renamed them causing a lot of noise in the diff.


 We're in the middle of a CommitFest right now,


Yes, I wasn't expecting this to be committed, I just didn't want to lose
track of it.


 so please add this patch to the next one if you would like it reviewed:

https://commitfest.postgresql.org/action/commitfest_view/open


Will do.


Re: [HACKERS] Removing special case OID generation

2012-02-13 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Another option might be to store all the sequences for a particular
 database in a single underlying data file.

We've looked into that before, and found some roadblocks IIRC, though
it probably isn't completely infeasible.  See the archives.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimize referential integrity checks (todo item)

2012-02-13 Thread Vik Reykja
On Mon, Feb 13, 2012 at 11:02, Chetan Suttraway 
chetan.suttra...@enterprisedb.com wrote:

 The patch was not getting applied. Was seeing below message:
 postgresql$ git apply  /Downloads/unchanged.patch
 error: src/backend/utils/adt/ri_triggers.c: already exists in working
 directory

 Have come up with attached patch which hopefully should not have missed
 any of your changes.


Thank you for doing that.  What command did you use?  I followed the
procedure on the wiki [1] but I must be doing something wrong.

[1] http://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git


 Please verify the changes.


They look good.  Thanks again.


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-13 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 The example that I remember was related to SELECT FOR
 UPDATE/SELECT FOR SHARE.  The idea of those statements is that you
 want to prevent the row from being updated or deleted until some
 other concurrent action is complete; for example, in the case of a
 foreign key, we'd like to prevent the referenced row from being
 deleted or updated in the relevant columns until the inserting
 transaction is committed.  But it doesn't work, because when the
 updating or deleting process gets done with the lock wait, they
 are still using the same snapshot as before, and merrily do
 exactly the the thing that the lock-wait was supposed to prevent.
 
This issue is one which appears to be a problem for people trying to
migrate from Oracle, where a write conflict would be generated.
 
 If an actual UPDATE is used, it's safe (I think): anyone who was
 going to UPDATE or DELETE the row will fail with some kind of
 serialization error.
 
Right; a write conflict.
 
 But a SELECT FOR UPDATE that commits is treated more like an
 UPDATE that rolls back: it's as if the lock never existed. 
 Someone (Florian?) proposed a patch to change this, but it seemed
 problematic for reasons I no longer exactly remember.
 
It had to do with only having one xmax and how that worked with
subtransactions.
 
Of course, besides the technical obstacles, such a semantic change
could break existing code for PostgreSQL users.  :-(
 
 When using an actual foreign key, we work around this by taking a
 new snapshot to cross-check that things haven't changed under us,
 but user-level code can't do that.  At READ COMMITTED, depending
 on the situation, either the fact that we take new snapshots
 pretty frequently or the EPQ machinery sometimes make things work
 sensibly anyway, and at SERIALIZABLE, SSI prevents these kinds of
 anomalies.  But REPEATABLE READ has no protection.
 
Well, personally I have a hard time calling READ COMMITTED behavior
sensible.  Consider this:
 
-- connection 1
test=# create table t (id int not null primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
t_pkey for table t
CREATE TABLE
test=# insert into t select generate_series(1, 10);
INSERT 0 10
 
-- connection 2
test=# begin;
BEGIN
test=# update t set id = id - 1;
UPDATE 10
 
-- connection 1
test=# select * from t where id = (select min(id) from t) for
update;
[blocks]
 
-- connection 2
test=# commit;
COMMIT
 
-- connection 1
[unblocks]
 id 

(0 rows)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-13 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Feb 11, 2012 at 11:11 AM, Andrew Dunstan
 aduns...@postgresql.org wrote:
 Do we actually need to bother with these cases?

 In flatten_join_alias_vars_mutator(), we've got a RangeTblEntry to
 work with.  I think the column names are to be found in the alias
 and/or eref attributes of the RangeTblEntry.

The eref names are the ones to use.  alias just records the original AS
clause (if any) attached to the RTE, which is mostly useful only for
reverse-listing the query.

 In adjust_appendrel_attrs_mutator(), we have a list, translated_vars,
 whose order matches the column order of the parent rel.  If we had the
 parent's RangeTblEntry, we could probably precede as in the previous
 case.  But the AppendRelInfo only contains the index of the RT.  Maybe
 we can figure out a way to use rt_fetch to get the RangeTblEntry
 itself, but that requires a pointer to the range table itself, which
 we haven't got.

This is surely fixable by passing a bit more information down.  If you
(Andrew) have something that covers everything but this issue, pass it
over and I'll take a whack at it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-13 Thread Tom Lane
[ sorry for ignoring this over the weekend --- I wasn't feeling very well ]

Andrew Dunstan and...@dunslane.net writes:
 On 02/11/2012 03:22 PM, Tom Lane wrote:
 I'm inclined to think that this is auto_explain's error, not that of
 the core code, ie we should be changing the output.

 Well, maybe this is more to your taste, although it strikes me as more 
 than something of a kludge. At least it's short :-)

I see you've already committed this, but I agree that it's quite a
kluge.

After looking a bit more at the existing explain code, it seems like the
critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup
calls around the ExplainPrintPlan call (see ExplainOnePlan), while
auto_explain does not.  I did not like your originally proposed patch
because it seemed to introduce yet another formatting concept into code
that has already got a few too many.  But couldn't we fix this by
exporting ExplainOpenGroup/ExplainCloseGroup and then calling those from
auto_explain?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Scaling XLog insertion (was Re: [HACKERS] Moving more work outside WALInsertLock)

2012-02-13 Thread Fujii Masao
On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 13.02.2012 01:04, Jeff Janes wrote:

 Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
  I have no idea if I did it correctly, in particular if calling
 GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
 effects that make that a bad thing to do.  I'm not proposing it as the
 real fix, I just wanted to get around this problem in order to do more
 testing.


 Thanks. That's basically the right approach. Attached patch contains a
 cleaned up version of that.


 It does get rid of the there is no contrecord flag errors, but
 recover still does not work.

 Now the count of tuples in the table is always correct (I never
 provoke a crash during the initial table load), but sometimes updates
 to those tuples that were reported to have been committed are lost.

 This is more subtle, it does not happen on every crash.

 It seems that when recovery ends on record with zero length at...,
 that recovery is correct.

 But when it ends on invalid magic number  in log file.. then the
 recovery is screwed up.


 Can you write a self-contained test case for that? I've been trying to
 reproduce that by running the regression tests and pgbench with a streaming
 replication standby, which should be pretty much the same as crash recovery.
 No luck this far.

Probably I could reproduce the same problem as Jeff got. Here is the test case:

$ initdb -D data
$ pg_ctl -D data start
$ psql -c create table t (i int); insert into t
values(generate_series(1,1)); delete from t
$ pg_ctl -D data stop -m i
$ pg_ctl -D data start

The crash recovery emitted the following server logs:

LOG:  database system was interrupted; last known up at 2012-02-14 02:07:01 JST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/179CC90
LOG:  invalid magic number  in log file 0, segment 1, offset 8060928
LOG:  redo done at 0/17AD858
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

After recovery, I could not see the table t which I created before:

$ psql -c select count(*) from t
ERROR:  relation t does not exist

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Access Error Details from PL/pgSQL

2012-02-13 Thread David E. Wheeler
Hackers,

In PL/pgSQL exception handling, I'm able to access the error code (SQLSTATE) 
and error message (SQLERRM). Is there any way to get at error details (yet)? If 
not, could SQLDETAIL or some such be added?

Thanks,

David
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Access Error Details from PL/pgSQL

2012-02-13 Thread Pavel Stehule
Hello

2012/2/13 David E. Wheeler da...@justatheory.com:
 Hackers,

 In PL/pgSQL exception handling, I'm able to access the error code (SQLSTATE) 
 and error message (SQLERRM). Is there any way to get at error details (yet)? 
 If not, could SQLDETAIL or some such be added?


no in stable

http://www.depesz.com/2011/07/20/waiting-for-9-2-stacked-diagnostics-in-plpgsql/

Pavel

 Thanks,

 David
 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUDA Sorting

2012-02-13 Thread Gaetano Mendola
On Feb 13, 2012 11:39 a.m., Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2012/2/13 Greg Smith g...@2ndquadrant.com:
  On 02/11/2012 08:14 PM, Gaetano Mendola wrote:
 
  The trend is to have server capable of running CUDA providing GPU via
  external hardware (PCI Express interface with PCI Express switches),
look
  for example at PowerEdge C410x PCIe Expansion Chassis from DELL.
 
 
  The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
  enclosure.  That's a completely sensible purchase if your goal is to
build a
  computing cluster, where a lot of work is handed off to a set of GPUs.
 I
  think that's even less likely to be a cost-effective option for a
database
  server.  Adding a single dedicated GPU installed in a server to
accelerate
  sorting is something that might be justifiable, based on your
benchmarks.
   This is a much more expensive option than that though.  Details at
  http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who
wants
  to see just how big this external box is.
 
 
  I did some experimenst timing the sort done with CUDA and the sort done
  with pg_qsort:
CUDA  pg_qsort
  33Milion integers:   ~ 900 ms,  ~ 6000 ms
  1Milion integers:~  21 ms,  ~  162 ms
  100k integers:   ~   2 ms,  ~   13 ms
  CUDA time has already in the copy operations (host-device,
device-host).
  As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R)
  Xeon(R) CPU X5650  @ 2.67GHz
 
 
  That's really interesting, and the X5650 is by no means a slow CPU.  So
this
  benchmark is providing a lot of CPU power yet still seeing over a 6X
speedup
  in sort times.  It sounds like the PCI Express bus has gotten fast
enough
  that the time to hand data over and get it back again can easily be
  justified for medium to large sized sorts.
 
  It would be helpful to take this patch and confirm whether it scales
when
  using in parallel.  Easiest way to do that would be to use the pgbench
-f
  feature, which allows running an arbitrary number of some query at once.
   Seeing whether this acceleration continued to hold as the number of
clients
  increases is a useful data point.
 
  Is it possible for you to break down where the time is being spent?  For
  example, how much of this time is consumed in the GPU itself, compared
to
  time spent transferring data between CPU and GPU?  I'm also curious
where
  the bottleneck is at with this approach.  If it's the speed of the
PCI-E bus
  for smaller data sets, adding more GPUs may never be practical.  If the
bus
  can handle quite a few of these at once before it saturates, it might be
  possible to overload a single GPU.  That seems like it would be really
hard
  to reach for database sorting though; I can't really defend justify my
gut
  feel for that being true though.
 
 
   I've never seen a PostgreSQL server capable of running CUDA, and I
   don't expect that to change.
 
  That sounds like:
 
  I think there is a world market for maybe five computers.
  - IBM Chairman Thomas Watson, 1943
 
 
  Yes, and 640K will be enough for everyone, ha ha.  (Having said the
640K
  thing is flat out denied by Gates, BTW, and no one has come up with
proof
  otherwise).
 
  I think you've made an interesting case for this sort of acceleration
now
  being useful for systems doing what's typically considered a data
warehouse
  task.  I regularly see servers waiting for far more than 13M integers to
  sort.  And I am seeing a clear trend toward providing more PCI-E slots
in
  servers now.  Dell's R810 is the most popular single server model my
  customers have deployed in the last year, and it has 5 X8 slots in it.
 It's
  rare all 5 of those are filled.  As long as a dedicated GPU works fine
when
  dropped to X8 speeds, I know a fair number of systems where one of those
  could be added now.
 
  There's another data point in your favor I didn't notice before your
last
  e-mail.  Amazon has a Cluster GPU Quadruple Extra Large node type that
  runs with NVIDIA Tesla hardware.  That means the installed base of
people
  who could consider CUDA is higher than I expected.  To demonstrate how
much
  that costs, to provision a GPU enabled reserved instance from Amazon
for one
  year costs $2410 at Light Utilization, giving a system with 22GB of
RAM
  and 1.69GB of storage.  (I find the reserved prices easier to compare
with
  dedicated hardware than the hourly ones)  That's halfway between the
  High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100
and
  the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
  $2200.  If someone could prove sorting was a bottleneck on their server,
  that isn't an unreasonable option to consider on a cloud-based database
  deployment.
 
  I still think that an approach based on OpenCL is more likely to be
suitable
  for PostgreSQL, which was part of why I gave CUDA low odds here.  The
points
  in favor of OpenCL are:
 
  -Since you last posted, 

Re: [HACKERS] Access Error Details from PL/pgSQL

2012-02-13 Thread David E. Wheeler
On Feb 13, 2012, at 9:30 AM, Pavel Stehule wrote:

 no in stable
 
 http://www.depesz.com/2011/07/20/waiting-for-9-2-stacked-diagnostics-in-plpgsql/

Ah, great, I had forgotten about that.

Thank you,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-13 Thread Andrew Dunstan



On 02/13/2012 11:15 AM, Tom Lane wrote:

[ sorry for ignoring this over the weekend --- I wasn't feeling very well ]

Andrew Dunstanand...@dunslane.net  writes:

On 02/11/2012 03:22 PM, Tom Lane wrote:

I'm inclined to think that this is auto_explain's error, not that of
the core code, ie we should be changing the output.

Well, maybe this is more to your taste, although it strikes me as more
than something of a kludge. At least it's short :-)

I see you've already committed this, but I agree that it's quite a
kluge.

After looking a bit more at the existing explain code, it seems like the
critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup
calls around the ExplainPrintPlan call (see ExplainOnePlan), while
auto_explain does not.  I did not like your originally proposed patch
because it seemed to introduce yet another formatting concept into code
that has already got a few too many.  But couldn't we fix this by
exporting ExplainOpenGroup/ExplainCloseGroup and then calling those from
auto_explain?


Yeah, maybe. We'd still have to do it conditionally (have to use 
ExplainBeginOutput for the XML case), but it would possibly be less kludgy.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-13 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 02/13/2012 11:15 AM, Tom Lane wrote:
 After looking a bit more at the existing explain code, it seems like the
 critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup
 calls around the ExplainPrintPlan call (see ExplainOnePlan), while
 auto_explain does not.

 Yeah, maybe. We'd still have to do it conditionally (have to use 
 ExplainBeginOutput for the XML case), but it would possibly be less kludgy.

Hm?  I wasn't suggesting removing the ExplainBeginOutput call, but more
like

ExplainBeginOutput(es);
+   ExplainOpenGroup(...);
ExplainQueryText(es, queryDesc);
ExplainPrintPlan(es, queryDesc);
+   ExplainCloseGroup(...);
ExplainEndOutput(es);

Details still TBD; the point is just that it's not clear to me why
auto_explain should need a formatting concept that doesn't already exist
within explain.c.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-13 Thread Robert Haas
On Mon, Feb 13, 2012 at 10:48 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Well, personally I have a hard time calling READ COMMITTED behavior
 sensible.  Consider this:

 [ gigantic pile of fail ]

Yeah, that's bad all right.  I think it's hard to argue that the
current behavior is sensible; the trick is to figure out something
that's better but doesn't impose too much additional overhead.  I
wonder if it's possible to use SSI (or some stripped-down mechanism
along similar lines) to track these kinds of write conflicts, rather
than cluttering the system catalogs with lots more TransactionId
fields.  Like, when we TRUNCATE a table, we could essentially make a
note in memory someplace recording the write conflict.

Unfortunately, the full-blown SSI machinery seems too expensive to use
for this, especially on all-read workloads where there are no actual
conflicts but lots of conflict checks.  But that could probably be
optimized.  The attraction of using something like an in-memory
conflict manager is that it would probably be quite general.  We could
fix problems of this type with no on-disk format changes whenever we
discover them (as we will certainly continue to do) just by adding the
appropriate flag-a-conflict calls to the right places in the code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-13 Thread Andrew Dunstan



On 02/13/2012 12:48 PM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

On 02/13/2012 11:15 AM, Tom Lane wrote:

After looking a bit more at the existing explain code, it seems like the
critical issue is that explain.c has ExplainOpenGroup/ExplainCloseGroup
calls around the ExplainPrintPlan call (see ExplainOnePlan), while
auto_explain does not.

Yeah, maybe. We'd still have to do it conditionally (have to use
ExplainBeginOutput for the XML case), but it would possibly be less kludgy.

Hm?  I wasn't suggesting removing the ExplainBeginOutput call, but more
like

ExplainBeginOutput(es);
+   ExplainOpenGroup(...);
ExplainQueryText(es, queryDesc);
ExplainPrintPlan(es, queryDesc);
+   ExplainCloseGroup(...);
ExplainEndOutput(es);

Details still TBD; the point is just that it's not clear to me why
auto_explain should need a formatting concept that doesn't already exist
within explain.c.




This will introduce an extra level of nesting for no good reason.

But this would work:

-   ExplainBeginOutput(es);
+   if (auto_explain_log_format != EXPLAIN_FORMAT_JSON)
+   ExplainBeginOutput(es);
+   else
+   ExplainOpenGroup(NULL, NULL, true, es);
ExplainQueryText(es, queryDesc);
ExplainPrintPlan(es, queryDesc);
-   ExplainEndOutput(es);
+   if (auto_explain_log_format != EXPLAIN_FORMAT_JSON)
+   ExplainEndOutput(es);
+   else
+   ExplainCloseGroup(NULL, NULL, true, es);

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUDA Sorting

2012-02-13 Thread Greg Stark
I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-13 Thread Andrew Dunstan



On 02/13/2012 01:33 PM, Andrew Dunstan wrote:



On 02/13/2012 12:48 PM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

On 02/13/2012 11:15 AM, Tom Lane wrote:
After looking a bit more at the existing explain code, it seems 
like the
critical issue is that explain.c has 
ExplainOpenGroup/ExplainCloseGroup

calls around the ExplainPrintPlan call (see ExplainOnePlan), while
auto_explain does not.

Yeah, maybe. We'd still have to do it conditionally (have to use
ExplainBeginOutput for the XML case), but it would possibly be less 
kludgy.

Hm?  I wasn't suggesting removing the ExplainBeginOutput call, but more
like

ExplainBeginOutput(es);
+ExplainOpenGroup(...);
ExplainQueryText(es, queryDesc);
ExplainPrintPlan(es, queryDesc);
+ExplainCloseGroup(...);
ExplainEndOutput(es);

Details still TBD; the point is just that it's not clear to me why
auto_explain should need a formatting concept that doesn't already exist
within explain.c.




This will introduce an extra level of nesting for no good reason.

But this would work:

-   ExplainBeginOutput(es);
+   if (auto_explain_log_format != EXPLAIN_FORMAT_JSON)
+   ExplainBeginOutput(es);
+   else
+   ExplainOpenGroup(NULL, NULL, true, es);
ExplainQueryText(es, queryDesc);
ExplainPrintPlan(es, queryDesc);
-   ExplainEndOutput(es);
+   if (auto_explain_log_format != EXPLAIN_FORMAT_JSON)
+   ExplainEndOutput(es);
+   else
+   ExplainCloseGroup(NULL, NULL, true, es);




Except that it causes other problems.

I think we'd probably bet sleeping dogs lie.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-13 Thread Andrew Dunstan



On 02/13/2012 11:00 AM, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

On Sat, Feb 11, 2012 at 11:11 AM, Andrew Dunstan
aduns...@postgresql.org  wrote:

Do we actually need to bother with these cases?

In flatten_join_alias_vars_mutator(), we've got a RangeTblEntry to
work with.  I think the column names are to be found in the alias
and/or eref attributes of the RangeTblEntry.

The eref names are the ones to use.  alias just records the original AS
clause (if any) attached to the RTE, which is mostly useful only for
reverse-listing the query.


In adjust_appendrel_attrs_mutator(), we have a list, translated_vars,
whose order matches the column order of the parent rel.  If we had the
parent's RangeTblEntry, we could probably precede as in the previous
case.  But the AppendRelInfo only contains the index of the RT.  Maybe
we can figure out a way to use rt_fetch to get the RangeTblEntry
itself, but that requires a pointer to the range table itself, which
we haven't got.

This is surely fixable by passing a bit more information down.  If you
(Andrew) have something that covers everything but this issue, pass it
over and I'll take a whack at it.


What I have fixes one of the three cases I have identified that appear 
to need fixing, the one that caused the json tests to crash with your 
original partial patch. See 
https://bitbucket.org/adunstan/pgdevel/changesets/tip/rowexprs. I 
won't have time to return to this for a few days. The two remaining 
cases should be fairly independent I think. If you don't get to this 
before then I'll look at the flatten_join_alias_vars_mutator case again 
and try to fix it based on the above, and then give you what I've got.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-13 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: 
 Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Well, personally I have a hard time calling READ COMMITTED
 behavior sensible.  Consider this:

 [ gigantic pile of fail ]
 
 Yeah, that's bad all right.  I think it's hard to argue that the
 current behavior is sensible; the trick is to figure out something
 that's better but doesn't impose too much additional overhead.  I
 wonder if it's possible to use SSI (or some stripped-down
 mechanism along similar lines) to track these kinds of write
 conflicts, rather than cluttering the system catalogs with lots
 more TransactionId fields.  Like, when we TRUNCATE a table, we
 could essentially make a note in memory someplace recording the
 write conflict.
 
Potential additional uses of the predicate locking developed for SSI
keep surfacing.  At some point we should probably pick a couple of
them and try to fashion an API for the non-blocking predicate
locking logic that serves them and SSI.  Since this predicate
locking system is explicitly interested only in read-write
conflicts, it does seem like it could work for SELECT FOR UPDATE
versus writes.
 
As mentioned once or twice before, it was pretty clear that while
predicate locking is required for SSI and is probably 80% of the C
code in the patch, it is really a separate thing -- we just didn't
want to try to create a generalized API based on the one initial
usage.  I think that an API based on registering and listening would
be the ticket.
 
 Unfortunately, the full-blown SSI machinery seems too expensive to
 use for this, especially on all-read workloads where there are no
 actual conflicts but lots of conflict checks.
 
In an all-read workload, if you actually declare the transactions to
be read-only SSI should not introduce much overhead.  If there's
much overhead showing up there at the moment, it would probably be
pretty easy to eliminate.  When there are any read-write
transactions active, it's a different story.
 
 But that could probably be optimized.
 
Undoubtedly.  It's disappointing that neither Dan nor I could find
the round tuits to make the kinds of changes in the SSI locking that
you made in some other areas for 9.2.  I'm not really sure how the
performance impact breaks down between predicate locking and SSI
proper, although I would tend to start from the assumption that,
like the lines of code, it's 80% in the predicate locking.
 
 The attraction of using something like an in-memory conflict
 manager is that it would probably be quite general.  We could fix
 problems of this type with no on-disk format changes whenever we
 discover them (as we will certainly continue to do) just by adding
 the appropriate flag-a-conflict calls to the right places in the
 code.
 
Assuming that the problems could be expressed in terms of read-write
conflicts, that's probably largely true.  I'm not sure that holds
for some of the funny READ COMMITTED behaviors, though.  I think the
only real cure there would be to make subtransactions cheap enough
that we could wrap execution of each SELECT and DML statement in a
subtransaction and roll back for another try with a new snapshot on
conflict.
 
If you want to track something other than read-write conflicts
and/or you want blocking when a conflict is found, you might be
better off looking to bend the heavyweight locks to your purposes.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-13 Thread Jay Levitt

Robert Haas wrote:

On Mon, Feb 13, 2012 at 7:45 AM, Robert Haasrobertmh...@gmail.com  wrote:

On Thu, Feb 9, 2012 at 3:37 PM, Jay Levittjay.lev...@gmail.com  wrote:

So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and
(probably both of our) 9.1-HEAD takes 1918s... is that something to worry
about, and if so, are there any tests I can run to assist? That bug doesn't
affect me personally, but y'know, community and all that.  Also, I wonder if
it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way
slower doing Y., and this is a canary in the coal mine.

This might be a lame hypothesis, but... is it possible that you built
your 9.1-tip binaries with --enable-cassert?  Or with different
optimization options?


No, I think I/O just varies more than my repeated tests on 1M rows 
indicated.  I ran the 10M-row test four times on the same server, 
alternating between packaged 9.1.2 and source-built 9.1.2 (default configure 
options), and saw these times:


INSERT  INDEX
apt 76  578
source  163 636
apt 73  546
source  80  473

EBS has no performance guarantees at all; you share your disks with an 
arbitrary number of other users, so if someone in the neighborhood decides 
to do some heavy disk I/O, you lose. Let this be a lesson to me: run 
benchmarks locally!



So I tested.  On my MacBook Pro, your test script builds the index in
just over 25 s on both REL9_1_2 and this morning's REL9_1_STABLE.


I think that's the 1-million version I emailed; try adding a zero and see if 
it doesn't take a little longer.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tab completion for SELECT

2012-02-13 Thread Peter Eisentraut
On tor, 2012-02-09 at 23:02 +0100, Dimitri Fontaine wrote:
 Peter Eisentraut pete...@gmx.net writes:
  Make tab-completion complete also function names – like: SELECT
  pg_gettabtab to see all functions that start with pg_get.
 
  Make tab-completion work for columns in SELECT. I know that when writing
  SELECT clause, psql doesn’t know which table it will deal with, but it
  could search through all the columns in database.
 
  That seems pretty useful, and it's more or less a one-line change, as in
  the attached patch.
 
 Does that includes support for completing SRF functions in the FROM clause?

No, that's an entirely different issue.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql tab completion for SELECT

2012-02-13 Thread Peter Eisentraut
On fre, 2012-02-10 at 01:24 -0500, Tom Lane wrote:
 That seems pretty nearly entirely bogus.  What is the argument for
 supposing that the word right after SELECT is a function name?  I would
 think it would be a column name (from who-knows-what table) much more
 often.

That's what the patch does.  It looks like you misread what I wrote.

 Also, if it is useful, people will expect it to work in more
 places than just the first word after SELECT --- for instance, somebody
 who didn't realize what a kluge it was would expect it to also work
 right after a top-level comma after SELECT.  Or probably after a left
 parenthesis in the SELECT list.  Etc.

All of psql tab completion works like that.  We only complete the first
table after FROM, the first column after ORDER BY, the first privilege
to grant, the first role to grant to, the first table to drop, the first
reloption, etc.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/python long-lived allocations in datum-dict transformation

2012-02-13 Thread Jan Urbański
On 12/02/12 00:48, Tom Lane wrote:
 =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 This is annoying for functions that plough through large tables, doing
 some calculation. Attached is a patch that does the conversion of
 PostgreSQL Datums into Python dict objects in a scratch memory context
 that gets reset every time.
 
 As best I can tell, this patch proposes creating a new, separate context
 (chewing up 8KB+) for every plpython procedure that's ever used in a
 given session.  This cure could easily be worse than the disease as far

Yeah, that's not ideal.

 What's more, it's unclear that
 it won't malfunction altogether if the function is used recursively
 (ie, what if PLyDict_FromTuple ends up calling the same function again?)

I was a bit worried about that, but the only place where
PLyDict_FromTuple calls into some other code is when it calls the type's
specific I/O function, which AFAICT can't call back into user code
(except for user-defined C I/O routines). It's not very comfortable, but
I think PLyDict_FromTuple can be allowed to be non-reentrant.

 Can't you fix it so that the temp context is associated with a
 particular function execution, rather than being statically allocated
 per-function?

That would be cool, but I failed to easily get a handle on something
that's like the execution context of a PL/Python function... Actually,
if we assume that PLyDict_FromTuple (which is quite a low-level thing)
never calls PL/Python UDFs we could keep a single memory context in
top-level PL/Python memory and pay the overhead once in a session, not
once per function.

OTOH if we want to make it reentrant, some more tinkering would be in order.

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUDA Sorting

2012-02-13 Thread Gaetano Mendola
On Feb 13, 2012 7:49 p.m., Greg Stark st...@mit.edu wrote:

 I don't think we should be looking at either CUDA or OpenCL directly.
 We should be looking for a generic library that can target either and
 is well maintained and actively developed. Any GPU code we write
 ourselves would rapidly be overtaken by changes in the hardware and
 innovations in parallel algorithms. If we find a library that provides
 a sorting api and adapt our code to use it then we'll get the benefits
 of any new hardware feature as the library adds support for them.

To sort integer I used the Thrust Nvidia library.


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-13 Thread Marko Kreen
On Tue, Feb 07, 2012 at 04:44:09PM +0200, Marko Kreen wrote:
 Although it seems we could allow exceptions, at least when we are
 speaking of Postgres backend, as the connection and result are
 internally consistent state when the handler is called, and the
 partial PGresult is stored under PGconn-result.  Even the
 connection is in consistent state, as the row packet is
 fully processed.
 
 So we have 3 variants, all work, but which one do we want to support?
 
 1) Exceptions are not allowed.
 2) Exceptions are allowed, but when it happens, user must call
PQfinish() next, to avoid processing incoming data from
potentially invalid state.
 3) Exceptions are allowed, and further row processing can continue
with PQisBusy() / PQgetResult()
 3.1) The problematic row is retried.  (Current behaviour)
 3.2) The problematic row is skipped.

I converted the patch to support 3.2), that is - skip row on exception.
That required some cleanups of getAnotherTuple() API, plus I made some
other cleanups.  Details below.

But during this I also started to think what happens if the user does
*not* throw exceptions.  Eg. Python does exceptions via regular return
values, which means complications when passing them upwards.

The main case I'm thinking of is actually resultset iterator in high-level
language.  Current callback-only style API requires that rows are
stuffed into temporary buffer until the network blocks and user code
gets control again.  This feels clumsy for a performance-oriented API.
Another case is user code that wants to do complex processing.  Doing
lot of stuff under callback seems dubious.  And what if some part of
code calls PQfinish() during some error recovery?

I tried imaging some sort of getFoo() style API for fetching in-flight
row data, but I always ended up with rewrite libpq step, so I feel
it's not productive to go there.

Instead I added simple feature: rowProcessor can return '2',
in which case getAnotherTuple() does early exit without setting
any error state.  In user side it appears as PQisBusy() returned
with TRUE result.  All pointers stay valid, so callback can just
stuff them into some temp area.  ATM there is not indication though
whether the exit was due to callback or other reasons, so user
must detect it based on whether new temp pointers appeares,
which means those must be cleaned before calling PQisBusy() again.
This actually feels like feature, those must not stay around
after single call.

It's included in main patch, but I also attached it as separate patch
so that it can be examined separately and reverted if not acceptable.

But as it's really simple, I recommend including it.

It's usage might now be obvious though, should we include
example code in doc?



New feature:

* Row processor can return 2, then PQisBusy() does early exit.
  Supportde only when connection is in non-blocking mode.

Cleanups:

* Row data is tagged as processed when rowProcessor is called,
  so exceptions will skip the row.  This simplifies non-exceptional
  handling as well.

* Converted 'return EOF' in V3 getAnotherTuple() to set error instead.
  AFAICS those EOFs are remnants from V2 getAnotherTuple()
  not something that is coded deliberately.  Note that when
  v3 getAnotherTuple() is called the row packet is fully in buffer.
  The EOF on broken packet to signify 'give me more data' does not
  result in any useful behaviour, instead you can simply get
  into infinite loop.

Fix bugs in my previous changes:

* Split the OOM error handling from custom error message handling,
  previously the error message in PGresult was lost due to OOM logic
  early free of PGresult.

* Drop unused goto label from experimental debug code.

-- 
marko

*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
***
*** 160,162  PQconnectStartParams  157
--- 160,164 
  PQping158
  PQpingParams  159
  PQlibVersion  160
+ PQsetRowProcessor	  161
+ PQsetRowProcessorErrMsg	  162
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
***
*** 2693,2698  makeEmptyPGconn(void)
--- 2693,2701 
  	conn-wait_ssl_try = false;
  #endif
  
+ 	/* set default row processor */
+ 	PQsetRowProcessor(conn, NULL, NULL);
+ 
  	/*
  	 * We try to send at least 8K at a time, which is the usual size of pipe
  	 * buffers on Unix systems.  That way, when we are sending a large amount
***
*** 2711,2718  makeEmptyPGconn(void)
--- 2714,2726 
  	initPQExpBuffer(conn-errorMessage);
  	initPQExpBuffer(conn-workBuffer);
  
+ 	/* set up initial row buffer */
+ 	conn-rowBufLen = 32;
+ 	conn-rowBuf = (PGrowValue *)malloc(conn-rowBufLen * sizeof(PGrowValue));
+ 
  	if (conn-inBuffer == NULL ||
  		conn-outBuffer == NULL ||
+ 		conn-rowBuf == NULL ||
  		PQExpBufferBroken(conn-errorMessage) ||
  		PQExpBufferBroken(conn-workBuffer))
  	{
***
*** 2814,2819  freePGconn(PGconn *conn)

Re: [HACKERS] pl/python long-lived allocations in datum-dict transformation

2012-02-13 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 On 12/02/12 00:48, Tom Lane wrote:
 What's more, it's unclear that
 it won't malfunction altogether if the function is used recursively
 (ie, what if PLyDict_FromTuple ends up calling the same function again?)

 I was a bit worried about that, but the only place where
 PLyDict_FromTuple calls into some other code is when it calls the type's
 specific I/O function, which AFAICT can't call back into user code
 (except for user-defined C I/O routines). It's not very comfortable, but
 I think PLyDict_FromTuple can be allowed to be non-reentrant.

I think that's pretty short-sighted.  Even if it's safe today (which
I am not 100% convinced of), there are plenty of foreseeable reasons
why it might^Wwill break in the future.

* There is no reason to think that datatype I/O functions will never
be written in anything but C.  People have asked repeatedly for the
ability to write them in higher-level languages.  I doubt that would
ever be possible in plpgsql, but with languages that can do
bit-twiddling like plpython or plperl, it seems possible.

* A datatype I/O function, even if written in C, could call user-written
code.  See domain_in for example, which can invoke arbitrary processing
via domain constraint checking.  If you were proposing to patch
PLyObject_ToTuple rather than the other direction, this patch would be
breakable today.  Admittedly the breakage would require some rather
contrived coding (your domain's constraint check function does
*what*?), but it would still be a security bug.

* Once we have the ability to associate a temp memory context with
plpython, there will be a temptation to use it for other purposes
besides this one, and it will not be long before such a purpose does
open a recursion risk, even if there's none there today.  (Speaking of
which, it sure looks to me like PLyObject_ToDatum, PLyObject_ToTuple,
etc leak memory like there's no tomorrow.)

 OTOH if we want to make it reentrant, some more tinkering would be in order.

I think that's in order.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_test_fsync performance

2012-02-13 Thread Bruce Momjian
I have heard complaints that /contrib/pg_test_fsync is too slow.  I
thought it was impossible to speed up pg_test_fsync without reducing its
accuracy.

However, now that I some consumer-grade SATA 2 drives, I noticed that
the slowness is really in the open_sync test:

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
 1 * 16kB open_sync write  76.421 ops/sec
 2 *  8kB open_sync writes 38.689 ops/sec
 4 *  4kB open_sync writes 19.140 ops/sec
 8 *  2kB open_sync writes  4.938 ops/sec
16 *  1kB open_sync writes  2.480 ops/sec

These last few lines can take very long, so I developed the attached
patch that scales down the number of tests.  This makes it more
reasonable to run pg_test_fsync.

I would like to apply this for PG 9.2.
 
-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_test_fsync/pg_test_fsync.c b/contrib/pg_test_fsync/pg_test_fsync.c
new file mode 100644
index 042b02b..970deda
*** a/contrib/pg_test_fsync/pg_test_fsync.c
--- b/contrib/pg_test_fsync/pg_test_fsync.c
*** test_open_syncs(void)
*** 358,368 
  	printf((This is designed to compare the cost of writing 16kB\n);
  	printf(in different write open_sync sizes.)\n);
  
! 	test_open_sync(16kB open_sync write, 16);
! 	test_open_sync( 8kB open_sync writes, 8);
! 	test_open_sync( 4kB open_sync writes, 4);
! 	test_open_sync( 2kB open_sync writes, 2);
! 	test_open_sync( 1kB open_sync writes, 1);
  }
  
  /*
--- 358,368 
  	printf((This is designed to compare the cost of writing 16kB\n);
  	printf(in different write open_sync sizes.)\n);
  
! 	test_open_sync( 1 * 16kB open_sync write, 16);
! 	test_open_sync( 2 *  8kB open_sync writes, 8);
! 	test_open_sync( 4 *  4kB open_sync writes, 4);
! 	test_open_sync( 8 *  2kB open_sync writes, 2);
! 	test_open_sync(16 *  1kB open_sync writes, 1);
  }
  
  /*
*** test_open_sync(const char *msg, int writ
*** 376,381 
--- 376,385 
  ops,
  writes;
  #endif
+ 	int	save_ops_per_test = ops_per_test;
+ 
+ 	/* This test can be long, so scale down the number of tests */
+ 	ops_per_test = ops_per_test * writes_size / 16;
  
  	printf(LABEL_FORMAT, msg);
  	fflush(stdout);
*** test_open_sync(const char *msg, int writ
*** 402,407 
--- 406,412 
  #else
  	printf(NA_FORMAT, n/a\n);
  #endif
+ 	ops_per_test = save_ops_per_test;
  }
  
  static void

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_test_fsync performance

2012-02-13 Thread Robert Haas
On Mon, Feb 13, 2012 at 7:42 PM, Bruce Momjian br...@momjian.us wrote:
 I have heard complaints that /contrib/pg_test_fsync is too slow.  I
 thought it was impossible to speed up pg_test_fsync without reducing its
 accuracy.

 However, now that I some consumer-grade SATA 2 drives, I noticed that
 the slowness is really in the open_sync test:

        Compare open_sync with different write sizes:
        (This is designed to compare the cost of writing 16kB
        in different write open_sync sizes.)
                 1 * 16kB open_sync write          76.421 ops/sec
                 2 *  8kB open_sync writes         38.689 ops/sec
                 4 *  4kB open_sync writes         19.140 ops/sec
                 8 *  2kB open_sync writes          4.938 ops/sec
                16 *  1kB open_sync writes          2.480 ops/sec

 These last few lines can take very long, so I developed the attached
 patch that scales down the number of tests.  This makes it more
 reasonable to run pg_test_fsync.

 I would like to apply this for PG 9.2.

On my MacOS X, it's fsync_writethrough that's insanely slow:

[rhaas pg_test_fsync]$ ./pg_test_fsync
2000 operations per test
Direct I/O is not supported on this platform.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync3523.267 ops/sec
fdatasync3360.023 ops/sec
fsync2410.048 ops/sec
fsync_writethrough 12.576 ops/sec
open_sync3649.475 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync1885.284 ops/sec
fdatasync2544.652 ops/sec
fsync3241.218 ops/sec
fsync_writethrough  ^C

Instead of or in addition to a fixed number operations per test, maybe
we should cut off each test after a certain amount of wall-clock time,
like 15 seconds.  It's kind of insane to run one of these tests for 3
minutes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_test_fsync performance

2012-02-13 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Instead of or in addition to a fixed number operations per test, maybe
 we should cut off each test after a certain amount of wall-clock time,
 like 15 seconds.

+1, I was about to suggest the same thing.  Running any of these tests
for a fixed number of iterations will result in drastic degradation of
accuracy as soon as the machine's behavior changes noticeably from what
you were expecting.  Run them for a fixed time period instead.  Or maybe
do a few, then check elapsed time and estimate a number of iterations to
use, if you're worried about the cost of doing gettimeofday after each
write.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_test_fsync performance

2012-02-13 Thread Bruce Momjian
On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  Instead of or in addition to a fixed number operations per test, maybe
  we should cut off each test after a certain amount of wall-clock time,
  like 15 seconds.
 
 +1, I was about to suggest the same thing.  Running any of these tests
 for a fixed number of iterations will result in drastic degradation of
 accuracy as soon as the machine's behavior changes noticeably from what
 you were expecting.  Run them for a fixed time period instead.  Or maybe
 do a few, then check elapsed time and estimate a number of iterations to
 use, if you're worried about the cost of doing gettimeofday after each
 write.

Good idea, and it worked out very well.  I changed the -o loops
parameter to -s seconds which calls alarm() after (default) 2 seconds,
and then once the operation completes, computes a duration per
operation.

The test now runs in 30 seconds and produces similar output to the
longer version.
 
-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_test_fsync/pg_test_fsync.c b/contrib/pg_test_fsync/pg_test_fsync.c
new file mode 100644
index 3fcb087..8554426
*** a/contrib/pg_test_fsync/pg_test_fsync.c
--- b/contrib/pg_test_fsync/pg_test_fsync.c
***
*** 27,41 
  #define NA_FORMAT			%18s
  #define OPS_FORMAT			%9.3f ops/sec
  
  static const char *progname;
  
! static int	ops_per_test = 2000;
  static int	needs_unlink = 0;
  static char full_buf[XLOG_SEG_SIZE],
  		   *buf,
  		   *filename = FSYNC_FILENAME;
  static struct timeval start_t,
  			stop_t;
  
  
  static void handle_args(int argc, char *argv[]);
--- 27,50 
  #define NA_FORMAT			%18s
  #define OPS_FORMAT			%9.3f ops/sec
  
+ #define START_TIMER	\
+ do { \
+ 	alarm_triggered = false; \
+ 	alarm(secs_per_test); \
+ 	gettimeofday(start_t, NULL); \
+ } while (0)
+ 
+ 
  static const char *progname;
  
! static int	secs_per_test = 2;
  static int	needs_unlink = 0;
  static char full_buf[XLOG_SEG_SIZE],
  		   *buf,
  		   *filename = FSYNC_FILENAME;
  static struct timeval start_t,
  			stop_t;
+ static bool alarm_triggered = false;
  
  
  static void handle_args(int argc, char *argv[]);
*** static void test_sync(int writes_per_op)
*** 46,57 
  static void test_open_syncs(void);
  static void test_open_sync(const char *msg, int writes_size);
  static void test_file_descriptor_sync(void);
  static void signal_cleanup(int sig);
  
  #ifdef HAVE_FSYNC_WRITETHROUGH
  static int	pg_fsync_writethrough(int fd);
  #endif
! static void print_elapse(struct timeval start_t, struct timeval stop_t);
  static void die(const char *str);
  
  
--- 55,67 
  static void test_open_syncs(void);
  static void test_open_sync(const char *msg, int writes_size);
  static void test_file_descriptor_sync(void);
+ static void process_alarm(int sig);
  static void signal_cleanup(int sig);
  
  #ifdef HAVE_FSYNC_WRITETHROUGH
  static int	pg_fsync_writethrough(int fd);
  #endif
! static void print_elapse(struct timeval start_t, struct timeval stop_t, int ops);
  static void die(const char *str);
  
  
*** main(int argc, char *argv[])
*** 65,70 
--- 75,81 
  	/* Prevent leaving behind the test file */
  	signal(SIGINT, signal_cleanup);
  	signal(SIGTERM, signal_cleanup);
+ 	signal(SIGALRM, process_alarm);
  #ifdef SIGHUP
  	/* Not defined on win32 */
  	signal(SIGHUP, signal_cleanup);
*** handle_args(int argc, char *argv[])
*** 96,102 
  {
  	static struct option long_options[] = {
  		{filename, required_argument, NULL, 'f'},
! 		{ops-per-test, required_argument, NULL, 'o'},
  		{NULL, 0, NULL, 0}
  	};
  	int			option;			/* Command line option */
--- 107,113 
  {
  	static struct option long_options[] = {
  		{filename, required_argument, NULL, 'f'},
! 		{secs-per-test, required_argument, NULL, 's'},
  		{NULL, 0, NULL, 0}
  	};
  	int			option;			/* Command line option */
*** handle_args(int argc, char *argv[])
*** 117,123 
  		}
  	}
  
! 	while ((option = getopt_long(argc, argv, f:o:,
   long_options, optindex)) != -1)
  	{
  		switch (option)
--- 128,134 
  		}
  	}
  
! 	while ((option = getopt_long(argc, argv, f:s:,
   long_options, optindex)) != -1)
  	{
  		switch (option)
*** handle_args(int argc, char *argv[])
*** 126,133 
  filename = strdup(optarg);
  break;
  
! 			case 'o':
! ops_per_test = atoi(optarg);
  break;
  
  			default:
--- 137,144 
  filename = strdup(optarg);
  break;
  
! 			case 's':
! secs_per_test = atoi(optarg);
  break;
  
  			default:
*** handle_args(int argc, char *argv[])
*** 148,154 
  		exit(1);
  	}
  
! 	printf(%d operations per test\n, ops_per_test);
  #if PG_O_DIRECT != 0
  	printf(O_DIRECT supported on this 

[HACKERS] SSI rw-conflicts and 2PC

2012-02-13 Thread Dan Ports
Looking over the SSI 2PC code recently, I noticed that I overlooked a
case that could lead to non-serializable behavior after a crash.

When we PREPARE a serializable transaction, we store part of the
SERIALIZABLEXACT in the statefile (in addition to the list of SIREAD
locks). One of the pieces of information we record is whether the
transaction had any conflicts in or out. The problem is that that can
change if a new conflict occurs after the transaction has prepared.

Here's an example of the problem (based on the receipt-report test):

-- Setup
CREATE TABLE ctl (k text NOT NULL PRIMARY KEY, deposit_date date NOT NULL);
INSERT INTO ctl VALUES ('receipt', DATE '2008-12-22');
CREATE TABLE receipt (receipt_no int NOT NULL PRIMARY KEY, deposit_date date 
NOT NULL, amount numeric(13,2));

-- T2
BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT INTO receipt VALUES (3, (SELECT deposit_date FROM ctl WHERE k = 
'receipt'), 4.00);
PREPARE TRANSACTION 't2';

-- T3
BEGIN ISOLATION LEVEL SERIALIZABLE;
UPDATE ctl SET deposit_date = DATE '2008-12-23' WHERE k = 'receipt';
COMMIT;

-- T1
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM ctl WHERE k = 'receipt';
SELECT * FROM receipt WHERE deposit_date = DATE '2008-12-22';
COMMIT;

Running this sequence of transactions normally, T1 will be rolled back
because of the pattern of conflicts T1 - T2 - T3, as we'd expect. This
should still be true even if we restart the database before executing
the last transaction -- but it's not. The problem is that, when T2
prepared, it had no conflicts, so we recorded that in the statefile.
The T2 - T3 conflict happened later, so we didn't know about it during
recovery.

I discussed this a bit with Kevin and we agreed that this is important
to fix, since it's a false negative that violates serializability. The
question is how to fix it. There are a couple of options...

The easiest answer would be to just treat every prepared transaction
found during recovery as though it had a conflict in and out. This is
roughly a one-line change, and it's certainly safe. But the downside is
that this is pretty restrictive: after recovery, we'd have to abort any
serializable transaction that tries to read anything that a prepared
transaction wrote, or modify anything that it read, until that
transaction is either committed or rolled back.

To do better than that, we want to know accurately whether the prepared
transaction had a conflict with a transaction that prepared or
committed before the crash. We could do this if we had a way to append
a record to the 2PC statefile of an already-prepared transaction --
then we'd just add a new record indicating the conflict. Of course, we
don't have a way to do that. It'd be tricky to add support for this,
since it has to be crash-safe, so the question is whether the improved
precision justifies the complexity it would require.

A third option is to observe that the only conflicts *in* that matter
from a recovered prepared transaction are from other prepared
transactions. So we could have prepared transactions include in their
statefile the xids of any prepared transactions they conflicted with
at prepare time, and match them up during recovery to reconstruct the
graph. This is a middle ground between the other two options. It
doesn't require modifying the statefile after prepare. However, conflicts
*out* to non-prepared transactions do matter, and this doesn't record
those, so we'd have to do the conservative thing -- which means that
after recovery, no one can read anything a prepared transaction wrote.

I thought I'd throw these options out there to see which ones people
think are reasonable (or any better ideas). Of the three, I think the
first (simplest) solution is the only one we could plausibly backpatch
to 9.1. But if the extra aborts after recovery seem too expensive, we
may want to consider one of the other options for future releases.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)

2012-02-13 Thread Robert Haas
On Mon, Feb 13, 2012 at 7:51 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 I rebased the patch due to the updates of pg_proc.h.

 Please see the newer one. Thanks,

Thanks, committed.  I think, though, that some further adjustment is
needed here, because you currently can't do ALTER FUNCTION ... NO
LEAKPROOF, which seems unacceptable.  It's fairly clear why not,
though: you get a grammar conflict, because the parser allows this:

create or replace function z() returns int as $$select 1$$ language
sql set transaction not deferrable;

However, since that syntax doesn't actually work, I'm thinking we
could just refactor things a bit to reject that at the parser stage.
The attached patch adopts that approach.  Anyone have a better idea?

I also think we ought to stick create_function_3 into one of the
parallel groups in the regression tests, if possible.  Can you
investigate that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


not-leakproof.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-13 Thread Shigeru Hanada
(2012/02/13 20:50), Etsuro Fujita wrote:
 The patches have been applied, but role-related regression tests failed
 in my environment.  I fixed it in a similar fashion of
 /src/test/regress/sql/foreign_data.sql.  Please find attached a updated
 patch for the regression tests.

Good catch, thanks.  I'll revise pgsql_fdw tests little more.

 BTW, What do you think about this?
 
 http://archives.postgresql.org/pgsql-hackers/2012-01/msg00229.php

I'm sorry that I've left the thread unfinished...  I've given up to
propose Join-push-down of foreign tables for 9.2, because it will take a
while to achieve general semantics mapping for join push-down and WHERE
clause push-down.  For 9.2, I'm proposing pgsql_fdw which has WHERE
clause push-down for built-in elements which are free from collation.
I'd like to go back to that item after 9.2 development enters beta or
RC, hopefully :)

-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2012-02-13 Thread Jaime Casanova
On Wed, Jan 25, 2012 at 9:47 PM, Noah Misch n...@leadboat.com wrote:

 With all that done, run some quick benchmarks: see how SELECT free_percent
 FROM pgstattuple(rel) fares compared to SELECT relation_free_space(rel) for
 a large heap and for a large B-tree index.  If the timing difference is too
 small to be interesting to you, remove relation_free_space() and submit your
 pgstattuple() improvements alone.  Otherwise, submit as written.


Ok. I split this in three patches.

1) pgstattuple-gin_spgist.patch
This first patch adds gin and spgist support to pgstattuple, also
makes pgstattuple use a ring buffer when reading tables or indexes.

2) pgstattuple-relation_free_space.patch
This patch adds the relation_free_space function to pgstattuple.

the function relation_free_space() is faster than pgstattuple(), to
test that i initialize pgbench with a scale of 40.
In that context pgstattuple() tooks 1.4s to process pgbench_account
table and relation_free_space() tooks 730ms (half the time!)
In the index the difference is less notorious, 170ms the former and
150ms the latter.

3) pgstattuple-stats_target.patch
This patch adds a stats_target parameter to the relation_free_space()
function, it mimics the way analyze choose the blocks to read and is
faster than plain relation_free_space() but of course could be inexact
if the pages that we don't read are the ones with more free space

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index beff1b9..9f2ec1f 100644
*** a/contrib/pgstattuple/pgstatindex.c
--- b/contrib/pgstattuple/pgstatindex.c
*** pgstatindex(PG_FUNCTION_ARGS)
*** 95,100 
--- 95,101 
  	BlockNumber nblocks;
  	BlockNumber blkno;
  	BTIndexStat indexStat;
+  	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
  
  	if (!superuser())
  		ereport(ERROR,
*** pgstatindex(PG_FUNCTION_ARGS)
*** 122,128 
  	 * Read metapage
  	 */
  	{
! 		Buffer		buffer = ReadBuffer(rel, 0);
  		Page		page = BufferGetPage(buffer);
  		BTMetaPageData *metad = BTPageGetMeta(page);
  
--- 123,129 
  	 * Read metapage
  	 */
  	{
! 		Buffer		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, 0, RBM_NORMAL, bstrategy);
  		Page		page = BufferGetPage(buffer);
  		BTMetaPageData *metad = BTPageGetMeta(page);
  
*** pgstatindex(PG_FUNCTION_ARGS)
*** 159,165 
  		CHECK_FOR_INTERRUPTS();
  
  		/* Read and lock buffer */
! 		buffer = ReadBuffer(rel, blkno);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  
  		page = BufferGetPage(buffer);
--- 160,166 
  		CHECK_FOR_INTERRUPTS();
  
  		/* Read and lock buffer */
!  		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  
  		page = BufferGetPage(buffer);
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index e5ddd87..6bbc957 100644
*** a/contrib/pgstattuple/pgstattuple.c
--- b/contrib/pgstattuple/pgstattuple.c
***
*** 24,33 
--- 24,35 
  
  #include postgres.h
  
+ #include access/gin_private.h
  #include access/gist_private.h
  #include access/hash.h
  #include access/nbtree.h
  #include access/relscan.h
+ #include access/spgist_private.h
  #include catalog/namespace.h
  #include funcapi.h
  #include miscadmin.h
*** static void pgstat_hash_page(pgstattuple
*** 73,83 
--- 75,96 
   Relation rel, BlockNumber blkno);
  static void pgstat_gist_page(pgstattuple_type *stat,
   Relation rel, BlockNumber blkno);
+ static void pgstat_gin_page(pgstattuple_type *stat,
+  Relation rel, BlockNumber blkno);
+ static void pgstat_spgist_page(pgstattuple_type *stat,
+  Relation rel, BlockNumber blkno);
  static Datum pgstat_index(Relation rel, BlockNumber start,
  			 pgstat_page pagefn, FunctionCallInfo fcinfo);
  static void pgstat_index_page(pgstattuple_type *stat, Page page,
    OffsetNumber minoff, OffsetNumber maxoff);
  
+ /* 
+  * Buffer access strategy for reading relations, it's simpler to keep it
+  * global because pgstat_*_page() functions read one buffer at a time.
+  * pgstat_heap() and pgstat_index() should initialize it before use.
+  */
+ BufferAccessStrategy bstrategy;
+ 
  /*
   * build_pgstattuple_type -- build a pgstattuple_type tuple
   */
*** pgstat_relation(Relation rel, FunctionCa
*** 229,235 
  	return pgstat_index(rel, GIST_ROOT_BLKNO + 1,
  		pgstat_gist_page, fcinfo);
  case GIN_AM_OID:
! 	err = gin index;
  	break;
  default:
  	err = unknown index;
--- 242,253 
  	return pgstat_index(rel, GIST_ROOT_BLKNO + 1,
  		pgstat_gist_page, fcinfo);
  case GIN_AM_OID:
! 	return pgstat_index(rel, GIN_METAPAGE_BLKNO + 1,
! 		pgstat_gin_page, fcinfo);
! 	break;
! case SPGIST_AM_OID:
! 	return pgstat_index(rel, 

Re: [HACKERS] pg_stats_recovery view

2012-02-13 Thread Jaime Casanova
On Thu, Feb 2, 2012 at 2:32 AM, Magnus Hagander mag...@hagander.net wrote:

 I haven't looked through the code in detail, but one direct comment:
 do we really need/want to send this through the stats collector? It
 will only ever have one sender - perhaps we should just either store
 it in shared memory or store it locally and only send it on demand?


fyi, i intend to send a reworked patch later today, it will store the
info locally and send it on demand.
about the _short_desc functions, i added that because i wanted to
understand what was happening during recovery and the wal_record_type
(xl_info) being a number is not that clear

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers