date:20120207

Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-07 Thread Chetan Suttraway

On Wed, Feb 8, 2012 at 12:19 PM, Chetan Suttraway <
chetan.suttra...@enterprisedb.com> wrote:

>
>
> On Wed, Feb 8, 2012 at 8:15 AM, Robert Haas  wrote:
>
>> On Tue, Feb 7, 2012 at 4:25 PM, Peter Eisentraut  wrote:
>> > On ons, 2012-02-01 at 15:53 +0530, Chetan Suttraway wrote:
>> >> This is regarding the TODO item :
>> >> "Add SPI_gettypmod() to return a field's typemod from a TupleDesc"
>> >
>> > My first thought was, this should be spelled SPI_gettypmod().  Not sure
>> > what others think.
>>
>> +1.
>>
>>
> The reason for using SPI_gettypemod() name was that I did see
> SPI_gettypeid().
>
> Anyways, will update patch with recommended name.
>
>
> Regards,
> Chetan
>
>
>
Please refer the attached patch  which now uses SPI_gettypmod() name.

Regards,
Chetan

-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index 81f284c..1f4632e 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -955,6 +955,24 @@ SPI_gettypeid(TupleDesc tupdesc, int fnumber)
 		return (SystemAttributeDefinition(fnumber, true))->atttypid;
 }
 
+int4
+SPI_gettypmod(TupleDesc tupdesc, int fnumber)
+{
+	SPI_result = 0;
+
+	if (fnumber > tupdesc->natts || fnumber == 0 ||
+		fnumber <= FirstLowInvalidHeapAttributeNumber)
+	{
+		SPI_result = SPI_ERROR_NOATTRIBUTE;
+		return -1;
+	}
+
+	if (fnumber > 0)
+		return tupdesc->attrs[fnumber - 1]->atttypmod;
+	else
+		return (SystemAttributeDefinition(fnumber, true))->atttypmod;
+}
+
 char *
 SPI_getrelname(Relation rel)
 {
diff --git a/src/include/executor/spi.h b/src/include/executor/spi.h
index cfbaa14..dedb1c7 100644
--- a/src/include/executor/spi.h
+++ b/src/include/executor/spi.h
@@ -113,6 +113,7 @@ extern char *SPI_getvalue(HeapTuple tuple, TupleDesc tupdesc, int fnumber);
 extern Datum SPI_getbinval(HeapTuple tuple, TupleDesc tupdesc, int fnumber, bool *isnull);
 extern char *SPI_gettype(TupleDesc tupdesc, int fnumber);
 extern Oid	SPI_gettypeid(TupleDesc tupdesc, int fnumber);
+extern int4 SPI_gettypmod(TupleDesc tupdesc, int fnumber);
 extern char *SPI_getrelname(Relation rel);
 extern char *SPI_getnspname(Relation rel);
 extern void *SPI_palloc(Size size);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-07 Thread Chetan Suttraway

On Wed, Feb 8, 2012 at 8:15 AM, Robert Haas  wrote:

> On Tue, Feb 7, 2012 at 4:25 PM, Peter Eisentraut  wrote:
> > On ons, 2012-02-01 at 15:53 +0530, Chetan Suttraway wrote:
> >> This is regarding the TODO item :
> >> "Add SPI_gettypmod() to return a field's typemod from a TupleDesc"
> >
> > My first thought was, this should be spelled SPI_gettypmod().  Not sure
> > what others think.
>
> +1.
>
>
The reason for using SPI_gettypemod() name was that I did see
SPI_gettypeid().

Anyways, will update patch with recommended name.

Regards,
Chetan

-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb

Re: [HACKERS] Can PQstatus() be used by Application to check connection to postgres periodically?

2012-02-07 Thread Pavel Golub

Hello, sujayr06.

You wrote:

s> Hello All,

s>My application has to do a real time data upload to PostgreSQL
s> server. 

s>Every time i have to do a real time upload, i do not wish to open
s> new connection.
s>I want to open a connection once [when my application comes up]
s> and periodically check if the connection is active.
s>Can PQstatus() be used by application to check the status of the
s> connection already established?

s>If PQstatus() cannot be used, does PostgreSQL provide alternate
s> interface to check the status of the connection.

s>   Note : I do not wish to open connection on every real time upload
s> as its an overhead for the application.

s>Appreciate any help!

You may use PQtransactionStatus for this purpose
(http://www.postgresql.org/docs/9.1/static/libpq-status.html)

s> WBR,
s> Sujay 

s>  


s>

s> --
s> View this message in context:
s> 
http://postgresql.1045698.n5.nabble.com/Can-PQstatus-be-used-by-Application-to-check-connection-to-postgres-periodically-tp5462315p5462315.html
s> Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.




-- 
With best wishes,
 Pavel  mailto:pa...@gf.microolap.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for preventing the specification of conflicting transaction read/write options

2012-02-07 Thread Chetan Suttraway

On Tue, Feb 7, 2012 at 8:44 PM, Kevin Grittner
wrote:

> Chetan Suttraway  wrote:
>
> > This is regarding the TODO item:
> > "Prevent the specification of conflicting transaction read/write
> > options"
> >
> > listed at:
> > http://wiki.postgresql.org/wiki/Todo
>
> Thanks for chipping away at items on the list.
>
> > Please pass on any inputs on the patch.
>
> Right now we want people focusing on fixing bugs and reviewing
> patches submitted by the start of the current CommitFest.  Please
> add this to the open CF so we don't lose track of it.
>
> -Kevin
>

Sure.

Thanks!

-Chetan

-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb

Re: [HACKERS] pgstat documentation tables

2012-02-07 Thread Magnus Hagander

On Feb 8, 2012 5:32 AM, "Tom Lane"  wrote:
>
> Bruce Momjian  writes:
> > How do people feel about pulling text out of the SGML docs and loading
> > it into the database as table and column comments?
>
> I'm not thrilled by that proposal.  The length limits on comments are
> very much shorter than what is sensible to use in catalogs.sgml (at
> least if you want the comments to look passable in \d displays).  And
> I don't want people trying to serve two different use-cases when they
> write those.

Yes, I'd rather see longer than shorter descriptions in the docs, and this
would turn that backwards...

>
> Perhaps it'd work to pull column comments from the C header files
> --- any same-line comment in catalog/pg_*.h is probably short enough to
> serve this purpose usefully.

That could work a lot better..

/Magnus

Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-07 Thread Shigeru Hanada

(2012/02/07 23:44), Marko Kreen wrote:
> On Tue, Feb 07, 2012 at 07:25:14PM +0900, Shigeru Hanada wrote:
>> - What is the right (or recommended) way to prevent from throwing
>> exceptoin in row-processor callback function?  When author should use
>> PG_TRY block to catch exception in the callback function?
> 
> When it calls backend functions that can throw exceptions?
> As all handlers running in backend will want to convert data
> to Datums, that means "always wrap handler code in PG_TRY"?

ISTM that telling a) what happens to PGresult and PGconn when row
processor ends without return, and b) how to recover them would be
necessary.  We can't assume much about caller because libpq is a client
library.  IMHO, it's most important to tell authors of row processor
clearly what should be done on error case.

In such case, must we call PQfinish, or is calling PQgetResult until it
returns NULL enough to reuse the connection?  IIUC calling
pqClearAsyncResult seems sufficient, but it's not exported for client...

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat documentation tables

2012-02-07 Thread Tom Lane

Bruce Momjian  writes:
> How do people feel about pulling text out of the SGML docs and loading
> it into the database as table and column comments?

I'm not thrilled by that proposal.  The length limits on comments are
very much shorter than what is sensible to use in catalogs.sgml (at
least if you want the comments to look passable in \d displays).  And
I don't want people trying to serve two different use-cases when they
write those.

Perhaps it'd work to pull column comments from the C header files
--- any same-line comment in catalog/pg_*.h is probably short enough to
serve this purpose usefully.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-07 Thread Tom Lane

Marti Raudsepp  writes:
> Case #1 uses the normal textcat(text, text) operator by automatically
> coercing 'x' as text.
> However, case #2 uses the anytextcat(anynonarray, text), which is
> marked as volatile thus acts as an optimization barrier.

Hmm ... since those operators were invented (in 8.3), we have adopted a
policy that I/O functions are presumed to be no worse than stable:
http://archives.postgresql.org/pgsql-committers/2010-07/msg00307.php
ISTM that would justify relabeling anytextcat/textanycat as stable,
which should fix this.

> One way would be doing preprocess_expression() before
> pull_up_subqueries() so function inlining happens earlier, but I can't
> imagine what unintended consequences that might have.

I'm pretty sure that would be a bad idea.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgstat documentation tables

2012-02-07 Thread Bruce Momjian

On Mon, Jan 16, 2012 at 01:01:50PM -0300, Alvaro Herrera wrote:
> 
> 
> Bruce had a patch to turn SGML descriptions of system view into comments
> via some Perl program or something.  He posted it many moons ago and I
> haven't seen an updated version.  Bruce, do you have something to say on
> this topic?

Yes, I was waiting to get feedback on that but it seems the 9.2 release
slipped right by me on that.  I guess I will try to pick it up for 9.3.

How do people feel about pulling text out of the SGML docs and loading
it into the database as table and column comments?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum rate limit in KBps

2012-02-07 Thread Bruce Momjian

On Thu, Jan 19, 2012 at 09:42:52PM -0500, Robert Haas wrote:
> I certainly didn't intend to come across as disparaging your work on
> this topic.  I understand that there are big problems with the way
> things work now; I'm just cautious about trying to replace them too
> hastily with something that may not turn out to be any better.  Of
> course, if we can replace it with something that we're sure is
> actually an improvement, I'm all in favor of that.  But, IMHO, the
> problems in this area are too serious to be solved by renaming the
> knobs.  At most, we're going to buy ourselves a little time to come up
> with a better solution.
> 
> IMHO, and at the risk of repeating myself, one of the big problems in
> this area is that we're making the user guess something that we really
> ought to be figuring out for them.  Just as users want checkpoints to
> run as slowly as possible while still not bumping into the next
> checkpoint, they'd presumably like vacuum to run as slowly as possible
> without bumping into the next vacuum.  Instead, we make them tell us
> how fast they'd like it tor run, which requires them to guess a value
> high enough to finish soon enough but low enough to minimize the
> impact on the rest of the system.
> 
> Another problem is that the vacuum algorithm itself could, I think, be
> made much smarter.  We could teach HOT to prune pages that contain no
> HOT chains but do contain dead tuples.  That would leave dead line
> pointers behind, but that's not nearly as bad as leaving the entire
> tuple behind.  We could, as Simon and others have suggested, have one
> threshold for vacuuming the heap (i.e. reclaiming dead tuples) and
> another for vacuuming the indexes (i.e. reclaiming dead line
> pointers).  That would open the door to partial vacuuming: just vacuum
> half a gigabyte or so of the heap, and then move on; the next vacuum
> can pick up where that one left off, at least up to the point where we
> decide we need to make an index pass; it would possibly also allow us
> to permit more than one vacuum on the same table at the same time,
> which is probably needed for very large tables.  We could have
> backends that see dead tuples on a page throw them over to the fence
> to the background writer for immediate pruning.  I blather, but I
> guess my point is that I really hope we're going to do something
> deeper here at some point in the near future, whatever becomes of the
> proposals now on the table.

As much as I hate to poo-poo a patch addition, I have to agree with
Robert Haas on this one.  Renaming settings really isn't moving us
forward.  It introduces a migration problem and really doesn't move us
forward in solving the underlying problem.  Additional monitoring, while
helpful, also is only a stop-gap.

Only a small number of sites are going to monitor auto-vacuum/analyze. 
Let's not start writing Postgres for those super-busy sites with a team
of administrators --- while they are important, they are not the
majority of our user-base, and we can pride ourselves that Postgres runs
pretty well without a team of admins.  We don't want to get into a case
where our super-visible, high-volume folks are overly setting the
project direction.

If we look at checkpoint smoothing, that was solved the right way with
a setting that worked automatically for everyone.

Now, I don't know if the solution is to time read/write duration to see
how busy the system is, or to look at the statistics to see how
backlogged the autovacuum system is when it gets time to actually
process a table, but those are the questions we should be asking here.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum rate limit in KBps

2012-02-07 Thread Bruce Momjian

On Thu, Jan 19, 2012 at 05:39:41PM -0500, Greg Smith wrote:
> On 1/19/12 1:10 PM, Robert Haas wrote:
> >I have to say that I find that intensely counterintuitive.  The
> >current settings are not entirely easy to tune correctly, but at least
> >they're easy to explain.
> 
> I attempt to explain those settings to people in training classes
> about once a month.  It's never been anything but a complete
> disaster.

I tell students you will not have problems with these settings unless
you change them.  ;-)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-07 Thread Noah Misch

On Tue, Feb 07, 2012 at 08:58:59PM +, Simon Riggs wrote:
> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch  wrote:
> > On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:

> > This patch uses FPIs to guard against torn hint writes, even when the
> > checksums are disabled. ?One could not simply condition them on the
> > page_checksums setting, though. ?Suppose page_checksums=off and we're 
> > hinting
> > a page previously written with page_checksums=on. ?If that write tears,
> > leaving the checksum intact, that block will now fail verification. ?A 
> > couple
> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if 
> > the
> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
> > Whenever the cluster starts with checksums disabled, set the field to
> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and 
> > the
> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
> > failure occurs in a page with LSN older than the stored one, emit either a
> > softer warning or no message at all.
> 
> We can only change page_checksums at restart (now) so the above seems moot.
> 
> If we crash with FPWs enabled we repair any torn pages.

There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.

> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be 
> > holding
> > is insufficient.
> 
> Am serialising this by only writing PageLSN while holding buf hdr lock.

That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock.  Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?

> > I can see value in an option to exclude local buffers, since corruption 
> > there
> > may be less exciting. ?It doesn't seem important for an initial patch, 
> > though.
> 
> I'm continuing to exclude local buffers. Let me know if that should change.

Seems reasonable.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-02-07 Thread Joachim Wieland

On Tue, Feb 7, 2012 at 4:59 PM, Robert Haas  wrote:
> It turns out that (as you anticipated) there are some problems with my
> previous proposal.

I assume you're talking to Tom, as my powers of anticipation are
actually quite limited... :-)

> This is not
> quite enough to get rid of g_conn, but it's close: the major stumbling
> block at this point is probably exit_nicely().  The gyrations we're
> going through to make sure that AH->connection gets closed before
> exiting are fairly annoying; maybe we should invent something in
> dumputils.c along the line of the backend's on_shmem_exit().

Yeah, this becomes even more important with parallel jobs where you
want all worker processes die once the parent exits. Otherwise some 10
already started processes would continue to dump your 10 largest
tables for the next few hours with the master process long dead... All
while you're about to start up the next master process...

In my patch I dealt with exactly the same problem for the error
handler by creating a separate function that has a static variable (a
pointer to the ParallelState). The value is set and retrieved through
the same function, so yes, it's kinda global but then again it can
only be accessed from this function, which is only called from the
error handler.

> I'm starting to think it might make sense to press on with this
> refactoring just a bit further and eliminate the distinction between
> Archive and ArchiveHandle.

How about doing more refactoring after applying the patch, you'd then
see what is really needed and then we'd also have an actual use case
for more than one connection (You might have already guessed that this
proposal is heavily influenced by my self-interest of avoiding too
much work to make my patch match your refactoring)...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Bruce Momjian

On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:
> So we need some principled way of deciding how much inlining is
> reasonable, because I am 100% certain this is not going to be the last
> time someone discovers that a massive exercise in inlining can yield a
> nifty performance benefit in some case or another: index builds and
> COPY have already been mentioned on this thread, and I expect that
> people will find other cases as well.  I'm not really sure what the
> "budget" is - i.e. how much binary bloat we can afford to add - or how
> many cases there are that can benefit, but the first isn't infinite
> and the second is more than the first.
> 
> Having said all that, I am inclined to commit at least some portion of
> this, but I wanted to knock off a few other patches that have been
> lingering for a while first.

One approach would be to only do a few types now, e.g. integers and
strings, and perhaps leave the others for later. 

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql case preserving completion

2012-02-07 Thread Robert Haas

On Tue, Feb 7, 2012 at 8:02 PM, Bruce Momjian  wrote:
> On Wed, Feb 01, 2012 at 08:19:24PM +0200, Peter Eisentraut wrote:
>> On tis, 2012-01-17 at 16:46 +0900, Fujii Masao wrote:
>> > When I tested the patch, "create ta" was converted unexpectedly to
>> > "create TABLE"
>> > though "alter ta" was successfully converted to "alter table". As far
>> > as I read the patch,
>> > you seems to have forgotten to change create_or_drop_command_generator() or
>> > something.
>>
>> Thanks, fixed and committed.
>
> I have to admit I like the capitalized keywords, but don't normally type
> them, but it must be just me because no one else said anything.

Yeah, I liked the old behavior, too.  But I figured it was a sign that
I'm an old fuddy-duddy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-07 Thread Robert Haas

On Tue, Feb 7, 2012 at 4:25 PM, Peter Eisentraut  wrote:
> On ons, 2012-02-01 at 15:53 +0530, Chetan Suttraway wrote:
>> This is regarding the TODO item :
>> "Add SPI_gettypmod() to return a field's typemod from a TupleDesc"
>
> My first thought was, this should be spelled SPI_gettypmod().  Not sure
> what others think.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add protransform for numeric, varbit, and temporal types

2012-02-07 Thread Robert Haas

On Tue, Feb 7, 2012 at 8:45 PM, Noah Misch  wrote:
> On Tue, Feb 07, 2012 at 12:43:11PM -0500, Robert Haas wrote:
>> I've committed the numeric and varbit patches and will look at the
>> temporal one next.
>
> Thanks.  The comment you added to numeric_transform() has a few typos,
> "constained" -> "constrained" and "nodes" -> "notes".

Yuck, that's pretty bad.  Thanks, fixed (I hope).

> When we rebuild the table at this point, it has a small maximum tuple size.
> Therefore, we omit the toast table entirely.

Ah, OK.  The debug messages might be more clear if they mentioned
whether or not we were omitting the TOAST table, I suppose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Robert Haas

On Tue, Feb 7, 2012 at 2:39 PM, Jay Levitt  wrote:
> Jim "Decibel!" Nasby wrote:
>>
>> I agree that it's probably pretty unusual to index floats.
>
> FWIW: Cubes and points are floats, right? So would spatial indexes benefit
> from this optimization, or is it only raw floats?

Cubes are not floats, nor are points.

In general, what's being proposed here is to make a separate copy of
quicksort for each of N datatypes, with the comparator function
inlined.  In order for this to benefit multiple types, they'd have to
use the same set of machine instructions to compare values on disk.
So in general you can make this apply to as many datatypes as you
want: it's just that each one will require another copy of the
quicksort code in the binary.  The more complex the comparator is, the
less you'll save, because the savings presumably come largely from
things like saving and resoring registers and pushing/popping stack
frames, and that's really only going to be material if the underlining
comparator is pretty darn cheap.

Actually, to be more precise, the patch is proposing to create TWO
copies of the quicksort code for each of N datatypes, one for the case
where there is but a single sort key and the other for the case where
there are multiple sort keys.  Having a separate code for the single
sort key case saves more because it lets you get rid of a control loop
- you don't have to iterate down the list of keys, because there's
only one.

I've been skeptical of this patch for a number of reasons, the major
one of which is that most workloads spend only a very small amount of
time in doing qucksorts, and most quicksorts are of very small amounts
of data and therefore fast anyway.   It is easy to construct an
artificial test case that does lots and lots of in-memory sorting, but
in real life I think that's not the great part of what people use
databases for.  The time gets spent on things like groveling through
big tables or doing joins, and then maybe we sort the rows at the end,
but that's not where most of the time goes.  It may be rightly pointed
out, of course, that a penny saved is a penny earned: the fact that
most people don't spend much time quicksorting doesn't mean that we
shouldn't try to make quicksorting as fast as possible.  But there are
a couple of additional considerations.

First, the code is, at present, uglier than I would like.  I mean to
spend some time trying to clean that up, but there are 102 patches in
this CommitFest (the number seems to keep slowly growing, despite the
deadline being three weeks gone) and this isn't the only one I care
about, so I haven't quite gotten there yet.  But hopefully soon.
Second, there's a concern about binary bloat: duplicating lots of code
with different comparators inlined generates, well, a lot of machine
code.  Of course, an 0.8% increase in the size of the resulting binary
is very unlikely to produce any measurable performance regression on
its own, but that's not really the issue.  The issue is rather that we
could easily go nuts applying this technique in lots of different
places - or even just in one place to lots of different types.  Let's
say that we find that even for types with very complex comparators, we
can get a 5% speedup on quick-sorting speed using this technique.
Should we, then, apply the technique to every type we have?  Perhaps
the binary would grow by, I don't know, 10%.  Maybe that's still not
enough to start hurting performance (or making packagers complain),
but surely it's getting there, and what happens when we discover that
similar speedups are possible using the same trick in five other
scenarios?  I think it's a bit like going out to dinner and spending
$50.  It's nicer than eating at home, and many people can afford to do
it occasionally, but if you do it every night, it gets expensive (and
possibly fattening).

So we need some principled way of deciding how much inlining is
reasonable, because I am 100% certain this is not going to be the last
time someone discovers that a massive exercise in inlining can yield a
nifty performance benefit in some case or another: index builds and
COPY have already been mentioned on this thread, and I expect that
people will find other cases as well.  I'm not really sure what the
"budget" is - i.e. how much binary bloat we can afford to add - or how
many cases there are that can benefit, but the first isn't infinite
and the second is more than the first.

Having said all that, I am inclined to commit at least some portion of
this, but I wanted to knock off a few other patches that have been
lingering for a while first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add protransform for numeric, varbit, and temporal types

2012-02-07 Thread Noah Misch

On Tue, Feb 07, 2012 at 12:43:11PM -0500, Robert Haas wrote:
> I've committed the numeric and varbit patches and will look at the
> temporal one next.

Thanks.  The comment you added to numeric_transform() has a few typos,
"constained" -> "constrained" and "nodes" -> "notes".

> I did notice one odd thing, though: sometimes we
> seem to get a rebuild on the toast index for no obvious reason:
> 
> rhaas=# set client_min_messages=debug1;
> SET
> rhaas=# create table foo (a serial primary key, b varbit);
> NOTICE:  CREATE TABLE will create implicit sequence "foo_a_seq" for
> serial column "foo.a"
> DEBUG:  building index "pg_toast_16430_index" on table "pg_toast_16430"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "foo_pkey" for table "foo"
> DEBUG:  building index "foo_pkey" on table "foo"
> CREATE TABLE
> rhaas=# alter table foo alter column b set data type varbit(4);
> DEBUG:  rewriting table "foo"
> DEBUG:  building index "foo_pkey" on table "foo"
> ALTER TABLE

When we rebuild the table at this point, it has a small maximum tuple size.
Therefore, we omit the toast table entirely.

> rhaas=# alter table foo alter column b set data type varbit;
> DEBUG:  building index "pg_toast_16430_index" on table "pg_toast_16430"
> ALTER TABLE

This command makes the tuple size unbounded again, so we create a toast table.

> Strangely, it doesn't happen if I add another column to the table:
> 
> rhaas=# set client_min_messages=debug1;
> SET
> rhaas=# create table foo (a serial primary key, b varbit, c varbit);

With the extra unconstrained varbit column, the tuple size is continuously
unbounded and the table has a toast table at all stages.  So, the difference
in behavior is correct, albeit unintuitive.

> NOTICE:  CREATE TABLE will create implicit sequence "foo_a_seq" for
> serial column "foo.a"
> DEBUG:  building index "pg_toast_16481_index" on table "pg_toast_16481"
> NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> "foo_pkey" for table "foo"
> DEBUG:  building index "foo_pkey" on table "foo"
> CREATE TABLE
> rhaas=# alter table foo alter column b set data type varbit(4);
> DEBUG:  building index "pg_toast_16490_index" on table "pg_toast_16490"
> DEBUG:  rewriting table "foo"
> DEBUG:  building index "foo_pkey" on table "foo"
> ALTER TABLE
> rhaas=# alter table foo alter column b set data type varbit;
> ALTER TABLE

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql case preserving completion

2012-02-07 Thread Bruce Momjian

On Wed, Feb 01, 2012 at 08:19:24PM +0200, Peter Eisentraut wrote:
> On tis, 2012-01-17 at 16:46 +0900, Fujii Masao wrote:
> > When I tested the patch, "create ta" was converted unexpectedly to
> > "create TABLE"
> > though "alter ta" was successfully converted to "alter table". As far
> > as I read the patch,
> > you seems to have forgotten to change create_or_drop_command_generator() or
> > something.
> 
> Thanks, fixed and committed.

I have to admit I like the capitalized keywords, but don't normally type
them, but it must be just me because no one else said anything.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] random_page_cost vs seq_page_cost

2012-02-07 Thread Bruce Momjian

On Tue, Feb 07, 2012 at 05:06:18PM -0500, Greg Smith wrote:
> On 02/07/2012 03:23 PM, Bruce Momjian wrote:
> >Where did you see that there will be an improvement in the 9.2
> >documentation?  I don't see an improvement.
> 
> I commented that I'm hoping for an improvement in the documentation
> of how much timing overhead impacts attempts to measure this area
> better.  That's from the "add timing of buffer I/O requests" feature
> submission.  I'm not sure if Bene read too much into that or not; I
> didn't mean to imply that the docs around random_page_cost have
> gotten better.
> 
> This particular complaint is extremely common though, seems to pop
> up on one of the lists a few times each year.  Your suggested doc
> fix is fine as a quick one, but I think it might be worth expanding
> further on this topic.  Something discussing SSDs seems due here
> too.  Here's a first draft of a longer discussion, to be inserted
> just after where it states the default value is 4.0:

I was initially concerned that tuning advice in this part of the docs
would look out of place, but now see the 25% shared_buffers
recommentation, and it looks fine, so we are OK.  (Should we caution
against more than 8GB of shared buffers?  I don't see that in the docs.)

I agree we are overdue for better a explanation of random page cost, so
I agree with your direction.  I did a little word-smithing to tighten up
your text;  feel free to discard what you don't like:

Random access to mechanical disk storage is normally much more expensive
than four-times sequential access.  However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache.  The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.

If you believe a 90% cache rate is an incorrect assumption
for your workload, you can increase random_page_cost to better
reflect the true cost of random storage reads. Correspondingly,
if your data is likely to be completely in cache, such as when
the database is smaller than the total server memory, decreasing
random_page_cost can be appropriate.  Storage that has a low random
read cost relative to sequential, e.g. solid-state drives, might
also be better modeled with a lower value for random_page_cost.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgindent README correction

2012-02-07 Thread Bruce Momjian

On Mon, Jan 09, 2012 at 01:32:49PM -0500, Robert Haas wrote:
> > The one other issue I ran into in following the latest pgindent
> > instructions was that I had to add #include  to the
> > parse.c file (as included in the pg_bsd_indent-1.1.tar.gz file at
> > ftp://ftp.postgresql.org/pub/dev ).  Without it I got this:
> >
> > parse.c: In function *parse*:
> > parse.c:236:6: warning: implicit declaration of function *exit*
> > parse.c:236:6: warning: incompatible implicit declaration of built-in
> > function *exit*
> >
> > Can someone fix that and put up a 1.2 version?
> 
> Sounds like a job for Mr. Momjian.

What server do I log into to update the ftp pgindent tgz file?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] random_page_cost vs seq_page_cost

2012-02-07 Thread Greg Smith


On 02/07/2012 03:23 PM, Bruce Momjian wrote:

Where did you see that there will be an improvement in the 9.2
documentation?  I don't see an improvement.


I commented that I'm hoping for an improvement in the documentation of 
how much timing overhead impacts attempts to measure this area better.  
That's from the "add timing of buffer I/O requests" feature submission.  
I'm not sure if Bene read too much into that or not; I didn't mean to 
imply that the docs around random_page_cost have gotten better.


This particular complaint is extremely common though, seems to pop up on 
one of the lists a few times each year.  Your suggested doc fix is fine 
as a quick one, but I think it might be worth expanding further on this 
topic.  Something discussing SSDs seems due here too.  Here's a first 
draft of a longer discussion, to be inserted just after where it states 
the default value is 4.0:


True random access to mechanical disk storage will normally be more 
expensive than this default suggests.  The value used is lower to 
reflect caching effects.  Some common random accesses to disk, such as 
indexed reads, are considered likely to be in cache.  The default value 
can be thought of as modeling random access as 40 times as expensive as 
sequential, while expecting that 90% of random reads will actually be 
cached.


If you believe a high cache rate is an incorrect assumption for your 
workload, you might increase random_page_cost to closer reflect the true 
cost of random reads against your storage.  Correspondingly, if your 
data is likely to be completely cached, such as when the database is 
smaller than the total memory in the server, decreasing random_page_cost 
can be appropriate.  Storage where the true cost of random reads is low, 
such as solid-state drives and similar memory-based devices, might also 
find lower values of random_page_cost better reflect the real-world cost 
of that operation.


===

I think of the value as being more like 80 times as expensive and a 95% 
hit rate, but the above seems more likely to turn into understandable 
math to a first-time reader of this section.  I stopped just short of 
recommending a value for the completely cached case.  I normally use 
1.01 there; I know others prefer going fully to 1.0 instead.  That 
argument seems like it could rage on for some time.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!

2012-02-07 Thread Andrew Dunstan




On 02/07/2012 02:56 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 01/31/2012 11:10 PM, Andrew Dunstan wrote:

Here's a possible patch for the exclude-table-data problem along the
lines you suggest.

Should I apply this?

I'm not happy with this yet.  My core complaint is that pg_dump used to
consider that creation of a TableDataInfo object for a table happens
if and only if we're going to dump the table's data.  And the comments
(eg in pg_dump.h) still say that.  But the previous patch left us in a
halfway zone where sometimes we'd create a TableDataInfo object and then
choose not to dump the data, and this patch doesn't get us out of that.
I think we should either revert to the previous definition, or go over
to a design wherein we always create TableDataInfo objects for all
tables (but probably still excluding data-less relations such as views)
and the whether-to-dump decision is expressed only by setting or not
setting the object's dump flag.

I worked a little bit on a patch to do the latter but found that it was
more invasive than I'd hoped.  Given the lack of any immediate payoff
I think it'd probably make more sense to do the former.  We could still
centralize the decision making into makeTableDataInfo a bit more than
now, but it should take the form of not creating the object at all,
rather than creating it and then clearing its dump flag.





OK, in this version we simply suppress creation of the TableDataInfo 
object if it's not wanted. I actually removed the code from 
makeTableDataInfo - there are only two places it gets called and doing 
it in those two spots seemed a bit cleaner.


cheers

andrew
*** a/src/bin/pg_dump/pg_dump.c
--- b/src/bin/pg_dump/pg_dump.c
***
*** 1141,1154  selectDumpableTable(TableInfo *tbinfo)
  			   tbinfo->dobj.catId.oid))
  		tbinfo->dobj.dump = false;
  
- 	/* If table is to be dumped, check that the data is not excluded */
- 	if (tbinfo->dobj.dump && !
- 		simple_oid_list_member(&tabledata_exclude_oids,
- 			   tbinfo->dobj.catId.oid))
- 		tbinfo->dobj.dumpdata = true;
- 	else
- 		tbinfo->dobj.dumpdata = false;
- 
  }
  
  /*
--- 1141,1146 
***
*** 1599,1608  dumpTableData(Archive *fout, TableDataInfo *tdinfo)
  	DataDumperPtr dumpFn;
  	char	   *copyStmt;
  
- 	/* don't do anything if the data isn't wanted */
- 	if (!tbinfo->dobj.dumpdata)
- 		return;
- 
  	if (!dump_inserts)
  	{
  		/* Dump/restore using COPY */
--- 1591,1596 
***
*** 1658,1663  getTableData(TableInfo *tblinfo, int numTables, bool oids)
--- 1646,1656 
  			&& no_unlogged_table_data)
  			continue;
  
+ 		/* Check that the data is not explicitly excluded */
+ 		if (simple_oid_list_member(&tabledata_exclude_oids,
+    tblinfo[i].dobj.catId.oid))
+ 			continue;
+ 
  		if (tblinfo[i].dobj.dump && tblinfo[i].dataObj == NULL)
  			makeTableDataInfo(&(tblinfo[i]), oids);
  	}
***
*** 14127,14133  getExtensionMembership(Archive *fout, ExtensionInfo extinfo[],
  TableInfo  *configtbl;
  
  configtbl = findTableByOid(atooid(extconfigarray[j]));
! if (configtbl && configtbl->dataObj == NULL)
  {
  	/*
  	 * Note: config tables are dumped without OIDs regardless
--- 14120,14132 
  TableInfo  *configtbl;
  
  configtbl = findTableByOid(atooid(extconfigarray[j]));
! 
! if (configtbl == NULL || 
! 	simple_oid_list_member(&tabledata_exclude_oids,
! 		   configtbl->dobj.catId.oid))
! 	continue;
! 
! if (configtbl->dataObj == NULL)
  {
  	/*
  	 * Note: config tables are dumped without OIDs regardless
*** a/src/bin/pg_dump/pg_dump.h
--- b/src/bin/pg_dump/pg_dump.h
***
*** 129,135  typedef struct _dumpableObject
  	char	   *name;			/* object name (should never be NULL) */
  	struct _namespaceInfo *namespace;	/* containing namespace, or NULL */
  	bool		dump;			/* true if we want to dump this object */
- 	booldumpdata;   /* true if we want data for this object */
  	bool		ext_member;		/* true if object is member of extension */
  	DumpId	   *dependencies;	/* dumpIds of objects this one depends on */
  	int			nDeps;			/* number of valid dependencies */
--- 129,134 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-07 Thread Peter Eisentraut

On ons, 2012-02-01 at 15:53 +0530, Chetan Suttraway wrote:
> This is regarding the TODO item :
> "Add SPI_gettypmod() to return a field's typemod from a TupleDesc"

My first thought was, this should be spelled SPI_gettypmod().  Not sure
what others think.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checkpoint sync pause

2012-02-07 Thread Greg Smith


On 02/03/2012 11:41 PM, Jeff Janes wrote:

-The steady stream of backend writes that happen between checkpoints have
filled up most of the OS write cache.  A look at /proc/meminfo shows around
2.5GB "Dirty:"

"backend writes" includes bgwriter writes, right?


Right.


Has using a newer kernal with dirty_background_bytes been tried, so it
could be set to a lower level?  If so, how did it do?  Or does it just
refuse to obey below the 5% level, as well?


Trying to dip below 5% using dirty_background_bytes slows VACUUM down 
faster than it improves checkpoint latency.  Since the sort of servers 
that have checkpoint issues are quite often ones that have VACUUM ones, 
too, that whole path doesn't seem very productive.  The one test I 
haven't tried yet is whether increasing the size of the VACUUM ring 
buffer might improve how well the server responds to a lower write cache.



If there is 3GB of dirty data spread over>300 segments each segment
is about full-sized (1GB) then on average<1% of each segment is
dirty?

If that analysis holds, then it seem like there is simply an awful lot
of data has to be written randomly, no matter how clever the
re-ordering is.  In other words, it is not that a harried or panicked
kernel or RAID control is failing to do good re-ordering when it has
opportunities to, it is just that you dirty your data too randomly for
substantial reordering to be possible even under ideal conditions.


Averages are deceptive here.  This data follows the usual distribution 
for real-world data, which is that there is a hot chunk of data that 
receives far more writes than average (particularly index blocks), along 
with a long tail of segments that are only seeing one or two 8K blocks 
modified (catalog data, stats, application metadata).


Plenty of useful reordering happens here.  It's happening in Linux's 
cache and in the controller's cache.  The constant of stream of 
checkpoint syncs doesn't stop that.  It does seems to do two bad things 
though:  a) makes some of these bad cache filled situations more likely, 
and b) doesn't leave any I/O capacity unused for clients to get some 
work done.  One of the real possibilities I've been considering more 
lately is that the value we've seen of the pauses during sync aren't so 
much about optimizing I/O, that instead it's from allowing a brief 
window of client backend I/O to slip in there between the cache filling 
checkpoint sync.



Does the BBWC, once given an fsync command and reporting success,
write out those block forthwith, or does it lolly-gag around like the
kernel (under non-fsync) does?  If it is waiting around for
write-combing opportunities that will never actually materialize in
sufficient quantities to make up for the wait, how to get it to stop?

Was the sorted checkpoint with an fsync after every file (real file,
not VFD) one of the changes you tried?


As far as I know the typical BBWC is always working.  When a read or a 
write comes in, it starts moving immediately.  When it gets behind, it 
starts making seek decisions more intelligently based on visibility of 
the whole queue.  But they don't delay doing any work at all the way 
Linux does.


I haven't had very good luck with sorting checkpoints at the PostgreSQL 
relation level on server-size systems.  There is a lot of sorting 
already happening at both the OS (~3GB) and BBWC (>=512MB) levels on 
this server.  My own tests on my smaller test server--with a scaled down 
OS (~750MB) and BBWC (256MB) cache--haven't ever validated sorting as a 
useful technique on top of that.  It's never bubbled up to being 
considered a likely win on the production one as a result.



DEBUG:  Sync #1 time=21.969000 gap=0.00 msec
DEBUG:  Sync #2 time=40.378000 gap=0.00 msec
DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
DEBUG:  Sync #10 time=302.75 gap=3003.958000 msec

Syncs 3 and 5 kind of surprise me.  It seems like the times should be
more bimodal.  If the cache is already full, why doesn't the system
promptly collapse, like it does later?  And if it is not, why would it
take 12 seconds, or even 2 seconds?  Or is this just evidence that the
gaps you are inserting are partially, but not completely, effective?


Given a mix of completely random I/O, a 24 disk array like this system 
has is lucky to hit 20MB/s clearing it out.  It doesn't take too much of 
that before even 12 seconds makes sense.  The 45 second pauses are the 
ones demonstrating the controller's cached is completely overwhelmed.  
It's rare to see caching turn truly bimodal, because the model for it 
has both a variable ingress and egress rate involved.  Even as the 
checkpoint sync is pushing st

Re: [HACKERS] 16-bit page checksums for 9.2

2012-02-07 Thread Simon Riggs

On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch  wrote:
> On Wed, Jan 11, 2012 at 10:12:31PM +, Simon Riggs wrote:
>> v7

Thanks very much for the review. Just realised I hadn't actually replied...

> This patch uses FPIs to guard against torn hint writes, even when the
> checksums are disabled.  One could not simply condition them on the
> page_checksums setting, though.  Suppose page_checksums=off and we're hinting
> a page previously written with page_checksums=on.  If that write tears,
> leaving the checksum intact, that block will now fail verification.  A couple
> of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
> old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
> Whenever the cluster starts with checksums disabled, set the field to
> InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
> field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
> failure occurs in a page with LSN older than the stored one, emit either a
> softer warning or no message at all.

We can only change page_checksums at restart (now) so the above seems moot.

If we crash with FPWs enabled we repair any torn pages.

> Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
> the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
> bumping the page LSN:
>        heap_xlog_visible()
>        visibilitymap_clear()
>        visibilitymap_truncate()
>        lazy_scan_heap()
>        XLogRecordPageWithFreeSpace()
>        FreeSpaceMapTruncateRel()
>        fsm_set_and_search()
>        fsm_vacuum_page()
>        fsm_search_avail()
> Though I have not investigated each of these in detail, I suspect most or all
> represent continued torn-page hazards.  Since the FSM is just a hint, we do
> not WAL-log it at all; it would be nicest to always skip checksums for it,
> too.  The visibility map shortcuts are more nuanced.

Still checking all, but not as bad as the list looks.

I'm removing chksum code from smgr and putting back into bufmgr and
other locations.

Patch footprint now looks like this.

 doc/src/sgml/config.sgml  |   40 +++
 src/backend/access/hash/hashpage.c|1
 src/backend/access/heap/rewriteheap.c |4
 src/backend/access/heap/visibilitymap.c   |1
 src/backend/access/nbtree/nbtree.c|1
 src/backend/access/nbtree/nbtsort.c   |3
 src/backend/access/spgist/spginsert.c |2
 src/backend/access/transam/twophase.c |   16 -
 src/backend/access/transam/xact.c |8
 src/backend/access/transam/xlog.c |  114 
 src/backend/commands/tablecmds.c  |2
 src/backend/storage/buffer/bufmgr.c   |   75 +
 src/backend/storage/buffer/localbuf.c |2
 src/backend/storage/ipc/procarray.c   |   63 +++-
 src/backend/storage/lmgr/proc.c   |4
 src/backend/storage/page/bufpage.c|  331 +-
 src/backend/utils/misc/guc.c  |   14 +
 src/backend/utils/misc/postgresql.conf.sample |   15 -
 src/include/access/xlog.h |1
 src/include/catalog/pg_control.h  |3
 src/include/catalog/storage.h |1
 src/include/storage/bufpage.h |  107 ++--
 src/include/storage/proc.h|3
 src/include/storage/procarray.h   |4
 24 files changed, 743 insertions(+), 72 deletions(-)

Patch is *not* attached here, yet. Still working on it.

> When page_checksums=off and we read a page last written by page_checksums=on,
> we still verify its checksum.  If a mostly-insert/read site uses checksums for
> some time and eventually wishes to shed the overhead, disabling the feature
> will not stop the costs for reads of old data.  They would need a dump/reload
> to get the performance of a never-checksummed database.  Let's either
> unconditionally skip checksum validation under page_checksums=off or add a new
> state like page_checksums=ignore for that even-weaker condition.

Agreed, changed.

>> --- a/doc/src/sgml/config.sgml
>> +++ b/doc/src/sgml/config.sgml
>
>> +       
>> +        Turning this parameter off speeds normal operation, but
>> +        might allow data corruption to go unnoticed. The checksum uses
>> +        16-bit checksums, using the fast Fletcher 16 algorithm. With this
>> +        parameter enabled there is still a non-zero probability that an 
>> error
>> +        could go undetected, as well as a non-zero probability of false
>> +        positives.
>> +       
>
> What sources of false positives do you intend to retain?

I see none. Will reword.

>> --- a/src/backend/catalog/storage.c
>> +++ b/src/backend/catalog/storage.c
>> @@ -20,6 +20,7 @@
>>  #include "postgres.h"
>>
>>  #include "access/visibilitymap.h"
>> +#include "access/transam.h"
>>  #include "access/xact.h"
>>  #i

Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-07 Thread Andrew Dunstan




On 02/07/2012 03:36 PM, Marti Raudsepp wrote:

On Tue, Feb 7, 2012 at 22:31, Andrew Dunstan  wrote:

It gets worse if you replace the expression with a call to a (non-sql)
function returning text, which was in fact the original use case. Then
you're pretty  much hosed.

Oh, if it's a non-SQL function then marking it as STABLE/IMMUTABLE
should solve it -- did you try that?




Hmm, your're right. I could have sworn I tried that. That still leaves 
the oddity you reported, though.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-07 Thread Marti Raudsepp

On Tue, Feb 7, 2012 at 22:31, Andrew Dunstan  wrote:
> It gets worse if you replace the expression with a call to a (non-sql)
> function returning text, which was in fact the original use case. Then
> you're pretty  much hosed.

Oh, if it's a non-SQL function then marking it as STABLE/IMMUTABLE
should solve it -- did you try that?

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-07 Thread Andrew Dunstan




On 02/07/2012 03:18 PM, Marti Raudsepp wrote:

Hi list,

Andrew Dunstan reported an awkward-seeming case on IRC where shifting
around a concatenation expression in a view made the planner choose a
good or a bad execution plan.

Simplified, it boils down to this:

db=# create table foo(i int);
db=# explain verbose select i from (select i, i::text || 'x' as asd
from foo) as subq;
Seq Scan on public.foo  (cost=0.00..34.00 rows=2400 width=4)
   Output: foo.i

db=# explain verbose select i from (select i, i || 'x'::text as asd
from foo) as subq;
Subquery Scan on subq  (cost=0.00..76.00 rows=2400 width=4)
   Output: subq.i
   ->   Seq Scan on public.foo  (cost=0.00..52.00 rows=2400 width=4)
 Output: foo.i, ((foo.i)::text || 'x'::text)

Case #1 uses the normal textcat(text, text) operator by automatically
coercing 'x' as text.
However, case #2 uses the anytextcat(anynonarray, text), which is
marked as volatile thus acts as an optimization barrier. Later, the
anytextcat SQL function is inlined and the EXPLAIN VERBOSE output has
no trace of what happened.

Is this something we can, or want, to fix?

One way would be doing preprocess_expression() before
pull_up_subqueries() so function inlining happens earlier, but I can't
imagine what unintended consequences that might have.

Another option would be creating explicit immutable  text || foo
operators for common types, but that sounds pretty hacky.






It gets worse if you replace the expression with a call to a (non-sql) 
function returning text, which was in fact the original use case. Then 
you're pretty  much hosed.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] random_page_cost vs seq_page_cost

2012-02-07 Thread Bruce Momjian

On Wed, Jan 11, 2012 at 08:26:52AM +, Benedikt Grundmann wrote:
> (replying just to you)
> On 10/01/12 15:22, Greg Smith wrote:
> > On 1/5/12 5:04 AM, Benedikt Grundmann wrote:
> > That sort of thing is one reason why all attempts so far to set
> > random_page_cost based on physical characteristics haven't gone
> > anywhere useful.  The setting is sort of overloaded right now, it's a
> > fuzzy mix of true random seek cost blended with some notion of cache
> > percentage. Trying to bring some measurements to bear on it is a less
> > effective approach than what people actually do here.  Monitor the
> > profile of query execution, change the value, see what happens.  Use
> > that as feedback for what direction to keep going; repeat until
> > you're just spinning with no improvements.
> > 
> Thank you very much for the reply it is very interesting.  I'm
> excited to hear that documentation in that area will improve in
> 9.2.  It's interesting postgres has remarkable good documentation
> but it is a sufficiently complex system that to actually sensible
> tune the knobs provided you have to understand quite a lot about
> what is going on.  A colleague of mine likes to say 
> "all abstractions leak", which seems very appropriate in this case.

Where did you see that there will be an improvement in the 9.2
documentation?  I don't see an improvement.

I looked over the random_page_cost documentation and remembered I was
always concerned about how vague it was about caching effects, so I
wrote the attached doc patch to explicity state it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index 3a84321..19e3e36
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** SET ENABLE_SEQSCAN TO OFF;
*** 2590,2595 
--- 2590,2597 
 
  Sets the planner's estimate of the cost of a
  non-sequentially-fetched disk page.  The default is 4.0.
+ (The default is lower than the typical difference between random
+ and sequential storage access speed because of caching effects.)
  This value can be overridden for tables and indexes in a particular
  tablespace by setting the tablespace parameter of the same name
  (see ).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Text-any concatenation volatility acting as optimization barrier

2012-02-07 Thread Marti Raudsepp

Hi list,

Andrew Dunstan reported an awkward-seeming case on IRC where shifting
around a concatenation expression in a view made the planner choose a
good or a bad execution plan.

Simplified, it boils down to this:

db=# create table foo(i int);
db=# explain verbose select i from (select i, i::text || 'x' as asd
from foo) as subq;
Seq Scan on public.foo  (cost=0.00..34.00 rows=2400 width=4)
  Output: foo.i

db=# explain verbose select i from (select i, i || 'x'::text as asd
from foo) as subq;
Subquery Scan on subq  (cost=0.00..76.00 rows=2400 width=4)
  Output: subq.i
  ->  Seq Scan on public.foo  (cost=0.00..52.00 rows=2400 width=4)
Output: foo.i, ((foo.i)::text || 'x'::text)

Case #1 uses the normal textcat(text, text) operator by automatically
coercing 'x' as text.
However, case #2 uses the anytextcat(anynonarray, text), which is
marked as volatile thus acts as an optimization barrier. Later, the
anytextcat SQL function is inlined and the EXPLAIN VERBOSE output has
no trace of what happened.

Is this something we can, or want, to fix?

One way would be doing preprocess_expression() before
pull_up_subqueries() so function inlining happens earlier, but I can't
imagine what unintended consequences that might have.

Another option would be creating explicit immutable  text || foo
operators for common types, but that sounds pretty hacky.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] When do we lose column names?

2012-02-07 Thread Tom Lane

Andrew Dunstan  writes:
> On 02/07/2012 12:47 PM, Tom Lane wrote:
>> In general I think we'd have to require that colnames be supplied in all
>> RowExprs if we go this way.  Anyplace that's trying to slide by without
>> will have to be fixed.  I don't recall how many places that is.

> I just had a thought that maybe we could make this simpler by dummying 
> up a list of colnames if we don't have one, instead of that assertion. 
> Or am I on the wrong track.

Well, if there are more than one or two RowExpr creators for which a
dummy set of colnames is the best we can do anyway, that might be a
reasonable answer.  But I think it would encourage people to be lazy
and let the dummy colnames be used even when they can do better, so
I'd rather not take this as a first-choice solution.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!

2012-02-07 Thread Tom Lane

Andrew Dunstan  writes:
> On 01/31/2012 11:10 PM, Andrew Dunstan wrote:
>> Here's a possible patch for the exclude-table-data problem along the 
>> lines you suggest.

> Should I apply this?

I'm not happy with this yet.  My core complaint is that pg_dump used to
consider that creation of a TableDataInfo object for a table happens
if and only if we're going to dump the table's data.  And the comments
(eg in pg_dump.h) still say that.  But the previous patch left us in a
halfway zone where sometimes we'd create a TableDataInfo object and then
choose not to dump the data, and this patch doesn't get us out of that.
I think we should either revert to the previous definition, or go over
to a design wherein we always create TableDataInfo objects for all
tables (but probably still excluding data-less relations such as views)
and the whether-to-dump decision is expressed only by setting or not
setting the object's dump flag.

I worked a little bit on a patch to do the latter but found that it was
more invasive than I'd hoped.  Given the lack of any immediate payoff
I think it'd probably make more sense to do the former.  We could still
centralize the decision making into makeTableDataInfo a bit more than
now, but it should take the form of not creating the object at all,
rather than creating it and then clearing its dump flag.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] When do we lose column names?

2012-02-07 Thread Andrew Dunstan




On 02/07/2012 12:47 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 11/16/2011 10:38 PM, Tom Lane wrote:

Upon further review, this patch would need some more work even for the
RowExpr case, because there are several places that build RowExprs
without bothering to build a valid colnames list.  It's clearly soluble
if anyone cares to put in the work, but I'm not personally excited
enough to pursue it ...

The patch itself causes a core dump with the current regression tests.

Yeah, observing that was what made me write the above.


I've been looking at the other places that build RowExprs. Most of them
look OK and none seem clearly in need of fixing at first glance. Which
do you think need attention?

In general I think we'd have to require that colnames be supplied in all
RowExprs if we go this way.  Anyplace that's trying to slide by without
will have to be fixed.  I don't recall how many places that is.




I just had a thought that maybe we could make this simpler by dummying 
up a list of colnames if we don't have one, instead of that assertion. 
Or am I on the wrong track.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Jay Levitt


Jim "Decibel!" Nasby wrote:

I agree that it's probably pretty unusual to index floats.


FWIW: Cubes and points are floats, right? So would spatial indexes benefit 
from this optimization, or is it only raw floats?


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Jim "Decibel!" Nasby

On 2/6/12 3:19 PM, Bruce Momjian wrote:

>  While we're waiting for anyone else to weigh in with an opinion on the
>  right place to draw the line here, do you want to post an updated
>  patch with the changes previously discussed?

Well, I think we have to ask not only how many people are using
float4/8, but how many people are sorting or creating indexes on them.
I think it would be few and perhaps should be eliminated.
I agree that it's probably pretty unusual to index floats. My objection 
was on the assumption that float8 is valid but float4 isn't. If we are 
going to provide a fast-path for one then we should do it for both if 
for no other reason than least surprise.

--
Jim C. Nasby, Database architect...@nasby.net
512.569.9461 (cell)http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-07 Thread Jay Levitt


Jay Levitt wrote:

[Posted at Andres's request]

TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in
various builds.


And I bet you'll want the test script... sigh.  attached.
\c postgres
drop database if exists slowcube;
create database slowcube;
\c slowcube
\timing
create schema slowcube;
set search_path to slowcube;

create extension cube;

set work_mem to '1GB';
set maintenance_work_mem to '1GB';

create table cubetest (
  position cube
  );

insert into cubetest (position)
  select cube(array[random() * 1000, random() * 1000, random() * 1000]) from 
generate_series(1,100);
select now();
create index q on cubetest using gist(position);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-07 Thread Jay Levitt


[Posted at Andres's request]

TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in 
various builds.


NOTABLE PROBLEMS

1. In 9.1.2, inserting 10x rows takes 19x the time.
   - 9.1-HEAD and 9.2 "fix" this; it now slows down linearly
   - but: 10s > 8s > 5s!
   - but: comparing Ubuntu binary w/vanilla source build on virtual disks, 
might not be significant


2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes 
it can't work on an unlogged table

3. In 9.2, creating the 10-million-row index always fails
4. 9.1-HEAD never successfully indexes 10 million rows ("never" = at least 
20 minutes on two runs; I will follow up in a few hours)


DETAILS

Times are in seconds, single run.

+---+-+-+--+--+
| Platform  | 1m rows | 1m rows | 10m rows | 10m rows |
|   | INSERT  | CR NDX  | INSERT   | CR NDX   |
+---+-+-+--+--+
| 9.1.2 logged  | 5   | 35  | 98   | 434  |
| 9.1.2 unlogged| 2   | 34[**]  | 22   | 374[**]  |
| 9.1-HEAD logged   | 10  | 65  | 89   | [***]|
| 9.1-HEAD unlogged | 2   | 39  | 20   | 690[**]  |
| 9.2 logged| 8   | 57  | 87   | 509[*]   |
| 9.2 unlogged  | 2   | 33[**]  | 21   | 327[*]   |
+---+-+-+--+--+

[*] psql:slowcube.sql:20: ERROR:  node buffer of page being split (121550) 
does not exist

[**] psql:slowcube.sql:21: ERROR:  unlogged GiST indexes are not supported
[***] never completed after 10-20 minutes; nothing in server.log at default 
logging levels, postgres process consuming about 1 CPU in IOWAIT, 
checkpoints every 7-8 seconds


VARIABILITY

A few runs in a row on 9.1-HEAD, 1 million rows, logged:

++--+
| INSERT | CREATE INDEX |
++--+
| 10 |   65 |
|  8 |   61 |
|  7 |   59 |
|  8 |   61 |
|  7 |   55 |
++--+

SYSTEM SPECS

Amazon EC2, EBS-backed, m1.large
7.5GB RAM, 2 cores
Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz

shared_buffers = 1867MB
checkpoint_segments = 32
effective_cache_size = 3734MB

9.1.2: installed binaries from Ubuntu's oneiric repo
9.1-HEAD: REL9_1_STABLE, ef19c9dfaa99a2b78ed0f78aa4a44ed31636fdc4, built 
with simple configure/make/make install
9.2: master, 1631598ea204a3b05104f25d008b510ff5a5c94a, built with simple 
configure/make/make install


9.1.2 and 9.1-HEAD were run on different (but identically configured) 
instances.  9.1-HEAD and 9.2 were run on the same instance, but EBS 
performance is unpredictable. YMMV.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Setting -Werror in CFLAGS

2012-02-07 Thread Bruce Momjian

On Wed, Jan 04, 2012 at 01:44:07PM -0500, Robert Haas wrote:
> On Tue, Jan 3, 2012 at 9:23 PM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> On Tue, Jan 3, 2012 at 7:39 PM, Peter Geoghegan  
> >> wrote:
> >>> Yes, I know that these only appeared in GCC 4.6+ and as such are a
> >>> relatively recent phenomenon, but there has been some effort to
> >>> eliminate them, and if I could get a non-hacked -Werror build I'd feel
> >>> happy enough about excluding them as already outlined.
> >
> >> I just do this:
> >> echo COPT=-Werror > src/Makefile.custom
> >> ...which seems to work reasonably well.
> >
> > I see no point in -Werror whatsoever.  If you aren't examining the make
> > output for warnings, you're not following proper development practice
> > IMO.
> 
> I find -Werror to be a convenient way to examine the output for
> warnings.  Otherwise they scroll off the screen.  Yeah, I could save
> the output to a file and grep it afterwards, but that seems less
> convenient.

Our src/tools/pgtest does this:


http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/tools/pgtest;hb=HEAD

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] When do we lose column names?

2012-02-07 Thread Tom Lane

Andrew Dunstan  writes:
> On 11/16/2011 10:38 PM, Tom Lane wrote:
>> Upon further review, this patch would need some more work even for the
>> RowExpr case, because there are several places that build RowExprs
>> without bothering to build a valid colnames list.  It's clearly soluble
>> if anyone cares to put in the work, but I'm not personally excited
>> enough to pursue it ...

> The patch itself causes a core dump with the current regression tests. 

Yeah, observing that was what made me write the above.

> I've been looking at the other places that build RowExprs. Most of them 
> look OK and none seem clearly in need of fixing at first glance. Which 
> do you think need attention?

In general I think we'd have to require that colnames be supplied in all
RowExprs if we go this way.  Anyplace that's trying to slide by without
will have to be fixed.  I don't recall how many places that is.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add protransform for numeric, varbit, and temporal types

2012-02-07 Thread Robert Haas

On Tue, Jan 3, 2012 at 10:31 AM, Robert Haas  wrote:
> On Sat, Dec 31, 2011 at 7:36 PM, Noah Misch  wrote:
>> Building on commit 8f9fe6edce358f7904e0db119416b4d1080a83aa, this adds
>> protransform functions to the length coercions for numeric, varbit, 
>> timestamp,
>> timestamptz, time, timetz and interval.  This mostly serves to make more 
>> ALTER
>> TABLE ALTER TYPE operations avoid a rewrite, including numeric(10,2) ->
>> numeric(12,2), varbit(4) -> varbit(8) and timestamptz(2) -> timestamptz(4).
>> The rules for varbit are exactly the same as for varchar.  Numeric is 
>> slightly
>> more complex:
>>
>>  * Flatten calls to our length coercion function that solely represent
>>  * increases in allowable precision.  Scale changes mutate every datum, so
>>  * they are unoptimizable.  Some values, e.g. 1E-1001, can only fit into an
>>  * unconstrained numeric, so a change from an unconstrained numeric to any
>>  * constrained numeric is also unoptimizable.
>>
>> time{,stamp}{,tz} are similar to varchar for these purposes, except that, for
>> example, plain "timestamptz" is equivalent to "timestamptz(6)".  interval has
>> a vastly different typmod format, but the principles applicable to length
>> coercion remain the same.
>>
>> Under --disable-integer-datetimes, I'm not positive that timestamp_scale() is
>> always a no-op when one would logically expect as much.  Does there exist a
>> timestamp such that v::timestamp(2) differs from v:timestamp(2)::timestamp(4)
>> due to floating point rounding?  Even if so, I'm fairly comfortable calling 
>> it
>> a feature rather than a bug to avoid perturbing values that way.
>>
>> After these patches, the only core length coercion casts not having
>> protransform functions are those for "bpchar" and "bit".  For those, we could
>> only optimize trivial cases of no length change.  I'm not planning to do so.
>
> This is cool stuff.  I will plan to review this once the CF starts.

I've committed the numeric and varbit patches and will look at the
temporal one next.  I did notice one odd thing, though: sometimes we
seem to get a rebuild on the toast index for no obvious reason:

rhaas=# set client_min_messages=debug1;
SET
rhaas=# create table foo (a serial primary key, b varbit);
NOTICE:  CREATE TABLE will create implicit sequence "foo_a_seq" for
serial column "foo.a"
DEBUG:  building index "pg_toast_16430_index" on table "pg_toast_16430"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"foo_pkey" for table "foo"
DEBUG:  building index "foo_pkey" on table "foo"
CREATE TABLE
rhaas=# alter table foo alter column b set data type varbit(4);
DEBUG:  rewriting table "foo"
DEBUG:  building index "foo_pkey" on table "foo"
ALTER TABLE
rhaas=# alter table foo alter column b set data type varbit;
DEBUG:  building index "pg_toast_16430_index" on table "pg_toast_16430"
ALTER TABLE

Strangely, it doesn't happen if I add another column to the table:

rhaas=# set client_min_messages=debug1;
SET
rhaas=# create table foo (a serial primary key, b varbit, c varbit);
NOTICE:  CREATE TABLE will create implicit sequence "foo_a_seq" for
serial column "foo.a"
DEBUG:  building index "pg_toast_16481_index" on table "pg_toast_16481"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"foo_pkey" for table "foo"
DEBUG:  building index "foo_pkey" on table "foo"
CREATE TABLE
rhaas=# alter table foo alter column b set data type varbit(4);
DEBUG:  building index "pg_toast_16490_index" on table "pg_toast_16490"
DEBUG:  rewriting table "foo"
DEBUG:  building index "foo_pkey" on table "foo"
ALTER TABLE
rhaas=# alter table foo alter column b set data type varbit;
ALTER TABLE

There may not be any particular harm in a useless rebuild of the TOAST
table index (wasted effort excepted), but my lack of understanding of
why this is happening causes me to fear that there's a bug, not so
much in these patches as in the core alter table code that is enabling
skipping table and index rebuilds.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-02-07 Thread Greg Smith


On 01/24/2012 03:53 PM, Robert Haas wrote:

There are two graphs for each branch.  The first is a scatter plot of
latency vs. transaction time.  I found that graph hard to understand,
though; I couldn't really tell what I was looking at.  So I made a
second set of graphs which graph number of completed transactions in a
given second of the test against time.


Note that you're now reinventing parts of pgbench-tools; the main two 
graphs it gives are the latency scatter plot and TPS per second.  The 
things you're likely to find interesting next are maximum latency, 90th 
percentile latency, and a delta for what changed in pg_stat_bgwriter 
during the test; those are the other things I track in that program.


I'm working toward publishing my own tests of the performance patches 
still considered useful by the end of the week.  Murphy's Law has active 
on that project since it started though--server crashed the day I left 
on a week long trip, and I've been sick ever since getting back.



First, some of
these transactions had really long latency.  Second, there are a
remarkable number of seconds all of the test during which no
transactions at all manage to complete, sometimes several seconds in a
row.


These periods have in my tests always been associated with Linux turning 
aggressive about cleaning out its write cache, either due to fsync 
request or simply crossing one of its thresholds for doing so.  My 
current record is an 80 second pause with no transactions completing.


One of things I expect to add to pgbench_tools within the next week is 
tracking how much dirty memory is accumulating during each test.  Seeing 
that graph overlaid on top of the rest makes a lot of what's happening 
at any time more obvious.  Noting when the checkpoints happen is a bit 
less interesting, because once the first one happens, they happen almost 
continuously.


You really need to track when the write and sync phases are happening 
for that to be really useful.  This circles back to why I proposed 
exposing those timing bits in pg_stat_bgwriter.  pgbench-tools already 
grabs data from it, which avoids all the mess around log file parsing.  
If I could do that more often and extract checkpoint timing from that 
data, it would make labelling graphs like these much easier to do, from 
the client that's running the benchmark even.



Third, all of the tests initially start of
processing transactions very quickly, and get slammed down very hard,
probably because the very high rate of transaction processing early on
causes a checkpoint to occur around 200 s.


At the beginning of a write-heavy pgbench run, rate is high until one of 
these two things happen:


1) A checkpoint begins
2) Linux's write cache threshold (typically 
/proc/sys/vm/dirty_background_ratio) worth of dirty memory accumulates.


Note that (1) on its own isn't necessarily the problem, it's something 
the case that it just makes (2) happen much faster then.


Basically, the first 30 to 150 seconds of any write-heavy test always 
have an inflated speed.  You're writing into the OS cache at maximum 
speed, and none of those writes are making it to physical disk--except 
perhaps for the WAL, which is all fast and sequential.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] xlog location arithmetic

2012-02-07 Thread Euler Taveira de Oliveira

On 26-01-2012 06:19, Fujii Masao wrote:

Thanks for your review. Comments below.

> When I compiled the source with xlogdiff.patch, I got the following warnings.
> 
> xlogfuncs.c:511:2: warning: format '%lX' expects argument of type
> 'long unsigned int *', but argument 3 has type 'uint64 *' [-Wformat]
> 
What is your compiler? I'm using gcc 4.6.2. I refactored the patch so I'm now
using XLogRecPtr and %X.

> postgres=# SELECT pg_xlog_location_diff('0/274', '0/274');
> ERROR:  xrecoff "274" is out of valid range, 0..A4A534C
> 
Ugh? I can't reproduce that. It seems to be related to long int used by the
prior version.

> Since pg_xlog_location_diff() can be executed during recovery,
> the above needs to be updated.
> 
Fixed.

>> While the output was int8 I could use
>> pg_size_pretty but now I couldn't. I attached another patch that implements
>> pg_size_pretty(numeric).
> 
I realized that it collides with the pg_size_pretty(int8) if we don't specify
a type. Hence, I decided to drop the pg_size_pretty(int8) in favor of
pg_size_pretty(numeric). It is slower than the former but it is not a
performance critical function.

> According to the above source code comment in pg_proc.h, ISTM
> pg_size_pretty() for numeric also needs to have its own DESCR().
> 
Fixed.

> According to "man strcat", the dest string must have enough space for
> the result.
> "buf" has enough space?
> 
Ops. Fixed.


-- 
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 236a60a..511a918 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -14942,7 +14942,7 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
   
   

-pg_size_pretty(bigint)
+pg_size_pretty(numeric)
 
text
Converts a size in bytes into a human-readable format with size units
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
index 26a8c01..d4a142b 100644
--- a/src/backend/utils/adt/dbsize.c
+++ b/src/backend/utils/adt/dbsize.c
@@ -24,6 +24,7 @@
 #include "storage/fd.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/numeric.h"
 #include "utils/rel.h"
 #include "utils/relmapper.h"
 #include "utils/syscache.h"
@@ -506,48 +507,101 @@ pg_total_relation_size(PG_FUNCTION_ARGS)
 	PG_RETURN_INT64(size);
 }
 
-/*
- * formatting with size units
- */
 Datum
 pg_size_pretty(PG_FUNCTION_ARGS)
 {
-	int64		size = PG_GETARG_INT64(0);
-	char		buf[64];
-	int64		limit = 10 * 1024;
-	int64		limit2 = limit * 2 - 1;
+	Numeric		size = PG_GETARG_NUMERIC(0);
+	Numeric		limit, limit2;
+
+	char		*buf, *result;
 
-	if (size < limit)
-		snprintf(buf, sizeof(buf), INT64_FORMAT " bytes", size);
+	limit = DatumGetNumeric(DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) (10 * 1024;
+	limit2 = DatumGetNumeric(DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) (10 * 1024 * 2 - 1;
+
+	if (DatumGetBool(DirectFunctionCall2(numeric_lt, NumericGetDatum(size), NumericGetDatum(limit
+	{
+		buf = DatumGetCString(DirectFunctionCall1(numeric_out, NumericGetDatum(size)));
+		result = palloc(strlen(buf) + 7);
+		strcpy(result, buf);
+		strcat(result, " bytes");
+	}
 	else
 	{
-		size >>= 9;/* keep one extra bit for rounding */
-		if (size < limit2)
-			snprintf(buf, sizeof(buf), INT64_FORMAT " kB",
-	 (size + 1) / 2);
+		Numeric		arg2;
+
+		/* keep one extra bit for rounding */
+		/* size >>= 9 */
+		arg2 = DatumGetNumeric(DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) pow(2, 9;
+		size = DatumGetNumeric(DirectFunctionCall2(numeric_div_trunc, NumericGetDatum(size), NumericGetDatum(arg2)));
+
+		if (DatumGetBool(DirectFunctionCall2(numeric_lt, NumericGetDatum(size), NumericGetDatum(limit2
+		{
+			/* size = (size + 1) / 2 */
+			size = DatumGetNumeric(DirectFunctionCall2(numeric_add, NumericGetDatum(size), 
+	DirectFunctionCall1(int8_numeric, Int64GetDatum(1;
+			size = DatumGetNumeric(DirectFunctionCall2(numeric_div_trunc, NumericGetDatum(size), 
+	DirectFunctionCall1(int8_numeric, Int64GetDatum(2;
+			buf = DatumGetCString(DirectFunctionCall1(numeric_out, NumericGetDatum(size)));
+			result = palloc(strlen(buf) + 4);
+			strcpy(result, buf);
+			strcat(result, " kB");
+		}
 		else
 		{
-			size >>= 10;
-			if (size < limit2)
-snprintf(buf, sizeof(buf), INT64_FORMAT " MB",
-		 (size + 1) / 2);
+			Numeric		arg3;
+
+			/* size >>= 10 */
+			arg3 = DatumGetNumeric(DirectFunctionCall1(int8_numeric, Int64GetDatum((int64) pow(2, 10;
+			size = DatumGetNumeric(DirectFunctionCall2(numeric_div_trunc, NumericGetDatum(size), NumericGetDatum(arg3)));
+
+			if (DatumGetBool(DirectFunctionCall2(numeric_lt, NumericGetDatum(size), NumericGetDatum(limit2
+			{
+/* size = (size + 1) / 2 */
+size = DatumGetNumeric(DirectFunctionCal

Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-07 Thread Greg Smith


On 02/07/2012 12:09 AM, Dan Scales wrote:

So, yes, good point -- double writes cannot replace the functionality of 
full_page_writes for base backup.  If double writes were in use, they might be 
automatically switched over to full page writes for the duration of the base 
backup.  And the double write file should not be part of the base backup.


There is already a check for this sort of problem during the base 
backup.  It forces full_pages_writes on for the backup, even if the 
running configuration has it off.  So long as double writes can be 
smoothly turned off and back on again, that same section of code can 
easily be made to handle that, too.


As far as not making the double write file part of the base backup, I 
was assuming that would go into a subdirectory under pg_xlog by 
default.  I would think that people who relocate pg_xlog using one of 
the methods for doing that would want the double write buffer to move as 
well.  And if it's inside pg_xlog, existing base backup scripts won't 
need to be changed--the correct ones already exclude pg_xlog files.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-02-07 Thread Tom Lane

Robert Haas  writes:
> most places that issue queries can simply use those routines without
> needing to peek under the hood into the ArchiveHandle.  This is not
> quite enough to get rid of g_conn, but it's close: the major stumbling
> block at this point is probably exit_nicely().  The gyrations we're
> going through to make sure that AH->connection gets closed before
> exiting are fairly annoying; maybe we should invent something in
> dumputils.c along the line of the backend's on_shmem_exit().

Do we actually care about closing the connection?  Worst case is that
backend logs an "unexpected EOF" message.  But yeah, an atexit hook
might be the easiest solution.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-07 Thread Magnus Hagander

On Tue, Feb 7, 2012 at 17:29, Heikki Linnakangas
 wrote:
> On 07.02.2012 16:55, Tom Lane wrote:
>>
>> (The integer vs float TimestampTz issue is a kind of portability
>> problem, but we've already bought into the assumption that sender and
>> receiver must be built with the same choice, no?)
>
>
> Hmm, true. In hindsight, I think that was a bad choice, but it's a bit late
> to change that. pg_basebackup doesn't otherwise care about the integer/float
> timestamps, but it does send a timestamp back to the server. You won't be
> able to actually start up the database if the config options don't match,
> but I think it would be good if pg_basebackup still worked across platforms
> and versions. For example, you might have a central backup server that calls
> pg_basebackup on several database servers, running on different platforms.
>
> In 9.0, the only field in the protocol that depends on timestamp format is
> WalDataMessageHeader->sendTime. That goes from server to client, and
> pg_basebackup/pg_receivexlog don't care about that. In 9.1 we introduced
> StandbyReplyMessage->sendTime, which is sent from client to server, but
> looking at the code it looks like the server doesn't use it for anything. In
> 9.2, we added WalSndrMessage->sendTime, which is used by a standby server to
> calculate how far behind the standby is.
>
> I'm tempted to just change all of those TimestampTz fields to something
> that's independent of integer/float timestamp setting, in 9.2. At a quick
> glance, it seems that it wouldn't break anything.

In general, I think that would work. Since we can't replicate across
versions anyway.

Will it break using pg_basebackup 9.2 on a 9.1 server, though? that
would also be very useful in the scenario of the central server...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-07 Thread Heikki Linnakangas


On 07.02.2012 16:55, Tom Lane wrote:

(The integer vs float TimestampTz issue is a kind of portability
problem, but we've already bought into the assumption that sender and
receiver must be built with the same choice, no?)


Hmm, true. In hindsight, I think that was a bad choice, but it's a bit 
late to change that. pg_basebackup doesn't otherwise care about the 
integer/float timestamps, but it does send a timestamp back to the 
server. You won't be able to actually start up the database if the 
config options don't match, but I think it would be good if 
pg_basebackup still worked across platforms and versions. For example, 
you might have a central backup server that calls pg_basebackup on 
several database servers, running on different platforms.


In 9.0, the only field in the protocol that depends on timestamp format 
is WalDataMessageHeader->sendTime. That goes from server to client, and 
pg_basebackup/pg_receivexlog don't care about that. In 9.1 we introduced 
StandbyReplyMessage->sendTime, which is sent from client to server, but 
looking at the code it looks like the server doesn't use it for 
anything. In 9.2, we added WalSndrMessage->sendTime, which is used by a 
standby server to calculate how far behind the standby is.


I'm tempted to just change all of those TimestampTz fields to something 
that's independent of integer/float timestamp setting, in 9.2. At a 
quick glance, it seems that it wouldn't break anything.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review of: explain / allow collecting row counts without timing info

2012-02-07 Thread Robert Haas

On Sun, Feb 5, 2012 at 12:44 AM, Kevin Grittner
 wrote:
> Tom Lane  wrote:
>
>> Yeah, I think we need to preserve that property. Unexpectedly
>> executing query (which may have side-effects) is a very dangerous
>> thing.  People are used to the idea that ANALYZE == execute, and
>> adding random other flags that also cause execution is going to
>> burn somebody.
>
> +1
>
> FWIW, another reason not to use Robert's suggested syntax is that you
> get "rows=n" entries with or without the actual run.  You just don't
> get the "actual" block to compare to the estimate.  So ROWS as an
> option would be very ambiguous.

OK, so based on that resoundingly unanimous input, I've committed
Tomas's last version.  I made some alterations to the sgml
documentation to avoid mentioning "gettimeofday" specifically, because
that might not be the call everywhere (e.g. Windows) and even if it
is, it doesn't seem too user-friendly.  The code is entirely as he had
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for parallel pg_dump

2012-02-07 Thread Robert Haas

On Fri, Feb 3, 2012 at 10:43 AM, Joachim Wieland  wrote:
> On Fri, Feb 3, 2012 at 7:52 AM, Robert Haas  wrote:
>> If you're OK with that much I'll go do it.
>
> Sure, go ahead!

It turns out that (as you anticipated) there are some problems with my
previous proposal.  For one thing, an Archive isn't really just a
connection, because it's also used as a data sink by passing it to
functions like ArchiveEntry().  For two things, as you probably
realized but I failed to, ConnectDatabase() is already setting
AH->connection to the same PGconn it returns, so the idea that we can
potentially have multiple connection objects using the existing
framework is not really true; or at least it's going to require more
work than I originally thought.  I think it might still be worth doing
at some point, but I think we probably need to clean up some of the
rest of this mess first.

I've now rejiggered things so that the Archive is passed down to
everything that needs it, so the global variable g_fout is gone.  I've
also added a couple of additional accessors to pg_backup_db.c so that
most places that issue queries can simply use those routines without
needing to peek under the hood into the ArchiveHandle.  This is not
quite enough to get rid of g_conn, but it's close: the major stumbling
block at this point is probably exit_nicely().  The gyrations we're
going through to make sure that AH->connection gets closed before
exiting are fairly annoying; maybe we should invent something in
dumputils.c along the line of the backend's on_shmem_exit().

I'm starting to think it might make sense to press on with this
refactoring just a bit further and eliminate the distinction between
Archive and ArchiveHandle.  Given a few more accessors (and it really
looks like it would only be a few), pg_dump.c could treat an
ArchiveHandle as an opaque struct, which would accomplish more or less
the same thing as the current design, but in a less confusing fashion
- i.e. without giving the reader the idea that the author desperately
wishes he were coding in C++.  That would allow simplification in a
number other places; just to take one example, we wouldn't need both
appendStringLiteralAH and appendStringLiteralAHX.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-07 Thread Robert Haas

On Tue, Feb 7, 2012 at 9:44 AM, Marko Kreen  wrote:
>> - What is the right (or recommended) way to prevent from throwing
>> exceptoin in row-processor callback function?  When author should use
>> PG_TRY block to catch exception in the callback function?
>
> When it calls backend functions that can throw exceptions?
> As all handlers running in backend will want to convert data
> to Datums, that means "always wrap handler code in PG_TRY"?

I would hate to have such a rule.  PG_TRY isn't free, and it's prone
to subtle bugs, like failing to mark enough stuff in the same function
"volatile".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Can PQstatus() be used by Application to check connection to postgres periodically?

2012-02-07 Thread sujayr06

Hello All,

   My application has to do a real time data upload to PostgreSQL
server. 

   Every time i have to do a real time upload, i do not wish to open
new connection.
   I want to open a connection once [when my application comes up]
and periodically check if the connection is active.
   Can PQstatus() be used by application to check the status of the
connection already established?

   If PQstatus() cannot be used, does PostgreSQL provide alternate
interface to check the status of the connection.

  Note : I do not wish to open connection on every real time upload
as its an overhead for the application.

   Appreciate any help!

WBR,
Sujay 

 


   

--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Can-PQstatus-be-used-by-Application-to-check-connection-to-postgres-periodically-tp5462315p5462315.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

2012-02-07 Thread Amit Kapila

>> I think it is a good idea, and can help double-writes perform better in the 
>> case of lots of backend evictions.
   I don't understand this point, because from the data in your mail, it 
appears that when shared buffers are less means when more evictions can happen, 
the performance is less.

ISTM that the performance is less incase shared buffers size is less because 
I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or 
checkpoint. 
Something like whenever backend process has to evict the buffer, it will do 
same as you have described that write in a double-write buffer, but bgwriter  
will check this double-buffer and flush from it.
Also whenever any backend will see that the double buffer is more than 2/3rd or 
some threshhold value full it will tell bgwriter to flush from double-write 
buffer.
This can ensure very less I/O by any backend.


-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using "double-write buffer" approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had 
proposed (as an alternative to a previous patch that only batched up writes by 
the checkpointer).  I think it is a good idea, and can help double-writes 
perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is trickier 
to reason about.

The idea is that all page writes generally are copied to a double-write buffer, 
rather than being immediately written.  Note that a full copy of the page is 
required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been added), 
some writes are pushed out using double writes -- the pages are first written 
and fsynced to a double-write file, then written to the data files, which are 
then fsynced.  Then double writes allow for fixing torn pages, so 
full_page_writes can be turned off (thus greatly reducing the size of the WAL 
log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
enough batch has accumulated, then flush the batch using double
writes.  [I don't think I need to intercept calls to smgrextend(),
but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
needs a double write or not.  (I think eventually this would be an
attribute of the buffer, set when the page is WAL-logged, rather than
a flag passed around.)

5.  Deal with duplicates in the double-write buffer appropriately (very
rarely happens).

To get good performance, I needed to have two double-write buffers, one for the 
checkpointer and one for all other processes.  The double-write buffers are 
circular buffers.  The checkpointer double-write buffer is just a single batch 
of 64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches 
of 64 pages each.  Each batch goes to a different double-write file, so that 
they can be issued independently as soon as each batch is completed.  Also, I 
need to sort the buffers being checkpointed by file/offset (see ioseq.c), so 
that the checkpointer batches will most likely only have to write and fsync one 
data file.

Interestingly, I find that the plot of tpm for DBT2 is much smoother (though 
still has wiggles) with double writes enabled, since there are no unpredictable 
long fsyncs at the end (or during) a checkpoint.

Here are performance numbers for double-write buffer (same configs as previous 
numbers), for 2-processor, 60-minute 50-warehouse DBT2.  One the right shows 
the size of the shared_buffers, and the size of the RAM in the virtual machine. 
 FPW stands for full_page_writes, DW for double_writes.  'two disk' means the 
WAL log is on a separate ext3 filesystem from the data files.

   FPW off FPW on  DW on, FPW off
one disk:  15488   13146   11713[5G buffers, 8G VM]
two disk:  18833   16703   18013

one disk:  12908   111599758[3G buffers, 6G VM]
two disk:  14258   12694   11229

one disk   1082998655806[1G buffers, 8G VM]
two disk   13605   126945682

one disk:   675261294878
two disk:   725366775239[1G buffers, 2G VM]


The performance of DW on the small cache cases (1G shared_buffers) is now much 
better, though still not as good as FPW on.  In the medium cache case (3G 
buffers), where there are significant backend dirty

Re: [HACKERS] patch for preventing the specification of conflicting transaction read/write options

2012-02-07 Thread Kevin Grittner

Chetan Suttraway  wrote:

> This is regarding the TODO item:
> "Prevent the specification of conflicting transaction read/write
> options"
> 
> listed at:
> http://wiki.postgresql.org/wiki/Todo

Thanks for chipping away at items on the list.

> Please pass on any inputs on the patch.

Right now we want people focusing on fixing bugs and reviewing
patches submitted by the start of the current CommitFest.  Please
add this to the open CF so we don't lose track of it.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing special case OID generation

2012-02-07 Thread Tom Lane

Simon Riggs  writes:
> So ISTM that we should just strip out the OID handling code and just
> have a system sequence defined like this

I think this is a pretty poor idea, because the overhead of nextval()
is quite a lot larger than the overhead to get an OID.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-07 Thread Tom Lane

Heikki Linnakangas  writes:
> On 07.02.2012 09:03, Fujii Masao wrote:
>> What about changing receivelog.c so that it uses time_t instead of
>> TimestampTz? Which would make the code simpler, I think.

> Hmm, that would reduce granularity to seconds.

It also creates portability issues that I'd just as soon not deal with,
ie, time_t is not the same width on all platforms.  (The integer vs
float TimestampTz issue is a kind of portability problem, but we've
already bought into the assumption that sender and receiver must be
built with the same choice, no?)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-07 Thread Marko Kreen

On Tue, Feb 07, 2012 at 07:25:14PM +0900, Shigeru Hanada wrote:
> (2012/02/02 23:30), Marko Kreen wrote:
> > On Thu, Feb 02, 2012 at 04:51:37PM +0900, Kyotaro HORIGUCHI wrote:
> >> Hello, This is new version of dblink.c
> >>
> >> - Memory is properly freed when realloc returns NULL in storeHandler().
> >>
> >> - The bug that free() in finishStoreInfo() will be fed with
> >>garbage pointer when malloc for sinfo->valbuflen fails is
> >>fixed.
> > 
> > Thanks, merged.  I also did some minor coding style cleanups.
> > 
> > Tagging it Ready For Committer.  I don't see any notable
> > issues with the patch anymore.
> > 
> > There is some potential for experimenting with more aggressive
> > optimizations on dblink side, but I'd like to get a nod from
> > a committer for libpq changes first.
> 
> I tried to use this feature in pgsql_fdw patch, and found some small issues.
> 
> - Typos
> - mes -> msg
> - funcion -> function
> - overritten -> overwritten
> - costom -> custom

Fixed.

> - What is the right (or recommended) way to prevent from throwing
> exceptoin in row-processor callback function?  When author should use
> PG_TRY block to catch exception in the callback function?

When it calls backend functions that can throw exceptions?
As all handlers running in backend will want to convert data
to Datums, that means "always wrap handler code in PG_TRY"?

Although it seems we could allow exceptions, at least when we are
speaking of Postgres backend, as the connection and result are
internally consistent state when the handler is called, and the
partial PGresult is stored under PGconn->result.  Even the
connection is in consistent state, as the row packet is
fully processed.

So we have 3 variants, all work, but which one do we want to support?

1) Exceptions are not allowed.
2) Exceptions are allowed, but when it happens, user must call
   PQfinish() next, to avoid processing incoming data from
   potentially invalid state.
3) Exceptions are allowed, and further row processing can continue
   with PQisBusy() / PQgetResult()
3.1) The problematic row is retried.  (Current behaviour)
3.2) The problematic row is skipped.

No clue if that is ok for handler written in C++, I have no idea
whether you can throw C++ exception when part of the stack
contains raw C calls.

> - It would be better to describe how to determine whether a column
> result is NULL should be described clearly.  Perhaps result value is
> NULL when PGrowValue.len is less than 0, right?

Eh, seems it's documented everywhere except in sgml doc.  Fixed.
[ Is it better to document that it is "-1" or "< 0"? ]

Also I removed one remaining dynamic stack array in dblink.c

Current state of patch attached.

-- 
marko

*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
***
*** 160,162  PQconnectStartParams  157
--- 160,164 
  PQping158
  PQpingParams  159
  PQlibVersion  160
+ PQsetRowProcessor	  161
+ PQsetRowProcessorErrMsg	  162
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
***
*** 2693,2698  makeEmptyPGconn(void)
--- 2693,2701 
  	conn->wait_ssl_try = false;
  #endif
  
+ 	/* set default row processor */
+ 	PQsetRowProcessor(conn, NULL, NULL);
+ 
  	/*
  	 * We try to send at least 8K at a time, which is the usual size of pipe
  	 * buffers on Unix systems.  That way, when we are sending a large amount
***
*** 2711,2718  makeEmptyPGconn(void)
--- 2714,2726 
  	initPQExpBuffer(&conn->errorMessage);
  	initPQExpBuffer(&conn->workBuffer);
  
+ 	/* set up initial row buffer */
+ 	conn->rowBufLen = 32;
+ 	conn->rowBuf = (PGrowValue *)malloc(conn->rowBufLen * sizeof(PGrowValue));
+ 
  	if (conn->inBuffer == NULL ||
  		conn->outBuffer == NULL ||
+ 		conn->rowBuf == NULL ||
  		PQExpBufferBroken(&conn->errorMessage) ||
  		PQExpBufferBroken(&conn->workBuffer))
  	{
***
*** 2814,2819  freePGconn(PGconn *conn)
--- 2822,2829 
  		free(conn->inBuffer);
  	if (conn->outBuffer)
  		free(conn->outBuffer);
+ 	if (conn->rowBuf)
+ 		free(conn->rowBuf);
  	termPQExpBuffer(&conn->errorMessage);
  	termPQExpBuffer(&conn->workBuffer);
  
***
*** 5078,5080  PQregisterThreadLock(pgthreadlock_t newhandler)
--- 5088,5091 
  
  	return prev;
  }
+ 
*** a/src/interfaces/libpq/fe-exec.c
--- b/src/interfaces/libpq/fe-exec.c
***
*** 66,71  static PGresult *PQexecFinish(PGconn *conn);
--- 66,72 
  static int PQsendDescribe(PGconn *conn, char desc_type,
  			   const char *desc_target);
  static int	check_field_number(const PGresult *res, int field_num);
+ static int	pqAddRow(PGresult *res, void *param, PGrowValue *columns);
  
  
  /* 
***
*** 160,165  PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
--- 161,167 
  	result->curBlock = NULL;
  	result->curOffset = 0;
  	result->spaceLeft = 0;
+

Re: [HACKERS] semi-PoC: kNN-gist for cubes

2012-02-07 Thread David Fetter

On Tue, Feb 07, 2012 at 05:56:52AM -0800, David Fetter wrote:
> On Mon, Feb 06, 2012 at 06:25:33PM -0500, Jay Levitt wrote:
> > I have a rough proof-of-concept for getting nearest-neighbor
> > searches working with cubes.
> 
> Please attach such patches to the email when posting them :)

And here's a cleaned-up version of the patch that at least passes
"make check."

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
*** a/contrib/cube/cube.c
--- b/contrib/cube/cube.c
***
*** 73,78  PG_FUNCTION_INFO_V1(g_cube_penalty);
--- 73,79 
  PG_FUNCTION_INFO_V1(g_cube_picksplit);
  PG_FUNCTION_INFO_V1(g_cube_union);
  PG_FUNCTION_INFO_V1(g_cube_same);
+ PG_FUNCTION_INFO_V1(g_cube_distance);
  
  Datum g_cube_consistent(PG_FUNCTION_ARGS);
  Datum g_cube_compress(PG_FUNCTION_ARGS);
***
*** 81,86  Datumg_cube_penalty(PG_FUNCTION_ARGS);
--- 82,88 
  Datum g_cube_picksplit(PG_FUNCTION_ARGS);
  Datum g_cube_union(PG_FUNCTION_ARGS);
  Datum g_cube_same(PG_FUNCTION_ARGS);
+ Datum g_cube_distance(PG_FUNCTION_ARGS);
  
  /*
  ** B-tree support functions
***
*** 649,654  g_cube_same(PG_FUNCTION_ARGS)
--- 651,671 
  }
  
  /*
+ ** Distance method
+ */
+ Datum
+ g_cube_distance(PG_FUNCTION_ARGS)
+ {
+   GISTENTRY   *entry = (GISTENTRY *) PG_GETARG_POINTER(0);
+   NDBOX   *query = PG_GETARG_NDBOX(1);
+   double  distance;
+ 
+   distance = DatumGetFloat8(DirectFunctionCall2(cube_distance, 
entry->key, PointerGetDatum(query)));
+ 
+   PG_RETURN_FLOAT8(distance);
+ }
+ 
+ /*
  ** SUPPORT ROUTINES
  */
  bool

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing special case OID generation

2012-02-07 Thread Alvaro Herrera


Excerpts from Simon Riggs's message of mar feb 07 10:46:09 -0300 2012:
> Recent events have made me notice the OID handling.
> 
> AFAICS, OIDs are just a sequence with a max value that fits in a uint.
> 
> So ISTM that we should just strip out the OID handling code and just
> have a system sequence defined like this
> 
> CREATE SEQUENCE _pg_oid
>MINVALUE 0
>MAXVALUE 4294967296
>CACHE 8192
>CYCLE;
> 
> Which would then make it easier to have a sequence for each toast
> table and a sequence for lobs.

Yeah, I agree that having a single sequence to handle oid allocations on
all toast tables across all databases is strange.  I don't have evidence
that this is a real scalability problem though.  But theoretically at
least it seems to me that there could sporadically be a problem if a
table has a long string of allocated OIDs and the system OID generator
wraps around to somewhere within that range to allocate a new one for
that table.  That could cause a long period of spinning to get a new
value, thus high latency on that insert.

(Now admittedly if the same were to happen with a non-shared sequence,
it would have to spin just the same -- but it'd do so without having to
grab a system-level lwlock each time.)

Having one sequence for each toast table could be wasteful though.  I
mean, sequences are not the best use of shared buffer cache currently.
If we could have more than one sequence data in a shared buffer page,
things would be different.  Not sure how serious this really is.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLockWaitUntilFree (was: Group commit, revised)

2012-02-07 Thread Robert Haas

On Tue, Feb 7, 2012 at 3:48 AM, Heikki Linnakangas
 wrote:
> On 03.02.2012 22:57, Robert Haas wrote:
>>
>> I think I recommended a bad name for this function.  It's really
>> LWLockAcquireOrWaitUntilFree.  Can we rename that?
>
> Agreed, that's better. Although quite long. "LWLockAcquireOrWait" perhaps?

Sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] semi-PoC: kNN-gist for cubes

2012-02-07 Thread David Fetter

On Mon, Feb 06, 2012 at 06:25:33PM -0500, Jay Levitt wrote:
> I have a rough proof-of-concept for getting nearest-neighbor
> searches working with cubes.

Please attach such patches to the email when posting them :)

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Removing special case OID generation

2012-02-07 Thread Simon Riggs

Recent events have made me notice the OID handling.

AFAICS, OIDs are just a sequence with a max value that fits in a uint.

So ISTM that we should just strip out the OID handling code and just
have a system sequence defined like this

CREATE SEQUENCE _pg_oid
   MINVALUE 0
   MAXVALUE 4294967296
   CACHE 8192
   CYCLE;

Which would then make it easier to have a sequence for each toast
table and a sequence for lobs.

Not sure its important now, but maybe it will reduce the size of the
executable and avoid oid-specific bugs.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-07 Thread Fujii Masao

On Tue, Feb 7, 2012 at 7:06 PM, Magnus Hagander  wrote:
> On Tue, Feb 7, 2012 at 10:55, Heikki Linnakangas
>  wrote:
>> On 07.02.2012 11:35, Magnus Hagander wrote:
>>>
>>> On Tue, Feb 7, 2012 at 10:31, Heikki Linnakangas
>>>   wrote:

 So, --statusint needs to be in milliseconds. And while we're at it, how

Attached patch does that and fixes the problem caused under
--disable-integer-datetimes.


 difficult would be to ask the server for the current value of
 replication_timeout, and set --statusint automatically based on that? Or
 perhaps mark replication_timeout as GUC_REPORT. It is rather fiddly that
 depending on a server setting, you need to pass an option in the client
 or
 it will just silently fail with no indication of what the problem is.
>>>
>>>
>>> We can't really ask for it easily, since we're on a replication
>>> connection. Unless we add that to the walsender grammar, but that
>>> would make it impossible to use the receivexlog stuff against a 9.1
>>> server (which I think still works, though I haven't tested it in a
>>> while).
>>
>>
>> You could put a version-check there, and only send the command if connected
>> to a 9.2 server.
>
> True..
>
>
>>> Do we have a facility to make it a GUC_REPORT but only for walsender
>>> connections?
>>
>>
>> There's no such facility at the moment.
>>
>>
>>> It seems like a very unnecessary thing to have it sent out over every
>>> single connection, since it would only be useful in a very small
>>> subset of them.
>>
>>
>> True, and conversely, many of the current GUC_REPORT settings don't apply to
>> replication clients at all. Like standard_conforming_strings and DateStyle.
>
> Right. But it's less important there, since the replication
> connections will in any reasonable configuration be only a tiny part
> of the connections.
>
>
>> I think we need another flag for settings that should be sent to replication
>> clients. GUC_REPORT_REPLICATION? Some settings would have both GUC_REPORT
>> and GUC_REPORT_REPLICATION, while others would have only one of them.
>
> Yup, seems like the cleanest solution.

Agreed. But when replication_timeout is changed by SIGHUP in the server,
there would be a delay before pg_receivexlog receives the parameter
change packet and changes the standby message interval based on the
new value of replication_timeout. Which might cause unexpected replication
timeout. So the user-settable interval (i.e., --statusint) is still useful for
people who want to avoid such fragility, even if we will support the auto-
tuning of the interval.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/doc/src/sgml/ref/pg_basebackup.sgml
--- b/doc/src/sgml/ref/pg_basebackup.sgml
***
*** 335,341  PostgreSQL documentation
--statusint=interval

 
! Specifies the number of seconds between status packets sent back to the
  server. This is required when streaming the transaction log (using
  --xlog=stream) if replication timeout is configured
  on the server, and allows for easier monitoring. The default value is
--- 335,341 
--statusint=interval

 
! Specifies the number of milliseconds between status packets sent back to the
  server. This is required when streaming the transaction log (using
  --xlog=stream) if replication timeout is configured
  on the server, and allows for easier monitoring. The default value is
*** a/doc/src/sgml/ref/pg_receivexlog.sgml
--- b/doc/src/sgml/ref/pg_receivexlog.sgml
***
*** 108,114  PostgreSQL documentation
--statusint=interval

 
! Specifies the number of seconds between status packets sent back to the
  server. This is required if replication timeout is configured on the
  server, and allows for easier monitoring. The default value is
  10 seconds.
--- 108,114 
--statusint=interval

 
! Specifies the number of milliseconds between status packets sent back to the
  server. This is required if replication timeout is configured on the
  server, and allows for easier monitoring. The default value is
  10 seconds.
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
***
*** 46,52  int			compresslevel = 0;
  bool		includewal = false;
  bool		streamwal = false;
  bool		fastcheckpoint = false;
! int			standby_message_timeout = 10;		/* 10 sec = default */
  
  /* Progress counters */
  static uint64 totalsize;
--- 46,52 
  bool		includewal = false;
  bool		streamwal = false;
  bool		fastcheckpoint = false;
! int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
  
  /* Progress counters */
  static uint64 totalsize;
***
*** 118,124  usage(void)
  	printf(_("  --help   show this help

[HACKERS] patch for preventing the specification of conflicting transaction read/write options

2012-02-07 Thread Chetan Suttraway

Hi,

This is regarding the TODO item:
"Prevent the specification of conflicting transaction read/write options"

listed at:
http://wiki.postgresql.org/wiki/Todo

The issue is :

SET TRANSACTION read only read write read only;

The fix involved iteration over transaction_mode_list and checking for
duplicate entries.
The patch is based on suggestions mentioned in message:
http://archives.postgresql.org/pgsql-hackers/2009-01/msg00692.php

As per this, the patch does not throw any error for the first test case
mentioned above.
It throws error only in case of conflicting modes.

For ex:
postgres=# SET TRANSACTION read only read only;
SET

postgres=# SET TRANSACTION read only read write;
ERROR:  conflicting options
LINE 1: SET TRANSACTION read only read write;
^

Below are basic unit test logs:

postgres=# SET TRANSACTION ISOLATION LEVEL serializable ISOLATION LEVEL
serializable;
SET
postgres=# SET TRANSACTION ISOLATION LEVEL serializable ISOLATION LEVEL
repeatable read;
ERROR:  conflicting options
LINE 1: SET TRANSACTION ISOLATION LEVEL serializable ISOLATION LEVEL...

postgres=# SET TRANSACTION DEFERRABLE DEFERRABLE;
SET
postgres=# SET TRANSACTION DEFERRABLE NOT DEFERRABLE;
ERROR:  conflicting options
LINE 1: SET TRANSACTION DEFERRABLE NOT DEFERRABLE;
^
postgres=# START TRANSACTION read only, read only;
START TRANSACTION
postgres=# rollback;
ROLLBACK
postgres=# START TRANSACTION read only, read write;
ERROR:  conflicting options
LINE 1: START TRANSACTION read only, read write;
postgres=# rollback;
ROLLBACK  ^
postgres=# BEGIN TRANSACTION ISOLATION LEVEL serializable, ISOLATION LEVEL
serializable;
BEGIN
postgres=# rollback;
ROLLBACK
postgres=# BEGIN TRANSACTION ISOLATION LEVEL serializable, ISOLATION LEVEL
repeatable read;
ERROR:  conflicting options
LINE 1: BEGIN TRANSACTION ISOLATION LEVEL serializable, ISOLATION LE...
  ^

Please pass on any inputs on the patch.

Regards,
Chetan

-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 62fde67..28c987f 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -146,6 +146,9 @@ static void processCASbits(int cas_bits, int location, const char *constrType,
 			   bool *deferrable, bool *initdeferred, bool *not_valid,
 			   core_yyscan_t yyscanner);
 
+static void
+check_trans_mode(List *list, DefElem *elem, core_yyscan_t yyscanner);
+
 %}
 
 %pure-parser
@@ -7510,9 +7513,16 @@ transaction_mode_list:
 			transaction_mode_item
 	{ $$ = list_make1($1); }
 			| transaction_mode_list ',' transaction_mode_item
-	{ $$ = lappend($1, $3); }
+	{
+		check_trans_mode((List *)$1, (DefElem *)$3, yyscanner);
+		$$ = lappend($1, $3);
+	}
+
 			| transaction_mode_list transaction_mode_item
-	{ $$ = lappend($1, $2); }
+	{
+		check_trans_mode((List *)$1, (DefElem *)$2, yyscanner);
+		$$ = lappend($1, $2);
+	}
 		;
 
 transaction_mode_list_or_empty:
@@ -13215,6 +13225,57 @@ parser_init(base_yy_extra_type *yyext)
 }
 
 /*
+ * checks for conflicting transaction modes by looking up current
+ * element in the given list.
+ */
+static void
+check_trans_mode(List *list, DefElem *elem, core_yyscan_t yyscanner)
+{
+	ListCell *lc;
+	A_Const *elem_arg;
+
+	elem_arg =(A_Const *) elem->arg;
+
+	/* cannot specify conflicting value for transaction mode. */
+	foreach (lc, list)
+	{
+		DefElem *next;
+		A_Const *arg;
+
+		next = (DefElem*)lfirst (lc);
+		arg = (A_Const *)next->arg;
+
+		/* check for duplicate value in remaining list */
+		if (strcmp(elem->defname, next->defname) == 0)
+		{
+			/*
+			 * isolation level values are stored
+			 * as string whereas other modes are
+			 * stored as integer values.
+			 */
+			if (strcmp(elem->defname,"transaction_isolation") == 0)
+			{
+/* check for conflicting values */
+if (strcmp(elem_arg->val.val.str,arg->val.val.str)!= 0)
+			ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting options"),
+	 parser_errposition(arg->location)));
+			}
+			else
+			{
+/* check for conflicting values */
+if (elem_arg->val.val.ival != arg->val.val.ival)
+			ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting options"),
+	 parser_errposition(arg->location)));
+			}
+		}
+	}
+}
+
+/*
  * Must undefine this stuff before including scan.c, since it has different
  * definitions for these macros.
  */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Command Triggers

2012-02-07 Thread Simon Riggs

On Thu, Jan 26, 2012 at 10:00 PM, Dimitri Fontaine
 wrote:
> Dimitri Fontaine  writes:
>>> Really I think there is not any single point where you can put the
>>> command-trigger hook and be done.  In almost every case, the right
>>> place is going to be buried somewhere within the execution of the
>>> command.
>>
>> I'm finally realizing it. I already introduced a structure called
>> CommandContextData to keep track of the current command elements we want
>> to pass down to command triggers, I think we're saying that this should
>> be a static variable that commands will need to be editing.
>
> In fact passing it as an argument to the command trigger API is much
> simpler and done in the attached. I'm having problems here with my
> install and not enough time this week (you don't speak English if you
> don't use understatements here and there, right?) so please expect a
> one-step-further patch to show the integration concept, not a ready for
> commit one just yet.
>
> Next steps are about adding support for more commands, and now that we
> have settled on a simpler integration that will be easy. The parameters
> sent to the trigger procedure are now the command tag, the main object
> Oid, its schema name and its object name. Only the command tag will
> never be NULL, all the other columns could be left out when calling an
> ANY COMMAND trigger, or a command on a schemaless object.
>
> Note: the per-command integration means the Oid is generally available,
> so let's just export it.
>
> An ack about the way it's now implemented would be awesome, and we could
> begin to start about which set of command exactly we want supported from
> the get go (default to all of them of course, but well, I don't think
> that's necessarily the best use of our time given our command list).


Looks good so far.

A user centric review of this patch...

Would like a way to say ALL COMMANDS rather than write out list of
supported commands. That list is presumably subject to change, so not
having this could result in bugs in later code.

The example given in the docs has no explanation. More examples and
explanation please. Would specifically be interested in a command
trigger which simply prevents all commands, which has many uses and is
simple enough to be in docs.

Why do we exclude SEQUENCEs? That could well be a problem for intended users

Please explain/add to docs/comments why ALTER VIEW is not supported?
Why not other commands??
Not a problem, but users will ask, so we should write down the answer

Please document why pg_cmdtriggers is a separate table and not
enhancement of pg_triggers?

Thanks

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-07 Thread Chetan Suttraway

On Tue, Feb 7, 2012 at 5:44 PM, Kevin Grittner
wrote:

> "Kevin Grittner"  wrote:
>
> > moved this to "Replication and Recovery"
>
> Oh, that was a different patch -- I didn't see yours.
>
> (It's early, and the caffeine isn't working yet.)
>
> Anyway, you should have plenty of options now.
>
> -Kevin
>
>
Thanks Kevin :)


-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB Blog : http://blogs.enterprisedb.com
Follow us on Twitter : http://www.twitter.com/enterprisedb

Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-07 Thread Kevin Grittner

"Kevin Grittner"  wrote:
 
> moved this to "Replication and Recovery"
 
Oh, that was a different patch -- I didn't see yours.
 
(It's early, and the caffeine isn't working yet.)
 
Anyway, you should have plenty of options now.
 
-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch for implementing SPI_gettypemod()

2012-02-07 Thread Kevin Grittner

Chetan Suttraway  wrote:
> Robert Haas  wrote:
 
>> Please add this to the next CommitFest:
>>
>> https://commitfest.postgresql.org/action/commitfest_view/open
 
> At the given link, I am able to choose only "System administration"
> under commitfest topic.
> I think there has to be "server features" or "Miscellaneous".
 
I created all the topics from the last CF, in the same order, and
moved this to "Replication and Recovery" -- it is about the ability
to recover if the OS crashes around the time initdb completes, right?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] controlling the location of server-side SSL files

2012-02-07 Thread Peter Eisentraut

On tis, 2012-01-24 at 22:05 +0200, Peter Eisentraut wrote:
> > > One thing that is perhaps worth thinking about:  Currently, we just
> > > ignore missing root.crt and root.crl files.  With this patch, we still
> > > do this, even if the user has given a specific nondefault location.
> > > That seems a bit odd, but I can't think of a simple way to do it better.
> > 
> > There's a review in the CF app for this finding only minor issues, so
> > I'm marking this patch therein as "Ready for Committer".
> 
> OK, no one had any concerns about the missing file behavior I
> described above?  If not, then I'll commit it soon.

I'm still worried about this.  If we ignore a missing root.crt, then the
effect is that authentication and certificate verification might fail,
which would be annoying, but you'd notice it soon enough.  But if we
ignore a missing root.crl, we are creating a security hole.

My best idea at the moment is that we should set these parameters to
empty by default, and make users point them to existing files if they
want to use that functionality.  Comments?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_basebackup -x stream from the standby gets stuck

2012-02-07 Thread Fujii Masao

Hi,

http://www.depesz.com/2012/02/03/waiting-for-9-2-pg_basebackup-from-slave/
> =$ time pg_basebackup -D /home/pgdba/slave2/ -F p -x stream -c fast -P -v -h 
> 127.0.0.1 -p 5921 -U replication
> xlog start point: 2/AC4E2600
> pg_basebackup: starting background WAL receiver
> 692447/692447 kB (100%), 1/1 tablespace
> xlog end point: 2/AC4E2600
> pg_basebackup: waiting for background process to finish streaming...
> pg_basebackup: base backup completed
>
> real3m56.237s
> user0m0.224s
> sys 0m0.936s
>
> (time is long because this is only test database with no traffic, so I had to 
> make some inserts for it to finish)

The above article points out the problem of pg_basebackup from the standby:
when "-x stream" is specified, pg_basebackup from the standby gets stuck if
there is no traffic in the database.

When "-x stream" is specified, pg_basebackup forks the background process
for receiving WAL records during backup, takes an online backup and waits for
the background process to end. The forked background process keeps receiving
WAL records, and whenever it reaches end of WAL file, it checks whether it has
already received all WAL files required for the backup, and exits if yes. Which
means that at least one WAL segment switch is required for pg_basebackup with
"-x stream" option to end.

In the backup from the master, WAL file switch always occurs at both start and
end of backup (i.e., in do_pg_start_backup() and do_pg_stop_backup()), so the
above logic works fine even if there is no traffic. OTOH, in the backup from the
standby, while there is no traffic, WAL file switch is not performed at all. So
in that case, there is no chance that the background process reaches end of WAL
file, check whether all required WAL arrives and exit. At the end, pg_basebackup
gets stuck.

To fix the problem, I'd propose to change the background process so that it
checks whether all required WAL has arrived, every time data is received, even
if end of WAL file is not reached. Patch attached. Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
***
*** 78,84  static void ReceiveTarFile(PGconn *conn, PGresult *res, int rownum);
  static void ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum);
  static void BaseBackup(void);
  
! static bool segment_callback(XLogRecPtr segendpos, uint32 timeline);
  
  #ifdef HAVE_LIBZ
  static const char *
--- 78,84 
  static void ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res, int rownum);
  static void BaseBackup(void);
  
! static bool reached_end_position(XLogRecPtr segendpos, uint32 timeline);
  
  #ifdef HAVE_LIBZ
  static const char *
***
*** 129,136  usage(void)
  
  
  /*
!  * Called in the background process whenever a complete segment of WAL
!  * has been received.
   * On Unix, we check to see if there is any data on our pipe
   * (which would mean we have a stop position), and if it is, check if
   * it is time to stop.
--- 129,137 
  
  
  /*
!  * Called in the background process every time data is received.
!  * Also called when the streaming stops to check whether
!  * the current log segment can be treated as a complete one.
   * On Unix, we check to see if there is any data on our pipe
   * (which would mean we have a stop position), and if it is, check if
   * it is time to stop.
***
*** 138,144  usage(void)
   * time to stop.
   */
  static bool
! segment_callback(XLogRecPtr segendpos, uint32 timeline)
  {
  	if (!has_xlogendptr)
  	{
--- 139,145 
   * time to stop.
   */
  static bool
! reached_end_position(XLogRecPtr segendpos, uint32 timeline)
  {
  	if (!has_xlogendptr)
  	{
***
*** 231,237  LogStreamerMain(logstreamer_param * param)
  {
  	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
  		   param->sysidentifier, param->xlogdir,
! 		   segment_callback, NULL, standby_message_timeout))
  
  		/*
  		 * Any errors will already have been reported in the function process,
--- 232,238 
  {
  	if (!ReceiveXlogStream(param->bgconn, param->startptr, param->timeline,
  		   param->sysidentifier, param->xlogdir,
! 		   reached_end_position, reached_end_position, standby_message_timeout))
  
  		/*
  		 * Any errors will already have been reported in the function process,
*** a/src/bin/pg_basebackup/pg_receivexlog.c
--- b/src/bin/pg_basebackup/pg_receivexlog.c
***
*** 71,77  usage(void)
  static bool
  segment_callback(XLogRecPtr segendpos, uint32 timeline)
  {
! 	if (verbose)
  		fprintf(stderr, _("%s: finished segment at %X/%X (timeline %u)\n"),
  progname, segendpos.xlogid, segendpos.xrecoff, timeline);
  
--- 71,77 
  static bool
  segment_callback(XLogRecPtr segendpos, uint32 timeline)
  {
! 	if (verbose && segendpos.xrecoff % XLOG_SEG_SIZE == 0)
  		fprin

Re: [HACKERS] Assertion failure in AtCleanup_Portals

2012-02-07 Thread Pavan Deolasee

On Tue, Feb 7, 2012 at 3:03 AM, Tom Lane  wrote:
> Pavan Deolasee  writes:
>> If I run the following sequence of commands, I get an assertion
>> failure in current HEAD.
>
>> postgres=# BEGIN;
>> BEGIN
>> postgres=# SELECT 1/0;
>> ERROR:  division by zero
>> postgres=# ROLLBACK TO A;
>> ERROR:  no such savepoint
>> postgres=# \q
>
>> The process fails when the session is closed and aborted transaction
>> gets cleaned at the proc_exit time. The stack trace is as below.
>
> Hmm.  It works fine if you issue an actual ROLLBACK command there,
> so my gut reaction is that AbortOutOfAnyTransaction isn't sufficiently
> duplicating the full-fledged ROLLBACK code path.  No time to dig further
> right now though.
>

It works OK for a ROLLBACK command because we create a new unnamed
portal for the ROLLBACK command, silently dropping the old one if it
already exists. Since the ROLLBACK command then runs successfully, we
don't see the same assertion. Would it be safe to drop FAILED unnamed
portals during AbortOutAnyTransaction ? May be it is if we can do that
before creating a new portal for ROLLBACK command itself.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql NUL record and field separator

2012-02-07 Thread Peter Eisentraut

On tor, 2012-01-26 at 19:00 +0530, Abhijit Menon-Sen wrote:
> At issue are (at least) these three lines from print_unaligned_text in
> src/bin/psql/print.c:
> 
>  358 /* the last record needs to be concluded with a newline
> */
>  359 if (need_recordsep)
>  360 fputc('\n', fout);
> 
> Perhaps the right thing to do would be to change this to output \0 if
> --record-separator-zero was used (but leave it at \n otherwise)? That
> is what my second attached patch does:
> 
> $ bin/psql --record-separator-zero --field-separator-zero -At -c
> 'select 1,2 union select 3,4'|xargs -0 echo
> 1 2 3 4
> 
> Thoughts?
> 
> > I think the most common use of this would be to set the record
> > separator from the command line, so we could use a short option
> > such as -0 or -z for that.
> 
> I agree. The current option names are very unwieldy to type.
> 
I have incorporated your two patches and added short options.  Updated
patch attached.

This made me wonder, however.  The existing -F and -R options set the
record *separator*.  The new options, however, set the record
*terminator*.  This is the small distinction that you had discovered.

Should we rename the options and/or add that to the documentation, or is
the new behavior obvious and any new terminology would be too confusing?
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index a9b1ed2..55aa5f2 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -482,6 +482,27 @@ PostgreSQL documentation
   
 
 
+
+  -z
+  --field-separator-zero
+  
+  
+  Set the field separator for unaligned output to a zero byte.
+  
+  
+
+
+
+  -0
+  --record-separator-zero
+  
+  
+  Set the record separator for unaligned output to a zero byte.  This is
+  useful for interfacing, for example, with xargs -0.
+  
+  
+
+
  
   -1
   --single-transaction
@@ -1909,6 +1930,16 @@ lo_import 152801
   
 
   
+  fieldsep_zero
+  
+  
+  Sets the field separator to use in unaligned output format to a zero
+  byte.
+  
+  
+  
+
+  
   footer
   
   
@@ -2078,6 +2109,16 @@ lo_import 152801
   
 
   
+  recordsep_zero
+  
+  
+  Sets the record separator to use in unaligned output format to a zero
+  byte.
+  
+  
+  
+
+  
   tableattr (or T)
   
   
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index ab809ec..8421ad0 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -2272,11 +2272,26 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
 	{
 		if (value)
 		{
-			free(popt->topt.fieldSep);
-			popt->topt.fieldSep = pg_strdup(value);
+			free(popt->topt.fieldSep.separator);
+			popt->topt.fieldSep.separator = pg_strdup(value);
+			popt->topt.fieldSep.separator_zero = false;
 		}
 		if (!quiet)
-			printf(_("Field separator is \"%s\".\n"), popt->topt.fieldSep);
+		{
+			if (popt->topt.fieldSep.separator_zero)
+printf(_("Field separator is zero byte.\n"));
+			else
+printf(_("Field separator is \"%s\".\n"), popt->topt.fieldSep.separator);
+		}
+	}
+
+	else if (strcmp(param, "fieldsep_zero") == 0)
+	{
+		free(popt->topt.fieldSep.separator);
+		popt->topt.fieldSep.separator = NULL;
+		popt->topt.fieldSep.separator_zero = true;
+		if (!quiet)
+			printf(_("Field separator is zero byte.\n"));
 	}
 
 	/* record separator for unaligned text */
@@ -2284,18 +2299,30 @@ do_pset(const char *param, const char *value, printQueryOpt *popt, bool quiet)
 	{
 		if (value)
 		{
-			free(popt->topt.recordSep);
-			popt->topt.recordSep = pg_strdup(value);
+			free(popt->topt.recordSep.separator);
+			popt->topt.recordSep.separator = pg_strdup(value);
+			popt->topt.recordSep.separator_zero = false;
 		}
 		if (!quiet)
 		{
-			if (strcmp(popt->topt.recordSep, "\n") == 0)
+			if (popt->topt.recordSep.separator_zero)
+printf(_("Record separator is zero byte.\n"));
+			else if (strcmp(popt->topt.recordSep.separator, "\n") == 0)
 printf(_("Record separator is ."));
 			else
-printf(_("Record separator is \"%s\".\n"), popt->topt.recordSep);
+printf(_("Record separator is \"%s\".\n"), popt->topt.recordSep.separator);
 		}
 	}
 
+	else if (strcmp(param, "recordsep_zero") == 0)
+	{
+		free(popt->topt.recordSep.separator);
+		popt->topt.recordSep.separator = NULL;
+		popt->topt.recordSep.separator_zero = true;
+		if (!quiet)
+			printf(_("Record separator is zero byte.\n"));
+	}
+
 	/* toggle between full and tuples-only format */
 	else if (strcmp(param, "t") == 0 || strcmp(param, "tuples_only") == 0)
 	{
diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c
index 172fd0c..eff0ea5 100644
--- a/src/bin/psql/help.c
+++ b/src/bin/psql/help.c
@@ -123,6

Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-02-07 Thread Shigeru Hanada

(2012/02/02 23:30), Marko Kreen wrote:
> On Thu, Feb 02, 2012 at 04:51:37PM +0900, Kyotaro HORIGUCHI wrote:
>> Hello, This is new version of dblink.c
>>
>> - Memory is properly freed when realloc returns NULL in storeHandler().
>>
>> - The bug that free() in finishStoreInfo() will be fed with
>>garbage pointer when malloc for sinfo->valbuflen fails is
>>fixed.
> 
> Thanks, merged.  I also did some minor coding style cleanups.
> 
> Tagging it Ready For Committer.  I don't see any notable
> issues with the patch anymore.
> 
> There is some potential for experimenting with more aggressive
> optimizations on dblink side, but I'd like to get a nod from
> a committer for libpq changes first.

I tried to use this feature in pgsql_fdw patch, and found some small issues.

- Typos
- mes -> msg
- funcion -> function
- overritten -> overwritten
- costom -> custom
- What is the right (or recommended) way to prevent from throwing
exceptoin in row-processor callback function?  When author should use
PG_TRY block to catch exception in the callback function?
- It would be better to describe how to determine whether a column
result is NULL should be described clearly.  Perhaps result value is
NULL when PGrowValue.len is less than 0, right?

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-07 Thread Magnus Hagander

On Tue, Feb 7, 2012 at 10:55, Heikki Linnakangas
 wrote:
> On 07.02.2012 11:35, Magnus Hagander wrote:
>>
>> On Tue, Feb 7, 2012 at 10:31, Heikki Linnakangas
>>   wrote:
>>>
>>> So, --statusint needs to be in milliseconds. And while we're at it, how
>>>
>>> difficult would be to ask the server for the current value of
>>> replication_timeout, and set --statusint automatically based on that? Or
>>> perhaps mark replication_timeout as GUC_REPORT. It is rather fiddly that
>>> depending on a server setting, you need to pass an option in the client
>>> or
>>> it will just silently fail with no indication of what the problem is.
>>
>>
>> We can't really ask for it easily, since we're on a replication
>> connection. Unless we add that to the walsender grammar, but that
>> would make it impossible to use the receivexlog stuff against a 9.1
>> server (which I think still works, though I haven't tested it in a
>> while).
>
>
> You could put a version-check there, and only send the command if connected
> to a 9.2 server.

True..


>> Do we have a facility to make it a GUC_REPORT but only for walsender
>> connections?
>
>
> There's no such facility at the moment.
>
>
>> It seems like a very unnecessary thing to have it sent out over every
>> single connection, since it would only be useful in a very small
>> subset of them.
>
>
> True, and conversely, many of the current GUC_REPORT settings don't apply to
> replication clients at all. Like standard_conforming_strings and DateStyle.

Right. But it's less important there, since the replication
connections will in any reasonable configuration be only a tiny part
of the connections.


> I think we need another flag for settings that should be sent to replication
> clients. GUC_REPORT_REPLICATION? Some settings would have both GUC_REPORT
> and GUC_REPORT_REPLICATION, while others would have only one of them.

Yup, seems like the cleanest solution.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-07 Thread Heikki Linnakangas


On 07.02.2012 11:35, Magnus Hagander wrote:

On Tue, Feb 7, 2012 at 10:31, Heikki Linnakangas
  wrote:

So, --statusint needs to be in milliseconds. And while we're at it, how
difficult would be to ask the server for the current value of
replication_timeout, and set --statusint automatically based on that? Or
perhaps mark replication_timeout as GUC_REPORT. It is rather fiddly that
depending on a server setting, you need to pass an option in the client or
it will just silently fail with no indication of what the problem is.


We can't really ask for it easily, since we're on a replication
connection. Unless we add that to the walsender grammar, but that
would make it impossible to use the receivexlog stuff against a 9.1
server (which I think still works, though I haven't tested it in a
while).


You could put a version-check there, and only send the command if 
connected to a 9.2 server.



Do we have a facility to make it a GUC_REPORT but only for walsender
connections?


There's no such facility at the moment.


It seems like a very unnecessary thing to have it sent out over every
single connection, since it would only be useful in a very small
subset of them.


True, and conversely, many of the current GUC_REPORT settings don't 
apply to replication clients at all. Like standard_conforming_strings 
and DateStyle.


I think we need another flag for settings that should be sent to 
replication clients. GUC_REPORT_REPLICATION? Some settings would have 
both GUC_REPORT and GUC_REPORT_REPLICATION, while others would have only 
one of them.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-07 Thread Magnus Hagander

On Tue, Feb 7, 2012 at 10:31, Heikki Linnakangas
 wrote:
> On 07.02.2012 09:03, Fujii Masao wrote:
>>
>> On Tue, Feb 7, 2012 at 2:58 PM, Fujii Masao  wrote:
>>>
>>> When I compiled HEAD with --disable-integer-datetimes and tested
>>>
>>> pg_receivexlog, I encountered unexpected replication timeout. As
>>> far as I read the pg_receivexlog code, the cause of this problem is
>>> that pg_receivexlog handles the standby message timeout incorrectly
>>> in --disable-integer-datetimes. The attached patch fixes this problem.
>>> Comments?
>>
>>
>> receivelog.c
>> ---
>>        timeout.tv_sec = last_status + standby_message_timeout - now - 1;
>>        if (timeout.tv_sec<= 0)
>> ---
>>
>> Umm.. the above code also handles the timestamp incorrectly. ISTM that the
>> root cause of these problems is that receivelog.c uses TimestampTz.
>
>
> Yep. While localGetCurrentTimestamp() returns a TimestampTz and handles
> float timestamps correctly, the caller just assigns the result to a int64
> variable, assuming --enable-integer-datetimes.

Ugh. Indeed.


>> What about changing receivelog.c so that it uses time_t instead of
>> TimestampTz? Which would make the code simpler, I think.
>
>
> Hmm, that would reduce granularity to seconds. The --statusint option is
> given in seconds, but it would be good to have more precision in the
> calculations to avoid rounding errors.
>
> But actually, if the purpose of the --statusint option is to avoid
> disconnection because of exceeding the server's replication_timeout, one
> second granularity just isn't enough to be begin with. replication_timeout
> is given in milliseconds, so if you set replication_timeout=900ms in the
> server, there is no way to make pg_basebackup/pg_receivexlog to reply in
> time.
>
> So, --statusint needs to be in milliseconds. And while we're at it, how
> difficult would be to ask the server for the current value of
> replication_timeout, and set --statusint automatically based on that? Or
> perhaps mark replication_timeout as GUC_REPORT. It is rather fiddly that
> depending on a server setting, you need to pass an option in the client or
> it will just silently fail with no indication of what the problem is.

We can't really ask for it easily, since we're on a replication
connection. Unless we add that to the walsender grammar, but that
would make it impossible to use the receivexlog stuff against a 9.1
server (which I think still works, though I haven't tested it in a
while).

Do we have a facility to make it a GUC_REPORT but only for walsender
connections? It seems like a very unnecessary thing to have it sent
out over every single connection, since it would only be useful in a
very small subset of them.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] incorrect handling of the timeout in pg_receivexlog

2012-02-07 Thread Heikki Linnakangas


On 07.02.2012 09:03, Fujii Masao wrote:

On Tue, Feb 7, 2012 at 2:58 PM, Fujii Masao  wrote:

When I compiled HEAD with --disable-integer-datetimes and tested
pg_receivexlog, I encountered unexpected replication timeout. As
far as I read the pg_receivexlog code, the cause of this problem is
that pg_receivexlog handles the standby message timeout incorrectly
in --disable-integer-datetimes. The attached patch fixes this problem.
Comments?


receivelog.c
---
timeout.tv_sec = last_status + standby_message_timeout - now - 1;
if (timeout.tv_sec<= 0)
---

Umm.. the above code also handles the timestamp incorrectly. ISTM that the
root cause of these problems is that receivelog.c uses TimestampTz.


Yep. While localGetCurrentTimestamp() returns a TimestampTz and handles 
float timestamps correctly, the caller just assigns the result to a 
int64 variable, assuming --enable-integer-datetimes.



What about changing receivelog.c so that it uses time_t instead of
TimestampTz? Which would make the code simpler, I think.


Hmm, that would reduce granularity to seconds. The --statusint option is 
given in seconds, but it would be good to have more precision in the 
calculations to avoid rounding errors.


But actually, if the purpose of the --statusint option is to avoid 
disconnection because of exceeding the server's replication_timeout, one 
second granularity just isn't enough to be begin with. 
replication_timeout is given in milliseconds, so if you set 
replication_timeout=900ms in the server, there is no way to make 
pg_basebackup/pg_receivexlog to reply in time.


So, --statusint needs to be in milliseconds. And while we're at it, how 
difficult would be to ask the server for the current value of 
replication_timeout, and set --statusint automatically based on that? Or 
perhaps mark replication_timeout as GUC_REPORT. It is rather fiddly that 
depending on a server setting, you need to pass an option in the client 
or it will just silently fail with no indication of what the problem is.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLockWaitUntilFree (was: Group commit, revised)

2012-02-07 Thread Heikki Linnakangas


On 03.02.2012 22:57, Robert Haas wrote:

I think I recommended a bad name for this function.  It's really
LWLockAcquireOrWaitUntilFree.  Can we rename that?


Agreed, that's better. Although quite long. "LWLockAcquireOrWait" perhaps?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Group commit, revised

2012-02-07 Thread Heikki Linnakangas


On 04.02.2012 07:24, Jeff Janes wrote:

Is it safe to assume that, under "#ifdef LWLOCK_STATS", a call to
LWLockAcquire will always precede any calls to LWLockWaitUntilFree
when a new process is started, to calloc the stats assays?

>

I guess it is right now, because the only user is WALWrite, which
would never be acquired before WALInsert is at least once.  But this
doesn't seem very future proof.


Agreed, we can't count on that. There's no clear single point after a 
process startup where the first lwlock is acquired. Out of curiosity, I 
added an elog(LOG, ...) to that initialization code, to log which lwlock 
is acquired first in a process. It depends on the process and 
circumstances - here's the list I got:


BufFreeListLock
ShmemIndexLock
XidGenLock
ProcArrayLock
BgWriterCommLock
AutoVacuumLock

And that's probably not all, I bet you would acquire different locks 
first with recovery, streaming replication etc.. I didn't test those.


Anyway, I added the initialization to LWLockWaitUntilFree now. Thanks!


I guess the same complain could be logged against LWLockConditionalAcquire.


LWLockConditionalAcquire doesn't update the stats, so it's ok.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

79 matches

Mail list logo