date:20110115

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

What is the basis for thinking that the sync should get the same
amount of time as the writes?  That seems pretty arbitrary.  Right
now, you're allowing 3 seconds per fsync, which could be a lot more or
a lot less than 40% of the total checkpoint time...


Just that it's where I ended up at when fighting with this for a month 
on the system I've seen the most problems at.  The 3 second number was 
reversed from a computation that said "aim for an internal of X minutes; 
we have Y relations on average involved in the checkpoint".  The 
direction my latest patch is strugling to go is computing a reasonable 
time automatically in the same way--count the relations, do a time 
estimate, add enough delay so the sync calls should be spread linearly 
over the given time range.




the checkpoint activity is always going to be spikey if it does
anything at all, so spacing it out *more* isn't obviously useful.
  


One of the components to the write queue is some notion that writes that 
have been waiting longest should eventually be flushed out.  Linux has 
this number called dirty_expire_centiseconds which suggests it enforces 
just that, set to a default of 30 seconds.  This is why some 5-minute 
interval checkpoints with default parameters, effectively spreading the 
checkpoint over 2.5 minutes, can work under the current design.  
Anything you wrote at T+0 to T+2:00 *should* have been written out 
already when you reach T+2:30 and sync.  Unfortunately, when the system 
gets busy, there is this "congestion control" logic that basically 
throws out any guarantee of writes starting shortly after the expiration 
time.


It turns out that the only thing that really works are the tunables that 
block new writes from happening once the queue is full, but they can't 
be set low enough to work well in earlier kernels when combined with 
lots of RAM.  Using the terminology of 
http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt at some point 
you hit a point where "a process generating disk writes will itself 
start writeback."  This is anologous to the PostgreSQL situation where 
backends do their own fsync calls.  The kernel will eventually move to 
where those trying to write new data are instead recruited into being 
additional sources of write flushing.  That's the part you just can't 
make aggressive enough on older kernels; dirty writers can always win.  
Ideally, the system never digs itself into a hole larger than you can 
afford to wait to write out.  It's a transacton speed vs. latency thing 
though, and the older kernels just don't consider the latency side well 
enough.


There is new mechanism in the latest kernels to control this much 
better:  dirty_bytes and dirty_background_bytes are the tunables.  I 
haven't had a chance to test yet.  As mentioned upthread, some of the 
bleding edge kernels that have this feature available in are showing 
such large general performance regressions in our tests, compared to the 
boring old RHEL5 kernel, that whether this feature works or not is 
irrelevant.  I haven't tracked down which new kernel distributions work 
well performance-wise and which don't yet for PostgreSQL.


I'm hoping that when I get there, I'll see results like 
http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages 
, where the ideal setting for dirty_bytes  to keep latency under control 
with BBWC was 15MB.  To put that into perspective, the lowest useful 
setting you can set dirty_ratio to is 5% of RAM.  That's 410MB on my 
measly 8GB desktop, and 3.3GB on the 64GB production server I've been 
trying to tune.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 5:57 PM, Greg Smith  wrote:
> I was just giving an example of how I might do an initial split.  There's a
> checkpoint happening now at time T; we have a rough idea that it needs to be
> finished before some upcoming time T+D.  Currently with default parameters
> this becomes:
>
> Write:  0.5 * D; Sync:  0
>
> Even though Sync obviously doesn't take zero.  The slop here is enough that
> it usually works anyway.
>
> I was suggesting that a quick reshuffling to:
>
> Write:  0.4 * D; Sync:  0.4 * D
>
> Might be a good first candidate for how to split the time up better.

What is the basis for thinking that the sync should get the same
amount of time as the writes?  That seems pretty arbitrary.  Right
now, you're allowing 3 seconds per fsync, which could be a lot more or
a lot less than 40% of the total checkpoint time, but I have a pretty
clear sense of why that's a sensible thing to try: you give the rest
of the system a moment or two to get some I/O done for something other
than the checkpoint before flushing the next batch of buffers.  But
the checkpoint activity is always going to be spikey if it does
anything at all, so spacing it out *more* isn't obviously useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Marti Raudsepp

On Sat, Jan 15, 2011 at 14:05, Robert Haas  wrote:
> Idea #4: For ext3 filesystems that like to dump the entire buffer
> cache instead of only the requested file, write a little daemon that
> runs alongside of (and completely indepdently of) PostgreSQL.  Every
> 30 s, it opens a 1-byte file, changes the byte, fsyncs the file, and
> closes the file, thus dumping the cache and preventing a ridiculous
> growth in the amount of data to be sync'd at checkpoint time.

Wouldn't it be easier to just mount in data=writeback mode? This
provides a similar level of journaling as most other file systems.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in pg_describe_object, patch v2

2011-01-15 Thread Andreas Karlsson

On Sat, 2011-01-15 at 10:36 -0500, Tom Lane wrote:
> But I can read the handwriting on the wall: if I want this done right,
> I'm going to have to do it myself.
> 
>   regards, tom lane

Do I understand you correctly if I interpret what you would like to see
is the same format used now in these cases?

1) When for btree and hash when lefttype and righttype are the same as
the two function parameters.

2) When for GiST and GIN there is only one member of the operator family
and the lefttype and righttype are the same as the type of the operator
class that is a member of the operator family.

That are the default rules in opclasscmds.c as I understood them

If so I think I could try to take a stab at this and see once done if it
looks like worth the additional code complexity.

Regards,
Andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Tatsuo Ishii

> When do the standby launch its walreceiver? It would be extra-nice for
> the base backup tool to optionally continue streaming WALs until the
> standby starts doing it itself, so that wal_keep_segments is really
> deprecated.  No idea how feasible that is, though.

Good point. I have been always wondering why we can't use exiting WAL
transporting infrastructure for sending/receiving WAL archive
segments in streaming replication.
If my memory serves, Fujii has already proposed such an idea but was
rejected for some reason I don't understand.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: Add \dL to show languages

2011-01-15 Thread Andreas Karlsson

Hi Josh,

Here is my review of this patch for the commitfest.

Review of https://commitfest.postgresql.org/action/patch_view?id=439

Contents and Purpose


This patch adds the \dL command in psql to list the procedual languages.

To me this seems like a useful addition to the commands in psql and \dL
is also a quite sane name for it which follows the established
conventions.

Documentation of the new command is included and looks good.

Runing it
=

Patch did not apply cleanly against HEAD so fixed it.

I tested the comamnds, \dL, \dLS, \dL+, \dLS+ and they wroked as I
expected. Support for patterns worked too and while not that useful it
was not much code either. The psql completion worked fine too.

Some things I noticed when using it though.

I do not like the use of parentheses in the usage description "list
(procedural) languages". Why not have it simply as "list procedural
languages"?

Should we include a column in \dL+ for the laninline function (DO
blocks)?

Performance
===

Quite irrelevant to this patch. :)

Coding
==

In general the code looks good and follows conventions except for some
whitesapce errors that I cleaned up.

* Trailing whitespace in src/bin/psql/describe.c.
* Incorrect indenation, spaces vs tabs in psql/describe.c and
psql/tab-complete.c.
* Removed empty line after return in listLanguages to match other
functions.

The comment for the function is not that descriptive but it is enough
and fits with the other functions.

Another things is that generated SQL needs its formatting fixed up in my
oppinion. I recommend looking at the query built by \dL by using psql
-E. Here is an example how the query looks for \dL+

SELECT l.lanname AS "Procedural Language",
   pg_catalog.pg_get_userbyid(l.lanowner) as "Owner",
   l.lanpltrusted AS "Trusted",
   l.lanplcallfoid::regprocedure AS "Call Handler",
   l.lanvalidator::regprocedure AS "Validator",
   NOT l.lanispl AS "System Language",
pg_catalog.array_to_string(l.lanacl, E'\n') AS "Access privileges" FROM 
pg_catalog.pg_language l
 ORDER BY 1;

Conclusion
==

The patch looks useful, worked, and there were no bugs obvious to me.
The code also looks good and in line with other functions doing similar
things. There are just some small issues that I think should be resolved
before committing it: laninline, format of the built query and the usage
string.

Attached is a version that applies to current HEAD and with whitespace
fixed.

Regards,
Andreas

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 5f61561..30d4507 100644
*** a/doc/src/sgml/ref/psql-ref.sgml
--- b/doc/src/sgml/ref/psql-ref.sgml
*** testdb=>
*** 1249,1254 
--- 1249,1269 
  

  
+   
+ \dL[S+] [ pattern ]
+ 
+ 
+ Lists all procedural languages. If pattern
+ is specified, only languages whose names match the pattern are listed.
+ By default, only user-created languages
+ are shown; supply the S modifier to include system
+ objects. If + is appended to the command name, each
+ language is listed with its call handler, validator, access privileges,
+ and whether it is a system object.
+ 
+ 
+   
  

  \dn[S+] [ pattern ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 962c13c..301dc11 100644
*** a/src/bin/psql/command.c
--- b/src/bin/psql/command.c
*** exec_command(const char *cmd,
*** 416,421 
--- 416,424 
  			case 'l':
  success = do_lo_list();
  break;
+ 			case 'L':
+ success = listLanguages(pattern, show_verbose, show_system);
+ break;
  			case 'n':
  success = listSchemas(pattern, show_verbose, show_system);
  break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 205190f..e41453b 100644
*** a/src/bin/psql/describe.c
--- b/src/bin/psql/describe.c
*** listTables(const char *tabtypes, const c
*** 2566,2571 
--- 2566,2635 
  }
  
  
+ /* \dL
+ *
+ * Describes Languages.
+ */
+ bool
+ listLanguages(const char *pattern, bool verbose, bool showSystem)
+ {
+ 	PQExpBufferData buf;
+ 	PGresult *res;
+ 	printQueryOpt myopt = pset.popt;
+ 
+ 	initPQExpBuffer(&buf);
+ 
+ 	printfPQExpBuffer(&buf,
+ 	  "SELECT l.lanname AS \"%s\",\n",
+ 	  gettext_noop("Procedural Language"));
+ 	if (pset.sversion >= 80300)
+ 			appendPQExpBuffer(&buf,
+ 			  "   pg_catalog.pg_get_userbyid(l.lanowner) as \"%s\",\n",
+ 			  gettext_noop("Owner"));
+ 
+ 	appendPQExpBuffer(&buf,
+ 	  "   l.lanpltrusted AS \"%s\"",
+ 	  gettext_noop("Trusted"));
+ 
+ 	if (verbose)
+ 	{
+ 			appendPQExpBuffer(&buf,
+ 			  ",\n   l.lanplcallfoid::regprocedure AS \"%s\",\n"
+ 			  "   l.lanvalidator::regprocedure AS \"%s\",\n"
+ 			  "   NOT l.lanispl AS \"%s\",\n",
+ 			  gettext_noop("Call Handler"),
+ 			  g

[HACKERS] review: FDW API

2011-01-15 Thread Jan Urbański

Hi,

what follows is a review of the FDW API patch from
http://archives.postgresql.org/message-id/20110114212358.82c7.69899...@metrosystems.co.jp

All three patches apply cleanly and compile without warnings. Regression
tests pass.

Let me go patch by patch, starting with the first one that adds the
HANDLER option.

It adds one useless hunk in src/backend/commands/foreigncmds.c (changes
order if #includes).

There's a typo in a C commnent ("determin which validator to be used").

Other than that, it looks OK.

The third patch just adds a GetForeignTable helper function and it looks OK.

The second patch actually adds the API. First of all, I'd like say that
it's a really cool piece of code, allowing all kinds of awesome
funcionality to be added. I'm already excited by the things that this
will make possible. Congratulations!

To get a feel of the API I wrote a simple FDW wrapper that presents data
from the commitfest RSS feed, based heavily on twitter_fdw by Hitoshi
Harada. Please treat my thoughts as someone's who doesn't really know
*why* the API looks like it does, but has some observations about what
was missing or what felt strange when using it. I guess that's the
position a typical FDW wrapper writer will be in.

First of all, the C comments mention that PlanRelScan should put a tuple
descriptor in FdwPlan, but there's no such field in it. Also, comments
for PlanRelScan talk about the 'attnos' argument, which is not in the
function's signature. I guess the comments are just obsolete and should
be updated. I think it's actually a good thing you don't have to put a
TupleDesc in FdwPlan.

There are two API methods, PlanNative and PlanQuery that are ifdef'd out
using IN_THE_FUTURE. Isn't the standard symbol we use NOT_USED? Also,
the comments say you can implement either PlanRelScan or PlanNative, and
only the former is available for now. If we keep these methods
commented, they should be moved to the end of the struct, so that
uncommenting them will not break compatibility with existing FDWs.

The only documentation a FDW writer has is fdwapi.h, so comments there
need to be top-notch. We might contemplate writing a documentation
chapter explaining how FDW handlers should be written, like we do for C
SRFs and libpq programs, but I guess for now just the headers file would
be enough.

FdwExecutionState is just a struct around a void pointer, can we imagine
adding more fields there? If not, maybe we could just remove the
structure and pass void pointers around? OTOH that gives us some
compiler checking and possibility of extending the struct, so I guess we
could also just leave it like that.

The Iterate method gets passed a TupleTableSlot. Do we really need such
a low-level structure? It makes returning the result easy, because you
just form your tuple and call ExecStoreTuple, but perhaps we could
abstract that away a bit and add a helper method that will take a tuple
and call ExecStoreTuple for us, passing InvalidBuffer and false as the
remaining arguments. Or maybe make Iterate return the tuple and call
ExecStoreTuple internally? I don't have strong opinions, but
TupleTableSlot feels a bit too gutty - I'm having a hard time imagining
what fields from it would be useful for a FDW writer, and so perhaps you
don't need to expose it.

Why does BeginScan accept a ParamListInfo argument? First of all, it
feels like a parsing thing, not a relation scan thing, so perhaps it
should be available at some other, earlier stage. Second of all, what
would it be useful for anyway? Neither file_fdw nor my commitfest_fdw
does anything with it.

We could use comments about how to return tuples from Iterate and how to
finish returning them. I had to look at the example to figure out that
you need ExecClearTuple(slot) in your last Iterate. If Iterate's
interface were to change, we could just return NULL instead of a tuple
to say that we're done.

We could be a bit more explicit about how to allocate objects, for
instance if I'm allocating a FdwPlan in my PlanRelScan with a palloc,
will it not go away too soon, or too late (IOW leak)?

I ran into a problem when doing:

select i from generate_series(1, 10) as g(i), pgcommitfest;

when I was trying to check for leaks. It returned four rows
(pgcommitfest is the foreign table that returns four rows). Explain
analyze shows a nested loop with a foreign scan as inner loop. Maybe
it's because I didn't implement ReScan, but the API says I don't have to.

If you don't implement Iterate you get a elog(ERROR). But if you don't
implement one of the other required methods, you segfault. Feels
inconsistent.

PlanRelScan looks like something that could use all kinds of information
to come up with a good plan. Maybe we could change its input argument to
a single struct that would contain all the current arguments, so it'll
be easier to extend when people will start writing FDWs and will find
out that they'd like more information available. Doing that would mean
that adding a field in a

Re: [HACKERS] textarray option for file FDW

2011-01-15 Thread Andrew Dunstan




On 01/15/2011 12:29 PM, Andrew Dunstan wrote:


I've been waiting for the latest FDW patches as patiently as I can, 
and I've been reviewing them this morning, in particular the file_fdw 
patch and how it interacts with the newly exposed COPY API. Overall it 
seems to be a whole lot cleaner, and the wholesale duplication of the 
copy code is gone, so it's much nicer and cleaner. So now I'd like to 
add a new option to it: "textarray". This option would require that 
the foreign table have exactly one field, of type text[], and would 
compose all the field strings read from the file for each record into 
the array (however many there are). This would require a few changes 
to contrib/file_fdw/file_fdw.c and a few changes to 
src/backend/commands/copy.c, which I can probably have done in fairly 
short order, Deo Volente. This will allow something like:


   CREATE FOREIGN TABLE arr_text (
t text[]
   )  SERVER file_server
   OPTIONS (format 'csv', filename '/path/to/ragged.csv', textarray
   'true');
   SELECT t[3]::int as a, t[1]::timestamptz as b, t[99] as not_there
   FROM arr_text;




A WIP patch is attached. It's against Shigeru Hanada's latest FDW 
patches. It's surprisingly tiny. Right now it probably leaks memory like 
a sieve, and that's the next thing I'm going to chase down.


cheers

andrew
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 58,67  static struct FileFdwOption valid_options[] = {
{ "quote",  ForeignTableRelationId },
{ "escape", ForeignTableRelationId },
{ "null",   ForeignTableRelationId },
  
/* FIXME: implement force_not_null option */
  
!   /* Centinel */
{ NULL, InvalidOid }
  };
  
--- 58,68 
{ "quote",  ForeignTableRelationId },
{ "escape", ForeignTableRelationId },
{ "null",   ForeignTableRelationId },
+   { "textarray",  ForeignTableRelationId },
  
/* FIXME: implement force_not_null option */
  
!   /* Sentinel */
{ NULL, InvalidOid }
  };
  
***
*** 134,139  file_fdw_validator(PG_FUNCTION_ARGS)
--- 135,141 
char   *escape = NULL;
char   *null = NULL;
boolheader;
+   booltextarray;
  
/* Only superuser can change generic options of the foreign table */
if (catalog == ForeignTableRelationId && !superuser())
***
*** 220,225  file_fdw_validator(PG_FUNCTION_ARGS)
--- 222,231 
 errmsg("null representation 
cannot use newline or carriage return")));
null = strVal(def->arg);
}
+   else if (strcmp(def->defname, "textarray") == 0)
+   {
+   textarray = defGetBoolean(def);
+   }
}
  
/* Check options which depend on the file format. */
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***
*** 117,122  typedef struct CopyStateData
--- 117,125 
bool   *force_quote_flags;  /* per-column CSV FQ flags */
bool   *force_notnull_flags;/* per-column CSV FNN flags */
  
+   /* param from FDW */
+   bool   text_array;  /* scan to a single text array field */
+ 
/* these are just for error messages, see CopyFromErrorCallback */
const char *cur_relname;/* table name for error messages */
int cur_lineno; /* line number for 
error messages */
***
*** 970,975  BeginCopy(bool is_from,
--- 973,986 
 errmsg("argument to option 
\"%s\" must be a list of column names",

defel->defname)));
}
+   else if (strcmp(defel->defname, "textarray") == 0)
+   {
+   if (cstate->text_array)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("conflicting or 
redundant options")));
+   cstate->text_array = defGetBoolean(defel);
+   }
else
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
***
*** 1109,1114  BeginCopy(bool is_from,
--- 1120,1131 
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("CSV quote character must not appear in 
the NULL specification")));
  
+   /* check textarray */
+   if (cstate->text_array && !is_from)
+   ereport(ERROR,
+

Re: [HACKERS] What happened to open_sync_without_odirect?

2011-01-15 Thread Bruce Momjian

Josh Berkus wrote:
> Last I remember, we were going to add this as an option.  But I don't
> see a patch in the queue.  Am I missing it?  Was I supposed to write it?

I don't know, but let me add that I am confused how this would look to
users.  In many cases, kernels don't even support O_DIRECT, so what
would we do to specify this?  What about just auto-disabling O_DIRECT if
the filesystem does not support it; maybe issue a log message about it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] New test_fsync messages for direct I/O

2011-01-15 Thread Bruce Momjian

Josh Berkus wrote:
> Greg,
> 
> > This is interesting, because test_fsync consistently reported a rate of
> > about half this when using open_datasync instead of the equal
> > performance I'm getting from the database.  I'll see if I can reproduce
> > that further, but it's no reason to be concerned about the change that's
> > been made I think.  Just more evidence that test_fsync has quirks left
> > to be sorted out.  But that's not backbranch material, it should be part
> > of 9.1 only refactoring, already in progress via the patch Josh
> > submitted.  There's a bit of time left to get that done.
> 
> Did you rerun test_sync with O_DIRECT entabled, using my patch?  The
> figures you had from test_fsync earlier were without O_DIRECT.

I have modified test_fsync with the attached, applied patch to report
cases where we are testing without O_DIRECT when only O_DIRECT would be
used by the server, and cases where O_DIRECT fails because of the file
system type.   Josh Berkus wanted the first case kept in case we decide
to offer non-direct-io options on machines that support direct i/o.

The new messages are:

* This non-direct I/O option is not used by Postgres.

** This file system and its mount options do not support direct
I/O, e.g. ext4 in journaled mode.

You can see the first one below in my output from Ubuntu:

$ ./test_fsync 
Ops-per-test = 2000

Simple non-sync'ed write:
8k write   58.175 ops/sec

Compare file sync methods using one write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
8k write, fdatasync68.425 ops/sec
8k write, fsync63.932 ops/sec
fsync_writethroughn/a
open_sync 8k write*73.785 ops/sec
open_sync 8k direct I/O write  82.929 ops/sec
* This non-direct I/O option is not used by Postgres.

Compare file sync methods using two writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync n/a
8k write, 8k write, fdatasync  42.728 ops/sec
8k write, 8k write, fsync  43.625 ops/sec
fsync_writethroughn/a
2 open_sync 8k writes* 37.150 ops/sec
2 open_sync 8k direct I/O writes   43.722 ops/sec
* This non-direct I/O option is not used by Postgres.

Compare open_sync with different sizes:
(This is designed to compare the cost of one large
sync'ed write and two smaller sync'ed writes.)
open_sync 16k write46.428 ops/sec
2 open_sync 8k writes  38.703 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
8k write, fsync, close 65.744 ops/sec
8k write, close, fsync 63.077 ops/sec

I believe test_fsync now matches the backend code.  If we decide to
change things, it can be adjusted.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index cd2b1f2..562da66 100644
*** /tmp/8wSNdd_test_fsync.c	Sat Jan 15 18:42:58 2011
--- src/tools/fsync/test_fsync.c	Sat Jan 15 18:38:57 2011
*** void
*** 163,169 
  test_sync(int writes_per_op)
  {
  	int			tmpfile, ops, writes;
! 
  	if (writes_per_op == 1)
  		printf("\nCompare file sync methods using one write:\n");
  	else
--- 163,170 
  test_sync(int writes_per_op)
  {
  	int			tmpfile, ops, writes;
! 	bool		fs_warning = false;
! 	
  	if (writes_per_op == 1)
  		printf("\nCompare file sync methods using one write:\n");
  	else
*** test_sync(int writes_per_op)
*** 176,184 
  	 */
  #ifdef OPEN_DATASYNC_FLAG
  	if (writes_per_op == 1)
! 		printf(LABEL_FORMAT, "open_datasync 8k write");
  	else
! 	 	printf(LABEL_FORMAT, "2 open_datasync 8k writes");
  	fflush(stdout);

  	if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1)
--- 177,193 
  	 */
  #ifdef OPEN_DATASYNC_FLAG
  	if (writes_per_op == 1)
! 		printf(LABEL_FORMAT, "open_datasync 8k write"
! #if PG_O_DIRECT != 0
! 		"*"
! #endif
! 		);
  	else
! 	 	printf(LABEL_FORMAT, "2 open_datasync 8k writes"
! #if PG_O_DIRECT != 0
! 		"*"
! #endif
! 		);
  	fflush(stdout);

  	if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1)
*** test_sync(int writes_per_op)
***

[HACKERS] .gitignore file needed for new replication parser

2011-01-15 Thread Kevin Grittner

diff --git a/src/backend/replication/.gitignore
b/src/backend/replication/.gitignore
new file mode 100644
index 000..a0332b2
--- /dev/null
+++ b/src/backend/replication/.gitignore
@@ -0,0 +1,3 @@
+/repl_gram.c
+/repl_gram.h
+/repl_scanner.c



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Include WAL in base backup

2011-01-15 Thread Magnus Hagander

On Sat, Jan 15, 2011 at 23:32, Dimitri Fontaine  wrote:
> Magnus Hagander  writes:
>> Here's a cutdown version of the idea about including WAL in the base
>> backup. What I initially wanted was to introduce a way to guarantee
>> that the required WAL (with some sort of cutoff of course) would be
>> available for the backup, but I ran out of time for that. We can
>> always add that later.
>
> What if you start a concurrent process that begins streaming the WAL
> segments just before you start the backup, and you stop it after having
> stopped the backup.  I would think that then, the local received files
> would be complete.  We would only need a program able to stream the WAL
> segments and build WAL files from that… do you know about one? :)

Sure, if you stream the backups "on the side", then you don't need
this feature. This is for "very simple filesystem backups", without
the need to set up streaming of archiving.

If you're streaming the WAL separately, you'd use pg_basebackup in
non-wal mode. That's why it's optional.

>> For now, you need to set wal_keep_segments to make it work properly,
>
> That's quite a big foot gun, isn't it?  You would have to at least offer
> an option to check for your backup or to call it broken when you miss
> some WAL files on the server.

Oh, it will give you an ERROR if any of the required WAL files aren't
around anymore. So you will know that your backup is incomplete.

> The only other safe option I know about that's not a pg_streamrecv
> subprocess would be to require archiving for the duration of the base
> backup, but I think we agreed that it would be nice being able to bypass
> that.

No, the safe option I'm envisioning for this one is to just prevent
the server from recycling the logfiles until they have been sent off.
Witha cap so we don't prevent recycling forever if the client hangs.
This is the part I started working on, but didn't hav etime to finish
for the CF. I've not given up on it, it's just waiting for later.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-15 Thread Josh Berkus


> If the problem is that all the freezing happens at once, then ISTM the
> solution is to add a random factor. Say, when a tuple just passes the
> lower threshold it has a 1% chance of being frozen. The chance grows
> until it is 100% as it reaches the upper threshold.

Doesn't have to be random; it could be determinative.  That is, we could
have a vacuum_freeze_max_size parameter ... and accompanying autovacuum
parameter ... which allowed the user to limit freezing scans to, say,
1GB of the table at a time.  If I could, say, call a manual freeze of
10% of the largest tables ever night, then I might actually be able to
schedule it.  It's a full scan of the whole table which is fatal.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LAST CALL FOR 9.1

2011-01-15 Thread Josh Berkus

On 1/15/11 6:50 AM, Robert Haas wrote:
> It is really already too late for us to be seriously considering
> integrating sync rep into 9.1.  It will lead to another enormous beta
> period during which the tree will be closed to new patches and
> everyone will complain, or else we'll open the tree for 9.2
> development and a different though overlapping set of people will
> complain about that, but if I try to bring down the gavel and actually
> insist that we don't consider sync rep, then a third, different, also
> overlapping set of people will complain about that.

Given that people are going to complain regardless, that gives you a lot
of freedom, no?

I'm more liberal; if we have a working-with-minor-bugs version of Sync
Rep by 2/15, I'm OK with it being in 9.1.  However, if major issues
remain outstanding ... or major disputes on features/API ... then boot it.

It's really up to  Simon/Heikki/Fujii as to whether that's a realistic
goal.  Certainly there's been a lot of work to *simplify* Synch Rep this
year.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

That seems like a bad idea - don't we routinely recommend that people
crank this up to 0.9?  You'd be effectively bounding the upper range
of this setting to a value to the less than the lowest value we
recommend anyone use today.
  


I was just giving an example of how I might do an initial split.  
There's a checkpoint happening now at time T; we have a rough idea that 
it needs to be finished before some upcoming time T+D.  Currently with 
default parameters this becomes:


Write:  0.5 * D; Sync:  0

Even though Sync obviously doesn't take zero.  The slop here is enough 
that it usually works anyway.


I was suggesting that a quick reshuffling to:

Write:  0.4 * D; Sync:  0.4 * D

Might be a good first candidate for how to split the time up better.  
The fact that this gives less writing time than the current biggest 
spread possible:


Write:  0.9 * D; Sync: 0

Is true.  It's also true that in the case where sync time really is 
zero, this new default would spread writes less than the current 
default.  I think that's optimistic, but it could happen if checkpoints 
are small and you have a good write cache.


Step back from that a second though.  Ultimately, the person who is 
getting checkpoints at a 5 minute interval, and is being nailed by 
spikes, should have the option of just increasing the parameters to make 
that interval bigger.  First you increase the measly default segments to 
a reasonable range, then checkpoint_completion_target is the second one 
you can try.  But from there, you quickly move onto making 
checkpoint_timeout longer.  At some point, there is no option but to 
give up checkpoints every 5 minutes as being practical, and make the 
average interval longer.


Whether or not a refactoring here makes things slightly worse for cases 
closer to the default doesn't bother me too much.  What bothers me is 
the way trying to stretch checkpoints out further fails to deliver as 
well as it should.  I'd be OK with saying "to get the exact same spread 
situation as in older versions, you may need to retarget for checkpoints 
every 6 minutes" *if* in the process I get a much better sync latency 
distribution in most cases.


Here's an interesting data point from the customer site this all started 
at, one I don't think they'll mind sharing since it helps make the 
situation more clear to the community.  After applying this code to 
spread sync out, in order to get their server back to functional we had 
to move all the parameters involved up to where checkpoints were spaced 
35 minutes apart.  It just wasn't possible to write any faster than that 
without disrupting foreground activity. 

The whole current model where people think of this stuff in terms of 
segments and completion targets is a UI disaster.  The direction I want 
to go in is where users can say "make sure checkpoints happen every N 
minutes", and something reasonable happens without additional parameter 
fiddling.  And if the resulting checkpoint I/O spike is too big, they 
just increase the timeout to N+1 or N*2 to spread the checkpoint 
further.  Getting too bogged down thinking in terms of the current, 
really terrible interface is something I'm trying to break myself of.  
Long-term, I want there to be checkpoint_timeout, and all the other 
parameters are gone, replaced by an internal implementation of the best 
practices proven to work even on busy systems.  I don't have as much 
clarity on exactly what that best practice is the way that, say, I just 
suggested exactly how to eliminate wal_buffers as an important thing to 
manually set.  But that's the dream UI:  you shoot for a checkpoint 
interval, and something reasonable happens; if that's too intense, you 
just increase the interval to spread further.  There probably will be 
small performance regression possible vs. the current code with 
parameter combination that happen to work well on it.  Preserving every 
one of those is something that's not as important to me as making the 
tuning interface simple and clear.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] arrays as pl/perl input arguments [PATCH]

2011-01-15 Thread Alex Hunsaker

On Wed, Jan 12, 2011 at 13:04, Alexey Klyukin  wrote:
>
> On Jan 12, 2011, at 8:52 PM, David E. Wheeler wrote:
>
>> On Jan 12, 2011, at 5:14 AM, Alexey Klyukin wrote:
>>
>>> You mean packing both a string representation and a reference to a single 
>>> SV * value?
>>
>> Dunno, I'm not a guts guy.
>
> Well, neither me (I haven't used much of the guts api there).

Find attached a proof of concept that modifies Alexey's patch to do
the above (using the overload example I and others posted).

Arrays have a reference of 'PostgreSQL::InServer::ARRAY'-- mainly to
be consistent with the other PL/Perl packages we have. It also lets
you match ref() =~ m/ARRAY$/ to make it a tad easier to figure out if
something is an array.

The other thing to note is I only applied this to the 'top' array.
That is when you have a multidimensional array, its children are
'real' arrays:

 create or replace function takes_array(text[]) returns void as $$
 my $array = shift;

 elog(NOTICE, ref $array);
 elog(NOTICE, ref $_) for (@$array);

$$ language plperl;

 select takes_array('{{1}, {2, 3}}'::text[]);
NOTICE:  PostgreSQL::InServer::ARRAY
CONTEXT:  PL/Perl function "takes_array"
NOTICE:  ARRAY
CONTEXT:  PL/Perl function "takes_array"
NOTICE:  ARRAY
CONTEXT:  PL/Perl function "takes_array"

We could change that so _all_ arrays have a ref of
PostgreSQL::InServer::ARRAY. However I thought it would be nice to
easily use built-ins, this way you can just 'cast' away just the top
level without having to recurse to children (my @arr = @$array).

This also will create a string representation if and when you try to
use it as a string.  It currently lacks row support in the stringify
case (and I left the regression test Alexey added for that failing) so
we would need to fix that up if we want to go down this road.

Thoughts?  Should I polish this a bit more?  Or do we like the GUC better?

pg_to_perl_arrays_kind_of.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] auto-sizing wal_buffers

2011-01-15 Thread Josh Berkus

On 1/14/11 10:51 PM, Greg Smith wrote:
> 
> ! Since the data is written out to disk at every transaction
> commit,
> ! the setting many only need to be be large enough to hold the
> amount
> ! of WAL data generated by one typical transaction.  Larger values,
> ! typically at least a few megabytes, can improve write performance
> ! on a busy server where many clients are committing at once.
> ! Extremely large settings are unlikely to provide additional
> benefit.

I think we can be more specific on that last sentence; is there even any
*theoretical* benefit to settings above 16MB, the size of a WAL segment?
 Certainly there have been no test results to show any.

If we don't know, keep it vague, but otherwise I suggest:

"Settings larger than the size of a single WAL segment (16MB by default)
are unlikely to produce any benefit."

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Include WAL in base backup

2011-01-15 Thread Dimitri Fontaine

Magnus Hagander  writes:
> Here's a cutdown version of the idea about including WAL in the base
> backup. What I initially wanted was to introduce a way to guarantee
> that the required WAL (with some sort of cutoff of course) would be
> available for the backup, but I ran out of time for that. We can
> always add that later.

What if you start a concurrent process that begins streaming the WAL
segments just before you start the backup, and you stop it after having
stopped the backup.  I would think that then, the local received files
would be complete.  We would only need a program able to stream the WAL
segments and build WAL files from that… do you know about one? :)

> For now, you need to set wal_keep_segments to make it work properly,

That's quite a big foot gun, isn't it?  You would have to at least offer
an option to check for your backup or to call it broken when you miss
some WAL files on the server.

The only other safe option I know about that's not a pg_streamrecv
subprocess would be to require archiving for the duration of the base
backup, but I think we agreed that it would be nice being able to bypass
that.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] What happened to open_sync_without_odirect?

2011-01-15 Thread Josh Berkus

Last I remember, we were going to add this as an option.  But I don't
see a patch in the queue.  Am I missing it?  Was I supposed to write it?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TYPE 0: Introduction; test cases

2011-01-15 Thread Noah Misch

On Sat, Jan 15, 2011 at 02:31:21PM -0500, Robert Haas wrote:
> On Sat, Jan 15, 2011 at 10:25 AM, Noah Misch  wrote:
> > Do you value test coverage so little?
> 
> If you're asking whether I think real-world usability is more
> important than test coverage, then yes.

No, I wasn't asking that.

The attached version implements your preferred real-world usability for the
index_build DEBUG message.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5790f87..6f5ffb3 100644
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***
*** 1420,1425  index_build(Relation heapRelation,
--- 1420,1435 
procedure = indexRelation->rd_am->ambuild;
Assert(RegProcedureIsValid(procedure));
  
+   if (indexInfo->ii_ToastForRelName != NULL)
+   ereport(DEBUG1,
+   (errmsg("building TOAST index for table \"%s\"",
+   
indexInfo->ii_ToastForRelName)));
+   else
+   ereport(DEBUG1,
+   (errmsg("building index \"%s\" on table \"%s\"",
+   
RelationGetRelationName(indexRelation),
+   
RelationGetRelationName(heapRelation;
+ 
/*
 * Switch to the table owner's userid, so that any index functions are 
run
 * as that user.  Also lock down security-restricted operations and
***
*** 2412,2418  IndexGetRelation(Oid indexId)
   * reindex_index - This routine is used to recreate a single index
   */
  void
! reindex_index(Oid indexId, bool skip_constraint_checks)
  {
RelationiRel,
heapRelation,
--- 2422,2429 
   * reindex_index - This routine is used to recreate a single index
   */
  void
! reindex_index(Oid indexId, const char *toastFor,
! bool skip_constraint_checks)
  {
RelationiRel,
heapRelation,
***
*** 2470,2475  reindex_index(Oid indexId, bool skip_constraint_checks)
--- 2481,2489 
indexInfo->ii_ExclusionStrats = NULL;
}
  
+   /* Pass the name of relation this TOAST index serves, if any. */
+   indexInfo->ii_ToastForRelName = toastFor;
+ 
/* We'll build a new physical relation for the index */
RelationSetNewRelfilenode(iRel, InvalidTransactionId);
  
***
*** 2529,2534  reindex_index(Oid indexId, bool skip_constraint_checks)
--- 2543,2551 
   * reindex_relation - This routine is used to recreate all indexes
   * of a relation (and optionally its toast relation too, if any).
   *
+  * If this is a TOAST relation, toastFor may bear the parent relation name,
+  * facilitating improved messages.
+  *
   * If heap_rebuilt is true, then the relation was just completely rebuilt by
   * an operation such as VACUUM FULL or CLUSTER, and therefore its indexes are
   * inconsistent with it.  This makes things tricky if the relation is a system
***
*** 2548,2554  reindex_index(Oid indexId, bool skip_constraint_checks)
   * CommandCounterIncrement will occur after each index rebuild.
   */
  bool
! reindex_relation(Oid relid, bool toast_too, bool heap_rebuilt)
  {
Relationrel;
Oid toast_relid;
--- 2565,2572 
   * CommandCounterIncrement will occur after each index rebuild.
   */
  bool
! reindex_relation(Oid relid, const char *toastFor,
!bool toast_too, bool heap_rebuilt)
  {
Relationrel;
Oid toast_relid;
***
*** 2625,2631  reindex_relation(Oid relid, bool toast_too, bool 
heap_rebuilt)
if (is_pg_class)
RelationSetIndexList(rel, doneIndexes, 
InvalidOid);
  
!   reindex_index(indexOid, heap_rebuilt);
  
CommandCounterIncrement();
  
--- 2643,2649 
if (is_pg_class)
RelationSetIndexList(rel, doneIndexes, 
InvalidOid);
  
!   reindex_index(indexOid, toastFor, heap_rebuilt);
  
CommandCounterIncrement();
  
***
*** 2648,2658  reindex_relation(Oid relid, bool toast_too, bool 
heap_rebuilt)
if (is_pg_class)
RelationSetIndexList(rel, indexIds, ClassOidIndexId);
  
-   /*
-* Close rel, but continue to hold the lock.
-*/
-   heap_close(rel, NoLock);
- 
result = (indexIds != NIL);
  
/*
--- 2666,2671 
***
*** 2660,2666  reindex_relation(Oid relid, bool toast_too, bool 
heap_rebuilt)
 * still hold the lock on the master table.
 */
if (toast_too && OidIsValid(toast_relid))
!   result

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-15 Thread Dimitri Fontaine

Magnus Hagander  writes:
> Not sure pg_ctl clone would be the proper name, since it's not
> actually a clone at this point (it might be with the second patch I
> ust posted that includes the WAL files)

Let's keep the clone name for the client that makes it all then :)

>> That should be -D --pgdata, for consistency with pg_dump.
>
> pg_dump doesn't have a -D. I assume you mean pg_ctl / initdb?

Yes, sorry, been too fast.

> They are. The docs clearly say "Only one of -d and
> -t can be specified"

Really too fast…

> Another option, I think Heikki mentioned this on IM at some point, is
> to do something like name it -.tar. That would give us best
> of both worlds?

Well I'd think we know the pg_tablespace columns encoding, so the
problem might be the filesystem encodings, right?  Well there's also the
option of creating .tar and have a symlink to it called .tar
but that's pushing it.  I don't think naming after OIDs is a good
service for users, but if that's all we can reasonably do…

Will continue reviewing and post something more polished and
comprehensive next week — mainly wanted to see if you wanted to include
pg_ctl  in the patch already.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TABLE ... REPLACE WITH

2011-01-15 Thread Simon Riggs

On Tue, 2010-12-14 at 19:48 +, Simon Riggs wrote:
> On Tue, 2010-12-14 at 11:34 -0800, Josh Berkus wrote:
> > As for the utility of this command: there is no question that I would
> > use it.  I'm not sure I like the syntax (I'd prefer REPLACE TABLE 
> > WITH _), but that's painting the bike shed.
> 
> REPLACE TABLE ying WITH yang
> 
> is probably easier to implement than hacking at the ALTER TABLE code
> mountain. 
> 
> >   While the command may
> > appear frivolous and unnecessary syntactical ornamentation to some, I
> > have to say that doing the "table doesy-doe" which this command
> > addresses is something I have written scripts for on at least 50% of
> > my professional clients.  It keeps coming up. 
> 
> Yeh.

Patch. Needs work.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f3bd565..671d2c3 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -8766,3 +8766,102 @@ AtEOSubXact_on_commit_actions(bool isCommit, SubTransactionId mySubid,
 		}
 	}
 }
+
+/*
+ * ExchangeTable
+ *		Swap the contents of two tables, after many checks.
+ *		We refer to the two tables as yin and yang to avoid confusion.
+ */
+void
+ExchangeTable(RangeVar *yin, RangeVar *yang)
+{
+	Oid			yinrelid, yangrelid;
+	Relation	yinrel, yangrel;
+
+	yinrel = heap_openrv(yin, AccessExclusiveLock);
+	yinrelid = RelationGetRelid(yinrel);
+
+	/* Checks on yin table */
+	if (yinrel->rd_rel->relkind != RELKIND_RELATION)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table",
+		RelationGetRelationName(yinrel;
+
+	if (!pg_class_ownercheck(yinrelid, GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER, ACL_KIND_CLASS,
+	   RelationGetRelationName(yinrel));
+
+	if (IsSystemRelation(yinrel))
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("permission denied: \"%s\" is a system catalog",
+		RelationGetRelationName(yinrel;
+
+	yangrel = heap_openrv(yang, AccessExclusiveLock);
+	yangrelid = RelationGetRelid(yangrel);
+
+	/* Check on yin and yang are not the same */
+	if (yangrelid == yinrelid)
+		ereport(ERROR,
+	(errcode(ERRCODE_DUPLICATE_TABLE),
+	errmsg("relation \"%s\" cannot be exchanged with itself",
+	RelationGetRelationName(yinrel;
+
+	/* Checks on yin table */
+	if (yangrel->rd_rel->relkind != RELKIND_RELATION)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a table",
+		RelationGetRelationName(yangrel;
+
+	if (!pg_class_ownercheck(yangrelid, GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER, ACL_KIND_CLASS,
+	   RelationGetRelationName(yangrel));
+
+	if (IsSystemRelation(yangrel))
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("permission denied: \"%s\" is a system catalog",
+		RelationGetRelationName(yangrel;
+
+	/*
+	 * Don't allow exchange on temp tables of other backends ... their local
+	 * buffer manager is not going to cope.
+	 */
+	if (RELATION_IS_OTHER_TEMP(yinrel) ||
+		RELATION_IS_OTHER_TEMP(yangrel))
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			  errmsg("cannot truncate temporary tables of other sessions")));
+
+	/*
+	 * Also check for active uses of the relation in the current transaction,
+	 * including open scans and pending AFTER trigger events.
+	 */
+	CheckTableNotInUse(yinrel, "EXCHANGE");
+	CheckTableNotInUse(yangrel, "EXCHANGE");
+
+	/*
+	 * Exchange table contents
+	 *
+	 * Swap heaps, toast tables, toast indexes
+	 * all forks
+	 * all indexes
+	 *
+	 * Checks:
+	 * * table definitions must match
+	 * * constraints must match
+	 * * indexes need not match
+	 * * outbound FKs don't need to match
+	 * * inbound FKs will be set to not validated
+	 *
+	 * No Trigger behaviour
+	 *
+	 * How is it WAL logged? By locks and underlying catalog updates
+	 */
+
+	/* keep our locks until commit */
+	heap_close(yangrel, NoLock);
+	heap_close(yinrel, NoLock);
+}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 660947c..67bc432 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -201,7 +201,8 @@ static RangeVar *makeRangeVarFromAnyName(List *names, int position, core_yyscan_
 		DropGroupStmt DropOpClassStmt DropOpFamilyStmt DropPLangStmt DropStmt
 		DropAssertStmt DropTrigStmt DropRuleStmt DropCastStmt DropRoleStmt
 		DropUserStmt DropdbStmt DropTableSpaceStmt DropFdwStmt
-		DropForeignServerStmt DropUserMappingStmt ExplainStmt FetchStmt
+		DropForeignServerStmt DropUserMappingStmt
+		ExchangeStmt ExplainStmt FetchStmt
 		GrantStmt GrantRoleStmt IndexStmt InsertStmt ListenStmt LoadStmt
 		LockStmt NotifyStmt ExplainableStmt PreparableStmt
 		CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
@@ -488,7 +489,7 @@ static RangeVar *makeRangeVarFromAnyNam

Re: [HACKERS] Sync Rep for 2011CF1

2011-01-15 Thread Magnus Hagander

On Sat, Jan 15, 2011 at 22:40, Simon Riggs  wrote:
>
> Here's the latest patch for sync rep.
>
> >From here, I will be developing the patch further on public git
> repository towards commit. My expectation is that commit is at least 2

That's great. Just one tiny detail - which repository and which branch? ;)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-15 Thread Euler Taveira de Oliveira


Em 15-01-2011 15:10, Magnus Hagander escreveu:

One thing I'm thinking about - right now the tool just takes -c
  to connect to the database. Should it instead be taught to
take the connection parameters that for example pg_dump does - one for
each of host, port, user, password? (shouldn't be hard to do..)


+1.


--
  Euler Taveira de Oliveira
  http://www.timbira.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 10:31 AM, Greg Smith  wrote:
> That's going to give worse performance than the current code in some cases.

OK.

>> How does the checkpoint target give you any time to sync them?  Unless
>> you squeeze the writes together more tightly, but that seems sketchy.
>
> Obviously the checkpoint target idea needs to be shuffled around some too.
>  I was thinking of making the new default 0.8, and having it split the time
> in half for write and sync.  That will make the write phase close to the
> speed people are seeing now, at the default of 0.5, while giving some window
> for spread sync too.  The exact way to redistribute that around I'm not so
> concerned about yet.  When I get to where that's the most uncertain thing
> left I'll benchmark the TPS vs. latency trade-off and see what happens.  If
> the rest of the code is good enough but this just needs to be tweaked,
> that's a perfect thing to get beta feedback to finalize.

That seems like a bad idea - don't we routinely recommend that people
crank this up to 0.9?  You'd be effectively bounding the upper range
of this setting to a value to the less than the lowest value we
recommend anyone use today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Return command tag 'REPLACE X' for CREATE OR REPLACE statements.

2011-01-15 Thread Marti Raudsepp

Here's an updated patch that reports command status back to
ProcessUtility via 'bool' return value.

I was a bit unsure about using bool return values because it's not
immediately obvious what "true" or "false" refer to, but defining a
new enum seemed like overkill, so I went with bool anyway. Any better
ideas?

The 2nd patch also moves MOVE/FETCH command tag formatting up to
ProcessUtility, hopefully this change is for the better.

Regards,
Marti
From 25a19ca972e6f02ba350b6e4112935ff1ed44b24 Mon Sep 17 00:00:00 2001
From: Marti Raudsepp 
Date: Sun, 28 Nov 2010 16:49:41 +0200
Subject: [PATCH 1/2] Return command tag 'REPLACE X' for CREATE OR REPLACE statements.

This affects CREATE OR REPLACE LANGUAGE/VIEW/RULE/FUNCTION

Related functions that previously returned "void" have been changed to
"bool" to report status up. Others have a new argument bool *didReplace.
---
 src/backend/catalog/pg_aggregate.c  |3 +-
 src/backend/catalog/pg_proc.c   |6 -
 src/backend/commands/functioncmds.c |   13 --
 src/backend/commands/proclang.c |   34 ++-
 src/backend/commands/view.c |   22 +
 src/backend/rewrite/rewriteDefine.c |   33 ++-
 src/backend/tcop/utility.c  |   43 +++---
 src/include/catalog/pg_proc_fn.h|3 +-
 src/include/commands/defrem.h   |2 +-
 src/include/commands/proclang.h |2 +-
 src/include/commands/view.h |2 +-
 src/include/rewrite/rewriteDefine.h |4 +-
 12 files changed, 119 insertions(+), 48 deletions(-)

diff --git a/src/backend/catalog/pg_aggregate.c b/src/backend/catalog/pg_aggregate.c
index 86e8c6b..95e4469 100644
--- a/src/backend/catalog/pg_aggregate.c
+++ b/src/backend/catalog/pg_aggregate.c
@@ -229,7 +229,8 @@ AggregateCreate(const char *aggName,
 			  NIL,		/* parameterDefaults */
 			  PointerGetDatum(NULL),	/* proconfig */
 			  1,	/* procost */
-			  0);		/* prorows */
+			  0,		/* prorows */
+			  NULL);	/* didReplace */
 
 	/*
 	 * Okay to create the pg_aggregate entry.
diff --git a/src/backend/catalog/pg_proc.c b/src/backend/catalog/pg_proc.c
index 2ab87d2..8b15efc 100644
--- a/src/backend/catalog/pg_proc.c
+++ b/src/backend/catalog/pg_proc.c
@@ -85,7 +85,8 @@ ProcedureCreate(const char *procedureName,
 List *parameterDefaults,
 Datum proconfig,
 float4 procost,
-float4 prorows)
+float4 prorows,
+bool *didReplace)
 {
 	Oid			retval;
 	int			parameterCount;
@@ -650,6 +651,9 @@ ProcedureCreate(const char *procedureName,
 			AtEOXact_GUC(true, save_nestlevel);
 	}
 
+	if(didReplace)
+		*didReplace = is_update;
+
 	return retval;
 }
 
diff --git a/src/backend/commands/functioncmds.c b/src/backend/commands/functioncmds.c
index 2a2b7c7..4678d9c 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -768,8 +768,11 @@ interpret_AS_clause(Oid languageOid, const char *languageName,
 /*
  * CreateFunction
  *	 Execute a CREATE FUNCTION utility statement.
+ *
+ * Returns TRUE if function was replaced via "CREATE OR REPLACE", FALSE
+ * if created.
  */
-void
+bool
 CreateFunction(CreateFunctionStmt *stmt, const char *queryString)
 {
 	char	   *probin_str;
@@ -791,7 +794,8 @@ CreateFunction(CreateFunctionStmt *stmt, const char *queryString)
 	Oid			requiredResultType;
 	bool		isWindowFunc,
 isStrict,
-security;
+security,
+didReplace;
 	char		volatility;
 	ArrayType  *proconfig;
 	float4		procost;
@@ -958,7 +962,10 @@ CreateFunction(CreateFunctionStmt *stmt, const char *queryString)
 	parameterDefaults,
 	PointerGetDatum(proconfig),
 	procost,
-	prorows);
+	prorows,
+	&didReplace);
+
+	return didReplace;
 }
 
 
diff --git a/src/backend/commands/proclang.c b/src/backend/commands/proclang.c
index 3860105..7a1520c 100644
--- a/src/backend/commands/proclang.c
+++ b/src/backend/commands/proclang.c
@@ -49,7 +49,7 @@ typedef struct
 	char	   *tmpllibrary;	/* path of shared library */
 } PLTemplate;
 
-static void create_proc_lang(const char *languageName, bool replace,
+static bool create_proc_lang(const char *languageName, bool replace,
  Oid languageOwner, Oid handlerOid, Oid inlineOid,
  Oid valOid, bool trusted);
 static PLTemplate *find_language_template(const char *languageName);
@@ -59,9 +59,12 @@ static void AlterLanguageOwner_internal(HeapTuple tup, Relation rel,
 
 /* -
  * CREATE PROCEDURAL LANGUAGE
+ *
+ * Returns TRUE if language was replaced via "CREATE OR REPLACE", FALSE
+ * if created.
  * -
  */
-void
+bool
 CreateProceduralLanguage(CreatePLangStmt *stmt)
 {
 	char	   *languageName;
@@ -146,7 +149,8 @@ CreateProceduralLanguage(CreatePLangStmt *stmt)
 		 NIL,
 		 PointerGetDatum(NULL),
 		 1,
-		 0);
+		 0,
+

[HACKERS] plperlu problem with utf8 [REVIEW]

2011-01-15 Thread Andy Colson



This is a review of  "plperl encoding issues"

https://commitfest.postgresql.org/action/patch_view?id=452

Purpose:

Your database uses one encoding, and passes data to perl in the same encoding, 
which perl is not prepared for (it assumes UTF-8).  This patch makes sure data 
is encoded into UTF-8 before its passed to plperl then converts the response 
from UTF-8 back to the database encoding for storage.

My test:

ptest2=# create database ptest2 encoding 'EUC_JP' template template0;

I created a simple perl function that reverses the string.  I don't know Japanese so I 
found a tattoo website that had sayings in Japanese... I picked: "I am awesome".

 
create or replace function preverse(x text) returns text as $$

my $tmp = reverse($_[0]);
return $tmp;
$$ LANGUAGE plperl;


Before the patch:

ptest2=#select preverse('私はよだれを垂らす');

  preverse

 垢蕕眇鬚譴世茲呂篁
(1 row)

It is also possible to generate invalid characters.  This function pulls off 
the last character in the string... assuming its UTF-8

create or replace function plastchar(x text) returns text as $$
my $tmp = substr($_[0], -1);
return $tmp;
$$ LANGUAGE plperl;

ptest2=# select plastchar('私はよだれを垂らす');

ERROR:  invalid byte sequence for encoding "EUC_JP": 0xb9
CONTEXT:  PL/Perl function "plastchar"

Because the string was not UTF-8, perl got confused and returned an invalid 
character.

After the patch:
The exact same plperl functions work fine:

ptest2=# select preverse('私はよだれを垂らす');

  preverse

 すら垂をれだよは私
(1 row)

ptest2=# select plastchar('私はよだれを垂らす');

 plastchar
---
 す
(1 row)




Performance:

This is a bug fix, not for performance, however, as noted by the author, many 
encodings will be very UTF-8'ish and the overhead will be very small.  For 
those encodings that would need converted, you'd need to do the same convert  
inside your perl function anyway before you could use the data.  The processing 
has just moved from inside your perl func to inside PG.




The Patch:
==
Applies clean to git head as of January 15 2011.  PG built with 
--enable-cassert and --enable-debug seems to run fine with no errors.

I don't think regression tests cover plperl, so understandable there are no 
tests in the patch.

There is no manual updates in the patch either, and I think there should be.  I 
think it should be made clear
that data (varchar, text, etc.  but not bytea) will be passed to perl as UTF-8, 
regardless of database encoding.  Also that "use utf8;" is always loaded and in 
use.



Code Review:

I am not qualified.  Looking through the patch, I'm reminded of the old saying: "Any 
sufficently advanced perl XS code is indistinguishable from magic"  :-)


Other Remarks:
==
- Yes I know... it was a joke.
- I sure hope this posts to the news group ok
- My terminal (konsole) had a hard time displaying Japanese, so I used psql's 
\i and \o to read/write files that kwrite show'd/encoded correctly via EUC_JP


Summary:

Looks good.  Looks needed.  Needs manual updates.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transaction-scope advisory locks

2011-01-15 Thread Marko Tiikkaja


On 2010-12-14 12:52 AM +0200, Marko Tiikkaja wrote:




Here's the latest version of the patch.  It now uses the API proposed by 
Simon, but still lacks documentation changes, which I'm going to send 
tomorrow.



Regards,
Marko Tiikkaja
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
***
*** 130,136  static const LockMethodData default_lockmethod = {
  
  static const LockMethodData user_lockmethod = {
AccessExclusiveLock,/* highest valid lock mode number */
!   false,
LockConflicts,
lock_mode_names,
  #ifdef LOCK_DEBUG
--- 130,136 
  
  static const LockMethodData user_lockmethod = {
AccessExclusiveLock,/* highest valid lock mode number */
!   true,
LockConflicts,
lock_mode_names,
  #ifdef LOCK_DEBUG
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***
*** 629,636  LockWaitCancel(void)
   * At subtransaction abort, we release all locks held by the subtransaction;
   * this is implemented by retail releasing of the locks under control of
   * the ResourceOwner mechanism.
-  *
-  * Note that user locks are not released in any case.
   */
  void
  ProcReleaseLocks(bool isCommit)
--- 629,634 
***
*** 641,646  ProcReleaseLocks(bool isCommit)
--- 639,645 
LockWaitCancel();
/* Release locks */
LockReleaseAll(DEFAULT_LOCKMETHOD, !isCommit);
+   LockReleaseAll(USER_LOCKMETHOD, false);
  }
  
  
*** a/src/backend/utils/adt/lockfuncs.c
--- b/src/backend/utils/adt/lockfuncs.c
***
*** 343,348  pg_advisory_lock_int8(PG_FUNCTION_ARGS)
--- 343,365 
  }
  
  /*
+  * pg_advisory_xact_lock(int8) - acquire xact scoped
+  * exclusive lock on an int8 key
+  */
+ Datum
+ pg_advisory_xact_lock_int8(PG_FUNCTION_ARGS)
+ {
+   int64   key = PG_GETARG_INT64(0);
+   LOCKTAG tag;
+ 
+   SET_LOCKTAG_INT64(tag, key);
+ 
+   (void) LockAcquire(&tag, ExclusiveLock, false, false);
+ 
+   PG_RETURN_VOID();
+ }
+ 
+ /*
   * pg_advisory_lock_shared(int8) - acquire share lock on an int8 key
   */
  Datum
***
*** 359,364  pg_advisory_lock_shared_int8(PG_FUNCTION_ARGS)
--- 376,398 
  }
  
  /*
+  * pg_advisory_xact_lock_shared(int8) - acquire xact scoped
+  * share lock on an int8 key
+  */
+ Datum
+ pg_advisory_xact_lock_shared_int8(PG_FUNCTION_ARGS)
+ {
+   int64   key = PG_GETARG_INT64(0);
+   LOCKTAG tag;
+ 
+   SET_LOCKTAG_INT64(tag, key);
+ 
+   (void) LockAcquire(&tag, ShareLock, false, false);
+ 
+   PG_RETURN_VOID();
+ }
+ 
+ /*
   * pg_try_advisory_lock(int8) - acquire exclusive lock on an int8 key, no wait
   *
   * Returns true if successful, false if lock not available
***
*** 378,383  pg_try_advisory_lock_int8(PG_FUNCTION_ARGS)
--- 412,437 
  }
  
  /*
+  * pg_try_advisory_xact_lock(int8) - acquire xact scoped
+  * exclusive lock on an int8 key, no wait
+  *
+  * Returns true if successful, false if lock not available
+  */
+ Datum
+ pg_try_advisory_xact_lock_int8(PG_FUNCTION_ARGS)
+ {
+   int64   key = PG_GETARG_INT64(0);
+   LOCKTAG tag;
+   LockAcquireResult res;
+ 
+   SET_LOCKTAG_INT64(tag, key);
+ 
+   res = LockAcquire(&tag, ExclusiveLock, false, true);
+ 
+   PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
+ }
+ 
+ /*
   * pg_try_advisory_lock_shared(int8) - acquire share lock on an int8 key, no 
wait
   *
   * Returns true if successful, false if lock not available
***
*** 397,402  pg_try_advisory_lock_shared_int8(PG_FUNCTION_ARGS)
--- 451,476 
  }
  
  /*
+  * pg_try_advisory_xact_lock_shared(int8) - acquire xact scoped
+  * share lock on an int8 key, no wait
+  *
+  * Returns true if successful, false if lock not available
+  */
+ Datum
+ pg_try_advisory_xact_lock_shared_int8(PG_FUNCTION_ARGS)
+ {
+   int64   key = PG_GETARG_INT64(0);
+   LOCKTAG tag;
+   LockAcquireResult res;
+ 
+   SET_LOCKTAG_INT64(tag, key);
+ 
+   res = LockAcquire(&tag, ShareLock, false, true);
+ 
+   PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
+ }
+ 
+ /*
   * pg_advisory_unlock(int8) - release exclusive lock on an int8 key
   *
   * Returns true if successful, false if lock was not held
***
*** 452,457  pg_advisory_lock_int4(PG_FUNCTION_ARGS)
--- 526,549 
  }
  
  /*
+  * pg_advisory_xact_lock(int4, int4) - acquire xact scoped
+  * exclusive lock on 2 int4 keys
+  */
+ Datum
+ pg_advisory_xact_lock_int4(PG_FUNCTION_ARGS)
+ {
+   int32   key1 = PG_GETARG_INT32(0);
+   int32   key2 = PG_GETARG_INT32(1);
+   LOCKTAG tag;
+ 
+   SET_LOCKTAG_INT32(tag, key1, key2);
+ 
+   (void) LockAcquire(&tag, ExclusiveLock, false, false);
+ 
+   PG_RETURN_VOID();
+ }
+ 
+ /*
   * pg_advisory_lock_shared(int4, int4) - acquire share lock on 2 int4 ke

Re: [HACKERS] LOCK for non-tables

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 3:37 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sat, Jan 15, 2011 at 10:25 AM, Tom Lane  wrote:
>>> I suggest also marking each item with a release in which we intend to do
>>> it, so we don't have to try to remember whether a reasonable amount of
>>> time has elapsed.
>
>> You mean like the way the 9.1devel documentation says that
>> contrib/xml2 will be removed in 8.4?  I wonder if we'll do anything
>> either about deprecating the module or about changing the
>> documentation before 8.4 is EOL.
>
> Well, that one is a special case, because we knew perfectly well that we
> hadn't replaced all the functionality of xml2 (and we still haven't).
> I think the "official" deprecation list should only contain items for
> which there's no blocking issue other than wanting to give users time to
> migrate to an existing alternate solution.

Fair enough.

Do we wish to officially document LOCK without TABLE as a good idea to
start avoiding, in case we decide to do something about that in the
future?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-15 Thread Magnus Hagander

On Sat, Jan 15, 2011 at 21:16, Dimitri Fontaine  wrote:
> Hi,
>
> I have an unexpected 5 mins window to do a first reading of the patch,
> so here goes the quick doc and comments proof reading of it. :)

:-)


> Magnus Hagander  writes:
>> This patch creates pg_basebackup in bin/, being a client program for
>> the streaming base backup feature.
>
> Great!  We have pg_ctl init[db], I think we want pg_ctl clone or some
> other command here to call the binary for us.  What do you think?

That might be useful, but I think we need to settle on the
pg_basebackup contents itself first.

Not sure pg_ctl clone would be the proper name, since it's not
actually a clone at this point (it might be with the second patch I
ust posted that includes the WAL files)


>> One thing I'm thinking about - right now the tool just takes -c
>>  to connect to the database. Should it instead be taught to
>> take the connection parameters that for example pg_dump does - one for
>> each of host, port, user, password? (shouldn't be hard to do..)
>
> Consistency is good.
>
> Now, basic first patch reading level review:
>
> I think doc/src/sgml/backup.sgml should include some notes about how
> libpq base backup streaming compares to rsync and the like in term of
> efficiency or typical performances, when to prefer which, etc.  I'll see
> about doing some tests next week.

Yeah, the whole backup chapter may well need some more work after this.


> +      --basedir= class="parameter">directory
>
> That should be -D --pgdata, for consistency with pg_dump.

pg_dump doesn't have a -D. I assume you mean pg_ctl / initdb?


> On a quick reading it's unclear from the docs alone how -d and -t leave
> together.  It seems like the options are exclusive but I'd have to ask…

They are. The docs clearly say "Only one of -d and
-t can be specified"


> + * The file will be named base.tar[.gz] if it's for the main data directory
> + * or .tar[.gz] if it's for another tablespace.
>
> Well we have UNIQUE, btree (spcname), so maybe we can use that here?

We could, but that would make it more likely to run into encoding
issues and such - do we restrict what can be in a tablespace name?

Also with a tar named by the oid, you *can* untar it into a directory
in pg_tblspc to recover from if you have to.

Another option, I think Heikki mentioned this on IM at some point, is
to do something like name it -.tar. That would give us best
of both worlds?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Include WAL in base backup

2011-01-15 Thread Magnus Hagander

Here's a cutdown version of the idea about including WAL in the base
backup. What I initially wanted was to introduce a way to guarantee
that the required WAL (with some sort of cutoff of course) would be
available for the backup, but I ran out of time for that. We can
always add that later.

For now, you need to set wal_keep_segments to make it work properly,
but if you do the idea is that the tar file/stream generated in the
base backup will include all the required WAL files. That means that
you can start a postmaster directly against the directory extracted
from the tarball, and you no longer need to set up archive logging to
get a backup.

I've got some refactoring I want to do around the
SendBackupDirectory() function after this, but a review of the
functionality first would be good. And obviously, documentation is
still necessary.

The patch to pg_basebackup applies on top of the previously posted version.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b4d5bbe..d8f0031 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -38,9 +38,9 @@ static void _tarWriteHeader(char *filename, char *linktarget,
 struct stat * statbuf);
 static void send_int8_string(StringInfoData *buf, int64 intval);
 static void SendBackupHeader(List *tablespaces);
-static void SendBackupDirectory(char *location, char *spcoid);
+static void SendBackupDirectory(char *location, char *spcoid, bool closetar);
 static void base_backup_cleanup(int code, Datum arg);
-static void perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir);
+static void perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir, bool includewal);
 
 typedef struct
 {
@@ -67,9 +67,12 @@ base_backup_cleanup(int code, Datum arg)
  * clobbered by longjmp" from stupider versions of gcc.
  */
 static void
-perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir)
+perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir, bool includewal)
 {
-	do_pg_start_backup(backup_label, true);
+	XLogRecPtr	startptr;
+	XLogRecPtr	endptr;
+
+	startptr = do_pg_start_backup(backup_label, true);
 
 	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 	{
@@ -79,11 +82,6 @@ perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir)
 		tablespaceinfo *ti;
 
 
-		/* Add a node for the base directory */
-		ti = palloc0(sizeof(tablespaceinfo));
-		ti->size = progress ? sendDir(".", 1, true) : -1;
-		tablespaces = lappend(tablespaces, ti);
-
 		/* Collect information about all tablespaces */
 		while ((de = ReadDir(tblspcdir, "pg_tblspc")) != NULL)
 		{
@@ -111,6 +109,10 @@ perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir)
 			tablespaces = lappend(tablespaces, ti);
 		}
 
+		/* Add a node for the base directory at the end */
+		ti = palloc0(sizeof(tablespaceinfo));
+		ti->size = progress ? sendDir(".", 1, true) : -1;
+		tablespaces = lappend(tablespaces, ti);
 
 		/* Send tablespace header */
 		SendBackupHeader(tablespaces);
@@ -120,12 +122,62 @@ perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir)
 		{
 			tablespaceinfo *ti = (tablespaceinfo *) lfirst(lc);
 
-			SendBackupDirectory(ti->path, ti->oid);
+			SendBackupDirectory(ti->path, ti->oid,
+!includewal || ti->path != NULL);
 		}
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
-	do_pg_stop_backup();
+	endptr = do_pg_stop_backup();
+
+	if (includewal)
+	{
+		/*
+		 * We've left the last tar file "open", so we can now append the
+		 * required WAL files to it.
+		 */
+		int 	logid;
+		int		logseg;
+
+		elog(LOG, "Going to write wal from %i.%i to %i.%i",
+			 startptr.xlogid, startptr.xrecoff / XLogSegSize,
+			 endptr.xlogid, endptr.xrecoff / XLogSegSize);
+		logid = startptr.xlogid;
+		logseg = startptr.xrecoff / XLogSegSize;
+
+		while (true)
+		{
+			char	xlogname[64];
+			char	fn[MAXPGPATH];
+			struct stat statbuf;
+
+			/* Send the current WAL file */
+#define TIMELINE 1
+			XLogFileName(xlogname, TIMELINE, logid, logseg);
+			sprintf(fn, "./pg_xlog/%s", xlogname);
+			if (lstat(fn, &statbuf) != 0)
+ereport(ERROR,
+		(errcode(errcode_for_file_access()),
+		 errmsg("could not stat file \"%s\": %m",
+fn)));
+
+			if (!S_ISREG(statbuf.st_mode))
+ereport(ERROR,
+		(errmsg("\"%s\" is not a file",
+fn)));
+			sendFile(fn, 1, &statbuf);
+
+			/* Advance to the next WAL file */
+			NextLogSeg(logid, logseg);
+
+			/* Have we reached our stop position yet? */
+			if (logid > endptr.xlogid ||
+(logid == endptr.xlogid && logseg >= endptr.xrecoff / XLogSegSize))
+break;
+		}
+		/* Send CopyDone message for the last tar file*/
+		pq_putemptymessage('c');
+	}
 }
 
 /*
@@ -135,7 +187,7 @@ perform_base_backup(const char *backup_label, bool pr

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Simon Riggs

On Sat, 2011-01-15 at 09:15 -0500, Robert Haas wrote:
> On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs  wrote:
> > On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
> >> Robert Haas wrote:
> >> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith  wrote:
> >> >
> >> > > One of the ideas Simon and I had been considering at one point was 
> >> > > adding
> >> > > some better de-duplication logic to the fsync absorb code, which I'm
> >> > > reminded by the pattern here might be helpful independently of other
> >> > > improvements.
> >> > >
> >> >
> >> > Hopefully I'm not stepping on any toes here, but I thought this was an
> >> > awfully good idea and had a chance to take a look at how hard it would
> >> > be today while en route from point A to point B.  The answer turned
> >> > out to be "not very", so PFA a patch that seems to work.  I tested it
> >> > by attaching gdb to the background writer while running pgbench, and
> >> > it eliminate the backend fsyncs without even breaking a sweat.
> >> >
> >>
> >> No toe damage, this is great, I hadn't gotten to coding for this angle
> >> yet at all.  Suffering from an overload of ideas and (mostly wasted)
> >> test data, so thanks for exploring this concept and proving it works.
> >
> > No toe damage either, but are we sure we want the de-duplication logic
> > and in this place?
> >
> > I was originally of the opinion that de-duplicating the list would save
> > time in the bgwriter, but that guess was wrong by about two orders of
> > magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.
> 
> Well, the point of this is not to save time in the bgwriter - I'm not
> surprised to hear that wasn't noticeable.  The point is that when the
> fsync request queue fills up, backends start performing an fsync *for
> every block they write*, and that's about as bad for performance as
> it's possible to be.  So it's worth going to a little bit of trouble
> to try to make sure it doesn't happen.  It didn't happen *terribly*
> frequently before, but it does seem to be common enough to worry about
> - e.g. on one occasion, I was able to reproduce it just by running
> pgbench -i -s 25 or something like that on a laptop.
> 
> With this patch applied, there's no performance impact vs. current
> code in the very, very common case where space remains in the queue -
> 999 times out of 1000, writing to the fsync queue will be just as fast
> as ever.  But in the unusual case where the queue has been filled up,
> compacting the queue is much much faster than performing an fsync, and
> the best part is that the compaction is generally massive.  I was
> seeing things like "4096 entries compressed to 14".  So clearly even
> if the compaction took as long as the fsync itself it would be worth
> it, because the next 4000+ guys who come along again go through the
> fast path.  But in fact I think it's much faster than an fsync.
> 
> In order to get pathological behavior even with this patch applied,
> you'd need to have NBuffers pending fsync requests and they'd all have
> to be different.  I don't think that's theoretically impossible, but
> Greg's research seems to indicate that even on busy systems we don't
> come even a little bit close to the circumstances that would cause it
> to occur in practice.  Every other change we might make in this area
> will further improve this case, too: for example, doing an absorb
> after each fsync would presumably help, as would the more drastic step
> of splitting the bgwriter into two background processes (one to do
> background page cleaning, and the other to do checkpoints, for
> example).  But even without those sorts of changes, I think this is
> enough to effectively eliminate the full fsync queue problem in
> practice, which seems worth doing independently of anything else.

You've persuaded me.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LOCK for non-tables

2011-01-15 Thread Tom Lane

Robert Haas  writes:
> On Sat, Jan 15, 2011 at 10:25 AM, Tom Lane  wrote:
>> I suggest also marking each item with a release in which we intend to do
>> it, so we don't have to try to remember whether a reasonable amount of
>> time has elapsed.

> You mean like the way the 9.1devel documentation says that
> contrib/xml2 will be removed in 8.4?  I wonder if we'll do anything
> either about deprecating the module or about changing the
> documentation before 8.4 is EOL.

Well, that one is a special case, because we knew perfectly well that we
hadn't replaced all the functionality of xml2 (and we still haven't).
I think the "official" deprecation list should only contain items for
which there's no blocking issue other than wanting to give users time to
migrate to an existing alternate solution.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] EXPLAIN and nfiltered

2011-01-15 Thread Marko Tiikkaja


On 2010-11-18 5:45 PM +0200, Marko Tiikkaja wrote:

Here's a patch for showing in EXPLAIN ANALYZE the number of rows a plan
qual filtered from a node's input.


Rebased against master.


Regards,
Marko Tiikkaja
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 975,992  ExplainNode(PlanState *planstate, List *ancestors,
double  startup_sec = 1000.0 * 
planstate->instrument->startup / nloops;
double  total_sec = 1000.0 * 
planstate->instrument->total / nloops;
double  rows = planstate->instrument->ntuples / nloops;
  
if (es->format == EXPLAIN_FORMAT_TEXT)
{
appendStringInfo(es->str,
!" (actual 
time=%.3f..%.3f rows=%.0f loops=%.0f)",
!startup_sec, 
total_sec, rows, nloops);
}
else
{
ExplainPropertyFloat("Actual Startup Time", 
startup_sec, 3, es);
ExplainPropertyFloat("Actual Total Time", total_sec, 3, 
es);
ExplainPropertyFloat("Actual Rows", rows, 0, es);
ExplainPropertyFloat("Actual Loops", nloops, 0, es);
}
}
--- 975,994 
double  startup_sec = 1000.0 * 
planstate->instrument->startup / nloops;
double  total_sec = 1000.0 * 
planstate->instrument->total / nloops;
double  rows = planstate->instrument->ntuples / nloops;
+   double  filtered = planstate->instrument->nfiltered / 
nloops;
  
if (es->format == EXPLAIN_FORMAT_TEXT)
{
appendStringInfo(es->str,
!" (actual 
time=%.3f..%.3f rows=%.0f filtered=%.0f loops=%.0f)",
!startup_sec, 
total_sec, rows, filtered, nloops);
}
else
{
ExplainPropertyFloat("Actual Startup Time", 
startup_sec, 3, es);
ExplainPropertyFloat("Actual Total Time", total_sec, 3, 
es);
ExplainPropertyFloat("Actual Rows", rows, 0, es);
+   ExplainPropertyFloat("Rows Filtered", rows, 0, es);
ExplainPropertyFloat("Actual Loops", nloops, 0, es);
}
}
***
*** 999,1004  ExplainNode(PlanState *planstate, List *ancestors,
--- 1001,1007 
ExplainPropertyFloat("Actual Startup Time", 0.0, 3, es);
ExplainPropertyFloat("Actual Total Time", 0.0, 3, es);
ExplainPropertyFloat("Actual Rows", 0.0, 0, es);
+   ExplainPropertyFloat("Rows Filtered", 0.0, 0, es);
ExplainPropertyFloat("Actual Loops", 0.0, 0, es);
}
}
*** a/src/backend/executor/execScan.c
--- b/src/backend/executor/execScan.c
***
*** 19,24 
--- 19,25 
  #include "postgres.h"
  
  #include "executor/executor.h"
+ #include "executor/instrument.h"
  #include "miscadmin.h"
  #include "utils/memutils.h"
  
***
*** 221,226  ExecScan(ScanState *node,
--- 222,229 
 * Tuple fails qual, so free per-tuple memory and try again.
 */
ResetExprContext(econtext);
+   if (node->ps.instrument)
+   node->ps.instrument->nfiltered += 1;
}
  }
  
*** a/src/include/executor/instrument.h
--- b/src/include/executor/instrument.h
***
*** 49,54  typedef struct Instrumentation
--- 49,55 
double  startup;/* Total startup time (in 
seconds) */
double  total;  /* Total total time (in 
seconds) */
double  ntuples;/* Total tuples produced */
+   double  nfiltered;  /* Total tuples filtered by 
qual */
double  nloops; /* # of run cycles for this 
node */
BufferUsage bufusage;   /* Total buffer usage */
  } Instrumentation;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Fix snapshot taking inconsistencies

2011-01-15 Thread Marko Tiikkaja


On 2010-10-21 3:32 PM +0200, Marko Tiikkaja wrote:




Here's the patch rebased against the master.  No code changes since the 
last patch I sent.



Regards,
Marko Tiikkaja
*** a/src/backend/catalog/pg_proc.c
--- b/src/backend/catalog/pg_proc.c
***
*** 759,765  fmgr_sql_validator(PG_FUNCTION_ARGS)
--- 759,767 
Oid funcoid = PG_GETARG_OID(0);
HeapTuple   tuple;
Form_pg_proc proc;
+   List   *raw_parsetree_list;
List   *querytree_list;
+   ListCell   *list_item;
boolisnull;
Datum   tmp;
char   *prosrc;
***
*** 832,840  fmgr_sql_validator(PG_FUNCTION_ARGS)
 */
if (!haspolyarg)
{
!   querytree_list = pg_parse_and_rewrite(prosrc,
!   
  proc->proargtypes.values,
!   
  proc->pronargs);
(void) check_sql_fn_retval(funcoid, proc->prorettype,
   
querytree_list,
   
NULL, NULL);
--- 834,858 
 */
if (!haspolyarg)
{
!   /*
!* Parse and rewrite the queries in the function text.
!*
!* Even though check_sql_fn_retval is only interested 
in the last
!* query, we analyze all of them here to check for any 
errors.
!*/
!   raw_parsetree_list = pg_parse_query(prosrc);
!   
!   querytree_list = NIL;
!   foreach(list_item, raw_parsetree_list)
!   {
!   Node *parsetree = (Node *) lfirst(list_item);
! 
!   querytree_list = 
pg_analyze_and_rewrite(parsetree, prosrc,
!   
proc->proargtypes.values, proc->pronargs);
!   }
! 
!   Assert(querytree_list != NIL);
! 
(void) check_sql_fn_retval(funcoid, proc->prorettype,
   
querytree_list,
   
NULL, NULL);
*** a/src/backend/executor/functions.c
--- b/src/backend/executor/functions.c
***
*** 84,89  typedef struct
--- 84,90 
boolreturnsSet; /* true if returning multiple 
rows */
boolreturnsTuple;   /* true if returning whole tuple result 
*/
boolshutdown_reg;   /* true if registered shutdown callback 
*/
+   boolsnapshot;   /* true if pushed an active 
snapshot */
boolreadonly_func;  /* true to run in "read only" mode */
boollazyEval;   /* true if using lazyEval for 
result query */
  
***
*** 93,107  typedef struct
  
JunkFilter *junkFilter; /* will be NULL if function returns 
VOID */
  
!   /* head of linked list of execution_state records */
!   execution_state *func_state;
  } SQLFunctionCache;
  
  typedef SQLFunctionCache *SQLFunctionCachePtr;
  
  
  /* non-export function prototypes */
! static execution_state *init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK);
  static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK);
--- 94,107 
  
JunkFilter *junkFilter; /* will be NULL if function returns 
VOID */
  
!   List *func_state;
  } SQLFunctionCache;
  
  typedef SQLFunctionCache *SQLFunctionCachePtr;
  
  
  /* non-export function prototypes */
! static List *init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK);
  static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK);
***
*** 123,183  static void sqlfunction_destroy(DestReceiver *self);
  
  
  /* Set up the list of per-query execution_state records for a SQL function */
! static execution_state *
  init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK)
  {
!   execution_state *firstes = NULL;
!   execution_state *preves = NULL;
execution_state *lasttages = NULL;
!   ListCell   *qtl_item;

Re: [HACKERS] pg_basebackup for streaming base backups

2011-01-15 Thread Dimitri Fontaine

Hi,

I have an unexpected 5 mins window to do a first reading of the patch,
so here goes the quick doc and comments proof reading of it. :)

Magnus Hagander  writes:
> This patch creates pg_basebackup in bin/, being a client program for
> the streaming base backup feature.

Great!  We have pg_ctl init[db], I think we want pg_ctl clone or some
other command here to call the binary for us.  What do you think?

> One thing I'm thinking about - right now the tool just takes -c
>  to connect to the database. Should it instead be taught to
> take the connection parameters that for example pg_dump does - one for
> each of host, port, user, password? (shouldn't be hard to do..)

Consistency is good.

Now, basic first patch reading level review:

I think doc/src/sgml/backup.sgml should include some notes about how
libpq base backup streaming compares to rsync and the like in term of
efficiency or typical performances, when to prefer which, etc.  I'll see
about doing some tests next week.

+  --basedir=directory

That should be -D --pgdata, for consistency with pg_dump.

On a quick reading it's unclear from the docs alone how -d and -t leave
together.  It seems like the options are exclusive but I'd have to ask…

+ * The file will be named base.tar[.gz] if it's for the main data directory
+ * or .tar[.gz] if it's for another tablespace.

Well we have UNIQUE, btree (spcname), so maybe we can use that here?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] textarray option for file FDW

2011-01-15 Thread Dimitri Fontaine

Tom Lane  writes:
> Why is this a good thing?  It seems like it would accomplish little
> except to defeat the SQL type system entirely.

It simply allows to have the classical ETL problem solved directly in
the database server, in SQL.  That's huge.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visual Studio 2010/Windows SDK 7.1 support

2011-01-15 Thread Andrew Chernow


On 1/15/2011 1:42 PM, Magnus Hagander wrote:

On Mon, Jan 3, 2011 at 16:45, Brar Piening  wrote:

On Mon, 3 Jan 2011 10:44:19 +0100, Magnus Hagander
wrote:


Yeah, it looks that way - it's missing the ordering of the contrib
I'll run it once for that now, and then please rebase your
patch on top of that - makes it easier to review it.


The rebased patch can be grabbed from http://www.piening.info/VS2010v2.patch


Hi!

Please make sure this goes on the commitfest page
(https://commitfest.postgresql.org/action/commitfest_view?id=9), so
it's not missed.

I'm currently lacking an env where I can put VS2010 on it (given that
the amazon cloud no longer works reasonably for windows machines, and
I can't put another version of VS on the other VM I'm using right
now), so it'll be a while before I can look at this.



I can provide a windows box with VS2010 if you'd like.  Wouldn't be 
until Monday or Tuesday.  Any preference on windows version?  Maybe 
Windows 7?  You want 64-bit?


Send a private email.

--
Andrew Chernow
eSilo, LLC
global backup
http://www.esilo.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] test_fsync made modular

2011-01-15 Thread Bruce Momjian

I have modified test_fsync to use modular C so there is less duplicate
code and it can be enhanced easier.  Applied patch attached, though the
diff is larger than the C file so you might want to just look at the C
file in git.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/tools/fsync/README b/src/tools/fsync/README
index a1c2ae4..5bf6f1f 100644
*** /tmp/pgdiff.20499/6PmS7d_README	Sat Jan 15 14:41:11 2011
--- src/tools/fsync/README	Sat Jan 15 14:20:47 2011
*** test_fsync
*** 3,9 
  
  This program tests fsync.  The tests are described as part of the program output.
  
! 	Usage:	test_fsync [-f filename] [loops]
  	
  test_fsync is intended to give you a reasonable idea of what the fastest
  fsync_method is on your specific system, as well as supplying diagnostic
--- 3,9 
  
  This program tests fsync.  The tests are described as part of the program output.
  
! 	Usage:	test_fsync [-f filename] [ops_per_test]
  	
  test_fsync is intended to give you a reasonable idea of what the fastest
  fsync_method is on your specific system, as well as supplying diagnostic
*** The output filename defaults to test_fsy
*** 16,20 
  test_fsync should be run in the same filesystem as your transaction log
  directory (pg_xlog).
  
! Loops default to 2000.  Increase this to get more accurate measurements.
  
--- 16,21 
  test_fsync should be run in the same filesystem as your transaction log
  directory (pg_xlog).
  
! Ops-per-test defaults to 2000.  Increase this to get more accurate
! measurements.
  
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index ebfd5fd..ce59afa 100644
*** /tmp/pgdiff.20499/aaqVod_test_fsync.c	Sat Jan 15 14:41:11 2011
--- src/tools/fsync/test_fsync.c	Sat Jan 15 14:36:45 2011
***
*** 22,27 
--- 22,28 
  #include 
  #include 
  
+ 
  /* 
   * put the temp files in the local directory
   * unless the user specifies otherwise 
***
*** 34,54 
  #define NA_FORMAT		LABEL_FORMAT "%18s"
  
  
! int			loops = 2000;
  
! void		die(char *str);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
  
  int
  main(int argc, char *argv[])
  {
! 	struct timeval start_t, stop_t;
! 	int			tmpfile, i;
! 	char	   *full_buf = (char *) malloc(XLOG_SEG_SIZE),
! 			   *buf, *filename = FSYNC_FILENAME;
  
  	/* 
! 	 * arguments: loops and filename (optional) 
  	 */
  	if (argc > 2 && strcmp(argv[1], "-f") == 0)
  	{
--- 35,93 
  #define NA_FORMAT		LABEL_FORMAT "%18s"
  
  
! int			ops_per_test = 2000;
! char	full_buf[XLOG_SEG_SIZE], *buf, *filename = FSYNC_FILENAME;
! struct timeval start_t, stop_t;
  
! 
! void		handle_args(int argc, char *argv[]);
! void		prepare_buf(void);
! void		test_open(void);
! void		test_non_sync(void);
! void		test_sync(int writes_per_op);
! void		test_open_syncs(void);
! void		test_file_descriptor_sync(void);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
+ void		die(char *str);
+ 
  
  int
  main(int argc, char *argv[])
  {
! 	handle_args(argc, argv);
! 	
! 	prepare_buf();
! 
! 	test_open();
! 	
! 	test_non_sync();
! 	
! 	/* Test using 1 8k write */
! 	test_sync(1);
! 
! 	/* Test using 2 8k writes */
! 	test_sync(2);
! 	
! 	test_open_syncs();
! 
! 	test_file_descriptor_sync();
! 	
! 	unlink(filename);
! 
! 	return 0;
! }
  
+ void
+ handle_args(int argc, char *argv[])
+ {
+ 	if (argc > 1 && strcmp(argv[1], "-h") == 0)
+ 	{
+ 		fprintf(stderr, "test_fsync [-f filename] [ops-per-test]\n");
+ 		exit(1);
+ 	}
+ 	
  	/* 
! 	 * arguments: ops_per_test and filename (optional) 
  	 */
  	if (argc > 2 && strcmp(argv[1], "-f") == 0)
  	{
*** main(int argc, char *argv[])
*** 58,67 
  	}
  
  	if (argc > 1) 
! 		loops = atoi(argv[1]);
  
! 	for (i = 0; i < XLOG_SEG_SIZE; i++)
! 		full_buf[i] = random();
  
  	/* 
  	 * test if we can open the target file 
--- 97,123 
  	}
  
  	if (argc > 1) 
! 		ops_per_test = atoi(argv[1]);
  
! 	printf("Ops-per-test = %d\n\n", ops_per_test);
! }
! 
! void
! prepare_buf(void)
! {
! 	int			ops;
! 
! 	/* write random data into buffer */
! 	for (ops = 0; ops < XLOG_SEG_SIZE; ops++)
! 		full_buf[ops] = random();
! 
! 	buf = (char *) TYPEALIGN(ALIGNOF_XLOG_BUFFER, full_buf);
! }
! 
! void
! test_open(void)
! {
! 	int			tmpfile;
  
  	/* 
  	 * test if we can open the target file 
*** main(int argc, char *argv[])
*** 70,94 
  		die("Cannot open output file.");
  	if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE)
  		die("write failed");
! 	/*
! 	 * fsync now so that dirty buffers don't skew later tests
! 	 */
  	if (fsync(tmpfile) != 0)
  		die("fsync failed");
- 	close(tmpfile);
  
! 	buf = (char *) TYPEALIGN(ALIGNOF_XLOG_BUFFER, full_buf);
  
! 	printf("Loops = %d\n\n", loops);
  
  	/*
  	 * Test a simple write without fsync
  	 */
! 	printf("Simple w

Re: [HACKERS] ALTER TYPE 0: Introduction; test cases

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 10:25 AM, Noah Misch  wrote:
> Do you value test coverage so little?

If you're asking whether I think real-world usability is more
important than test coverage, then yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 11:14 AM, Tom Lane  wrote:
> Greg Smith  writes:
>> Does try_relation_open need to have a lock acquisition timeout when AV
>> is calling it?
>
> Hmm.  I think when looking at the AV code, I've always subconsciously
> assumed that try_relation_open would fail immediately if it couldn't get
> the lock.  That certainly seems like it would be a more appropriate way
> to behave than delaying indefinitely.

I'm confused how that's not happening already. What does "try" mean, otherwise?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LOCK for non-tables

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 10:25 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sat, Jan 15, 2011 at 6:29 AM, Simon Riggs  wrote:
>>> I think we should have a section in the release notes on Deprecated
>>> Features, noting that certain things will be removed later and should be
>>> changed now and not relied upon in the future. A pending
>>> incompatibilities list.
>
>> Agreed.  Of course, the problem is sometimes we don't do what we say
>> we're going to do, but it's worth a try.
>
> I think if we had a formal list of planned removals, it'd be more likely
> that they'd actually happen.  Right now there's no process at all
> driving such things forward.

OK.

> I suggest also marking each item with a release in which we intend to do
> it, so we don't have to try to remember whether a reasonable amount of
> time has elapsed.

You mean like the way the 9.1devel documentation says that
contrib/xml2 will be removed in 8.4?  I wonder if we'll do anything
either about deprecating the module or about changing the
documentation before 8.4 is EOL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

2011-01-15 Thread Tomas Vondra

Hi,

a short update regarding the ndistinct stream estimators - I've
implemented the estimators described in the papers I've mentioned in my
previous posts. If you want try it, the sources are available at github,
at http://tvondra.github.com/pg_estimator (ignore the page, I have to
update it, just download the sources).

You can build the estimators as standalone benchmarks (build.sh) or as
aggregate functions (build-agg.sh) - I guess the latter is much easier
to play with. The installation is described on my blog, along with some
two simple benchmarks

http://www.fuzzy.cz/en/articles/aggregate-functions-for-distinct-estimation/

Feel free to play with it with different data, and if you notice
something strange just let me know.

Several comment regarding the implemented estimators:

1) PCSA, Probabilistic Counting (both described in the 1985 paper)

   - quite good performance, especially with respect to the memory
   requirements (95 and 495 B)

2) Adaptive Sampling (1990), Self-Learning Bitmap (2009)

   - very different algorithms, almost the same performance (especially
   with large number of distinct values)

   - my personal favourites, as the memory requirements are still very
   reasonable (compared to the Bloom Counter)

3) Rough Estimator (2010)

   - this algorithm is described as an "optimal algorithm" (not
   exactly, it's a building block of the optimal algorithm), but the
   results are not as good as with (1) and (2)

   - one of the problems is it needs k-wise independent hash functions,
   and the MD5 hashing scheme used with the other algorithms probably
   does not conform to this condition :-(

   - I've tried to implement such hash functions on my own (using prime
   field and polynomials in modulo arithmetics), but the results were
   quite bad - if you know how to get such hash functions, let me know.

   - this algorithm is much more complex than the other algorithms, so
   if someone could do a review, that would be nice (maybe there is a
   bug resulting in the big estimate error)

4) Bloom Counter

   - basically a Bloom filter + a counter (incremented when an new
   element is added to the filter)

   - due to the design (no false negatives, positive false positives)
   it's significantly biased (gives underestimates), but I think that
   can be easily fixed with a coefficient (depends on the number of
   items / false positive eror rate)

   - needs significantly more memory to get similar performance as the
   other algorithms (a Bloom counter that can track 1 milion distinct
   values needs about 1.2MB of memory, while a S-Bitmap that tracks 10
   milion items needs just 65kB)

   - so unless we can use the special feature of Bloom Filter (i.e. the
   ability to detect items that are in the data), this is a waste of
   memory

   I know Google uses Bloom filters in BigTable to limit I/O, i.e.
   before doing the I/O they ask the Bloom filter if there are such
   data - if the answer is no, they don't have to do the I/O (false
   negatives not allowed), if the answer is yes they do the I/O.

   Not sure if we can / want to do something like this in PostgreSQL,
   as it significantly depends on the workload (and in most cases we
   know the data are there, so we have to do the I/O anyway). But I
   guess it might be a good idea for a contrib module (to maintain the
   Bloom filter on your own and do the decision on application). The
   tricky part here is to maintain the bloom filter up to date.

I'll try to fix the optimal estimator (not sure how to do that right
now), and I'll investigate the proposed 'relation fork' proposed as a
way to store the estimator.

regards
Tomas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LAST CALL FOR 9.1

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 1:59 PM, David E. Wheeler  wrote:
> On Jan 15, 2011, at 6:50 AM, Robert Haas wrote:
>>> It is really already too late for us to be seriously considering
>> integrating sync rep into 9.1.  It will lead to another enormous beta
>> period during which the tree will be closed to new patches and
>> everyone will complain, or else we'll open the tree for 9.2
>> development and a different though overlapping set of people will
>> complain about that, but if I try to bring down the gavel and actually
>> insist that we don't consider sync rep, then a third, different, also
>> overlapping set of people will complain about that.
>
> I hereby complain about your complaints about the complainers.
>
> There, you lose. ;-P

Yeah, I needed a fourth way to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LAST CALL FOR 9.1

2011-01-15 Thread David E. Wheeler

On Jan 15, 2011, at 6:50 AM, Robert Haas wrote:

> It is really already too late for us to be seriously considering
> integrating sync rep into 9.1.  It will lead to another enormous beta
> period during which the tree will be closed to new patches and
> everyone will complain, or else we'll open the tree for 9.2
> development and a different though overlapping set of people will
> complain about that, but if I try to bring down the gavel and actually
> insist that we don't consider sync rep, then a third, different, also
> overlapping set of people will complain about that.

I hereby complain about your complaints about the complainers.

There, you lose. ;-P

David

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Visual Studio 2010/Windows SDK 7.1 support

2011-01-15 Thread Magnus Hagander

On Mon, Jan 3, 2011 at 16:45, Brar Piening  wrote:
> On Mon, 3 Jan 2011 10:44:19 +0100, Magnus Hagander 
> wrote:
>>
>> Yeah, it looks that way - it's missing the ordering of the contrib
>> I'll run it once for that now, and then please rebase your
>> patch on top of that - makes it easier to review it.
>
> The rebased patch can be grabbed from http://www.piening.info/VS2010v2.patch

Hi!

Please make sure this goes on the commitfest page
(https://commitfest.postgresql.org/action/commitfest_view?id=9), so
it's not missed.

I'm currently lacking an env where I can put VS2010 on it (given that
the amazon cloud no longer works reasonably for windows machines, and
I can't put another version of VS on the other VM I'm using right
now), so it'll be a while before I can look at this.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] textarray option for file FDW

2011-01-15 Thread Andrew Dunstan




On 01/15/2011 01:24 PM, Itagaki Takahiro wrote:

On Sun, Jan 16, 2011 at 02:29, Andrew Dunstan  wrote:

"textarray". This option would require that the foreign table have exactly
one field, of type text[], and would compose all the field strings read from
the file for each record into the array (however many there are).

   CREATE FOREIGN TABLE arr_text (
t text[]
   )  SERVER file_server
   OPTIONS (format 'csv', filename '/path/to/ragged.csv', textarray 'true');

FOREIGN TABLE has stable definitions, so there are some issues
that doesn't exist in COPY command:
- when the type of the last column is not a text[]
- when the last column is changed by ALTER FOREIGN TABLE ADD/DROP COLUMN



I intended that this would be checked for validity at runtime, in 
BeginCopy(). If the table doesn't match the requirement an error would 
be raised. I don't think it's a big problem.



BTW, those options could be specified in column options rather than table
options. But we don't have column option syntax in the current proposal.

   CREATE FOREIGN TABLE arr_text (
i integer OPTION (column 1), -- column position in csv file
t text[]  OPTION (column 'all the rest'),
d dateOPTION (column 2)
   )  SERVER file_server
   OPTIONS (format 'csv', filename '/path/to/ragged.csv');



Well, ok, but is there any proposal on the table to do that?

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Magnus Hagander

On Sat, Jan 15, 2011 at 19:27, Tom Lane  wrote:
> Magnus Hagander  writes:
>> Something like this to fix? or is this going to put those "warnings by
>> stupid versions of gcc" back?
>
> Possibly.  If so, I'll fix it --- I have an old gcc to test against
> here.

Ok, thanks, I'll commit tihs one then.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Tom Lane

Magnus Hagander  writes:
> Something like this to fix? or is this going to put those "warnings by
> stupid versions of gcc" back?

Possibly.  If so, I'll fix it --- I have an old gcc to test against
here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] textarray option for file FDW

2011-01-15 Thread Itagaki Takahiro

On Sun, Jan 16, 2011 at 02:29, Andrew Dunstan  wrote:
> "textarray". This option would require that the foreign table have exactly
> one field, of type text[], and would compose all the field strings read from
> the file for each record into the array (however many there are).
>
>   CREATE FOREIGN TABLE arr_text (
>        t text[]
>   )  SERVER file_server
>   OPTIONS (format 'csv', filename '/path/to/ragged.csv', textarray 'true');

FOREIGN TABLE has stable definitions, so there are some issues
that doesn't exist in COPY command:
- when the type of the last column is not a text[]
- when the last column is changed by ALTER FOREIGN TABLE ADD/DROP COLUMN

BTW, those options could be specified in column options rather than table
options. But we don't have column option syntax in the current proposal.

  CREATE FOREIGN TABLE arr_text (
       i integer OPTION (column 1), -- column position in csv file
       t text[]  OPTION (column 'all the rest'),
       d dateOPTION (column 2)
  )  SERVER file_server
  OPTIONS (format 'csv', filename '/path/to/ragged.csv');

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Magnus Hagander

On Sat, Jan 15, 2011 at 16:54, Tom Lane  wrote:
> Heikki Linnakangas  writes:
>> On 15.01.2011 17:30, Tom Lane wrote:
>>> So what?  The needed actions will be covered by WAL replay.
>
>> No, they won't, if pg_base_backup() is called *after* getting the list
>> of tablespaces.
>
> Ah.  Then the fix is to change the order in which those things are done.

Grumble. It used to be that way. For some reason I can't recall, I broke it.

Something like this to fix? or is this going to put those "warnings by
stupid versions of gcc" back?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 1ed5e2a..b4d5bbe 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -40,7 +40,7 @@ static void send_int8_string(StringInfoData *buf, int64 intval);
 static void SendBackupHeader(List *tablespaces);
 static void SendBackupDirectory(char *location, char *spcoid);
 static void base_backup_cleanup(int code, Datum arg);
-static void perform_base_backup(const char *backup_label, List *tablespaces);
+static void perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir);
 
 typedef struct
 {
@@ -67,13 +67,50 @@ base_backup_cleanup(int code, Datum arg)
  * clobbered by longjmp" from stupider versions of gcc.
  */
 static void
-perform_base_backup(const char *backup_label, List *tablespaces)
+perform_base_backup(const char *backup_label, bool progress, DIR *tblspcdir)
 {
 	do_pg_start_backup(backup_label, true);
 
 	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 	{
+		List	   *tablespaces = NIL;
 		ListCell   *lc;
+		struct dirent *de;
+		tablespaceinfo *ti;
+
+
+		/* Add a node for the base directory */
+		ti = palloc0(sizeof(tablespaceinfo));
+		ti->size = progress ? sendDir(".", 1, true) : -1;
+		tablespaces = lappend(tablespaces, ti);
+
+		/* Collect information about all tablespaces */
+		while ((de = ReadDir(tblspcdir, "pg_tblspc")) != NULL)
+		{
+			char		fullpath[MAXPGPATH];
+			char		linkpath[MAXPGPATH];
+
+			/* Skip special stuff */
+			if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+continue;
+
+			snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name);
+
+			MemSet(linkpath, 0, sizeof(linkpath));
+			if (readlink(fullpath, linkpath, sizeof(linkpath) - 1) == -1)
+			{
+ereport(WARNING,
+		(errmsg("unable to read symbolic link %s: %m", fullpath)));
+continue;
+			}
+
+			ti = palloc(sizeof(tablespaceinfo));
+			ti->oid = pstrdup(de->d_name);
+			ti->path = pstrdup(linkpath);
+			ti->size = progress ? sendDir(linkpath, strlen(linkpath), true) : -1;
+			tablespaces = lappend(tablespaces, ti);
+		}
+
 
 		/* Send tablespace header */
 		SendBackupHeader(tablespaces);
@@ -101,9 +138,6 @@ void
 SendBaseBackup(const char *backup_label, bool progress)
 {
 	DIR		   *dir;
-	struct dirent *de;
-	List	   *tablespaces = NIL;
-	tablespaceinfo *ti;
 	MemoryContext backup_context;
 	MemoryContext old_context;
 
@@ -134,41 +168,10 @@ SendBaseBackup(const char *backup_label, bool progress)
 		ereport(ERROR,
 (errmsg("unable to open directory pg_tblspc: %m")));
 
-	/* Add a node for the base directory */
-	ti = palloc0(sizeof(tablespaceinfo));
-	ti->size = progress ? sendDir(".", 1, true) : -1;
-	tablespaces = lappend(tablespaces, ti);
-
-	/* Collect information about all tablespaces */
-	while ((de = ReadDir(dir, "pg_tblspc")) != NULL)
-	{
-		char		fullpath[MAXPGPATH];
-		char		linkpath[MAXPGPATH];
-
-		/* Skip special stuff */
-		if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
-			continue;
-
-		snprintf(fullpath, sizeof(fullpath), "pg_tblspc/%s", de->d_name);
-
-		MemSet(linkpath, 0, sizeof(linkpath));
-		if (readlink(fullpath, linkpath, sizeof(linkpath) - 1) == -1)
-		{
-			ereport(WARNING,
-  (errmsg("unable to read symbolic link %s: %m", fullpath)));
-			continue;
-		}
+	perform_base_backup(backup_label, progress, dir);
 
-		ti = palloc(sizeof(tablespaceinfo));
-		ti->oid = pstrdup(de->d_name);
-		ti->path = pstrdup(linkpath);
-		ti->size = progress ? sendDir(linkpath, strlen(linkpath), true) : -1;
-		tablespaces = lappend(tablespaces, ti);
-	}
 	FreeDir(dir);
 
-	perform_base_backup(backup_label, tablespaces);
-
 	MemoryContextSwitchTo(old_context);
 	MemoryContextDelete(backup_context);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_basebackup for streaming base backups

2011-01-15 Thread Magnus Hagander

This patch creates pg_basebackup in bin/, being a client program for
the streaming base backup feature.

I think it's more or less done now. I've again split it out of
pg_streamrecv, because it had very little shared code with that
(basically just the PQconnectdb() wrapper).

One thing I'm thinking about - right now the tool just takes -c
 to connect to the database. Should it instead be taught to
take the connection parameters that for example pg_dump does - one for
each of host, port, user, password? (shouldn't be hard to do..)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index db7c834..c14ae43 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -813,6 +813,16 @@ SELECT pg_stop_backup();

 

+You can also use the  tool to take
+the backup, instead of manually copying the files. This tool will take
+care of the pg_start_backup(), copy and
+pg_stop_backup() steps automatically, and transfers the
+backup over a regular PostgreSQL connection
+using the replication protocol, instead of requiring filesystem level
+access.
+   
+
+   
 Some file system backup tools emit warnings or errors
 if the files they are trying to copy change while the copy proceeds.
 When taking a base backup of an active database, this situation is normal
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f40fa9d..c44d11e 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -160,6 +160,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
new file mode 100644
index 000..42a714b
--- /dev/null
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -0,0 +1,249 @@
+
+
+
+ 
+  pg_basebackup
+  1
+  Application
+ 
+
+ 
+  pg_basebackup
+  take a base backup of a PostgreSQL cluster
+ 
+
+ 
+  pg_basebackup
+ 
+
+ 
+  
+   pg_basebackup
+   option
+  
+ 
+
+ 
+  
+   Description
+  
+  
+   pg_basebackup is used to take base backups of
+   a running PostgreSQL database cluster. These
+   are taken without affecting other clients to the database, and can be used
+   both for point-in-time recovery (see )
+   and as the starting point for a log shipping or streaming replication standby
+   server (see ).
+  
+
+  
+   pg_basebackup makes a binary copy of the database
+   cluster files, while making sure the system is automatically put in and
+   out of backup mode automatically. Backups are always taken of the entire
+   database cluster, it is not possible to back up individual databases or
+   database objects. For individual database backups, a tool such as
+must be used.
+  
+
+  
+   The backup is made over a regular PostgreSQL
+   connection, and uses the replication protocol. The connection must be
+   made with a user having REPLICATION permissions (see
+   ).
+  
+
+  
+   Only one backup can be concurrently active in
+   PostgreSQL, meaning that only one instance of
+   pg_basebackup can run at the same time
+   against a single database cluster.
+  
+ 
+
+ 
+  Options
+
+   
+
+ 
+  -c conninfo
+  --conninfo=conninfo
+  
+   
+Specify the conninfo string used to connect to the server. For example:
+
+$ pg_basebackup -c "host=192.168.0.2 user=backup"
+
+See  for more information on all the
+available connection options.
+   
+  
+ 
+
+ 
+  -d directory
+  --basedir=directory
+  
+   
+Directory to restore the base data directory to. When the cluster has
+no additional tablespaces, the whole database will be placed in this
+directory. If the cluster contains additional tablespaces, the main
+data directory will be placed in this directory, but all other
+tablespaces will be placed in the same absolute path as they have
+on the server.
+   
+   
+Only one of -d and -t can be specified.
+   
+  
+ 
+
+ 
+  -l label
+  --label=label
+  
+   
+Sets the label for the backup. If none is specified, a default value of
+pg_basebackup base backup will be used.
+   
+  
+ 
+
+ 
+  -p
+  --progress
+  
+   
+Enables progress reporting. Turning this on will deliver an approximate
+progress report during the backup. Since the database may change during
+the backup, this is only an approximation and may not end at exactly
+100%.
+   
+   
+When this is enabled, the backup will start by enumerating the size of
+the entire database, and then go back and send the actual contents.
+This may make the backup take slightly longer, and in particular it
+will take longer before the first data is sent.
+   
+  
+

Re: [HACKERS] textarray option for file FDW

2011-01-15 Thread Andrew Dunstan




On 01/15/2011 12:44 PM, Tom Lane wrote:

Andrew Dunstan  writes:

... So now I'd like to add a
new option to it: "textarray". This option would require that the
foreign table have exactly one field, of type text[], and would compose
all the field strings read from the file for each record into the array
(however many there are).

Why is this a good thing?  It seems like it would accomplish little
except to defeat the SQL type system entirely.





We have discussed previously allowing this sort of thing. It's not a new 
proposal at all.


My use case is that I have a customer who reads in data like this (using 
my patch to allow ragged CSV input, which you previously objected to) 
and then performs a sophisticated battery of validity tests on the data 
before loading it into its final destination. To do that their 
requirement is that we not error out on reading the data, so we load the 
data into a table that is all text fields.


In fact, having COPY read in a text array is *your* suggestion: 
. 
This is simply a proposal to implement that via FDW, which makes it easy 
to avoid any syntax issues.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql

2011-01-15 Thread Noah Misch

Hello Pavel,

I'm reviewing this patch for CommitFest 2011-01.

The patch seems fully desirable.  It appropriately contains no documentation
updates.  It contains no new tests, and that's probably fine, too; I can't think
of any corner cases where this would do something other than work correctly or
break things comprehensively.

Using your test case from here:
http://archives.postgresql.org/message-id/aanlktikdcw+c-c4u4ngaobhpfszkb5uy_zuatdzfp...@mail.gmail.com
I observed a 28x speedup, similar to ??lvaro's report.  I also made my own test
case, attached, to evaluate this from a somewhat different angle and also to
consider the worst case.  With a 100 KiB string (good case), I see a 4.8x
speedup.  With a 1 KiB string (worst case - never toasted), I see no
statistically significant change.  This is to be expected.

A few specific questions and comments follow, all minor.  Go ahead and mark the
patch ready for committer when you've acted on them, or declined to do so, to
your satisfaction.  I don't think a re-review will be needed.

Thanks,
nm

On Mon, Nov 22, 2010 at 01:46:51PM +0100, Pavel Stehule wrote:
> *** ./pl_exec.c.orig  2010-11-16 10:28:42.0 +0100
> --- ./pl_exec.c   2010-11-22 13:33:01.597726809 +0100

The patch applies cleanly, but the format is slightly nonstandard (-p0 when
applied from src/pl/plpgsql/src, rather than -p1 from the root).

> ***
> *** 3944,3949 
> --- 3965,3993 
>   
>   *typeid = var->datatype->typoid;
>   *typetypmod = var->datatype->atttypmod;
> + 
> + /*
> +  * explicit deTOAST and decomprim for varlena 
> types

"decompress", perhaps?

> +  */
> + if (var->should_be_detoasted)
> + {
> + Datum dvalue;
> + 
> + Assert(!var->isnull);
> + 
> + oldcontext = 
> MemoryContextSwitchTo(estate->fn_mcxt);
> + dvalue = 
> PointerGetDatum(PG_DETOAST_DATUM(var->value));
> + if (dvalue != var->value)
> + {
> + if (var->freeval)
> + free_var(var);
> + var->value = dvalue;
> + var->freeval = true;
> + }
> + MemoryContextSwitchTo(oldcontext); 

This line adds trailing white space.

> + var->should_be_detoasted = false;
> + }
> + 
>   *value = var->value;
>   *isnull = var->isnull;
>   break;

> *** ./plpgsql.h.orig  2010-11-16 10:28:42.0 +0100
> --- ./plpgsql.h   2010-11-22 13:12:38.897851879 +0100

> ***
> *** 644,649 
> --- 645,651 
>   boolfn_is_trigger;
>   PLpgSQL_func_hashkey *fn_hashkey;   /* back-link to hashtable key */
>   MemoryContext fn_cxt;
> + MemoryContext   fn_mcxt;/* link to function's memory 
> context */
>   
>   Oid fn_rettype;
>   int fn_rettyplen;
> ***
> *** 692,697 
> --- 694,701 
>   Oid rettype;/* type of current 
> retval */
>   
>   Oid fn_rettype; /* info about declared 
> function rettype */
> + MemoryContext   fn_mcxt;/* link to function's memory 
> context */
> + 
>   boolretistuple;
>   boolretisset;
>   

I only see PLpgSQL_execstate.fn_mcxt getting populated in this patch.  Is the
PLpgSQL_function.fn_mcxt leftover from an earlier design?

I could not readily tell the difference between fn_cxt and fn_mcxt.  As I
understand it, fn_mcxt is the SPI procCxt, and fn_cxt is the long-lived context
used to cache facts across many transactions.  Perhaps name the member something
like "top_cxt", "exec_cxt" or "proc_cxt" and comment it like "/* SPI Proc memory
context */" to make this clearer.
\set datscale 1000

CREATE TABLE t (c text);
INSERT INTO t VALUES (repeat('a', 1024 * :datscale));

CREATE OR REPLACE FUNCTION f(runs int) RETURNS void LANGUAGE plpgsql AS $$
DECLARE
foo text;
BEGIN
SELECT c INTO foo FROM t;
FOR n IN 1 .. runs LOOP
PERFORM foo < 'x';
END LOOP;
END
$$;

-- datscale=1000: 20.4s before patch, 4.21s after patch
SELECT f(4000);

-- datscale=1: 20.8s before patch, 21.0s after patch
SELECT f(300);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your su

Re: [HACKERS] textarray option for file FDW

2011-01-15 Thread Tom Lane

Andrew Dunstan  writes:
> ... So now I'd like to add a 
> new option to it: "textarray". This option would require that the 
> foreign table have exactly one field, of type text[], and would compose 
> all the field strings read from the file for each record into the array 
> (however many there are).

Why is this a good thing?  It seems like it would accomplish little
except to defeat the SQL type system entirely.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] textarray option for file FDW

2011-01-15 Thread Andrew Dunstan



I've been waiting for the latest FDW patches as patiently as I can, and 
I've been reviewing them this morning, in particular the file_fdw patch 
and how it interacts with the newly exposed COPY API. Overall it seems 
to be a whole lot cleaner, and the wholesale duplication of the copy 
code is gone, so it's much nicer and cleaner. So now I'd like to add a 
new option to it: "textarray". This option would require that the 
foreign table have exactly one field, of type text[], and would compose 
all the field strings read from the file for each record into the array 
(however many there are). This would require a few changes to 
contrib/file_fdw/file_fdw.c and a few changes to 
src/backend/commands/copy.c, which I can probably have done in fairly 
short order, Deo Volente. This will allow something like:


   CREATE FOREIGN TABLE arr_text (
t text[]
   )  SERVER file_server
   OPTIONS (format 'csv', filename '/path/to/ragged.csv', textarray
   'true');
   SELECT t[3]::int as a, t[1]::timestamptz as b, t[99] as not_there
   FROM arr_text;

Thoughts?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in pg_describe_object, patch v2

2011-01-15 Thread Joel Jacobson

2011/1/15 Tom Lane :
> But I can read the handwriting on the wall: if I want this done right,
> I'm going to have to do it myself.
>
>                        regards, tom lane

Excellently put! I will with pride steal that phrase and use it
whenever I run into the same situation myself.
Quite often a proposed "good enough solution" introduces unnecessary
complexity and awkwardness, while at the same time the "proper
solution" requires an unproportionally amount of resources to
accomplish.
The model most people adapt is a mix of both. Spend time on details
when it is absolutely necessary and accept "good enough solutions"
until someone (not necessarily yourself, like you suggested) gets
motivated enough to implement the "proper solution".

I don't think and I hope it's not just one guy in the universe who can
get it "done right".

So, instead of fighting like war maniacs on this
god-damn-two-lines-of-code-patch, could Tom please describe to the
patch submitter what you mean with get it "done right", and perhaps
the patch submitter will be motivated enough to invest a few
hundred/thousands hours of his time to solve the problem the way
hinted by Tom?

-- 
Best regards,

Joel Jacobson
Glue Finance

E: j...@gluefinance.com
T: +46 70 360 38 01

Postal address:
Glue Finance AB
Box  549
114 11  Stockholm
Sweden

Visiting address:
Glue Finance AB
Birger Jarlsgatan 14
114 34 Stockholm
Sweden

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

2011-01-15 Thread Bruce Momjian

Josh Berkus wrote:
> On 12/6/10 6:13 PM, Tom Lane wrote:
> > Josh Berkus  writes:
> >> OK, patch coming then.  Right now test_fsync aborts when O_DIRECT fails.
> >>  What should I have it do instead?
> > 
> > Report that it fails, and keep testing the other methods.
> 
> Patch attached.  Includes a fair amount of comment cleanup, since
> existing comments did not meet our current project standards.  Tests all
> 6 of the methods we support separately.
> 
> Some questions, though:
> 
> (1) Why are we doing the open_sync different-size write test?  AFAIK,
> this doesn't match any behavior which PostgreSQL has.

I added program output to explain this.

> (2) In this patch, I'm stepping down the number of loops which
> fsync_writethrough does by 90%.  The reason for that was that on the
> platforms where I tested writethrough (desktop machines), doing 10,000
> loops took 15-20 *minutes*, which seems hard on the user.  Would be easy
> to revert if you think it's a bad idea.
>   Possibly auto-sizing the number of loops based on the first fsync test
> might be a good idea, but seems like going a bit too far.

I did not know why writethough we always be much slower than other sync
methods so I just reduced the loop count to 2k.

> (3) Should the multi-descriptor test be using writethrough on platforms
> which support it?

Thank you for your patch.  I have applied most of it, attached.  

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/tools/fsync/Makefile b/src/tools/fsync/Makefile
index fe3e626..44419ee 100644
*** /tmp/pgdiff.17122/M3047c_Makefile	Sat Jan 15 11:53:43 2011
--- src/tools/fsync/Makefile	Fri Jan 14 22:35:30 2011
*** override CPPFLAGS := -I$(libpq_srcdir) $
*** 16,24 
  
  OBJS= test_fsync.o
  
! all: test_fsync
  
! test_fsync: test_fsync.o | submake-libpq submake-libpgport
  	$(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
  
  clean distclean maintainer-clean:
--- 16,24 
  
  OBJS= test_fsync.o
  
! all: submake-libpq submake-libpgport test_fsync
  
! test_fsync: test_fsync.o $(libpq_builddir)/libpq.a
  	$(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
  
  clean distclean maintainer-clean:
diff --git a/src/tools/fsync/README b/src/tools/fsync/README
index 6d9acd3..a1c2ae4 100644
*** /tmp/pgdiff.17122/wFeqrc_README	Sat Jan 15 11:53:43 2011
--- src/tools/fsync/README	Sat Jan 15 11:34:58 2011
***
*** 1,11 
! src/tools/fsync/README
! 
! fsync
! =
  
  This program tests fsync.  The tests are described as part of the program output.
  
  	Usage:	test_fsync [-f filename] [loops]
  
- Loops defaults to 5000.  The default output file is /var/tmp/test_fsync.out.
- Consider that /tmp or /var/tmp might be memory-based file systems.
--- 1,20 
! test_fsync
! ==
  
  This program tests fsync.  The tests are described as part of the program output.
  
  	Usage:	test_fsync [-f filename] [loops]
+ 	
+ test_fsync is intended to give you a reasonable idea of what the fastest
+ fsync_method is on your specific system, as well as supplying diagnostic
+ information in the event of an identified I/O problem.  However,
+ differences shown by test_fsync might not make any difference in real
+ database throughput, especially since many database servers are not
+ speed-limited by their transaction logs.
+ 
+ The output filename defaults to test_fsync.out in the current directory.
+ test_fsync should be run in the same filesystem as your transaction log
+ directory (pg_xlog).
+ 
+ Loops default to 2000.  Increase this to get more accurate measurements.
  
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index 28c2119..831e2a0 100644
*** /tmp/pgdiff.17122/OSeljb_test_fsync.c	Sat Jan 15 11:53:43 2011
--- src/tools/fsync/test_fsync.c	Sat Jan 15 11:52:03 2011
***
*** 3,9 
   *
   *
   *	test_fsync.c
!  *		test various fsync() methods
   */
  
  #include "postgres.h"
--- 3,9 
   *
   *
   *	test_fsync.c
!  *		tests all supported fsync() methods
   */
  
  #include "postgres.h"
***
*** 22,40 
  #include 
  #include 
  
! 
! #ifdef WIN32
  #define FSYNC_FILENAME	"./test_fsync.out"
- #else
- /* /tmp might be a memory file system */
- #define FSYNC_FILENAME	"/var/tmp/test_fsync.out"
- #endif
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */
  
  #define LABEL_FORMAT	"\t%-30s"
  
! int			loops = 1;
  
  void		die(char *str);
  void		print_elapse(struct timeval start_t, struct timeval stop_t);
--- 22,39 
  #include 
  #include 
  
! /* 
!  * put the temp files in the local directory
!  * unless the user specifies otherwise 
!  */
  #define FSYNC_FILENAME	"./test_fsync.out"
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */
  
  #define LABEL_FORMAT	"\t%-30s"
  
! 
! int			loops = 2000;
  
  void		die(char *str);
  void

Re: [HACKERS] auto-sizing wal_buffers

2011-01-15 Thread Greg Smith


Fujii Masao wrote:

+intXLOGbuffersMin = 8;

XLOGbuffersMin is a fixed value. I think that defining it as a macro
rather than a variable seems better.

+   if (XLOGbuffers > 2048)
+   XLOGbuffers = 2048;

Using "XLOG_SEG_SIZE/XLOG_BLCKSZ" rather than 2048 seems
better.

+#wal_buffers = -1  # min 32kB, -1 sets based on 
shared_buffers

Typo: s/32kB/64kB
  


Thanks, I've fixed all these issues and attached a new full patch, 
pushed to github, etc.  Tests give same results back, and it's nice that 
it scale to reasonable behavior if someone changes their XLOG segment size.


It should be possible to set the value back to the older minimum value 
of 32kB too.  That's doesn't actually seem to work though; when I try it 
I get:


$ psql -c "SELECT name,unit,boot_val,setting,current_setting(name) FROM 
pg_settings WHERE name IN ('wal_buffers','shared_buffers')"

 name  | unit | boot_val | setting | current_setting
+--+--+-+-
shared_buffers | 8kB  | 1024 | 131072  | 1GB
wal_buffers| 8kB  | -1   | 8   | 64kB

Where I was expecting that setting to be "4" instead for 32kB.  So 
there's probably some minor bug left in where I inserted this into the 
initialization sequence.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e2a2c5..1c13f20 100644
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
*** SET ENABLE_SEQSCAN TO OFF;
*** 1638,1649 


 
! The amount of memory used in shared memory for WAL data.  The
! default is 64 kilobytes (64kB).  The setting need only
! be large enough to hold the amount of WAL data generated by one
! typical transaction, since the data is written out to disk at
! every transaction commit.  This parameter can only be set at server
! start.
 
  
 
--- 1638,1659 


 
! The amount of shared memory used for storing WAL data.  The
! default setting of -1 adjusts this automatically based on the size
! of shared_buffers, making it 1/32 (about 3%) of
! the size of that normally larger shared memory block.  Automatically
! set values are limited to a maximum sufficient to hold one WAL
! 		segment worth of data, normally 16 megabytes (16MB).
! The smallest allowable setting is 32 kilobytes (32kB).
!
! 
!
! Since the data is written out to disk at every transaction commit,
! the setting many only need to be be large enough to hold the amount
! of WAL data generated by one typical transaction.  Larger values,
! typically at least a few megabytes, can improve write performance
! on a busy server where many clients are committing at once.
! Extremely large settings are unlikely to provide additional benefit.
 
  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6a230..d057773 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 69,75 
  /* User-settable parameters */
  int			CheckPointSegments = 3;
  int			wal_keep_segments = 0;
! int			XLOGbuffers = 8;
  int			XLogArchiveTimeout = 0;
  bool		XLogArchiveMode = false;
  char	   *XLogArchiveCommand = NULL;
--- 69,76 
  /* User-settable parameters */
  int			CheckPointSegments = 3;
  int			wal_keep_segments = 0;
! int			XLOGbuffers = -1;
! int			XLOGbuffersMin = 8;
  int			XLogArchiveTimeout = 0;
  bool		XLogArchiveMode = false;
  char	   *XLogArchiveCommand = NULL;
*** GetSystemIdentifier(void)
*** 4780,4790 
--- 4781,4819 
  /*
   * Initialization of shared memory for XLOG
   */
+ 
+ void XLOGTuneNumBuffers(void)
+ {
+ 	int one_segment = XLOG_SEG_SIZE / XLOG_BLCKSZ;
+ 
+ 	/*
+ 	 * If automatic setting was requested, use about 3% as much memory as
+ 	 * requested for the buffer cache.  Clamp the automatic maximum to the
+ 	 * size of one XLOG segment, while still allowing a larger manual
+ 	 * setting.  Don't go below the default setting in earlier versions:
+ 	 * twice the size of the minimum, which at default build options is 64kB.
+ 	 */	 
+ 	if (XLOGbuffers == -1)  /* set automatically */
+ 		{
+ 		XLOGbuffers = NBuffers / 32;
+ 		if (XLOGbuffers > one_segment)
+ 			XLOGbuffers = one_segment;
+ 		if (XLOGbuffers < (XLOGbuffersMin * 2))
+ 			XLOGbuffers = XLOGbuffersMin * 2;
+ 		}
+ 
+ 	/* Enforce a 32KB minimum in every case */
+ 	if (XLOGbuffers < XLOGbuffersMin)
+ 		XLOGbuffers = XLOGbuffersMin;
+ }
+ 
  Size
  XLOGShmemSize(void)
  {
  	Size		size;
  
+ 	XLOGTuneNumBuffers();
+ 	
  	/

Re: [HACKERS] Add support for logging the current role

2011-01-15 Thread Andrew Dunstan




On 01/15/2011 11:08 AM, Tom Lane wrote:

Robert Haas  writes:

On Fri, Jan 14, 2011 at 9:24 PM, Andrew Dunstan  wrote:

What's your suggestion, then?

If there's a practical way to add the requested escape, add it to the
text format and leave reengineering the CSV format for another day.
Yeah, I know that's not the most beautiful solution in the world, but
we're doing engineering here, not theology.

Well, the original patch was exactly that.  But I don't agree with that
approach; I think allowing the capabilities of text and CSV logs to
diverge significantly would be a mistake.  If a piece of information is
valuable enough to need a way to include it in textual log entries,
then you need a way to include it in CSV log entries too.  If it's not
valuable enough to do the work to support it in CSV, then we can live
without it.




Yeah, I agree, that's exactly the kind of divergence we usually try to 
avoid. And it's hardly theology to say let's not do a half-assed job on 
this.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] log_hostname and pg_stat_activity

2011-01-15 Thread Peter Eisentraut

On tor, 2010-12-23 at 22:21 +0100, Magnus Hagander wrote:
> On Thu, Dec 23, 2010 at 22:09, Peter Eisentraut  wrote:
> > Somehow I fantasized that log_hostname would also turn
> > pg_stat_activity.client_addr into names instead of IP addresses.  It
> > doesn't, but perhaps it should?  It would be nice to be able to
> > configure an IP-address free setup.  Or would it be worth having a
> > separate configuration parameter for that?
> 
> It should certainly be renamed to something else if it does both, but
> I don't see the point of having two separate parameters between them.
> As long as you can use a cached version of the lookup, you're only
> paying the price once, after all...
> 
> However, pg_stat_activity.client_addr is an inet field, not a text
> string, so you'd have to invent a separate field for it I think...

Here is a patch that adds a client_hostname field to pg_stat_activity.
It takes the hostname if it is available either by having log_hostname
set or if the pg_hba.conf processing resolved it.  So I think for most
people who would care about this thing, it would "just work".

I'm not so sure about the pgstat.c internals.  Do we need the separate
buffer in shared memory, as in this patch, or could we just copy the
name directly into the PgBackendStatus struct?  I guess not the latter,
since my compiler complains about copying a string into a volatile
pointer.

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 718e996..dd92728 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -482,6 +482,7 @@ CREATE VIEW pg_stat_activity AS
 U.rolname AS usename,
 S.application_name,
 S.client_addr,
+S.client_hostname,
 S.client_port,
 S.backend_start,
 S.xact_start,
@@ -499,6 +500,7 @@ CREATE VIEW pg_stat_replication AS
 U.rolname AS usename,
 S.application_name,
 S.client_addr,
+S.client_hostname,
 S.client_port,
 S.backend_start,
 W.state,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 301568f..3a00157 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2223,6 +2223,7 @@ pgstat_fetch_global(void)
 
 static PgBackendStatus *BackendStatusArray = NULL;
 static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendClientHostnameBuffer = NULL;
 static char *BackendAppnameBuffer = NULL;
 static char *BackendActivityBuffer = NULL;
 
@@ -2244,7 +2245,7 @@ BackendStatusShmemSize(void)
 }
 
 /*
- * Initialize the shared status array and activity/appname string buffers
+ * Initialize the shared status array and several string buffers
  * during postmaster startup.
  */
 void
@@ -2286,6 +2287,24 @@ CreateSharedBackendStatus(void)
 		}
 	}
 
+	/* Create or attach to the shared client hostname buffer */
+	size = mul_size(NAMEDATALEN, MaxBackends);
+	BackendClientHostnameBuffer = (char *)
+		ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+	if (!found)
+	{
+		MemSet(BackendClientHostnameBuffer, 0, size);
+
+		/* Initialize st_clienthostname pointers. */
+		buffer = BackendClientHostnameBuffer;
+		for (i = 0; i < MaxBackends; i++)
+		{
+			BackendStatusArray[i].st_clienthostname = buffer;
+			buffer += NAMEDATALEN;
+		}
+	}
+
 	/* Create or attach to the shared activity buffer */
 	size = mul_size(pgstat_track_activity_query_size, MaxBackends);
 	BackendActivityBuffer = (char *)
@@ -2386,16 +2405,21 @@ pgstat_bestart(void)
 	beentry->st_databaseid = MyDatabaseId;
 	beentry->st_userid = userid;
 	beentry->st_clientaddr = clientaddr;
+	beentry->st_clienthostname[0] = '\0';
 	beentry->st_waiting = false;
 	beentry->st_appname[0] = '\0';
 	beentry->st_activity[0] = '\0';
 	/* Also make sure the last byte in each string area is always 0 */
+	beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
 	beentry->st_appname[NAMEDATALEN - 1] = '\0';
 	beentry->st_activity[pgstat_track_activity_query_size - 1] = '\0';
 
 	beentry->st_changecount++;
 	Assert((beentry->st_changecount & 1) == 0);
 
+	if (MyProcPort && MyProcPort->remote_hostname)
+		strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname, NAMEDATALEN);
+
 	/* Update app name to current GUC setting */
 	if (application_name)
 		pgstat_report_appname(application_name);
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a95ba8b..a608495 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -504,7 +504,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 
 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
 
-		tupdesc = CreateTemplateTupleDesc(11, false);
+		tupdesc = CreateTemplateTupleDesc(12, false);
 		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "datid", OIDOID, -1, 0);
 		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "procpid", INT

Re: [HACKERS] We need to log aborted autovacuums

2011-01-15 Thread Tom Lane

Greg Smith  writes:
> Does try_relation_open need to have a lock acquisition timeout when AV 
> is calling it?

Hmm.  I think when looking at the AV code, I've always subconsciously
assumed that try_relation_open would fail immediately if it couldn't get
the lock.  That certainly seems like it would be a more appropriate way
to behave than delaying indefinitely.

In practice, the scenario you're worried about seems unlikely to persist
indefinitely: as soon as someone else comes along and blocks behind
autovacuum's request, the deadlock checker will kick AV off the lock
queue.  But it would probably be better not to depend on that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add support for logging the current role

2011-01-15 Thread Tom Lane

Robert Haas  writes:
> On Fri, Jan 14, 2011 at 9:24 PM, Andrew Dunstan  wrote:
>> What's your suggestion, then?

> If there's a practical way to add the requested escape, add it to the
> text format and leave reengineering the CSV format for another day.
> Yeah, I know that's not the most beautiful solution in the world, but
> we're doing engineering here, not theology.

Well, the original patch was exactly that.  But I don't agree with that
approach; I think allowing the capabilities of text and CSV logs to
diverge significantly would be a mistake.  If a piece of information is
valuable enough to need a way to include it in textual log entries,
then you need a way to include it in CSV log entries too.  If it's not
valuable enough to do the work to support it in CSV, then we can live
without it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Tom Lane

Heikki Linnakangas  writes:
> On 15.01.2011 17:30, Tom Lane wrote:
>> So what?  The needed actions will be covered by WAL replay.

> No, they won't, if pg_base_backup() is called *after* getting the list 
> of tablespaces.

Ah.  Then the fix is to change the order in which those things are done.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] kill -KILL: What happens?

2011-01-15 Thread Florian Pflug

On Jan14, 2011, at 17:45 , Robert Haas wrote:
> On Fri, Jan 14, 2011 at 11:28 AM, Florian Pflug  wrote:
>> I gather that the behaviour we want is for normal backends to exit
>> once the postmaster is gone, and for utility processes (bgwriter, ...)
>> to exit once all the backends are gone.
>> 
>> The test program I posted in this thread proves that FIFOs and select()
>> can be used to implement this, if we're ready to check for EOF on the
>> socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
>> route to take?
> 
> I don't think there's much point in getting excited about the order in
> which things exit.  If we're agreed (and we seem to be, modulo Tom)
> that the backends should exit quickly if the postmaster dies, then
> worrying about whether the utility processes exit slightly before or
> slightly after that doesn't excite me very much.

I've realized that POSIX actually *does* provide a way to receive a signal -
the SIGIO machinery. I've modified my test case do to that. To simplify things,
I've removed support for multiple life sign objects.

The code now does the following:

The parents creates a pipe, sets it's reading fd to O_NONBLOCK and O_ASYNC,
and registers a SIGIO handler. The SIGIO handler checks a global flag, and
simply sends a SIGTERM to its own pid if the flag is set.

Child processes close the pipe's writing end (called "giving up ownership
of the life sign" in the code) and set the global flag if they want to receive
a SIGTERM once the parent is gone. The parent's health state can additionally
be checked at any time by trying to read() from the pipe. read() returns
EAGAIN as long as the parent is still alive and EOF otherwise.

I'm not sure how portable this is. It compiles and runs fine on both my linux
machine (Ubuntu 10.04.01 LTS) and my laptop (OSX 10.6.6). 

In the EXEC_BACKEND case the pipe would need to be created with mkfifo() in
the data directory, but otherwise things should work the same. Haven't tried
that yet, though.

Code attached. The output should be

Launched backend 8636
Launched backend 8637
Launched backend 8638
Backend 8636 detected live parent
Backend 8637 detected live parent
Backend 8638 detected live parent
Backend 8636 detected live parent
Backend 8637 detected live parent
Backend 8638 detected live parent
Parent exiting
Backend 8637 exiting after parent died
Backend 8638 exiting after parent died
Backend 8636 exiting after parent died

if things work correctly.

best regards,
Florian Pflug

liveness.c
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in pg_describe_object, patch v2

2011-01-15 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jan 13, 2011 at 1:24 PM, Tom Lane  wrote:
>> Read the CREATE OPERATOR CLASS source code.  (It likely would be best to
>> refactor that a bit so it would expose some way to obtain the implied
>> defaults --- I don't think that's done explicitly now, and it's
>> certainly not exported from opclasscmds.c.)

> I didn't say I *couldn't* take the time to understand this; I said I
> think it'd be more clear to more people if we just printed the types
> all the time.

No, it would just be useless noise that would confuse most people,
especially since the default behavior varies across AMs.  Please observe
that not one single one of the contrib modules has any occasion to
specify non-default amprocleft/right --- that should give you a fix on
how useful it is to worry about what they are.  If it weren't for
binary-compatibility kluges we wouldn't need the ability to specify
these at all.

But I can read the handwriting on the wall: if I want this done right,
I'm going to have to do it myself.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Heikki Linnakangas


On 15.01.2011 17:30, Tom Lane wrote:

Heikki Linnakangas  writes:

BTW, I just spotted a small race condition between creating a new table
space and base backup. We take a snapshot of all the tablespaces in
pg_tblspc before calling pg_start_backup(). If someone creates a new
tablespace and puts some data in it in the window between base backup
acquiring the list tablespaces and starting the backup, the new
tablespace won't be included in the backup.


So what?  The needed actions will be covered by WAL replay.


No, they won't, if pg_base_backup() is called *after* getting the list 
of tablespaces.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

I'll believe it when I see it.  How about this:

a 1
a 2
sync a
b 1
b 2
sync b
c 1
c 2
sync c

Or maybe some variant, where we become willing to fsync a file a
certain number of seconds after writing the last block, or when all
the writes are done, whichever comes first.


That's going to give worse performance than the current code in some 
cases.  The goal of what's in there now is that you get a sequence like 
this:


a1
b1
a2
[Filesystem writes a1]
b2
[Filesystem writes b1]
sync a [Only has to write a2]
sync b [Only has to write b2]

This idea works until you to get where the filesystem write cache is so 
large that it becomes lazier about writing things.  The fundamental 
idea--push writes out some time before the sync, in hopes the filesystem 
will get to them before that said--it not unsound.  On some systems, 
doing the sync more aggressively than that will be a regression.  This 
approach just breaks down in some cases, and those cases are happening 
more now because their likelihood scales with total RAM.  I don't want 
to screw the people with smaller systems, who may be getting 
considerable benefit from the existing sequence.  Today's little 
systems--which are very similar to the high-end ones the spread 
checkpoint stuff was developed on during 8.3--do get some benefit from 
it as far as I know.


Anyway, now that the ability to get logging on all this stuff went in 
during the last CF, it's way easier to just setup a random system to run 
tests in this area than it used to be.  Whatever testing does happen 
should include, say, a 2GB laptop with a single hard drive in it.  I 
think that's the bottom of what is reasonable to consider a reasonable 
target for tweaking write performance on, given hardware 9.1 is likely 
to be deployed on.



How does the checkpoint target give you any time to sync them?  Unless
you squeeze the writes together more tightly, but that seems sketchy.
  


Obviously the checkpoint target idea needs to be shuffled around some 
too.  I was thinking of making the new default 0.8, and having it split 
the time in half for write and sync.  That will make the write phase 
close to the speed people are seeing now, at the default of 0.5, while 
giving some window for spread sync too.  The exact way to redistribute 
that around I'm not so concerned about yet.  When I get to where that's 
the most uncertain thing left I'll benchmark the TPS vs. latency 
trade-off and see what happens.  If the rest of the code is good enough 
but this just needs to be tweaked, that's a perfect thing to get beta 
feedback to finalize.



Well you don't have to put it in shared memory on account of any of
that.  You can just hang it on a global variable.
  


Hmm.  Because it's so similar to other things being allocated in shared 
memory, I just automatically pushed it over to there.  But you're right; 
it doesn't need to be that complicated.  Nobody is touching it but the 
background writer.



If we can find something that's a modest improvement on the
status quo and we can be confident in quickly, good, but I'd rather
have 9.1 go out the door on time without fully fixing this than delay
the release.
  


I'm not somebody who needs to be convinced of that.  There are two near 
commit quality pieces of this out there now:


1) Keep some BGW cleaning and fsync absorption going while sync is 
happening, rather than starting it and ignoring everything else until 
it's done.


2) Compact fsync requests when the queue fills

If that's all we can get for 9.1, it will still be a major improvement.  
I realize I only have a very short period of time to complete a major 
integration breakthrough on the pieces floating around before the goal 
here has to drop to something less ambitious.  I head to the West Coast 
for a week on the 23rd; I'll be forced to throw in the towel at that 
point if I can't get the better ideas we have in pieces here all 
assembled well by then.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Tom Lane

Heikki Linnakangas  writes:
> BTW, I just spotted a small race condition between creating a new table 
> space and base backup. We take a snapshot of all the tablespaces in 
> pg_tblspc before calling pg_start_backup(). If someone creates a new 
> tablespace and puts some data in it in the window between base backup 
> acquiring the list tablespaces and starting the backup, the new 
> tablespace won't be included in the backup.

So what?  The needed actions will be covered by WAL replay.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LOCK for non-tables

2011-01-15 Thread Tom Lane

Robert Haas  writes:
> On Sat, Jan 15, 2011 at 6:29 AM, Simon Riggs  wrote:
>> I think we should have a section in the release notes on Deprecated
>> Features, noting that certain things will be removed later and should be
>> changed now and not relied upon in the future. A pending
>> incompatibilities list.

> Agreed.  Of course, the problem is sometimes we don't do what we say
> we're going to do, but it's worth a try.

I think if we had a formal list of planned removals, it'd be more likely
that they'd actually happen.  Right now there's no process at all
driving such things forward.

I suggest also marking each item with a release in which we intend to do
it, so we don't have to try to remember whether a reasonable amount of
time has elapsed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TYPE 0: Introduction; test cases

2011-01-15 Thread Noah Misch

On Sat, Jan 15, 2011 at 08:57:30AM -0500, Robert Haas wrote:
> On Sat, Jan 15, 2011 at 1:30 AM, Noah Misch  wrote:
> > Here's v2 based on your feedback.
> >
> > I pruned test coverage down to just the highlights. ?By the end of patch 
> > series,
> > the net change becomes +67 to alter_table.sql and +111 to alter_table.out. 
> > ?The
> > alter_table.out delta is larger in this patch (+150), because the 
> > optimizations
> > don't yet apply and more objects are reported as rebuilt.
> >
> > If this looks sane, I'll rebase the rest of the patches accordingly.
> 
> + * Table, NOT NULL and DEFAULT constraints and the "oid" system column do
> + * not (currently) follow the row type, so they require no attention here.
> + * The non-propagation of DEFAULT and NOT NULL make ADD COLUMN safe, too.
> 
> This comment seems somewhat unrelated to the rest of the patch, and I
> really hope that the first word ("Table") actually means "CHECK",
> because we certainly shouldn't be treating table check constraints and
> column check constraints differently, at least AIUI.

"Table" should be "CHECK"; thanks.  I added the comment in this patch because
it's clarifying existing behavior that was not obvious to me, rather than
documenting a new fact arising due to the patch series.  A few of the new test
cases address this behavior.

> > That was a good idea, but the implementation is awkward. ?index_build has 
> > the
> > TOAST table and index relations, and there's no good way to find the parent
> > table from either. ?No index covers pg_class.reltoastrelid, so scanning by 
> > that
> > would be a bad idea. ?Autovacuum solves this by building its own hash table 
> > with
> > the mapping; that wouldn't fit well here. ?We could parse the parent OID 
> > out of
> > the TOAST name (heh, heh). ?I suppose we could pass the parent relation from
> > create_toast_table down through index_create to index_build. ?Currently,
> > index_create knows nothing of the fact that it's making a TOAST index, and
> > changing that to improve this messages seems a bit excessive. ?Thoughts?
> >
> > For this version, I've kept the DEBUG1/DEBUG2 split by TOAST.
> 
> Well, I pretty much think that split is going to be not what anyone
> wants for any purpose OTHER than the regression tests.  So if there's
> no other way around it I'd be much more inclined to remove the
> regression tests than to keep that split.

Do you value test coverage so little?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] limiting hint bit I/O

2011-01-15 Thread Martijn van Oosterhout

On Fri, Jan 14, 2011 at 05:24:31PM -0800, Josh Berkus wrote:
> On 1/14/11 11:51 AM, Tom Lane wrote:
> > The people whose tables are mostly insert-only complain about it, but
> > that's not the majority of our userbase IMO.  We just happen to have a
> > couple of particularly vocal ones, like Berkus.
> 
> It might or might not be the majority, but it's an extremely common case
> affecting a lot of users.  Many, if not most, software applications have
> a "log" table (or two, or three) which just accumulates rows, and when
> that log table gets a vacuum freeze it pretty much halts the database in
> its tracks.  Between my client practice and IRC, I run across complaints
> about this issue around 3 times a month.

If the problem is that all the freezing happens at once, then ISTM the
solution is to add a random factor. Say, when a tuple just passes the
lower threshold it has a 1% chance of being frozen. The chance grows
until it is 100% as it reaches the upper threshold.

This should reduce the freezing traffic to a constant (hopefully
manageable) stream, since as the chance of freezing increases the
amount of data to be frozen goes down, so they should cancel somewhat.

To avoid rewriting pages multiple times, if one tuple can be frozen on
a page, we should freeze as many as possible, but the logic may do that
already.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first. 
>   - Charles de Gaulle

signature.asc
Description: Digital signature

[HACKERS] LAST CALL FOR 9.1

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 9:25 AM, Greg Smith  wrote:
> I'm going to share it before someone announces a deadline has passed or
> something.  (whistling)

Speaking of that, it is now Saturday, January 15, 2011 and CommitFest
2011-01 is now in progress.  If you have any patches that you haven't
submitted yet, have submitted but not listed in the CommitFest app
yet, or (loud ahem) you have any patches that you listed on the
CommitFest app with a fake message id but haven't actually submitted
yet, please plan to rectify that in the next 8 hours or so, or else
add them to CF 2011-Next, which is now the open CommitFest.

It is really already too late for us to be seriously considering
integrating sync rep into 9.1.  It will lead to another enormous beta
period during which the tree will be closed to new patches and
everyone will complain, or else we'll open the tree for 9.2
development and a different though overlapping set of people will
complain about that, but if I try to bring down the gavel and actually
insist that we don't consider sync rep, then a third, different, also
overlapping set of people will complain about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Streaming base backups

2011-01-15 Thread Heikki Linnakangas


On 14.01.2011 13:38, Magnus Hagander wrote:

On Fri, Jan 14, 2011 at 11:19, Heikki Linnakangas
  wrote:

On 14.01.2011 08:45, Fujii Masao wrote:

1. Smart shutdown is requested while walsender is sending a backup.
2. Shutdown causes archiver to end.
  (Though shutdown sends SIGUSR2 to walsender to exit, walsender
   running backup doesn't respond for now)
3. At the end of backup, walsender calls do_pg_stop_backup, which
  forces a switch to a new WAL file and waits until the last WAL file
has
  been archived.
  *BUT*, since archiver has already been dead, walsender waits for
  that forever.


Not only does it wait forever, but it writes the end-of-backup WAL record
after bgwriter has already exited and written the shutdown checkpoint
record.

I think postmaster should treat a walsender as a regular backend, until it
has started streaming.

We can achieve that by starting up the child as PM_CHILD_ACTIVE, and
changing the state to PM_CHILD_WALSENDER later, when streaming is started.
Looking at the postmaster.c, that should be safe, postmaster will treat a
backend as a regular backend anyway until it has connected to shared memory.
It is *not* safe to switch a walsender back to a regular process, but we
have no need to do that.


Seems reasonable to me.

I've applied a patch that exits base backups when the postmaster is
shutting down - I'm happily waiting for Heikki to submit one that
changes the shutdown logic in the postmaster :-)


Ok, committed a fix for that.

BTW, I just spotted a small race condition between creating a new table 
space and base backup. We take a snapshot of all the tablespaces in 
pg_tblspc before calling pg_start_backup(). If someone creates a new 
tablespace and puts some data in it in the window between base backup 
acquiring the list tablespaces and starting the backup, the new 
tablespace won't be included in the backup.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 9:25 AM, Greg Smith  wrote:
> Once upon a time we got a patch from Itagaki Takahiro whose purpose was to
> sort writes before sending them out:
>
> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

Ah, a fine idea!

> Which has very low odds of the sync on "a" finishing quickly, we'd get this
> one:
>
> table block
> a 1
> a 2
> b 1
> b 2
> c 1
> c 2
> sync a
> sync b
> sync c
>
> Which sure seems like a reasonable way to improve the odds data has been
> written before the associated sync comes along.

I'll believe it when I see it.  How about this:

a 1
a 2
sync a
b 1
b 2
sync b
c 1
c 2
sync c

Or maybe some variant, where we become willing to fsync a file a
certain number of seconds after writing the last block, or when all
the writes are done, whichever comes first.  It seems to me that it's
going to be a bear to figure out what fraction of the checkpoint
you've completed if you put all of the syncs at the end, and this
whole problem appears to be predicated the assumption that the OS
*isn't* writing out in a timely fashion.  Are we sure that postponing
the fsync relative to the writes is anything more than wishful
thinking?

> Also, I could just traverse the sorted list with some simple logic to count
> the number of unique files, and then set the delay between fsync writes
> based on it.  In the above, once the list was sorted, easy to just see how
> many times the table name changes on a linear scan of the sorted data.  3
> files, so if the checkpoint target gives me, say, a minute of time to sync
> them, I can delay 20 seconds between.  Simple math, and exactly the sort I

How does the checkpoint target give you any time to sync them?  Unless
you squeeze the writes together more tightly, but that seems sketchy.

> So I fixed the bitrot on the old sorted patch, which was fun as it came from
> before the 8.3 changes.  It seemed to work.  I then moved the structure it
> uses to hold the list of buffers to write, the thing that's sorted, into
> shared memory.  It's got a predictable maximum size, relying on palloc in
> the middle of the checkpoint code seems bad, and there's some potential gain
> from not reallocating it every time through.

Well you don't have to put it in shared memory on account of any of
that.  You can just hang it on a global variable.

> There's good bits in the patch I submitted for the last CF and in the patch
> you wrote earlier this week.  This unfinished patch may be a valuable idea
> to fit in there too once I fix it, or maybe it's fundamentally flawed and
> one of the other ideas you suggested (or I have sitting on the potential
> design list) will work better.  There's a patch integration problem that
> needs to be solved here, but I think almost all the individual pieces are
> available.  I'd hate to see this fail to get integrated now just for lack of
> time, considering the problem is so serious when you run into it.

Likewise, but committing something half-baked is no good either.  I
think we're in a position to crush the full-fsync-queue problem flat
(my patch should do that, and there are several other obvious things
we can do for extra certainty) but the problem of spreading out the
fsyncs looks to me like something we don't completely know how to
solve.  If we can find something that's a modest improvement on the
status quo and we can be confident in quickly, good, but I'd rather
have 9.1 go out the door on time without fully fixing this than delay
the release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs...

Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.
  


What I've been working on is something halfway between these two ideas.  
I have a patch, and it doesn't work right yet because I just broke it, 
but since I have some faint hope this will all come together any minute 
now I'm going to share it before someone announces a deadline has passed 
or something.  (whistling).  I'm going to add this messy thing and the 
patch you submitted upthread to the CF list; I'll review yours, I'll 
either fix the remaining problem in this one myself or rewrite to one of 
your ideas, and then it's onto a round of benchmarking.


Once upon a time we got a patch from Itagaki Takahiro whose purpose was 
to sort writes before sending them out:


http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

This didn't really work repeatedly for everyone because of the now well 
understood ext3 issues--I never replicated that speedup at the time for 
example.  And this was before the spread checkpoint code was in 8.3.  
The hope was that it wasn't really going to be necessary after that anyway.


Back to today...instead of something complicated, it struck me that if I 
just had a count of exactly how many files were involved in each 
checkpoint, that would be helpful.  I could keep the idea of a fixed 
delay between fsyncs, but just auto-tune that delay amount based on the 
count.  And how do you count the number of unique things in a list?  
Well, you can always sort them.  I thought that if the sorted writes 
patch got back to functional again, it could serve two purposes.  It 
would group all of the writes for a file together, and if you did the 
syncs in the same sorted order they would have the maximum odds of 
discovering the data was already written.  So rather than this possible 
order:


table block
a 1
b 1
c 1
c 2
b 2
a 2
sync a
sync b
sync c

Which has very low odds of the sync on "a" finishing quickly, we'd get 
this one:


table block
a 1
a 2
b 1
b 2
c 1
c 2
sync a
sync b
sync c

Which sure seems like a reasonable way to improve the odds data has been 
written before the associated sync comes along.


Also, I could just traverse the sorted list with some simple logic to 
count the number of unique files, and then set the delay between fsync 
writes based on it.  In the above, once the list was sorted, easy to 
just see how many times the table name changes on a linear scan of the 
sorted data.  3 files, so if the checkpoint target gives me, say, a 
minute of time to sync them, I can delay 20 seconds between.  Simple 
math, and exactly the sort I used to get reasonable behavior on the busy 
production system this all started on.  There's some unresolved 
trickiness in the segment-driven checkpoint case, but one thing at a time.


So I fixed the bitrot on the old sorted patch, which was fun as it came 
from before the 8.3 changes.  It seemed to work.  I then moved the 
structure it uses to hold the list of buffers to write, the thing that's 
sorted, into shared memory.  It's got a predictable maximum size, 
relying on palloc in the middle of the checkpoint code seems bad, and 
there's some potential gain from not reallocating it every time through.


Somewhere along the way, it started doing this instead of what I wanted:

BadArgument("!(((header->context) != ((void *)0) && 
(Node*)((header->context)))->type) == T_AllocSetContext", File: 
"mcxt.c", Line: 589)


(that's from initdb, not a good sign)

And it's left me wondering whether this whole idea is a dead end I used 
up my window of time wandering down.


There's good bits in the patch I submitted for the last CF and in the 
patch you wrote earlier this week.  This unfinished patch may be a 
valuable idea to fit in there too once I fix it, or maybe it's 
fundamentally flawed and one of the other ideas you suggested (or I have 
sitting on the potential design list) will work better.  There's a patch 
integration problem that needs to be solved here, but I think almost all 
the individual pieces are available.  I'd hate to see this fail to get 
integrated now just for lack of time, considering the problem is so 
serious when you run into it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
i

Re: [HACKERS] Bug in pg_describe_object, patch v2

2011-01-15 Thread Robert Haas

On Thu, Jan 13, 2011 at 1:24 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, Jan 12, 2011 at 7:47 PM, Tom Lane  wrote:
>>> IMO, what this patch needs is to not output the types unless they are
>>> actually different from the default (which can be inferred from the AM
>>> type and the function arguments). That would fix my concern about it
>>> emitting information that is 99.44% useless.
>
>> I guess we could do that, but I don't understand how you're supposed
>> to infer them, which means probably a lot of other people won't
>> either.
>
> Read the CREATE OPERATOR CLASS source code.  (It likely would be best to
> refactor that a bit so it would expose some way to obtain the implied
> defaults --- I don't think that's done explicitly now, and it's
> certainly not exported from opclasscmds.c.)

I didn't say I *couldn't* take the time to understand this; I said I
think it'd be more clear to more people if we just printed the types
all the time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 8:55 AM, Simon Riggs  wrote:
> On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
>> Robert Haas wrote:
>> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith  wrote:
>> >
>> > > One of the ideas Simon and I had been considering at one point was adding
>> > > some better de-duplication logic to the fsync absorb code, which I'm
>> > > reminded by the pattern here might be helpful independently of other
>> > > improvements.
>> > >
>> >
>> > Hopefully I'm not stepping on any toes here, but I thought this was an
>> > awfully good idea and had a chance to take a look at how hard it would
>> > be today while en route from point A to point B.  The answer turned
>> > out to be "not very", so PFA a patch that seems to work.  I tested it
>> > by attaching gdb to the background writer while running pgbench, and
>> > it eliminate the backend fsyncs without even breaking a sweat.
>> >
>>
>> No toe damage, this is great, I hadn't gotten to coding for this angle
>> yet at all.  Suffering from an overload of ideas and (mostly wasted)
>> test data, so thanks for exploring this concept and proving it works.
>
> No toe damage either, but are we sure we want the de-duplication logic
> and in this place?
>
> I was originally of the opinion that de-duplicating the list would save
> time in the bgwriter, but that guess was wrong by about two orders of
> magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.

Well, the point of this is not to save time in the bgwriter - I'm not
surprised to hear that wasn't noticeable.  The point is that when the
fsync request queue fills up, backends start performing an fsync *for
every block they write*, and that's about as bad for performance as
it's possible to be.  So it's worth going to a little bit of trouble
to try to make sure it doesn't happen.  It didn't happen *terribly*
frequently before, but it does seem to be common enough to worry about
- e.g. on one occasion, I was able to reproduce it just by running
pgbench -i -s 25 or something like that on a laptop.

With this patch applied, there's no performance impact vs. current
code in the very, very common case where space remains in the queue -
999 times out of 1000, writing to the fsync queue will be just as fast
as ever.  But in the unusual case where the queue has been filled up,
compacting the queue is much much faster than performing an fsync, and
the best part is that the compaction is generally massive.  I was
seeing things like "4096 entries compressed to 14".  So clearly even
if the compaction took as long as the fsync itself it would be worth
it, because the next 4000+ guys who come along again go through the
fast path.  But in fact I think it's much faster than an fsync.

In order to get pathological behavior even with this patch applied,
you'd need to have NBuffers pending fsync requests and they'd all have
to be different.  I don't think that's theoretically impossible, but
Greg's research seems to indicate that even on busy systems we don't
come even a little bit close to the circumstances that would cause it
to occur in practice.  Every other change we might make in this area
will further improve this case, too: for example, doing an absorb
after each fsync would presumably help, as would the more drastic step
of splitting the bgwriter into two background processes (one to do
background page cleaning, and the other to do checkpoints, for
example).  But even without those sorts of changes, I think this is
enough to effectively eliminate the full fsync queue problem in
practice, which seems worth doing independently of anything else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER TYPE 0: Introduction; test cases

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 1:30 AM, Noah Misch  wrote:
> Here's v2 based on your feedback.
>
> I pruned test coverage down to just the highlights.  By the end of patch 
> series,
> the net change becomes +67 to alter_table.sql and +111 to alter_table.out.  
> The
> alter_table.out delta is larger in this patch (+150), because the 
> optimizations
> don't yet apply and more objects are reported as rebuilt.
>
> If this looks sane, I'll rebase the rest of the patches accordingly.

+ * Table, NOT NULL and DEFAULT constraints and the "oid" system column do
+ * not (currently) follow the row type, so they require no attention here.
+ * The non-propagation of DEFAULT and NOT NULL make ADD COLUMN safe, too.

This comment seems somewhat unrelated to the rest of the patch, and I
really hope that the first word ("Table") actually means "CHECK",
because we certainly shouldn't be treating table check constraints and
column check constraints differently, at least AIUI.

> That was a good idea, but the implementation is awkward.  index_build has the
> TOAST table and index relations, and there's no good way to find the parent
> table from either.  No index covers pg_class.reltoastrelid, so scanning by 
> that
> would be a bad idea.  Autovacuum solves this by building its own hash table 
> with
> the mapping; that wouldn't fit well here.  We could parse the parent OID out 
> of
> the TOAST name (heh, heh).  I suppose we could pass the parent relation from
> create_toast_table down through index_create to index_build.  Currently,
> index_create knows nothing of the fact that it's making a TOAST index, and
> changing that to improve this messages seems a bit excessive.  Thoughts?
>
> For this version, I've kept the DEBUG1/DEBUG2 split by TOAST.

Well, I pretty much think that split is going to be not what anyone
wants for any purpose OTHER than the regression tests.  So if there's
no other way around it I'd be much more inclined to remove the
regression tests than to keep that split.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Simon Riggs

On Sat, 2011-01-15 at 05:47 -0500, Greg Smith wrote:
> Robert Haas wrote: 
> > On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith  wrote:
> >   
> > > One of the ideas Simon and I had been considering at one point was adding
> > > some better de-duplication logic to the fsync absorb code, which I'm
> > > reminded by the pattern here might be helpful independently of other
> > > improvements.
> > > 
> > 
> > Hopefully I'm not stepping on any toes here, but I thought this was an
> > awfully good idea and had a chance to take a look at how hard it would
> > be today while en route from point A to point B.  The answer turned
> > out to be "not very", so PFA a patch that seems to work.  I tested it
> > by attaching gdb to the background writer while running pgbench, and
> > it eliminate the backend fsyncs without even breaking a sweat.
> >   
> 
> No toe damage, this is great, I hadn't gotten to coding for this angle
> yet at all.  Suffering from an overload of ideas and (mostly wasted)
> test data, so thanks for exploring this concept and proving it works.

No toe damage either, but are we sure we want the de-duplication logic
and in this place?

I was originally of the opinion that de-duplicating the list would save
time in the bgwriter, but that guess was wrong by about two orders of
magnitude, IIRC. The extra time in the bgwriter wasn't even noticeable.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add support for logging the current role

2011-01-15 Thread Robert Haas

On Fri, Jan 14, 2011 at 9:24 PM, Andrew Dunstan  wrote:
> On 01/14/2011 08:41 PM, Robert Haas wrote:
>> I think we're in the process of designing a manned mission to Mars to
>> solve the problem that our shoelaces are untied.
> What's your suggestion, then?

If there's a practical way to add the requested escape, add it to the
text format and leave reengineering the CSV format for another day.
Yeah, I know that's not the most beautiful solution in the world, but
we're doing engineering here, not theology.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] kill -KILL: What happens?

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 6:27 AM, Florian Pflug  wrote:
> On Jan14, 2011, at 17:45 , Robert Haas wrote:
>> On Fri, Jan 14, 2011 at 11:28 AM, Florian Pflug  wrote:
>>> I gather that the behaviour we want is for normal backends to exit
>>> once the postmaster is gone, and for utility processes (bgwriter, ...)
>>> to exit once all the backends are gone.
>>>
>>> The test program I posted in this thread proves that FIFOs and select()
>>> can be used to implement this, if we're ready to check for EOF on the
>>> socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
>>> route to take?
>>
>> I don't think there's much point in getting excited about the order in
>> which things exit.  If we're agreed (and we seem to be, modulo Tom)
>> that the backends should exit quickly if the postmaster dies, then
>> worrying about whether the utility processes exit slightly before or
>> slightly after that doesn't excite me very much.
>
> Tom seems to think that as our utility processes gain importance, one day
> we might require one to outlive all the backends, and that whatever solution
> we adopt should allow us to arrange for that. Or at least this how I
> understood him.

Well, there's certainly ONE of those already: the logging collector.
But it already has its own solution to this problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LOCK for non-tables

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 6:29 AM, Simon Riggs  wrote:
> On Sat, 2011-01-15 at 12:19 +0100, Florian Pflug wrote:
>> On Jan15, 2011, at 02:03 , Tom Lane wrote:
>> > Robert Haas  writes:
>> >> Me, too.  But I don't agree with your particular choice of small
>> >> syntax adjustment.  Maybe we should just let the issue drop for now.
>> >> Nobody's actually complained about this that I can recall; it's just a
>> >> comment that's been sitting there in pg_dump for ages, and I was
>> >> inspired to think of it again because of the SQL/MED work.  I'm not
>> >> sufficiently in love with this idea to walk through fire for it.
>> >
>> > Agreed.  Once there's some pressing need for it, it'll be easier to make
>> > the case that some amount of incompatibility is acceptable.
>>
>> Assuming that day will come eventually, should we deprecate the LOCK 
>> shortcut now to ease the transition later? If people want that, I could go
>> through the docs and add some appropriate warnings.
>
> Sounds good to me.

+1.

> I think we should have a section in the release notes on Deprecated
> Features, noting that certain things will be removed later and should be
> changed now and not relied upon in the future. A pending
> incompatibilities list.

Agreed.  Of course, the problem is sometimes we don't do what we say
we're going to do, but it's worth a try.

> I would urge people to come up with a much wider list of "things we
> don't like" so we can more easily avoid discussions like this in the
> future. Forward planning helps make change easier.

Optional keywords!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Robert Haas

On Sat, Jan 15, 2011 at 5:47 AM, Greg Smith  wrote:
> No toe damage, this is great, I hadn't gotten to coding for this angle yet
> at all.  Suffering from an overload of ideas and (mostly wasted) test data,
> so thanks for exploring this concept and proving it works.

Yeah - obviously I want to make sure that someone reviews the logic
carefully, since a loss of fsyncs or a corruption of the request queue
could affect system stability, but only very rarely, since you'd need
full fsync queue + crash.  But the code is pretty simple, so it should
be possible to convince ourselves as to its correctness (or
otherwise).  Obviously, major credit to you and Simon for identifying
the problem and coming up with a proposed fix.

> I'm not sure what to do with the rest of the work I've been doing in this
> area here, so I'm tempted to just combine this new bit from you with the
> older patch I submitted, streamline, and see if that makes sense.  Expected
> to be there already, then "how about spending 5 minutes first checking out
> that autovacuum lock patch again" turned out to be a wild underestimate.

I'd rather not combine the patches, because this one is pretty simple
and just does one thing, but feel free to write something that applies
over top of it.  Looking through your old patch (sync-spread-v3),
there seem to be a couple of components there:

- Compact the fsync queue based on percentage fill rather than number
of writes per absorb.  If we apply my queue-compacting logic, do we
still need this?  The queue compaction may hold the BgWriterCommLock
for slightly longer than AbsorbFsyncRequests() would, but I'm not
inclined to jump to the conclusion that this is worth getting excited
about.  The whole idea of accessing BgWriterShmem->num_requests
without the lock gives me the willies anyway - sure, it'll probably
work OK most of the time, especially on x86, but it seems hard to
predict whether there will be occasional bad behavior on platforms
with weak memory ordering.

- Call pgstat_send_bgwriter() at the end of AbsorbFsyncRequests().
Not sure what the motivation for this is.

- CheckpointSyncDelay(), to make sure that we absorb fsync requests
and free up buffers during a long checkpoint.  I think this part is
clearly valuable, although I'm not sure we've satisfactorily solved
the problem of how to spread out the fsyncs and still complete the
checkpoint on schedule.

As to that, I have a couple of half-baked ideas I'll throw out so you
can laugh at them.  Some of these may be recycled versions of ideas
you've already had/mentioned, so, again, credit to you for getting the
ball rolling.

Idea #1: When we absorb fsync requests, don't just remember that there
was an fsync request; also remember the time of said fsync request.
If a new fsync request arrives for a segment for which we're already
remembering an fsync request, update the timestamp on the request.
Periodically scan the fsync request queue for requests older than,
say, 30 s, and perform one such request.   The idea is - if we wrote a
bunch of data to a relation and then haven't touched it for a while,
force it out to disk before the checkpoint actually starts so that the
volume of work required by the checkpoint is lessened.

Idea #2: At the beginning of a checkpoint when we scan all the
buffers, count the number of buffers that need to be synced for each
relation.  Use the same hashtable that we use for tracking pending
fsync requests.  Then, interleave the writes and the fsyncs.  Start by
performing any fsyncs that need to happen but have no buffers to sync
(i.e. everything that must be written to that relation has already
been written).  Then, start performing the writes, decrementing the
pending-write counters as you go.  If the pending-write count for a
relation hits zero, you can add it to the list of fsyncs that can be
performed before the writes are finished.  If it doesn't hit zero
(perhaps because a non-bgwriter process wrote a buffer that we were
going to write anyway), then we'll do it at the end.  One problem with
this - aside from complexity - is that most likely most fsyncs would
either happen at the beginning or very near the end, because there's
no reason to assume that buffers for the same relation would be
clustered together in shared_buffers.  But I'm inclined to think that
at least the idea of performing fsyncs for which no dirty buffers
remain in shared_buffers at the beginning of the checkpoint rather
than at the end might have some value.

Idea #3: Stick with the idea of a fixed delay between fsyncs, but
compute how many fsyncs you think you're ultimately going to need at
the start of the checkpoint, and back up the target completion time by
3 s per fsync from the get-go, so that the checkpoint still finishes
on schedule.

Idea #4: For ext3 filesystems that like to dump the entire buffer
cache instead of only the requested file, write a little daemon that
runs alongside of (and completely indepdently of) PostgreSQL.  Every
30 s,

Re: [HACKERS] pov 1.0 is released, testers with huge schemas needed

2011-01-15 Thread Joel Jacobson

2011/1/14 Joel Jacobson :
> To install:
> git clone g...@github.com:gluefinance/pov.git

Ops, typo, to do this without my ssh key, you need to do:

git clone git://github.com/gluefinance/pov.git

-- 
Best regards,

Joel Jacobson
Glue Finance

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LOCK for non-tables

2011-01-15 Thread Magnus Hagander

On Jan 15, 2011 12:30 PM, "Simon Riggs"  wrote:
>
> On Sat, 2011-01-15 at 12:19 +0100, Florian Pflug wrote:
> > On Jan15, 2011, at 02:03 , Tom Lane wrote:
> > > Robert Haas  writes:
> > >> Me, too.  But I don't agree with your particular choice of small
> > >> syntax adjustment.  Maybe we should just let the issue drop for now.
> > >> Nobody's actually complained about this that I can recall; it's just
a
> > >> comment that's been sitting there in pg_dump for ages, and I was
> > >> inspired to think of it again because of the SQL/MED work.  I'm not
> > >> sufficiently in love with this idea to walk through fire for it.
> > >
> > > Agreed.  Once there's some pressing need for it, it'll be easier to
make
> > > the case that some amount of incompatibility is acceptable.
> >
> > Assuming that day will come eventually, should we deprecate the LOCK

> > shortcut now to ease the transition later? If people want that, I could
go
> > through the docs and add some appropriate warnings.
>
> Sounds good to me.
>
>
> I think we should have a section in the release notes on Deprecated
> Features, noting that certain things will be removed later and should be
> changed now and not relied upon in the future. A pending
> incompatibilities list.

+1.  This would be very useful. Its hard enough for us "on the inside" to
keep track of things that we deprecated...

> I would urge people to come up with a much wider list of "things we
> don't like" so we can more easily avoid discussions like this in the
> future. Forward planning helps make change easier.

There is a section on the TODO for that already, i think. Seems reasonable
since this is more for developers than users.

/Magnus

Re: [HACKERS] [RRR] reviewers needed!

2011-01-15 Thread Dimitri Fontaine

Robert Haas  writes:
> lot more people pitch in.

I just assigned myself the review of those two patches:

  https://commitfest.postgresql.org/action/patch_view?id=475
Streaming base backups

  https://commitfest.postgresql.org/action/patch_view?id=502
FOR KEY LOCK foreign keys

I think I'll post reviews sometime next week, it would be a surprise if
that happens sooner than Tuesday.  Depending on how the extension review
goes I might have some more time later in the CF, of course.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LOCK for non-tables

2011-01-15 Thread Simon Riggs

On Sat, 2011-01-15 at 12:19 +0100, Florian Pflug wrote:
> On Jan15, 2011, at 02:03 , Tom Lane wrote:
> > Robert Haas  writes:
> >> Me, too.  But I don't agree with your particular choice of small
> >> syntax adjustment.  Maybe we should just let the issue drop for now.
> >> Nobody's actually complained about this that I can recall; it's just a
> >> comment that's been sitting there in pg_dump for ages, and I was
> >> inspired to think of it again because of the SQL/MED work.  I'm not
> >> sufficiently in love with this idea to walk through fire for it.
> > 
> > Agreed.  Once there's some pressing need for it, it'll be easier to make
> > the case that some amount of incompatibility is acceptable.
> 
> Assuming that day will come eventually, should we deprecate the LOCK 
> shortcut now to ease the transition later? If people want that, I could go
> through the docs and add some appropriate warnings.

Sounds good to me.


I think we should have a section in the release notes on Deprecated
Features, noting that certain things will be removed later and should be
changed now and not relied upon in the future. A pending
incompatibilities list.

I would urge people to come up with a much wider list of "things we
don't like" so we can more easily avoid discussions like this in the
future. Forward planning helps make change easier.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] kill -KILL: What happens?

2011-01-15 Thread Florian Pflug

On Jan14, 2011, at 17:45 , Robert Haas wrote:
> On Fri, Jan 14, 2011 at 11:28 AM, Florian Pflug  wrote:
>> I gather that the behaviour we want is for normal backends to exit
>> once the postmaster is gone, and for utility processes (bgwriter, ...)
>> to exit once all the backends are gone.
>> 
>> The test program I posted in this thread proves that FIFOs and select()
>> can be used to implement this, if we're ready to check for EOF on the
>> socket in CHECK_FOR_INTERRUPTS() every few seconds. Is this a viable
>> route to take?
> 
> I don't think there's much point in getting excited about the order in
> which things exit.  If we're agreed (and we seem to be, modulo Tom)
> that the backends should exit quickly if the postmaster dies, then
> worrying about whether the utility processes exit slightly before or
> slightly after that doesn't excite me very much.


Tom seems to think that as our utility processes gain importance, one day
we might require one to outlive all the backends, and that whatever solution
we adopt should allow us to arrange for that. Or at least this how I
understood him.

That parts can also easily be left out by using only one FIFO instead of
two, kept open for writing only in the postmaster.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery control functions

2011-01-15 Thread Simon Riggs

On Sat, 2011-01-15 at 20:00 +0900, Fujii Masao wrote:
> On Fri, Jan 14, 2011 at 8:15 PM, Simon Riggs  wrote:
> > On Fri, 2011-01-14 at 11:09 +, Simon Riggs wrote:
> >> Functions to control recovery, to aid PITR and Hot Standby.
> >> pg_is_xlog_replay_paused()
> >> pg_xlog_replay_pause()
> >> pg_xlog_replay_resume()
> >>
> >> recovery.conf parameter: pause_at_recovery_target (bool)
> >
> > And now with the correct patch.
> 
> IIRC, in last year, you implemented the related function which
> advances recovery the specified number of records and pauses.
> Why did you drop that from the patch? That's very useful at least
> for me to do PITR and debug the recovery code.

SMoP. It complicated the code and the testing time would have exceeded
the amount of time I had available to spend on this, by a long way.

> +If in hot standby, all queries will see the same consistent snapshot
> +of the database, and no query conflicts will be generated.
> 
> Really? The access exclusive lock taken from the master before
> pause of recovery can conflict with a query?

No "recovery conflicts" will be generated. i.e. no new conflicts.

Yes, existing locks will interfere with queries, if they exist.

> +pause_at_recovery_target recovery
> parameter
> +  
> +  
> +   
> +Specifies whether recovery should pause when the recovery target
> +is reached. The default is true, if a recovery target is set.
> 
> The default is false, according to the code.

Thanks. Well spotted.

> If HS is disabled and pause_at_recovery_target is enabled,
> recovery might never end infinitely. This is not desirable.
> We should reject such a combination of settings or emit
> WARNING?

I was about to say "but it already does that". Checking the patch it
seems I must have removed that line, though I can't see any reason why I
would have removed it now. Will put it back.

> + while (RecoveryIsPaused());
> + {
> + pg_usleep(10L); /* 100 ms */
> + HandleStartupProcInterrupts();
> 
> 100ms seems too short. What about 1s or bigger?
> Or wait on the latch rather than using poll loop.

Yes, time is short.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LOCK for non-tables

2011-01-15 Thread Florian Pflug

On Jan15, 2011, at 02:03 , Tom Lane wrote:
> Robert Haas  writes:
>> Me, too.  But I don't agree with your particular choice of small
>> syntax adjustment.  Maybe we should just let the issue drop for now.
>> Nobody's actually complained about this that I can recall; it's just a
>> comment that's been sitting there in pg_dump for ages, and I was
>> inspired to think of it again because of the SQL/MED work.  I'm not
>> sufficiently in love with this idea to walk through fire for it.
> 
> Agreed.  Once there's some pressing need for it, it'll be easier to make
> the case that some amount of incompatibility is acceptable.

Assuming that day will come eventually, should we deprecate the LOCK 
shortcut now to ease the transition later? If people want that, I could go
through the docs and add some appropriate warnings.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery control functions

2011-01-15 Thread Simon Riggs

On Sat, 2011-01-15 at 20:11 +0900, Fujii Masao wrote:
> On Fri, Jan 14, 2011 at 9:00 PM, Simon Riggs  wrote:
> >> How hard would it be to have a pg_xlog_replay_until( >> timestamp>), to have it resume recovery up to that point and then
> >> pause again?
> >
> > You can already do that for timestamps.
> 
> You mean using recovery_target_time and pause_at_recovery_target?
> The problem is that we cannot continue recovery after the pause
> by them. If we resume recovery after the pause, recovery ends
> immediately.

Shutdown while paused, alter parameter, restart.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery control functions

2011-01-15 Thread Fujii Masao

On Fri, Jan 14, 2011 at 9:00 PM, Simon Riggs  wrote:
>> How hard would it be to have a pg_xlog_replay_until(> timestamp>), to have it resume recovery up to that point and then
>> pause again?
>
> You can already do that for timestamps.

You mean using recovery_target_time and pause_at_recovery_target?
The problem is that we cannot continue recovery after the pause
by them. If we resume recovery after the pause, recovery ends
immediately.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Recovery control functions

2011-01-15 Thread Fujii Masao

On Fri, Jan 14, 2011 at 8:15 PM, Simon Riggs  wrote:
> On Fri, 2011-01-14 at 11:09 +, Simon Riggs wrote:
>> Functions to control recovery, to aid PITR and Hot Standby.
>> pg_is_xlog_replay_paused()
>> pg_xlog_replay_pause()
>> pg_xlog_replay_resume()
>>
>> recovery.conf parameter: pause_at_recovery_target (bool)
>
> And now with the correct patch.

IIRC, in last year, you implemented the related function which
advances recovery the specified number of records and pauses.
Why did you drop that from the patch? That's very useful at least
for me to do PITR and debug the recovery code.

+If in hot standby, all queries will see the same consistent snapshot
+of the database, and no query conflicts will be generated.

Really? The access exclusive lock taken from the master before
pause of recovery can conflict with a query?

+pause_at_recovery_target recovery
parameter
+  
+  
+   
+Specifies whether recovery should pause when the recovery target
+is reached. The default is true, if a recovery target is set.

The default is false, according to the code.

If HS is disabled and pause_at_recovery_target is enabled,
recovery might never end infinitely. This is not desirable.
We should reject such a combination of settings or emit
WARNING?

+   while (RecoveryIsPaused());
+   {
+   pg_usleep(10L); /* 100 ms */
+   HandleStartupProcInterrupts();

100ms seems too short. What about 1s or bigger?
Or wait on the latch rather than using poll loop.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2011-01-15 Thread Greg Smith


Robert Haas wrote:

On Tue, Nov 30, 2010 at 3:29 PM, Greg Smith  wrote:
  

One of the ideas Simon and I had been considering at one point was adding
some better de-duplication logic to the fsync absorb code, which I'm
reminded by the pattern here might be helpful independently of other
improvements.



Hopefully I'm not stepping on any toes here, but I thought this was an
awfully good idea and had a chance to take a look at how hard it would
be today while en route from point A to point B.  The answer turned
out to be "not very", so PFA a patch that seems to work.  I tested it
by attaching gdb to the background writer while running pgbench, and
it eliminate the backend fsyncs without even breaking a sweat.
  


No toe damage, this is great, I hadn't gotten to coding for this angle 
yet at all.  Suffering from an overload of ideas and (mostly wasted) 
test data, so thanks for exploring this concept and proving it works.


I'm not sure what to do with the rest of the work I've been doing in 
this area here, so I'm tempted to just combine this new bit from you 
with the older patch I submitted, streamline, and see if that makes 
sense.  Expected to be there already, then "how about spending 5 minutes 
first checking out that autovacuum lock patch again" turned out to be a 
wild underestimate.


Part of the problem is that it's become obvious to me the last month 
that right now is a terrible time to be doing Linux benchmarks that 
impact filesystem sync behavior.  The recent kernel changes that are 
showing in the next rev of the enterprise distributions--like RHEL6 and 
Debian Squeeze both working to get a stable 2.6.32--have made testing a 
nightmare.  I don't want to dump a lot of time into optimizing for 
<2.6.32 if this problem changes its form in newer kernels, but the 
distributions built around newer kernels are just not fully baked enough 
yet to tell.  For example, the pre-release Squeeze numbers we're seeing 
are awful so far, but it's not really done yet either.  I expect 3-6 
months from today, that all will have settled down enough that I can 
make some sense of it.  Lately my work with the latest distributions has 
just been burning time installing stuff that doesn't work quite right yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: [HACKERS] We need to log aborted autovacuums

2011-01-15 Thread Greg Smith


Josh Berkus wrote:

The lack of vacuum could be occurring for any of 4 reasons:

1) Locking
2) You have a lot of tables and not enough autovac_workers / too much
sleep time
3) You need to autovac this particular table more frequently, since it
gets dirtied really fast
4) The table has been set with special autovac settings which keep it
from being autovac'd

We can currently distinguish between cased 2, 3, 4 based on existing
available facts.  However, distinguishing case 1 from 2 or 3, in
particular, isn't currently possible except by methods which require
collecting a lot of ad-hoc monitoring data over a period of time.  This
makes the effort required for the diagnosis completely out of proportion
with the magnitude of the problem.
  


Since no one coded up something more exciting, I just did a round of 
self-review of the little logging patch I sent upthread to see if it was 
a worthy CF candidate.  I agree that *something* should be done, and I'd 
rather have a logging-based solution than nothing at all here.


So that patch is, to use a polite UK term I've adopted recently, 
rubbish.  But unless I'm confused (it is late), I think there's a 
problem here that's much worse than what you described--in the case 
where someone has grabbed a really large lock on a table that needs 
cleanup at least.  I started with the following test case:


SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,10);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables 
WHERE relname='t';

DELETE FROM t WHERE s<5;
\q

Which, so long as the min duration is 0, does mostly what I expected.  I 
followed immediately by starting a new session, doing:


BEGIN;
LOCK t;

Before autovacuum got to the table; I left this session open, stuck 
there idle in a transaction holding a lock.  I threw some extra logging 
in the code path leading up to where we'd both like the lack of lock 
acquisition to be logged, and it shows the following (at DEBUG3):


DEBUG:  t: vac: 4 (threshold 50), anl: 14 (threshold 50)
DEBUG:  autovacuum: t: vac needed
INFO:  vacuum_rel:  processing 16528
INFO:  vacuum_rel:  trying to open relation 16528

All but the first one of those lines are not in the current code, and as 
noted already that one existing line is at DEBUG3.  This was trying to 
simulate the case you were complaining about:  autovacuum has been 
triggered, it knows there is work to be done, and it can't do said work.


It hasn't triggered the error message I tried to add yet through, 
because it's not hitting that spot.  Look where it's actually blocked at:


$ ps -eaf | grep postgres
gsmith5490  4880  0 04:09 ?00:00:00 postgres: autovacuum 
worker process   gsmith waiting  

$ psql -c "select procpid,now() - query_start as runtime,current_query 
from pg_stat_activity where current_query like 'autovacuum%'"
procpid | runtime |current_query   
-+-+-

   5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t

I haven't just failed to vacuum the table that needs it without any 
entry in the logs saying as much.  I've actually tied down an autovacuum 
worker and kept it from doing anything else until that lock is 
released!  When vacuum_rel inside of vacuum.c does this on a relation it 
wants to acquire a lock on:


   onerel = try_relation_open(relid, lmode);

It just hangs out there forever, if the only issue is not being able to 
get the lock it wants.  The abort code path I tried to insert logging 
into only occurs if the relation was deleted.  Watch what's happened 
while I've been typing:


procpid | runtime |current_query   
-+-+-

   5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t

I've now hit the AV worker with a denial of service attack, essentially, 
for over 35 minutes.  I'd bet that if I was a malicious user, I could 
make AV come to a halt altogether for the whole server with this 
approach, even with relatively limited rights on the database.


Does try_relation_open need to have a lock acquisition timeout when AV 
is calling it?  I don't see any other way to get a log mesage noting 
when this problem passed by in there, you can only see it live in these 
process view.  And the current behavior doesn't feel right to me.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] auto-sizing wal_buffers

2011-01-15 Thread Fujii Masao

On Sat, Jan 15, 2011 at 3:51 PM, Greg Smith  wrote:
> Agreed on both fronts.  Attached patch does the magic.  Also available in
> branch "walbuffers" from git://github.com/greg2ndQuadrant/postgres.git

+intXLOGbuffersMin = 8;

XLOGbuffersMin is a fixed value. I think that defining it as a macro
rather than a variable seems better.

+   if (XLOGbuffers > 2048)
+   XLOGbuffers = 2048;

Using "XLOG_SEG_SIZE/XLOG_BLCKSZ" rather than 2048 seems
better.

+#wal_buffers = -1  # min 32kB, -1 sets based on 
shared_buffers

Typo: s/32kB/64kB

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

97 matches

Mail list logo