Re: [HACKERS] GSoC - proposal - Materialized Views in PostgreSQL

2010-04-11 Thread Florian G. Pflug

On 11.04.10 20:47 , Robert Haas wrote:

On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Robert Haas wrote:

2010/4/10 Andrew Dunstanand...@dunslane.net:

Heikki Linnakangas wrote:

1. Keep the materialized view up-to-date when the base tables
change. This can be further divided into many steps, you can
begin by supporting automatic updates only on very simple
views with e.g a single table and a where clause. Then extend
that to support joins, aggregates, subqueries etc. Keeping it
really limited, you could even require the user to write the
required triggers himself.

That last bit doesn't strike me as much of an advance. Isn't
the whole point of this to automate it? Creating greedy
materialized views is usually not terribly difficult now, but
you do have to write the triggers.


Yeah, I agree.


It doesn't accomplish anything interesting on its own. But if you
do the planner changes to automatically use the materialized view
to satisfy queries (item 2. in my previous email), it's useful.


But you can't do that with a snapshot view, only a continuous updated
one.


If continuous updates prove to be too hard initially, you could instead
update the view on select if it's outdated. Such a materialized view
would be a kind of inter-session cache for subselects.

The hard part would probably be to figure out how to decide whether the
view is outdated or not, and to deal with two concurrent transactions
trying to use an outdates view (and both trying to refresh it). What
makes the second problem hard is that you wouldn't want one of the
transactions to wait for the other to complete, because this is not how
SELECTs traditionally behave.

best regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Serializable implementation

2009-12-28 Thread Florian G. Pflug

On 28.12.09 18:54 , Kevin Grittner wrote:

To give some idea of the scope of development, Michael Cahill added
SSI to InnoDB by modifying 250 lines of code and adding 450 lines of
 code; however, InnoDB already had the S2PL option and the prototype
 implementation isn't as sophisticated as I feel is necessary for
real production use (particularly regarding the granularity of SIREAD
locks).  I'm assuming it would take more to reach real production
quality in PostgreSQL.  My SWAG would be to multiply by two or
three.


I believe the hard part of implementing true serializability is not the
actual SSI or S2PL algorithm, but rather the necessary predicate locking
strategy.

So I think checking how InnoDB tackles that and how much of it's code is
invovled might give a more realistic estimate of the effort required.

best regards,
Florian Plug



smime.p7s
Description: S/MIME Cryptographic Signature


[HACKERS] [WIP] Inspection of row types in pl/pgsq l and pl/sql

2009-12-20 Thread Florian G. Pflug

Hi

I've completed a (first) working version of a extension that allows
easier introspection of composite types from SQL and pl/PGSQL.

The original proposal and ensuing discussion can be found here:
http://archives.postgresql.org/pgsql-hackers/2009-11/msg00695.php

The extension can be found on:
http://github.com/fgp/pg_record_inspect

This is what the extension currently provides (all in schema
record_inspect).

* fieldinfo [composite type]
  Used to by fieldinfos() to describe a record's fields.
  Contains the fields
fieldname (name),
fieldtype (regclass),
fieldtypemod (varchar)

* fieldinfo[] fieldinfos(record)
  Returns an array of fieldinfos describing the record''s fields

* anyelement fieldvalue(record, field name, defval anyelement,
coerce boolean)
  Returns the value of the field field, or defval should the value
  be null. If coerce is true, the value is coerced to defval's type
  if possible, otherwise an error is raised if the field''s type and
  defval's type differ.

* anyelement fieldvalues(record, defval anyelement, coerce boolean)
  Returns an array containing values of the record'' fields. NULL
  values are replaced by defval. If coerce is false, only the
  fields with the same type as defval are considered. Otherwise, the
  field'' values are coerced if possible, or an error is raised if not.

The most hacky part of the code is probably coerceDatum() - needed to
coerce a field's value to the requested output type. I wanted to avoid
creating and parsing an actual SQL statement for every cast, and instead
chose to use coerce_to_target_type() to create the expression trees
representing casts. I use the noe type CoerceToDomainValue to inject the
source value into the cast plan upon execution - see makeCastPlan() and
execCastPlan() for details. If anyone has a better idea, please speak up

I personally would like to see this becoming a contrib module one day,
but that of course depends on how much interest there is in such a feature.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


[HACKERS] Compiling HEAD with -Werror int 64-bit mode

2009-12-15 Thread Florian G. Pflug

Hi

HEAD fails to compile in 64-bit mode on Mac OS X 10.6 with gcc 4.2 and
-Werror.

What happens is that INT64_FORMAT gets defined as %ld (which is
correct - long and unsigned long are 64 bits wide on x86_64), but
the check for a working 64-bit int fails, causing INT64_IS_BUSTED to get
defined and int64 becoming actually a 32-bit type. This is turn causes
warnings when these pseudo int64s are passed to printf with format
specified INT64_FORMAT, which get turned to errors by -Werror.

configure fails to recognize long as a working 64-bit type because the
does_int64_work configure test produces warning due to a missing return
value declaration for main() and a missing prototype for
does_int64_work(). (Aain, those warning are turned into errors by -Werror).

I use the following envvar settings (when running ./configure) to force
64-bit mode and -Werror
CC=gcc-4.2 CFLAGS=-arch x86_64 -Werror LDFLAGS=-arch x86_64

The following patch fixed the problem for me - though I didn't yet try
it on any other platform that Mac OS X 10.6 with gcc 4.2 and in 64-bit mode.

--
diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 9ac2c30..c6bd523 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -35,7 +35,7 @@ AC_CACHE_CHECK([whether $1 is 64 bits], [Ac_cachevar],
 ac_int64 a = 2001;
 ac_int64 b = 4005;

-int does_int64_work()
+static int does_int64_work()
 {
   ac_int64 c,d;

@@ -49,8 +49,8 @@ int does_int64_work()
 return 0;
   return 1;
 }
-main() {
-  exit(! does_int64_work());
+int main() {
+  return(! does_int64_work());
 }],
 [Ac_cachevar=yes],
 [Ac_cachevar=no],
--

best regards,
Florian Pflug




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Compiling HEAD with -Werror int 64-bit mode

2009-12-15 Thread Florian G. Pflug

On 15.12.09 16:02 , Tom Lane wrote:

Florian G. Pflugf...@phlo.org  writes:

configure fails to recognize long as a working 64-bit type
because the does_int64_work configure test produces warning due to
 a missing return value declaration for main() and a missing
prototype for does_int64_work(). (Aain, those warning are turned
into errors by -Werror).


autoconf's test programs tend to be sufficiently sloppy that I would
expect -Werror to break a whole lot of things, not just this. We can
possibly neaten up the particular test case but there are many tests
whose expansion we don't have much control over.


Yeah, I expected all hell to break loose - only to be pleasantly
surprised by this being the only issue I encountered. So I figured
fixing this might be worthwhile - even if this surely does not fix
-Werror builds on all platforms and/or compilers.

Alternatively - is there a way to use -Werror only for building the
actual sources, not the configure tests? I didn't find one, but my
autoconf-fu is pretty limited...

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Range types

2009-12-15 Thread Florian G. Pflug

On 15.12.09 15:52 , Tom Lane wrote:

to...@tuxteam.de writes:

(and as Andrew Dunstan pointed out off-list: I was wrong with my
bold assertion that one can squeeze infinitely many (arbitrary
length) strings between two given. This is not always the case).


Really?  If the string length is unbounded I think you were right.


One example is a and aa (assuming a is minimal character in your
alphabet). The general case is the strings A and Aaa...a I think -
it doesn't get any more exciting than this.

This *is* a bit surprising, since one usually assumes that the ordering
of strings and reals is fairly similar, since both are lexical.

But note that the mapping of strings into the reals this intuition is
based on (simply prefix a the string with 0. and interpret as a real,
or something similar if the alphabet isn't {0,1}) isn't one-to-one - the
strings 1, 10, 100, ... are all mapped to the *same* real number 0.1

So for reals, the statement is reduced to the trivial fact that for
every x there is no y with x  y  x. Which is of course true..

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Compiling HEAD with -Werror int 64-bit mode

2009-12-15 Thread Florian G. Pflug

On 15.12.09 23:38 , Tom Lane wrote:

Peter Eisentrautpete...@gmx.net  writes:

So to summarize, this is just a bad idea.  Creating a less obscure
way to use -Werror might be worthwhile, though.


I suppose we could add --with-Werror but it seems pretty
specialized to me.  A more appropriate solution would allow the user
to provide flags that get added to CFLAGS only after we do all the
configure tests (implying that it's on the user's head that these
flags are right and don't break anything, but then again that's
pretty much true of up-front CFLAGS too).  And that basically
describes COPTS ... the only thing lacking is documentation.


For what it's worth, I agree. Though we might want to arrange for
configure to store the value of COPT somewhere so that
  COPT=-Werror ./configure
  make
works which it currently doesn't seem to.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Application name patch - v4

2009-11-29 Thread Florian G. Pflug

Tom Lane wrote:

: One possibility would be to make it possible to issue SETs that
behave : as if set in a startup packet - imho its an implementation
detail that : SET currently is used.

I think there's a good deal of merit in this, and it would't be hard
at all to implement, seeing that we already have SET LOCAL and SET
SESSION. We could add a third keyword, say SET DEFAULT, that would
have the behavior of setting the value in a fashion that would
persist across resets.  I'm not sure that DEFAULT is exactly le mot
juste here, but agreeing on a keyword would probably be the hardest
part of making it happen.


Hm, but without a way to prevent the users of a connection pool from
issuing SET DEFAULT, that leaves a connection pool with no way to
revert a connection to a known state.

How about SET CONNECTION, with an additional GUC called
connection_setup which can only be set to true, never back to false.
Once connection_setup is set to true, further SET CONNECTION attempts
would fail.

In a way, this mimics startup-packet SETs without actually doing things
in the startup packet.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] File IO - why does PG do things in pages?

2009-11-26 Thread Florian G. Pflug

Dan Eloff wrote:
At the lower levels in PG, reading from the disk into cache, and 
writing from the cache to the disk is always done in pages.


Why does PG work this way? Is it any slower to write whole pages 
rather than just the region of the page that changed? Conversely, is 
it faster? From what I think I know of operating systems, reading 
should bring the whole page into the os buffers anyway, so reading
the whole page instead of just part of it isn't much more expensive. 
Perhaps writing works similarly?


First, data fetched from the disk is (except for data in temporary
tables, I believe) not stored in private memory of the backend process
doing the read(), but instead a a shared memory segment accessible by
all backend processes. This allows two different backend processes to
work modify the data concurrently without them stepping on each other's
toes. Note that immediatly writing back any changes is *not* an option,
since WAL logging mandates that all changes got to the WAL *first*.
Hence, if you were to write out each changed tuple immediately, you'd
have to first write the changes to the WAL *and* fsync the WAL to
guarantee they hit the disk first.

Sharing the data between backend processes requires a fair amount of
infrastructure. You need a way to locate a given chunk of on-disk data
in the shared memory buffer cache, and be able to acquire and release
locks on those buffers to prevent two backends from wrecking havoc when
they try to update the same piece of information. Organizing data in
fixed-sized chunks (which is what pages are) helps with keeping the
complexity of that infrastructure manageable, and the overhead
reasonably low.

There are also things like tracking the free space in a data file, which
also gets easier if you only have to track it page-wise (Is there free
space on this page or not), instead of having to track arbitrary ranges
of free space.

Finally, since data writes happen in units of blocks (and not bytes),
you need to guarantee that you do your IO in some multiple of that unit
anyway, otherwise you'd have a very hard time guaranteeing data
consistency after a crash. Google for torn page writes, that should
give you more details about this problem.

Note, however, that a postgres page (usually 8K) is usually larger than
the filesystem's blocksize (usually 512b). So always reading in full
pages induces *some* IO overhead. Just not that much - especially since
the blocks comprising a page are extremely likely to be arranges
consecutively on disk, so there is no extra seeking involved.

This, at least, are what I believe to be the main reasons for doing
things in units of pages - hope this helps at least somewhat.

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


[HACKERS] ALTER TABLE, find_composite_type_dependencies and locking

2009-11-25 Thread Florian G. Pflug

Hi

I'm currently investigating how much work it'd be to implement arrays of
domains since I have a client who might be interested in sponsoring that
work.

The comments around the code handling ALTER DOMAIN ADD CONSTRAINT are
pretty clear about the lack of proper locking in that code - altering a
domain while simultaneously add a column with that domain as a type
might result in inconsistencies between the data in that column and the
domain's constraints after both transactions committed.

I do, however, suspect that ALTER TABLE is plagued by similar problems.
Currently, during the rewrite phase of ALTER TABLE,
find_composite_type_dependencies is used to verify that the table's row
type (or any type directly or indirectly depending on that type) is not
used as a column's type anywhere in the database.

But since this code does not take any permanent locks on the visited
types, it seems that adding such a column concurrently is not prevented.
 If the original ALTER TABLE changed a column's type, data inserted into
the newly added column before the original ALTER TABLE committed will
have a type different from what the catalog says after the original
ALTER TABLE commits. Or at least so I think - I haven't yet tested that
theory...

I am aware that since a commit fest is currently running, now might not
be the best time to bring up this topic. Since I feared forgetting this
all together, I decided to still post now, though. I figured people
still have to option to ignore this for now if they're busy with getting
those patches committed.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] ALTER TABLE, find_composite_type_dependencies and locking (Confirmed)

2009-11-25 Thread Florian G. Pflug

Florian G. Pflug wrote:
I do, however, suspect that ALTER TABLE is plagued by similar 
problems. Currently, during the rewrite phase of ALTER TABLE, 
find_composite_type_dependencies is used to verify that the table's 
row type (or any type directly or indirectly depending on that type) 
is not used as a column's type anywhere in the database.


But since this code does not take any permanent locks on the visited
 types, it seems that adding such a column concurrently is not 
prevented. If the original ALTER TABLE changed a column's type, data 
inserted into the newly added column before the original ALTER TABLE 
committed will have a type different from what the catalog says after

 the original ALTER TABLE commits. Or at least so I think - I haven't
 yet tested that theory...


I was able to confirm that this is an actual bug in 8.5. I did, however,
need to use an array-of-composite type. With only nested composite types
it seems that CheckAttributeType() protects against the race, because it
follows the dependency chain and opens each type's relation in
AccessShareLock mode. This blocks once the traversal hits the type which
is being altered, hence forcing the table creation to wait for the
concurrent alter table to complete.

Create two types in session 1
session 1: create table t1 (t1_i int);
session 1: create type t2 as (t2_t1 t1);

Warm the type cache in session 2
(A simple select array[row(row(-1))::t2] would probably suffice)
session 2: create table bug (bug_t2s t2[]);
session 2: insert into bug (bug_t2s) values (array[row(row(-1))::t2]);
session 2: select bug.bug_t2s[1].t2_t1.t1_i from bug;
[select correctly returns one row containing -1]
session 2: drop table bug;

Alter type of t1_i in session 1
session 1: alter table t1 alter column t1_i type varchar;
[Pause session 1 using gdb *right* after the call to
 find_composite_type_dependencies in ATRewriteTable
 returned]

Create the bug table in session 2, and insert record
session 2: create table bug (bug_t2s t2[]);
session 2: insert into bug (bug_t2s) values (array[row(row(-1))::t2]);
session 2: select bug.bug_t2s[1].t2_t1.t1_i from bug;
[select correctly returns one row containing -1]

Complete the alter table in session 1
[Resume session 1 using gdb]
session 1: select bug.bug_t2s[1].t2_t1.t1_i from bug;
[select returns bogus string. On my 8.5 debug+cassert build,
 its a long chain of \x7F\x7F\x7F\x...]

Don't have any good idea how to fix this, yet. If CheckAttributeType()
really *does* offer sufficient protected in the non-array case,
extending that to the general case might work. But OTOH it might equally
well be that a more sophisticated race exists even in the non-array
case, and I simply didn't manage to trigger it...

best regards, Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-21 Thread Florian G. Pflug

Tom Lane wrote:

Josh Berkus j...@agliodbs.com writes:

(2) this change, while very useful, does change what had been a
simple rule (All variables are NULL unless specifically set
otherwise) into a conditional one (All variables are NULL unless
set otherwise OR unless they are declared as domain types with
defaults).  Do people feel that the new behavior would be
sufficiently intuitive to avoid user confusion?


I'm inclined to leave it alone.  It complicates the mental model, and
 frankly attaching defaults to domains was not one of the SQL
committee's better ideas anyway.  It's *fundamentally*
non-orthogonal.


I've always though of domains as being a kind of subtype of it's base
type. In this picture, DEFAULTs for domains correspond to overriding the
default constructor of the type (thinking C++ now), and seem like a
natural thing to have. But maybe that's more a C++ programmers than a
database designers point of view...

I've just checked how rowtypes behave, and while the set to null unless
specifically set otherwise rule kind of holds for them, their NULL
value seems to be special-cased enough to blur the line quite a bit

create or replace function myt() returns t as $body$
declare
  r t;
begin
  raise notice 'r: %, r is null: %', r, (r is null);
  return r;
end;
$body$ language plpgsql immutable;
select myt(),myt() is null;

gives:

NOTICE:  r: (,), r is null: t
NOTICE:  r: (,), r is null: t
 myt | ?column?
-+--
 (,) | f

Strange I think... And at least half of an exception to the simple
always null unless specifically set otherwise rule

It also seems that while domain DEFAULTs are ignored, the resulting
(null-initialized) variable is still checked against the domain's
constraints, including a potential NOT NULL constraint

create domain myint as int not null;
create or replace function myint() returns myint as $body$
declare
  i myint;
begin
  return i;
end;
$body$ language plpgsql immutable;

raises

ERROR:  domain myint does not allow null values
CONTEXT:  PL/pgSQL function myint line 3 during statement block local
variable initialization

This has the potential to cause some headache I think if you use domains
to prohibit NULL values because they make no semantic sense for your
application, and depend on DEFAULT to fill in some other value (like an
empty string or an empty array).

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-21 Thread Florian G. Pflug

Gurjeet Singh wrote:
On Sat, Nov 21, 2009 at 7:26 AM, Josh Berkus j...@agliodbs.com 
mailto:j...@agliodbs.com wrote: However, there are some other

issues to be resolved:

(1) what should be the interaction of DEFAULT parameters and domains 
with defaults?


The function's DEFAULT parameter should take precedence over the
default of the domain.


I think Josh was pondering whether

create domain myint as int default 0;
create function f(i myint) ...;

should behave like

create function f(i myint default 0) ...;

and hence call f(0) if you do select f();, or instead
raise an error because no f with zero parameters is defined (as it does
now).

I'd say no, because no default should be treated the same as default
null, so for consistency we'd then have to also support

create function g(i int) ...;
select g();

And of course throw an error if there was another function defined as
create function g() ...;

This way leads to madness...

If one really wanted to do that, there'd have to be an OPTIONAL clause
for function parameters that works like DEFAULT, but doesn't take a
default value and instead uses the type's default (NULL except for
domains with DEFAULT clause). But I wouldn't got that far, personally...

best regards,
Florian Pflug




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-20 Thread Florian G. Pflug

Robert Haas wrote:

On Thu, Nov 19, 2009 at 9:06 PM, Florian G. Pflug f...@phlo.org wrote:

I've tried to create a patch, but didn't see how I'd convert the result
from get_typedefault() (A Node*, presumeably the parsetree corresponding
to the default expression?) into a plan that I could store in a
PLpgSQL_expr. I guess I'd need something like SPI_prepare_plan that
takes a parse tree instead of a query string. Or am I on a completely
wrong track there?

While trying to cook up a patch I've also stumbled over what I perceive
as a bug relating to DOMAINS and column DEFAULTs. I'll write that up in
a second E-Mail to avoid confusion.


I suggest adding this to the open CommitFest (2010-01) at
https://commitfest.postgresql.org/action/commitfest_view/open


Hm, but I don't (yet) have a patch to add...

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-20 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:

It seems that pl/pgsql ignores the DEFAULT value of domains for
local variables.


The plpgsql documentation seems entirely clear on this:

The DEFAULT clause, if given, specifies the initial value assigned to
 the variable when the block is entered. If the DEFAULT clause is not
 given then the variable is initialized to the SQL null value.


Hm, must have missed that paragraph :-(. Sorry for that.

Would a patch that changes that have any chance of being accepted? Or is
the gain (not having to repeat the DEFAULT clause, and being able to
maintain it at one place instead of many) considered too small compared
to the risk of breaking existing code?

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Listen / Notify - what to do when the queue is full

2009-11-19 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Joachim Wieland wrote:

On Thu, Nov 19, 2009 at 4:12 AM, Tom Lane t...@sss.pgh.pa.us wrote:
Yes, I have been thinking about that also. So what should happen
when you prepare a transaction that has sent a NOTIFY before?


From the user's point of view, nothing should happen at prepare.

At a quick glance, it doesn't seem hard to support 2PC. Messages should
be put to the queue at prepare, as just before normal commit, but the
backends won't see them until they see that the XID has committed.


Yeah, but if the server is restarted after the PREPARE but before the 
COMMIT, the notification will be lost, since all notification queue 
entries are lost upon restart with the slru design, no?


best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Listen / Notify - what to do when the queue is full

2009-11-19 Thread Florian G. Pflug

Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
A better approach is to do something similar to what we do now: at 
prepare, just store the notifications in the state file like we do 
already. In notify_twophase_postcommit(), copy the messages to the 
shared queue. Although it's the same approach we have now, it

becomes a lot cleaner with the patch, because we're not
piggybacking the messages on the backend-private queue of the
current transaction, but sending the messages directly on behalf of
the prepared transaction being committed.


This is still ignoring the complaint: you are creating a clear risk 
that COMMIT PREPARED will fail.


I'm not sure that it's really worth it, but one way this could be
made safe would be for PREPARE to reserve the required amount of
queue space, such that nobody else could use it during the window
from PREPARE to COMMIT PREPARED.


I'd see no problem with COMMIT PREPARED failing, as long as it was
possible to retry the COMMIT PREPARED at a later time. There surely are
other failure cases for COMMIT PREPARED too, like an IO error that
prevents the clog bit from being set, or a server crash half-way through
COMMIT PREPARED.

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Listen / Notify - what to do when the queue is full

2009-11-19 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:

Tom Lane wrote:

This is still ignoring the complaint: you are creating a clear
risk that COMMIT PREPARED will fail.



I'd see no problem with COMMIT PREPARED failing, as long as it
was possible to retry the COMMIT PREPARED at a later time. There
surely are other failure cases for COMMIT PREPARED too, like an IO
error that prevents the clog bit from being set, or a server crash
half-way through COMMIT PREPARED.


Yes, there are failure cases that are outside our control.  That's no
 excuse for creating one that's within our control.


True. On the other hand, people might prefer having to deal with (very
unlikely) COMMIT PREPARED *transient* failures over not being able to
use NOTIFY together with 2PC at all. Especially since any credible
distributed transaction manager has to deal with COMMIT PREPARED
failures anyway.

Just my $0.02, though.

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


[HACKERS] DEFAULT of domain ignored in plpgsql (8.4.1)

2009-11-19 Thread Florian G. Pflug

Hi

It seems that pl/pgsql ignores the DEFAULT value of domains for local
variables. With the following definitions in place

create domain myint as int default 0;
create or replace function myint() returns myint as $body$
declare
  v_result myint;
begin
  return v_result;
end;
$body$ language plpgsql immutable;

issuing
select myint();
returns NULL, not 0 on postgres 8.4.1

If the line
  v_result myint;
is changes to
  v_result myint default 0;
than 0 is returned as expected.

I've tried to create a patch, but didn't see how I'd convert the result
from get_typedefault() (A Node*, presumeably the parsetree corresponding
to the default expression?) into a plan that I could store in a
PLpgSQL_expr. I guess I'd need something like SPI_prepare_plan that
takes a parse tree instead of a query string. Or am I on a completely
wrong track there?

While trying to cook up a patch I've also stumbled over what I perceive
as a bug relating to DOMAINS and column DEFAULTs. I'll write that up in
a second E-Mail to avoid confusion.

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


[HACKERS] column DEFAULTs and prepared statements

2009-11-19 Thread Florian G. Pflug

Hi

While trying to come up with a patch to handle domain DEFAULTs in
plpgsql I've stumbled across the following behavior regarding domain
DEFAULTs and prepared statements.

session 1: create domain myint as int default 0 ;
session 1: create table mytable (i myint) ;
session 2: prepare ins as insert into mytable (i) values (default);
session 2: execute ins;
session 1: alter domain myint set default 1;
session 2: execute ins;

select * from mytable returns:
 i
---
 0
 0


while I'd have expected:
 i
---
 0
 1

After doing the same without using a domain
session 1: create table mytable (i myint default 0) ;
session 2: prepare ins as insert into mytable (i) values (default);
session 2: execute ins;
session 1: alter table mytable alter column i default 1;
session 2: execute ins;

select * from mytable returns:
 i
---
 0
 1

As far as I understand the code this happens because the dependency on
the domain (for the default value) is not recorded in the plan cache
entry. This would imply that the same error also occurs if the INSERT
happens from a pl/pgsql function instead of a manually prepared
statement, but I haven't tested that.

If someone gives me a general idea where to start, I could try to come
up with a patch

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql

2009-11-14 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:

Tom Lane wrote:

Trying to do this in plpgsql is doomed to failure and heartache,



Well, the proposed functions at least allow for some more
flexibility in working with row types, given that you know in
advance which types you will be dealing with (but not necessarily
the precise ordering and number of the record's fields). They might
feel a bit kludgy because of the anyelement dummy argument that
bridges the gap between the statically typed nature of SQL and the
rather dynamic RECORDs, but the kludgy-ness factor is still within
reasonable limits I think.


It sounds pretty d*mn klugy to me, and I stand by my comment that it 
isn't going to work anyway.  When you try it you are going to run

into parameter type doesn't match that while preparing the plan
errors.


Ok, I must be missing something. I currently fail to see how
my proposed
  record_value(record, name, anyelement) returns anyelement
function differs (from the type system's point of view) from
  value_from_string(text, anyelement) returns anyelement
which simply casts the text value to the given type and can easily be
implemented in plpgsq.

create or replace function
value_from_string(v_value text, v_type_dummy anyelement)
returns anyelement as
$body$
declare
  v_result v_type_dummy%type;
begin
  if v_value is null then
return null;
  end if;

  v_result := v_value;
  return v_result;
end;
$body$ language plpgsql immutable;

-- Returns 124
select value_from_string('123', NULL::int) + 1;
-- returns {1,2,3,4}
select value_from_string('{1,2,3}', NULL::int[]) || array[4];

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql

2009-11-14 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:

Ok, I must be missing something. I currently fail to see how my
proposed record_value(record, name, anyelement) returns anyelement 
function differs (from the type system's point of view) from 
value_from_string(text, anyelement) returns anyelement which simply
casts the text value to the given type and can easily be 
implemented in plpgsq.


The problem is at the call site --- if you try to call it with
different record types on different calls you're going to get a
failure. Or so I expect anyway.


Ah, OK - so it's really the record type, and not my anyelement kludge
that might cause problems.

Actually, I do now realize that record is a way more special case than
I'd have initially thought. For example, I could have sworn that it's
possible to pass record values to pl/pgsql functions, but just found
out the hard way that it isn't. Seems that the possibility of declaring
record variables lulled me into thinking it's pretty standard type
when it actually isn't.

best regards, Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql

2009-11-14 Thread Florian G. Pflug

Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:
Yes, and I have used it, but it really would be nicer to have some 
introspection facilities built in, especially for use in triggers.


Maybe, but the proposal at hand is spectacularly ugly --- in particular
it seems designed around the assumption that a given trigger will only
care about handling a predetermined set of datatypes, which hardly
fits with PG's normal goals for datatype extensibility.  If the argument
is that you don't like hstore or other PLs because they'll smash
everything to text, then I think you have to do better than this.


While I agree that handling arbitrary datatypes at runtime would be 
nice, I really don't see how that could ever be done from within a 
plpgsql procedure, unless plpgsql somehow morphs into a dynamically 
typed language. Plus, the set of datatypes an application deals with is 
usually much smaller than the set of tables, and less likely to change 
over time.


I'd also argue that this restriction does not conflict with PG's goal of 
datatype extensibility at all. Datatype extensibility in PG's boils down 
to being able to create new datatypes without modifying postgres itself 
- but it still expects that you do so while designing your application. 
Which also is when trigger functions that use record_value() or a 
similar function would be written.


Plus, fully generic handling of data of arbitrary type is a somewhat 
strange notion anyway, because it leaves you with very few operations 
guaranteed to be defined for those values. In the case of PG, you'd be 
pretty much limited to casting those values from and to text.


best regards,
Florian Pflug




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql

2009-11-14 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:

While I agree that handling arbitrary datatypes at runtime would be
 nice, I really don't see how that could ever be done from within a
 plpgsql procedure, unless plpgsql somehow morphs into a
dynamically typed language.


Which is not likely to happen, which is why this is fundamentally a 
dead end.  I don't think it's appropriate to put ugly, hard to use 
band-aids over the fact that plpgsql isn't designed to do this. One

of the principal reasons why we work so hard to support multiple PLs
is that they have different strengths.  If you need something that's 
more dynamically typed than plpgsql, you should go use something

else.


In principle, I agree. In pratice, however, the company who I do my
current project for has settled on plpgsql and isn't willing to use
other PLs in their software because they lack the skill to maintain code
written in other PLs. Therefore I'm trying to find an at least somewhat
acceptable solution using plpgsql.


Plus, fully generic handling of data of arbitrary type is a
somewhat strange notion anyway, because it leaves you with very few
operations guaranteed to be defined for those values. In the case
of PG, you'd be pretty much limited to casting those values from
and to text.


Well, that's the wrong way to look at it.  To me, the right design 
would involve saying that my trigger needs to do operation X on the 
data, and therefore it should support all datatypes that can do X. It

should not need a hard-wired list of which types those are.


True, but that'd require fairly large changes to plpgsql AFAICS.

Perhaps it would help if we looked at some specific use-cases that 
people need, rather than debating abstractly.  What do you need your 
generic trigger to *do*?


I need to build a global index table of all values of a certain type
together with a pointer to the row and table that contains them. Since
all involved tables have an id column, storing that pointer is the
easy part. The hard part is collecting all those values in an
insert/update/delete trigger so that I can update the global index
accordingly.

Currently, a set of plpgsql functions generate a seperate trigger
function for each table. Yuck!

Instead of this nearly-impossible to read code-generating function I
want to create a generic trigger function that works for any of the
involved tables. Preferrably in plpgsql because of the skill issue
mentioned above.

best regards,
Florian Pflug




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql

2009-11-14 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:

Tom Lane wrote:

Perhaps it would help if we looked at some specific use-cases
that people need, rather than debating abstractly.  What do you
need your generic trigger to *do*?



I need to build a global index table of all values of a certain
type together with a pointer to the row and table that contains
them. Since all involved tables have an id column, storing that
pointer is the easy part. The hard part is collecting all those
values in an insert/update/delete trigger so that I can update the
global index accordingly.


So in this case it seems like you don't actually need any
polymorphism at all; the target columns are always of a known
datatype.  You just don't want to commit to their names.  I wonder
though why you're willing to pin down the name of the id column but
not the name of the data column.


There might be more than one (or none at all) columns of the type to be
indexed. I need to process all such columns (each of them produces a
seperate record in the index table). Plus, this schema is relatively
volatile - new fields are added about once a month or so.

Currently, a set of plpgsql functions generate a seperate trigger 
function for each table. Yuck!


Would you be happy with an approach similar to what Andrew mentioned,
 ie, you generate CREATE TRIGGER commands that list the names of the 
target column(s) as TG_ARGV arguments?  The alternative to that seems
 to be that you iterate at runtime through all the table columns to 
see which ones are of the desired type.  Which might be less trouble 
to set up, but the performance penalty of figuring out 
basically-unchanging information again on every single tuple update 
seems awful high.


Hm.. I had hoped to get away without any need to modify the trigger
definitions if the schema changes. But having a function that does DROP
TRIGGER; CREATE TRIGGER... is already a huge improvement over having
one that does CREATE FUNCTION

I've now played around with the
EXECUTE 'select $1.' || quote_ident(fieldname)' USING NEW/OLD
trick, and simply look up the existing field with
SELECT attname
FROM pg_attribute
WHERE
attrelid = TG_RELID AND
atttypeid IN (...) AND
attname NOT IN ('referenced_by', 'self') AND
attnum  0 AND NOT attisdropped
This at least gives me a working proof-of-concept implementation of the
trigger.

Still, doing that SELECT seems rather silly since NEW and OLD already
contain the required information. So I still believe that having
something like record_name() and record_types() would be useful. And at
least these functions have less of an issue with the type system...

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Check constraint on domain over an array not executed for array literals

2009-11-13 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Agreed, it's a bug. A simpler example is just: [snipped]


Will this fix for this be included in 8.4.2 (or .3), or will it have to
wait for 8.4 because it changes behavior?

There's a special case in transformExpr function to handle the 
ARRAY[...]::arraytype construct, which skips the usual type-casting

 and just constructs an ArrayExpr with the right target type.
However, it's not taking into account that the target type can be a
domain.

Attached patch fixes that. Anyone see a problem with it?

I'm not familiar with the parser so I can't really judge this. However,
I've applied the patch to my development db and it seems to work fine,
and fixes the bug.

Thanks for the quick response!

best regards,
Florian Pflug


smime.p7s
Description: S/MIME Cryptographic Signature


[HACKERS] Inspection of row types in pl/pgsql and pl/sql

2009-11-13 Thread Florian G. Pflug

Hi

I'm currently working on a project where we need to build a global cache
table containing all values of certain types found in any of the other
tables. Currently, a seperate insert, update and delete (plpgsql)
trigger function exists for each table in the database which is
auto-generated by a (plpgsql) function which queries the system catalogs
to find all fields with a certain type, and then generates the
appropriate plpgsql function using EXECUTE '...'.

I'd like to replace this function-generating function by a generic
trigger function that works for all tables. Due to the lack of any way
to inspect the *structure* of a record type, however, I'd have to use a
C language function for that, which induces quite some maintenance
headaches (especially if deployed on windows).

I'm therefore thinking about implementing the following generate-purpose
inspection functions for row types

record_length(record) returns smallint
  Returns the number of fields in the given record.

record_names(record) returns name[]
  Returns the names of the record's fields. Array will contain NULLs
  if one or more fields are unnamed.

record_types(record) returns regtype[];
  Returns the OIDs of the record's types. Array won't contain NULLs

record_value(record, name, anyelement) returns anyelement
  Returns the value of a certain (named) field. The type of the third
  argument defines the return type (its value is ignored). The
  field's value is cast to that type if possible, otherwise an
  error is raised.

record_value(record, smallint, anyelement) returns anyelement
  Returns the value of the field at the given position.

record_values(record, regtype, anyelement) returns anyarray
  Returns an array of all values of all fields with the given type or
  whose type is a domain over the given type. No casting is done.

Any comment/critique is appreciated.

Would anyone else find those functions useful?

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Inspection of row types in pl/pgsql and pl/sql

2009-11-13 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug f...@phlo.org writes:
I'd like to replace this function-generating function by a generic 
trigger function that works for all tables. Due to the lack of any

way to inspect the *structure* of a record type, however, I'd have
to use a C language function for that, which induces quite some
maintenance headaches (especially if deployed on windows).


Trying to do this in plpgsql is doomed to failure and heartache,
because it's fundamentally a strongly typed language.  The proposed
functions won't fix that and hence will be unusable in practice.  I'd
suggest either using C, or using one of the less-strongly-typed PLs.


Well, the proposed functions at least allow for some more flexibility in
working with row types, given that you know in advance which types you
will be dealing with (but not necessarily the precise ordering and
number of the record's fields). They might feel a bit kludgy because of
the anyelement dummy argument that bridges the gap between the
statically typed nature of SQL and the rather dynamic RECORDs, but the
kludgy-ness factor is still within reasonable limits I think.

Since all the other PLs (except C) are not nearly as integrated with the
 postgres type system, using them for this task does not really buy
anything IMHO. AFAIK, all these PLs will convert any SQL type which
isn't specifically mapped to one of the PLs types to a string. *That* I
can do with pl/pgsql too, by simply using record_out() and then parsing
the result...

C of course lets me work around all these problems - but at the cost of
a longer development time and (more importantly) more maintenance
headaches (especially on windows, where a C compiler is not just one
apt-get/yum/whatever call away).

Regarding usability - the proposed function would for example allow you
to implement a wide-range of row-to-text conversion functions in pure
pl/pgsql by calling record_value(record, name, anyelement) with
NULL::varchar as the last argument for each field, and then
concatinating the resulting text together any way you like.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


[HACKERS] Check constraint on domain over an array not executed for array literals

2009-11-12 Thread Florian G. Pflug

Hi

While trying to create a domain over an array type to enforce a certain
shape or certain contents of an array (like the array being only
one-dimensional or not containing NULLs), I've stumbled over what I
believe to be a bug in postgresql 8.4

It seems that check constraints on domains are *not* executed for
literals of the domain-over-array-type - in other words, for expressions
like:
array[...]::my-domain-over-array-type.

They are, however, executed if I first force the array to be of the base
type, and then cast it to the array type.

Here is an example that reproduces the problem:

create domain myintarray as int[] check (
  -- Check that the array is neither null, nor empty,
  -- nor multi-dimensional
  (value is not null) and
  (array_length(value,1) is not null) and
  (array_length(value,1)  0) and
  (array_length(value,2) is null)
);

select null::myintarray; -- Fails (Right)

select array[]::myintarray; -- Succeeds (Wrong)
select array[]::int[]::myintarray; -- Fails (Right)

select array[1]::myintarray; -- Succeeds (Right)
select array[1]::int[]::myintarray; -- Succeeds (Right)

select array[array[1]]::myintarray; -- Succeeds (Wrong)
select array[array[1]]::int[][]::myintarray; -- Fails (Right)


I guess the reason is that the ::arraytype part of
array[...]::arraytype isn't really a cast at all, but instead part of
the array literal syntax. Hence, array[]::myintarray probably creates an
empty myintarray instance, and then adds the elements between the square
brackets (none) - with none of this steps triggering a run of the check
constraint.

I still have the feeling that this a bug, though. First, because it
leaves you with no way at guarantee that values of a given domain always
fulfill certain constraints. And second because array[...]::arraytype
at least *looks* like a cast, and hence should behave like one too.

best regards,
Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Florian G. Pflug

Simon Riggs wrote:

On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:


The main idea was to invert the meaning of the xid array in the snapshot
struct - instead of storing all the xid's between xmin and xmax that are
to be considering in-progress, the array contained all the xid's 
xmin that are to be considered completed.



The downside is that the size of the read-only snapshot is theoretically
unbounded, which poses a bit of a problem if it's supposed to live
inside shared memory...


Why do it inverted? That clearly has problems.


Because it solves the problem of sponteaously apprearing XIDs in the 
WAL. At least prior to 8.3 with virtual xids, a transaction might have 
allocated it's xid long before actually writing anything to disk, and 
therefore long before this XID ever shows up in the WAL. And with a 
non-inverted snapshot such an XID would be considered to be completed 
by transactions on the slave... So, one either needs to periodically log 
a snapshot on the master or log XID allocations which both seem to cause 
considerable additional load on the master. With an inverted snapshot, 
it's sufficient to log the current RecentXmin - a values that is readily 
available on the master, and therefore the cost amounts to just one 
additional 4-byte field per xlog entry.


regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-15 Thread Florian G. Pflug

Simon Riggs wrote:

On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:


The current read-only snapshot (which current meaning the
corresponding state on the master at the time the last replayed wal
record was generated) was maintained in shared memory. It' xmin field
was continually updated with the (newly added) XLogRecord.xl_xmin
field, which contained the xid of the oldest running query on the
master, with a pruning step after each ReadOnlySnapshot.xmin update to
remove all entries  xmin from the xid array. If a commit was seen for
an xid, that xid was added to the ReadOnlySnapshot.xid array.

The advantage of this concept is that it handles snapshotting on the
slave without too much additional work for the master (The only change
is the addition of the xl_xmin field to XLogRecord). It especially
removes that need to track ShmemVariableCache-nextXid.


Snapshots only need to know which transactions are currently running
during WAL apply. The standby can't remove any tuples itself, so it
doesn't need to know what the master's OldestXmin is. 


I used the logged xmin value to track the shared snapshot's xmin, which 
in turn allowed me to prune the xid array, eliminating all xids  that xmin.


regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Transaction Snapshots and Hot Standby

2008-09-13 Thread Florian G. Pflug

Heikki Linnakangas wrote:

BTW, we haven't talked about how to acquire a snapshot in the slave.
 You'll somehow need to know which transactions have not yet
committed, but will in the future. In the master, we keep track of
in-progress transaction in the ProcArray, so I suppose we'll need to
do the same in the slave. Very similar to prepared transactions,
actually. I believe the Abort records, which are not actually needed
for normal operation, become critical here. The slave will need to
put an entry to ProcArray for any new XLogRecord.xl_xid it sees in
the WAL, and remove the entry at a Commit and Abort record. And clear
them all at a shutdown record.


For reference, here is how I solved the snapshot problem in my
Summer-of-Code project last year, which dealt exactly with executing
read-only queries on PITR slaves (But sadly never came out of alpha
stage due to both my and Simon's lack of time)

The main idea was to invert the meaning of the xid array in the snapshot
struct - instead of storing all the xid's between xmin and xmax that are
to be considering in-progress, the array contained all the xid's 
xmin that are to be considered completed.

The current read-only snapshot (which current meaning the
corresponding state on the master at the time the last replayed wal
record was generated) was maintained in shared memory. It' xmin field
was continually updated with the (newly added) XLogRecord.xl_xmin
field, which contained the xid of the oldest running query on the
master, with a pruning step after each ReadOnlySnapshot.xmin update to
remove all entries  xmin from the xid array. If a commit was seen for
an xid, that xid was added to the ReadOnlySnapshot.xid array.

The advantage of this concept is that it handles snapshotting on the
slave without too much additional work for the master (The only change
is the addition of the xl_xmin field to XLogRecord). It especially
removes that need to track ShmemVariableCache-nextXid.

The downside is that the size of the read-only snapshot is theoretically
unbounded, which poses a bit of a problem if it's supposed to live
inside shared memory...

regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Getting to universal binaries for Darwin

2008-07-19 Thread Florian G. Pflug

Tom Lane wrote:

You can get around that by hacking up the generated config files
with #ifdef __i386__ and so on to expose the correct values of
the hardware-dependent symbols to each build.  Of course you have
to know what the correct values are --- if you don't have a sample
of each architecture handy to run configure against, it'd be easy
to miss some things.  And even then it's pretty tedious.  I am
not sure if it is possible or worth the trouble to try to automate
this part better.


Hm - configure *does* the right thing if CFLAGS is set to *just* -arch 
i386 or -arch ppc (at least on intel hardware, because OSX can run 
ppc binaries there, but not vice versa), right? If this is true, we need
some way to run configure multiple times, once for each arch, but then 
still get *one* set of Makefiles that have all the archs in their CFLAGS..



Modulo the above problems, I was able to build i386+ppc binaries that
do in fact work on both architectures.  I haven't got any 64-bit Apple
machines to play with, so there might be 64-bit issues I missed.
Still, this is a huge step forward compared to what was discussed here:
http://archives.postgresql.org/pgsql-general/2008-02/msg00200.php
I think that my MacBook should be able to build and run 64-bit binaries, 
so I can test that if you want. Do you have a script that does the 
necessary config file magic, or did you do that by hand?


regards, Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [PATCHES] [HACKERS] Solaris ident authentication using unix domain sockets

2008-07-08 Thread Florian G. Pflug

Josh Berkus wrote:

Tom,


Indeed.  If the Solaris folk feel that getupeercred() is insecure,
 they had better explain why their kernel is that broken.  This is
 entirely unrelated to the known shortcomings of the ident IP 
protocol.


The Solaris security  kernel folks do, actually.  However, there's 
no question that TRUST is inherently insecure, and that's what people

 are going to use if they can't get IDENT to work.


I'd be *very* interested in how they come to that assessment. I'd have
thought that the only alternative to getpeereid/getupeercred is
password-based or certificate-based authenticated - which seem *less*
secure because a) they also rely on the client having the correct uid
or gid (to read the password/private key), plus b) the risk of the
password/private key getting into the wrong hands.

How is that sort of authenticated handled by services shipping with solaris?

regards, Florian Pflug, hoping to be enlightened beyond his limited
posix-ish view of the world...


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VirtualXactLockTableInsert

2008-06-27 Thread Florian G. Pflug

Simon Riggs wrote:

When we move from having a virtual xid to having a real xid I don't
see any attempt to re-arrange the lock queues. Surely if there are
people waiting on the virtual xid, they must be moved across to wait
on the actual xid? Otherwise the locking queue will not be respected
because we have two things on which people might queue. Anybody
explain that?


Locks on real xids serve a different purpose than locks on virtual xids.
Locks on real xids are used to wait for transaction who touched a
certain tuple (in which case they certainly must have acquired a real
xid) to end. Locks on vxids on the other hand are used to wait for the
ending of transactions which either hold a certain lock or use a
snapshot with a xmin earlier than some point in time.

indexcmds.c is the only place where VirtualXactLockTableWait() is used -
the concurrent index creation needs to wait for all transactions to end
which either might not know about the index (after phase 1 and 2), or
who might still see tuples not included in the index (before marking the
index valid).


In cases where we know we will assign a real xid, can we just skip
the assignment of the virtual xid completely? For example, where an
implicit transaction is started by a DML statement. Otherwise we have
to wait for 2 lock table inserts, not just one.

A more general solution would be to get rid of vxid locks completly.
This is possible if we figure out a way to handle the first two waiting
phases or concurrent index builds in another way. One idea I had for
approaching this was to extend the lock manager by adding some sort of
WaitForCurrentLockHolders(LockTag) function. I felt (and still feel)
feel I didn't understand the locking code well enough to start hacking
it though, and Tom didn't like the idea either. His argument was that it
wasn't clear how deadlock detection would cope with such a facility IIRC.

regards, Florian Pflug


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hint Bits and Write I/O

2008-05-30 Thread Florian G. Pflug

Kevin Grittner wrote:

On Wed, May 28, 2008 at  6:26 PM, in message

[EMAIL PROTECTED],
Florian G. Pflug [EMAIL PROTECTED] wrote: 
 

I think we should put some randomness into the decision,
to spread the IO caused by hit-bit updates after a batch load.
 
Currently we have a policy of doing a VACUUM FREEZE ANALYZE on a table

after a bulk load, or on the entire database after loading a pg_dump
of a database.  We do this before putting the table or database into
production.  This avoids surprising clusters of writes at
unpredictable times.  Please don't defeat that.  (I'm not sure whether
your current suggestion would.)


No, VACUUM (and therefore VACUUM FREEZE) dirty all buffers they set hit 
bits on anyway, since they also update the xmin values. But a more 
IO-friendly approach to setting hit bits might make that VACUUM FREEZE 
step unnecessary ;-)


regards, Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hint Bits and Write I/O

2008-05-28 Thread Florian G. Pflug

Simon Riggs wrote:
Hmm, I think the question is: How many hint bits need to be set 
before we mark the buffer dirty? (N)


Should it be 1, as it is now? Should it be never? Never is a long 
time. As N increases, clog accesses increase. So it would seem there 
is likely to be an optimal value for N.


After further thought, I begin to think that the number of times we set
a dirty hint-bit shouldn't influence the decision of whether to dirty
the page too much. Instead, we should look at the *age* of the last xid
which modified the tuple. The idea is that the clog pages showing the
status of young xids are far more likely to be cached that the pages
for older xids. This makes a lost hint-bit update much cheaper for
young than for old xids, because we probably won't waste any IO if we
have to set the hint-bit again later, because the buffer was evicted
from shared_buffers before being written out. Additionally, I think we
should put some randomness into the decision, to spread the IO caused by
hit-bit updates after a batch load.

All in all, I envision a formula like
chance_of_dirtying = min(1,
  alpha
  *floor((next_xid - last_modifying_xid)/clog_page_size)
  /clog_buffers
)

This means that a hint-bit update never triggers dirtying if the last
modifying xid belongs to the same clog page as the next unused xid -
which sounds good, since that clog page gets touched on every commit and
abort, and therefore is cached nearly for sure.

For xids on older pages, the chance of dirtying grows (more aggresivly
for larger alpha values). For alpha = 1, a hint-bit update dirties a
buffer for sure only if the xid is older than clog_page_size*clog_buffers.

regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hint Bits and Write I/O

2008-05-27 Thread Florian G. Pflug

Simon Riggs wrote:

After some discussions at PGCon, I'd like to make some proposals for
hint bit setting with the aim to reduce write overhead.

Currently, when we see an un-hinted row we set the bit, if possible and
then dirty the block.

If we were to set the bit but *not* dirty the block we may be able to
find a reduction in I/O. In many cases this would make no difference at
all, since we often set hints on an already dirty block. In other cases,
particularly random INSERTs, UPDATEs and DELETEs against large tables
this would reduce I/O, though possibly increase accesses to clog.


Hm, but the io overhead of hit-bit setting occurs only once, while the
pressure on the clog is increased until we set the hint-bit. This looks 
like not writing the hit-bit update to disk results in worse throughput 
unless there are many updated, and only very few selects. But not too 
many updates either, because if a page gets hit by tuple updates faster 
than the bgwriter writes it out, you won't waste any io on hit-bit-only 
writes either. That might turn out to be a pretty slim window which 
actually shows substantial IO savings...



My proposal is to have this as a two-stage process. When we set the hint
on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
then we mark it BM_DIRTY.

The objective of this is to remove effects of single index accesses.
So effectively, only the first hit-bit update hitting a previously clean 
buffer gets treated specially - the second hit-bit update flags the 
buffer as dirty, just as it does now? That sounds a bit strange - why is 
it exactly the *second* write that triggers the dirtying? Or did I 
missunderstand what you wrote?


regards, Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] triggers on prepare, commit, rollback... ?

2008-05-20 Thread Florian G. Pflug

Tom Lane wrote:

Bruce Momjian [EMAIL PROTECTED] writes:

trigger on prepare, commit, rollback, savepoint,
This is a sufficiently frequently asked question that I wish someone 
would add an entry to the FAQ about it, or add it to the TODO list's 
Features we don't want section.



OK, remind me why we don't want it again?


I'm sure I've ranted on this several times before, but a quick archive
search doesn't find anything.  So, here are a few points to chew on:

* Trigger on rollback: what's that supposed to do?  The current
transaction is already aborted, so the trigger has no hope of making any
database changes that will ever be visible to anyone.

* Trigger on commit: what do you do if the transaction fails after
calling the trigger?  The reductio ad absurdum for this is to consider
having two on-commit triggers, where obviously the second one could
fail.


I full agree that having triggers on rollback and on commit of 2PC 
transactions is broken by design. Triggers on COMMIT (for non-2PC 
transactions) and PREPARE (for 2PC-Transactions) seem workable though -
I'd expect such a trigger to be executed *before* any actual commit 
handling takes place. Essentially, doing

BEGIN
some work
COMMIT
in the presence of an on-commit trigger would be equivalent to doing
BEGIN
some work
SELECT my_trigger_function
COMMIT.

A possible use-case for that is aggregating some statistics collected 
during a transaction. One could e.g. maintain a cache of table rowcounts
by summing up the number of inserted and deleted records per table with 
some per-row ON INSERT and ON DELETE (presumably C-language) triggers,

and than update a global cache at transaction end.

regards, Florian Pflug


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.

2008-03-12 Thread Florian G. Pflug

Bruce Momjian wrote:

Is this a TODO?


It's for from clear that avoing an exclusive ProcArray lock on subxact 
abort will bring a measurable performance benefit, so probably not.


I've actually coded a prototype for this a few months ago, to
check if it would bring any benefit at all, though I ran out of time 
before I had time to benchmark this, and I probably also lack the 
hardware for running high-concurrency tests.



---
Florian G. Pflug wrote:

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Currently, we do not assume that either the childXids array, nor the xid
cache in the proc array are sorted by ascending xid order. I believe that
we could simplify the code, further reduce the locking requirements, and
enabled a transaction to de-overflow it's xid cache if we assume that those
arrays are in ascending xid order.

de-overflowing the cache sounds completely unsafe, as other backends need
that state to determine whether they need to look into pg_subtrans.

We'd only de-overflow if we abort *all* xids that are missing from the
xid cache. And only after marking them as aborted in the clog. If someone
concurrently checks for an overflow, and already sees the new (non-overflowed)
state, than he'll assume the xid is not running if he hasn't found it in
the array. Which is correct - we just aborted it.

Plus, removing the exclusive lock doesn't depend on de-overflowing. It's
just something that seems rather easy to do once the subxid handling is
in a state that allows concurrent removal of entries. If it turns out that
it's not that easy, than I'll just drop the idea again.

I still don't believe you can avoid taking exclusive lock, either; your 
argument here did not address latestCompletedXid.

Sorry, not addressing latestCompletedXid was an oversight :-(.
My point is the we only *need* to advance latestCompletedXid on COMMITS. We do
so for aborts only to avoid running with unnecessarily low xmins after
a transaction ABORT. That corner case can only happen after a toplevel
ABORT, though - aborting subxacts cannot change the xmin, because the
toplevel xact will have a lower xid than any of it's subtransactions anyway.

We can therefore just remember the largest assigned xid for a given transaction,
and update latestCompletedXid to that on toplevel commit or abort. That
prevents that corner-case too, without updating latestCompletedXid during
subxact abort.


But the main point remains this: there is no evidence whatsoever that these
code paths are sufficiently performance-critical to be worth speeding up by
making the code more fragile.

The gain will be less than that of the locking improvements done so far.
It will be a win for heavy users of plpgsql BEGIN/END/EXCEPTION blocks,
though I think.

We'll also save some cycles in TransactionIdIsInProgress, because we can
use a binary search, but that's just an added bonus.

I'm currently trying to code up a patch, since it's easier to judge the
correctness of actual code than that of a mere proposals. I'll do some
benchmarking when the patch is done to see if it brings measurable benefits.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Florian G. Pflug

Pavan Deolasee wrote:

What I am thinking is if we can read ahead these blocks in the shared
 buffers and then apply redo changes to them, it can potentially 
improve things a lot. If there are multiple read requests, kernel (or

 controller ?) can probably schedule the reads more efficiently.

The same holds true for index scans, though. Maybe we can find a
solution that benefits both cases - something along the line of a
bgreader process

Btw, isn't our redo recovery completely physical in nature ? I mean, 
can we replay redo logs related to a block independent of other 
blocks ? The reason I am asking because if thats the case, ISTM we 
can introduce parallelism in recovery by splitting and reordering the

 xlog records and then run multiple processes to do the redo
recovery.


I'd say its physical on the tuple level (We just log the new tuple on an
update, not how to calculate it from the old one), but logical on the
page level (We log the fact that a tuple was inserted on a page, but
e.g. the physical location of the tuple on the page can come out
differently upon replay). It's even more logical for indices, because
we log page splits as multiple wal records, letting the recovery process
deal with synthesizing upper-level updates should we crash in the middle
of a page split. Additionally, we log full-page images as a safeguard
against torn page writes. Those would need to be considered as a kind of
reorder barrier in any parallel restore scenario, I guess.

I know that Simon has some ideas about parallel restored, though I don't
know how he wants to solve the dependency issues involved. Perhaps by
not parallelizing withon one table or index...



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Read-ahead and parallelism in redo recovery

2008-02-29 Thread Florian G. Pflug

Greg Stark wrote:

Florian G. Pflug wrote:
The same holds true for index scans, though. Maybe we can find a 
solution that benefits both cases - something along the line of a 
bgreader process
I posted a patch to do readahead for bitmap index scans using 
posix_fadvise. Experiments showed it works great on raid arrays on 
Linux. Solaris will need to use libaio though which I haven't tried 
yet.

Cool! I'd like to try it out - is that patch available in the pg-patches
archives?

Doing it for normal index scans is much much harder. You can 
readahead a single page by using the next pointer if it looks like 
you'll need it. But I don't see a convenient way to get more than 
that.

I was thinking that after reading a page from the index, the backend
could post a list of heap pages referenced from that index page to the
shmem. A background process would repeatedly scan that list, and load
those pages into the buffer cache.

regards, Florian Pflug



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CREATE TABLE, load and freezing

2008-02-28 Thread Florian G. Pflug

Pavan Deolasee wrote:

In a typical scenario, user might create a table and load data in the
table as part of a single transaction (e.g pg_restore). In this case,
it would help if we create the tuples in the *frozen* state to avoid
any wrap-around related issues with the table.  Without this, very
large read-only tables would require one round of complete freezing
if there are lot of transactional activities in the other parts of
the database. And when that happens, it would generate lots of
unnecessary IOs on these large tables.

If that works, then we might also want to set the visibility hint bits.
Not because lookup of that information is expensive - the tuples all 
came from the same transaction, virtually guaranteeing that the relevent

pg_clog page stays in memory after the first few pages.
But by setting them immediatly we'd save some IO, since we won't dirty
all pages during the first scan.


I don't know if this is a real problem for anybody, but I could think
 of its use case, at least in theory.
A cannot speak for freeze-on-restore, but in a project I'm currently 
working on, the IO caused (I guess) by hint-bit updates during the

first scan of the table is at least noticeably...

regards, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An idea for parallelizing COPY within one backend

2008-02-27 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

...
Neither the dealer, nor the workers would need access to the either
the shared memory or the disk, thereby not messing with the one backend
is one transaction is one session dogma.
...


Unfortunately, this idea has far too narrow a view of what a datatype
input function might do.  Just for starters, consider enum input,
which certainly requires catalog access.  We have also explicitly
acknowledged the idea that datatype I/O functions might try to store
typmod-related data in some special catalog somewhere.


Hm... how many in-core datatypes are there which need catalog access in
their input or output functions? Maybe we could change the API for
i/o functions in a way that allows us to request all needed information
to be cached?

regards, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] An idea for parallelizing COPY within one backend

2008-02-27 Thread Florian G. Pflug

Dimitri Fontaine wrote:
Of course, the backends still have to parse the input given by pgloader, which 
only pre-processes data. I'm not sure having the client prepare the data some 
more (binary format or whatever) is a wise idea, as you mentionned and wrt 
Tom's follow-up. But maybe I'm all wrong, so I'm all ears!


As far as I understand, pgloader starts N threads or processes that open 
up N individual connections to the server. In that case, moving then 
text-binary conversion from the backend into the loader won't give any

additional performace I'd say.

The reason that I'd love some within-one-backend solution is that I'd 
allow you to utilize more than one CPU for a restore within a *single* 
transaction. This is something that a client-side solution won't be able 
to deliver, unless major changes to the architecture of postgres happen 
first...


regards, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An idea for parallelizing COPY within one backend

2008-02-27 Thread Florian G. Pflug

Brian Hurt wrote:

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

...
Neither the dealer, nor the workers would need access to the either
the shared memory or the disk, thereby not messing with the one backend
is one transaction is one session dogma.
...

Unfortunately, this idea has far too narrow a view of what a datatype
input function might do.  Just for starters, consider enum input,
which certainly requires catalog access.  We have also explicitly
acknowledged the idea that datatype I/O functions might try to store
typmod-related data in some special catalog somewhere.


Would it be possible to determine when the copy is starting that this 
case holds, and not use the parallel parsing idea in those cases?


In theory, yes. In pratice, I don't want to be the one who has to answer 
to an angry user who just suffered a major drop in COPY performance 
after adding an ENUM column to his table.


I was thinking more along the line of letting a datatype specify a
function void* ioprepare(typmod) which returns some opaque object
specifying all that the input and output function needs to know.
We could than establish the rule that input/output functions may not 
access the catalog, and instead pass them a pointer to that opaque object.


All pretty pie-in-the-sky at the moment, though...

regards, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] An idea for parallelizing COPY within one backend

2008-02-27 Thread Florian G. Pflug

Andrew Dunstan wrote:

Florian G. Pflug wrote:
Would it be possible to determine when the copy is starting that this 
case holds, and not use the parallel parsing idea in those cases?


In theory, yes. In pratice, I don't want to be the one who has to 
answer to an angry user who just suffered a major drop in COPY 
performance after adding an ENUM column to his table.


I am yet to be convinced that this is even theoretically a good path to 
follow. Any sufficiently large table could probably be partitioned and 
then we could use the parallelism that is being discussed for pg_restore 
without any modification to the backend at all. Similar tricks could be 
played by an external bulk loader for third party data sources.


That assumes that some specific bulkloader like pg_restore, pgloader
or similar is used to perform the load. Plain libpq-users would either 
need to duplicate the logic these loaders contain, or wouldn't be able 
to take advantage of fast loads.


Plus, I'd see this as a kind of testbed for gently introducing 
parallelism into postgres backends (especially thinking about sorting 
here). CPU gain more and more cores, so in the long run I fear that we 
will have to find ways to utilize more than one of those to execute a 
single query.


But of course the architectural details need to be sorted out before any 
credible judgement about the feasability of this idea can be made...


regards, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] An idea for parallelizing COPY within one backend

2008-02-27 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:
Plus, I'd see this as a kind of testbed for gently introducing 
parallelism into postgres backends (especially thinking about sorting 
here).


This thinking is exactly what makes me scream loudly and run in the
other direction.  I don't want threads introduced into the backend,
whether gently or otherwise.  The portability and reliability hits
that we'll take are too daunting.  Threads that invoke user-defined
code (as anything involved with datatype-specific operations must)
are especially fearsome, as there is precisely 0 chance of that code
being thread-safe.


Exactly my thinking. That is why I was looking for a way to introduce 
parallelism *without* threading. Though it's not so much the 
user-defined code that scares me, but rather the portability issues. The 
differences between NPTL and non-NPTL threads on linux alone make me 
shudder...


Was I was saying is that there might be a chance to get some parallelism 
without threading, by executing well-defined pieces of code with 
controlled dependencies in separate processes. COPY seemed like an ideal 
testbed for that idea, since the conversion of received lines into 
tuples seemed reasonable self-contained, and with little outside 
dependencies. If the idea can't be made to work there, it probably won't 
work anywhere. If it turns out that it does (with an API change for 
input/output functions) however, then it *might* be possible to apply it 
to other relatively self-contained parts in the future...


To restate, I don't want threaded backends. Not in the foreseeable 
future at least. But I'd still love to see a single transaction using 
more than one core.


regards, Florian Pflug




---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] An idea for parallelizing COPY within one backend

2008-02-26 Thread Florian G. Pflug

As far as I can see the main difficulty in making COPY run faster (on
the server) is that pretty involved conversion from plain-text lines
into tuples. Trying to get rid of this conversion by having the client
send something that resembles the data stored in on-disk tuples is not a
good answer, either, because it ties the client too closely to
backend-version specific implementation details.

But those problems only arise if the *client* needs to deal with the
binary format. What I envision is parallelizing that conversion step on
the server, controlled by a backend process, kind of like a filter
between the server and the client.

Upon reception of a COPY INTO command, a backend would
.) Retrieve all catalog information required to convert a plain-text
line into a tuple
.) Fork off a dealer and N worker processes that take over the
client connection. The dealer distributes lines received from the
client to the N workes, while the original backend receives them
as tuples back from the workers.

Neither the dealer, nor the workers would need access to the either
the shared memory or the disk, thereby not messing with the one backend
is one transaction is one session dogma.

Now I'm eagerly waiting to hear all the reasons why this idea is broken
as hell ;-)
regards, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] dblink doesn't honor interrupts while waiting a result

2008-02-25 Thread Florian G. Pflug

Marko Kreen wrote:

On 2/25/08, Florian G. Pflug [EMAIL PROTECTED] wrote:

 I'm not sure how a proper fix for this could look like, since the
 blocking actually happens inside libpq - but this certainly makes
 working with dblink painfull...


Proper fix would be to use async libpq API, then loop on poll(2)
with small timeout.  You can look at pl/proxy for example code.

Ah, cool, I'll check out pl/proxy.

regards, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Behaviour of rows containg not-null domains in plpgsql

2008-02-24 Thread Florian G. Pflug

Hi

I just stumbled over the following behaviour, introduced with 8.3, and
wondered if this is by design or an oversight.

If you define a domain over some existing type, constrain it to
non-null values, and use that domain as a field type in a table 
definition, it seems to be impossible to declare pl/pgsql variables

of that table's row type. The problem seems to be that upon declaration,
the row variable is filled with nulls - but since the domain is marked
not-null, that immediatly triggers an exception.

Here is an example
CREATE DOMAIN d AS varchar NOT NULL;
CREATE TABLE t (txt d);
CREATE FUNCTION f() RETURNS VOID AS $$
DECLARE
  v_t t;
BEGIN
END;
$$ LANGUAGE 'plpgsql' VOLATILE;
SELECT f();

Note that the following works.
CREATE TABLE t2 (txt varchar not null);
CREATE FUNCTION f2() RETURNS VOID AS $$
DECLARE
  v_t t2;
BEGIN
END;
$$ LANGUAGE 'plpgsql' VOLATILE;
SELECT f2();

If you guys agree that this is a bug, I'll try to find a fix and send a 
patch.


greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Behaviour of rows containg not-null domains in plpgsql

2008-02-24 Thread Florian G. Pflug

Andrew Dunstan wrote:

Florian G. Pflug wrote:
If you define a domain over some existing type, constrain it to 
non-null values, and use that domain as a field type in a table 
definition, it seems to be impossible to declare pl/pgsql variables

 of that table's row type. The problem seems to be that upon
declaration, the row variable is filled with nulls - but since the
domain is marked not-null, that immediatly triggers an exception.

Here is an example snipped example



What seems worse is that it still fails even if you declare the
domain to have a default value.

I didn't try that, but I *did* try was providing a default value for the
row variable - which doesn't work either, since we do not currently
support row variable defaults.

The only workaround I found was to define the variable as record.

regards, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Behaviour of rows containg not-null domains in plpgsql

2008-02-24 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:
I just stumbled over the following behaviour, introduced with 8.3, 
and wondered if this is by design or an oversight.


No, this was in 8.2.

Ah, sorry - I'm porting an app from 8.1 straight to 8.3, and blindly
assumes that i'd have worked with 8.2...

If you define a domain over some existing type, constrain it to 
non-null values, and use that domain as a field type in a table 
definition, it seems to be impossible to declare pl/pgsql variables
 of that table's row type. The problem seems to be that upon 
declaration, the row variable is filled with nulls - but since the 
domain is marked not-null, that immediatly triggers an exception.


What else would you expect it to do?  AFAICS any other behavior would
 be contrary to spec.

It's the inconsistency between row types (where the not-null contraint
in the table definition *doesn't* prevent a declaration like myvar
mytable in pl/pgsql), and domains (where the not-null constraint *does*
prevent such a declaration) that bugs me.

Plus, the fact that we don't support default specifications in
pl/pgsql for row types turns this inconvenience into a major PITA,
forcing you to use record when you know that correct type perfectly
well...

Is there some difficulty in implementing row-type defaults, or is it
just that nobody cared enough about them to do the work?

regards, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Behaviour of rows containg not-null domains in plpgsql

2008-02-24 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Plus, the fact that we don't support default specifications in
pl/pgsql for row types turns this inconvenience into a major PITA,


You mean initialization expressions, not defaults, correct?  (I would
consider the latter to mean that whatever attrdef entries were attached
to the rowtype's parent table would be used implicitly.)

Yeah, I mean writing declare; v_var schema.table default row()

regards, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[HACKERS] dblink doesn't honor interrupts while waiting a result

2008-02-24 Thread Florian G. Pflug

Hi

dblink in 8.3 blocks without any possibility of interrupting it while
waiting for an answer from the remote server. Here is a strace
[pid 27607] rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0
[pid 27607] sendto(56, Q\0\0\0008lock table travelhit.booking_code in 
exclusive mode\0, 57, 0, NULL, 0) = 57

[pid 27607] rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_IGN}, 8) = 0
[pid 27607] poll([{fd=56, events=POLLIN|POLLERR}], 1, -1) = ? 
ERESTART_RESTARTBLOCK (To be restarted)

[pid 27607] --- SIGTERM (Terminated) @ 0 (0) ---
[pid 27607] rt_sigreturn(0xf)   = -1 EINTR (Interrupted system call)
[pid 27607] poll(

As you can see I'm trying to lock the table travelhit.booking_code, 
which blocks because someone else is already holding that lock. When

I send a SIGTERM to the backend, the poll() syscalll is interruped -
but immediatly restarted.

I'm not sure how a proper fix for this could look like, since the 
blocking actually happens inside libpq - but this certainly makes 
working with dblink painfull...


regards, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] configurability of OOM killer

2008-02-02 Thread Florian G. Pflug

Tom Lane wrote:

Florian Weimer [EMAIL PROTECTED] writes:

* Alvaro Herrera:

I am wondering if we can set the system up so that it skips postmaster,



How much does that help?  Postmaster c still need to be shut down
when a regular backend dies due to SIGKILL.


The $64 problem is that if the parent postmaster process is victimized
by the OOM killer, you won't get an automatic restart.  In most people's
eyes that is considerably worse than the momentary DOS imposed by a kill
of a child backend.  And what we now find, which is truly staggeringly
stupid on the kernel's part, is that it *preferentially* kills the
parent instead of whatever child might actually be eating the memory.


Maybe we should just react equally brute-force, and just disable the 
OOM-Killer for the postmaster if we're running on linux. It seems that 
something like echo -17  /proc/pid/oom_adj should do the trick.


And maybe add a note to the docs telling people to disable memory 
overcommit on dedicated database servers if that isn't already there...


regards, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] configurability of OOM killer

2008-02-02 Thread Florian G. Pflug

Tom Lane wrote:
Another thought is to tell people to run the postmaster under a 
per-process memory ulimit that is conservative enough so that the 
system can't get into the regime where the OOM killer activates. 
ulimit actually behaves the way we want, ie, it's polite about 
telling you you can't have more memory ;-).


That will only work if postgres in the only service running on the
machine, though, no? If the postmaster and it's chilren use up 80% of
the available memory, then launching a forkbomb will still lead to the
postmaster being killed (Since it will get the most points). Or at least
this is how I interpret link posted originally.

And *if* postgres is the only service, does setting a ulimit have an
advantage over disabling memory overcommitting?

AFAICS, memory overcommit helps if a program creates 50mb of mosty
read-only data, and than forks 10 times, or if it maps a large amount of
memory but writes to that block only sparsely. Since postgres does
neither, a dedicated postgres server won't see any benefits from
overcommitting memory I'd think.

regards, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and synchronized scans and pg_dump et al

2008-01-28 Thread Florian G. Pflug

Guillaume Smet wrote:

On Jan 27, 2008 9:07 PM, Markus Bertheau
[EMAIL PROTECTED] wrote:

2008/1/28, Tom Lane [EMAIL PROTECTED]:

Do we have nominations for a name?  The first idea that comes to
mind is synchronized_scanning (defaulting to ON).
synchronized_sequential_scans is a bit long, but contains the 
keyword sequential scans, which will ring a bell with many, more

so than synchronized_scanning.


synchronize_seqscans?


How about enable_syncscan, or enable_seqscan_sync? It's not strictly
something the influences the planner, but maybe it's similar enough to
justify a similar naming?

regards, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] CLUSTER and synchronized scans and pg_dump et al

2008-01-28 Thread Florian G. Pflug

Steve Atkins wrote:

On Jan 28, 2008, at 8:36 AM, Tom Lane wrote:


Andrew Dunstan [EMAIL PROTECTED] writes:

Kevin Grittner wrote:

It would seem reasonable to me for pg_dump to use ORDER BY to select
data from clustered tables.



What will be the performance hit from doing that?


That worries me too.  Also, in general pg_dump's charter is to reproduce
the state of the database as best it can, not to improve it.


One common use of cluster around here is to act as a faster version
of vacuum full when there's a lot of dead rows in a table. There's no
intent to keep the table clustered on that index, and the cluster flag
isn't removed with alter table (why bother, the only thing it affects is
the cluster command).

I'm guessing that's not unusual, and it'd lead to sorting tables as part
of pg_dump.


I've done that too - and every time I typed that CLUSTER ...  I 
thought why, oh why isn't there something like REWRITE TABLE table, 
which would work just like CLUSTER, but without the sorting ;-) Maybe 
something to put on the TODO list...


We might even call it VACCUM REWRITE ;-)

regards, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Transaction Snapshot Cloning

2008-01-17 Thread Florian G. Pflug

Tom Lane wrote:

I'm not sure what the most convenient user API would be for an on-demand
hard-read-only mode, but we can't use SET TRANSACTION READ ONLY for it.
It'd have to be some other syntax.  Maybe just use a GUC variable
instead of bespoke syntax?  SET TRANSACTION is really just syntactic
sugar for GUC SET operations anyway ...


We could reuse the transaction_read_only GUC, adding strict as a 3rd 
allowed value beside on and off. And maybe make ansi an alias for 
on to emphasize that one behavior is what the standard wants, and the 
other is a postgres extension.


regards, Florian Pflug


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Postgresql Materialized views

2008-01-17 Thread Florian G. Pflug

Tom Lane wrote:

Well, my point is that taking automatic rewriting as a required feature
has at least two negative impacts:

* it rules out any form of lazy update, even though for many applications
an out-of-date summary view would be acceptable for some purposes;

* requiring MVCC consistency will probably hugely reduce the variety of
views that we can figure out how to materialize, and cost performance
even for the ones we can do at all.

It's not zero-cost, even if you consider implementation effort and
complexity as free (which I don't).


There is one big additional advantage of automatic rewriting though, I 
believe. If we had the infrastructure to recognize that possibility of 
using a predefined (materialized) view for executing a query, we can 
also use that infrastructure to get implement a kind of optimizer hints.


How? We'd need statistics-materialized views, which don't materialize 
the full result of the view, but instead compute it's statistical 
properties (the same which ANALYZE computes for a table). When planning 
a query we can then substitute the guessed values for rowcount and 
friends of a subplan by the values computed for the corresponding 
statistics-materialized view.


However, until someone figures out *how* to *actually* recognize that 
possibility of using a MV for executing a query, this is just 
hand-wavering of course...


regards, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Transaction Snapshot Cloning

2008-01-16 Thread Florian G. Pflug

Tom Lane wrote:

Chris Browne [EMAIL PROTECTED] writes:

Note that we required that the provider transaction have the
attributes IsXactIsoLevelSerializable and XactReadOnly both being
true, so we have the mandates that the resultant backend process:



a) Is in read only mode, and
b) Is in serializable mode.


If XactReadOnly were a hard read only constraint, that argument
might be worth the electrons it's written on.  I quote TFM:

Now I think someone was looking into a hard read only mode for
use in doing read-only queries against a PITR slave; if that
ever happens it might be adaptable to serve this purpose too.
But we haven't got it today.


That would haven been me then ;-)

I think that lazy xid assignment actually got us 90% of the way towards
a hard transaction read-only constraint - nearly all data-modfying 
operation surely depend on the xact having an xid assigned, no? (The 
only exception might be nextval() and friends).


I seem to remember there being some pushback to the idea of changing the 
semantics of set transaction isolation read only from soft to hard 
semantics though - on the basis that it might break existing 
applications. If that has changed (or my memory tricks me ;-) ) I'd 
volunteer to create a patch for 8.4 to make set transaction read only 
a hard constraint.


regards, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] module archive

2007-10-27 Thread Florian G. Pflug

Peter Eisentraut wrote:

Am Donnerstag, 25. Oktober 2007 schrieb Andrew Dunstan:

From time to time people have raised the idea of a CPAN-like mechanism for
downloading, building and installing extensions and the like (types, 
functions, sample dbs, anything not requiring Postgres itself to be 
rebuilt), and I have been thinking on this for the last few days. What sort

of requirements would people have of such a mechanism? How do people
envision it working?


Downloading, building, and installing extensions is actually fairly 
standardized already (well, perhaps there are 2 or 3 standards, but CPAN has

 that as well).  I think the inhibitions relate more to the management of
what is installed.

I imagine we need a package manager inside of PostgreSQL to manage 
installation, setup, removal, dependencies and so on.  Much like rpm or dpkg

 really.  That should replace the current run this .sql file mechanism,
much like rpm and dpkg replaced the run make install and trust me
mechanism.  I have some of this mapped out in my head if there is interest.


The major challenge that I see is getting updates right, especially when the
package/module contains tables which the user might have added data to (Like for
example pre-8.3 tsearch). Both for updates of the packages, and for upgrading
postgres to a new major revision.

Maybe there are some schema-versioning tools available already that might help,
though...

We'd also need easy integration with the real rpm and dpkg, so that 
distribution packages can be built easily and I can run


apt-get install postgresql extension1 extension2

Wow, that is ambitious ;-)
I haven't yet seen a single distribution that gets this right for CPAN, ruby
gems, or anything the like - if you know one, I'd be very interested in trying
it out.

Speaking for myself, I'd already be very happy if I could do
apt-get install postgresql postgresql-pkg
and then
postpkg database install whatever module.

That'd also allow postpkg to deal with the database-specific requirements of a
package manager (Like specifying which db to add the module too).

regards, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] PostgreSQL 8.3, libpq and WHERE CURRENT OF

2007-10-27 Thread Florian G. Pflug

Sebastien FLAESCH wrote:

Forget this one, just missing the WITH HOLD option... Must teach myself a bit
more before sending further mails. Seb


AFAIK you cannot use WITH HOLD together with updateable cursors.
I might be wrong, though...

regards, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] full text search in 8.3

2007-10-19 Thread Florian G. Pflug

andy wrote:
Is there any chance there is an easier way to backup/restore?  On one 
hand, its not too bad, and it'll only be once (correct?).  Now that fts 
is in core future backup/restores will work, right?  I think it's 
analogous to telling someone they are updating from tsearch2 to 
tsearch3, and it might be a little more painful than just a backup/restore.


On the other hand I think a backup/restore will pollute the new db with 
a bunch of functions and types that wont ever be used, so it's so much 
cleaner to build it by hand.


Are there other fts users that might have opinions on that?


I'm not really a tsearch user (just played with it a bit once). But I wondered 
if you are aware that you can prevent certain objects from being restored
quite easiy if you use pg_dump and pg_restore together with custom format 
(-Fc). There is some option to pg_restore that reads the dump, and ouputs a 
table of contents. You can then remove some entries from that list, and pass the 
modified list to pg_restore which will skip entries that do not show up on your 
modified list.


Maybe we could document some regexp, awk script, or similar that strips the 
tsearch stuff from such a table of contents?


regards, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Plan invalidation vs temp sequences

2007-10-19 Thread Florian G. Pflug

Gregory Stark wrote:

Tom Lane [EMAIL PROTECTED] writes:


There doesn't seem to be any very nice way to fix this.  There is
not any existing support mechanism (comparable to query_tree_walker)
for scanning whole plan trees, which means that searching a cached plan
for regclass Consts is going to involve a chunk of new code no matter
how we approach it.  We might want to do that someday --- in particular,
if we ever try to extend the plan inval mechanism to react to
redefinitions of non-table objects, we'd likely need some such thing
anyway.  I'm disinclined to try to do it for 8.3 though.  The use-case
for temp sequences seems a bit narrow and there are several workarounds
(see followups to bug report), so I'm feeling this is a
fix-some-other-day kind of issue.


Given that sequences are in fact relations is there some way to work around
the issue at least in this case by stuffing the sequence's relid someplace
which the plan invalldation code can check for it?


Hm... couldn't this be worked around by doing
create or replace function dynamic_oid(text) returning regclass as
'select $1::regclass' language 'pl/pgsql' stable;
And then writing
nextval(dynamic_oid('mysequence')).

I didn't test this, but it it actually works, maybe we should just stick this
into the docs somewhere. It's probably too late to add that function to the 
backend, though...


As long as mysequence is really a temporary sequence, this wont even have
searchpath issues I think, because those are always on top of the searchpatch
anyway, aren't they?

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Tom Lane wrote:

I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.


Ok, I'll write a patch to do that.


What is the argument against making relfilenodes globally unique by adding the 
xid and epoch of the creating transaction to the filename? Those 64 bits could 
be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a 
relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem 
should support IMHO.


regards, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Tom Lane wrote:

I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.


Ok, I'll write a patch to do that.


What is the argument against making relfilenodes globally unique by adding the 
xid and epoch of the creating transaction to the filename? Those 64 bits could 
be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a 
relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem 
should support IMHO.


regards, Florian Pflug

PS: Sorry if this arrives twice - I'm having a few troubles with my mail setup.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal whenWALarchivingis enabled

2007-10-18 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

What is the argument against making relfilenodes globally unique by adding
the xid and epoch of the creating transaction to the filename?


1. Zero chance of ever backpatching.  (I know I said I wasn't excited about
that, but it's still a strike against a proposed fix.)

2. Adds new fields to RelFileNode, which will be a major code change, and
possibly a noticeable performance hit (bigger hashtable keys).

3. Adds new columns to pg_class, which is a real PITA ...

4. Breaks oid2name and all similar code that knows about relfilenode.


Ah, Ok. I was under the impression that relfilenode in pg_class is a string of
some kind. In that case only GetNewRelFileNode would have needed patching...
But that is obviously not the case, as I realized now :-(

Thanks for setting me straight ;-)

regards, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Florian G. Pflug

Heikki Linnakangas wrote:

I wrote:

Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.


Ok, here's one idea:

Instead of deleting the file immediately on commit of DROP TABLE, the
file is truncated to release the space, but not unlink()ed, to avoid
reusing that relfilenode. The truncated file can be deleted after next
checkpoint.

Now, how does checkpoint know what to delete? We can use the fsync
request mechanism for that. When a file is truncated, a new kind of
fsync request, a deletion request, is sent to the bgwriter, which
collects all such requests to a list. Before checkpoint calculates new
RedoRecPtr, the list is swapped with an empty one, and after writing the
new checkpoint record, all the files that were in the list are deleted.

We would leak empty files on crashes, but we leak files on crashes
anyway, so that shouldn't be an issue. This scheme wouldn't require
catalog changes, so it would be suitable for backpatching.

Any better ideas?
Couldn't we fix this by forcing a checkpoint before we commit the transaction 
that created the new pg_class entry for the clustered table? Or rather, more 
generally, before committing a transaction that created a new non-temporary 
relfilenode but didn't WAL-log any subsequent inserts.


Thats of course a rather sledgehammer-like approach to this problem - but at 
least for the backbranched the fix would be less intrusive...


regards, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Why copy_relation_data only use wal when WALarchiving is enabled

2007-10-17 Thread Florian G. Pflug

Simon Riggs wrote:

On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
Second, suppose that no checkpoint has occured during the upper 
series--authough not quite possible;
That part is irrelevant. It's forced out to disk and doesn't need 
recovery, with or without the checkpoint.


There's no hole that I can see.
No, Jacky is right. The same problem exists at least with CLUSTER, and I 
think there's other commands that rely on immediate fsync as well.


Attached is a shell script that demonstrates the problem on CVS HEAD with
CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is
dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to
get the same relfilenode that T1 had. Then we crash the server, forcing a
WAL replay. After that, T2 is empty. Oops.

Unfortunately I don't see any easy way to fix it.


So, what you are saying is that re-using relfilenodes can cause problems 
during recovery in any command that alters the relfilenode of a relation?


For what I understand, I'd say that creating a relfilenode *and* subsequently
inserting data without WAL-logging causes the problem. If the relfilenode was
recently deleted, the inserts might be effectively undone upon recovery (because
we first replay the delete), but later *not* redone (because we didn't WAL-log
the inserts).

That brings me to another idea from a fix that is less heavyweight than my
previous checkpoint-before-commit suggestion.

We could make relfilenodes globally unique if we added the xid and epoch of the
creating transaction to the filename. Those are 64 bits, so if we encode them
in base 36 (using A-Z,0-9), that'd increase the length of the filenames by 13.

regards, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] pg_restore oddity?

2007-10-12 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Mario Weilguni wrote:

I cannot use -1 for performance, because some gist stuff has changed
and the restore fails. But there seems to be no option for pg_restore to
use transactions for data restore, so it's very very slow (one million
records, each obviously in it's own transaction - because a separate
session select count(1) from logins shows a growing number).


By default, pg_dump/pg_restore uses a COPY command for each table, and
each COPY executes as a single transaction, so you shouldn't see the row
count growing like that. Is the dump file in --inserts format?


It would be nice to use transactions for the data stuff itself, but not
for schema changes or functions. I know I can use separate pg_restore
runs for schema and data, but it's complicated IMHO.


pg_restore -s foo
pg_restore -a -1 foo

doesn't seem too complicated to me. Am I missing something?


Doesn't pg_restore create the indices *after* loading the data if you let it 
restore the schema *and* the data in one step? The above workaround would 
disable that optimization, thereby making the data-restore phase much more costly.


Now that I think about it, I remember that I've often whished that we not only 
had --schema-only and --data-only, but also --schema-unconstrained-only and 
--constraints-only.


regards, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Locales and Encodings

2007-10-12 Thread Florian G. Pflug

Tom Lane wrote:

Peter Eisentraut [EMAIL PROTECTED] writes:

Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:

It would make Postgres inconsistent and less integrated with the rest of
the OS. How do you explain that Postgres doesn't follow the system's
configurations and the collations don't agree with the system collations?


We already have our own encoding support (for better or worse), and I don't 
think having one's own locale support would be that much different.


Well, yes it would be, because encodings are pretty well standardized;
there is not likely to be any user-visible difference between one
platform's idea of UTF8 and another's.  This is very very far from being
the case for locales.  See for instance the recent thread in which we
found out that en_US locale has utterly different sort orders on
Linux and OS X.


For me, this paragraph is more of in argument *in favour* of having our own 
locale support. At least for me, consistency between PG running on different 
platforms would bring more benefits than consistency between PG and the platform 
it runs on.


At the company I used to work for, we had all our databases running with 
encoding=utf-8 and locale=C, because I didn't want our applications to depend on 
platform-specific locale issues. Plus, some of the applications supported 
multiple languages, making a cluster-global locale unworkable anyway - a 
restriction which would go away if we went with ICU.


regards, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Plan invalidation vs temp sequences

2007-10-11 Thread Florian G. Pflug

Gregory Stark wrote:

Tom Lane [EMAIL PROTECTED] writes:


There doesn't seem to be any very nice way to fix this.  There is
not any existing support mechanism (comparable to query_tree_walker)
for scanning whole plan trees, which means that searching a cached plan
for regclass Consts is going to involve a chunk of new code no matter
how we approach it.  We might want to do that someday --- in particular,
if we ever try to extend the plan inval mechanism to react to
redefinitions of non-table objects, we'd likely need some such thing
anyway.  I'm disinclined to try to do it for 8.3 though.  The use-case
for temp sequences seems a bit narrow and there are several workarounds
(see followups to bug report), so I'm feeling this is a
fix-some-other-day kind of issue.


Given that sequences are in fact relations is there some way to work around
the issue at least in this case by stuffing the sequence's relid someplace
which the plan invalldation code can check for it?


Hm... couldn't this be worked around by doing
create or replace function dynamic_oid(text) returning regclass as
'select $1::regclass' language 'pl/pgsql' stable;
And then writing
nextval(dynamic_oid('mysequence')).

I didn't test this, but it it actually works, maybe we should just stick this
into the docs somewhere. It's probably too late to add that function to the 
backend, though...


As long as mysequence is really a temporary sequence, this wont even have
searchpath issues I think, because those are always on top of the searchpatch
anyway, aren't they?

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] full text search in 8.3

2007-10-11 Thread Florian G. Pflug

andy wrote:
Is there any chance there is an easier way to backup/restore?  On one 
hand, its not too bad, and it'll only be once (correct?).  Now that fts 
is in core future backup/restores will work, right?  I think it's 
analogous to telling someone they are updating from tsearch2 to 
tsearch3, and it might be a little more painful than just a backup/restore.


On the other hand I think a backup/restore will pollute the new db with 
a bunch of functions and types that wont ever be used, so it's so much 
cleaner to build it by hand.


Are there other fts users that might have opinions on that?


I'm not really a tsearch user (just played with it a bit once). But I wondered 
if you are aware that you can prevent certain objects from being restored
quite easiy if you use pg_dump and pg_restore together with custom format 
(-Fc). There is some option to pg_restore that reads the dump, and ouputs a 
table of contents. You can then remove some entries from that list, and pass the 
modified list to pg_restore which will skip entries that do not show up on your 
modified list.


Maybe we could document some regexp, awk script, or similar that strips the 
tsearch stuff from such a table of contents?


regards, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Including Snapshot Info with Indexes

2007-10-09 Thread Florian G. Pflug

Csaba Nagy wrote:

Can we frame a set of guidelines, or may be some test procedure, which
can declare a certain function as deterministic? 


You mean postgres should check your function if it is really immutable ?
I can't imagine any way to do it correctly in reasonable time :-)
Imagine a function of 10 parameters which returns the sum of the
parameters all the time except for parameters all 1 it will randomly
return a value _once in a thousand executions_... please find a generic
algorithm which spots this function as not immutable in reasonable
execution time ;-)
So this example is a bit extreme, but don't underestimate the user ;-)


I think you're overly pessimistic here ;-) This classification can be done quite 
efficiently as long as your language is static enough. The trick is not to 
execute the function, but to scan the code to find all other functions and SQL 
statements a given function may possibly call. If your function calls no SQL 
statements, and only other functions already marked IMMUTABLE, then it must be 
IMMUTABLE itself.


It does seem that only pl/pgsql is static enough for this to work, though,
making this idea rather unappealing.


I am just saying from the top of my mind. Even otherwise, if we can
even restrict this indexing to only Built-in deterministic functions.,
don't you think it would help the cause of a majority? I have just
made the proposal to create the index with snapshot a optional one. 


Restrictions like this are always confusing for the end user (i.e. why
can I use built-ins here and not my own ?). I leave to the actual coders
to say anything about code maintenance concerns...
Yes, and some built-ins have gotten that classification wrong too in the past 
IIRC. Which probably is a good reason not to trust our users to get it right ;-)


greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Including Snapshot Info with Indexes

2007-10-09 Thread Florian G. Pflug

Andrew Dunstan wrote:

Florian G. Pflug wrote:


I think you're overly pessimistic here ;-) This classification can be done
quite efficiently as long as your language is static enough. The trick is
not to execute the function, but to scan the code to find all other
functions and SQL statements a given function may possibly call. If your
function calls no SQL statements, and only other functions already marked
IMMUTABLE, then it must be IMMUTABLE itself.

It does seem that only pl/pgsql is static enough for this to work, 
though, making this idea rather unappealing.




How would you propose to analyse C functions, for which you might not have
the C code?

Scanning the binary, together with symbol annotations for immutability of course
;-))

No, seriously. I do *not* advocate that we actually autoclassify functions, for
a lot of reasons. I just wanted to refute the statement that doing so is
generally impossible - it's not. It's trivial for some languages (In haskhell
for example all functions that don't use monads are immutable, and their
signature tell if they do use monads or or), realistic for others (pl/pgsql,
where we do have the sourcecode), and utterly impossible for others
(pl/{ruby,python,perl,...}, pl/c, ...).

Besides - AFAICS *anything* that makes VACUUM depend on IMMUTABLE to be correct
would instantly break tsearch, no? At least as long as we allow changing
stopwords and the like of dictionaries used by an index - which we'd better
allow, unless we want the DBAs to come with pitchforks after us...

regards, Florian Pflug, who shudders when imagining DBAs with pitchforks...

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Including Snapshot Info with Indexes

2007-10-08 Thread Florian G. Pflug

Gokulakannan Somasundaram wrote:

Hi Heikki, I am always slightly late in understanding things. Let me
try to understand the use of DSM. It is a bitmap index on whether all
the tuples in a particular block is visible to all the backends,
whether a particular block contains tuples which are invisible to
everyone. But i think this will get subjected to the same limitations
of Bitmap index. Even Oracle suggests the use of Bitmap index for
only data warehousing tables, where the Bitmap indexes will be
dropped and recreated after every bulk load. This is not a viable
alternative for OLTP  transactions. But i think i am late in the game
 as i haven't participated in those discussions

While the DSM might be similar in spirit to a bitmap index, the actual
implementation has a lot more freedome I'd say, since you can tailor it
exactly to the need of tracking some summarized visibility info. So not
all shortcomings of bitmap indices must necessarily apply to the DSM
also. But of course thats mostly handwavering...


One Bitmap index block usually maps to lot of blocks in the heap. So
locking of one page to update the DSM for update/delete/insert would
hit the concurrency. But again all these are my observation w.r.t
oracle bitmap indexes. May be i am missing something in DSM.

A simple DSM would probably contain a bit per page that says all xmin 
GlobalXmin, and all xmax unset or aborted. That bit would only get SET
during VACUUM, and only unset during INSERT/UPDATE/DELETE. If setting it
is protected by a VACUUM-grade lock on the page, we might get away with
no locking during the unset, making the locking overhead pretty small.


I couldn't get that piece of discussion in the archive, which 
discusses the design of Retail Vacuum. So please advise me again

here. Let's take up Retail Vacuuming again. The User defined function
 which would return different values at different time can be
classified as non-deterministic  functions. We can say that this
index cannot be created on a non-deterministic function. This is the
way it is implemented in Oracle. What they have done is they have
classified certain built-in operators and functions as deterministic.
Similarly they have classified a few as non-deterministic operators
and functions. Can we  follow a similar approach?

Postgres already distinguishes VOLATILE,STABLE and IMMUTABLE functions.
It doesn't, however, risk physical data corruption, even if you get that
classification wrong. The worst that happens AFAIK are wrong query
results - but fixing your function, followed by a REINDEX always
corrects the problme. If you start poking holes into that safety net,
there'll be a lot of pushback I believe - and IMHO rightly so, because
people do, and always will, get such classifications wrong.

greetings, Florian Pflug

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)

2007-09-26 Thread Florian G. Pflug

Kevin Grittner wrote:

I omitted the code I was originally considering to have it work against
files in place rather than as a filter.  It seemed much simpler this
way, we didn't actually have a use case for the additional functionality,
and it seemed safer as a filter.  Thoughts?


A special non-filter mode could save some IO and diskspace by not actually 
writing all those zeros, but instead just seek to SizeOfWal-1 after writing the 
last valid byte, and writing one more zero. Of course, if you're gonna

compress the WAL anyway, there is no point...

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] GCC builtins for atomic-test-and-set, memory barries, and such

2007-09-23 Thread Florian G. Pflug

Hi

When reading Tom's comment about the bug in my use latestCompletedXid 
to slightly speed up TransactionIdIsInProgress patch, I remembered that 
I recently stumbled across GCC builtins for atomic test-and-test and

read/write reordering barriers...

Has anyone looked into those? It seems that those could be used to 
provide a fallback spinlock implementation - though maybe we already 
cover all interesting targets, and it's not worth the effort.


Anyway, here is the link to the GCC docu. It says that the naming of 
these follows some Intel Spec, so presumably the Intel compiler supports 
the same builtins...

http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] SPI access to PostgreSQL query plan

2007-09-17 Thread Florian G. Pflug

Cristiano Duarte wrote:

2007/9/17, Tom Lane [EMAIL PROTECTED]:

Cristiano Duarte [EMAIL PROTECTED] writes:
Is there a way to have access to PostgreSQL query plan and/or predicates 
inside a function using spi (or any other way)?

No.

Hi Tom,


No means: there is no way since the query plan is stored in a 
secret/safe/protected/non-visible/fort-knox like place :) or it means, there

is no friendly way to do it with spi or casual c language programming?


No as in: You function is not told by the executor which filters are applied
to it's results, and since it might be called multiple times within a query you
cannot figure that out yourself, even if you somehow got hold of the currently
executed plan.

So unless you start to hack the executor in serious ways, you'll either have to
pass the filter condition manually to your function, or live with it producing
unnecessary output rows.

Thats only holds true for functions in languages other than pl/sql (Which is
*not* the same as pl/pgsql) - SQL functions can be inlined by the executor, and
then are subject to the usual optimizations. (So they essentially behave like
views).

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] A small mistake in the initial latestCompletedXid idea

2007-09-12 Thread Florian G. Pflug

Hi

When I initially proposed to use the latest *committed* xid as the xmax instead
of ReadNewTransactionId(), I believed that this would cause tuples created by a
later aborted transaction not to be vacuumed until another transaction (with a
higher xid) commits later. The idea was therefore modified to store the latest
*completed* xid, instead of the latest committed one.

I just realized that my fear was unjustified. AFAICS, VACUUM will aways remove
tuples created by aborted transactions, even if the xid is = OldestXmin.

Therefore, I suggest that we rename latestCompletedXid to latestCommittedXid,
and update it only on commits. Admittedly, this won't bring any measurable
performance benefit in itself (it will slightly reduce the average snapshot
size, though), but not doing so might stand in the way of possible future
optimizations in that area.

I'll submit a patch to the patches list shortly.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] A small mistake in the initial latestCompletedXid idea

2007-09-12 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Therefore, I suggest that we rename latestCompletedXid to latestCommittedXid,
and update it only on commits. Admittedly, this won't bring any measurable
performance benefit in itself (it will slightly reduce the average snapshot
size, though), but not doing so might stand in the way of possible future
optimizations in that area.


This is a bad idea.  As you say, it doesn't directly save anything,
and the downside is that it may result in RecentGlobalXmin not moving
forward.  Consider a situation where there's a long string of aborts and
nary a commit.  latestCommittedXid won't advance, therefore each new
transaction continues to compute xmin = xmax = latestCommittedXid+1,
and so the window between global xmin and the newest active XIDs gets
wider and wider.  That puts performance stress on pg_clog and
pg_subtrans buffers --- if it goes on long enough, we get into a context
swap storm caused by pg_subtrans buffer thrashing.  We need to be sure
that xmin/xmax move forward when XIDs exit the ProcArray, whether they
commit or not.


Hm.. Ok.. I see your point.

Maybe we should then make your initial argument against this hold, by
adding a check for xid  latestCompletedXid into TransactionIdIsInProgress.
That is cheap, might save some unnecessary proc array scanning, and justifies
advancing latestCommittedXid during subxact abort (where your argument
above doesn't hold, as you say yourself later on).


Your post made me think for awhile about whether we really need to
serialize aborts at all.  From a transactional correctness standpoint
I think maybe we don't, but again the difficulty is with xmin tracking.

Agreed - For transaction correctness, aborted and in-progress transactions
are similar enough that it doesn't matter much in which pot our snapshot
puts them. They cases where the difference matters rechecks with
TransactionIdIsInProgress anyway and/or waits on the xid's lock.


If an aborting xact can remove its XID from ProcArray without locking,
then it is possible that two concurrent scans of ProcArray arrive at
different xmin values, which means that GetOldestXmin might deliver
an incorrectly large answer, and that's fatal.  (One of the possible
consequences is truncating pg_subtrans too soon, but I believe there
are other ones too.)

I'm not yet sure if that deviation in the xmin calculations poses any
real risk, or not. After all, even the transaction ending up with the
larger xmin won't actually see xids between the smaller and the larger
xid as in-progress. I haven't yet been able to come up with either a
counterexample, or an argument that this is indeed safe.


Subtransactions don't affect xmin, of course, so there may be an
argument here that we don't have to do this stuff for a subtransaction
abort.  But I remain unconvinced that optimizing subtransaction abort
will really buy a performance gain worth taking any risk for.

That depends largely on the workload, I would think. If there is any
benefit, I'd expect to see it for workloads involving deeply nested
BEGIN/END/EXCEPTION blocks. Especially because we currently roll back
each sub-transaction seperatly AFAICS, meaning we might take that
exclusive lock many times in short succession.

greetings, Florian Pflug


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Final Thoughts for 8.3 on LWLocking and Scalability

2007-09-11 Thread Florian G. Pflug

Simon Riggs wrote:

On Tue, 2007-09-11 at 10:21 -0400, Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:
1. The ProcArrayLock is acquired Exclusive-ly by only one 
remaining operation: XidCacheRemoveRunningXids(). Reducing things

 to that level is brilliant work, Florian and Tom.
It would be brilliant if it were true, but it isn't.  Better look 
again.


On the more detailed explanation, I say in normal operation.

My analytical notes attached to the original post show ProcArrayLock 
is acquired exclusively during backend start, exit and while making a
 prepared (twophase) commit. So yes, it is locked Exclusively in 
other places, but they happen rarely and they actually add/remove 
procs from the array, so its unlikely anything can change there 
anyhow.


Well, and during normal during COMMIT and ABORT, which might happen
rather frequently ;-)

I do agree, however, that XidCacheRemoveRunningXids() is the only site
left where getting rid of it might be possible, and might bring
measurable benefit for some workloads. With more effort, we might not
even need it during ABORT, but I doubt that the effort would be worth
it. While some (plpgsql intensive) workloads might abort subxacts rather
frequently, I doubt that same holds true for toplevel aborts.

I'm actually working on a patch to remove that lock from
XidCacheRemoveRunningXids(), but I'm not yet completely sure that my
approach is safe. Tom had some objections that I take rather seriously.
We'll see ;-)

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.

2007-09-10 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Currently, we do not assume that either the childXids array, nor the xid
cache in the proc array are sorted by ascending xid order. I believe that
we could simplify the code, further reduce the locking requirements, and
enabled a transaction to de-overflow it's xid cache if we assume that those
arrays are in ascending xid order.


de-overflowing the cache sounds completely unsafe, as other backends need
that state to determine whether they need to look into pg_subtrans.


We'd only de-overflow if we abort *all* xids that are missing from the
xid cache. And only after marking them as aborted in the clog. If someone
concurrently checks for an overflow, and already sees the new (non-overflowed)
state, than he'll assume the xid is not running if he hasn't found it in
the array. Which is correct - we just aborted it.

Plus, removing the exclusive lock doesn't depend on de-overflowing. It's
just something that seems rather easy to do once the subxid handling is
in a state that allows concurrent removal of entries. If it turns out that
it's not that easy, than I'll just drop the idea again.

I still don't believe you can avoid taking exclusive lock, either; your 
argument here did not address latestCompletedXid.


Sorry, not addressing latestCompletedXid was an oversight :-(.
My point is the we only *need* to advance latestCompletedXid on COMMITS. We do
so for aborts only to avoid running with unnecessarily low xmins after
a transaction ABORT. That corner case can only happen after a toplevel
ABORT, though - aborting subxacts cannot change the xmin, because the
toplevel xact will have a lower xid than any of it's subtransactions anyway.

We can therefore just remember the largest assigned xid for a given transaction,
and update latestCompletedXid to that on toplevel commit or abort. That
prevents that corner-case too, without updating latestCompletedXid during
subxact abort.


But the main point remains this: there is no evidence whatsoever that these
code paths are sufficiently performance-critical to be worth speeding up by
making the code more fragile.


The gain will be less than that of the locking improvements done so far.
It will be a win for heavy users of plpgsql BEGIN/END/EXCEPTION blocks,
though I think.

We'll also save some cycles in TransactionIdIsInProgress, because we can
use a binary search, but that's just an added bonus.

I'm currently trying to code up a patch, since it's easier to judge the
correctness of actual code than that of a mere proposals. I'll do some
benchmarking when the patch is done to see if it brings measurable benefits.

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Maybe some more low-hanging fruit in the latestCompletedXid patch.

2007-09-09 Thread Florian G. Pflug

Hi

I've already posted this idea, but I feel that I did explain it
rather badly. So here comes a new try.

Currently, we do not assume that either the childXids array, nor
the xid cache in the proc array are sorted by ascending xid order.
I believe that we could simplify the code, further reduce the locking
requirements, and enabled a transaction to de-overflow it's xid cache
if we assume that those arrays are in ascending xid order.

** Sortedness **
Presumably, the existing code *already* guarantees that both the childXids
and the xid cache *are* in fact sorted by ascending xid order - due to
the following reasons:

A transaction (including subtransactions) form a tree. Each xact (or node)
may have any number of children - assume that they are added left to right
in the order of their creation. For ever (sub)xact, all subtrees but the
rightmost one is closed (meaning that no new nodes will ever be created inside
them). That last assertion follows from the fact that each non-rightmost subtree
is either aborted, or subcommitted.

Since adding a new sibling to a (sub) xact is only possible if all the other 
subtrees are subcommitted or aborted, it will surely get a higher xid than any 
node in any sibling subtree. Since the xids in the subcommitted siblings where 
already added to the original (sub) xact upon subcommit, the childXids of that

xact will have the following structure:
xids of 1. subcommited child xids of 2. subcommited child ...
Each of the sublists are guaranteed to include only xids larger than those
inside their predecessors - which leads to a completely sorted array in
the whole.

The proc-array cache is surely sorted by ascending xid order, because we
add entries while we generate them (until we overflow).

** Benefits ***
Since we know that both the childXids and the xid cache array are sorted,
removing entries from the cache becomes much easier. Since we can only abort
a rightmost subtree (All others are either already aborted, or subcommitted),
we know that the to-be-aborted xids are the N largest xids in our whole
transactions. To remove them from the proc array we overwrite their xids
with InvalidTransactionId from right-to-left, and afterwards set the new
array length.

This allows as to replace the ExclusiveLock with a ShareLock. Since we
overwrite the xids right-to-left, a snapshot take concurrently will miss
some of the right-most xids. This is exactly the same situation as if
we had aborted those xids one-by-one, instead of all in one go.

(Xmin calculation are completely unaffected - the toplevel xid *not*
removed this way, and that one is surely smaller than any subxact xid)

The rather easy xid-cache remove algorithms will also detect if a previously
overflowed cache will be de-overflowed by the removal. (This needs a little
bit more checking - de-overflowing the xid cache is a bit trickey because
we must not race with TransactionIdIsInProgress. But that seems doable,
it'll just need the right ordering of things - maybe updating the CLOG
*before* updating the proc array is already sufficient even).

If there is interest in doing this, I can come up with a patch.

greetings, Florian Pflug



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Florian G. Pflug

Simon Riggs wrote:

On Fri, 2007-09-07 at 06:36 +0200, Florian G. Pflug wrote:

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:
- I actually think with just a little bit of more work, we

can go even further, and get rid of the ReadNewTransactionId() call
completely during snapshotting.

[ squint... ]  This goes a bit far for me.  In particular, I think this
will fail in the edge case when there are no live XIDs visible in
ProcArray.  You cannot go back and do ReadNewTransactionId afterward,
at least not without re-scanning the ProcArray a second time, which
makes it at best a questionable win.

Why would it?


I think the additional suggestion goes a bit too far. You may be right,
but I don't want to change the transaction system in advanced ways this
close to the next release. We may have difficulty spotting bugs in that
thinking during beta.


Ok, those were two clear votes against doing this, so I'll stop
arguing ;-). I do think that we should have another look at this when
8.4 opens, though.

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Florian G. Pflug

Tom Lane wrote:

I've spent the past hour or so trying to consolidate the comments in
GetSnapshotData and related places into a single chunk of text to be
added to src/backend/access/transam/README.  Attached is what I have so
far --- this incorporates the idea of not taking ProcArrayLock to exit
an XID-less transaction, but not yet Florian's idea.  I think it'd get
simpler if we changed to that, but am posting this for comments.



Interlocking transaction begin, transaction end, and snapshots
--

We try hard to minimize the amount of overhead and lock contention involved
in the frequent activities of beginning/ending a transaction and taking a
snapshot.  Unfortunately, we must have some interlocking for this, because
it is critical that all backends agree on the commit order of transactions.

This is actually still a slightly stronger requirement than what we really
need I think.

To simplify the following discussion, A - B shall mean that transaction A saw
B as committed. Conversely, A ! B shall mean that A treats B as in-progress.
If A was in read-committed mode, the visibility refers to the latest snapshot
that A used.

Now assume A and B commit at nearly the same time, and for two other
transactions C and D the following holds:
C - A, C ! B but
D ! A, D - B.

This would violate the requirement that the commit order is globally
agreed upon, yet as long as both A ! B and B ! A holds, there is no
conflict. (Note that if A and B are serializable, A ! B  B ! A
implies that A and B cannot have touched the same record and have
both committed - one would have been aborted due to a
SerializationError).

I must admit, though, that this is a quite academic case, since the
prerequisite A ! B and B ! A is something we have no control over
for read-committed transactions - who knows when they might have taken
their last snapshot...

Still, I wanted to mention this because I believe that the minimal
requirement that we actually *need* to enforce is
A - B and B - C imply A - C.  (T1)

The actual implementation will probably always have to enforce
something slightly stronger, but it's still nice to know the
minimal guarantee needed to be able to judge correctness.


For example, suppose an UPDATE in xact A is blocked by xact B's prior
update of the same row, and xact B is doing commit while xact C gets a
snapshot.  Xact A can complete and commit as soon as B releases its locks.
If xact C's GetSnapshotData sees xact B as still running, then it had
better see xact A as still running as well, or it will be able to see two
tuple versions - one deleted by xact B and one inserted by xact A. 

In my notation this becomes: A - B and C ! B implies C ! A.

This then follows from (T1) - Assume that A - B, C ! B but C - A,
then with (A) C - B follows from C - A and A - B, which contradicts
C ! B.


We
enforce this by not allowing any transaction to exit the set of running
transactions while a snapshot is being taken.  (This rule is probably
stronger than necessary, but see the next problem.)  The implementation
of this is that GetSnapshotData takes the ProcArrayLock in shared mode
(thereby allowing multiple backends to take snapshots in parallel), but
xact.c must take the ProcArrayLock in exclusive mode while clearing
MyProc-xid at transaction end (either commit or abort).

Agreed. We *do* enforce a strict global ordering of committs and snapshots.
This then guarantees (T1) because if A - B and B - C, than A *must*
have taken it's snapshot after B committed, and B in turn *must* have
taken it's snapshot after C committed, so surely A - C will hold too.


Here is another variant of the risk scenario:

1. Xact A is running (in Read Committed mode).
2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is
   swapped out before it can acquire ProcArrayLock.
3. Xact B gets new XID (= C's xmax), makes changes and commits.
4. Xact A changes some row R changed by xact B and commits.
5. Xact C finishes getting its snapshot data.  It sees xact A as done,
   but sees xact B as still running (since B = xmax).

Now C will see R changed by xact B and then xact A, *but* does not see
other changes made by xact B.  If C is supposed to be in Serializable mode,
this is wrong.

I never really grasped why we need to assume serializable here - this seems
wrong if C is read-committed too. Seeing only half of a transaction's changes
can never be right, can it?


To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock
before it calls ReadNewTransactionId.  This prevents xact A from exiting
the set of running transactions seen by xact C.  Therefore both A and B
will be seen as still running = no inconsistency.

Another point of view is that determining the xmax of a snapshot really *is*
part of taking the snapshot. Since we already obtained that we need to
serialize snapshotting and committing, it follows that we must not allow
committs to happen between the 

Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-07 Thread Florian G. Pflug

Tom Lane wrote:
Here's some revised text for the README file, based on using Florian's idea 
of a global latestCompletedXid variable.  As I worked through it I realized 
that in this design, XidGenLock gates entry of new XIDs into the ProcArray 
while ProcArrayLock gates their removal.  Which is an interesting sort of 
symmetry property.  It also turns out that the reason we need to gate entry 
with XidGenLock is to keep from breaking GetOldestXmin, rather than to ensure

correctness of snapshots per se.

I believe it would break both, no? If an xid = latestCompletedXid is
not included in the snapshot, but later used for updates, the snapshot
will see those changes as committed when they really are not.

But other than that, it really sounds fine. It certainly explains things much
better than the comments in the existing code.

I noticed two rather cosmetic issues
.) latestCompletedXid sounds as it might refer to the *last* completed xid,
but it actually refers to the largest / highest completed xid. So maybe we
should call it highestCompletedXid or largestCompletedXid.

.) Since you mention that we assume reading and writing int4s are atomic
operations, maybe we should mention that for safety's sake we mark the
corresponding pointers with volatile?

greetings, Florian Pflug



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-06 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

So I believe you're right, and we can skip taking the lock in the no
xid case - I actually think with just a little bit of more work, we
can go even further, and get rid of the ReadNewTransactionId() call
completely during snapshotting.


[ squint... ]  This goes a bit far for me.  In particular, I think this
will fail in the edge case when there are no live XIDs visible in
ProcArray.  You cannot go back and do ReadNewTransactionId afterward,
at least not without re-scanning the ProcArray a second time, which
makes it at best a questionable win.


Why would it? The idea was to remember the largest committed xid, and that
won't go away just because the proc array is rather empty xid-wise. Actually,
in that case the largest comitted xid+1 will (nearly) be what
ReadNewTransactionId() returns. (Nearly because the transaction with the xid
ReadNewTransactionId()-1 might have aborted, so largestCommittedXid might be
a bit further behind ReadNewTransactionId().)

(That slightly lagging of largestCommittedXid might cause some tuples not to
be VACUUMED though, so we might want to update largestCommittedXid for
ABORTS too, and probably rename it to largestNonRunningXid or whatever ;-) ).

I would go as far as saying that largestCommittedXid+1 is the natural choice
for xmax - after all, xmax is the cutoff point after which a xid *cannot*
be seen as committed, and largestCommittedXid+1 is the smallest xmax that
guarantees that we see xacts committed before the snapshot as committed.

The xmin computation won't change - apart from using some other initial value.

This would rid us of the rather complicated entanglement of XidGenLock and
the ProcArrayLock, lessen the lock contention, and reduce the average snapshot
size a bit.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Low hanging fruit in lazy-XID-assignment patch?

2007-09-06 Thread Florian G. Pflug

Tom Lane wrote:

Simon was complaining a bit ago that we still have problems with
excessive contention for the ProcArrayLock, and that much of this stems
from the need for transaction exit to take that lock exclusively.
The lazy-XID patch, as committed, doesn't help that situation at all,
saying

/*
 * Lock ProcArrayLock because that's what GetSnapshotData uses.
 * You might assume that we can skip this step if we had no
 * transaction id assigned, because the failure case outlined
 * in GetSnapshotData cannot happen in that case. This is true,
 * but we *still* need the lock guarantee that two concurrent
 * computations of the *oldest* xmin will get the same result.
 */


I think the comment is correct in principle - If we remove the oldest
xmin without locking, then two concurrent OldestXmin calculations
will get two different results. The question is if that has any
negative effects, though.


That leaves xmin, which AFAICS is
only interesting for the computations of GetOldestXmin() and
RecentGlobalXmin.  And I assert it doesn't matter if those numbers
advance asynchronously, so long as they never go backward.

Yes, the xmin is surely the only field that might need need the locking.

It was this comment in GetSnapshotData that made me keep the locking
in the first place:

 * It is sufficient to get shared lock on ProcArrayLock, even if we are
 * computing a serializable snapshot and therefore will be setting
 * MyProc-xmin. This is because any two backends that have overlapping
 * shared holds on ProcArrayLock will certainly compute the same xmin
 * (since no xact, in particular not the oldest, can exit the set of
 * running transactions while we hold ProcArrayLock --- see further
 * discussion just below). So it doesn't matter whether another backend
 * concurrently doing GetSnapshotData or GetOldestXmin sees our xmin as
 * set or not; he'd compute the same xmin for himself either way.
 * (We are assuming here that xmin can be set and read atomically,
 * just like xid.)

But now that I read this again, I think that comment is just missleading -
especially the part So it doesn't matter whether another backend concurrently
doing GetSnapshotData or GetOldestXmin sees our xmin as set or not; he'd compute
the same xmin for himself either way.
This sounds as if the Proc-xmin that *one* backend announces had
influence over the Proc-xmin that *another* backend might compute.
Which isn't true - it only influences the GlobalXmin that another backend might
compute.

So I believe you're right, and we can skip taking the lock in the no xid case -
I actually think with just a little bit of more work, we can go even further,
and get rid of the ReadNewTransactionId() call completely during snapshotting.

There are two things we must ensure when I comes to snapshots, commits and
xid assignment.

1) A transaction must either be not in progress, be in our snapshot, or
   have an xid = xmax.
2) If transaction A sees B as committed, and B sees C as committed, then
   A must see C as committed.

ad 1): We guarantee that by storing the xid in the proc array before releasing
   the XidGenLock. Therefore, when we later obtain our xmax value,
   we can be sure that we see all xacts in the proc array that have an
   xid  xmax and are in progress.

ad 2): We guarantee that by serializing snapshotting against committing. Since
   we use ReadNewTransactionId() as the snapshot's xmax this implies that
   we take the ProcArrayLock *before* reading our xmax value.

Now, ReadNewTransactionId() is actually larger than necessary as a xmax.
The minimal xmax that we can set is largest committed xid+1. We can
easily track that value during commit when we hold the ProcArrayLock
(If we have no xid, and therefor don't have to hold the lock, we also
don't need to update that value).

If we used this LatestCommittedXid as xmax, we'd still guarantee (2),
but without having to hold the XidGenLock during GetSnapshotData().

I wouldn't have dared to suggest this for 8.3, but since you came up with
locking improvements in the first place... ;-)

greetings, Florian Pflug


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] loose ends in lazy-XID-assigment patch

2007-09-05 Thread Florian G. Pflug

Tom Lane wrote:

I've committed Florian's patch, but there remain a couple of things
that need work:

* Should CSV-mode logging include the virtual transaction ID (VXID) in
addition to, or instead of, XID?  There will be many situations where
there is no XID.

Maybe make %x show both, or only the xid if that is set, and the vxid
otherwise? That would probably be what most existing users of %x want.
For those who want them seperated, we'd have %v (vxid), and maybe
%X (xid only). Seems a bit like overkills, though...



* As things stand, when a two-phase transaction is prepared, it drops
its lock on the original VXID; this seems necessary for reasons
previously discussed.  I made the code put an invalid VXID into the
gxact structure for the prepared xact, which means that pg_locks shows
things like this:

regression=# select * from pg_locks;
   locktype| database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |  mode   | granted 
---+--+--+--+---++---+-+---+--++---+-+-

 transactionid |  |  |  |   || 
21774 | |   |  | -1/0   |   | ExclusiveLock 
  | t
 relation  |   126093 |   126124 |  |   ||  
 | |   |  | -1/0   |   | AccessShareLock | t
 relation  |   126093 |10969 |  |   ||  
 | |   |  | 1/260  | 20592 | AccessShareLock | t
 virtualxid|  |  |  |   | 1/260  |  
 | |   |  | 1/260  | 20592 | ExclusiveLock   | t
(4 rows)

This seems fairly undesirable :-( not least because you can't tell one
prepared xact from another and thus can't see which locks belong to
each.  But I'm unsure what to do about it.  We could have the prepared
xact continue to display the original VXID, but there would be no
certainty about the VXID remaining unique, which seems bad.  Another
possibility is to put back the transaction ID column, but since that's
not unique for read-only transactions, we still don't have anything
usable as a join key.

The best idea I can think of is to make the virtualtransaction column
read out the VXID for regular transactions and the transaction ID for
prepared transactions, or maybe the transaction ID for any transaction
that has one and VXID just for read-only xacts.  We can get away with
that because the column is only text and not any better-defined
datatype.  It seems mighty ugly though; and changing the ID shown for
a transaction mid-stream isn't very pleasant either.


We could make the VXID in the gxact struct be
backendId=InvalidBackendId, lxid=xid. That'd be still an invalid vxid, but not
the same for every prepared transaction.

If we take this further, we could get rid of the lock on the xid completely,
I believe. We'd define some PermanentBackendId (lets say, -2, since -1 is
taken). When preparing the xact, we'd drop the lock on the old VXID, and
instead acquire one on (PermanentBackendId, xid). Waiting on an xid would
become a bit tricky, but doable I think. We'd have to first check the procarray
- if we find the xid there, we translate it to a vxid, and wait on that.
Aftwards (whether we found a vxid, or not) we wait on (PermanentBackendId, xid).
That doesn't exactly make XactLockTableWait cheaper, but that might be OK,
since we

I haven't really thought this through, though. I think that with carefull
ordering of things we can control the race conditions this might posses -
but I'm not sure at this point.

greetings, Florian Pflug


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Per-function GUC settings: trickier than it looked

2007-09-03 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Tom Lane wrote:

So it seems that only SET LOCAL within a function with per-function
GUC settings is at issue.  I think that there is a pretty strong
use-case for saying that if you have a per-function setting of a
particular variable foo, then any SET LOCAL foo within the function
ought to vanish at function end --- for instance a function could want
to try a few different search_path settings and automatically revert to
the caller's setting on exit.

Agreed.

 The question is what about SET LOCAL

on a variable that *hasn't* been explicitly SET by the function
definition.  Either approach we take with it could be surprising,
but probably having it revert at function end is more surprising...


At least for me, the least surprising behaviour would be to
revert it too. Than the rule becomes a function is always
executed in a pseudo-subtransaction that affects only GUCs

Since at least for pl/pgsql, a function body *alreay* is a
BEGIN/END block - and therefore syntactically even looks
like a subtransaction - this seems quite logical.

And it would mean that the semantics of SET LOCAL won't change,
just because you add an EXCEPTION clause to the function's toplevel
BEGIN/END block.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Per-function GUC settings: trickier than it looked

2007-09-03 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

At least for me, the least surprising behaviour would be to
revert it too. Than the rule becomes a function is always
executed in a pseudo-subtransaction that affects only GUCs


Only if it has at least one SET clause.  The overhead is too high
to insist on this for every function call.


In that case, I agree that only variables specified in a SET-clause
should be reverted. Otherwise, adding or removing
SET-clauses (e.g, because you chose a different implementation
of a function that suddenly doesn't need regexps anymore) will
cause quite arbitrary behavior changes.

And the rule becomes (I tend to forget things, so I like simple
rules that I can remember ;-) ) For each SET-clause, there is
a pseudo-subtransaction affecting only *this* GUC.

greetings, Florian Pflug



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Per-function GUC settings: trickier than it looked

2007-09-03 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

And the rule becomes (I tend to forget things, so I like simple
rules that I can remember ;-) ) For each SET-clause, there is
a pseudo-subtransaction affecting only *this* GUC.


The other question is whether we want to change the behavior of SET
LOCAL even in the absence of function SET-clauses.  The current rule
is that a LOCAL setting goes away at subtransaction commit, leading
to this behavior:

regression=# show regex_flavor;
 regex_flavor 
--

 advanced
(1 row)

regression=# begin;
BEGIN
regression=# savepoint x;
SAVEPOINT
regression=# set local regex_flavor to basic;
SET
regression=# release x;
RELEASE
regression=# show regex_flavor;
 regex_flavor 
--

 advanced
(1 row)

which makes some sense if you think of release as subtransaction
end, but not a lot if you think of it as forgetting a savepoint.
Likewise, SET LOCAL within a plpgsql exception block goes away at
successful block exit, which is not the first thing you'd expect.
Neither of these behaviors are documented anywhere AFAIR; certainly
the SET reference page doesn't explain 'em.

I think we should probably take this opportunity to fix that, and
make SET LOCAL mean persists until end of current top-level
transaction, unless rolled back earlier or within a function SET
clause.

So:

* Plain SET takes effect immediately and persists unless rolled back
or overridden by another explicit SET.  In particular the value will
escape out of a function that has a SET-clause for the same variable.

* SET LOCAL takes effect immediately and persists until rolled back,
overridden by another SET, or we exit a function that has a SET-clause
for the same variable.

* Rollback of a transaction or subtransaction cancels any SET or SET
LOCAL within it.  Otherwise, the latest un-rolled-back SET or SET LOCAL
determines the active value within a transaction, and the latest
un-rolled-back SET determines the value that will prevail after the
transaction commits.

* A function SET clause saves the entry-time value, and restores it at
function exit, except when overridden by an un-rolled-back SET (but not
SET LOCAL) within the function.

Clear to everyone?  Any objections?

That makes SET LOCAL completely equivalent to SET, except
when used inside a function that has a corresponding SET-clause, right?
So I think *if* this is done, SET LOCAL should be renamed to
SET FUNCTION. This would also prevent confusion, because everyone
who currently uses SET LOCAL will have to change his code anyway,
since the semantics change for every use-case apart from functions
with SET-clauses, which don't exist in 8.2.
Or am I overlooking something?

And renaming SET LOCAL also emphasized that point that we are taking
away functionality here - even if that functionality might not seem
very useful.

BTW, I *did* check the documentation before responding to Simon's original
mail, and I *did* read it as SET LOCAL goes away a subtransaction end.
I figured that since there is no word on subtransactions in that part
of the documentation, transaction will apply equally to both toplevel
and subtransaction. It might very well be that I'm the only one who
read it that way, though ;-) And I must admin that I wasn't completely
sure, so I *did* try it out before I posted...

I'd strong prefer SET LOCAL to kept it's current semantics, only that SET
LOCAL changes will now be rolled back if the function has a matching 
SET-clause. For multiple reasons:

  .) It's useful to be able to temporarily change GUCs from a client, and
 being able to reset them afterwards. Using a subtransaction for this
 is maybe a bit wastefull, but at least it works.
  .) In pl/pgsql, that fact that SET LOCAL goes away after the current
 BEGIN/END block seems entirely logical.
  .) It doesn't take away existing functionality

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Per-function GUC settings: trickier than it looked

2007-09-03 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

Tom Lane wrote:

Clear to everyone?  Any objections?



That makes SET LOCAL completely equivalent to SET, except
when used inside a function that has a corresponding SET-clause, right?


Maybe it wasn't clear :-(.  They aren't equivalent because in the
absence of rollback, SET's effects persist past main-transaction end;
SET LOCAL's don't.  That's the way they were defined originally
(pre-subtransactions) and it still seems to make sense.

Ah, OK - things make much more sense now.


So I think *if* this is done, SET LOCAL should be renamed to
SET FUNCTION. This would also prevent confusion, because everyone
who currently uses SET LOCAL will have to change his code anyway,
since the semantics change for every use-case apart from functions
with SET-clauses, which don't exist in 8.2.


I'm not sure how many people have really written code that depends on
the behavior of SET LOCAL rolling back at successful subtransaction end.
I think we'd have heard about it if very many people had noticed,
because it's not what the manual says.

For the one use we've actually advocated (setting a temporary value
within a function and then reverting to the old setting before exit),
there isn't any visible change in behavior, since abandonment of the
restored value at subtransaction end still ends up with the same result.


And renaming SET LOCAL also emphasized that point that we are taking
away functionality here - even if that functionality might not seem
very useful.


We can't break the officially advocated solution for secure search_path.
However, that particular coding pattern will still work with the change
I'm proposing.  It's only where you *don't* manually restore the prior
value that you might notice a difference.


BTW, I *did* check the documentation before responding to Simon's original
mail, and I *did* read it as SET LOCAL goes away a subtransaction end.
I figured that since there is no word on subtransactions in that part
of the documentation, transaction will apply equally to both toplevel
and subtransaction.


Yeah, but you know that it's subtransactions under the hood, whereas
someone who's thinking in terms of SAVEPOINT/RELEASE and BEGIN/EXCEPTION
probably hasn't a clue about that.

I plead guilty here ;-). That whole SAVEPOINT/RELEASE thing always seemed
strange to me - I just accepted it at some point, but still translated it
into something hierarchical I guess




   .) In pl/pgsql, that fact that SET LOCAL goes away after the current
  BEGIN/END block seems entirely logical.


I don't think so ... your other side-effects such as table updates don't
disappear, so why should SET's

I guess because LOCAL to me implies some lexical locality - like the
surrounding BEGIN/END block.


I'm not necessarily averse to inventing a third version of SET, but I
don't see a well-thought-out proposal here.  In particular, we should be
making an effort to *not* expose the concept of subtransaction at the
SQL level at all, because that's not what the spec has.

Thanks for your explanation - I can see your point now, after realizing
why the spec has SAVEPOINT/RELEASE and *not* nested BEGIN/COMMIT blocks.

So, at least on the SQL-level, I guess I agree - your new semantics fit
better with the sql spec, even if they seemed quite strange to me at
first sight. Though maybe we should add  SET TRANSACTION as a synonym for
SET LOCAL? - the former seems to convey your new semantics much better than
the later.

It still seems a bit strange that SET LOCAL is undone at function-exit,
if the function has a matching SET-clause. But we need that for backwards-
compatibility of the secure-search_path workaround, right? Maybe we could
make SET TRANSACTION different from SET LOCAL in pl/pgsql, and warn
if SET LOCAL is used? That would enable us either get rid of SET LOCAL
in the long term, or to really make it local to the surrounding BEGIN/END
block.

So, to reiterate, my idea is
.) Make SET TRANSACTION a synonym for SET LOCAL at the SQL-Level.
.) In pl/pgsql, SET TRANSACTION sets a new value that is kept after the
   function exits, even if the function has a matching SET-clause.
.) SET LOCAL in pl/pgsql set a new value that is kept if the function
   has no matching SET-clause. If it has one, the value is restored.
   In any case, we emit a warning that SET LOCAL is going away.
.) One day, make SET LOCAL in pl/pgsql mean local to the surrounding
   BEGIN/END block. Independent of any SET-clauses the function
   might or might not have.

The last idea might seem to create a inconsistency between the SQL-level
and pl/pgsql, but I think it does not. SET LOCAL is local to the
surrounding BEGIN/{END|COMMIT} block in both cases - it's just that
you have nested such blocks in pl/pgsql, but not in plain SQL.

greetings, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Per-function GUC settings: trickier than it looked

2007-09-03 Thread Florian G. Pflug

Tom Lane wrote:

So, to reiterate, my idea is
.) Make SET TRANSACTION a synonym for SET LOCAL at the SQL-Level.
.) In pl/pgsql, SET TRANSACTION sets a new value that is kept after the
function exits, even if the function has a matching SET-clause.
.) SET LOCAL in pl/pgsql set a new value that is kept if the function
has no matching SET-clause. If it has one, the value is restored.
In any case, we emit a warning that SET LOCAL is going away.
.) One day, make SET LOCAL in pl/pgsql mean local to the surrounding
BEGIN/END block. Independent of any SET-clauses the function
might or might not have.


I don't think it's a good idea to change SET LOCAL now and plan on
changing it again later ;-).  If we really want BEGIN-block-local
SET capability, I'd prefer to think of some new keyword for that.
But I'm not convinced it's interesting --- given the proposed behavior
of function SET-clauses, attaching a SET to your function seems like
it'll cover the need for restoring outer values.


Hm... could we still have SET TRANSACTION as a synonym for SET LOCAL?
That would blend nicely with SET TRANSACTION ISOLATION LEVEL and
SET TRANSACTION READ ONLY.

[ thinking... ] Hey, wait a moment. Regarding SET TRANSACTION READ ONLY -
This is not strictly speaking a GUC, but still, if we pretend that
there are no subtransaction, that command should too propage to the
outermost transaction on release, shouldn't it?

This is what happens currently (CVS HEAD with at least your initial
function-SET-clause patch already in)
regression=# begin ;
BEGIN
regression=# savepoint s1 ;
SAVEPOINT
regression=# set transaction read only ;
SET
regression=# release s1 ;
RELEASE
regression=# create table test (id int) ;
CREATE TABLE
regression=# commit ;
COMMIT

compared to:
regression=# begin ;
BEGIN
regression=# set transaction read only ;
SET
regression=# create table test (id int) ;
ERROR:  transaction is read-only

I believe that for consistencies sake, the set transaction read only should
have propagated to the outermost transaction on release s1.

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PATCH] Lazy xid assingment V2

2007-09-01 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Tom Lane wrote:

I had an idea this morning that might be useful: back off the strength
of what we try to guarantee.  Specifically, does it matter if we leak a
file on crash, as long as it isn't occupying a lot of disk space?
(I suppose if you had enough crashes to accumulate many thousands of
leaked files, the directory entries would start to be a performance drag,
but if your DB crashes that much you have other problems.)  This leads
to the idea that we don't really need to protect the open(O_CREAT) per
se.  Rather, we can emit a WAL entry *after* successful creation of a
file, while it's still empty.  This eliminates all the issues about
logging an action that might fail.  The WAL entry would need to include
the relfilenode and the creating XID.  Crash recovery would track these
until it saw the commit or abort or prepare record for the XID, and if
it didn't find any, would remove the file.


That idea, like all other approaches based on tracking WAL records, fail
if there's a checkpoint after the WAL record (and that's quite likely to
happen if the file is large). WAL replay wouldn't see the file creation
WAL entry, and wouldn't know to track the xid. We'd need a way to carry
the information over checkpoints.


Yes, checkpoints would need to include a list of created-but-yet-uncommitted
files. I think the hardest part is figuring out a way to get that information
to the backend doing the checkpoint - my idea was to track them in shared
memory, but that would impose a hard limit on the number of concurrent
file creations. Not nice :-(

But wait... I just had an idea.
We already got such a central list of created-but-uncommited
files - pg_class itself. There is a small window between file creation
and inserting the name into pg_class - but as Tom says, if we leak it then,
it won't use up much space anyway.

So maybe we should just scan pg_class on VACUUM, and obtain a list of files
that are referenced only from DEAD tuples. Those files we can than safely
delete, no?

If we *do* want a strict no-leakage guarantee, than we'd have to update pg_class
before creating the file, and flush the WAL. If we take Alvaro's idea of storing
temporary relations in a seperate directory, we could skip the flush for those,
because we can just clean out that directory after recovery. Having to flush
the WAL when creating non-temporary relations doesn't sound too bad - those
operations won't occur very often, I'd say.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCH] Lazy xid assingment V2

2007-09-01 Thread Florian G. Pflug

August Zajonc wrote:
Yes, checkpoints would need to include a list of 
created-but-yet-uncommitted
files. I think the hardest part is figuring out a way to get that 
information

to the backend doing the checkpoint - my idea was to track them in shared
memory, but that would impose a hard limit on the number of concurrent
file creations. Not nice :-(

I'm confused about this.

As long as we assert the rule that the file name can't change on the 
move, then after commit the file can be in only one of two places. The 
name of the file is known (ie, pg_class). The directories are known. 
What needs to be carried forwarded past a checkpoint? We don't even look 
at WAL, so checkpoints are irrelevant it seems


If there is a crash just after commit and before the move, no harm. You 
just move on startup. If the move fails, no harm, you can emit warning 
and open in /pending (or simply error, even easier).

If you're going to open the file from /pending, whats the point of moving
it in the first place?

The idea would have to be that you move on commit (Or on COMMIT-record
replay, in case of a crash), and then, after recovering the whole wal,
you could remove leftover files in /pending.

The main problem is that you have to do the move *after* flushing the COMMIT
record to disk - otherwise you're gonna leak the file if you crash between
moving and flushing.

But that implies that the transaction is *already* committed when you do
the move. Others won't know that yet (You do the move *after* flushing,
but *before* updating the CLOG) - but still, since the COMMIT-record is
on disk, you cannot rollback anymore (Since if you crash, and replay the
COMMIT record, the transaction  *will* be committed).

So, what are you going to do if the move fails? You cannot roll back, and
you cannot update the CLOG (because than others would see your new table,
but no datafile). The only option is to PANIC. This will lead to a server
restart, WAL recovery, and probably another PANIC once the COMMIT-record
is replayed (Since the move probably still won't be possible).

It might be even worse - I'm not sure that a rename is an atomic operation
on most filesystems. If it's not, then you might end up with two files if
power fails *just* as you rename, or, worse with no file at all. Even a slight
possibility of the second case seems unacceptable - I means loosing
a committed transaction.

I agree that we should eventually find a way to guarantee either no file
leakage, or at least an upper bound on the amount of wasted space. But
doing so at the cost of PANICing if the move fails seems like a bad
tradeoff...

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PATCH] Lazy xid assingment V2

2007-09-01 Thread Florian G. Pflug

Tom Lane wrote:

Florian G. Pflug [EMAIL PROTECTED] writes:

It might be even worse - I'm not sure that a rename is an atomic operation
on most filesystems.


rename(2) is specified to be atomic by POSIX, but relinking a file into
a different directory can hardly be --- it's not even provided as a
single kernel call, is it?

I'd have thought that they only guarantee that if the new name already
exists it's atomically replaced. But I might be wrong


And there's still the problem that changing the filename on-the-fly is
going to break tons of low-level stuff, most of which is not supposed to
know about transactions at all, notably bgwriter.

Good point - I thought that we wouldn't have to care about this because
we could close the relation before renaming in the committing backend
and be done with it, because other backends won't see the new file
before we update the clog. But you're right, bgwriter is a problem
and one not easily solved...

So that rename-on-commit idea seems to be quite dead...


What I was thinking about was a flag file separate from the data file
itself, a bit like what we use for archiver signaling.  If  is the
new data file, then touch .new to mark the file as needing to be
deleted on restart.  Remove these files just *before* commit.  This
leaves you with a narrow window between removing the flag file and
actually committing, but there's no risk of having to PANIC --- if the
remove fails, you just abort the transaction.

Hm.. we could call the file nnn.xid.new, and delete it after the commit,
silently ignoring any failures. During both database-wide VACUUM and
after recovery we'd remove any leftover *.xid.new files, but only
if the xid is marked committed in the clog. After that cleanup step,
we'd delete any files which still have an associated flag file.

Processing those nnn.xid.new files during VACUUM is just needed to
avoid any problems because of xid wraparound - it could maybe
be replaced by maybe naming the file nnn.epoch.xid.new


However, this has nonzero overhead compared to the current behavior.
I'm still dubious that we have a problem that needs solving ...

I agree that file leakage is not a critical problem - if it were, they'd
be much more complaints...

But it's still something that a postgres DBA has to be aware of, because
it might bite you quite badly. Since IMHO admin friendlyness is one of
the strengths of postgresql, removing the possibility of leakage would be
nice in the long term.

Nothing that needs any rushing, though - and nothing that we'd want to pay
for in terms of performance.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


  1   2   3   4   >