date:20110721

Re: [HACKERS] proposal: new contrib module plpgsql's embeded sql validator

2011-07-21 Thread Pavel Stehule

2011/7/22 Jim Nasby :
> On Jul 19, 2011, at 10:51 PM, Pavel Stehule wrote:
>>> If you mean that such checks would be done automatically, no, they
>>> shouldn't be.  Consider a function that creates a table and then uses
>>> it, or even just depends on using a table that doesn't yet exist when
>>> you do CREATE FUNCTION.
>>
>> yes, any deep check is not possible for function that uses a temporary 
>> tables.
>>
>> A plpgsql_lint is not silver bullet - for these cases is necessary to
>> disable lint.
>>
>> . I can't to speak generally - I have no idea, how much percent of
>> functions are functions with access to temporary tables - in my last
>> project I use 0 temp tables on cca 300 KB of plpgsql code.
>>
>> The more terrible problem is a new dependency between functions. I use
>> a workaround - some like headers
>
> You can work around temp table issues the same way: just define the temp 
> table before you create the function.
>
> In practice, if I have a function that depends on a temp table it either 
> creates it itself if it doesn't already exist or I have a separate function 
> to create the table; that way you have a single place that has the temp table 
> definition, and that is in the database itself.

there is other trick - use a persistent table with same name before.
Runtime temporary table is near in search_path, so all executed SQL
will be related to this temp table.

Pavel

> --
> Jim C. Nasby, Database Architect                   j...@nasby.net
> 512.569.9461 (cell)                         http://jim.nasby.net
>
>
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 9:19 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch  wrote:
>>> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
 SIGetDataEntries() can pretty easily be made lock-free.  The only real
 changes that seem to be are needed are (1) to use a 64-bit counter, so
 you never need to decrement
>
>>> On second thought, won't this be inadequate on 32-bit systems, where 
>>> updating
>>> the 64-bit counter produces two stores?  You must avoid reading it between 
>>> those stores.
>
>> Now that is a potentially big problem.
>
> Could we do something similar to the xxid hacks?  That is, we have a lot
> of counters that should be fairly close to each other, so we store only
> the low-order 32 bits of each notional value, and separately maintain a
> common high-order word.  You probably would need some additional
> overhead each time the high-order word bumps, but that's reasonably
> infrequent.

Well, the trouble is figuring out what the shape of that additional
overhead needs to look like.  I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:

+   if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
+   return 0;

Patch (with comment explaining why I think this is OK) attached.  If
the message numbers happen to be equal only because the counter has
wrapped, then stateP->resetState will be true, so we'll still realize
we need to do some work.

Test results, with the lazy vxid patch plus this patch, at 8 clients:

tps = 34028.144439 (including connections establishing)
tps = 34079.085935 (including connections establishing)
tps = 34125.295938 (including connections establishing)

And at 32 clients:

tps = 185521.605364 (including connections establishing)
tps = 188250.700451 (including connections establishing)
tps = 186077.847215 (including connections establishing)

And at 80 clients:

tps = 188568.886569 (including connections establishing)
tps = 191035.971512 (including connections establishing)
tps = 189363.019377 (including connections establishing)

Not quite as good as the unlocked version, but better than the
per-backend mutex, and a whole lot simpler.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


sinval-fastpath.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] libedit memory stomp is apparently fixed in OS X Lion

2011-07-21 Thread Tom Lane

We've had several complaints about that tab-completion bug in the
Apple-supplied version of libedit, most recently here:
http://archives.postgresql.org/pgsql-bugs/2011-06/msg00119.php

I had a bug filed with Apple about that, and today I got some auto-mail
indicating they'd fixed that bug as of OS X 10.7 (Lion).  I don't have
Lion installed here, but I grabbed the libedit sources from
www.opensource.apple.com and indeed it looks fixed.  So, if any early
adopters want to try it out ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch  wrote:
>> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>>> SIGetDataEntries() can pretty easily be made lock-free.  The only real
>>> changes that seem to be are needed are (1) to use a 64-bit counter, so
>>> you never need to decrement

>> On second thought, won't this be inadequate on 32-bit systems, where updating
>> the 64-bit counter produces two stores?  You must avoid reading it between 
>> those stores.

> Now that is a potentially big problem.

Could we do something similar to the xxid hacks?  That is, we have a lot
of counters that should be fairly close to each other, so we store only
the low-order 32 bits of each notional value, and separately maintain a
common high-order word.  You probably would need some additional
overhead each time the high-order word bumps, but that's reasonably
infrequent.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: bogus descriptions displayed by \d+

2011-07-21 Thread Josh Kupershmidt

On Sun, Jul 17, 2011 at 10:54 AM, Josh Kupershmidt  wrote:
> On Sat, Jul 16, 2011 at 12:49 PM, Tom Lane  wrote:
>> After a bit of review of the archives, the somebody was me:
>> http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=b7d67954456f15762c04e5269b64adc88dcd0860
>>
>> and this thread was the discussion about it:
>> http://archives.postgresql.org/pgsql-hackers/2009-12/msg01982.php
>>
>> It looks like we thought about pg_dump, but did not think about psql.
>
> Ah, interesting. I didn't even know this functionality existed. And I
> think there is some documentation lacking; in the 8.4 doc page:

Here's a small patch against branch 8.4 to mention support for COMMENT
ON index_name.column_name.

Also, a patch against master to:
 * get rid of the bogus "Description" outputs for \d+ sequence_name
and \d+ index_name
 * clarify in the COMMENT ON doc page that a table _or view_ name may
be used for comments on columns, rules, and triggers. If we allowed
constraints on views, we could have just put in a note explaining that
"table_name.column_name" applies to tables and views, but constraints
are the odd man out.
 * slightly reordered the listing of the first bunch of Parameters on
that page so that agg_name comes first, as it does in the Synopsis
section

I noticed that the synopsis for CREATE RULE:
  http://www.postgresql.org/docs/9.1/static/sql-createrule.html

uses the term "table", which could be a similar omission. However, on
that page the first sentence of the description specifies "table or
view" so it might be fine as-is.

And while I'm messing with this, some further nitpicks about psql not
addressed by these patches:
 * The "Storage" column for \d+ sequence_name is correct, I suppose,
but repetitive
 * The "Type" column for \dv+ view_name, \di+ index_name, \ds+
sequence_name , etc. seems borderline useless.. shouldn't you know
what type you're looking at based on the backslash command you're
using? Plus the table heading could be more specific than "List of
relations", e.g. "List of views".

Josh
diff --git a/doc/src/sgml/ref/comment.sgml b/doc/src/sgml/ref/comment.sgml
index ab12614..58a2f02 100644
*** a/doc/src/sgml/ref/comment.sgml
--- b/doc/src/sgml/ref/comment.sgml
*** COMMENT ON
*** 26,32 
AGGREGATE agg_name (agg_type [, ...] ) |
CAST (source_type AS target_type) |
COLLATION object_name |
!   COLUMN table_name.column_name |
CONSTRAINT constraint_name ON table_name |
CONVERSION object_name |
DATABASE object_name |
--- 26,32 
AGGREGATE agg_name (agg_type [, ...] ) |
CAST (source_type AS target_type) |
COLLATION object_name |
!   COLUMN table_or_view_name.column_name |
CONSTRAINT constraint_name ON table_name |
CONVERSION object_name |
DATABASE object_name |
*** COMMENT ON
*** 42,48 
OPERATOR FAMILY object_name USING index_method |
[ PROCEDURAL ] LANGUAGE object_name |
ROLE object_name |
!   RULE rule_name ON table_name |
SCHEMA object_name |
SEQUENCE object_name |
SERVER object_name |
--- 42,48 
OPERATOR FAMILY object_name USING index_method |
[ PROCEDURAL ] LANGUAGE object_name |
ROLE object_name |
!   RULE rule_name ON table_or_view_name |
SCHEMA object_name |
SEQUENCE object_name |
SERVER object_name |
*** COMMENT ON
*** 52,58 
TEXT SEARCH DICTIONARY object_name |
TEXT SEARCH PARSER object_name |
TEXT SEARCH TEMPLATE object_name |
!   TRIGGER trigger_name ON table_name |
TYPE object_name |
VIEW object_name
  } IS 'text'
--- 52,58 
TEXT SEARCH DICTIONARY object_name |
TEXT SEARCH PARSER object_name |
TEXT SEARCH TEMPLATE object_name |
!   TRIGGER trigger_name ON table_or_view_name |
TYPE object_name |
VIEW object_name
  } IS 'text'
*** COMMENT ON
*** 97,105 
  

 
  object_name
  table_name.column_name
- agg_name
  constraint_name
  function_name
  operator_name
--- 97,106 
  

 
+ agg_name
  object_name
+ table_or_view_name.column_name
  table_name.column_name
  constraint_name
  function_name
  operator_name
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 682cf8a..46aae54 100644
*** a/src/bin/psql/describe.c
--- b/src/bin/psql/describe.c
*** describeOneTableDetails(const char *sche
*** 1295,1302 
  		appendPQExpBuffer(&buf, "\n  NULL AS attcollation");
  	if (tableinfo.relkind == 'i')
  		appendPQExpBuffer(&buf, ",\n  pg_catalog.pg_get_indexdef(a.attrelid, a.attnum, TRUE) AS indexdef");
! 	if (verbose)
! 		appendPQExpBuffer(&buf, ",\n  a.attstorage, pg_catalog.col_description(a.attrelid, a.attnum)");
  	appendPQExpBuffer(&buf, "\nFROM pg_catalog.pg_attribute a");
  	appendPQExpBuffer(&buf, "\nWHERE a.attrelid = '%s' AND a.attnum > 0 AND NOT a.attisdropped", oid);
  	appendPQExpBuffer(&buf, "\nORDER BY a.attnum;");
--- 1295,1309 
  		appendPQExpBuffer(&buf, "

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch  wrote:
> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>> Profiling this combination of patches reveals that there is still some
>> pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
>> to me that on x86, we really don't need this lock ... or
>> SInvalReadLock ... or a per-backend mutex.  The whole of
>> SIGetDataEntries() can pretty easily be made lock-free.  The only real
>> changes that seem to be are needed are (1) to use a 64-bit counter, so
>> you never need to decrement
>
> On second thought, won't this be inadequate on 32-bit systems, where updating
> the 64-bit counter produces two stores?  You must avoid reading it between 
> those
> stores.

Now that is a potentially big problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 6:44 PM, Dan Ports  wrote:
> If you're suggesting that hardware memory barriers aren't going to be
> needed to implement lock-free code on x86, that isn't true. Because a
> read can be reordered with respect to a write to a different memory
> location, you can still have problems. So you do still need memory
> barriers, just fewer of them.
>
> Dekker's algorithm is the classic example: two threads each set a flag
> and then check whether the other thread's flag is set. In any
> sequential execution, at least one should see the other's flag set, but
> on the x86 that doesn't always happen. One thread's read might be
> reordered before its write.

In the case of sinval, what we need to do for SIGetDataEntries() is,
approximately, a bunch of loads, followed by a store to one of the
locations we loaded (which no one else can have written meanwhile).
So I think that's OK.

In SIInsertDataEntries(), what we need to do is, approximately, take a
lwlock, load from a location which can only be written while holding
the lwlock, do a bunch of stores, ending with a store to that first
location, and release the lwlock.  I think that's OK, too.

>> 2. Machines with weak memory ordering.  On this category of machines
>> (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
>> reorders memory accesses arbitrarily unless you explicitly issue
>> instructions that enforce synchronization.  You still need to keep the
>> compiler from moving things around, too.  Alpha is particularly
>> pernicious, because something like a->b can fetch the pointed-to value
>> before loading the pointer itself.  This is otherwise known as "we
>> have basically no cache coherency circuits on this chip at all".  On
>> these machines, you need to issue an explicit memory barrier
>> instruction at each sequence point, or just acquire and release a
>> spinlock.
>
> The Alpha is pretty much unique (thankfully!) in allowing dependent
> reads to be reordered. That makes it even weaker than the typical
> weak-ordering machine. Since reading a pointer and then dereferencing
> it is a pretty reasonable thing to do regularly in RCU code, you
> probably don't want to emit barriers in between on architectures where
> it's not actually necessary. That argues for another operation that's
> defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
> Certainly the Linux kernel found it useful to do so
> (read_barrier_depends)
>
> Alternatively, one might question how important it is to support the
> Alpha these days...

Well, currently, we do, so we probably don't want to drop support for
that without some careful thought.  I searched the archive and found
someone trying to compile 8.3.something on Alpha just a few years ago,
so it's apparently not totally dead yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 6:22 PM, Florian Pflug  wrote:
> On Jul21, 2011, at 21:15 , Robert Haas wrote:
>> On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane  wrote:
>>> Robert Haas  writes:
 ... On these machines, you need to issue an explicit memory barrier
 instruction at each sequence point, or just acquire and release a
 spinlock.
>>>
>>> Right, and the reason that a spinlock fixes it is that we have memory
>>> barrier instructions built into the spinlock code sequences on machines
>>> where it matters.
>>>
>>> To get to the point where we could do the sort of optimization Robert
>>> is talking about, someone will have to build suitable primitives for
>>> all the platforms we support.  In the cases where we use gcc ASM in
>>> s_lock.h, it shouldn't be too hard to pull out the barrier
>>> instruction(s) ... but on platforms where we rely on OS-supplied
>>> functions, some research is going to be needed.
>>
>> Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
>> on a backend-private slock_t should work anywhere that PostgreSQL
>> works at all[1]. That will probably be slower than a memory fence
>> instruction and certainly slower than a compiler barrier, but the
>> point is that - right now - we're doing it the slow way everywhere.
>
> As I discovered while playing with various lockless algorithms to
> improve our LWLocks, spin locks aren't actually a replacement for
> a (full) barrier.
>
> Lock acquisition only really needs to guarantee that loads and stores
> which come after the acquisition operation in program order (i.e., in
> the instruction stream) aren't globally visible before that operation
> completes. This kind of barrier behaviour is often fittingly called
> "acquire barrier".
>
> Similarly, a lock release operation only needs to guarantee that loads
> and stores which occur before that operation in program order are
> globally visible before the release operation completes. This, again,
> is fittingly called "release barrier".
>
> Now assume the following code fragment
>
> global1 = 1;
> SpinLockAcquire();
> SpinLockRelease();
> global2 = 1;
>
> If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
> has "release barrier" sematics, the it's possible for the store to global1
> to be delayed until after SpinLockAcquire(), and similarly for the store
> to global2 to be executed before SpinLockRelease() completes. In other
> words, what happens is
>
> SpinLockAcquire();
> global1 = 1;
> global2 = 1;
> SpinLockRelease();
>
> But once that can happens, there's no reason that it couldn't also be
>
> SpinLockAcquire();
> global2 = 1;
> global1 = 1;
> SpinLockRelease();
>
> I didn't check if any of our spin lock implementations is actually affected
> by this, but it doesn't seem wise to rely on them being full barriers, even
> if it may be true today.

Hmm.  I'm not worried about that.  AFAIK, only IA64 has such an
implementation, and our existing spinlock implementation doesn't use
it.  If we were to add something like that in the future, we'd
presumably know that we were doing it, and would add the appropriate
memory barrier primitive at the same time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] spinlock contention

2011-07-21 Thread Florian Pflug

On Jul18, 2011, at 04:36 , Robert Haas wrote:
> On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug  wrote:
>>> I don't want to fiddle with your git repo, but if you attach a patch
>>> that applies to the master branch I'll give it a spin if I have time.
>> 
>> Patch attached.
>> 
>> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
>> spin lock instead of "locked xadd" to increment the shared counters.
> 
> [ Back from vacation, catching up on email. ]
> 
> gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)
> 
> pgbench -n -S -T 180 -c 32 -j 32
> 
> with lwlock-part patch:
> tps = 36974.644091 (including connections establishing)
> 
> unpatched cd34647c666be867f95ef8fc0492c30356043f10:
> tps = 39432.064293 (including connections establishing)
> 
> And with -c 8 -j 8:
> 
> tps = 26946.202428 (including connections establishing)
> tps = 27206.507424 (including connections establishing)

:-( That's disappointing, to say the least.

I also completely fail to understand what the heck is going on there.

I mean, you did conclusively prove that commenting out the SInval stuff
made a huge difference. There's also supposed to hardly any invalidation
going on during a pgbench -S run. So, since the patch removes two of the
three spin-lock acquisitions from SIGetDataEntries() (so long as there are
no exclusive lockers of SInvalReadLock), there should be some effect.
Or so I'd think at least...

If anyone has I theory, I'd love to hear it.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Jim Nasby

On Jul 21, 2011, at 5:30 PM, Christopher Browne wrote:
> - I'd commend capturing NOW() in a timestamptz field.  That gives you:
> 1.  What time the DB server thought it was, in terms of UT1
> 2.  What timezone it thought was tied to that connection.

Except that it doesn't, and that's exactly the problem I'm trying to solve 
here. I want to know what timezone we were using when we put a value into 
timestamptz, which then got converted to UT1. Without a reliable way to store 
what the timezone *was* at that time, we have no way to go back to it.

Now, we can debate whether it makes more sense to store the original time 
without conversion to UT1, or whether we should store the time after converting 
it to UT1 (or whether we should offer both options), but that debate is 
pointless without a good way to remember what timezone it started out in.

Arguably, we could just create an add-on data type for storing that timezone 
information, but that seems pretty daft to me: you're stuck either storing raw 
text which takes what should be a 12 byte datatype up to a 20-30 byte type (8 
byte timestamp + varlena + text of timezone name), or you end up with major 
problems trying to keep an enum in sync with what the database has available in 
it's ZIC database.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Florian Pflug

On Jul21, 2011, at 03:46 , Robert Haas wrote:
> Profiling this combination of patches reveals that there is still some
> pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
> to me that on x86, we really don't need this lock ... or
> SInvalReadLock ... or a per-backend mutex.  The whole of
> SIGetDataEntries() can pretty easily be made lock-free.  The only real
> changes that seem to be are needed are (1) to use a 64-bit counter, so
> you never need to decrement and (2) to recheck resetState after
> reading the entries from the queue, to see if we got reset while we
> were reading those entries.  Since x86 guarantees that writes will
> become visible in the order they are executed, we only need to make
> sure that the compiler doesn't rearrange things.  As long as we first
> read the maxMsgNum and then read the messages, we can't read garbage.
> As long as we read the messages before we check resetState, we will be
> sure to notice if we got reset before we read all the messages (which
> is the only way that we can have read garbage messages).

Sounds sensible. There're one additional hazard though - you'll also
need the reads to be atomic. x86 guarantees that for up to 32 (i386)
respectively 64 (x64) loads, but only for reads from properly aligned
addresses (4 bytes for 4-byte reads, 8 bytes for 8-byte reads).

I founds that out the hard way a few days ago, again while playing with
different LWLock implementations, when I botched my test setup and
the proc array entries ended up being miss-aligned. Boy, was it fun
to debug the random crashes caused by non-atomic pointer reads...

If we widen the counter to 64-bit, reading it atomically on x86 becomes
a bit of a challenge on i386, but is doable also. From what I remember,
there are two options. You can either use the 8-byte compare-and-exchange
operation, but it might be that only quite recent CPUs support that. The
other options seems to be to use floating-point instructions. I believe
the latter is what Intel's own Thread Building Blocks library does, but
I'd have to re-check to be sure. It might also be that, once you starting
using floating-point instructions, you find that you actually do need
fencing instructions even on x86. Dunno if the weaker ordering affects only
SIMD instructions or all floating point stuff...

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Noah Misch

On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
> Profiling this combination of patches reveals that there is still some
> pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
> to me that on x86, we really don't need this lock ... or
> SInvalReadLock ... or a per-backend mutex.  The whole of
> SIGetDataEntries() can pretty easily be made lock-free.  The only real
> changes that seem to be are needed are (1) to use a 64-bit counter, so
> you never need to decrement

On second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores?  You must avoid reading it between those
stores.

-- 
Noah Mischhttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Dan Ports

On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote:
> 1. Machines with strong memory ordering.  On this category of machines
> (which include x86), the CPU basically does not perform loads or
> stores out of order.  On some of these machines, it is apparently
> possible for there to be some ordering of stores relative to loads,
> but if the program stores two values or loads two values, those
> operations will performed in the same order they appear in the
> program. 

This is all correct, but...

> The main thing you need to make your code work reliably on
> these machines is a primitive that keeps the compiler from reordering
> your code during optimization. 

If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.

Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.

> 2. Machines with weak memory ordering.  On this category of machines
> (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
> reorders memory accesses arbitrarily unless you explicitly issue
> instructions that enforce synchronization.  You still need to keep the
> compiler from moving things around, too.  Alpha is particularly
> pernicious, because something like a->b can fetch the pointed-to value
> before loading the pointer itself.  This is otherwise known as "we
> have basically no cache coherency circuits on this chip at all".  On
> these machines, you need to issue an explicit memory barrier
> instruction at each sequence point, or just acquire and release a
> spinlock.

The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)

Alternatively, one might question how important it is to support the
Alpha these days...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Christopher Browne

On Thu, Jul 21, 2011 at 5:48 PM, Jim Nasby  wrote:
> On Jul 19, 2011, at 4:06 PM, Josh Berkus wrote:
>>> I have my doubts about that, and I hope not.  These details haven't been
>>> discussed at all; I only started this thread to get community approval
>>> on cataloguing the TZs.
>>
>> I am strongly in favor of having a *timezone* data type and some system
>> whereby we can uniquely identify timezones in the Zic database.  That
>> would be tremendously useful for all sorts of things.  I'm just
>> asserting that those who want a composite timestamp+saved-time-zone data
>> type have not thought about all of the complications involved.
>
> Having to deal with timezone's completely separate from their timestamps is a 
> huge PITA. That said, if we had a timezone datatype there's at least the 
> possibility of using a composite type to deal with all of this. Or at least 
> we can just create a custom datatype using existing tools... the only part of 
> this that I see that actually requires closer core support is the timezone 
> data itself.
>
> So if the community is OK with adding a timezone datatype then we can focus 
> on that and leave the timestamptztz data type as an add-on (at least assuming 
> we don't run into any gotchas).

As I have been watching this whole thread, my inclination has been to
look at this from a "Prolog" perspective, where we think about the
database as indicating a series of assertions about facts, from which
we then try to reason.

I suspect that determining what *really* needs to get recorded depends
on this.  And it seems to me that trying to head down the path of
defining oid-based lookups of timezone names may be putting the cart
before the horse.

There are a number of facts about a timestamp:

1.  What time did the database server think it was?

SELECT NOW();

captures the database's concept of what time it was, complete with:
a) The time, based, I think, on UT1.  With the caveat that there's no
certainty that the DB server's time is necessarily correct.
b) An encoding of the timezone offset based on the value of the
TimeZone GUC for this connection.

If one is running an NTP daemon, pointing to a decently-connected
network of NTP servers, then it's likely that this time is pretty
accurate.  And most of the time, I'd be inclined to treat this as
authoritative, and contend that anything else is likely to be less
correct and less easy to work with.

The goal of this discussion thread is to record another timestamp with
a different basis.  It's not entirely clear what is its basis.  I'll
suggest one, which mostly underlines my contention that it's likely
"less correct" and "less easy to work with" than having a column
defined as...
   some_timestamp timestamp with timezone default NOW()

2.  Client-based timestamp, comprising two things:

a) A time, ascertained by the client.
b) A timezone, ascertained by the client.

Note that timezones are pretty open-ended.  There is an authoritative
encoding defined in the tz database, but there are other values used
out there.  We had to patch Slony-I to have it use 'ISO' timestamps,
and recommend running in GMT/UTC, because there are values that blow
things up.

For instance, on AIX, there is a habit for boxes to set TZ=CUT0, out
of the box, which isn't on what PostgreSQL considers to be the
"official list."

On the more whimsical side of things, Joey Hess, a Debian developer
noted for such things as ikiwiki, etckeeper, git-annex, decided to
create his very own custom timezone, "JEST", because he was irritated
about DST.

http://kitenet.net/~joey/blog/entry/howto_create_your_own_time_zone/
http://kitenet.net/~joey/blog/entry/JEST_results/

That "whimsical" entry won't be going into tzdata, and while we could
discount this case as whimsy, it's not "out there" for organizations
such as nation states to decide to legislate their own things, that we
can't be certain will necessarily get into tzdata.

There are enough aliases and possibilities of local national decisions
to make it at least somewhat troublesome to treat this as something
that can be considered fixed down to the OID level.

My conclusion would be that if someone is really, really, really keen
on capturing their own notion of timezone, then this fits with the
notion that, if they want to have something that could be treated as
remotely authoritative, they should capture a multiplicity of pieces
of datestamp information, and actively accept that this will be pretty
duplicative.

- I'd commend capturing NOW() in a timestamptz field.  That gives you:
1.  What time the DB server thought it was, in terms of UT1
2.  What timezone it thought was tied to that connection.
- Also, I'd be inclined to capture, in plain text form:
3.  A client-recorded timestamp.  I'm agnostic as to whether this has
*any* validation done on it; I'd think it plausible that this is
simply a text field, that might require a human to interpret it.
4.  A client-recorded timezone.  This would be a plain text field,

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Florian Pflug

On Jul21, 2011, at 21:15 , Robert Haas wrote:
> On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> ... On these machines, you need to issue an explicit memory barrier
>>> instruction at each sequence point, or just acquire and release a
>>> spinlock.
>> 
>> Right, and the reason that a spinlock fixes it is that we have memory
>> barrier instructions built into the spinlock code sequences on machines
>> where it matters.
>> 
>> To get to the point where we could do the sort of optimization Robert
>> is talking about, someone will have to build suitable primitives for
>> all the platforms we support.  In the cases where we use gcc ASM in
>> s_lock.h, it shouldn't be too hard to pull out the barrier
>> instruction(s) ... but on platforms where we rely on OS-supplied
>> functions, some research is going to be needed.
> 
> Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
> on a backend-private slock_t should work anywhere that PostgreSQL
> works at all[1]. That will probably be slower than a memory fence
> instruction and certainly slower than a compiler barrier, but the
> point is that - right now - we're doing it the slow way everywhere.

As I discovered while playing with various lockless algorithms to
improve our LWLocks, spin locks aren't actually a replacement for
a (full) barrier.

Lock acquisition only really needs to guarantee that loads and stores
which come after the acquisition operation in program order (i.e., in
the instruction stream) aren't globally visible before that operation
completes. This kind of barrier behaviour is often fittingly called
"acquire barrier".

Similarly, a lock release operation only needs to guarantee that loads
and stores which occur before that operation in program order are
globally visible before the release operation completes. This, again,
is fittingly called "release barrier".

Now assume the following code fragment

global1 = 1;
SpinLockAcquire();
SpinLockRelease();
global2 = 1;

If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
has "release barrier" sematics, the it's possible for the store to global1
to be delayed until after SpinLockAcquire(), and similarly for the store
to global2 to be executed before SpinLockRelease() completes. In other
words, what happens is

SpinLockAcquire();
global1 = 1;
global2 = 1;
SpinLockRelease();

But once that can happens, there's no reason that it couldn't also be

SpinLockAcquire();
global2 = 1;
global1 = 1;
SpinLockRelease();

I didn't check if any of our spin lock implementations is actually affected
by this, but it doesn't seem wise to rely on them being full barriers, even
if it may be true today.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 4:54 PM, Noah Misch  wrote:
> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>> For the last week or so, in between various other tasks, I've been
>> trying to understand why performance drops off when you run pgbench -n
>> -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
>> counts.  The answer, in a word, is SIGetDataEntries().  I believe we
>> need to bite the bullet and rewrite this using a lock-free algorithm,
>> using memory barriers on processors with weak memory ordering.
>> Perhaps there is another way to do it, but nothing I've tried has
>> really worked so far, and I've tried quite a few things.  Here's the
>> data.
>>
>> On unpatched master, performance scales pretty much linearly out to 32
>> clients.  As you add more clients, it drops off:
>
>> [80 clients]
>> tps = 132518.586371 (including connections establishing)
>> tps = 130968.749747 (including connections establishing)
>> tps = 132574.338942 (including connections establishing)
>
>> [80 clients, with lazy vxid locks and sinval-unlocked]
>> tps = 203256.701227 (including connections establishing)
>> tps = 190637.957571 (including connections establishing)
>> tps = 190228.617178 (including connections establishing)
>
> Nice numbers.  The sinval-unlocked.patch implementation looks like it's 
> taking a
> sound direction.
>
> In
> http://archives.postgresql.org/message-id/ca+tgmobbxmh_9zjudheswo6m8sbmb5hdzt+3chcluv5eztv...@mail.gmail.com,
> you quoted 210k TPS when you stubbed out AcceptInvalidationMessages().  Is it
> correct to conclude that AcceptInvalidationMessages() still reduces the
> transaction rate by 5-10% with this stack of patches?

Good question - I have not tested.

One idea I just had... if we use a 64-bit counter for maxMsgNum, maybe
we could make AcceptInvalidationMessages() a macro, something like
this:

if (MyProcState->nextMsgNum != shmInvalState->maxMsgNum)
ReallyAcceptInvalidationMessages();

That ought to be extremely cheap and - if we use 64-bit counters for
the message-number counters - safe.  You might object that the load of
maxMsgNum might migrate backward, but it can't possibly back up any
further than the preceding lock acquisition, since that's required to
be a full memory barrier on every architecture.  And if we haven't
acquired a relevant lock, then a relevant sinval message could show up
an instance after we check regardless of the implementation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: new contrib module plpgsql's embeded sql validator

2011-07-21 Thread Jim Nasby

On Jul 19, 2011, at 10:51 PM, Pavel Stehule wrote:
>> If you mean that such checks would be done automatically, no, they
>> shouldn't be.  Consider a function that creates a table and then uses
>> it, or even just depends on using a table that doesn't yet exist when
>> you do CREATE FUNCTION.
> 
> yes, any deep check is not possible for function that uses a temporary tables.
> 
> A plpgsql_lint is not silver bullet - for these cases is necessary to
> disable lint.
> 
> . I can't to speak generally - I have no idea, how much percent of
> functions are functions with access to temporary tables - in my last
> project I use 0 temp tables on cca 300 KB of plpgsql code.
> 
> The more terrible problem is a new dependency between functions. I use
> a workaround - some like headers

You can work around temp table issues the same way: just define the temp table 
before you create the function.

In practice, if I have a function that depends on a temp table it either 
creates it itself if it doesn't already exist or I have a separate function to 
create the table; that way you have a single place that has the temp table 
definition, and that is in the database itself.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Tom Lane

Jim Nasby  writes:
> On Jul 19, 2011, at 11:22 AM, Ian Caulfield wrote:
>> There was an earlier point made that if someone puts eg 5pm local time
>> two years in the future into the database, and then the DST boundary
>> gets moved subsequently, some applications would like the value to
>> still say 5pm local time, even though that means it now refers to a
>> different point in absolute time - this potentially seems like a
>> useful feature. Retroactive timezone changes wouldn't make a lot of
>> sense in this case though...

> Right; and timezone's aren't supposed to change retroactively. The ZIC 
> database is specifically setup so that it knows the history of TZ changes and 
> deals with the past correctly.

You haven't noticed that at least two or three times a year, there are
"historical corrections" in the ZIC database?  The mapping between local
time and UTC might be less likely to change for a time instant in the
past than one in the future, but it would be folly to assume that it's
immutable in either direction.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Jim Nasby

On Jul 19, 2011, at 4:06 PM, Josh Berkus wrote:
>> I have my doubts about that, and I hope not.  These details haven't been
>> discussed at all; I only started this thread to get community approval
>> on cataloguing the TZs.
> 
> I am strongly in favor of having a *timezone* data type and some system
> whereby we can uniquely identify timezones in the Zic database.  That
> would be tremendously useful for all sorts of things.  I'm just
> asserting that those who want a composite timestamp+saved-time-zone data
> type have not thought about all of the complications involved.

Having to deal with timezone's completely separate from their timestamps is a 
huge PITA. That said, if we had a timezone datatype there's at least the 
possibility of using a composite type to deal with all of this. Or at least we 
can just create a custom datatype using existing tools... the only part of this 
that I see that actually requires closer core support is the timezone data 
itself.

So if the community is OK with adding a timezone datatype then we can focus on 
that and leave the timestamptztz data type as an add-on (at least assuming we 
don't run into any gotchas).

Alvaro, please speak up if there's any technical issues here that I've missed?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread David E. Wheeler

On Jul 21, 2011, at 2:39 PM, Jim Nasby wrote:

>> CREATE OR REPLACE FUNCTION is_timezone(
>>   tz CITEXT
>> ) RETURNS BOOLEAN LANGUAGE plpgsql STABLE AS $$
> ...
>> CREATE DOMAIN timezone AS CITEXT CHECK ( is_timezone( VALUE ) );
> 
> Storing giant globs of text with every timestamp field is really ugly.

You work with what you've got.

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Jim Nasby

On Jul 19, 2011, at 11:22 AM, Ian Caulfield wrote:
> On 19 July 2011 17:11, Kevin Grittner  wrote:
>> Tom Lane  wrote:
>>> "Kevin Grittner"  writes:
 Josh Berkus  wrote:
> The timestamp and the timezone in which that timestamp was
> entered are two separate pieces of data and *ought* to be in two
> separate fields.
>>> 
 So, if you're grabbing a timestamp and the time zone for it, how
 do you ensure you've done that atomically if you're at the
 boundary of a DST change?
>>> 
>>> In my view of the world, the timezone that you are in is not an
>>> object that changes across a DST boundary.
>> 
>> You're right -- the moment in time should be fixed like in the
>> current PostgreSQL "timestamp with time zone", and the time zone
>> doesn't change with DST.  Not an intentional read herring, but
>> definitely some muddy thinking there.
> 
> There was an earlier point made that if someone puts eg 5pm local time
> two years in the future into the database, and then the DST boundary
> gets moved subsequently, some applications would like the value to
> still say 5pm local time, even though that means it now refers to a
> different point in absolute time - this potentially seems like a
> useful feature. Retroactive timezone changes wouldn't make a lot of
> sense in this case though...

Right; and timezone's aren't supposed to change retroactively. The ZIC database 
is specifically setup so that it knows the history of TZ changes and deals with 
the past correctly.

> I guess there are three concepts of time here - an absolute fixed time
> with no reference to a timezone, a time with a timezone that is still
> set as a fixed point in time, or a local time in a specific timezone
> that would move if the timezone definition changed.

Or, another way to put the third class: a timestamp that remembers what it's 
original timezone was so that you can refer to it a common timezone (such as 
UTC), OR you can refer to it at it's original, local time. That's our exact 
need for this: we have different businesses that operate in different 
timezones. Generally, we only care about things in local time, but there are 
cases (such as event logging) where we could care about local *OR* unified time.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Jim Nasby

On Jul 19, 2011, at 4:20 PM, David E. Wheeler wrote:
> On Jul 19, 2011, at 2:06 PM, Josh Berkus wrote:
> 
>> I am strongly in favor of having a *timezone* data type and some system
>> whereby we can uniquely identify timezones in the Zic database.
> 
> CREATE OR REPLACE FUNCTION is_timezone(
>tz CITEXT
> ) RETURNS BOOLEAN LANGUAGE plpgsql STABLE AS $$
...
> CREATE DOMAIN timezone AS CITEXT CHECK ( is_timezone( VALUE ) );

Storing giant globs of text with every timestamp field is really ugly.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade and log file output on Windows

2011-07-21 Thread Bruce Momjian


I started thinking more about this --- we already allow multiple
processes to write to a single file when running Postgres as a backend
on Windows.  I think it is these open() flags that make it possible:

(FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE),

However, we are not going to be able to make all shell and command-line
tools use those flags, so the best solution seems to just avoid the
extra logging.  I just added a C comment to pg_upgrade to document this.

---

bruce wrote:
> 
> I have fixed the bug below with the attached patches for 9.0, 9.1, and
> 9.2.  I did a minimal patch for 9.0 and 9.1, and a more thorough patch
> for 9.2.
> 
> The use of -l/log was tested originally in pg_migrator (for 8.4) and
> reported to be working, but it turns out it only worked in 'check' mode,
> and threw an error during actual upgrade.  This bug was reported to me
> recently by EnterpriseDB testing of PG 9.0 on Windows.  The 9.2 C
> comments should make it clear that Windows doesn't allow multiple
> processes to write to the same file and  should avoid future breakage.
> 
> ---
> 
> Bruce Momjian wrote:
> > Has anyone successfully used pg_upgrade 9.0 with -l (log) on Windows?
> > 
> > I received a private email bug report that pg_upgrade 9.0 does not work
> > with the -l/log option on Windows.  The error is:
> > 
> > Analyzing all rows in the new cluster
> > ""c:/MinGW/msys/1.0/home/edb/inst/bin/vacuumdb" --port 55445 --username 
> > "edb" --all --analyze
> > >> c:/MinGW/msys/1.0/home/edb/auxschedule/test.log 2>&1"
> > The process cannot access the file because it is being used by another
> > process.
> >
> > What has me confused is this same code exists in pg_migrator, which was
> > fixed to work with -l on Windows by Hiroshi Saito with this change:
> > 
> > /*
> >  * On Win32, we can't send both server output and pg_ctl output
> >  * to the same file because we get the error:
> >  * "The process cannot access the file because it is being used by 
> > another process."
> >  * so we have to send pg_ctl output to 'nul'.
> >  */
> > sprintf(cmd, SYSTEMQUOTE "\"%s/pg_ctl\" -l \"%s\" -D \"%s\" "
> > "-o \"-p %d -c autovacuum=off -c 
> > autovacuum_freeze_max_age=20\" "
> > "start >> \"%s\" 2>&1" SYSTEMQUOTE,
> >  bindir, ctx->logfile, datadir, port,
> > #ifndef WIN32
> >  ctx->logfile);
> > #else
> >  DEVNULL);
> > #endif
> > 
> > The fix was not to use the same log file and output file for pg_ctl. 
> > But as you can see, the pg_ctl and vacuumdb code is unchanged:
> > 
> > prep_status(ctx, "Analyzing all rows in the new cluster");
> > exec_prog(ctx, true,
> >   SYSTEMQUOTE "\"%s/vacuumdb\" --port %d --username \"%s\" "
> >   "--all --analyze >> %s 2>&1" SYSTEMQUOTE,
> >   ctx->new.bindir, ctx->new.port, ctx->user, ctx->logfile);
> > 
> > I can't figure out of there is something odd about this user's setup or
> > if there is a bug in pg_upgrade with -l on Windows.
> > 
> > -- 
> >   Bruce Momjian  http://momjian.us
> >   EnterpriseDB http://enterprisedb.com
> > 
> >   + It's impossible for everything to be true. +
> > 
> > -- 
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
> 
> -- 
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
> 
>   + It's impossible for everything to be true. +

[ text/x-diff is unsupported, treating like TEXT/PLAIN ]

> diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
> new file mode 100644
> index 1a515e7..37d2eed
> *** a/contrib/pg_upgrade/pg_upgrade.c
> --- b/contrib/pg_upgrade/pg_upgrade.c
> *** prepare_new_cluster(migratorContext *ctx
> *** 161,168 
>   prep_status(ctx, "Analyzing all rows in the new cluster");
>   exec_prog(ctx, true,
> SYSTEMQUOTE "\"%s/vacuumdb\" --port %d --username 
> \"%s\" "
> !   "--all --analyze >> %s 2>&1" SYSTEMQUOTE,
> !   ctx->new.bindir, ctx->new.port, ctx->user, 
> ctx->logfile);
>   check_ok(ctx);
>   
>   /*
> --- 161,174 
>   prep_status(ctx, "Analyzing all rows in the new cluster");
>   exec_prog(ctx, true,
> SYSTEMQUOTE "\"%s/vacuumdb\" --port %d --username 
> \"%s\" "
> !   "--all --analyze >> \"%s\" 2>&1" SYSTEMQUOTE,
> !   ctx->new.bindir, ctx->new.port, ctx->user,
> ! #ifndef WIN32
> !   ctx->logfile
> ! #else
> !   DEVNU

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Noah Misch

On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
> For the last week or so, in between various other tasks, I've been
> trying to understand why performance drops off when you run pgbench -n
> -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
> counts.  The answer, in a word, is SIGetDataEntries().  I believe we
> need to bite the bullet and rewrite this using a lock-free algorithm,
> using memory barriers on processors with weak memory ordering.
> Perhaps there is another way to do it, but nothing I've tried has
> really worked so far, and I've tried quite a few things.  Here's the
> data.
> 
> On unpatched master, performance scales pretty much linearly out to 32
> clients.  As you add more clients, it drops off:

> [80 clients]
> tps = 132518.586371 (including connections establishing)
> tps = 130968.749747 (including connections establishing)
> tps = 132574.338942 (including connections establishing)

> [80 clients, with lazy vxid locks and sinval-unlocked]
> tps = 203256.701227 (including connections establishing)
> tps = 190637.957571 (including connections establishing)
> tps = 190228.617178 (including connections establishing)

Nice numbers.  The sinval-unlocked.patch implementation looks like it's taking a
sound direction.

In
http://archives.postgresql.org/message-id/ca+tgmobbxmh_9zjudheswo6m8sbmb5hdzt+3chcluv5eztv...@mail.gmail.com,
you quoted 210k TPS when you stubbed out AcceptInvalidationMessages().  Is it
correct to conclude that AcceptInvalidationMessages() still reduces the
transaction rate by 5-10% with this stack of patches?

-- 
Noah Mischhttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Pavan Deolasee

On Thu, Jul 21, 2011 at 4:01 PM, Robert Haas  wrote:

>
>
> I think we are better off doing only equality comparisons and dodging
> this problem altogether.
>
>
Fair enough.


>
> Just-plain-dead line pointers would have lp_off = 0.  Dead-vacuumed
> line pointers would have lp_off != 0.  The first vacuum would use
> lp_off = 1, the next one lp_off = 2, etc.
>
> Actually, come to think of it, we could fit a 30-bit counter into the
> line pointer.  There are 15 unused bits in lp_off and 15 unused bits
> in lp_len.
>
>
Thats clever! I think we can go this path and completely avoid any special
area or additional header fields.


>
> If we use a counter that is large enough that we don't have to worry
> about wrap-around, I guess that's OK, though it seems a little weird
> to think about having different backends running with different ideas
> about the correct counter value.
>
>
I think thats fine. For example, every backend runs with a different
RecentXmin today and that doesn't impact any functionality. It only limits
how much they can prune at any given time. The same would happen by having a
stale counter.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 3:53 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> I think the real challenge is going to be testing.  If anyone has a
>> machine with weak memory ordering they can give me access to, that
>> would be really helpful for flushing the bugs out of this stuff.
>
> There are multi-CPU PPCen in the buildfarm, or at least there were last
> time I broke the sinval code ;-).  Note that testing on a single-core
> PPC will prove nothing.

Yeah, I was just thinking about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 12:51 PM, Pavan Deolasee
 wrote:
> The way we force the subsequent vacuum to also look at the pages scanned and
> pruned by previous failed vacuum, all the pages that have dead-vacuum line
> pointers would have a new stamp once the vacuum finishes successfully and
> the pg_class would have the same stamp.

That seems a bit fragile.  One of the things we've talked about doing
is skipping pages that are pinned by some other backend.  Now maybe
that would be infrequent enough not to matter... but...

Also, I'm not sure that's the only potential change that would break this.

I think we are better off doing only equality comparisons and dodging
this problem altogether.

> I thought more about it and for a moment believed that we can do this with
> just a bit since we rescan the  pages with dead and dead-vacuum line
> pointers after an aborted vacuum, but concluded that a bit or a small
> counter is not good enough since other backends might be running with a
> stale value and would get fooled into believing that they can collect the
> dead-vacuum line pointers before the index pointers are actually removed. We
> can still use a 32-bit counter though since the wrap-around for that is
> practically very large for any backend to still run with such a stale
> counter (you would need more than 1 billion vacuums on the same table in
> between for you to hit this).

I think that's a safe assumption.

>> The patch as
>> submitted uses lp_off to store a single bit, to distinguish between
>> vacuum and dead-vacuumed, but we could actually have (for greater
>> safety and debuggability) a 15-byte counter that just wraps around
>> from 32,767 to 1.  (Maybe it would be wise to reserve a few counter
>> values, or a few bits, or both, for future projects.)  That would
>> eliminate the need to touch PageRepairFragmentation() or use the
>> special space, since all the information would be in the line pointer
>> itself.  Not having to rearrange the page to reclaim dead line
>> pointers is appealing, too.
>
> Not sure if I get you here. We need a mechanism to distinguish between dead
> and dead-vacuum line pointers. How would the counter (which I assume you
> mean 15-bit and not byte) help solve that ? Or are you just suggesting
> replacing LSN with the counter in the page header ?

Just-plain-dead line pointers would have lp_off = 0.  Dead-vacuumed
line pointers would have lp_off != 0.  The first vacuum would use
lp_off = 1, the next one lp_off = 2, etc.

Actually, come to think of it, we could fit a 30-bit counter into the
line pointer.  There are 15 unused bits in lp_off and 15 unused bits
in lp_len.

>> > Is there something in place to make sure that pruning uses an up-to-date
>> > relindxvacxlogid/off value? I guess it doesn't matter if it's
>> > out-of-date,
>> > you'll just miss the opportunity to remove some dead tuples.
>>
>> This seems like a tricky problem, because it could cause us to
>> repeatedly fail to remove the same dead line pointers, which would be
>> poor.  We could do something like this: after updating pg_class,
>> vacuum send an interrupt to any backend which holds RowExclusiveLock
>> or higher on that relation.  The interrupt handler just sets a flag.
>> If that backend does heap_page_prune() and sees the flag set, it knows
>> that it needs to recheck pg_class.  This is a bit grotty and doesn't
>> completely close the race condition (the signal might not arrive in
>> time), but it ought to make it narrow enough not to matter in
>> practice.
>
> I am not too excited about adding that complexity to the code. Even if a
> backend does not have up-to-date value, it will fail to collect the
> dead-vacuum pointers, but soon either it will catch up or some other backend
> will remove them or the next vacuum will take care of it.

If we use a counter that is large enough that we don't have to worry
about wrap-around, I guess that's OK, though it seems a little weird
to think about having different backends running with different ideas
about the correct counter value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Kohei KaiGai

2011/7/21 Robert Haas :
> On Thu, Jul 21, 2011 at 3:25 PM, Yeb Havinga  wrote:
>> Is it possible to only include the syscache on --enable-selinux
>> configurations? It would imply physical data incompatibility with standard
>> configurations, but that's also true for e.g. the block size.
>
> Not really.  SECURITY LABEL is supposedly a generic facility that can
> be used by a variety of providers, and the regression tests load a
> dummy provider which works on any platform to test that it hasn't
> gotten broken.
>
>> Also, the tests I did with varying bucket sizes suggested that decreasing
>> the syscache to 256 didn't show a significant performance decrease compared
>> to the 2048 #buckets, for the restorecon test, which hits over 3000 objects
>> with security labels. My guess is that that is a fair middle of the road
>> database schema size. Are you unwilling to pay the startup overhead for a
>> extra 256 syscache?
>
> Not sure.  I'd rather not, if it's easy to rejigger things so we don't
> have to.  I don't think this is necessarily a hard problem to solve -
> it's just that no one has tried yet.
>
Now, I tend to implement a cache mechanism to translate ObjectAddress
to security label by sepgsql module itself, rather than generic syscache,
although it requires a new hook on PrepareForTupleInvalidation() as Robert
suggested in this thread.
Indeed, it seems to me worthwhile not to allocate memory being unused for
90% of users; from perspective of startup performance and resource consumption.

In addition, we may be potentially able to have a cache stuff well optimized
to the access control of SELinux; such as cache reclaim for recently unused
entries. So, I'd like to focus on the stuff in sepgsql/uavc.c right now.

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Tom Lane

Robert Haas  writes:
> I think the real challenge is going to be testing.  If anyone has a
> machine with weak memory ordering they can give me access to, that
> would be really helpful for flushing the bugs out of this stuff.

There are multi-CPU PPCen in the buildfarm, or at least there were last
time I broke the sinval code ;-).  Note that testing on a single-core
PPC will prove nothing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Kohei KaiGai

2011/7/21 Yeb Havinga :
>
>> Is it possible to only include the syscache on --enable-selinux
>> configurations? It would imply physical data incompatibility with standard
>> configurations, but that's also true for e.g. the block size.
>>
>> Also, the tests I did with varying bucket sizes suggested that decreasing
>> the syscache to 256 didn't show a significant performance decrease compared
>> to the 2048 #buckets, for the restorecon test, which hits over 3000 objects
>> with security labels. My guess is that that is a fair middle of the road
>> database schema size. Are you unwilling to pay the startup overhead for a
>> extra 256 syscache?
>>
>
> Hello KaiGai-san,
>
> off-list,
>
Unfortunatelly, not so...

> I was wondering why the catalog pg_seclabel exists at all. Why not store the
> labels together with the objects (pg_class, pg_attribute etc) ? The syscache
> wouldn't be needed in that case.
>
Although current sepgsql support to assign security label on limited number of
object classes (schema, relation, column and functions).
However, we have planed to control accesses on whole of objects managed by
PostgreSQL, not only these four.

If we needed to expand system catalog everytime when an object get newly
supported, the patch would be more invasive and make hard to upstream.

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Yeb Havinga



Is it possible to only include the syscache on --enable-selinux 
configurations? It would imply physical data incompatibility with 
standard configurations, but that's also true for e.g. the block size.


Also, the tests I did with varying bucket sizes suggested that 
decreasing the syscache to 256 didn't show a significant performance 
decrease compared to the 2048 #buckets, for the restorecon test, which 
hits over 3000 objects with security labels. My guess is that that is 
a fair middle of the road database schema size. Are you unwilling to 
pay the startup overhead for a extra 256 syscache?




Hello KaiGai-san,

off-list,

I was wondering why the catalog pg_seclabel exists at all. Why not store 
the labels together with the objects (pg_class, pg_attribute etc) ? The 
syscache wouldn't be needed in that case.


regards,
Yeb



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 3:25 PM, Yeb Havinga  wrote:
> Is it possible to only include the syscache on --enable-selinux
> configurations? It would imply physical data incompatibility with standard
> configurations, but that's also true for e.g. the block size.

Not really.  SECURITY LABEL is supposedly a generic facility that can
be used by a variety of providers, and the regression tests load a
dummy provider which works on any platform to test that it hasn't
gotten broken.

> Also, the tests I did with varying bucket sizes suggested that decreasing
> the syscache to 256 didn't show a significant performance decrease compared
> to the 2048 #buckets, for the restorecon test, which hits over 3000 objects
> with security labels. My guess is that that is a fair middle of the road
> database schema size. Are you unwilling to pay the startup overhead for a
> extra 256 syscache?

Not sure.  I'd rather not, if it's easy to rejigger things so we don't
have to.  I don't think this is necessarily a hard problem to solve -
it's just that no one has tried yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Dave Page

On Thu, Jul 21, 2011 at 8:25 PM, Robert Haas  wrote:
> On Thu, Jul 21, 2011 at 3:22 PM, Dave Page  wrote:
>> On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas  wrote:
>>> I think the real challenge is going to be testing.  If anyone has a
>>> machine with weak memory ordering they can give me access to, that
>>> would be really helpful for flushing the bugs out of this stuff.
>>> Getting it to work on x86 is not the hard part.
>>
>> I believe there's a PPC box in our storage facility in NJ that we
>> might be able to dig out for you. There's also a couple in our India
>> office. Let me know if they'd be of help.
>
> Yes!
>
> More processors is better, of course, but having anything at all to
> test on would be an improvement.

OK, will check with India first, as it'll be easier for them to deploy.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 3:22 PM, Dave Page  wrote:
> On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas  wrote:
>> I think the real challenge is going to be testing.  If anyone has a
>> machine with weak memory ordering they can give me access to, that
>> would be really helpful for flushing the bugs out of this stuff.
>> Getting it to work on x86 is not the hard part.
>
> I believe there's a PPC box in our storage facility in NJ that we
> might be able to dig out for you. There's also a couple in our India
> office. Let me know if they'd be of help.

Yes!

More processors is better, of course, but having anything at all to
test on would be an improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Yeb Havinga


On 2011-07-21 15:03, Robert Haas wrote:

On Thu, Jul 21, 2011 at 4:00 AM, Yeb Havinga  wrote:

Besides that I have to admit having problems understanding why the 5MB cache
for pg_seclabel is a problem; it's memory consumption is lineair only to the
size of the underlying database.  (in contrast with the other cache storing
access vectors which would have O(n*m) space complexity if it wouldn't
reclaim space). So it is proportional to the number of objects in a database
and in size it seems to be in the same order as pg_proc, pg_class and
pg_attribute.

Fair enough.  I'm not convinced that the sheer quantity of memory use
is a problem, although I would like to see a few more test results
before we decide that definitively.  I *am* unwilling to pay the
startup overhead of initializing an extra 2048 syscache that only
sepgsql users will actually need.


Is it possible to only include the syscache on --enable-selinux 
configurations? It would imply physical data incompatibility with 
standard configurations, but that's also true for e.g. the block size.


Also, the tests I did with varying bucket sizes suggested that 
decreasing the syscache to 256 didn't show a significant performance 
decrease compared to the 2048 #buckets, for the restorecon test, which 
hits over 3000 objects with security labels. My guess is that that is a 
fair middle of the road database schema size. Are you unwilling to pay 
the startup overhead for a extra 256 syscache?


--

Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Dave Page

On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas  wrote:
>
> I think the real challenge is going to be testing.  If anyone has a
> machine with weak memory ordering they can give me access to, that
> would be really helpful for flushing the bugs out of this stuff.
> Getting it to work on x86 is not the hard part.

I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> ... On these machines, you need to issue an explicit memory barrier
>> instruction at each sequence point, or just acquire and release a
>> spinlock.
>
> Right, and the reason that a spinlock fixes it is that we have memory
> barrier instructions built into the spinlock code sequences on machines
> where it matters.
>
> To get to the point where we could do the sort of optimization Robert
> is talking about, someone will have to build suitable primitives for
> all the platforms we support.  In the cases where we use gcc ASM in
> s_lock.h, it shouldn't be too hard to pull out the barrier
> instruction(s) ... but on platforms where we rely on OS-supplied
> functions, some research is going to be needed.

Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.

I think the real challenge is going to be testing.  If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] This was a suggestion from Noah Misch.  I wasn't quite convinced
when he initially made it, but having studied the issue a lot more, I
now am.  The CPU doesn't know how many processes have the memory
mapped into their address space.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Questions and experiences writing a Foreign Data Wrapper

2011-07-21 Thread Tom Lane

"Albe Laurenz"  writes:
> I wrote a FDW for Oracle to a) learn some server coding
> and b) see how well the FDW API works for me.

> I came up with three questions/experiences:

> 1) GetUserMapping throws an error if there is no
>user mapping for the user (or PUBLIC).
>I think that it would be much more useful if
>it would return NULL or something similar instead.

We could make it do that, but under what circumstances would it be
useful to not throw an error?  It doesn't seem like you should try
to establish a remote connection anyway, if there's no mapping.

> 3) I am confused by the order of function calls
>during execution of a subplan. It is like this:
>  BeginForeignScan
>  ReScanForeignScan
>  IterateForeignScan
>  IterateForeignScan
>  ...
>  ReScanForeignScan
>  IterateForeignScan
>  IterateForeignScan
>  ...
>  EndForeignScan
>   So the first ReScan is done immediately after
>   BeginForeignScan. Moreover, internal parameters are not
>   set in the BeginForeignScan call.

>   This is probably working as designed, but BeginForeignScan
>   has no way to know whether it should execute a remote
>   query or not.

I'd say it probably shouldn't, ever.  If you look at the executor's node
init functions, none of them do any actual data fetching.  They just
prepare data structures.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Tom Lane

Robert Haas  writes:
> ... On these machines, you need to issue an explicit memory barrier
> instruction at each sequence point, or just acquire and release a
> spinlock.

Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.

To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support.  In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 12:16 PM, Kevin Grittner
 wrote:
> Very impressive!  Those numbers definitely justify some #ifdef code
> to provide alternatives for weak memory ordering machines versus
> others.  With the number of CPUs climbing as it is, this is very
> important work!

Thanks.  I'm not thinking so much about #ifdef (although that could
work, too) as I am about providing some primitives to allow this sort
of code-writing to be done in a somewhat less ad-hoc fashion.  It
seems like there are basically two categories of machines we need to
worry about.

1. Machines with strong memory ordering.  On this category of machines
(which include x86), the CPU basically does not perform loads or
stores out of order.  On some of these machines, it is apparently
possible for there to be some ordering of stores relative to loads,
but if the program stores two values or loads two values, those
operations will performed in the same order they appear in the
program.  The main thing you need to make your code work reliably on
these machines is a primitive that keeps the compiler from reordering
your code during optimization.  On x86, certain categories of exotic
instructions do require

2. Machines with weak memory ordering.  On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization.  You still need to keep the
compiler from moving things around, too.  Alpha is particularly
pernicious, because something like a->b can fetch the pointed-to value
before loading the pointer itself.  This is otherwise known as "we
have basically no cache coherency circuits on this chip at all".  On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.

So you can imagine a primitive that is defined to be a compiler
barrier on machines with strong memory ordering, and as a memory
fencing instruction on machines with weak memory ordering.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Joshua D. Drake


On 07/21/2011 11:13 AM, Tom Lane wrote:

"Joshua D. Drake"  writes:

So I am looking intently on what it is going to take to get the URI
patch done for psql [1] and was digging around the web and have a URI
parser library. It is under the New BSD license and is strictly RFC  RFC
3986 [2] compliant .


Surely we do not need a whole library to parse URIs.


Also:

http://uriparser.git.sourceforge.net/git/gitweb-index.cgi

Sincerely,

Joshua D. Drake



regards, tom lane




--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Joshua D. Drake


On 07/21/2011 11:13 AM, Tom Lane wrote:

"Joshua D. Drake"  writes:

So I am looking intently on what it is going to take to get the URI
patch done for psql [1] and was digging around the web and have a URI
parser library. It is under the New BSD license and is strictly RFC  RFC
3986 [2] compliant .


Surely we do not need a whole library to parse URIs.


Shrug, standards compliant, already runs on windows

Seems like a good idea to me?

http://uriparser.sourceforge.net/

Sincerely,

Joshua D. Drake




regards, tom lane




--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Tom Lane

"Joshua D. Drake"  writes:
> So I am looking intently on what it is going to take to get the URI 
> patch done for psql [1] and was digging around the web and have a URI 
> parser library. It is under the New BSD license and is strictly RFC  RFC 
> 3986 [2] compliant .

Surely we do not need a whole library to parse URIs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Peter Geoghegan

On 21 July 2011 18:43, Joshua D. Drake  wrote:
> So I am looking intently on what it is going to take to get the URI patch
> done for psql [1] and was digging around the web and have a URI parser
> library. It is under the New BSD license and is strictly RFC  RFC 3986 [2]
> compliant .
>
> Now I have not dug into the code but the parser is used by other projects.
> So question is:
>
> Assuming the code actually makes this patch easier, do we:
>
> A. Pull in the code into the main tree
> B. Instead have it as a requirement via configure?
>
> 1.
> http://archives.postgresql.org/message-id/1302114698.23164.17.camel@jd-desktop
>
> 2. http://tools.ietf.org/html/rfc3986

Without commenting on the practicalities of what you'd like to do,
including code from other projects in the tree is well precedented.
Off the top of my head, I can tell you that pgcrypto uses code from
various sources, while preserving the original copyright notices.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove O(N^2) performance issue with multiple SAVEPOINTs.

2011-07-21 Thread Alvaro Herrera

Excerpts from Simon Riggs's message of jue jul 21 13:30:25 -0400 2011:

> I think I'll just revert it though. Subtransactions need a lot of
> tuning but this isn't high enough up my list to be worth the work.

If it works and is sane, why would you revert it?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fixing PQsetvalue()

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 12:19 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> So I finally got around to taking a look at this patch, and I guess my
>> basic feeling is that I like it.  The existing code is pretty weird
>> and inconsistent: the logic in PQsetvalue() basically does the same
>> thing as the logic in pqAddTuple(), but incompatibly and less
>> efficiently.  Unifying them seems sensible, and the fix looks simple
>> enough to back-patch.
>
> Yeah, I've been looking at it too.  For some reason I had had the
> idea that the proposed patch complicated the code, but actually it's
> simplifying it by removing almost-duplicate code.  So that's good.
>
> The patch as proposed adds back a bug in return for the one it fixes
> (you can not free() the result of pqResultAlloc()), but that's easily
> fixed.
>
> Will fix and commit.

Cool.  I believe that's the last patch for CommitFest 2011-06.

*bangs gavel*

I believe that makes it time for 9.2alph1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Joshua D. Drake


Hey,

So I am looking intently on what it is going to take to get the URI 
patch done for psql [1] and was digging around the web and have a URI 
parser library. It is under the New BSD license and is strictly RFC  RFC 
3986 [2] compliant .


Now I have not dug into the code but the parser is used by other 
projects. So question is:


Assuming the code actually makes this patch easier, do we:

A. Pull in the code into the main tree
B. Instead have it as a requirement via configure?

1. 
http://archives.postgresql.org/message-id/1302114698.23164.17.camel@jd-desktop


2. http://tools.ietf.org/html/rfc3986


Sincerely,

Joshua D. Drake


--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove O(N^2) performance issue with multiple SAVEPOINTs.

2011-07-21 Thread Simon Riggs

On Thu, Jul 21, 2011 at 5:56 PM, Tom Lane  wrote:
> Heikki Linnakangas  writes:
>> Instead of leaving the locks dangling to an already-destroyed resource
>> owner, how about assigning all locks directly to the top-level resource
>> owner in one sweep? That'd still be much better than the old way of
>> recursively reassigning them up the subtransaction tree, one level at a
>> time.
>
> I haven't actually read the patch, but the reason for pushing them up
> only one level at a time is that if an intermediate-level subtransaction
> aborts, the locks taken by its child subtransactions have to be released
> at that time.  It sure sounds like this patch broke that.

The only path altered by the patch was the
final-commit-while-in-a-subxact, so I don't see a problem in the part
you mention.

At commit all the locks get transferred to the parent, so we scan the
the lock table repeatedly, giving O(N^2).

I think I'll just revert it though. Subtransactions need a lot of
tuning but this isn't high enough up my list to be worth the work.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove O(N^2) performance issue with multiple SAVEPOINTs.

2011-07-21 Thread Tom Lane

Heikki Linnakangas  writes:
> Instead of leaving the locks dangling to an already-destroyed resource 
> owner, how about assigning all locks directly to the top-level resource 
> owner in one sweep? That'd still be much better than the old way of 
> recursively reassigning them up the subtransaction tree, one level at a 
> time.

I haven't actually read the patch, but the reason for pushing them up
only one level at a time is that if an intermediate-level subtransaction
aborts, the locks taken by its child subtransactions have to be released
at that time.  It sure sounds like this patch broke that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Pavan Deolasee

On Thu, Jul 21, 2011 at 12:17 PM, Robert Haas  wrote:

> On Thu, Jul 14, 2011 at 12:43 PM, Heikki Linnakangas
> > I think you can sidestep that
> > if you check that the page's vacuum LSN <= vacuum LSN in pg_class,
> instead
> > of equality.
>
> I don't think that works, because the point of storing the LSN in
> pg_class is to verify that the vacuum completed the index cleanup
> without error.  The fact that a newer vacuum accomplished that goal
> does not mean that all older ones did.
>
>
The way we force the subsequent vacuum to also look at the pages scanned and
pruned by previous failed vacuum, all the pages that have dead-vacuum line
pointers would have a new stamp once the vacuum finishes successfully and
the pg_class would have the same stamp.


> > Ignoring the issue stated in previous paragraph, I think you wouldn't
> > actually need an 64-bit LSN. A smaller counter is enough, as wrap-around
> > doesn't matter. In fact, a single bit would be enough. After a successful
> > vacuum, the counter on each heap page (with dead line pointers) is N, and
> > the value in pg_class is N. There are no other values on the heap,
> because
> > vacuum will have cleaned them up. When you begin the next vacuum, it will
> > stamp pages with N+1. So at any stage, there is only one of two values on
> > any page, so a single bit is enough. (But as I said, that doesn't hold if
> > vacuum skips some pages thanks to the visibility map)
>
> If this can be made to work, it's a very appealing idea.


I thought more about it and for a moment believed that we can do this with
just a bit since we rescan the  pages with dead and dead-vacuum line
pointers after an aborted vacuum, but concluded that a bit or a small
counter is not good enough since other backends might be running with a
stale value and would get fooled into believing that they can collect the
dead-vacuum line pointers before the index pointers are actually removed. We
can still use a 32-bit counter though since the wrap-around for that is
practically very large for any backend to still run with such a stale
counter (you would need more than 1 billion vacuums on the same table in
between for you to hit this).


> The patch as
> submitted uses lp_off to store a single bit, to distinguish between
> vacuum and dead-vacuumed, but we could actually have (for greater
> safety and debuggability) a 15-byte counter that just wraps around
> from 32,767 to 1.  (Maybe it would be wise to reserve a few counter
> values, or a few bits, or both, for future projects.)  That would
> eliminate the need to touch PageRepairFragmentation() or use the
> special space, since all the information would be in the line pointer
> itself.  Not having to rearrange the page to reclaim dead line
> pointers is appealing, too.
>
>
Not sure if I get you here. We need a mechanism to distinguish between dead
and dead-vacuum line pointers. How would the counter (which I assume you
mean 15-bit and not byte) help solve that ? Or are you just suggesting
replacing LSN with the counter in the page header ?


> > Is there something in place to make sure that pruning uses an up-to-date
> > relindxvacxlogid/off value? I guess it doesn't matter if it's
> out-of-date,
> > you'll just miss the opportunity to remove some dead tuples.
>
> This seems like a tricky problem, because it could cause us to
> repeatedly fail to remove the same dead line pointers, which would be
> poor.  We could do something like this: after updating pg_class,
> vacuum send an interrupt to any backend which holds RowExclusiveLock
> or higher on that relation.  The interrupt handler just sets a flag.
> If that backend does heap_page_prune() and sees the flag set, it knows
> that it needs to recheck pg_class.  This is a bit grotty and doesn't
> completely close the race condition (the signal might not arrive in
> time), but it ought to make it narrow enough not to matter in
> practice.
>
>
I am not too excited about adding that complexity to the code. Even if a
backend does not have up-to-date value, it will fail to collect the
dead-vacuum pointers, but soon either it will catch up or some other backend
will remove them or the next vacuum will take care of it.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Pavan Deolasee

On Thu, Jul 21, 2011 at 11:51 AM, Robert Haas  wrote:

> On Tue, Jul 12, 2011 at 4:47 PM, Pavan Deolasee
>  wrote:
> > Comments ?
>
> I was going to spend some time reviewing this, but I see that (1) it
> has bit-rotted slightly - there is a failing hunk in pg_class.h and
> (2) some of the comments downthread seem to suggest that you're
> thinking about whether to revise this somewhat, in particular by using
> some counter other than an LSN.  Are you planning to submit an updated
> version?
>
>
Yeah, I would submit an updated version. I was just waiting to see if there
are more comments about the general design. But I think I can now proceed.

I wonder if we can just ignore the wrap-around issue and use a 32 bit
counter. The counter can be stored in the pg_class itself since its use is
limited for the given table. At the start of vacuum, we get the current
value. We then increment the counter (taking care of wrap-around) and use
the incremented value as a marker in the page special area. If the vacuum
runs to completion, we store the new value back in the pg_class row. Since
vacuums are serialized for a given table, we don't need to worry about
concurrent updates to the value.

While collecting dead-vacuum line pointers, either during HOT-prune or
subsequent vacuum, we check if the current pg_class value and if the value
is equal to the page counter, we can safely collect the dead-vacuum line
pointers. For a moment, I thought we can just do away with a bit as Heikki
suggested up thread, but the problem comes with the backends which might be
running with stale value of the counter in the pg_class and the counter
should be large enough so that it does not quickly wrap-around for all
practical purposes.

> A few comments on this version just reading through it:
>
> - In lazy_scan_heap, where you've made the call to
> RecordPageWithFreeSpace() unconditional, the comment change you made
> immediately above is pretty half-baked.  It still refers to
> lazy_vacuum_heap, which you've meanwhile removed.  You need to rewrite
> the whole comment, I think.
>
> - Instead of passing bool need_vaclsn to PageRepairFragmentation(),
> how about passing Offset new_special_size?  Currently,
> PageRepairFragmentation() doesn't know whether it's looking at a heap
> page or an index page, and it would be nice to keep it that way.  It's
> even possible that expanding the special space opportunistically
> during page defragmentation could be useful in other contexts besides
> this.  Or perhaps contracting it.
>
> - itemid.h seems a bit schizophrenic about dead line pointers.  Here,
> you've decided that it's OK for lp_flags == LP_DEAD && lp_off == 1 to
> mean dead-vacuumed, but there existing code says:
>
> #define LP_DEAD 3   /* dead, may or may
> not have storage */
>
> AFAICT, the actual situation here is that indexes sometimes use dead
> line pointers with storage, but the heap doesn't; thus, the heap can
> safely use the storage bits of dead line pointers to mean something
> else, but indexes can't.  I think the comments throughout itemid.h
> should be adjusted to bring this out a bit more clearly, though.
>
>

I will take care of these issues in the revised patch. Thanks for looking at
the patch.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] fixing PQsetvalue()

2011-07-21 Thread Tom Lane

Robert Haas  writes:
> So I finally got around to taking a look at this patch, and I guess my
> basic feeling is that I like it.  The existing code is pretty weird
> and inconsistent: the logic in PQsetvalue() basically does the same
> thing as the logic in pqAddTuple(), but incompatibly and less
> efficiently.  Unifying them seems sensible, and the fix looks simple
> enough to back-patch.

Yeah, I've been looking at it too.  For some reason I had had the
idea that the proposed patch complicated the code, but actually it's
simplifying it by removing almost-duplicate code.  So that's good.

The patch as proposed adds back a bug in return for the one it fixes
(you can not free() the result of pqResultAlloc()), but that's easily
fixed.

Will fix and commit.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Robert Haas

On Thu, Jul 14, 2011 at 12:43 PM, Heikki Linnakangas
 wrote:
> How does this interact with the visibility map? If you set the visibility
> map bit after vacuuming indexes, a subsequent vacuum will not visit the
> page. The second vacuum will update relindxvacxlogid/off, but it will not
> clean up the dead line pointers left behind by the first vacuum. Now the LSN
> on the page differs from the one stored in pg_class, so subsequent pruning
> will not remove the dead line pointers either.

Currently, I think we would only set the visibility map bit after
vacuuming the page for the second time.  The patch as submitted
doesn't appear to go back and set visibility map bits after finishing
the index vacuum.  Now, that might be nice to do, because then a
hypothetical index-only scan could start taking advantage of vacuum
having been done sooner.  If we wanted to do that, we could
restructure the visibility map to store two bits per page: one to
indicate whether there is any potential work for VACUUM to do (modulo
freezing) and the other to indicate whether an index pointer could
possibly be aimed at a dead line pointer.  (In fact, maybe we'd even
want to have a third bit to indicate "all tuples frozen", which would
be useful for optimizing anti-wraparound vacuum.)

> I think you can sidestep that
> if you check that the page's vacuum LSN <= vacuum LSN in pg_class, instead
> of equality.

I don't think that works, because the point of storing the LSN in
pg_class is to verify that the vacuum completed the index cleanup
without error.  The fact that a newer vacuum accomplished that goal
does not mean that all older ones did.

> Ignoring the issue stated in previous paragraph, I think you wouldn't
> actually need an 64-bit LSN. A smaller counter is enough, as wrap-around
> doesn't matter. In fact, a single bit would be enough. After a successful
> vacuum, the counter on each heap page (with dead line pointers) is N, and
> the value in pg_class is N. There are no other values on the heap, because
> vacuum will have cleaned them up. When you begin the next vacuum, it will
> stamp pages with N+1. So at any stage, there is only one of two values on
> any page, so a single bit is enough. (But as I said, that doesn't hold if
> vacuum skips some pages thanks to the visibility map)

If this can be made to work, it's a very appealing idea.  The patch as
submitted uses lp_off to store a single bit, to distinguish between
vacuum and dead-vacuumed, but we could actually have (for greater
safety and debuggability) a 15-byte counter that just wraps around
from 32,767 to 1.  (Maybe it would be wise to reserve a few counter
values, or a few bits, or both, for future projects.)  That would
eliminate the need to touch PageRepairFragmentation() or use the
special space, since all the information would be in the line pointer
itself.  Not having to rearrange the page to reclaim dead line
pointers is appealing, too.

> Is there something in place to make sure that pruning uses an up-to-date
> relindxvacxlogid/off value? I guess it doesn't matter if it's out-of-date,
> you'll just miss the opportunity to remove some dead tuples.

This seems like a tricky problem, because it could cause us to
repeatedly fail to remove the same dead line pointers, which would be
poor.  We could do something like this: after updating pg_class,
vacuum send an interrupt to any backend which holds RowExclusiveLock
or higher on that relation.  The interrupt handler just sets a flag.
If that backend does heap_page_prune() and sees the flag set, it knows
that it needs to recheck pg_class.  This is a bit grotty and doesn't
completely close the race condition (the signal might not arrive in
time), but it ought to make it narrow enough not to matter in
practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Kevin Grittner

Robert Haas  wrote:
 
> SIGetDataEntries(). I believe we need to bite the bullet and
> rewrite this using a lock-free algorithm, using memory barriers on
> processors with weak memory ordering.
 
> [32 processors; 80 clients]
 
> On unpatched master
 
> tps = 132518.586371 (including connections establishing)
> tps = 130968.749747 (including connections establishing)
> tps = 132574.338942 (including connections establishing)
 
> With the lazy vxid locks patch
 
> tps = 119215.958372 (including connections establishing)
> tps = 113056.859871 (including connections establishing)
> tps = 160562.770998 (including connections establishing)
 
> gets rid of SInvalReadLock and instead gives each backend its own
> spinlock.
 
> tps = 167392.042393 (including connections establishing)
> tps = 171336.145020 (including connections establishing)
> tps = 170500.529303 (including connections establishing)
 
> SIGetDataEntries() can pretty easily be made lock-free.
 
> tps = 203256.701227 (including connections establishing)
> tps = 190637.957571 (including connections establishing)
> tps = 190228.617178 (including connections establishing)
 
> Thoughts? Comments? Ideas?
 
Very impressive!  Those numbers definitely justify some #ifdef code
to provide alternatives for weak memory ordering machines versus
others.  With the number of CPUs climbing as it is, this is very
important work!
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fixing PQsetvalue()

2011-07-21 Thread Merlin Moncure

On Wed, Jul 20, 2011 at 10:28 PM, Robert Haas  wrote:
> On Mon, Jul 18, 2011 at 6:38 AM, Pavel Golub  wrote:
>> Hello, Merlin.
>>
>> I hope it's OK that I've added Andrew's patch to CommitFest:
>> https://commitfest.postgresql.org/action/patch_view?id=606
>>
>> I did this becuase beta3 already released, but nut nothig is done on
>> this bug.
>
> So I finally got around to taking a look at this patch, and I guess my
> basic feeling is that I like it.  The existing code is pretty weird
> and inconsistent: the logic in PQsetvalue() basically does the same
> thing as the logic in pqAddTuple(), but incompatibly and less
> efficiently.  Unifying them seems sensible, and the fix looks simple
> enough to back-patch.
>
> With respect to Tom's concern about boxing ourselves in, I guess it's
> hard for me to get worried about that.  I've heard no one suggest
> changing the internal representation libpq uses for result sets, and
> even if we did, presumably the new format would also need to support
> an "append a tuple" operation - or the very worst we could cause it to
> support that without much difficulty.
>
> So, +1 from me.

right -- thanks for that. For the record, I think a rework of the
libpq internal representation would be likely to happen concurrently
with a rework of the API -- for example to better support streaming
data. PQsetvalue very well might prove to be a headache -- just too
hard to say.  libpq strikes me as a 50 year plus marriage might:
fractious, full of mystery and regrets, but highly functional.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] timing for 9.1beta4 / rc1

2011-07-21 Thread Kevin Grittner

Robert Haas  wrote:

> http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Open_Items is
> looking pretty barren at the moment.  Are there things that should
> be listed there?

I don't know of any.  The non-blocking item about paring down Dan's
latest isolation test case to a size which can be committed is
something I won't be able to get to for at least a couple weeks --
at least if I want to keep drawing paychecks from my employer.  If
someone else took that, I wouldn't complain.  Meanwhile, I am
running the test on my workstation regularly.

-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Robert Haas

On Tue, Jul 12, 2011 at 4:47 PM, Pavan Deolasee
 wrote:
> Comments ?

I was going to spend some time reviewing this, but I see that (1) it
has bit-rotted slightly - there is a failing hunk in pg_class.h and
(2) some of the comments downthread seem to suggest that you're
thinking about whether to revise this somewhat, in particular by using
some counter other than an LSN.  Are you planning to submit an updated
version?

A few comments on this version just reading through it:

- In lazy_scan_heap, where you've made the call to
RecordPageWithFreeSpace() unconditional, the comment change you made
immediately above is pretty half-baked.  It still refers to
lazy_vacuum_heap, which you've meanwhile removed.  You need to rewrite
the whole comment, I think.

- Instead of passing bool need_vaclsn to PageRepairFragmentation(),
how about passing Offset new_special_size?  Currently,
PageRepairFragmentation() doesn't know whether it's looking at a heap
page or an index page, and it would be nice to keep it that way.  It's
even possible that expanding the special space opportunistically
during page defragmentation could be useful in other contexts besides
this.  Or perhaps contracting it.

- itemid.h seems a bit schizophrenic about dead line pointers.  Here,
you've decided that it's OK for lp_flags == LP_DEAD && lp_off == 1 to
mean dead-vacuumed, but there existing code says:

#define LP_DEAD 3   /* dead, may or may
not have storage */

AFAICT, the actual situation here is that indexes sometimes use dead
line pointers with storage, but the heap doesn't; thus, the heap can
safely use the storage bits of dead line pointers to mean something
else, but indexes can't.  I think the comments throughout itemid.h
should be adjusted to bring this out a bit more clearly, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Sushant Sinha


On Thu, 2011-07-21 at 15:31 +0200, Jan Urbański wrote:
> On 21/07/11 15:27, Sushant Sinha wrote:
> > I am using plpythonu on postgres 9.0.2. One of my python functions was
> > throwing a TypeError exception. However, I only see the exception in the
> > database and not the stack trace. It becomes difficult to debug if the
> > stack trace is absent in Python.
> > 
> > logdb=# select get_words(forminput) from fi;   
> > ERROR:  PL/Python: TypeError: an integer is required
> > CONTEXT:  PL/Python function "get_words"
> > 
> > And here is the error if I run that function on the same data in python:
> > 
> > [traceback]
> > 
> > Is this a known problem or this needs addressing?
> 
> Yes, traceback support in PL/Python has already been implemented and is
> a new feature that will be available in PostgreSQL 9.1.
> 
> Cheers,
> Jan

Thanks Jan! Just one more reason to try 9.1.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Jan Urbański

On 21/07/11 15:27, Sushant Sinha wrote:
> I am using plpythonu on postgres 9.0.2. One of my python functions was
> throwing a TypeError exception. However, I only see the exception in the
> database and not the stack trace. It becomes difficult to debug if the
> stack trace is absent in Python.
> 
> logdb=# select get_words(forminput) from fi;   
> ERROR:  PL/Python: TypeError: an integer is required
> CONTEXT:  PL/Python function "get_words"
> 
> And here is the error if I run that function on the same data in python:
> 
> [traceback]
> 
> Is this a known problem or this needs addressing?

Yes, traceback support in PL/Python has already been implemented and is
a new feature that will be available in PostgreSQL 9.1.

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Peter Geoghegan

On 21 July 2011 14:27, Sushant Sinha  wrote:
> I am using plpythonu on postgres 9.0.2. One of my python functions was
> throwing a TypeError exception. However, I only see the exception in the
> database and not the stack trace. It becomes difficult to debug if the
> stack trace is absent in Python.
>
> logdb=# select get_words(forminput) from fi;
> ERROR:  PL/Python: TypeError: an integer is required
> CONTEXT:  PL/Python function "get_words"
>
>
> And here is the error if I run that function on the same data in python:
>
> Traceback (most recent call last):
>  File "valid.py", line 215, in 
>    parse_query(result['forminput'])
>  File "valid.py", line 132, in parse_query
>    dateobj = datestr_to_obj(columnHash[column])
>  File "valid.py", line 37, in datestr_to_obj
>    dateobj = datetime.date(words[2], words[1], words[0])
> TypeError: an integer is required
>
>
> Is this a known problem or this needs addressing?

Traceback information will be added to PL/Python errors in Postgres
9.1, due out in about September.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Sushant Sinha

I am using plpythonu on postgres 9.0.2. One of my python functions was
throwing a TypeError exception. However, I only see the exception in the
database and not the stack trace. It becomes difficult to debug if the
stack trace is absent in Python.

logdb=# select get_words(forminput) from fi;   
ERROR:  PL/Python: TypeError: an integer is required
CONTEXT:  PL/Python function "get_words"


And here is the error if I run that function on the same data in python:

Traceback (most recent call last):
  File "valid.py", line 215, in 
parse_query(result['forminput'])
  File "valid.py", line 132, in parse_query
dateobj = datestr_to_obj(columnHash[column])
  File "valid.py", line 37, in datestr_to_obj
dateobj = datetime.date(words[2], words[1], words[0])
TypeError: an integer is required


Is this a known problem or this needs addressing?

Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 4:00 AM, Yeb Havinga  wrote:
> Is there a way to dump syscache statistics like there is for
> MemoryContext..Stats (something - gdb helped me there)?

I don't know of one.

> Besides that I have to admit having problems understanding why the 5MB cache
> for pg_seclabel is a problem; it's memory consumption is lineair only to the
> size of the underlying database.  (in contrast with the other cache storing
> access vectors which would have O(n*m) space complexity if it wouldn't
> reclaim space). So it is proportional to the number of objects in a database
> and in size it seems to be in the same order as pg_proc, pg_class and
> pg_attribute.

Fair enough.  I'm not convinced that the sheer quantity of memory use
is a problem, although I would like to see a few more test results
before we decide that definitively.  I *am* unwilling to pay the
startup overhead of initializing an extra 2048 syscache that only
sepgsql users will actually need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] timing for 9.1beta4 / rc1

2011-07-21 Thread Robert Haas

All,

9.1beta3 was tagged on July 7th and announced July 12th.  In the
interest of avoiding institutional inertia, should we try to set a
tentative date for a next beta or, if we don't end up fixing too many
bugs in the meanwhile, perhaps a release candidate?

Obviously we want to give people some time to shake out bugs in beta3,
but I think it would be good to get our next 9.1 release (whatever it
ends up being named) out the door sometime in August, and we'll be
more likely to actually do that if we pick a date to shoot for.

On a related note,
http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Open_Items is looking
pretty barren at the moment.  Are there things that should be listed
there?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Environment checks prior to regression tests?

2011-07-21 Thread Robert Haas

On Thu, Jul 21, 2011 at 6:16 AM, Kohei Kaigai  wrote:
> How about an idea that allows to launch environment checker (typically shell 
> scripts) prior
> to regression tests?
>
> The following stuffs should be preconfigured to run sepgsql's regression test.
> - SELinux must be run and configured to enforcing mode.
> - The sepgsql-regtest policy module must be loaded.
> - The boolean of sepgsql_regression_test_mode must be turned on.
> - The psql command should be labeled as 'bin_t'
>
> If checkinstall optionally allows to launch an environment checker on 
> regression test,
> we may be possible to suggest users to fix up their configuration. It seems 
> to me quite
> helpful.
>
> For example, one idea is to inject a dummy variable (mostly, initialized to 
> empty) as
> dependency of installcheck, being available to overwrite in Makefile of 
> contrib, as follows:
>
>  # against installed postmaster
>  installcheck: submake $(REGRESS_PRE)
>          $(pg_regress_installcheck) $(REGRESS_OPTS) $(REGRESS)

Seems reasonable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Commitfest Status: Sudden Death Overtime

2011-07-21 Thread Hannu Krosing

On Thu, 2011-07-21 at 00:21 +0200, Florian Pflug wrote:
> There's a small additional concern, though, which is that there's an
> XPath 2.0 spec out there, and it modifies the type system and data model
> rather heavily. So before we go adding functions, it'd probably be wise
> to check that we're not painting ourselves into a corner.

Why not just write  XPATH2() function conforming to XPath 2.0 spec if
the new spec is substancially different ?


-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Environment checks prior to regression tests?

2011-07-21 Thread Kohei Kaigai

How about an idea that allows to launch environment checker (typically shell 
scripts) prior
to regression tests?

The following stuffs should be preconfigured to run sepgsql's regression test.
- SELinux must be run and configured to enforcing mode.
- The sepgsql-regtest policy module must be loaded.
- The boolean of sepgsql_regression_test_mode must be turned on.
- The psql command should be labeled as 'bin_t'

If checkinstall optionally allows to launch an environment checker on 
regression test,
we may be possible to suggest users to fix up their configuration. It seems to 
me quite
helpful.

For example, one idea is to inject a dummy variable (mostly, initialized to 
empty) as
dependency of installcheck, being available to overwrite in Makefile of 
contrib, as follows:

  # against installed postmaster
  installcheck: submake $(REGRESS_PRE)
  $(pg_regress_installcheck) $(REGRESS_OPTS) $(REGRESS)

Thanks,
--
NEC Europe Ltd, SAP Global Competence Center
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Yeb Havinga


On 2011-07-21 00:08, Robert Haas wrote:

On Wed, Jul 20, 2011 at 4:48 PM, Tom Lane  wrote:

Kohei Kaigai  writes:

I'd like to have a discussion about syscache towards next commit-fest.
The issues may be:
  - Initial bucket allocation on most cases never be referenced.
  - Reclaim cache entries on growing up too large.

There used to be support for limiting the number of entries in a
syscache.  It got removed (cf commit
8b9bc234ad43dfa788bde40ebf12e94f16556b7f) because (1) it was remarkably
expensive to do it (extra list manipulations, etc), and (2) performance
tended to fall off a cliff as soon as you had a few more tables or
whatever than the caches would hold.  I'm disinclined to reverse that
decision.  It appears to me that the security label stuff needs a
different set of performance tradeoffs than the rest of the catalogs,
which means it probably ought to do its own caching, rather than trying
to talk us into pessimizing the other catalogs for seclabel's benefit.

I agree that we don't want to limit the size of the catcaches.  We've
been careful to design them in such a way that they won't blow out
memory, and so far there's no evidence that they do.  If it ain't
broke, don't fix it.  Having catcaches that can grow in size as needed
sounds useful to me, though.
Is there a way to dump syscache statistics like there is for 
MemoryContext..Stats (something - gdb helped me there)?


Besides that I have to admit having problems understanding why the 5MB 
cache for pg_seclabel is a problem; it's memory consumption is lineair 
only to the size of the underlying database.  (in contrast with the 
other cache storing access vectors which would have O(n*m) space 
complexity if it wouldn't reclaim space). So it is proportional to the 
number of objects in a database and in size it seems to be in the same 
order as pg_proc, pg_class and pg_attribute.


regards,

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Questions and experiences writing a Foreign Data Wrapper

2011-07-21 Thread Albe Laurenz

Heikki Linnakangas wrote:
>> 2) If I decide to close remote database connections after
>> use, I would like to do so where reasonable.
>> I would like to keep the connection open between query
>> planning and query execution and close it when the
>> scan is done.
>> The exception could be named prepared statements.
>> Is there a way to tell if that is the case during
>> planing or execution?

> Hmm, maybe you could add a hook to close the connection when the
> transaction ends. But actually, you'd want to keep the connection
> open across transactions too. Some sort of a general connection
> caching facility would be useful for many FDW.

I agree, and that is how I implemented it at the moment.
But it might be nice to give the user the option, say, if they know
that it is a long session in a daemon process that accesses the
remote table only once a day.

I'll look into the hook option.

Here are some more ideas for FDW API functions/macros that might be
useful for FDW developers.

- A function that gives you the internal and external parameters at
  execution time.
- A function that gives you a type's input and output function.
- A function that gives you the OID of the foreign table owner.
- A function that gives you the list of columns of the foreign
  table (atttypid, atttypmod, attname, maybe others).

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

68 matches

Mail list logo