Re: [HACKERS] Questions and experiences writing a Foreign Data Wrapper

2011-07-21 Thread Albe Laurenz
Heikki Linnakangas wrote:
 2) If I decide to close remote database connections after
 use, I would like to do so where reasonable.
 I would like to keep the connection open between query
 planning and query execution and close it when the
 scan is done.
 The exception could be named prepared statements.
 Is there a way to tell if that is the case during
 planing or execution?

 Hmm, maybe you could add a hook to close the connection when the
 transaction ends. But actually, you'd want to keep the connection
 open across transactions too. Some sort of a general connection
 caching facility would be useful for many FDW.

I agree, and that is how I implemented it at the moment.
But it might be nice to give the user the option, say, if they know
that it is a long session in a daemon process that accesses the
remote table only once a day.

I'll look into the hook option.

Here are some more ideas for FDW API functions/macros that might be
useful for FDW developers.

- A function that gives you the internal and external parameters at
  execution time.
- A function that gives you a type's input and output function.
- A function that gives you the OID of the foreign table owner.
- A function that gives you the list of columns of the foreign
  table (atttypid, atttypmod, attname, maybe others).

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Yeb Havinga

On 2011-07-21 00:08, Robert Haas wrote:

On Wed, Jul 20, 2011 at 4:48 PM, Tom Lanet...@sss.pgh.pa.us  wrote:

Kohei Kaigaikohei.kai...@emea.nec.com  writes:

I'd like to have a discussion about syscache towards next commit-fest.
The issues may be:
  - Initial bucket allocation on most cases never be referenced.
  - Reclaim cache entries on growing up too large.

There used to be support for limiting the number of entries in a
syscache.  It got removed (cf commit
8b9bc234ad43dfa788bde40ebf12e94f16556b7f) because (1) it was remarkably
expensive to do it (extra list manipulations, etc), and (2) performance
tended to fall off a cliff as soon as you had a few more tables or
whatever than the caches would hold.  I'm disinclined to reverse that
decision.  It appears to me that the security label stuff needs a
different set of performance tradeoffs than the rest of the catalogs,
which means it probably ought to do its own caching, rather than trying
to talk us into pessimizing the other catalogs for seclabel's benefit.

I agree that we don't want to limit the size of the catcaches.  We've
been careful to design them in such a way that they won't blow out
memory, and so far there's no evidence that they do.  If it ain't
broke, don't fix it.  Having catcaches that can grow in size as needed
sounds useful to me, though.
Is there a way to dump syscache statistics like there is for 
MemoryContext..Stats (something - gdb helped me there)?


Besides that I have to admit having problems understanding why the 5MB 
cache for pg_seclabel is a problem; it's memory consumption is lineair 
only to the size of the underlying database.  (in contrast with the 
other cache storing access vectors which would have O(n*m) space 
complexity if it wouldn't reclaim space). So it is proportional to the 
number of objects in a database and in size it seems to be in the same 
order as pg_proc, pg_class and pg_attribute.


regards,

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Environment checks prior to regression tests?

2011-07-21 Thread Kohei Kaigai
How about an idea that allows to launch environment checker (typically shell 
scripts) prior
to regression tests?

The following stuffs should be preconfigured to run sepgsql's regression test.
- SELinux must be run and configured to enforcing mode.
- The sepgsql-regtest policy module must be loaded.
- The boolean of sepgsql_regression_test_mode must be turned on.
- The psql command should be labeled as 'bin_t'

If checkinstall optionally allows to launch an environment checker on 
regression test,
we may be possible to suggest users to fix up their configuration. It seems to 
me quite
helpful.

For example, one idea is to inject a dummy variable (mostly, initialized to 
empty) as
dependency of installcheck, being available to overwrite in Makefile of 
contrib, as follows:

  # against installed postmaster
  installcheck: submake $(REGRESS_PRE)
  $(pg_regress_installcheck) $(REGRESS_OPTS) $(REGRESS)

Thanks,
--
NEC Europe Ltd, SAP Global Competence Center
KaiGai Kohei kohei.kai...@emea.nec.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest Status: Sudden Death Overtime

2011-07-21 Thread Hannu Krosing
On Thu, 2011-07-21 at 00:21 +0200, Florian Pflug wrote:
 There's a small additional concern, though, which is that there's an
 XPath 2.0 spec out there, and it modifies the type system and data model
 rather heavily. So before we go adding functions, it'd probably be wise
 to check that we're not painting ourselves into a corner.

Why not just write  XPATH2() function conforming to XPath 2.0 spec if
the new spec is substancially different ?


-- 
---
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Environment checks prior to regression tests?

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 6:16 AM, Kohei Kaigai kohei.kai...@emea.nec.com wrote:
 How about an idea that allows to launch environment checker (typically shell 
 scripts) prior
 to regression tests?

 The following stuffs should be preconfigured to run sepgsql's regression test.
 - SELinux must be run and configured to enforcing mode.
 - The sepgsql-regtest policy module must be loaded.
 - The boolean of sepgsql_regression_test_mode must be turned on.
 - The psql command should be labeled as 'bin_t'

 If checkinstall optionally allows to launch an environment checker on 
 regression test,
 we may be possible to suggest users to fix up their configuration. It seems 
 to me quite
 helpful.

 For example, one idea is to inject a dummy variable (mostly, initialized to 
 empty) as
 dependency of installcheck, being available to overwrite in Makefile of 
 contrib, as follows:

  # against installed postmaster
  installcheck: submake $(REGRESS_PRE)
          $(pg_regress_installcheck) $(REGRESS_OPTS) $(REGRESS)

Seems reasonable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] timing for 9.1beta4 / rc1

2011-07-21 Thread Robert Haas
All,

9.1beta3 was tagged on July 7th and announced July 12th.  In the
interest of avoiding institutional inertia, should we try to set a
tentative date for a next beta or, if we don't end up fixing too many
bugs in the meanwhile, perhaps a release candidate?

Obviously we want to give people some time to shake out bugs in beta3,
but I think it would be good to get our next 9.1 release (whatever it
ends up being named) out the door sometime in August, and we'll be
more likely to actually do that if we pick a date to shoot for.

On a related note,
http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Open_Items is looking
pretty barren at the moment.  Are there things that should be listed
there?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 4:00 AM, Yeb Havinga yebhavi...@gmail.com wrote:
 Is there a way to dump syscache statistics like there is for
 MemoryContext..Stats (something - gdb helped me there)?

I don't know of one.

 Besides that I have to admit having problems understanding why the 5MB cache
 for pg_seclabel is a problem; it's memory consumption is lineair only to the
 size of the underlying database.  (in contrast with the other cache storing
 access vectors which would have O(n*m) space complexity if it wouldn't
 reclaim space). So it is proportional to the number of objects in a database
 and in size it seems to be in the same order as pg_proc, pg_class and
 pg_attribute.

Fair enough.  I'm not convinced that the sheer quantity of memory use
is a problem, although I would like to see a few more test results
before we decide that definitively.  I *am* unwilling to pay the
startup overhead of initializing an extra 2048 syscache that only
sepgsql users will actually need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Sushant Sinha
I am using plpythonu on postgres 9.0.2. One of my python functions was
throwing a TypeError exception. However, I only see the exception in the
database and not the stack trace. It becomes difficult to debug if the
stack trace is absent in Python.

logdb=# select get_words(forminput) from fi;   
ERROR:  PL/Python: TypeError: an integer is required
CONTEXT:  PL/Python function get_words


And here is the error if I run that function on the same data in python:

Traceback (most recent call last):
  File valid.py, line 215, in module
parse_query(result['forminput'])
  File valid.py, line 132, in parse_query
dateobj = datestr_to_obj(columnHash[column])
  File valid.py, line 37, in datestr_to_obj
dateobj = datetime.date(words[2], words[1], words[0])
TypeError: an integer is required


Is this a known problem or this needs addressing?

Thanks,
Sushant.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Peter Geoghegan
On 21 July 2011 14:27, Sushant Sinha sushant...@gmail.com wrote:
 I am using plpythonu on postgres 9.0.2. One of my python functions was
 throwing a TypeError exception. However, I only see the exception in the
 database and not the stack trace. It becomes difficult to debug if the
 stack trace is absent in Python.

 logdb=# select get_words(forminput) from fi;
 ERROR:  PL/Python: TypeError: an integer is required
 CONTEXT:  PL/Python function get_words


 And here is the error if I run that function on the same data in python:

 Traceback (most recent call last):
  File valid.py, line 215, in module
    parse_query(result['forminput'])
  File valid.py, line 132, in parse_query
    dateobj = datestr_to_obj(columnHash[column])
  File valid.py, line 37, in datestr_to_obj
    dateobj = datetime.date(words[2], words[1], words[0])
 TypeError: an integer is required


 Is this a known problem or this needs addressing?

Traceback information will be added to PL/Python errors in Postgres
9.1, due out in about September.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Jan Urbański
On 21/07/11 15:27, Sushant Sinha wrote:
 I am using plpythonu on postgres 9.0.2. One of my python functions was
 throwing a TypeError exception. However, I only see the exception in the
 database and not the stack trace. It becomes difficult to debug if the
 stack trace is absent in Python.
 
 logdb=# select get_words(forminput) from fi;   
 ERROR:  PL/Python: TypeError: an integer is required
 CONTEXT:  PL/Python function get_words
 
 And here is the error if I run that function on the same data in python:
 
 [traceback]
 
 Is this a known problem or this needs addressing?

Yes, traceback support in PL/Python has already been implemented and is
a new feature that will be available in PostgreSQL 9.1.

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/Python: No stack trace for an exception

2011-07-21 Thread Sushant Sinha

On Thu, 2011-07-21 at 15:31 +0200, Jan Urbański wrote:
 On 21/07/11 15:27, Sushant Sinha wrote:
  I am using plpythonu on postgres 9.0.2. One of my python functions was
  throwing a TypeError exception. However, I only see the exception in the
  database and not the stack trace. It becomes difficult to debug if the
  stack trace is absent in Python.
  
  logdb=# select get_words(forminput) from fi;   
  ERROR:  PL/Python: TypeError: an integer is required
  CONTEXT:  PL/Python function get_words
  
  And here is the error if I run that function on the same data in python:
  
  [traceback]
  
  Is this a known problem or this needs addressing?
 
 Yes, traceback support in PL/Python has already been implemented and is
 a new feature that will be available in PostgreSQL 9.1.
 
 Cheers,
 Jan

Thanks Jan! Just one more reason to try 9.1.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Robert Haas
On Tue, Jul 12, 2011 at 4:47 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 Comments ?

I was going to spend some time reviewing this, but I see that (1) it
has bit-rotted slightly - there is a failing hunk in pg_class.h and
(2) some of the comments downthread seem to suggest that you're
thinking about whether to revise this somewhat, in particular by using
some counter other than an LSN.  Are you planning to submit an updated
version?

A few comments on this version just reading through it:

- In lazy_scan_heap, where you've made the call to
RecordPageWithFreeSpace() unconditional, the comment change you made
immediately above is pretty half-baked.  It still refers to
lazy_vacuum_heap, which you've meanwhile removed.  You need to rewrite
the whole comment, I think.

- Instead of passing bool need_vaclsn to PageRepairFragmentation(),
how about passing Offset new_special_size?  Currently,
PageRepairFragmentation() doesn't know whether it's looking at a heap
page or an index page, and it would be nice to keep it that way.  It's
even possible that expanding the special space opportunistically
during page defragmentation could be useful in other contexts besides
this.  Or perhaps contracting it.

- itemid.h seems a bit schizophrenic about dead line pointers.  Here,
you've decided that it's OK for lp_flags == LP_DEAD  lp_off == 1 to
mean dead-vacuumed, but there existing code says:

#define LP_DEAD 3   /* dead, may or may
not have storage */

AFAICT, the actual situation here is that indexes sometimes use dead
line pointers with storage, but the heap doesn't; thus, the heap can
safely use the storage bits of dead line pointers to mean something
else, but indexes can't.  I think the comments throughout itemid.h
should be adjusted to bring this out a bit more clearly, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] timing for 9.1beta4 / rc1

2011-07-21 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 http://wiki.postgresql.org/wiki/PostgreSQL_9.1_Open_Items is
 looking pretty barren at the moment.  Are there things that should
 be listed there?
 
I don't know of any.  The non-blocking item about paring down Dan's
latest isolation test case to a size which can be committed is
something I won't be able to get to for at least a couple weeks --
at least if I want to keep drawing paychecks from my employer.  If
someone else took that, I wouldn't complain.  Meanwhile, I am
running the test on my workstation regularly.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-07-21 Thread Merlin Moncure
On Wed, Jul 20, 2011 at 10:28 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Jul 18, 2011 at 6:38 AM, Pavel Golub pa...@microolap.com wrote:
 Hello, Merlin.

 I hope it's OK that I've added Andrew's patch to CommitFest:
 https://commitfest.postgresql.org/action/patch_view?id=606

 I did this becuase beta3 already released, but nut nothig is done on
 this bug.

 So I finally got around to taking a look at this patch, and I guess my
 basic feeling is that I like it.  The existing code is pretty weird
 and inconsistent: the logic in PQsetvalue() basically does the same
 thing as the logic in pqAddTuple(), but incompatibly and less
 efficiently.  Unifying them seems sensible, and the fix looks simple
 enough to back-patch.

 With respect to Tom's concern about boxing ourselves in, I guess it's
 hard for me to get worried about that.  I've heard no one suggest
 changing the internal representation libpq uses for result sets, and
 even if we did, presumably the new format would also need to support
 an append a tuple operation - or the very worst we could cause it to
 support that without much difficulty.

 So, +1 from me.

right -- thanks for that. For the record, I think a rework of the
libpq internal representation would be likely to happen concurrently
with a rework of the API -- for example to better support streaming
data. PQsetvalue very well might prove to be a headache -- just too
hard to say.  libpq strikes me as a 50 year plus marriage might:
fractious, full of mystery and regrets, but highly functional.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Kevin Grittner
Robert Haas  wrote:
 
 SIGetDataEntries(). I believe we need to bite the bullet and
 rewrite this using a lock-free algorithm, using memory barriers on
 processors with weak memory ordering.
 
 [32 processors; 80 clients]
 
 On unpatched master
 
 tps = 132518.586371 (including connections establishing)
 tps = 130968.749747 (including connections establishing)
 tps = 132574.338942 (including connections establishing)
 
 With the lazy vxid locks patch
 
 tps = 119215.958372 (including connections establishing)
 tps = 113056.859871 (including connections establishing)
 tps = 160562.770998 (including connections establishing)
 
 gets rid of SInvalReadLock and instead gives each backend its own
 spinlock.
 
 tps = 167392.042393 (including connections establishing)
 tps = 171336.145020 (including connections establishing)
 tps = 170500.529303 (including connections establishing)
 
 SIGetDataEntries() can pretty easily be made lock-free.
 
 tps = 203256.701227 (including connections establishing)
 tps = 190637.957571 (including connections establishing)
 tps = 190228.617178 (including connections establishing)
 
 Thoughts? Comments? Ideas?
 
Very impressive!  Those numbers definitely justify some #ifdef code
to provide alternatives for weak memory ordering machines versus
others.  With the number of CPUs climbing as it is, this is very
important work!
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Robert Haas
On Thu, Jul 14, 2011 at 12:43 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 How does this interact with the visibility map? If you set the visibility
 map bit after vacuuming indexes, a subsequent vacuum will not visit the
 page. The second vacuum will update relindxvacxlogid/off, but it will not
 clean up the dead line pointers left behind by the first vacuum. Now the LSN
 on the page differs from the one stored in pg_class, so subsequent pruning
 will not remove the dead line pointers either.

Currently, I think we would only set the visibility map bit after
vacuuming the page for the second time.  The patch as submitted
doesn't appear to go back and set visibility map bits after finishing
the index vacuum.  Now, that might be nice to do, because then a
hypothetical index-only scan could start taking advantage of vacuum
having been done sooner.  If we wanted to do that, we could
restructure the visibility map to store two bits per page: one to
indicate whether there is any potential work for VACUUM to do (modulo
freezing) and the other to indicate whether an index pointer could
possibly be aimed at a dead line pointer.  (In fact, maybe we'd even
want to have a third bit to indicate all tuples frozen, which would
be useful for optimizing anti-wraparound vacuum.)

 I think you can sidestep that
 if you check that the page's vacuum LSN = vacuum LSN in pg_class, instead
 of equality.

I don't think that works, because the point of storing the LSN in
pg_class is to verify that the vacuum completed the index cleanup
without error.  The fact that a newer vacuum accomplished that goal
does not mean that all older ones did.

 Ignoring the issue stated in previous paragraph, I think you wouldn't
 actually need an 64-bit LSN. A smaller counter is enough, as wrap-around
 doesn't matter. In fact, a single bit would be enough. After a successful
 vacuum, the counter on each heap page (with dead line pointers) is N, and
 the value in pg_class is N. There are no other values on the heap, because
 vacuum will have cleaned them up. When you begin the next vacuum, it will
 stamp pages with N+1. So at any stage, there is only one of two values on
 any page, so a single bit is enough. (But as I said, that doesn't hold if
 vacuum skips some pages thanks to the visibility map)

If this can be made to work, it's a very appealing idea.  The patch as
submitted uses lp_off to store a single bit, to distinguish between
vacuum and dead-vacuumed, but we could actually have (for greater
safety and debuggability) a 15-byte counter that just wraps around
from 32,767 to 1.  (Maybe it would be wise to reserve a few counter
values, or a few bits, or both, for future projects.)  That would
eliminate the need to touch PageRepairFragmentation() or use the
special space, since all the information would be in the line pointer
itself.  Not having to rearrange the page to reclaim dead line
pointers is appealing, too.

 Is there something in place to make sure that pruning uses an up-to-date
 relindxvacxlogid/off value? I guess it doesn't matter if it's out-of-date,
 you'll just miss the opportunity to remove some dead tuples.

This seems like a tricky problem, because it could cause us to
repeatedly fail to remove the same dead line pointers, which would be
poor.  We could do something like this: after updating pg_class,
vacuum send an interrupt to any backend which holds RowExclusiveLock
or higher on that relation.  The interrupt handler just sets a flag.
If that backend does heap_page_prune() and sees the flag set, it knows
that it needs to recheck pg_class.  This is a bit grotty and doesn't
completely close the race condition (the signal might not arrive in
time), but it ought to make it narrow enough not to matter in
practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-07-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So I finally got around to taking a look at this patch, and I guess my
 basic feeling is that I like it.  The existing code is pretty weird
 and inconsistent: the logic in PQsetvalue() basically does the same
 thing as the logic in pqAddTuple(), but incompatibly and less
 efficiently.  Unifying them seems sensible, and the fix looks simple
 enough to back-patch.

Yeah, I've been looking at it too.  For some reason I had had the
idea that the proposed patch complicated the code, but actually it's
simplifying it by removing almost-duplicate code.  So that's good.

The patch as proposed adds back a bug in return for the one it fixes
(you can not free() the result of pqResultAlloc()), but that's easily
fixed.

Will fix and commit.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Pavan Deolasee
On Thu, Jul 21, 2011 at 11:51 AM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Jul 12, 2011 at 4:47 PM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:
  Comments ?

 I was going to spend some time reviewing this, but I see that (1) it
 has bit-rotted slightly - there is a failing hunk in pg_class.h and
 (2) some of the comments downthread seem to suggest that you're
 thinking about whether to revise this somewhat, in particular by using
 some counter other than an LSN.  Are you planning to submit an updated
 version?


Yeah, I would submit an updated version. I was just waiting to see if there
are more comments about the general design. But I think I can now proceed.

I wonder if we can just ignore the wrap-around issue and use a 32 bit
counter. The counter can be stored in the pg_class itself since its use is
limited for the given table. At the start of vacuum, we get the current
value. We then increment the counter (taking care of wrap-around) and use
the incremented value as a marker in the page special area. If the vacuum
runs to completion, we store the new value back in the pg_class row. Since
vacuums are serialized for a given table, we don't need to worry about
concurrent updates to the value.

While collecting dead-vacuum line pointers, either during HOT-prune or
subsequent vacuum, we check if the current pg_class value and if the value
is equal to the page counter, we can safely collect the dead-vacuum line
pointers. For a moment, I thought we can just do away with a bit as Heikki
suggested up thread, but the problem comes with the backends which might be
running with stale value of the counter in the pg_class and the counter
should be large enough so that it does not quickly wrap-around for all
practical purposes.



 A few comments on this version just reading through it:

 - In lazy_scan_heap, where you've made the call to
 RecordPageWithFreeSpace() unconditional, the comment change you made
 immediately above is pretty half-baked.  It still refers to
 lazy_vacuum_heap, which you've meanwhile removed.  You need to rewrite
 the whole comment, I think.

 - Instead of passing bool need_vaclsn to PageRepairFragmentation(),
 how about passing Offset new_special_size?  Currently,
 PageRepairFragmentation() doesn't know whether it's looking at a heap
 page or an index page, and it would be nice to keep it that way.  It's
 even possible that expanding the special space opportunistically
 during page defragmentation could be useful in other contexts besides
 this.  Or perhaps contracting it.

 - itemid.h seems a bit schizophrenic about dead line pointers.  Here,
 you've decided that it's OK for lp_flags == LP_DEAD  lp_off == 1 to
 mean dead-vacuumed, but there existing code says:

 #define LP_DEAD 3   /* dead, may or may
 not have storage */

 AFAICT, the actual situation here is that indexes sometimes use dead
 line pointers with storage, but the heap doesn't; thus, the heap can
 safely use the storage bits of dead line pointers to mean something
 else, but indexes can't.  I think the comments throughout itemid.h
 should be adjusted to bring this out a bit more clearly, though.



I will take care of these issues in the revised patch. Thanks for looking at
the patch.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Pavan Deolasee
On Thu, Jul 21, 2011 at 12:17 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jul 14, 2011 at 12:43 PM, Heikki Linnakangas
  I think you can sidestep that
  if you check that the page's vacuum LSN = vacuum LSN in pg_class,
 instead
  of equality.

 I don't think that works, because the point of storing the LSN in
 pg_class is to verify that the vacuum completed the index cleanup
 without error.  The fact that a newer vacuum accomplished that goal
 does not mean that all older ones did.


The way we force the subsequent vacuum to also look at the pages scanned and
pruned by previous failed vacuum, all the pages that have dead-vacuum line
pointers would have a new stamp once the vacuum finishes successfully and
the pg_class would have the same stamp.


  Ignoring the issue stated in previous paragraph, I think you wouldn't
  actually need an 64-bit LSN. A smaller counter is enough, as wrap-around
  doesn't matter. In fact, a single bit would be enough. After a successful
  vacuum, the counter on each heap page (with dead line pointers) is N, and
  the value in pg_class is N. There are no other values on the heap,
 because
  vacuum will have cleaned them up. When you begin the next vacuum, it will
  stamp pages with N+1. So at any stage, there is only one of two values on
  any page, so a single bit is enough. (But as I said, that doesn't hold if
  vacuum skips some pages thanks to the visibility map)

 If this can be made to work, it's a very appealing idea.


I thought more about it and for a moment believed that we can do this with
just a bit since we rescan the  pages with dead and dead-vacuum line
pointers after an aborted vacuum, but concluded that a bit or a small
counter is not good enough since other backends might be running with a
stale value and would get fooled into believing that they can collect the
dead-vacuum line pointers before the index pointers are actually removed. We
can still use a 32-bit counter though since the wrap-around for that is
practically very large for any backend to still run with such a stale
counter (you would need more than 1 billion vacuums on the same table in
between for you to hit this).


 The patch as
 submitted uses lp_off to store a single bit, to distinguish between
 vacuum and dead-vacuumed, but we could actually have (for greater
 safety and debuggability) a 15-byte counter that just wraps around
 from 32,767 to 1.  (Maybe it would be wise to reserve a few counter
 values, or a few bits, or both, for future projects.)  That would
 eliminate the need to touch PageRepairFragmentation() or use the
 special space, since all the information would be in the line pointer
 itself.  Not having to rearrange the page to reclaim dead line
 pointers is appealing, too.


Not sure if I get you here. We need a mechanism to distinguish between dead
and dead-vacuum line pointers. How would the counter (which I assume you
mean 15-bit and not byte) help solve that ? Or are you just suggesting
replacing LSN with the counter in the page header ?


  Is there something in place to make sure that pruning uses an up-to-date
  relindxvacxlogid/off value? I guess it doesn't matter if it's
 out-of-date,
  you'll just miss the opportunity to remove some dead tuples.

 This seems like a tricky problem, because it could cause us to
 repeatedly fail to remove the same dead line pointers, which would be
 poor.  We could do something like this: after updating pg_class,
 vacuum send an interrupt to any backend which holds RowExclusiveLock
 or higher on that relation.  The interrupt handler just sets a flag.
 If that backend does heap_page_prune() and sees the flag set, it knows
 that it needs to recheck pg_class.  This is a bit grotty and doesn't
 completely close the race condition (the signal might not arrive in
 time), but it ought to make it narrow enough not to matter in
 practice.


I am not too excited about adding that complexity to the code. Even if a
backend does not have up-to-date value, it will fail to collect the
dead-vacuum pointers, but soon either it will catch up or some other backend
will remove them or the next vacuum will take care of it.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove O(N^2) performance issue with multiple SAVEPOINTs.

2011-07-21 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Instead of leaving the locks dangling to an already-destroyed resource 
 owner, how about assigning all locks directly to the top-level resource 
 owner in one sweep? That'd still be much better than the old way of 
 recursively reassigning them up the subtransaction tree, one level at a 
 time.

I haven't actually read the patch, but the reason for pushing them up
only one level at a time is that if an intermediate-level subtransaction
aborts, the locks taken by its child subtransactions have to be released
at that time.  It sure sounds like this patch broke that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove O(N^2) performance issue with multiple SAVEPOINTs.

2011-07-21 Thread Simon Riggs
On Thu, Jul 21, 2011 at 5:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Instead of leaving the locks dangling to an already-destroyed resource
 owner, how about assigning all locks directly to the top-level resource
 owner in one sweep? That'd still be much better than the old way of
 recursively reassigning them up the subtransaction tree, one level at a
 time.

 I haven't actually read the patch, but the reason for pushing them up
 only one level at a time is that if an intermediate-level subtransaction
 aborts, the locks taken by its child subtransactions have to be released
 at that time.  It sure sounds like this patch broke that.

The only path altered by the patch was the
final-commit-while-in-a-subxact, so I don't see a problem in the part
you mention.

At commit all the locks get transferred to the parent, so we scan the
the lock table repeatedly, giving O(N^2).

I think I'll just revert it though. Subtransactions need a lot of
tuning but this isn't high enough up my list to be worth the work.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Joshua D. Drake

Hey,

So I am looking intently on what it is going to take to get the URI 
patch done for psql [1] and was digging around the web and have a URI 
parser library. It is under the New BSD license and is strictly RFC  RFC 
3986 [2] compliant .


Now I have not dug into the code but the parser is used by other 
projects. So question is:


Assuming the code actually makes this patch easier, do we:

A. Pull in the code into the main tree
B. Instead have it as a requirement via configure?

1. 
http://archives.postgresql.org/message-id/1302114698.23164.17.camel@jd-desktop


2. http://tools.ietf.org/html/rfc3986


Sincerely,

Joshua D. Drake


--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing PQsetvalue()

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 12:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 So I finally got around to taking a look at this patch, and I guess my
 basic feeling is that I like it.  The existing code is pretty weird
 and inconsistent: the logic in PQsetvalue() basically does the same
 thing as the logic in pqAddTuple(), but incompatibly and less
 efficiently.  Unifying them seems sensible, and the fix looks simple
 enough to back-patch.

 Yeah, I've been looking at it too.  For some reason I had had the
 idea that the proposed patch complicated the code, but actually it's
 simplifying it by removing almost-duplicate code.  So that's good.

 The patch as proposed adds back a bug in return for the one it fixes
 (you can not free() the result of pqResultAlloc()), but that's easily
 fixed.

 Will fix and commit.

Cool.  I believe that's the last patch for CommitFest 2011-06.

*bangs gavel*

I believe that makes it time for 9.2alph1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove O(N^2) performance issue with multiple SAVEPOINTs.

2011-07-21 Thread Alvaro Herrera
Excerpts from Simon Riggs's message of jue jul 21 13:30:25 -0400 2011:

 I think I'll just revert it though. Subtransactions need a lot of
 tuning but this isn't high enough up my list to be worth the work.

If it works and is sane, why would you revert it?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Peter Geoghegan
On 21 July 2011 18:43, Joshua D. Drake j...@commandprompt.com wrote:
 So I am looking intently on what it is going to take to get the URI patch
 done for psql [1] and was digging around the web and have a URI parser
 library. It is under the New BSD license and is strictly RFC  RFC 3986 [2]
 compliant .

 Now I have not dug into the code but the parser is used by other projects.
 So question is:

 Assuming the code actually makes this patch easier, do we:

 A. Pull in the code into the main tree
 B. Instead have it as a requirement via configure?

 1.
 http://archives.postgresql.org/message-id/1302114698.23164.17.camel@jd-desktop

 2. http://tools.ietf.org/html/rfc3986

Without commenting on the practicalities of what you'd like to do,
including code from other projects in the tree is well precedented.
Off the top of my head, I can tell you that pgcrypto uses code from
various sources, while preserving the original copyright notices.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Tom Lane
Joshua D. Drake j...@commandprompt.com writes:
 So I am looking intently on what it is going to take to get the URI 
 patch done for psql [1] and was digging around the web and have a URI 
 parser library. It is under the New BSD license and is strictly RFC  RFC 
 3986 [2] compliant .

Surely we do not need a whole library to parse URIs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Joshua D. Drake

On 07/21/2011 11:13 AM, Tom Lane wrote:

Joshua D. Drakej...@commandprompt.com  writes:

So I am looking intently on what it is going to take to get the URI
patch done for psql [1] and was digging around the web and have a URI
parser library. It is under the New BSD license and is strictly RFC  RFC
3986 [2] compliant .


Surely we do not need a whole library to parse URIs.


Shrug, standards compliant, already runs on windows

Seems like a good idea to me?

http://uriparser.sourceforge.net/

Sincerely,

Joshua D. Drake




regards, tom lane




--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Policy on pulling in code from other projects?

2011-07-21 Thread Joshua D. Drake

On 07/21/2011 11:13 AM, Tom Lane wrote:

Joshua D. Drakej...@commandprompt.com  writes:

So I am looking intently on what it is going to take to get the URI
patch done for psql [1] and was digging around the web and have a URI
parser library. It is under the New BSD license and is strictly RFC  RFC
3986 [2] compliant .


Surely we do not need a whole library to parse URIs.


Also:

http://uriparser.git.sourceforge.net/git/gitweb-index.cgi

Sincerely,

Joshua D. Drake



regards, tom lane




--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 12:16 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Very impressive!  Those numbers definitely justify some #ifdef code
 to provide alternatives for weak memory ordering machines versus
 others.  With the number of CPUs climbing as it is, this is very
 important work!

Thanks.  I'm not thinking so much about #ifdef (although that could
work, too) as I am about providing some primitives to allow this sort
of code-writing to be done in a somewhat less ad-hoc fashion.  It
seems like there are basically two categories of machines we need to
worry about.

1. Machines with strong memory ordering.  On this category of machines
(which include x86), the CPU basically does not perform loads or
stores out of order.  On some of these machines, it is apparently
possible for there to be some ordering of stores relative to loads,
but if the program stores two values or loads two values, those
operations will performed in the same order they appear in the
program.  The main thing you need to make your code work reliably on
these machines is a primitive that keeps the compiler from reordering
your code during optimization.  On x86, certain categories of exotic
instructions do require

2. Machines with weak memory ordering.  On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization.  You still need to keep the
compiler from moving things around, too.  Alpha is particularly
pernicious, because something like a-b can fetch the pointed-to value
before loading the pointer itself.  This is otherwise known as we
have basically no cache coherency circuits on this chip at all.  On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.

So you can imagine a primitive that is defined to be a compiler
barrier on machines with strong memory ordering, and as a memory
fencing instruction on machines with weak memory ordering.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 ... On these machines, you need to issue an explicit memory barrier
 instruction at each sequence point, or just acquire and release a
 spinlock.

Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.

To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support.  In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Questions and experiences writing a Foreign Data Wrapper

2011-07-21 Thread Tom Lane
Albe Laurenz laurenz.a...@wien.gv.at writes:
 I wrote a FDW for Oracle to a) learn some server coding
 and b) see how well the FDW API works for me.

 I came up with three questions/experiences:

 1) GetUserMapping throws an error if there is no
user mapping for the user (or PUBLIC).
I think that it would be much more useful if
it would return NULL or something similar instead.

We could make it do that, but under what circumstances would it be
useful to not throw an error?  It doesn't seem like you should try
to establish a remote connection anyway, if there's no mapping.

 3) I am confused by the order of function calls
during execution of a subplan. It is like this:
  BeginForeignScan
  ReScanForeignScan
  IterateForeignScan
  IterateForeignScan
  ...
  ReScanForeignScan
  IterateForeignScan
  IterateForeignScan
  ...
  EndForeignScan
   So the first ReScan is done immediately after
   BeginForeignScan. Moreover, internal parameters are not
   set in the BeginForeignScan call.

   This is probably working as designed, but BeginForeignScan
   has no way to know whether it should execute a remote
   query or not.

I'd say it probably shouldn't, ever.  If you look at the executor's node
init functions, none of them do any actual data fetching.  They just
prepare data structures.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 ... On these machines, you need to issue an explicit memory barrier
 instruction at each sequence point, or just acquire and release a
 spinlock.

 Right, and the reason that a spinlock fixes it is that we have memory
 barrier instructions built into the spinlock code sequences on machines
 where it matters.

 To get to the point where we could do the sort of optimization Robert
 is talking about, someone will have to build suitable primitives for
 all the platforms we support.  In the cases where we use gcc ASM in
 s_lock.h, it shouldn't be too hard to pull out the barrier
 instruction(s) ... but on platforms where we rely on OS-supplied
 functions, some research is going to be needed.

Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.

I think the real challenge is going to be testing.  If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] This was a suggestion from Noah Misch.  I wasn't quite convinced
when he initially made it, but having studied the issue a lot more, I
now am.  The CPU doesn't know how many processes have the memory
mapped into their address space.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Dave Page
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas robertmh...@gmail.com wrote:

 I think the real challenge is going to be testing.  If anyone has a
 machine with weak memory ordering they can give me access to, that
 would be really helpful for flushing the bugs out of this stuff.
 Getting it to work on x86 is not the hard part.

I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Yeb Havinga

On 2011-07-21 15:03, Robert Haas wrote:

On Thu, Jul 21, 2011 at 4:00 AM, Yeb Havingayebhavi...@gmail.com  wrote:

Besides that I have to admit having problems understanding why the 5MB cache
for pg_seclabel is a problem; it's memory consumption is lineair only to the
size of the underlying database.  (in contrast with the other cache storing
access vectors which would have O(n*m) space complexity if it wouldn't
reclaim space). So it is proportional to the number of objects in a database
and in size it seems to be in the same order as pg_proc, pg_class and
pg_attribute.

Fair enough.  I'm not convinced that the sheer quantity of memory use
is a problem, although I would like to see a few more test results
before we decide that definitively.  I *am* unwilling to pay the
startup overhead of initializing an extra 2048 syscache that only
sepgsql users will actually need.


Is it possible to only include the syscache on --enable-selinux 
configurations? It would imply physical data incompatibility with 
standard configurations, but that's also true for e.g. the block size.


Also, the tests I did with varying bucket sizes suggested that 
decreasing the syscache to 256 didn't show a significant performance 
decrease compared to the 2048 #buckets, for the restorecon test, which 
hits over 3000 objects with security labels. My guess is that that is a 
fair middle of the road database schema size. Are you unwilling to pay 
the startup overhead for a extra 256 syscache?


--

Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 3:22 PM, Dave Page dp...@pgadmin.org wrote:
 On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas robertmh...@gmail.com wrote:
 I think the real challenge is going to be testing.  If anyone has a
 machine with weak memory ordering they can give me access to, that
 would be really helpful for flushing the bugs out of this stuff.
 Getting it to work on x86 is not the hard part.

 I believe there's a PPC box in our storage facility in NJ that we
 might be able to dig out for you. There's also a couple in our India
 office. Let me know if they'd be of help.

Yes!

More processors is better, of course, but having anything at all to
test on would be an improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Dave Page
On Thu, Jul 21, 2011 at 8:25 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jul 21, 2011 at 3:22 PM, Dave Page dp...@pgadmin.org wrote:
 On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas robertmh...@gmail.com wrote:
 I think the real challenge is going to be testing.  If anyone has a
 machine with weak memory ordering they can give me access to, that
 would be really helpful for flushing the bugs out of this stuff.
 Getting it to work on x86 is not the hard part.

 I believe there's a PPC box in our storage facility in NJ that we
 might be able to dig out for you. There's also a couple in our India
 office. Let me know if they'd be of help.

 Yes!

 More processors is better, of course, but having anything at all to
 test on would be an improvement.

OK, will check with India first, as it'll be easier for them to deploy.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 3:25 PM, Yeb Havinga yebhavi...@gmail.com wrote:
 Is it possible to only include the syscache on --enable-selinux
 configurations? It would imply physical data incompatibility with standard
 configurations, but that's also true for e.g. the block size.

Not really.  SECURITY LABEL is supposedly a generic facility that can
be used by a variety of providers, and the regression tests load a
dummy provider which works on any platform to test that it hasn't
gotten broken.

 Also, the tests I did with varying bucket sizes suggested that decreasing
 the syscache to 256 didn't show a significant performance decrease compared
 to the 2048 #buckets, for the restorecon test, which hits over 3000 objects
 with security labels. My guess is that that is a fair middle of the road
 database schema size. Are you unwilling to pay the startup overhead for a
 extra 256 syscache?

Not sure.  I'd rather not, if it's easy to rejigger things so we don't
have to.  I don't think this is necessarily a hard problem to solve -
it's just that no one has tried yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Yeb Havinga


Is it possible to only include the syscache on --enable-selinux 
configurations? It would imply physical data incompatibility with 
standard configurations, but that's also true for e.g. the block size.


Also, the tests I did with varying bucket sizes suggested that 
decreasing the syscache to 256 didn't show a significant performance 
decrease compared to the 2048 #buckets, for the restorecon test, which 
hits over 3000 objects with security labels. My guess is that that is 
a fair middle of the road database schema size. Are you unwilling to 
pay the startup overhead for a extra 256 syscache?




Hello KaiGai-san,

off-list,

I was wondering why the catalog pg_seclabel exists at all. Why not store 
the labels together with the objects (pg_class, pg_attribute etc) ? The 
syscache wouldn't be needed in that case.


regards,
Yeb



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Kohei KaiGai
2011/7/21 Yeb Havinga yebhavi...@gmail.com:

 Is it possible to only include the syscache on --enable-selinux
 configurations? It would imply physical data incompatibility with standard
 configurations, but that's also true for e.g. the block size.

 Also, the tests I did with varying bucket sizes suggested that decreasing
 the syscache to 256 didn't show a significant performance decrease compared
 to the 2048 #buckets, for the restorecon test, which hits over 3000 objects
 with security labels. My guess is that that is a fair middle of the road
 database schema size. Are you unwilling to pay the startup overhead for a
 extra 256 syscache?


 Hello KaiGai-san,

 off-list,

Unfortunatelly, not so...

 I was wondering why the catalog pg_seclabel exists at all. Why not store the
 labels together with the objects (pg_class, pg_attribute etc) ? The syscache
 wouldn't be needed in that case.

Although current sepgsql support to assign security label on limited number of
object classes (schema, relation, column and functions).
However, we have planed to control accesses on whole of objects managed by
PostgreSQL, not only these four.

If we needed to expand system catalog everytime when an object get newly
supported, the patch would be more invasive and make hard to upstream.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I think the real challenge is going to be testing.  If anyone has a
 machine with weak memory ordering they can give me access to, that
 would be really helpful for flushing the bugs out of this stuff.

There are multi-CPU PPCen in the buildfarm, or at least there were last
time I broke the sinval code ;-).  Note that testing on a single-core
PPC will prove nothing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.1] sepgsql - userspace access vector cache

2011-07-21 Thread Kohei KaiGai
2011/7/21 Robert Haas robertmh...@gmail.com:
 On Thu, Jul 21, 2011 at 3:25 PM, Yeb Havinga yebhavi...@gmail.com wrote:
 Is it possible to only include the syscache on --enable-selinux
 configurations? It would imply physical data incompatibility with standard
 configurations, but that's also true for e.g. the block size.

 Not really.  SECURITY LABEL is supposedly a generic facility that can
 be used by a variety of providers, and the regression tests load a
 dummy provider which works on any platform to test that it hasn't
 gotten broken.

 Also, the tests I did with varying bucket sizes suggested that decreasing
 the syscache to 256 didn't show a significant performance decrease compared
 to the 2048 #buckets, for the restorecon test, which hits over 3000 objects
 with security labels. My guess is that that is a fair middle of the road
 database schema size. Are you unwilling to pay the startup overhead for a
 extra 256 syscache?

 Not sure.  I'd rather not, if it's easy to rejigger things so we don't
 have to.  I don't think this is necessarily a hard problem to solve -
 it's just that no one has tried yet.

Now, I tend to implement a cache mechanism to translate ObjectAddress
to security label by sepgsql module itself, rather than generic syscache,
although it requires a new hook on PrepareForTupleInvalidation() as Robert
suggested in this thread.
Indeed, it seems to me worthwhile not to allocate memory being unused for
90% of users; from perspective of startup performance and resource consumption.

In addition, we may be potentially able to have a cache stuff well optimized
to the access control of SELinux; such as cache reclaim for recently unused
entries. So, I'd like to focus on the stuff in sepgsql/uavc.c right now.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 12:51 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 The way we force the subsequent vacuum to also look at the pages scanned and
 pruned by previous failed vacuum, all the pages that have dead-vacuum line
 pointers would have a new stamp once the vacuum finishes successfully and
 the pg_class would have the same stamp.

That seems a bit fragile.  One of the things we've talked about doing
is skipping pages that are pinned by some other backend.  Now maybe
that would be infrequent enough not to matter... but...

Also, I'm not sure that's the only potential change that would break this.

I think we are better off doing only equality comparisons and dodging
this problem altogether.

 I thought more about it and for a moment believed that we can do this with
 just a bit since we rescan the  pages with dead and dead-vacuum line
 pointers after an aborted vacuum, but concluded that a bit or a small
 counter is not good enough since other backends might be running with a
 stale value and would get fooled into believing that they can collect the
 dead-vacuum line pointers before the index pointers are actually removed. We
 can still use a 32-bit counter though since the wrap-around for that is
 practically very large for any backend to still run with such a stale
 counter (you would need more than 1 billion vacuums on the same table in
 between for you to hit this).

I think that's a safe assumption.

 The patch as
 submitted uses lp_off to store a single bit, to distinguish between
 vacuum and dead-vacuumed, but we could actually have (for greater
 safety and debuggability) a 15-byte counter that just wraps around
 from 32,767 to 1.  (Maybe it would be wise to reserve a few counter
 values, or a few bits, or both, for future projects.)  That would
 eliminate the need to touch PageRepairFragmentation() or use the
 special space, since all the information would be in the line pointer
 itself.  Not having to rearrange the page to reclaim dead line
 pointers is appealing, too.

 Not sure if I get you here. We need a mechanism to distinguish between dead
 and dead-vacuum line pointers. How would the counter (which I assume you
 mean 15-bit and not byte) help solve that ? Or are you just suggesting
 replacing LSN with the counter in the page header ?

Just-plain-dead line pointers would have lp_off = 0.  Dead-vacuumed
line pointers would have lp_off != 0.  The first vacuum would use
lp_off = 1, the next one lp_off = 2, etc.

Actually, come to think of it, we could fit a 30-bit counter into the
line pointer.  There are 15 unused bits in lp_off and 15 unused bits
in lp_len.

  Is there something in place to make sure that pruning uses an up-to-date
  relindxvacxlogid/off value? I guess it doesn't matter if it's
  out-of-date,
  you'll just miss the opportunity to remove some dead tuples.

 This seems like a tricky problem, because it could cause us to
 repeatedly fail to remove the same dead line pointers, which would be
 poor.  We could do something like this: after updating pg_class,
 vacuum send an interrupt to any backend which holds RowExclusiveLock
 or higher on that relation.  The interrupt handler just sets a flag.
 If that backend does heap_page_prune() and sees the flag set, it knows
 that it needs to recheck pg_class.  This is a bit grotty and doesn't
 completely close the race condition (the signal might not arrive in
 time), but it ought to make it narrow enough not to matter in
 practice.

 I am not too excited about adding that complexity to the code. Even if a
 backend does not have up-to-date value, it will fail to collect the
 dead-vacuum pointers, but soon either it will catch up or some other backend
 will remove them or the next vacuum will take care of it.

If we use a counter that is large enough that we don't have to worry
about wrap-around, I guess that's OK, though it seems a little weird
to think about having different backends running with different ideas
about the correct counter value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 3:53 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I think the real challenge is going to be testing.  If anyone has a
 machine with weak memory ordering they can give me access to, that
 would be really helpful for flushing the bugs out of this stuff.

 There are multi-CPU PPCen in the buildfarm, or at least there were last
 time I broke the sinval code ;-).  Note that testing on a single-core
 PPC will prove nothing.

Yeah, I was just thinking about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Single pass vacuum - take 1

2011-07-21 Thread Pavan Deolasee
On Thu, Jul 21, 2011 at 4:01 PM, Robert Haas robertmh...@gmail.com wrote:



 I think we are better off doing only equality comparisons and dodging
 this problem altogether.


Fair enough.



 Just-plain-dead line pointers would have lp_off = 0.  Dead-vacuumed
 line pointers would have lp_off != 0.  The first vacuum would use
 lp_off = 1, the next one lp_off = 2, etc.

 Actually, come to think of it, we could fit a 30-bit counter into the
 line pointer.  There are 15 unused bits in lp_off and 15 unused bits
 in lp_len.


Thats clever! I think we can go this path and completely avoid any special
area or additional header fields.



 If we use a counter that is large enough that we don't have to worry
 about wrap-around, I guess that's OK, though it seems a little weird
 to think about having different backends running with different ideas
 about the correct counter value.


I think thats fine. For example, every backend runs with a different
RecentXmin today and that doesn't impact any functionality. It only limits
how much they can prune at any given time. The same would happen by having a
stale counter.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Noah Misch
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
 For the last week or so, in between various other tasks, I've been
 trying to understand why performance drops off when you run pgbench -n
 -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
 counts.  The answer, in a word, is SIGetDataEntries().  I believe we
 need to bite the bullet and rewrite this using a lock-free algorithm,
 using memory barriers on processors with weak memory ordering.
 Perhaps there is another way to do it, but nothing I've tried has
 really worked so far, and I've tried quite a few things.  Here's the
 data.
 
 On unpatched master, performance scales pretty much linearly out to 32
 clients.  As you add more clients, it drops off:

 [80 clients]
 tps = 132518.586371 (including connections establishing)
 tps = 130968.749747 (including connections establishing)
 tps = 132574.338942 (including connections establishing)

 [80 clients, with lazy vxid locks and sinval-unlocked]
 tps = 203256.701227 (including connections establishing)
 tps = 190637.957571 (including connections establishing)
 tps = 190228.617178 (including connections establishing)

Nice numbers.  The sinval-unlocked.patch implementation looks like it's taking a
sound direction.

In
http://archives.postgresql.org/message-id/ca+tgmobbxmh_9zjudheswo6m8sbmb5hdzt+3chcluv5eztv...@mail.gmail.com,
you quoted 210k TPS when you stubbed out AcceptInvalidationMessages().  Is it
correct to conclude that AcceptInvalidationMessages() still reduces the
transaction rate by 5-10% with this stack of patches?

-- 
Noah Mischhttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and log file output on Windows

2011-07-21 Thread Bruce Momjian

I started thinking more about this --- we already allow multiple
processes to write to a single file when running Postgres as a backend
on Windows.  I think it is these open() flags that make it possible:

(FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE),

However, we are not going to be able to make all shell and command-line
tools use those flags, so the best solution seems to just avoid the
extra logging.  I just added a C comment to pg_upgrade to document this.

---

bruce wrote:
 
 I have fixed the bug below with the attached patches for 9.0, 9.1, and
 9.2.  I did a minimal patch for 9.0 and 9.1, and a more thorough patch
 for 9.2.
 
 The use of -l/log was tested originally in pg_migrator (for 8.4) and
 reported to be working, but it turns out it only worked in 'check' mode,
 and threw an error during actual upgrade.  This bug was reported to me
 recently by EnterpriseDB testing of PG 9.0 on Windows.  The 9.2 C
 comments should make it clear that Windows doesn't allow multiple
 processes to write to the same file and  should avoid future breakage.
 
 ---
 
 Bruce Momjian wrote:
  Has anyone successfully used pg_upgrade 9.0 with -l (log) on Windows?
  
  I received a private email bug report that pg_upgrade 9.0 does not work
  with the -l/log option on Windows.  The error is:
  
  Analyzing all rows in the new cluster
  c:/MinGW/msys/1.0/home/edb/inst/bin/vacuumdb --port 55445 --username 
  edb --all --analyze
   c:/MinGW/msys/1.0/home/edb/auxschedule/test.log 21
  The process cannot access the file because it is being used by another
  process.
 
  What has me confused is this same code exists in pg_migrator, which was
  fixed to work with -l on Windows by Hiroshi Saito with this change:
  
  /*
   * On Win32, we can't send both server output and pg_ctl output
   * to the same file because we get the error:
   * The process cannot access the file because it is being used by 
  another process.
   * so we have to send pg_ctl output to 'nul'.
   */
  sprintf(cmd, SYSTEMQUOTE \%s/pg_ctl\ -l \%s\ -D \%s\ 
  -o \-p %d -c autovacuum=off -c 
  autovacuum_freeze_max_age=20\ 
  start  \%s\ 21 SYSTEMQUOTE,
   bindir, ctx-logfile, datadir, port,
  #ifndef WIN32
   ctx-logfile);
  #else
   DEVNULL);
  #endif
  
  The fix was not to use the same log file and output file for pg_ctl. 
  But as you can see, the pg_ctl and vacuumdb code is unchanged:
  
  prep_status(ctx, Analyzing all rows in the new cluster);
  exec_prog(ctx, true,
SYSTEMQUOTE \%s/vacuumdb\ --port %d --username \%s\ 
--all --analyze  %s 21 SYSTEMQUOTE,
ctx-new.bindir, ctx-new.port, ctx-user, ctx-logfile);
  
  I can't figure out of there is something odd about this user's setup or
  if there is a bug in pg_upgrade with -l on Windows.
  
  -- 
Bruce Momjian  br...@momjian.ushttp://momjian.us
EnterpriseDB http://enterprisedb.com
  
+ It's impossible for everything to be true. +
  
  -- 
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 
 -- 
   Bruce Momjian  br...@momjian.ushttp://momjian.us
   EnterpriseDB http://enterprisedb.com
 
   + It's impossible for everything to be true. +

[ text/x-diff is unsupported, treating like TEXT/PLAIN ]

 diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
 new file mode 100644
 index 1a515e7..37d2eed
 *** a/contrib/pg_upgrade/pg_upgrade.c
 --- b/contrib/pg_upgrade/pg_upgrade.c
 *** prepare_new_cluster(migratorContext *ctx
 *** 161,168 
   prep_status(ctx, Analyzing all rows in the new cluster);
   exec_prog(ctx, true,
 SYSTEMQUOTE \%s/vacuumdb\ --port %d --username 
 \%s\ 
 !   --all --analyze  %s 21 SYSTEMQUOTE,
 !   ctx-new.bindir, ctx-new.port, ctx-user, 
 ctx-logfile);
   check_ok(ctx);
   
   /*
 --- 161,174 
   prep_status(ctx, Analyzing all rows in the new cluster);
   exec_prog(ctx, true,
 SYSTEMQUOTE \%s/vacuumdb\ --port %d --username 
 \%s\ 
 !   --all --analyze  \%s\ 21 SYSTEMQUOTE,
 !   ctx-new.bindir, ctx-new.port, ctx-user,
 ! #ifndef WIN32
 !   ctx-logfile
 ! #else
 !   DEVNULL
 ! #endif
 !   );
   check_ok(ctx);
   
   /*
 *** prepare_new_cluster(migratorContext *ctx
 *** 174,181 
   prep_status(ctx, Freezing all rows on the new cluster);
 

Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Jim Nasby
On Jul 19, 2011, at 4:20 PM, David E. Wheeler wrote:
 On Jul 19, 2011, at 2:06 PM, Josh Berkus wrote:
 
 I am strongly in favor of having a *timezone* data type and some system
 whereby we can uniquely identify timezones in the Zic database.
 
 CREATE OR REPLACE FUNCTION is_timezone(
tz CITEXT
 ) RETURNS BOOLEAN LANGUAGE plpgsql STABLE AS $$
...
 CREATE DOMAIN timezone AS CITEXT CHECK ( is_timezone( VALUE ) );

Storing giant globs of text with every timestamp field is really ugly.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Jim Nasby
On Jul 19, 2011, at 11:22 AM, Ian Caulfield wrote:
 On 19 July 2011 17:11, Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Josh Berkus j...@agliodbs.com wrote:
 The timestamp and the timezone in which that timestamp was
 entered are two separate pieces of data and *ought* to be in two
 separate fields.
 
 So, if you're grabbing a timestamp and the time zone for it, how
 do you ensure you've done that atomically if you're at the
 boundary of a DST change?
 
 In my view of the world, the timezone that you are in is not an
 object that changes across a DST boundary.
 
 You're right -- the moment in time should be fixed like in the
 current PostgreSQL timestamp with time zone, and the time zone
 doesn't change with DST.  Not an intentional read herring, but
 definitely some muddy thinking there.
 
 There was an earlier point made that if someone puts eg 5pm local time
 two years in the future into the database, and then the DST boundary
 gets moved subsequently, some applications would like the value to
 still say 5pm local time, even though that means it now refers to a
 different point in absolute time - this potentially seems like a
 useful feature. Retroactive timezone changes wouldn't make a lot of
 sense in this case though...

Right; and timezone's aren't supposed to change retroactively. The ZIC database 
is specifically setup so that it knows the history of TZ changes and deals with 
the past correctly.

 I guess there are three concepts of time here - an absolute fixed time
 with no reference to a timezone, a time with a timezone that is still
 set as a fixed point in time, or a local time in a specific timezone
 that would move if the timezone definition changed.

Or, another way to put the third class: a timestamp that remembers what it's 
original timezone was so that you can refer to it a common timezone (such as 
UTC), OR you can refer to it at it's original, local time. That's our exact 
need for this: we have different businesses that operate in different 
timezones. Generally, we only care about things in local time, but there are 
cases (such as event logging) where we could care about local *OR* unified time.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread David E. Wheeler
On Jul 21, 2011, at 2:39 PM, Jim Nasby wrote:

 CREATE OR REPLACE FUNCTION is_timezone(
   tz CITEXT
 ) RETURNS BOOLEAN LANGUAGE plpgsql STABLE AS $$
 ...
 CREATE DOMAIN timezone AS CITEXT CHECK ( is_timezone( VALUE ) );
 
 Storing giant globs of text with every timestamp field is really ugly.

You work with what you've got.

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Jim Nasby
On Jul 19, 2011, at 4:06 PM, Josh Berkus wrote:
 I have my doubts about that, and I hope not.  These details haven't been
 discussed at all; I only started this thread to get community approval
 on cataloguing the TZs.
 
 I am strongly in favor of having a *timezone* data type and some system
 whereby we can uniquely identify timezones in the Zic database.  That
 would be tremendously useful for all sorts of things.  I'm just
 asserting that those who want a composite timestamp+saved-time-zone data
 type have not thought about all of the complications involved.

Having to deal with timezone's completely separate from their timestamps is a 
huge PITA. That said, if we had a timezone datatype there's at least the 
possibility of using a composite type to deal with all of this. Or at least we 
can just create a custom datatype using existing tools... the only part of this 
that I see that actually requires closer core support is the timezone data 
itself.

So if the community is OK with adding a timezone datatype then we can focus on 
that and leave the timestamptztz data type as an add-on (at least assuming we 
don't run into any gotchas).

Alvaro, please speak up if there's any technical issues here that I've missed?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Tom Lane
Jim Nasby j...@nasby.net writes:
 On Jul 19, 2011, at 11:22 AM, Ian Caulfield wrote:
 There was an earlier point made that if someone puts eg 5pm local time
 two years in the future into the database, and then the DST boundary
 gets moved subsequently, some applications would like the value to
 still say 5pm local time, even though that means it now refers to a
 different point in absolute time - this potentially seems like a
 useful feature. Retroactive timezone changes wouldn't make a lot of
 sense in this case though...

 Right; and timezone's aren't supposed to change retroactively. The ZIC 
 database is specifically setup so that it knows the history of TZ changes and 
 deals with the past correctly.

You haven't noticed that at least two or three times a year, there are
historical corrections in the ZIC database?  The mapping between local
time and UTC might be less likely to change for a time instant in the
past than one in the future, but it would be folly to assume that it's
immutable in either direction.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: new contrib module plpgsql's embeded sql validator

2011-07-21 Thread Jim Nasby
On Jul 19, 2011, at 10:51 PM, Pavel Stehule wrote:
 If you mean that such checks would be done automatically, no, they
 shouldn't be.  Consider a function that creates a table and then uses
 it, or even just depends on using a table that doesn't yet exist when
 you do CREATE FUNCTION.
 
 yes, any deep check is not possible for function that uses a temporary tables.
 
 A plpgsql_lint is not silver bullet - for these cases is necessary to
 disable lint.
 
 . I can't to speak generally - I have no idea, how much percent of
 functions are functions with access to temporary tables - in my last
 project I use 0 temp tables on cca 300 KB of plpgsql code.
 
 The more terrible problem is a new dependency between functions. I use
 a workaround - some like headers

You can work around temp table issues the same way: just define the temp table 
before you create the function.

In practice, if I have a function that depends on a temp table it either 
creates it itself if it doesn't already exist or I have a separate function to 
create the table; that way you have a single place that has the temp table 
definition, and that is in the database itself.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 4:54 PM, Noah Misch n...@2ndquadrant.com wrote:
 On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
 For the last week or so, in between various other tasks, I've been
 trying to understand why performance drops off when you run pgbench -n
 -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
 counts.  The answer, in a word, is SIGetDataEntries().  I believe we
 need to bite the bullet and rewrite this using a lock-free algorithm,
 using memory barriers on processors with weak memory ordering.
 Perhaps there is another way to do it, but nothing I've tried has
 really worked so far, and I've tried quite a few things.  Here's the
 data.

 On unpatched master, performance scales pretty much linearly out to 32
 clients.  As you add more clients, it drops off:

 [80 clients]
 tps = 132518.586371 (including connections establishing)
 tps = 130968.749747 (including connections establishing)
 tps = 132574.338942 (including connections establishing)

 [80 clients, with lazy vxid locks and sinval-unlocked]
 tps = 203256.701227 (including connections establishing)
 tps = 190637.957571 (including connections establishing)
 tps = 190228.617178 (including connections establishing)

 Nice numbers.  The sinval-unlocked.patch implementation looks like it's 
 taking a
 sound direction.

 In
 http://archives.postgresql.org/message-id/ca+tgmobbxmh_9zjudheswo6m8sbmb5hdzt+3chcluv5eztv...@mail.gmail.com,
 you quoted 210k TPS when you stubbed out AcceptInvalidationMessages().  Is it
 correct to conclude that AcceptInvalidationMessages() still reduces the
 transaction rate by 5-10% with this stack of patches?

Good question - I have not tested.

One idea I just had... if we use a 64-bit counter for maxMsgNum, maybe
we could make AcceptInvalidationMessages() a macro, something like
this:

if (MyProcState-nextMsgNum != shmInvalState-maxMsgNum)
ReallyAcceptInvalidationMessages();

That ought to be extremely cheap and - if we use 64-bit counters for
the message-number counters - safe.  You might object that the load of
maxMsgNum might migrate backward, but it can't possibly back up any
further than the preceding lock acquisition, since that's required to
be a full memory barrier on every architecture.  And if we haven't
acquired a relevant lock, then a relevant sinval message could show up
an instance after we check regardless of the implementation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Florian Pflug
On Jul21, 2011, at 21:15 , Robert Haas wrote:
 On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 ... On these machines, you need to issue an explicit memory barrier
 instruction at each sequence point, or just acquire and release a
 spinlock.
 
 Right, and the reason that a spinlock fixes it is that we have memory
 barrier instructions built into the spinlock code sequences on machines
 where it matters.
 
 To get to the point where we could do the sort of optimization Robert
 is talking about, someone will have to build suitable primitives for
 all the platforms we support.  In the cases where we use gcc ASM in
 s_lock.h, it shouldn't be too hard to pull out the barrier
 instruction(s) ... but on platforms where we rely on OS-supplied
 functions, some research is going to be needed.
 
 Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
 on a backend-private slock_t should work anywhere that PostgreSQL
 works at all[1]. That will probably be slower than a memory fence
 instruction and certainly slower than a compiler barrier, but the
 point is that - right now - we're doing it the slow way everywhere.

As I discovered while playing with various lockless algorithms to
improve our LWLocks, spin locks aren't actually a replacement for
a (full) barrier.

Lock acquisition only really needs to guarantee that loads and stores
which come after the acquisition operation in program order (i.e., in
the instruction stream) aren't globally visible before that operation
completes. This kind of barrier behaviour is often fittingly called
acquire barrier.

Similarly, a lock release operation only needs to guarantee that loads
and stores which occur before that operation in program order are
globally visible before the release operation completes. This, again,
is fittingly called release barrier.

Now assume the following code fragment

global1 = 1;
SpinLockAcquire();
SpinLockRelease();
global2 = 1;

If SpinLockAcquire() has acquire barrier semantics, and SpinLockRelease()
has release barrier sematics, the it's possible for the store to global1
to be delayed until after SpinLockAcquire(), and similarly for the store
to global2 to be executed before SpinLockRelease() completes. In other
words, what happens is

SpinLockAcquire();
global1 = 1;
global2 = 1;
SpinLockRelease();

But once that can happens, there's no reason that it couldn't also be

SpinLockAcquire();
global2 = 1;
global1 = 1;
SpinLockRelease();

I didn't check if any of our spin lock implementations is actually affected
by this, but it doesn't seem wise to rely on them being full barriers, even
if it may be true today.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Christopher Browne
On Thu, Jul 21, 2011 at 5:48 PM, Jim Nasby j...@nasby.net wrote:
 On Jul 19, 2011, at 4:06 PM, Josh Berkus wrote:
 I have my doubts about that, and I hope not.  These details haven't been
 discussed at all; I only started this thread to get community approval
 on cataloguing the TZs.

 I am strongly in favor of having a *timezone* data type and some system
 whereby we can uniquely identify timezones in the Zic database.  That
 would be tremendously useful for all sorts of things.  I'm just
 asserting that those who want a composite timestamp+saved-time-zone data
 type have not thought about all of the complications involved.

 Having to deal with timezone's completely separate from their timestamps is a 
 huge PITA. That said, if we had a timezone datatype there's at least the 
 possibility of using a composite type to deal with all of this. Or at least 
 we can just create a custom datatype using existing tools... the only part of 
 this that I see that actually requires closer core support is the timezone 
 data itself.

 So if the community is OK with adding a timezone datatype then we can focus 
 on that and leave the timestamptztz data type as an add-on (at least assuming 
 we don't run into any gotchas).

As I have been watching this whole thread, my inclination has been to
look at this from a Prolog perspective, where we think about the
database as indicating a series of assertions about facts, from which
we then try to reason.

I suspect that determining what *really* needs to get recorded depends
on this.  And it seems to me that trying to head down the path of
defining oid-based lookups of timezone names may be putting the cart
before the horse.

There are a number of facts about a timestamp:

1.  What time did the database server think it was?

SELECT NOW();

captures the database's concept of what time it was, complete with:
a) The time, based, I think, on UT1.  With the caveat that there's no
certainty that the DB server's time is necessarily correct.
b) An encoding of the timezone offset based on the value of the
TimeZone GUC for this connection.

If one is running an NTP daemon, pointing to a decently-connected
network of NTP servers, then it's likely that this time is pretty
accurate.  And most of the time, I'd be inclined to treat this as
authoritative, and contend that anything else is likely to be less
correct and less easy to work with.

The goal of this discussion thread is to record another timestamp with
a different basis.  It's not entirely clear what is its basis.  I'll
suggest one, which mostly underlines my contention that it's likely
less correct and less easy to work with than having a column
defined as...
   some_timestamp timestamp with timezone default NOW()

2.  Client-based timestamp, comprising two things:

a) A time, ascertained by the client.
b) A timezone, ascertained by the client.

Note that timezones are pretty open-ended.  There is an authoritative
encoding defined in the tz database, but there are other values used
out there.  We had to patch Slony-I to have it use 'ISO' timestamps,
and recommend running in GMT/UTC, because there are values that blow
things up.

For instance, on AIX, there is a habit for boxes to set TZ=CUT0, out
of the box, which isn't on what PostgreSQL considers to be the
official list.

On the more whimsical side of things, Joey Hess, a Debian developer
noted for such things as ikiwiki, etckeeper, git-annex, decided to
create his very own custom timezone, JEST, because he was irritated
about DST.

http://kitenet.net/~joey/blog/entry/howto_create_your_own_time_zone/
http://kitenet.net/~joey/blog/entry/JEST_results/

That whimsical entry won't be going into tzdata, and while we could
discount this case as whimsy, it's not out there for organizations
such as nation states to decide to legislate their own things, that we
can't be certain will necessarily get into tzdata.

There are enough aliases and possibilities of local national decisions
to make it at least somewhat troublesome to treat this as something
that can be considered fixed down to the OID level.

My conclusion would be that if someone is really, really, really keen
on capturing their own notion of timezone, then this fits with the
notion that, if they want to have something that could be treated as
remotely authoritative, they should capture a multiplicity of pieces
of datestamp information, and actively accept that this will be pretty
duplicative.

- I'd commend capturing NOW() in a timestamptz field.  That gives you:
1.  What time the DB server thought it was, in terms of UT1
2.  What timezone it thought was tied to that connection.
- Also, I'd be inclined to capture, in plain text form:
3.  A client-recorded timestamp.  I'm agnostic as to whether this has
*any* validation done on it; I'd think it plausible that this is
simply a text field, that might require a human to interpret it.
4.  A client-recorded timezone.  This would be a plain text field, and
I'm not certain it's of any 

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Dan Ports
On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote:
 1. Machines with strong memory ordering.  On this category of machines
 (which include x86), the CPU basically does not perform loads or
 stores out of order.  On some of these machines, it is apparently
 possible for there to be some ordering of stores relative to loads,
 but if the program stores two values or loads two values, those
 operations will performed in the same order they appear in the
 program. 

This is all correct, but...

 The main thing you need to make your code work reliably on
 these machines is a primitive that keeps the compiler from reordering
 your code during optimization. 

If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.

Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.

 2. Machines with weak memory ordering.  On this category of machines
 (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
 reorders memory accesses arbitrarily unless you explicitly issue
 instructions that enforce synchronization.  You still need to keep the
 compiler from moving things around, too.  Alpha is particularly
 pernicious, because something like a-b can fetch the pointed-to value
 before loading the pointer itself.  This is otherwise known as we
 have basically no cache coherency circuits on this chip at all.  On
 these machines, you need to issue an explicit memory barrier
 instruction at each sequence point, or just acquire and release a
 spinlock.

The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)

Alternatively, one might question how important it is to support the
Alpha these days...

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Noah Misch
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
 Profiling this combination of patches reveals that there is still some
 pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
 to me that on x86, we really don't need this lock ... or
 SInvalReadLock ... or a per-backend mutex.  The whole of
 SIGetDataEntries() can pretty easily be made lock-free.  The only real
 changes that seem to be are needed are (1) to use a 64-bit counter, so
 you never need to decrement

On second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores?  You must avoid reading it between those
stores.

-- 
Noah Mischhttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Florian Pflug
On Jul21, 2011, at 03:46 , Robert Haas wrote:
 Profiling this combination of patches reveals that there is still some
 pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
 to me that on x86, we really don't need this lock ... or
 SInvalReadLock ... or a per-backend mutex.  The whole of
 SIGetDataEntries() can pretty easily be made lock-free.  The only real
 changes that seem to be are needed are (1) to use a 64-bit counter, so
 you never need to decrement and (2) to recheck resetState after
 reading the entries from the queue, to see if we got reset while we
 were reading those entries.  Since x86 guarantees that writes will
 become visible in the order they are executed, we only need to make
 sure that the compiler doesn't rearrange things.  As long as we first
 read the maxMsgNum and then read the messages, we can't read garbage.
 As long as we read the messages before we check resetState, we will be
 sure to notice if we got reset before we read all the messages (which
 is the only way that we can have read garbage messages).

Sounds sensible. There're one additional hazard though - you'll also
need the reads to be atomic. x86 guarantees that for up to 32 (i386)
respectively 64 (x64) loads, but only for reads from properly aligned
addresses (4 bytes for 4-byte reads, 8 bytes for 8-byte reads).

I founds that out the hard way a few days ago, again while playing with
different LWLock implementations, when I botched my test setup and
the proc array entries ended up being miss-aligned. Boy, was it fun
to debug the random crashes caused by non-atomic pointer reads...

If we widen the counter to 64-bit, reading it atomically on x86 becomes
a bit of a challenge on i386, but is doable also. From what I remember,
there are two options. You can either use the 8-byte compare-and-exchange
operation, but it might be that only quite recent CPUs support that. The
other options seems to be to use floating-point instructions. I believe
the latter is what Intel's own Thread Building Blocks library does, but
I'd have to re-check to be sure. It might also be that, once you starting
using floating-point instructions, you find that you actually do need
fencing instructions even on x86. Dunno if the weaker ordering affects only
SIMD instructions or all floating point stuff...

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] storing TZ along timestamps

2011-07-21 Thread Jim Nasby
On Jul 21, 2011, at 5:30 PM, Christopher Browne wrote:
 - I'd commend capturing NOW() in a timestamptz field.  That gives you:
 1.  What time the DB server thought it was, in terms of UT1
 2.  What timezone it thought was tied to that connection.

Except that it doesn't, and that's exactly the problem I'm trying to solve 
here. I want to know what timezone we were using when we put a value into 
timestamptz, which then got converted to UT1. Without a reliable way to store 
what the timezone *was* at that time, we have no way to go back to it.

Now, we can debate whether it makes more sense to store the original time 
without conversion to UT1, or whether we should store the time after converting 
it to UT1 (or whether we should offer both options), but that debate is 
pointless without a good way to remember what timezone it started out in.

Arguably, we could just create an add-on data type for storing that timezone 
information, but that seems pretty daft to me: you're stuck either storing raw 
text which takes what should be a 12 byte datatype up to a 20-30 byte type (8 
byte timestamp + varlena + text of timezone name), or you end up with major 
problems trying to keep an enum in sync with what the database has available in 
it's ZIC database.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock contention

2011-07-21 Thread Florian Pflug
On Jul18, 2011, at 04:36 , Robert Haas wrote:
 On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug f...@phlo.org wrote:
 I don't want to fiddle with your git repo, but if you attach a patch
 that applies to the master branch I'll give it a spin if I have time.
 
 Patch attached.
 
 Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
 spin lock instead of locked xadd to increment the shared counters.
 
 [ Back from vacation, catching up on email. ]
 
 gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)
 
 pgbench -n -S -T 180 -c 32 -j 32
 
 with lwlock-part patch:
 tps = 36974.644091 (including connections establishing)
 
 unpatched cd34647c666be867f95ef8fc0492c30356043f10:
 tps = 39432.064293 (including connections establishing)
 
 And with -c 8 -j 8:
 
 tps = 26946.202428 (including connections establishing)
 tps = 27206.507424 (including connections establishing)

:-( That's disappointing, to say the least.

I also completely fail to understand what the heck is going on there.

I mean, you did conclusively prove that commenting out the SInval stuff
made a huge difference. There's also supposed to hardly any invalidation
going on during a pgbench -S run. So, since the patch removes two of the
three spin-lock acquisitions from SIGetDataEntries() (so long as there are
no exclusive lockers of SInvalReadLock), there should be some effect.
Or so I'd think at least...

If anyone has I theory, I'd love to hear it.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 6:22 PM, Florian Pflug f...@phlo.org wrote:
 On Jul21, 2011, at 21:15 , Robert Haas wrote:
 On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 ... On these machines, you need to issue an explicit memory barrier
 instruction at each sequence point, or just acquire and release a
 spinlock.

 Right, and the reason that a spinlock fixes it is that we have memory
 barrier instructions built into the spinlock code sequences on machines
 where it matters.

 To get to the point where we could do the sort of optimization Robert
 is talking about, someone will have to build suitable primitives for
 all the platforms we support.  In the cases where we use gcc ASM in
 s_lock.h, it shouldn't be too hard to pull out the barrier
 instruction(s) ... but on platforms where we rely on OS-supplied
 functions, some research is going to be needed.

 Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
 on a backend-private slock_t should work anywhere that PostgreSQL
 works at all[1]. That will probably be slower than a memory fence
 instruction and certainly slower than a compiler barrier, but the
 point is that - right now - we're doing it the slow way everywhere.

 As I discovered while playing with various lockless algorithms to
 improve our LWLocks, spin locks aren't actually a replacement for
 a (full) barrier.

 Lock acquisition only really needs to guarantee that loads and stores
 which come after the acquisition operation in program order (i.e., in
 the instruction stream) aren't globally visible before that operation
 completes. This kind of barrier behaviour is often fittingly called
 acquire barrier.

 Similarly, a lock release operation only needs to guarantee that loads
 and stores which occur before that operation in program order are
 globally visible before the release operation completes. This, again,
 is fittingly called release barrier.

 Now assume the following code fragment

 global1 = 1;
 SpinLockAcquire();
 SpinLockRelease();
 global2 = 1;

 If SpinLockAcquire() has acquire barrier semantics, and SpinLockRelease()
 has release barrier sematics, the it's possible for the store to global1
 to be delayed until after SpinLockAcquire(), and similarly for the store
 to global2 to be executed before SpinLockRelease() completes. In other
 words, what happens is

 SpinLockAcquire();
 global1 = 1;
 global2 = 1;
 SpinLockRelease();

 But once that can happens, there's no reason that it couldn't also be

 SpinLockAcquire();
 global2 = 1;
 global1 = 1;
 SpinLockRelease();

 I didn't check if any of our spin lock implementations is actually affected
 by this, but it doesn't seem wise to rely on them being full barriers, even
 if it may be true today.

Hmm.  I'm not worried about that.  AFAIK, only IA64 has such an
implementation, and our existing spinlock implementation doesn't use
it.  If we were to add something like that in the future, we'd
presumably know that we were doing it, and would add the appropriate
memory barrier primitive at the same time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 6:44 PM, Dan Ports d...@csail.mit.edu wrote:
 If you're suggesting that hardware memory barriers aren't going to be
 needed to implement lock-free code on x86, that isn't true. Because a
 read can be reordered with respect to a write to a different memory
 location, you can still have problems. So you do still need memory
 barriers, just fewer of them.

 Dekker's algorithm is the classic example: two threads each set a flag
 and then check whether the other thread's flag is set. In any
 sequential execution, at least one should see the other's flag set, but
 on the x86 that doesn't always happen. One thread's read might be
 reordered before its write.

In the case of sinval, what we need to do for SIGetDataEntries() is,
approximately, a bunch of loads, followed by a store to one of the
locations we loaded (which no one else can have written meanwhile).
So I think that's OK.

In SIInsertDataEntries(), what we need to do is, approximately, take a
lwlock, load from a location which can only be written while holding
the lwlock, do a bunch of stores, ending with a store to that first
location, and release the lwlock.  I think that's OK, too.

 2. Machines with weak memory ordering.  On this category of machines
 (which includes PowerPC, Dec Alpha, and maybe some others), the CPU
 reorders memory accesses arbitrarily unless you explicitly issue
 instructions that enforce synchronization.  You still need to keep the
 compiler from moving things around, too.  Alpha is particularly
 pernicious, because something like a-b can fetch the pointed-to value
 before loading the pointer itself.  This is otherwise known as we
 have basically no cache coherency circuits on this chip at all.  On
 these machines, you need to issue an explicit memory barrier
 instruction at each sequence point, or just acquire and release a
 spinlock.

 The Alpha is pretty much unique (thankfully!) in allowing dependent
 reads to be reordered. That makes it even weaker than the typical
 weak-ordering machine. Since reading a pointer and then dereferencing
 it is a pretty reasonable thing to do regularly in RCU code, you
 probably don't want to emit barriers in between on architectures where
 it's not actually necessary. That argues for another operation that's
 defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
 Certainly the Linux kernel found it useful to do so
 (read_barrier_depends)

 Alternatively, one might question how important it is to support the
 Alpha these days...

Well, currently, we do, so we probably don't want to drop support for
that without some careful thought.  I searched the archive and found
someone trying to compile 8.3.something on Alpha just a few years ago,
so it's apparently not totally dead yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch n...@2ndquadrant.com wrote:
 On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
 Profiling this combination of patches reveals that there is still some
 pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
 to me that on x86, we really don't need this lock ... or
 SInvalReadLock ... or a per-backend mutex.  The whole of
 SIGetDataEntries() can pretty easily be made lock-free.  The only real
 changes that seem to be are needed are (1) to use a 64-bit counter, so
 you never need to decrement

 On second thought, won't this be inadequate on 32-bit systems, where updating
 the 64-bit counter produces two stores?  You must avoid reading it between 
 those
 stores.

Now that is a potentially big problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: bogus descriptions displayed by \d+

2011-07-21 Thread Josh Kupershmidt
On Sun, Jul 17, 2011 at 10:54 AM, Josh Kupershmidt schmi...@gmail.com wrote:
 On Sat, Jul 16, 2011 at 12:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 After a bit of review of the archives, the somebody was me:
 http://git.postgresql.org/gitweb/?p=postgresql.gita=commitdiffh=b7d67954456f15762c04e5269b64adc88dcd0860

 and this thread was the discussion about it:
 http://archives.postgresql.org/pgsql-hackers/2009-12/msg01982.php

 It looks like we thought about pg_dump, but did not think about psql.

 Ah, interesting. I didn't even know this functionality existed. And I
 think there is some documentation lacking; in the 8.4 doc page:

Here's a small patch against branch 8.4 to mention support for COMMENT
ON index_name.column_name.

Also, a patch against master to:
 * get rid of the bogus Description outputs for \d+ sequence_name
and \d+ index_name
 * clarify in the COMMENT ON doc page that a table _or view_ name may
be used for comments on columns, rules, and triggers. If we allowed
constraints on views, we could have just put in a note explaining that
table_name.column_name applies to tables and views, but constraints
are the odd man out.
 * slightly reordered the listing of the first bunch of Parameters on
that page so that agg_name comes first, as it does in the Synopsis
section

I noticed that the synopsis for CREATE RULE:
  http://www.postgresql.org/docs/9.1/static/sql-createrule.html

uses the term table, which could be a similar omission. However, on
that page the first sentence of the description specifies table or
view so it might be fine as-is.

And while I'm messing with this, some further nitpicks about psql not
addressed by these patches:
 * The Storage column for \d+ sequence_name is correct, I suppose,
but repetitive
 * The Type column for \dv+ view_name, \di+ index_name, \ds+
sequence_name , etc. seems borderline useless.. shouldn't you know
what type you're looking at based on the backslash command you're
using? Plus the table heading could be more specific than List of
relations, e.g. List of views.

Josh
diff --git a/doc/src/sgml/ref/comment.sgml b/doc/src/sgml/ref/comment.sgml
index ab12614..58a2f02 100644
*** a/doc/src/sgml/ref/comment.sgml
--- b/doc/src/sgml/ref/comment.sgml
*** COMMENT ON
*** 26,32 
AGGREGATE replaceable class=PARAMETERagg_name/replaceable (replaceable class=PARAMETERagg_type/replaceable [, ...] ) |
CAST (replaceablesource_type/replaceable AS replaceabletarget_type/replaceable) |
COLLATION replaceable class=PARAMETERobject_name/replaceable |
!   COLUMN replaceable class=PARAMETERtable_name/replaceable.replaceable class=PARAMETERcolumn_name/replaceable |
CONSTRAINT replaceable class=PARAMETERconstraint_name/replaceable ON replaceable class=PARAMETERtable_name/replaceable |
CONVERSION replaceable class=PARAMETERobject_name/replaceable |
DATABASE replaceable class=PARAMETERobject_name/replaceable |
--- 26,32 
AGGREGATE replaceable class=PARAMETERagg_name/replaceable (replaceable class=PARAMETERagg_type/replaceable [, ...] ) |
CAST (replaceablesource_type/replaceable AS replaceabletarget_type/replaceable) |
COLLATION replaceable class=PARAMETERobject_name/replaceable |
!   COLUMN replaceable class=PARAMETERtable_or_view_name/replaceable.replaceable class=PARAMETERcolumn_name/replaceable |
CONSTRAINT replaceable class=PARAMETERconstraint_name/replaceable ON replaceable class=PARAMETERtable_name/replaceable |
CONVERSION replaceable class=PARAMETERobject_name/replaceable |
DATABASE replaceable class=PARAMETERobject_name/replaceable |
*** COMMENT ON
*** 42,48 
OPERATOR FAMILY replaceable class=PARAMETERobject_name/replaceable USING replaceable class=parameterindex_method/replaceable |
[ PROCEDURAL ] LANGUAGE replaceable class=PARAMETERobject_name/replaceable |
ROLE replaceable class=PARAMETERobject_name/replaceable |
!   RULE replaceable class=PARAMETERrule_name/replaceable ON replaceable class=PARAMETERtable_name/replaceable |
SCHEMA replaceable class=PARAMETERobject_name/replaceable |
SEQUENCE replaceable class=PARAMETERobject_name/replaceable |
SERVER replaceable class=PARAMETERobject_name/replaceable |
--- 42,48 
OPERATOR FAMILY replaceable class=PARAMETERobject_name/replaceable USING replaceable class=parameterindex_method/replaceable |
[ PROCEDURAL ] LANGUAGE replaceable class=PARAMETERobject_name/replaceable |
ROLE replaceable class=PARAMETERobject_name/replaceable |
!   RULE replaceable class=PARAMETERrule_name/replaceable ON replaceable class=PARAMETERtable_or_view_name/replaceable |
SCHEMA replaceable class=PARAMETERobject_name/replaceable |
SEQUENCE replaceable class=PARAMETERobject_name/replaceable |
SERVER replaceable class=PARAMETERobject_name/replaceable |
*** COMMENT ON
*** 52,58 
TEXT SEARCH DICTIONARY replaceable class=PARAMETERobject_name/replaceable |
TEXT SEARCH PARSER replaceable 

Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch n...@2ndquadrant.com wrote:
 On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
 SIGetDataEntries() can pretty easily be made lock-free.  The only real
 changes that seem to be are needed are (1) to use a 64-bit counter, so
 you never need to decrement

 On second thought, won't this be inadequate on 32-bit systems, where updating
 the 64-bit counter produces two stores?  You must avoid reading it between 
 those stores.

 Now that is a potentially big problem.

Could we do something similar to the xxid hacks?  That is, we have a lot
of counters that should be fairly close to each other, so we store only
the low-order 32 bits of each notional value, and separately maintain a
common high-order word.  You probably would need some additional
overhead each time the high-order word bumps, but that's reasonably
infrequent.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] libedit memory stomp is apparently fixed in OS X Lion

2011-07-21 Thread Tom Lane
We've had several complaints about that tab-completion bug in the
Apple-supplied version of libedit, most recently here:
http://archives.postgresql.org/pgsql-bugs/2011-06/msg00119.php

I had a bug filed with Apple about that, and today I got some auto-mail
indicating they'd fixed that bug as of OS X 10.7 (Lion).  I don't have
Lion installed here, but I grabbed the libedit sources from
www.opensource.apple.com and indeed it looks fixed.  So, if any early
adopters want to try it out ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] sinval synchronization considered harmful

2011-07-21 Thread Robert Haas
On Thu, Jul 21, 2011 at 9:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch n...@2ndquadrant.com wrote:
 On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
 SIGetDataEntries() can pretty easily be made lock-free.  The only real
 changes that seem to be are needed are (1) to use a 64-bit counter, so
 you never need to decrement

 On second thought, won't this be inadequate on 32-bit systems, where 
 updating
 the 64-bit counter produces two stores?  You must avoid reading it between 
 those stores.

 Now that is a potentially big problem.

 Could we do something similar to the xxid hacks?  That is, we have a lot
 of counters that should be fairly close to each other, so we store only
 the low-order 32 bits of each notional value, and separately maintain a
 common high-order word.  You probably would need some additional
 overhead each time the high-order word bumps, but that's reasonably
 infrequent.

Well, the trouble is figuring out what the shape of that additional
overhead needs to look like.  I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:

+   if (stateP-nextMsgNum == segP-maxMsgNum  !stateP-resetState)
+   return 0;

Patch (with comment explaining why I think this is OK) attached.  If
the message numbers happen to be equal only because the counter has
wrapped, then stateP-resetState will be true, so we'll still realize
we need to do some work.

Test results, with the lazy vxid patch plus this patch, at 8 clients:

tps = 34028.144439 (including connections establishing)
tps = 34079.085935 (including connections establishing)
tps = 34125.295938 (including connections establishing)

And at 32 clients:

tps = 185521.605364 (including connections establishing)
tps = 188250.700451 (including connections establishing)
tps = 186077.847215 (including connections establishing)

And at 80 clients:

tps = 188568.886569 (including connections establishing)
tps = 191035.971512 (including connections establishing)
tps = 189363.019377 (including connections establishing)

Not quite as good as the unlocked version, but better than the
per-backend mutex, and a whole lot simpler.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


sinval-fastpath.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: new contrib module plpgsql's embeded sql validator

2011-07-21 Thread Pavel Stehule
2011/7/22 Jim Nasby j...@nasby.net:
 On Jul 19, 2011, at 10:51 PM, Pavel Stehule wrote:
 If you mean that such checks would be done automatically, no, they
 shouldn't be.  Consider a function that creates a table and then uses
 it, or even just depends on using a table that doesn't yet exist when
 you do CREATE FUNCTION.

 yes, any deep check is not possible for function that uses a temporary 
 tables.

 A plpgsql_lint is not silver bullet - for these cases is necessary to
 disable lint.

 . I can't to speak generally - I have no idea, how much percent of
 functions are functions with access to temporary tables - in my last
 project I use 0 temp tables on cca 300 KB of plpgsql code.

 The more terrible problem is a new dependency between functions. I use
 a workaround - some like headers

 You can work around temp table issues the same way: just define the temp 
 table before you create the function.

 In practice, if I have a function that depends on a temp table it either 
 creates it itself if it doesn't already exist or I have a separate function 
 to create the table; that way you have a single place that has the temp table 
 definition, and that is in the database itself.

there is other trick - use a persistent table with same name before.
Runtime temporary table is near in search_path, so all executed SQL
will be related to this temp table.

Pavel

 --
 Jim C. Nasby, Database Architect                   j...@nasby.net
 512.569.9461 (cell)                         http://jim.nasby.net




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers