date:20141105

On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote:
 On 10/30/2014 06:02 PM, Andres Freund wrote:
 On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
 On 10/06/2014 02:29 PM, Andres Freund wrote:
 I've not yet really looked,
 but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
 make me happy...
 
 Ok.. Can you elaborate?
 
 To me the split between xloginsert.c doing some of the record assembly,
 and xlog.c doing the lower level part of the assembly is just wrong.
 
 I moved the checks for bootstrap mode and xl_len == 0, from the current
 XLogInsert to the new XLogInsert() function. I don't imagine that to be
 enough address your feelings about the split between xlog.c and
 xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
 to do about it, as it feels very sensible to me as it is now. So unless you
 object more loudly, I'm going to just go ahead with this and commit,
 probably tomorrow.

I'm still not particularly happy with the split. But I think this patch
is enough of a win anyhow. So go ahead.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

On Wed, Nov 5, 2014 at 5:24 PM, Anssi Kääriäinen anssi.kaariai...@thl.fi
wrote:

 On Tue, 2014-11-04 at 23:43 -0600, Jim Nasby wrote:

  I'm worried about 2 commits in the same microsecond on the same
  system, not on 2 different systems. Or, put another way, if we're
  going to expose this I think it should also provide a guaranteed
  unique commit ordering for a single cluster. Presumably, this
  shouldn't be that hard since we do know the exact order in which
  things committed.

 Addition of LSN when the timestamps for two transactions are exactly
 same isn't enough. There isn't any guarantee that a later commit gets a
 later timestamp than an earlier commit.

True if WAL record ID is not globally consistent. Two-level commit ordering
can be done with (timestamp or LSN, nodeID). At equal timestamp, we could
say as well that the node with the lowest systemID wins for example. That's
not something


 In addition, I wonder if this feature would be misused. Record
 transaction ids to a table to find out commit order (use case could be
 storing historical row versions for example). Do a dump and restore on
 another cluster, and all the transaction ids are completely meaningless
 to the system.

I think you are forgetting the fact to be able to take a consistent dump
using an exported snapshot. In this case the commit order may not be that
meaningless..


 Having the ability to record commit order into an audit table would be
 extremely welcome, but as is, this feature doesn't provide it.

That's something that can actually be achieved with this feature if the SQL
interface is able to query all the timestamps in a xid range with for
example a background worker that tracks this data periodically. Now the
thing is as well: how much timestamp history do we want to keep? The patch
truncating SLRU files with frozenID may cover a sufficient range...
-- 
Michael

Re: [HACKERS] Sequence Access Method WIP

2014-11-05 Thread Heikki Linnakangas


On 11/04/2014 11:01 PM, Petr Jelinek wrote:

On 04/11/14 13:11, Heikki Linnakangas wrote:

On 10/13/2014 01:01 PM, Petr Jelinek wrote:

Only the alloc and reloptions methods are required (and implemented by
the local AM).

The caching, xlog writing, updating the page, etc is handled by backend,
the AM does not see the tuple at all. I decided to not pass even the
struct around and just pass the relevant options because I think if we
want to abstract the storage properly then the AM should not care about
how the pg_sequence looks like at all, even if it means that the
sequence_alloc parameter list is bit long.


Hmm. The division of labour between the seqam and commands/sequence.c
still feels a bit funny. sequence.c keeps track of how many values have
been WAL-logged, and thus usable immediately, but we still call
sequence_alloc even when using up those already WAL-logged values. If
you think of using this for something like a centralized sequence server
in a replication cluster, you certainly don't want to make a call to the
remote server for every value - you'll want to cache them.

With the local seqam, there are two levels of caching. Each backend
caches some values (per the CACHE value option in CREATE SEQUENCE). In
addition to that, the server WAL-logs 32 values at a time. If you have a
remote seqam, it would most likely add a third cache, but it would
interact in strange ways with the second cache.

Considering a non-local seqam, the locking is also a bit strange. The
server keeps the sequence page locked throughout nextval(). But if the
actual state of the sequence is maintained elsewhere, there's no need to
serialize the calls to the remote allocator, i.e. the sequence_alloc()
calls.

I'm not exactly sure what to do about that. One option is to completely
move the maintenance of the current value, i.e. sequence.last_value,
to the seqam. That makes sense from an abstraction point of view. For
example with a remote server managing the sequence, storing the
current value in the local catalog table makes no sense as it's always
going to be out-of-date. The local seqam would store it as part of the
am-private data. However, you would need to move the responsibility of
locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to
provide an API that the seqam can call to do that. Perhaps just let the
seqam call heap_inplace_update on the sequence relation.


My idea of how this works is - sequence_next handles the allocation and
the level2 caching (WAL logged cache) via amdata if it supports it, or
returns single value if it doesn't - then the WAL will always just write
the 1 value and there will basically be no level2 cache, since it is the
sequence_next who controls how much will be WAL-logged, what backend
asks for is just suggestion.


Hmm, so the AM might return an nallocated value less than the fetch 
value that sequence.c requested? As the patch stands, wouldn't that make 
sequence.c write a WAL record more often?


In fact, if the seqam manages the current value outside the database 
(e.g. a remote seqam that gets the value from another server), 
nextval() never needs to write a WAL record.



The third level caching as you say should be solved by the
sequence_request_update and sequence_update callback - that will enable
the sequence AM to handle this kind of caching asynchronously without
blocking the sequence_next unnecessarily.

That way it's possible to implement many different strategies, from no
cache at all, to three levels of caching, with and without blocking of
calls to sequence_next.


It's nice that asynchronous operation is possible, but that seems like a 
more advanced feature. Surely an approach where you fetch a bunch of 
values when needed is going to be more common. Let's focus on the simple 
synchronous case first, and make sure the API works well for that.



For the amdata handling (which is the AM's private data variable) the
API assumes that (Datum) 0 is NULL, this seems to work well for
reloptions so should work here also and it simplifies things a little
compared to passing pointers to pointers around and making sure
everything is allocated, etc.

Sadly the fact that amdata is not fixed size and can be NULL made the
page updates of the sequence relation quite more complex that it used to
be.


It would be nice if the seqam could define exactly the columns it needs,
with any datatypes. There would be a set of common attributes:
sequence_name, start_value, cache_value, increment_by, max_value,
min_value, is_cycled. The local seqam would add last_value, log_cnt
and is_called to that. A remote seqam that calls out to some other
server might store the remote server's hostname etc.

There could be a seqam function that returns a TupleDesc with the
required columns, for example.


Wouldn't that somewhat bloat catalog if we had new catalog table for
each sequence AM?


No, that's not what I meant. The number of catalog tables would be the 
same as today. Sequences

Re: [HACKERS] WAL replay bugs

2014-11-05 Thread Alvaro Herrera

Michael Paquier wrote:

 Now, do we really want this feature in-core? That's somewhat a duplicate of
 what is mentioned here:
 http://www.postgresql.org/message-id/CAB7nPqQMq=4ejak317mxz4has0i+1rslbqu29zx18jwlb2j...@mail.gmail.com
 Of course both things do not have the same coverage as the former is for
 buildfarm and dev, while the latter is dedidated to production systems, but
 could be used for development as well.

Oh, I had forgotten that other patch.

 The patch sent there is a bit outdated, but a potential implementation gets
 simpler with XLogReadBufferForRedo able to return flags about each block
 state during redo. I am still planning to come back to it for this cycle,
 though I stopped for now waiting for the WAL format patches finish to shape
 the APIs this feature would rely on.

I agree it makes sense to wait until the WAL reworks are done -- glad
to hear you're putting some time in this area.

Thanks,

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Representing a SRF return column in catalogs

2014-11-05 Thread Atri Sharma

Hi,

I am working on something that requires representing a SRF return column in
pg_proc and being able to retrieve it, retrieve the name of the column and
make a ColumnRef node from it.

The catch here are aliases:

SELECT generate_series(1,100) AS a ORDER BY a;

I need to know that the return column of generate_series is being referred
as a in this query (hence tlist, sortClause will have 'c').

I discussed and think that a way can be to have position of column rather
than the actual name but that raises the question of the functionality
needed for back resolving position to name (to make a ColumnRef node) and
then infer that the column is being referred as an alias in this query.

Any pointers will be really helpful.

Regards,

Atri

Re: [HACKERS] get_cast_func syscache utility function

2014-11-05 Thread Andrew Dunstan



On 11/04/2014 01:45 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

here's a patch for a utility function to look up the cast function for
a from/to pair of types, as recently suggested by Alvaro. Although it
only contains one use (in json.c), the upcoming jsonb generators would
also use it twice. I'd like to get this committed fairly quickly so I
can prepare an updated patch for the jsonb generators.

I'm not exactly convinced that this is an appropriate utility function,
because casts for which a COERCION_METHOD_FUNCTION entry exists in pg_cast
are by no means the whole universe of casts.  I'm concerned that creating
this function will encourage patch authors to blow off other
possibilities when they should not.  In the particular context at hand,
it seems like you might be better advised to think about how to refactor
json_categorize_type to be more helpful to your new use-case.

A concrete example of what I'm worried about is that json_categorize_type
already ignores the possibility of binary-compatible (WITHOUT FUNCTION)
casts to json.  Since it's eliminated the domain case earlier, that's
perhaps not too horridly broken; but it'd be very easy for new uses of
this get_cast_func() function to overlook such considerations.

In short, I'd rather see this addressed through functions with slightly
higher-level APIs that are capable of covering more cases.  In most cases
it'd be best if callers were using find_coercion_pathway() rather than
taking shortcuts.




Well, then, do we really need a wrapper at all? Should we just be doing 
something like this?


if (typoid = FirstNormalObjectId)
{
Oid castfunc;
CoercionPathType ctype;

ctype = find_coercion_pathway(JSONOID, typoid,
  COERCION_EXPLICIT, castfunc);

if (ctype == COERCION_PATH_FUNC  OidIsValid(castfunc))
{
*tcategory = JSONTYPE_CAST;
*outfuncoid = castfunc;
}
}



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Fix xpath() to return namespace definitions

2014-11-05 Thread Ali Akbar

2014-11-04 22:16 GMT+07:00 Peter Eisentraut pete...@gmx.net:


 On 10/6/14 10:24 PM, Ali Akbar wrote:
  While reviewing the patch myself, i spotted some formatting problems in
  the code. Fixed in this v5 patch.
 
  Also, this patch uses context patch format (in first versions, because
  of the small modification, context patch format obfucates the changes.
  After reimplementation this isn't the case anymore)

 I think the problem this patch is addressing is real, and your approach
 is sound, but I'd ask you to go back to the xmlCopyNode() version, and
 try to add a test case for why the second argument = 1 is necessary.  I
 don't see any other problems.


 OK. Because upstream code is fixed in current version, i'll revert to the
 previous version. Test case added to regression test. With =1 argument, the
 result is correct:
 local:piece xmlns:local=\http://127.0.0.1\; xmlns=\http://127.0.0.2\;
 id=\1\
internalnumber one/internal
internal2/
  /local:piece

 without the argument, the result is not correct, all children will be
 lost. Because of that, the other regression test will fail too because the
 children is not copied:
 *** 584,593 

   -- Text XPath expressions evaluation
   SELECT xpath('/value', data) FROM xmltest;
 ! xpath
 ! --
 !  {valueone/value}
 !  {valuetwo/value}
   (2 rows)

   SELECT xpath(NULL, NULL) IS NULL FROM xmltest;
 --- 584,593 

   -- Text XPath expressions evaluation
   SELECT xpath('/value', data) FROM xmltest;
 !xpath
 ! 
 !  {value/}
 !  {value/}
   (2 rows)

   SELECT xpath(NULL, NULL) IS NULL FROM xmltest;
 ***
 ... cut

 updated patch attached.


I noticed somewhat big performance regression if the xml is large (i use
PRODML Product Volume sample from energistics.org):
* Without patch (tested 3 times):
select unnest(xpath('//a:flow', x, ARRAY[['a','
http://www.prodml.org/schemas/1series']])) from u;

unnest
---
 flow
+
 kindgas
lift/kind+
...
Time: 84,012 ms
Time: 85,683 ms
Time: 88,766 ms


* With latest v6 patch (notice the correct result with namespace
definition):

select unnest(xpath('//a:flow', x, ARRAY[['a','
http://www.prodml.org/schemas/1series']])) from u;

unnest
---
 flow xmlns=http://www.prodml.org/schemas/1series;
+
...
Time: 108,912 ms
Time: 108,267 ms
Time: 114,848 ms


It's 23% performance regression.

* Just curious, i'm also testing v5 patch performance (notice the namespace
in the result):
select unnest(xpath('//a:flow', x, ARRAY[['a','
http://www.prodml.org/schemas/1series']])) from u;

unnest
---
 flow xmlns=http://www.prodml.org/schemas/1series;
+
 kindgas
lift/kind+
Time: 92,552 ms
Time: 97,440 ms
Time: 99,309 ms

The regression is only 13%. I know the xmlCopyNode() version (v6 patch) is
much more cleaner than v5patch, should we consider the performance benefit?

Anyway, thanks for the review! :)

Regards,
--
Ali Akbar

[HACKERS] Time to remove dummy autocommit GUC?

There's a thread over in pgsql-admin
http://www.postgresql.org/message-id/flat/1317404836.812502.1415174912912.javamail.ya...@jws11120.mail.ir2.yahoo.com
suggesting that the server-side autocommit GUC causes more confusion
than it's worth, because newbies think it should reflect the state of
client-side autocommit behavior.

We left that variable in there back in 7.4, on backward-compatibility
grounds that were a bit debatable even then.  ISTM that by now we could
just flat-out remove it.  Surely no client-side code is still looking
at it.

Thoughts, objections?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sequence Access Method WIP

2014-11-05 Thread Petr Jelinek


On 05/11/14 13:45, Heikki Linnakangas wrote:

On 11/04/2014 11:01 PM, Petr Jelinek wrote:

On 04/11/14 13:11, Heikki Linnakangas wrote:

On 10/13/2014 01:01 PM, Petr Jelinek wrote:

Only the alloc and reloptions methods are required (and implemented by
the local AM).

The caching, xlog writing, updating the page, etc is handled by
backend,
the AM does not see the tuple at all. I decided to not pass even the
struct around and just pass the relevant options because I think if we
want to abstract the storage properly then the AM should not care about
how the pg_sequence looks like at all, even if it means that the
sequence_alloc parameter list is bit long.


Hmm. The division of labour between the seqam and commands/sequence.c
still feels a bit funny. sequence.c keeps track of how many values have
been WAL-logged, and thus usable immediately, but we still call
sequence_alloc even when using up those already WAL-logged values. If
you think of using this for something like a centralized sequence server
in a replication cluster, you certainly don't want to make a call to the
remote server for every value - you'll want to cache them.

With the local seqam, there are two levels of caching. Each backend
caches some values (per the CACHE value option in CREATE SEQUENCE). In
addition to that, the server WAL-logs 32 values at a time. If you have a
remote seqam, it would most likely add a third cache, but it would
interact in strange ways with the second cache.

Considering a non-local seqam, the locking is also a bit strange. The
server keeps the sequence page locked throughout nextval(). But if the
actual state of the sequence is maintained elsewhere, there's no need to
serialize the calls to the remote allocator, i.e. the sequence_alloc()
calls.

I'm not exactly sure what to do about that. One option is to completely
move the maintenance of the current value, i.e. sequence.last_value,
to the seqam. That makes sense from an abstraction point of view. For
example with a remote server managing the sequence, storing the
current value in the local catalog table makes no sense as it's always
going to be out-of-date. The local seqam would store it as part of the
am-private data. However, you would need to move the responsibility of
locking and WAL-logging to the seqam. Maybe that's OK, but we'll need to
provide an API that the seqam can call to do that. Perhaps just let the
seqam call heap_inplace_update on the sequence relation.


My idea of how this works is - sequence_next handles the allocation and
the level2 caching (WAL logged cache) via amdata if it supports it, or
returns single value if it doesn't - then the WAL will always just write
the 1 value and there will basically be no level2 cache, since it is the
sequence_next who controls how much will be WAL-logged, what backend
asks for is just suggestion.


Hmm, so the AM might return an nallocated value less than the fetch
value that sequence.c requested? As the patch stands, wouldn't that make
sequence.c write a WAL record more often?



That's correct, that's also why you usually want to have some form of 
local caching if possible.



In fact, if the seqam manages the current value outside the database
(e.g. a remote seqam that gets the value from another server),
nextval() never needs to write a WAL record.


Sure it does, you need to keep the current state in Postgres also, at 
least the current value so that you can pass correct input to 
sequence_alloc(). And you need to do this in crash-safe way so WAL is 
necessary.


I think sequences will cache in amdata if possible for that type of 
sequence and in case where it's not and it's caching on some external 
server then you'll most likely get bigger overhead from network than WAL 
anyway...





For the amdata handling (which is the AM's private data variable) the
API assumes that (Datum) 0 is NULL, this seems to work well for
reloptions so should work here also and it simplifies things a little
compared to passing pointers to pointers around and making sure
everything is allocated, etc.

Sadly the fact that amdata is not fixed size and can be NULL made the
page updates of the sequence relation quite more complex that it
used to
be.


It would be nice if the seqam could define exactly the columns it needs,
with any datatypes. There would be a set of common attributes:
sequence_name, start_value, cache_value, increment_by, max_value,
min_value, is_cycled. The local seqam would add last_value, log_cnt
and is_called to that. A remote seqam that calls out to some other
server might store the remote server's hostname etc.

There could be a seqam function that returns a TupleDesc with the
required columns, for example.


Wouldn't that somewhat bloat catalog if we had new catalog table for
each sequence AM?


No, that's not what I meant. The number of catalog tables would be the
same as today. Sequences look much like any other relation, with entries
in pg_attribute catalog table for all the attributes for each

Re: [HACKERS] get_cast_func syscache utility function

Andrew Dunstan and...@dunslane.net writes:
 On 11/04/2014 01:45 PM, Tom Lane wrote:
 In short, I'd rather see this addressed through functions with slightly
 higher-level APIs that are capable of covering more cases.  In most cases
 it'd be best if callers were using find_coercion_pathway() rather than
 taking shortcuts.

 Well, then, do we really need a wrapper at all? Should we just be doing 
 something like this?

  if (typoid = FirstNormalObjectId)
  {
  Oid castfunc;
  CoercionPathType ctype;

  ctype = find_coercion_pathway(JSONOID, typoid,
COERCION_EXPLICIT, 
 castfunc);

  if (ctype == COERCION_PATH_FUNC  OidIsValid(castfunc))
  {
  *tcategory = JSONTYPE_CAST;
  *outfuncoid = castfunc;
  }
  }

Well, of course, the question that immediately raises is why isn't this
code handling the other possible CoercionPathTypes ;-).  But at least
it's pretty obvious from the code that you are ignoring such cases,
so yes I think this is better than what's there now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Order of views in stats docs

2014-11-05 Thread Magnus Hagander

http://www.postgresql.org/docs/9.3/static/monitoring-stats.html, table 27-1.

Can somebody find or explain the order of the views in there? It's not
actually alphabetical, but it's also not logical. In particular, what
is pg_stat_replication doing second to last?

I would suggest we move pg_stat_replication up to directly under
pg_stat_activity, and move pg_stat_database_conflicts up to directly
under pg_stat_database. I think the rest makes reasonable sense.

Any objections to this? Can anybody spot a reason for why they are
where they are other than that it was just appended to the end of the
table without realizing the order that I'm missing now and am about to
break?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Order of views in stats docs

Magnus Hagander mag...@hagander.net writes:
 http://www.postgresql.org/docs/9.3/static/monitoring-stats.html, table 27-1.
 Can somebody find or explain the order of the views in there? It's not
 actually alphabetical, but it's also not logical. In particular, what
 is pg_stat_replication doing second to last?

 I would suggest we move pg_stat_replication up to directly under
 pg_stat_activity, and move pg_stat_database_conflicts up to directly
 under pg_stat_database. I think the rest makes reasonable sense.

 Any objections to this? Can anybody spot a reason for why they are
 where they are other than that it was just appended to the end of the
 table without realizing the order that I'm missing now and am about to
 break?

I agree that the last two items seem to be suffering from blindly-add-
it-to-the-end syndrome, which is a disease that runs rampant around here.

However, should we consider the possibility of changing the table to
straight alphabetical ordering?  I'm not as much in love with that
approach as some folks, but it does have the merit that it's always clear
where you ought to put a new item.  This would result in grouping the
all, sys, and user views separately, rather than grouping those
variants of a view together ... but on reflection I'm not sure that
that'd be totally horrible.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps




On 11/5/14, 6:10 AM, Michael Paquier wrote:

In addition, I wonder if this feature would be misused. Record
transaction ids to a table to find out commit order (use case could be
storing historical row versions for example). Do a dump and restore on
another cluster, and all the transaction ids are completely meaningless
to the system.

I think you are forgetting the fact to be able to take a consistent dump using 
an exported snapshot. In this case the commit order may not be that 
meaningless..


Anssi's point is that you can't use xmin because it can change, but I think 
anyone working with this feature would understand that.


Having the ability to record commit order into an audit table would be
extremely welcome, but as is, this feature doesn't provide it.

That's something that can actually be achieved with this feature if the SQL 
interface is able to query all the timestamps in a xid range with for example a 
background worker that tracks this data periodically. Now the thing is as well: 
how much timestamp history do we want to keep? The patch truncating SLRU files 
with frozenID may cover a sufficient range...


Except that commit time is not guaranteed unique *even on a single system*. 
That's my whole point. If we're going to bother with all the commit time 
machinery it seems really silly to provide a way to uniquely order every commit.

Clearly trying to uniquely order commits across multiple systems is a far 
larger problem, and I'm not suggesting we attempt that. But for a single system 
AIUI all we need to do is expose the LSN of each commit record and that will 
give you the exact and unique order in which transactions committed.

This isn't a hypothetical feature either; if we had this, logical replication 
systems wouldn't have to try and fake this via batches. You could actually 
recreate exactly what data was visible at what time to all transactions, not 
just repeatable read ones (as long as you kept snapshot data as well, which 
isn't hard).

As for how much data to keep, if you have a process that's doing something to 
record this information permanently all it needs to do is keep an old enough 
snapshot around. That's not that hard to do, even from user space.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to remove dummy autocommit GUC?

Tom Lane t...@sss.pgh.pa.us wrote:

 ISTM that by now we could just flat-out remove it.

+1

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps


On 11/5/14, 10:30 AM, Andres Freund wrote:

Except that commit time is not guaranteed unique *even on a single
system*. That's my whole point. If we're going to bother with all the
commit time machinery it seems really silly to provide a way to
uniquely order every commit.

Well. I think that's the misunderstanding here. That's absolutely not
what committs is supposed to be used for. For the replication stream
you'd hopefully use logical decoding. That gives you the transaction
data exactly in commit order.


So presumably you'd want to use logical decoding to insert into a table with a 
sequence on it, or similar?

I agree, that sounds like a better way to handle this. I think it's worth 
mentioning in the docs for commit_ts, because people WILL mistakenly try and 
use it to determine commit ordering.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

On 2014-11-05 10:23:15 -0600, Jim Nasby wrote:
 
 
 On 11/5/14, 6:10 AM, Michael Paquier wrote:
 In addition, I wonder if this feature would be misused. Record
 transaction ids to a table to find out commit order (use case could be
 storing historical row versions for example). Do a dump and restore on
 another cluster, and all the transaction ids are completely meaningless
 to the system.
 
 I think you are forgetting the fact to be able to take a consistent dump 
 using an exported snapshot. In this case the commit order may not be that 
 meaningless..
 
 Anssi's point is that you can't use xmin because it can change, but I think 
 anyone working with this feature would understand that.

 Having the ability to record commit order into an audit table would be
 extremely welcome, but as is, this feature doesn't provide it.
 
 That's something that can actually be achieved with this feature if
 the SQL interface is able to query all the timestamps in a xid range
 with for example a background worker that tracks this data
 periodically. Now the thing is as well: how much timestamp history do
 we want to keep? The patch truncating SLRU files with frozenID may
 cover a sufficient range...


 Except that commit time is not guaranteed unique *even on a single
 system*. That's my whole point. If we're going to bother with all the
 commit time machinery it seems really silly to provide a way to
 uniquely order every commit.

Well. I think that's the misunderstanding here. That's absolutely not
what committs is supposed to be used for. For the replication stream
you'd hopefully use logical decoding. That gives you the transaction
data exactly in commit order.

 Clearly trying to uniquely order commits across multiple systems is a
 far larger problem, and I'm not suggesting we attempt that. But for a
 single system AIUI all we need to do is expose the LSN of each commit
 record and that will give you the exact and unique order in which
 transactions committed.

I don't think that's something you should attempt. That's what logical
decoding is for. Hence I see little point in exposing the LSN that way.

Where I think committs is useful is a method for analyzing and resolving
conflicts between multiple systems. In that case you likely can't use
the LSN for anything as it won't be very meaningful. If you get
conflicts below the accuracy of the timestamps you better use another
deterministic method of resolving them - BDR e.g. compares the system
identifier, timeline id, database oid, and a user defined name. While
enforcing that those aren't the same between systems.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport


On 10/10/14, 7:26 PM, Peter Geoghegan wrote:

Both Robert and Heikki expressed dissatisfaction with the fact that
B-Tree index builds don't use sortsupport. Because B-Tree index builds
cannot really use the onlyKey optimization, the historic oversight
of not supporting B-Tree builds (and CLUSTER) might have been at least
a little understandable [1]. But with the recent work on sortsupport
for text, it's likely that B-Tree index builds will be missing out on
rather a lot by not taking advantage of sortsupport.

Attached patch modifies tuplesort to correct this oversight. What's
really nice about it is that there is actually a net negative code
footprint:


Did anything ever happen with this?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

On 2014-11-05 10:34:40 -0600, Jim Nasby wrote:
 On 11/5/14, 10:30 AM, Andres Freund wrote:
 Except that commit time is not guaranteed unique *even on a single
 system*. That's my whole point. If we're going to bother with all the
 commit time machinery it seems really silly to provide a way to
 uniquely order every commit.

 Well. I think that's the misunderstanding here. That's absolutely not
 what committs is supposed to be used for. For the replication stream
 you'd hopefully use logical decoding. That gives you the transaction
 data exactly in commit order.
 
 So presumably you'd want to use logical decoding to insert into a
 table with a sequence on it, or similar?

I'm not following. I'd use logical decoding to replicate the data to
another system, thereby guaranteeing its done in commit order. Then,
when applying the data on the other side, I can detect/resolve some
forms of conflicts by looking at the timestamps of rows via committs.

 I agree, that sounds like a better way to handle this. I think it's
 worth mentioning in the docs for commit_ts, because people WILL
 mistakenly try and use it to determine commit ordering.

Ok, sounds sensible.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start


On 11/4/14, 6:11 PM, Álvaro Hernández Tortosa wrote:

 Should we improve then the docs stating this more clearly? Any objection 
to do this?


If we go that route we should also mention that now() will no longer be doing 
what you probably hope it would (AFAIK it's driven by BEGIN and not the first 
snapshot).

Perhaps we should change how now() works, but I'm worried about what that might 
do to existing applications...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] get_cast_func syscache utility function

2014-11-05 Thread Andrew Dunstan



On 11/05/2014 10:10 AM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 11/04/2014 01:45 PM, Tom Lane wrote:

In short, I'd rather see this addressed through functions with slightly
higher-level APIs that are capable of covering more cases.  In most cases
it'd be best if callers were using find_coercion_pathway() rather than
taking shortcuts.

Well, then, do we really need a wrapper at all? Should we just be doing
something like this?
  if (typoid = FirstNormalObjectId)
  {
  Oid castfunc;
  CoercionPathType ctype;
  ctype = find_coercion_pathway(JSONOID, typoid,
COERCION_EXPLICIT, 
castfunc);
  if (ctype == COERCION_PATH_FUNC  OidIsValid(castfunc))
  {
  *tcategory = JSONTYPE_CAST;
  *outfuncoid = castfunc;
  }
  }

Well, of course, the question that immediately raises is why isn't this
code handling the other possible CoercionPathTypes ;-).  But at least
it's pretty obvious from the code that you are ignoring such cases,
so yes I think this is better than what's there now.


Also, it's equivalent to what's there now, I think. I wasn't intending 
to change the behaviour - if someone wants to do that they can submit a 
separate patch.



cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time to remove dummy autocommit GUC?

2014-11-05 Thread Magnus Hagander

On Nov 5, 2014 5:27 PM, Kevin Grittner kgri...@ymail.com wrote:

 Tom Lane t...@sss.pgh.pa.us wrote:

  ISTM that by now we could just flat-out remove it.

 +1


+1.

I thought we had already removed it years ago :-)

/Magnus

Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-05 Thread Jeff Janes

On Tue, Nov 4, 2014 at 2:28 PM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:


 I was clearly too careless about testing the xlog code --- it had
 numerous bugs.  This version should be a lot better, but there might be
 problems lurking still as I don't think I covered it all.  Let me know
 if you see anything wrong.



At line 252 of brin_xlog.c, should the UnlockReleaseBuffer(metabuf) be
protected by a BufferIsValid?

XLogReadBufferForRedo says that it might return an invalid buffer under
some situations.  Perhaps it is known that those situations can't apply
here?


Now I am getting segfaults during normal (i.e. no intentional crashes)
operations.  I think I was seeing them sometimes before as well, I just
wasn't looking for them.

The attached script invokes the segfault within a few minutes.  A lot of
the stuff in the script is probably not necessary, I just didn't spend the
time to pair it down to the essentials.  It does not need to be in
parallel, I get the crash when invoked with only one job (perl ~/
brin_crash.pl 1).

I think this is related to having block ranges which have no tuples in them
when they are first summarized.  If I take out the with t as (delete from
foo returning *) insert into foo select * from t, then I don't see the
crashes


#0  0x0089ed3e in pg_detoast_datum_packed (datum=0x0) at fmgr.c:2270
#1  0x00869be9 in text_le (fcinfo=0x7fff1bf6b9f0) at varlena.c:1661
#2  0x0089cfc7 in FunctionCall2Coll (flinfo=0x297e640,
collation=100, arg1=0, arg2=43488216) at fmgr.c:1324
#3  0x004678f8 in minmaxConsistent (fcinfo=0x7fff1bf6be40) at
brin_minmax.c:213
#4  0x0089d0c9 in FunctionCall3Coll (flinfo=0x297b830,
collation=100, arg1=43509512, arg2=43510296, arg3=43495856) at fmgr.c:1349
#5  0x00462484 in bringetbitmap (fcinfo=0x7fff1bf6c310) at
brin.c:469
#6  0x0089cfc7 in FunctionCall2Coll (flinfo=0x28f2440, collation=0,
arg1=43495712, arg2=43497376) at fmgr.c:1324
#7  0x004b3fc9 in index_getbitmap (scan=0x297b120,
bitmap=0x297b7a0) at indexam.c:651
#8  0x0062ece0 in MultiExecBitmapIndexScan (node=0x297af30) at
nodeBitmapIndexscan.c:89
#9  0x00619783 in MultiExecProcNode (node=0x297af30) at
execProcnode.c:550
#10 0x0062dea2 in BitmapHeapNext (node=0x2974750) at
nodeBitmapHeapscan.c:104

Cheers,

Jeff


brin_crash.pl
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sequence Access Method WIP

2014-11-05 Thread Heikki Linnakangas


On 11/05/2014 05:07 PM, Petr Jelinek wrote:

On 05/11/14 13:45, Heikki Linnakangas wrote:

In fact, if the seqam manages the current value outside the database
(e.g. a remote seqam that gets the value from another server),
nextval() never needs to write a WAL record.


Sure it does, you need to keep the current state in Postgres also, at
least the current value so that you can pass correct input to
sequence_alloc(). And you need to do this in crash-safe way so WAL is
necessary.


Why does sequence_alloc need the current value? If it's a remote 
seqam, the current value is kept in the remote server, and the last 
value that was given to this PostgreSQL server is irrelevant.


That irks me with this API. The method for acquiring a new value isn't 
fully abstracted behind the AM interface, as sequence.c still needs to 
track it itself. That's useful for the local AM, of course, and maybe 
some others, but for others it's totally useless.



For the amdata handling (which is the AM's private data variable) the
API assumes that (Datum) 0 is NULL, this seems to work well for
reloptions so should work here also and it simplifies things a little
compared to passing pointers to pointers around and making sure
everything is allocated, etc.

Sadly the fact that amdata is not fixed size and can be NULL made the
page updates of the sequence relation quite more complex that it
used to
be.


It would be nice if the seqam could define exactly the columns it needs,
with any datatypes. There would be a set of common attributes:
sequence_name, start_value, cache_value, increment_by, max_value,
min_value, is_cycled. The local seqam would add last_value, log_cnt
and is_called to that. A remote seqam that calls out to some other
server might store the remote server's hostname etc.

There could be a seqam function that returns a TupleDesc with the
required columns, for example.


Wouldn't that somewhat bloat catalog if we had new catalog table for
each sequence AM?


No, that's not what I meant. The number of catalog tables would be the
same as today. Sequences look much like any other relation, with entries
in pg_attribute catalog table for all the attributes for each sequence.
Currently, all sequences have the same set of attributes, sequence_name,
last_value and so forth. What I'm proposing is that there would a set of
attributes that are common to all sequences, but in addition to that
there could be any number of AM-specific attributes.


Oh, that's interesting idea, so the AM interfaces would basically return
updated tuple and there would be some description function that returns
tupledesc.


Yeah, something like that.


I am bit worried that this would kill any possibility of
ALTER SEQUENCE USING access_method. Plus I don't think it actually
solves any real problem - serializing the internal C structs into bytea
is not any harder than serializing them into tuple IMHO.


I agree that serialization to bytea isn't that difficult, but it's still 
nicer to work directly with the correct data types. And it makes the 
internal state easily accessible for monitoring and debugging purposes.



It also does not really solve the amdata being dynamic
size issue.


Yes it would. There would not be a single amdata attribute, but the AM
could specify any number of custom attributes, which could be fixed size
or varlen. It would be solely the AM's responsibility to set the values
of those attributes.



That's not the issue I was referring to, I was talking about the page
replacement code which is not as simple now that we have potentially
dynamic size tuple and if tuples were different for different AMs the
code would still have to be able to handle that case. Setting the values
in tuple itself is not too complicated.


I don't see the problem with that. We deal with variable-sized tuples in 
heap pages all the time. The max size of amdata (or the extra 
AM-specific columns) is going to be determined by the block size, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport

2014-11-05 Thread Peter Geoghegan

On Wed, Nov 5, 2014 at 8:36 AM, Jim Nasby jim.na...@bluetreble.com wrote:
 Did anything ever happen with this?

I consider this to be related to the abbreviated keys stuff, although
it is clearly independently valuable (so it could be reviewed
independently). Since Robert ran out of time to work on abbreviated
keys (hopefully temporarily), it's not all that surprising that no one
has paid attention to this smaller piece of work. I see that Andreas
Karlsson is signed up to review this, so that's something.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] to_char_at_timezone()?

On 11/04/2014 04:04 PM, Marko Tiikkaja wrote:
 On 11/5/14, 12:59 AM, Tom Lane wrote:
 Marko Tiikkaja ma...@joh.to writes:
 So I got into thinking whether it would make sense to provide a new
 function, say, to_char_at_timezone() to solve this problem.  For
 example:
 ...
 Any thoughts?  The patch is quite trivial.

 I'm not convinced that it's all that trivial.  Is the input timestamp or
 timestamptz, and what's the semantics of that exactly (ie what timezone
 rotation happens)?  One's first instinct is often wrong in this area.
 
 In my example, the input is a timestamptz, and the output is converted
 to the target time zone the same way timestamptz_out() does, except
 based on the input timezone instead of TimeZone.
 
 Not sure whether it would make sense to do this for timestamp, or
 whether there's even a clear intuitive behaviour there.

Why wouldn't we just add the timezone as an additional parameter?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] to_char_at_timezone()?

2014-11-05 Thread Marko Tiikkaja


On 11/5/14, 7:36 PM, Josh Berkus wrote:

On 11/04/2014 04:04 PM, Marko Tiikkaja wrote:

In my example, the input is a timestamptz, and the output is converted
to the target time zone the same way timestamptz_out() does, except
based on the input timezone instead of TimeZone.

Not sure whether it would make sense to do this for timestamp, or
whether there's even a clear intuitive behaviour there.


Why wouldn't we just add the timezone as an additional parameter?


Are you suggesting that we add a new overload of to_char() instead of a 
new function to_char_at_timezone()?  That sounds a bit confusing, but 
that might just be me.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_multixact not getting truncated


On 11/3/14, 7:40 PM, Josh Berkus wrote:

On 11/03/2014 05:24 PM, Josh Berkus wrote:

BTW, the reason I started poking into this was a report from a user that
they have a pg_multixact directory which is 21GB in size, and is 2X the
size of the database.

Here's XID data:

Latest checkpoint's NextXID:  0/1126461940
Latest checkpoint's NextOID:  371135838
Latest checkpoint's NextMultiXactId:  162092874
Latest checkpoint's NextMultiOffset:  778360290
Latest checkpoint's oldestXID:945761490
Latest checkpoint's oldestXID's DB:   370038709
Latest checkpoint's oldestActiveXID:  1126461940
Latest checkpoint's oldestMultiXid:   123452201
Latest checkpoint's oldestMulti's DB: 370038709

Oldest mxid file is 29B2, newest is 3A13

No tables had a relminmxid of 1 (some of the system tables were 0,
though), and the data from pg_class and pg_database is consistent.


More tidbits:

I just did a quick check on customer systems (5 of them).  This only
seems to be happening on customer systems where I specifically know
there is a high number of FK lock waits (the system above gets around
1000 per minute that we know of).  Other systems with higher transaction
rates don't exhibit this issue; I checked a 9.3.5 database which
normally needs to do XID wraparound once every 10 days, and it's
pg_multixact is only 48K (it has one file, ).

Note that pg_clog on the bad machine is only 64K in size.

How many IDs are there per mxid file?

#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))

So for 8k blocks, there are 2k offsets (really MultiXactIds) per page, 32 pages 
per SLRU segment. Your file names aren't making sense to me. :( If I'm doing 
the math correctly, 29B2 is MXID 699 531 264 and 3A13 is 974 323 712. You're 
only looking in pg_multixact/members/, yes?


Relevant code starts in vacuum.c/vac_update_datfrozenxid()

If there's any rows in pg_class for tables/matviews/toast with either relfrozenxid 
 next XID or relminmxid  next MXID then the code *silently* pulls the plug 
right there. IMO we should at least issue a warning.

That you see relminxid advancing tells me this isn't the case here.

ForceTransactionIdLimitUpdate() is a bit suspect in that it only looks at 
xidVacLimit, but if it were breaking then you wouldn't see pg_database minmxid 
advancing.

Looking through TruncateMultiXact, I don't see anything that could prevent 
truncation, unless the way we're handing MultiXactID wraparound is broken 
(which I don't see any indication of).

Can you post the contents of pg_multixact/members/?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_multixact not getting truncated

On 11/05/2014 10:40 AM, Jim Nasby wrote:
 On 11/3/14, 7:40 PM, Josh Berkus wrote:
 On 11/03/2014 05:24 PM, Josh Berkus wrote:
 BTW, the reason I started poking into this was a report from a user that
 they have a pg_multixact directory which is 21GB in size, and is 2X the
 size of the database.

 Here's XID data:

 Latest checkpoint's NextXID:  0/1126461940
 Latest checkpoint's NextOID:  371135838
 Latest checkpoint's NextMultiXactId:  162092874
 Latest checkpoint's NextMultiOffset:  778360290
 Latest checkpoint's oldestXID:945761490
 Latest checkpoint's oldestXID's DB:   370038709
 Latest checkpoint's oldestActiveXID:  1126461940
 Latest checkpoint's oldestMultiXid:   123452201
 Latest checkpoint's oldestMulti's DB: 370038709

 Oldest mxid file is 29B2, newest is 3A13

 No tables had a relminmxid of 1 (some of the system tables were 0,
 though), and the data from pg_class and pg_database is consistent.

 More tidbits:

 I just did a quick check on customer systems (5 of them).  This only
 seems to be happening on customer systems where I specifically know
 there is a high number of FK lock waits (the system above gets around
 1000 per minute that we know of).  Other systems with higher transaction
 rates don't exhibit this issue; I checked a 9.3.5 database which
 normally needs to do XID wraparound once every 10 days, and it's
 pg_multixact is only 48K (it has one file, ).

 Note that pg_clog on the bad machine is only 64K in size.

 How many IDs are there per mxid file?
 #define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
 
 So for 8k blocks, there are 2k offsets (really MultiXactIds) per page,
 32 pages per SLRU segment. Your file names aren't making sense to me. :(
 If I'm doing the math correctly, 29B2 is MXID 699 531 264 and 3A13 is
 974 323 712. You're only looking in pg_multixact/members/, yes?

These are members, not offsets.  What's stored in members?

 Relevant code starts in vacuum.c/vac_update_datfrozenxid()
 
 If there's any rows in pg_class for tables/matviews/toast with either
 relfrozenxid  next XID or relminmxid  next MXID then the code
 *silently* pulls the plug right there. IMO we should at least issue a
 warning.

Wait, how would that situation arise in the first place? Wraparound is
supposed to prevent it.

Mind you, I checked pg_class, and it matches the minmxid shown by
pg_database, so that's not the smoking gun.

 Can you post the contents of pg_multixact/members/?

Well, not as of last week, obviously.

https://gist.github.com/jberkus/d05db3629e8c898664c4

I haven't pasted all the filenames, because, well, look at the count.  I
also added the contents of the /offsets directory, for full information.

Note that we've added 400 multixact files since my first email.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

Jim Nasby jim.na...@bluetreble.com wrote:

 for a single system AIUI all we need to do is expose the LSN of
 each commit record and that will give you the exact and unique
 order in which transactions committed.

 This isn't a hypothetical feature either; if we had this,
 logical replication systems wouldn't have to try and fake this
 via batches. You could actually recreate exactly what data was
 visible at what time to all transactions, not just repeatable
 read ones (as long as you kept snapshot data as well, which isn't
 hard).

Well, that not entirely true for serializable transactions; there
are points in time where reading the committed state could cause a
transaction to roll back[1] -- either a writing transaction which
would make that visible state inconsistent with the later committed
state or the reading transaction if it views something which is not
(yet) consistent.

That's not to say that this feature is a bad idea; part of the
serializable implementation itself depends on being able to
accurately determine commit order, and this feature could allow
that to work more efficiently.  I'm saying that, like hot standby,
a replicated database could not provide truly serializable
transactions (even read only ones) without something else in
addition to this.  We've discussed various ways of doing that.
Perhaps the most promising is to include in the stream some
indication of which points in the transaction stream are safe for a
serializable transaction to read.  If there's a way to implement
commit order recording such that a two-state flag could be
associated with each commit, I think it could be made to work for
serializable transactions.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] https://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL replay bugs

2014-11-05 Thread Peter Eisentraut

On 11/4/14 10:50 PM, Michael Paquier wrote:
 Except pg_upgrade, are there other tests using bash?

There are a few obscure things under src/test/.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Order of views in stats docs

2014-11-05 Thread Peter Eisentraut

On 11/5/14 10:57 AM, Tom Lane wrote:
 However, should we consider the possibility of changing the table to
 straight alphabetical ordering?  I'm not as much in love with that
 approach as some folks, but it does have the merit that it's always clear
 where you ought to put a new item.

Yes, I think that property is important when developing in a loose
community.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

Jim Nasby jim.na...@bluetreble.com wrote:

 If we go that route we should also mention that now() will no
 longer be doing what you probably hope it would (AFAIK it's
 driven by BEGIN and not the first snapshot).

There is also the fact that pg_stat_activity shows a connection as
being idle in transaction as soon as BEGIN is executed; it does
not wait for the snapshot to be acquired.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-05 Thread Heikki Linnakangas


On 10/30/2014 09:19 PM, Andres Freund wrote:

Some things I noticed while reading the patch:


A lot of good comments, but let me pick up just two that are related:


* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
   refer to the relation, but not to the block number. These still log
   their rnode manually. Shouldn't we somehow deal with those in a
   similar way explicit block references are dealt with?

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
   other slru stuff) doesn't include a reference to the page. Isn't that
   bad? Shouldn't we make XLogRegisterBlock() usable for that case?
   Otherwise I fail to see how pg_rewind like tools can sanely deal with this?


Yeah, there are still operations that modify relation pages, but don't 
store the information about the modified pages in the standard format. 
That includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, 
and also XLOG_DBASE_CREATE/DROP. And then there are updates to 
non-relation files, like all the slru stuff, relcache init files, etc. 
And updates to the FSM and VM bypass the full-page write mechanism too.


To play it safe, pg_rewind copies all non-relation files as is. That 
includes all SLRUs, FSM and VM files, and everything else whose filename 
doesn't match the (main fork of) a relation file. Of course, that's a 
fair amount of copying to do, so we might want to optimize that in the 
future, but I want to nail the relation files first. They are usually an 
order of magnitude larger than the other files, after all.


Unfortunately pg_rewind still needs to recognize and parse the special 
WAL records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files 
outside the normal block registration system. I've been thinking that we 
should add another flag to the WAL record format to mark such records. 
pg_rewind will still need to understand the record format of such 
records, but the flag will help to catch bugs of omission. If pg_rewind 
or another such tool sees a record that's flagged as special, but 
doesn't recognize the record type, it can throw an error.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-11-05 Thread Peter Geoghegan

On Mon, Oct 27, 2014 at 5:15 PM, Peter Geoghegan p...@heroku.com wrote:
 Let's see if we can link these two thoughts.

 1. You think the biggest problem is the lack of attention to the design.

 2. I keep asking you to put the docs in a readable form.

 If you can't understand the link between those two things, I am at a loss.

 You've read the docs. Please be clearer. In what sense are they not
 readable? The main description of the feature appears on the INSERT
 reference page:

 http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-insert.html

I've updated that reference page. I did a fair amount of copy-editing,
but also updated the docs to describe the latest (unpublished)
refinements to the syntax. Which is, as you and Robert requested, that
the target and rejected-for-insertion tuples may be referenced with
magical aliases in the style of OLD.* and NEW.*. I've spelt these
aliases as TARGET.* and EXCLUDED.*, since OLD.* and NEW.* didn't seem
to make much sense here. This requires some special processing during
rewriting (which, as you probably know, is true of DML statements in
general), and is certainly more invasive than what I had before, but
all told isn't too bad. Basically, there is still an ExcludedExpr, but
it only appears in the post-rewrite query tree, and is never created
by the raw grammar or processed during parse analysis.

I attach the doc patch with the relevant changes, in case you'd like a
quick reference to where things are changed.

I have already implemented the two things that you and Robert asked
for most recently: A costing model for unique index inference, and the
above syntax. I've also added IGNORE support to postgres_fdw (so you
can IGNORE if and only if a unique index inference specification is
omitted, just as with updatable views since V1.3).

Currently, I'm working on fixing an issue with RLS that I describe in
detail here:

https://wiki.postgresql.org/wiki/UPSERT#RLS

Once I fix that (provided it doesn't take too long), I'll publish a
V1.4. AFAICT, that'll close out all of the current open issues.

I hope this goes some way towards addressing your concerns.
-- 
Peter Geoghegan
From b9556286b3710234a32ce48e8d163d5e844154b8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan p...@heroku.com
Date: Fri, 26 Sep 2014 20:59:04 -0700
Subject: [PATCH 6/6] User-visible documentation for INSERT ... ON CONFLICT
 {UPDATE | IGNORE}

INSERT ... ON CONFLICT {UPDATE | IGNORE} is documented as a new clause
of the INSERT command.  Some potentially surprising interactions with
triggers are noted -- BEFORE INSERT per-row triggers must fire without
the INSERT path necessarily being taken, for example.

All the existing features that INSERT ... ON CONFLICT {UPDATE | IGNORE}
fails to completely play nice with have those limitations noted.  (Notes
are added to the existing documentation for those other features,
although some of these cases will need to be revisited).  This includes
postgres_fdw, updatable views and table inheritance (although these have
most interesting cases covered, particularly inheritance).

Finally, a user-level description of the new MVCC violation that the
ON CONFLICT UPDATE variant sometimes requires has been added to Chapter
13 - Concurrency Control, beside existing commentary on READ COMMITTED
mode's special handling of concurrent updates.  The new MVCC violation
introduced seems somewhat distinct from the existing one (i.e.  what is
internally referred to as the EvalPlanQual() mechanism), because in READ
COMMITTED mode it is no longer necessary for any row version to be
conventionally visible to the command's MVCC snapshot for an UPDATE of
the row to occur (or for the row to be locked, should the WHERE clause
predicate not be satisfied).
---
 doc/src/sgml/ddl.sgml |  23 +++
 doc/src/sgml/fdwhandler.sgml  |   8 ++
 doc/src/sgml/indices.sgml |  11 +-
 doc/src/sgml/keywords.sgml|   7 +
 doc/src/sgml/mvcc.sgml|  24 
 doc/src/sgml/plpgsql.sgml |  14 +-
 doc/src/sgml/postgres-fdw.sgml|   8 ++
 doc/src/sgml/ref/create_index.sgml|   7 +-
 doc/src/sgml/ref/create_rule.sgml |   6 +-
 doc/src/sgml/ref/create_table.sgml|   5 +-
 doc/src/sgml/ref/create_trigger.sgml  |   5 +-
 doc/src/sgml/ref/create_view.sgml |  33 -
 doc/src/sgml/ref/insert.sgml  | 262 +++---
 doc/src/sgml/ref/set_constraints.sgml |   6 +-
 doc/src/sgml/trigger.sgml |  29 +++-
 15 files changed, 419 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index f9dc151..1bf3537 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -2428,9 +2428,27 @@ VALUES ('Albany', NULL, NULL, 'NY');
   /para
 
   para
+   There is limited inheritance support for commandINSERT/command
+   commands with literalON CONFLICT/ clauses.  Tables with
+   children are not generally accepted as targets.

Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-05 Thread Peter Eisentraut

On 10/31/14 6:19 AM, Simon Riggs wrote:
 Various ways of tweaking Foreign Keys are suggested that are helpful
 for larger databases.

 *INITIALLY NOT ENFORCED
 FK created, but is not enforced during DML.
 Will be/Must be marked NOT VALID when first created.
 We can run a VALIDATE on the constraint at any time; if it passes the
 check it is marked VALID and presumed to stay that way until the next
 VALIDATE run.

Does that mean the FK would become invalid after every DML operation,
until you expicitly revalidate it?  Is that practical?

 ON DELETE IGNORE
 ON UPDATE IGNORE
 If we allow this specification then the FK is one way - we check the
 existence of a row in the referenced table, but there is no need for a
 trigger on the referenced table to enforce an action on delete or
 update, so no need to lock the referenced table when adding FKs.

Are you worried about locking the table at all, or about having to lock
many rows?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Amazon Redshift

2014-11-05 Thread philip taylor


Hackers,

do you think we should implement some of the functions offered by Amazon Redshift?



http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html



Amazon Redshift is completely based on PostgreSQL, but they have implemented some features not currently available in PostgreSQL.
Given the fact that its a successful customer-driven commercial project, one can easily guess that the functions they implemented are useful and requested by many users.



Phil

Re: [HACKERS] Amazon Redshift

philip taylor philta...@mail.com wrote:

 do you think we should implement some of the functions offered by
 Amazon Redshift?

 http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html

Well, obviously we already have many of the functions linked from
that page.  I think that if you want to go through them and pick
out a few that you think are particularly promising it could start
a useful discussion.  Pointing at a page that includes such things
as CURRENT_DATE, the avg() aggregate, and the lead() window
function is just going to waste the time of everyone who follows
the link.  It's not like all of them would be committed in the same
patch anyway.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Order of views in stats docs


On 11/5/14, 2:43 PM, Peter Eisentraut wrote:

On 11/5/14 10:57 AM, Tom Lane wrote:

However, should we consider the possibility of changing the table to
straight alphabetical ordering?  I'm not as much in love with that
approach as some folks, but it does have the merit that it's always clear
where you ought to put a new item.


Yes, I think that property is important when developing in a loose
community.


Couldn't we just stick a warning SGML comment at the end of the list? ISTM 
that's no more likely to be missed/ignored than noticing that the list happens 
to be alphabetical.

Perhaps the best solution is to split the list into different areas; one for 
database stats, another for table stats, etc.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Amazon Redshift


On 11/5/14, 3:42 PM, Kevin Grittner wrote:

philip taylor philta...@mail.com wrote:


do you think we should implement some of the functions offered by
Amazon Redshift?

http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html


Well, obviously we already have many of the functions linked from
that page.  I think that if you want to go through them and pick
out a few that you think are particularly promising it could start
a useful discussion.  Pointing at a page that includes such things
as CURRENT_DATE, the avg() aggregate, and the lead() window
function is just going to waste the time of everyone who follows
the link.  It's not like all of them would be committed in the same
patch anyway.


At the risk stating the obvious, before we look at implementing them we should 
see if Amazon would donate that effort.

But as Kevin said, first step is discussion on what we don't have and whether 
we want it.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] superuser() shortcuts

2014-11-05 Thread Adam Brightwell

Attached is two separate patches to address previous
comments/recommendations:

* superuser-cleanup-shortcuts_11-5-2014.patch
* has_privilege-cleanup_11-5-2014.patch

-Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com
diff --git a/contrib/test_decoding/expected/permissions.out b/contrib/test_decoding/expected/permissions.out
new file mode 100644
index 212fd1d..f011955
*** a/contrib/test_decoding/expected/permissions.out
--- b/contrib/test_decoding/expected/permissions.out
*** RESET ROLE;
*** 54,66 
  -- plain user *can't* can control replication
  SET ROLE lr_normal;
  SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
! ERROR:  must be superuser or replication role to use replication slots
  INSERT INTO lr_test VALUES('lr_superuser_init');
  ERROR:  permission denied for relation lr_test
  SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
! ERROR:  must be superuser or replication role to use replication slots
  SELECT pg_drop_replication_slot('regression_slot');
! ERROR:  must be superuser or replication role to use replication slots
  RESET ROLE;
  -- replication users can drop superuser created slots
  SET ROLE lr_superuser;
--- 54,69 
  -- plain user *can't* can control replication
  SET ROLE lr_normal;
  SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
! ERROR:  permission denied to use replication slots
! HINT:  You must be superuser or replication role to use replication slots.
  INSERT INTO lr_test VALUES('lr_superuser_init');
  ERROR:  permission denied for relation lr_test
  SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
! ERROR:  permission denied to use replication slots
! HINT:  You must be superuser or replication role to use replication slots.
  SELECT pg_drop_replication_slot('regression_slot');
! ERROR:  permission denied to use replication slots
! HINT:  You must be superuser or replication role to use replication slots.
  RESET ROLE;
  -- replication users can drop superuser created slots
  SET ROLE lr_superuser;
*** SELECT 'init' FROM pg_create_logical_rep
*** 90,96 
  RESET ROLE;
  SET ROLE lr_normal;
  SELECT pg_drop_replication_slot('regression_slot');
! ERROR:  must be superuser or replication role to use replication slots
  RESET ROLE;
  -- all users can see existing slots
  SET ROLE lr_superuser;
--- 93,100 
  RESET ROLE;
  SET ROLE lr_normal;
  SELECT pg_drop_replication_slot('regression_slot');
! ERROR:  permission denied to use replication slots
! HINT:  You must be superuser or replication role to use replication slots.
  RESET ROLE;
  -- all users can see existing slots
  SET ROLE lr_superuser;
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
new file mode 100644
index 133143d..52d8208
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***
*** 27,32 
--- 27,33 
  #include miscadmin.h
  #include replication/walreceiver.h
  #include storage/smgr.h
+ #include utils/acl.h
  #include utils/builtins.h
  #include utils/numeric.h
  #include utils/guc.h
*** pg_start_backup(PG_FUNCTION_ARGS)
*** 54,63 
  
  	backupidstr = text_to_cstring(backupid);
  
! 	if (!superuser()  !has_rolreplication(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 		   errmsg(must be superuser or replication role to run a backup)));
  
  	startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
  
--- 55,65 
  
  	backupidstr = text_to_cstring(backupid);
  
! 	if (!has_replication_privilege(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
!  errmsg(permission denied to run a backup),
!  errhint(You must be superuser or replication role to run a backup.)));
  
  	startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
  
*** pg_stop_backup(PG_FUNCTION_ARGS)
*** 82,91 
  {
  	XLogRecPtr	stoppoint;
  
! 	if (!superuser()  !has_rolreplication(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 		 (errmsg(must be superuser or replication role to run a backup;
  
  	stoppoint = do_pg_stop_backup(NULL, true, NULL);
  
--- 84,94 
  {
  	XLogRecPtr	stoppoint;
  
! 	if (!has_replication_privilege(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
!  errmsg(permission denied to run a backup),
!  errhint(You must be superuser or replication role to run a backup.)));
  
  	stoppoint = do_pg_stop_backup(NULL, true, NULL);
  
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
new file mode 100644
index d30612c..4d26b02
*** a/src/backend/catalog/aclchk.c
---

Re: [HACKERS] tracking commit timestamps

2014-11-05 Thread Steve Singer


On 11/05/2014 11:23 AM, Jim Nasby wrote:



Except that commit time is not guaranteed unique *even on a single 
system*. That's my whole point. If we're going to bother with all the 
commit time machinery it seems really silly to provide a way to 
uniquely order every commit.


Clearly trying to uniquely order commits across multiple systems is a 
far larger problem, and I'm not suggesting we attempt that. But for a 
single system AIUI all we need to do is expose the LSN of each commit 
record and that will give you the exact and unique order in which 
transactions committed.


This isn't a hypothetical feature either; if we had this, logical 
replication systems wouldn't have to try and fake this via batches. 
You could actually recreate exactly what data was visible at what time 
to all transactions, not just repeatable read ones (as long as you 
kept snapshot data as well, which isn't hard).


As for how much data to keep, if you have a process that's doing 
something to record this information permanently all it needs to do is 
keep an old enough snapshot around. That's not that hard to do, even 
from user space.


+1 for this.

It isn't just 'replication' systems that have a need for getting the 
commit order of transactions on a single system.  I have a application 
(not slony) where we want to query a table but order the output based on 
the transaction commit order of when the insert into the table was done 
(think of a queue). I'm not replicating the output but passing the data 
to other applications for further processing.  If I just had the commit 
timestamp I would need to put in some other condition to break ties in a 
consistent way.  I think being able to get an ordering by commit LSN is 
what I really want in this case not the timestamp.


Logical decoding is one solution to this (that I was considering) but 
being able to do something like

select * FROM event_log order by commit_id would be a lot simpler.






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Amazon Redshift

2014-11-05 Thread philip taylor

 do you think we should implement some of the functions offered by
 Amazon Redshift?

 http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html

 Well, obviously we already have many of the functions linked from
 that page. I think that if you want to go through them and pick
 out a few that you think are particularly promising it could start
 a useful discussion. Pointing at a page that includes such things
 as CURRENT_DATE, the avg() aggregate, and the lead() window
 function is just going to waste the time of everyone who follows
 the link. It's not like all of them would be committed in the same
 patch anyway.

 At the risk stating the obvious, before we look at implementing them we 
 should see if Amazon would donate that effort.

 But as Kevin said, first step is discussion on what we don't have and whether 
 we want it.

Ok, this is a summary of what they have that we don't (of course, I could have 
missed something):

Date Functions

http://docs.aws.amazon.com/redshift/latest/dg/r_ADD_MONTHS.html
http://docs.aws.amazon.com/redshift/latest/dg/r_DATEADD_function.html
http://docs.aws.amazon.com/redshift/latest/dg/r_DATEDIFF_function.html
http://docs.aws.amazon.com/redshift/latest/dg/r_LAST_DAY.html
http://docs.aws.amazon.com/redshift/latest/dg/r_MONTHS_BETWEEN_function.html
http://docs.aws.amazon.com/redshift/latest/dg/r_NEXT_DAY.html

String Functions

http://docs.aws.amazon.com/redshift/latest/dg/REGEXP_COUNT.html
http://docs.aws.amazon.com/redshift/latest/dg/REGEXP_INSTR.html
http://docs.aws.amazon.com/redshift/latest/dg/r_STRTOL.html
http://docs.aws.amazon.com/redshift/latest/dg/crc32-function.html
http://docs.aws.amazon.com/redshift/latest/dg/FUNC_SHA1.html

JSON Functions

http://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html
http://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_ARRAY_ELEMENT_TEXT.html
http://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html

Window functions

http://docs.aws.amazon.com/redshift/latest/dg/r_WF_RATIO_TO_REPORT.html
http://docs.aws.amazon.com/redshift/latest/dg/r_WF_MEDIAN.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

On 2014-11-05 23:08:31 +0200, Heikki Linnakangas wrote:
 On 10/30/2014 09:19 PM, Andres Freund wrote:
 Some things I noticed while reading the patch:
 
 A lot of good comments, but let me pick up just two that are related:
 
 * There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
refer to the relation, but not to the block number. These still log
their rnode manually. Shouldn't we somehow deal with those in a
similar way explicit block references are dealt with?
 
 * Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
other slru stuff) doesn't include a reference to the page. Isn't that
bad? Shouldn't we make XLogRegisterBlock() usable for that case?
Otherwise I fail to see how pg_rewind like tools can sanely deal with 
  this?
 
 Yeah, there are still operations that modify relation pages, but don't store
 the information about the modified pages in the standard format. That
 includes XLOG_SMGR_TRUNCATE that you spotted, and XLOG_SMGR_CREATE, and also
 XLOG_DBASE_CREATE/DROP. And then there are updates to non-relation files,
 like all the slru stuff, relcache init files, etc. And updates to the FSM
 and VM bypass the full-page write mechanism too.

That's a awful number of special cases. I see little reason not to
invent something that can reference at least the relation in those
cases. Then we can at least only resync the relations if there were
changes. E.g. for vm's that not necessarily all that likely in an insert
mostly case.
I'm not that worried about things like create/drop database, but fsm,
vm, and the various slru's are really somewhat essential.

 To play it safe, pg_rewind copies all non-relation files as is. That
 includes all SLRUs, FSM and VM files, and everything else whose filename
 doesn't match the (main fork of) a relation file. Of course, that's a fair
 amount of copying to do, so we might want to optimize that in the future,
 but I want to nail the relation files first. They are usually an order of
 magnitude larger than the other files, after all.

That's fair enough.

 Unfortunately pg_rewind still needs to recognize and parse the special WAL
 records like XLOG_SMGR_CREATE/TRUNCATE, that modify relation files outside
 the normal block registration system. I've been thinking that we should add
 another flag to the WAL record format to mark such records.

So everytime pg_rewind comes across a record with that flag set which it
doesn't have special case code it'd balk? Sounds sensible.

Personally I think we should at least have a generic format to refer to
entire relations without a specific block number. And one to refer to
SLRUs. I don't think we necessarily need to implement them now, but we
should make sure there's bit space left to denote them.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sequence Access Method WIP

2014-11-05 Thread Petr Jelinek


On 05/11/14 18:32, Heikki Linnakangas wrote:

On 11/05/2014 05:07 PM, Petr Jelinek wrote:

On 05/11/14 13:45, Heikki Linnakangas wrote:

In fact, if the seqam manages the current value outside the database
(e.g. a remote seqam that gets the value from another server),
nextval() never needs to write a WAL record.


Sure it does, you need to keep the current state in Postgres also, at
least the current value so that you can pass correct input to
sequence_alloc(). And you need to do this in crash-safe way so WAL is
necessary.


Why does sequence_alloc need the current value? If it's a remote
seqam, the current value is kept in the remote server, and the last
value that was given to this PostgreSQL server is irrelevant.


Hmm, I am not sure if I see this usecase as practical TBH, but I also 
don't see fundamental problem with it.



That irks me with this API. The method for acquiring a new value isn't
fully abstracted behind the AM interface, as sequence.c still needs to
track it itself. That's useful for the local AM, of course, and maybe
some others, but for others it's totally useless.


Hmm, I think that kind of abstraction can only be done by passing 
current tuple and returning updated tuple (yes I realize that it's what 
you have been saying basically).


In general it sounds like the level of abstraction you'd want would be 
one where AM cares about everything except the the code that does the 
actual writes to page and WAL (but when to do those would still be 
controlled completely by AM?) and the SQL interface.
I don't see how to make that work with ALTER SEQUENCE USING to be honest 
and I do care quite a lot about that use-case (I think the ability to 
convert the local sequences to 3rd party ones and back is very important).




That's not the issue I was referring to, I was talking about the page
replacement code which is not as simple now that we have potentially
dynamic size tuple and if tuples were different for different AMs the
code would still have to be able to handle that case. Setting the values
in tuple itself is not too complicated.


I don't see the problem with that. We deal with variable-sized tuples in
heap pages all the time. The max size of amdata (or the extra
AM-specific columns) is going to be determined by the block size, though.



Glad to hear that. Yes the limit is block size, I think we can live with 
that at least for the moment...


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

On 2014-11-05 17:17:05 -0500, Steve Singer wrote:
 It isn't just 'replication' systems that have a need for getting the commit
 order of transactions on a single system.  I have a application (not slony)
 where we want to query a table but order the output based on the transaction
 commit order of when the insert into the table was done (think of a queue).
 I'm not replicating the output but passing the data to other applications
 for further processing.  If I just had the commit timestamp I would need to
 put in some other condition to break ties in a consistent way.  I think
 being able to get an ordering by commit LSN is what I really want in this
 case not the timestamp.
 
 Logical decoding is one solution to this (that I was considering) but being
 able to do something like
 select * FROM event_log order by commit_id would be a lot simpler.

Imo that's essentially a different feature. What you essentially would
need here is a 'commit sequence number' - but no timestamps. And
probably to be useful that number has to be 8 bytes in itself.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-05 Thread Jeff Janes

On Wed, Nov 5, 2014 at 12:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com
wrote:

Thanks for the updated patch.

Now when I run the test program (version with better error reporting
attached), it runs fine until I open a psql session and issue:

reindex table foo;

Then it immediately falls over with some rows no longer being findable
through the index.

-- use index
select count(*) from foo where text_array = md5(4611::text);
0

-- use seq scan
select count(*) from foo where text_array||'' = md5(4611::text);
1

Where the number '4611' was taken from the error message of the test
program.


brin_crash.pl
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Amazon Redshift

On 11/05/2014 02:36 PM, philip taylor wrote:

 Ok, this is a summary of what they have that we don't (of course, I could 
 have missed something):
 

I can't see any functions on that list I'd want.

For example, DATEADD is there just to be compatible with MSSQL. It's
useless to us.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] numeric_normalize() is a few bricks shy of a load

9.4 introduced a function numeric_normalize() whose charter is to
produce a string representing a numeric, such that two numerics' string
representations will compare equal if and only if the numeric values
compare equal.  (This is used, so far, only by GIN indexes on JSONB.)
Thomas Fanghaenel pointed out to me that this function is utterly,
completely broken: it will reduce distinct values such as 1 and 1000
to the same string, and what's much worse for the GIN use-case,
it can reduce values that should compare equal to distinct strings.
Here's a test case demonstrating the latter assertion:

regression=# create table t1 (f1 jsonb);
CREATE TABLE
regression=# insert into t1 values('{x: 12000}');
INSERT 0 1
regression=# insert into t1 values('{x: 12000.00}');
INSERT 0 1
regression=# select * from t1 where f1 @ '{x: 12000}';
   f1
-
 {x: 12000}
 {x: 12000.00}
(2 rows)

regression=# create index on t1 using gin (f1);
CREATE INDEX
regression=# set enable_seqscan TO 0;
SET
regression=# select * from t1 where f1 @ '{x: 12000}';
  f1  
--
 {x: 12000}
(1 row)

regression=# select * from t1 where f1 @ '{x: 12000.00}';
   f1
-
 {x: 12000.00}
(1 row)

Since JSONB GIN indexes are always considered lossy, the
reduction-to-same-string case is masked by index rechecks, resulting
only in an inefficient index that fails to distinguish keys it could
distinguish.  However, the reduction-to-different-strings case results
in wrong answers, as illustrated above.

I think there's no choice but to fix this in 9.4.  The only reason
it even needs discussion is that this would invalidate index entries
in beta testers' JSONB GIN indexes, such that queries would not find
entries that they did find before.  I'm not sure we can do anything
about this except recommend that beta testers REINDEX such indexes
after updating to 9.4next.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: dynahash replacement for buffer table

Hi,

git branch also available at:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014

A minor review of this:
* Should be rebased ontop of the atomics API

* In benchmarks it becomes apparent that the dynamic element width makes
some macros like CHashTableGetNode() and
CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
can't figure how to build an efficient expression for the
target. There's two ways to address this:
a) Actually morph chash into something that will be included into the
user sites. Since I think there'll only be a limited number of those
that's probably acceptable.
b) Simplify the access rules. I quite seriously doubt that the
interleaving of garbage and freelists is actually necessary/helpful.

* There's several cases where it's noticeable that chash creates more
cacheline misses - that's imo a good part of why the single threaded
performance regresses. There's good reasons for the current design
without a inline bucket, namely that it makes the concurrency handling
easier. But I think that can be countered by still storing a pointer -
just to an element inside the bucket. Afaics that'd allow this to be
introduced quite easily?

I'm afraid that I can't see us going forward unless we address the
noticeable single threaded penalties. Do you see that differently?

* There's some whitespace damage. (space followed by tab, followed by
actual contents)

* I still think it's a good idea to move the full memory barriers into
the cleanup/writing process by doing write memory barriers when
acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
in the process testing for the hazard pointer.

* Shouldn't we relax the CPU in a couple places like CHashAllocate(),
CHashBucketScan()'s loops?

* I don't understand right now (but then I'm in a Hotel bar) why it's
safe for CHashAddToGarbage() to clobber the hash value?
CHashBucketScan() relies the chain being ordered by hashcode. No?
Another backend might just have been past the IsMarked() bit and look
at the hashcode?

* We really should find a way to sensibly print the stats.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

On Mon, Nov 3, 2014 at 2:14 PM, Álvaro Hernández Tortosa a...@8kdata.com 
wrote:
 Given a transaction started with BEGIN (REPEATABLE READ |
 SERIALIZABLE), if a concurrent session commits some data before *any* query
 within the first transaction, that committed data is seen by the
 transaction. This is not what I'd expect.

I think the problem is with your expectation, not the behavior.
Serializable means that the transactions execute in such a fashion
that their parallel execution is equivalent to some serial order of
execution.  It doesn't say that it must be equivalent to any
particular order that you might imagine, such as the order in which
the transactions commit, or, as you propose, the order in which they
begin.  Generally, I think that's a good thing, because transaction
isolation is expensive: even at repeatable read, but for the need to
provide transaction isolation, there would be no such thing as bloat.
The repeatable read and serializable levels add further overhead of
various kinds.  We could provide a super-duper serializable level that
provides even tighter guarantees, but (1) I can't imagine many people
are keen to make the cost of serialization even higher than it already
is and (2) if you really want that behavior, just do some trivial
operation sufficient to cause a snapshot to be taken immediately after
the BEGIN.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL replay bugs

On Thu, Nov 6, 2014 at 5:41 AM, Peter Eisentraut pete...@gmx.net wrote:

 On 11/4/14 10:50 PM, Michael Paquier wrote:
  Except pg_upgrade, are there other tests using bash?

 There are a few obscure things under src/test/.


Oh, I see. There is quite a number here, and each script is doing quite
different things..
$ git grep /sh src/test/
src/test/locale/de_DE.ISO8859-1/runall:#! /bin/sh
src/test/locale/gr_GR.ISO8859-7/runall:#! /bin/sh
src/test/locale/koi8-r/runall:#! /bin/sh
src/test/locale/koi8-to-win1251/runall:#! /bin/sh
src/test/mb/mbregress.sh:#! /bin/sh
src/test/performance/start-pgsql.sh:#!/bin/sh
src/test/regress/regressplans.sh:#! /bin/sh
-- 
Michael

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

2014-11-05 Thread Álvaro Hernández Tortosa



On 05/11/14 17:46, Jim Nasby wrote:

On 11/4/14, 6:11 PM, Álvaro Hernández Tortosa wrote:
 Should we improve then the docs stating this more clearly? Any 
objection to do this?


If we go that route we should also mention that now() will no longer 
be doing what you probably hope it would (AFAIK it's driven by BEGIN 
and not the first snapshot).


If I understand you correctly, you mean that if we add a note to 
the documentation stating that the transaction really freezes when you 
do the first query, people would expect now() to be also frozen when the 
first query is done, which is not what happens (it's frozen at tx 
start). Then, yes, you're right, probably *both* the isolation levels 
and the now() function documentation should be patched to become more 
precise.




Perhaps we should change how now() works, but I'm worried about what 
that might do to existing applications...


Perhaps, I also believe it might not be good for existing 
applications, but it definitely has a different freeze behavior, which 
seems inconsistent too.


Thanks,

Álvaro

--
Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

2014-11-05 Thread Álvaro Hernández Tortosa



On 06/11/14 00:42, Robert Haas wrote:

On Mon, Nov 3, 2014 at 2:14 PM, Álvaro Hernández Tortosa a...@8kdata.com 
wrote:

 Given a transaction started with BEGIN (REPEATABLE READ |
SERIALIZABLE), if a concurrent session commits some data before *any* query
within the first transaction, that committed data is seen by the
transaction. This is not what I'd expect.

I think the problem is with your expectation, not the behavior.


But my expectation is derived from the documentation:

The Repeatable Read isolation level only sees data committed before the 
transaction began;


In PostgreSQL you will see data committed after a BEGIN ... 
(REPEATABLE READ | SERIALIZABLE) statement (only before the first 
query). And it's reasonable to think that transaction begins when you 
issue a BEGIN statement. It's also reasonable to think this way as:


- now() is frozen at BEGIN time, as Nasby pointed out
- pg_stat_activity says that the transaction is started, as Kevin mentioned

So if the behavior is different from what the documentation says 
and what other external indicators may point out, I think at least 
documentation should be clear about this precise behavior, to avoid 
confusing users.



Serializable means that the transactions execute in such a fashion
that their parallel execution is equivalent to some serial order of
execution.  It doesn't say that it must be equivalent to any
particular order that you might imagine, such as the order in which
the transactions commit, or, as you propose, the order in which they
begin.  Generally, I think that's a good thing, because transaction
isolation is expensive: even at repeatable read, but for the need to
provide transaction isolation, there would be no such thing as bloat.
The repeatable read and serializable levels add further overhead of
various kinds.  We could provide a super-duper serializable level that
provides even tighter guarantees, but (1) I can't imagine many people
are keen to make the cost of serialization even higher than it already
is and (2) if you really want that behavior, just do some trivial
operation sufficient to cause a snapshot to be taken immediately after
the BEGIN.



I'm not really asking for a new isolation level, just that either 
BEGIN really freezes (for repeatable read and serializable) or if that's 
expensive and not going to happen, that the documentation clearly states 
the fact that freeze starts at first-query-time, and that if you need to 
freeze before your first real query, you should do a dummy one instead 
(like SELECT 1). Also, if this early freeze is a performance hit -and 
for that reason BEGIN is not going to be changed to freeze- then that 
also should be pointed out in the documentation, so that users that 
freeze early with SELECT 1-type queries understand that.


Regards,

Álvaro

--
Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps

2014-11-05 Thread Steve Singer


On 11/05/2014 05:43 PM, Andres Freund wrote:

On 2014-11-05 17:17:05 -0500, Steve Singer wrote:
Imo that's essentially a different feature. What you essentially would
need here is a 'commit sequence number' - but no timestamps. And
probably to be useful that number has to be 8 bytes in itself.


I think this gets to the heart of some of the differing views people 
have expressed on this patch


Is this patch supposed to:

A)  Add commit timestamp tracking but nothing more

B) Add infrastructure to store commit timestamps and provide a facility 
for storing additional bits of data extensions might want to be 
associated with the commit


C).  Add commit timestamps and node identifiers to commits

If the answer is (A) then I can see why some people are objecting to 
adding extradata.If the answer is (B) then it's fair to ask how well 
does this patch handle various types of things people might want to 
attach to the commit record (such as the LSN).   I think the problem is 
that once you start providing a facility extensions can use to store 
data along with the commit record then being restricted to 4 or 8 bytes 
is very limiting.  It also doesn't allow you to load two extensions at 
once on a system.   You wouldn't be able to have both the  
'steve_commit_order' extension and BDR installed at the same time.  I 
don't think this patch does a very good job at (B) but It wasn't 
intended to.


If what we are really doing is C, and just calling the space 'extradata' 
until we get the logical identifier stuff in and then we are going to 
rename extradata  to nodeid then we should say so.  If we are really 
concerned about the pg_upgrade impact of expanding the record later then 
maybe we should add 4 bytes of padding to the CommitTimeStampEntry now 
and but leave the manipulating the node id until later.


Steve





Greetings,

Andres Freund





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] recovery_target_time and standby_mode

Hackers,

Someone brought me an issue that recovery_target_time and standby_mode
weren't working as they expected.  I think that the way they work now
makes sense, but we do need to clarify it in the docs.  However, I'm
posting this to hackers first in case the way these two work together
*isn't* as intended.

Setup: two servers are restored from the same pgBarman archive.  The
master is brought to a specific point in time using
recovery_target_time.  Then they user attempts to do the same with the
new replica.

recovery.conf:

recovery_target_time = 'SOME-PAST-TIMESTAMP'
standby_mode = on
primary_conninfo = 'host=mymaster user=postgres port=5432'


How It Works Now:

When the recovery_target_time is reached, standby_mode is ignored and
the server comes up as a standalone.

How The User Wanted It To Work:

When the recovery_target_time is reached, switch to streaming
replication and stay a standby.


Note that there is a workaround for what the user wants to do.  I'm just
trying to clarify what our desired behavior is.  From there we can
either work on patches or on doc fixes.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport

2014-11-05 Thread Andreas Karlsson

On 10/11/2014 02:26 AM, Peter Geoghegan wrote: Both Robert and Heikki 
expressed dissatisfaction with the fact that

 B-Tree index builds don't use sortsupport. Because B-Tree index builds
 cannot really use the onlyKey optimization, the historic oversight
 of not supporting B-Tree builds (and CLUSTER) might have been at least
 a little understandable [1]. But with the recent work on sortsupport
 for text, it's likely that B-Tree index builds will be missing out on
 rather a lot by not taking advantage of sortsupport.

 Attached patch modifies tuplesort to correct this oversight. What's
 really nice about it is that there is actually a net negative code
 footprint:

   src/backend/access/nbtree/nbtsort.c  |  63 +++---
   src/backend/utils/sort/sortsupport.c |  77 ++--
   src/backend/utils/sort/tuplesort.c   | 274 +++
   src/include/utils/sortsupport.h  |   3 +
   4 files changed, 203 insertions(+), 214 deletions(-)

The code compiles and passes the test suite.

I looked at the changes to the code. The new code is clean and there is 
more code re-use and improved readability. On possible further 
improvement would be to move the preparation of SortSupport to a common 
function since this is done three time in the code.


I did some simple benchmarks by adding indexes to temporary tables and 
could see improvements of around 10% in index build time. So it gives a 
nice, but not amazing, performance improvement.


Is there any case where we should expect any greater performance 
improvement?


Either way I find this a nice patch which improves code quality and 
performance.


= Minor code style issues I found

- There is a double space in strategy  = (scanKey-sk_flags [...].
- I think there should be a newline in tuplesort_begin_index_btree() 
before /* Prepare SortSupport data for each column */.

- Remove the extra newline after reversedirection().
- End sentences in comments with period. That seems to be the common 
practice in the project.


--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] numeric_normalize() is a few bricks shy of a load

2014-11-05 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 I think there's no choice but to fix this in 9.4.  The only reason
 it even needs discussion is that this would invalidate index entries
 in beta testers' JSONB GIN indexes, such that queries would not find
 entries that they did find before.  I'm not sure we can do anything
 about this except recommend that beta testers REINDEX such indexes
 after updating to 9.4next.
 
 Thoughts?

I'm not happy about it, but I agree with your assessment.  Beta testers
will survive (not exaclty the first time they've had to deal with such
changes..).  We surely can't ship it as-is.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] recovery_target_time and standby_mode

2014-11-05 Thread Greg Stark

On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote:
 When the recovery_target_time is reached, switch to streaming
 replication and stay a standby.

Then shouldn't he just not specify a recovert_target at all? That's
the default behaviour for standby_mode on, the whole point of
recovery_target is to specify when to stop recovery and leave standby
mode, no?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start


On 11/5/14, 6:04 PM, Álvaro Hernández Tortosa wrote:


On 05/11/14 17:46, Jim Nasby wrote:

On 11/4/14, 6:11 PM, Álvaro Hernández Tortosa wrote:

 Should we improve then the docs stating this more clearly? Any objection 
to do this?


If we go that route we should also mention that now() will no longer be doing 
what you probably hope it would (AFAIK it's driven by BEGIN and not the first 
snapshot).


 If I understand you correctly, you mean that if we add a note to the 
documentation stating that the transaction really freezes when you do the first 
query, people would expect now() to be also frozen when the first query is 
done, which is not what happens (it's frozen at tx start). Then, yes, you're 
right, probably *both* the isolation levels and the now() function 
documentation should be patched to become more precise.


Bingo.

Hrm, is there anything else that differs between the two?


Perhaps we should change how now() works, but I'm worried about what that might 
do to existing applications...


 Perhaps, I also believe it might not be good for existing applications, 
but it definitely has a different freeze behavior, which seems inconsistent too.


Yeah, I'd really rather fix it...

Hmm... we do have transaction_timestamp(); perhaps we could leave that as the 
time BEGIN executed and shift everything else to use the snapshot time.

Or we could do the opposite. I suspect that would be more likely to cause data 
quality errors, but anyone that treats timestamps as magic values is going to 
have those anyway.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport

2014-11-05 Thread Peter Geoghegan

Thanks for the review.

On Wed, Nov 5, 2014 at 4:33 PM, Andreas Karlsson andr...@proxel.se wrote:
 I looked at the changes to the code. The new code is clean and there is more
 code re-use and improved readability. On possible further improvement would
 be to move the preparation of SortSupport to a common function since this is
 done three time in the code.

The idea there is to have more direct control of sortsupport. With the
abbreviated keys patch, abbreviation occurs based on a decision made
by tuplesort.c. I can see why you'd say that, but I prefer to keep
initialization of sortsupport structs largely concentrated in
tuplesort.c, and more or less uniform regardless of the tuple-type
being sorted.

 I did some simple benchmarks by adding indexes to temporary tables and could
 see improvements of around 10% in index build time. So it gives a nice, but
 not amazing, performance improvement.

Cool.

 Is there any case where we should expect any greater performance
 improvement?

The really compelling case is abbreviated keys - as you probably know,
there is a patch that builds on this patch, and the abbreviated keys
patch, so that B-Tree builds and CLUSTER can use abbreviated keys too.
That doesn't really have much to do with this patch, though. The
important point is that heap tuple sorting (with a query that has no
client overhead, and involves one big sort) and B-Tree index creation
both have their tuplesort as a totally dominant cost. The improvements
for each case should be quite comparable, which is good (except, as
noted in my opening e-mail, when heap/datum tuplesorting can use the
onlyKey optimization, while B-Tree/CLUSTER tuplesorting cannot).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] ltree::text not immutable?

On Tue, Nov 4, 2014 at 6:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 An alternative that just occurred to me is to put the no-volatile-
 I/O-functions check into CREATE TYPE, but make it just WARNING not
 ERROR.  That would be nearly as good as an ERROR in terms of nudging
 people who'd accidentally omitted a volatility marking from their
 custom types.  But we could leave chkpass as-is and wait to see if
 anyone complains hey, why am I getting this warning?.  If we don't
 hear anyone complaining, maybe that means we can get away with changing
 the type's behavior in 9.6 or later.

 Attached is a complete patch along these lines.  As I suggested earlier,
 this just makes the relevant changes in ltree--1.0.sql and
 pg_trgm--1.1.sql without bumping their extension version numbers,
 since it doesn't seem important enough to justify a version bump.

 I propose that we could back-patch the immutability-additions in ltree and
 pg_trgm, since they won't hurt anything and might make life a little
 easier for future adopters of those modules.  The WARNING additions should
 only go into HEAD though.

I don't understand why you went to all the trouble of building a
versioning system for extensions if you're not going to use it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] recovery_target_time and standby_mode

On 11/05/2014 05:00 PM, Greg Stark wrote:
 On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote:
 When the recovery_target_time is reached, switch to streaming
 replication and stay a standby.
 
 Then shouldn't he just not specify a recovert_target at all? That's
 the default behaviour for standby_mode on, the whole point of
 recovery_target is to specify when to stop recovery and leave standby
 mode, no?
 

Their goal was to restore a master *and replica* to a point in time
before Bad Stuff happened, and then have a working master-replica pair.

Like I said, this is probably working as intended, we just need to
clarify it in the docs.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] recovery_target_time and standby_mode

On Thu, Nov 6, 2014 at 10:00 AM, Greg Stark st...@mit.edu wrote:
 On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote:
 When the recovery_target_time is reached, switch to streaming
 replication and stay a standby.

 Then shouldn't he just not specify a recovert_target at all? That's
 the default behaviour for standby_mode on, the whole point of
 recovery_target is to specify when to stop recovery and leave standby
 mode, no?
Agreed with Greg, once a target recovery is switched the node gets out
of recovery. What the user should have done here is not specify
recovery_target_time in the standby's recovery.conf such as it follows
the master through streaming.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] recovery_target_time and standby_mode

On Thu, Nov 6, 2014 at 10:41 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Nov 6, 2014 at 10:00 AM, Greg Stark st...@mit.edu wrote:
 On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote:
 When the recovery_target_time is reached, switch to streaming
 replication and stay a standby.

 Then shouldn't he just not specify a recovert_target at all? That's
 the default behaviour for standby_mode on, the whole point of
 recovery_target is to specify when to stop recovery and leave standby
 mode, no?
 Agreed with Greg, once a target recovery is switched the node gets out
 of recovery. What the user should have done here is not specify
 recovery_target_time in the standby's recovery.conf such as it follows
 the master through streaming.
Just adding: ... On the new timeline that master is bumping to. If the
standby already replayed of the point where WAL forked on master, then
the standby should be rewinded.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] recovery_target_time and standby_mode

On 11/05/2014 05:41 PM, Michael Paquier wrote:
 On Thu, Nov 6, 2014 at 10:00 AM, Greg Stark st...@mit.edu wrote:
 On Thu, Nov 6, 2014 at 12:32 AM, Josh Berkus j...@agliodbs.com wrote:
 When the recovery_target_time is reached, switch to streaming
 replication and stay a standby.

 Then shouldn't he just not specify a recovert_target at all? That's
 the default behaviour for standby_mode on, the whole point of
 recovery_target is to specify when to stop recovery and leave standby
 mode, no?
 Agreed with Greg, once a target recovery is switched the node gets out
 of recovery. What the user should have done here is not specify
 recovery_target_time in the standby's recovery.conf such as it follows
 the master through streaming.

What I'm pointing out is that you can't actually do that.  You think you
can, but you can't.

Instead, what you need to do is:

1) Recover to target_time.
2) Pause
3) shut down the replica
4) replace recovery.conf with one which streams
5) restart replica

This is consistent behavior and makes sense when you think about it.  So
I think what we need to do is clarify in the documentation covering
recovery_target and standby_mode that they are exclusive.

Hmmm.  You know, I think this means we do have a bug.  If
recovery_target_time and standby_mode are exclusive, we should error if
the user attempts to set them both.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] group locking: incomplete patch, just for discussion

On Sun, Nov 2, 2014 at 7:31 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The procgloballist stuff should be the subject of a separate patch
 which I agree with.

Yes, I think that's probably a net improvement in robustness quite
apart from what we decide to do about any of the rest of this.  I've
attached it here as revise-procglobal-tracking.patch and will commit
that bit if nobody objects.  The remainder is reattached without
change as group-locking-v0.1.patch.

Per your other comment, I've developed the beginnings of a testing
framework which I attached here as test_group_locking-v0.patch.  That
doesn't look to have much hope of evolving into something we'd want
even in contrib, but I think it'll be rather useful for debugging.  It
works like this:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# select 
test_group_locking('1.0:start,2.0:start,1.0:lock:AccessExclusiveLock:foo,2.0:lock:AccessExclusiveLock:foo');
NOTICE:  starting worker 1.0
NOTICE:  starting worker 2.0
NOTICE:  instructing worker 1.0 to acquire AccessExclusiveLock on
relation with OID 16387
NOTICE:  instructing worker 2.0 to acquire AccessExclusiveLock on
relation with OID 16387
ERROR:  could not obtain AccessExclusiveLock on relation with OID 16387
CONTEXT:  background worker, group 2, task 0

The syntax is a little arcane, I guess, but it's documented in the
comments within.  In this case I asked it to start up two background
workers and have them both try to take AccessExclusiveLock on table
foo.  As expected, the second one fails.  The idea is that workers are
identified by a pair of numbers X.Y; two workers with the same X-value
are in the same locking group.  So if I call the second worker 1.1
rather than 2.0, it'll join the same locking group as worker 1.0 and
... then it does the wrong thing, and then it crashes the server,
because my completely-untested code is unsurprisingly riddled with
bugs.

Eventually, this needs to be generalized a bit so that we can use it
to test deadlock detection.  That's tricky, because what you really
want to do is tell worker A to wait for some lock and then, once
you're sure it's on the wait queue, tell worker B to go take some
other lock and check that you see the resulting deadlock.  There
doesn't seem to be a good API for the user backend to find out whether
some background worker is waiting for some particular lock, so I may
have to resort to the hacky expedient of having the driver process
wait for a few seconds and assume that's long enough that the
background worker will be on the wait queue by then.  Or maybe I can
drum up some solution, but anyway it's not done yet.

The value of this test code is that we can easily reproduce locking
scenarios which would be hard to reproduce in a real workload - e.g.
because they're timing-dependent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 723051e..85997f6 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -210,6 +210,10 @@ static bool FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode);
 static bool FastPathTransferRelationLocks(LockMethod lockMethodTable,
 			  const LOCKTAG *locktag, uint32 hashcode);
 static PROCLOCK *FastPathGetRelationLockEntry(LOCALLOCK *locallock);
+static bool GroupLockShouldJumpQueue(LockMethod lockMethodTable,
+		 LOCKMODE lockmode,
+		 LOCK *lock,
+		 PROCLOCK *proclock);
 
 /*
  * To make the fast-path lock mechanism work, we must have some way of
@@ -339,7 +343,8 @@ PROCLOCK_PRINT(const char *where, const PROCLOCK *proclockP)
 static uint32 proclock_hash(const void *key, Size keysize);
 static void RemoveLocalLock(LOCALLOCK *locallock);
 static PROCLOCK *SetupLockInTable(LockMethod lockMethodTable, PGPROC *proc,
- const LOCKTAG *locktag, uint32 hashcode, LOCKMODE lockmode);
+ PGPROC *leader, const LOCKTAG *locktag, uint32 hashcode,
+ LOCKMODE lockmode);
 static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
@@ -894,8 +899,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 * away anytime.  So we have to use SetupLockInTable() to recompute the
 	 * lock and proclock pointers, even if they're already set.
 	 */
-	proclock = SetupLockInTable(lockMethodTable, MyProc, locktag,
-hashcode, lockmode);
+	proclock = SetupLockInTable(lockMethodTable, MyProc, LockGroupLeader,
+locktag, hashcode, lockmode);
 	if (!proclock)
 	{
 		AbortStrongLockAcquire();
@@ -914,18 +919,27 @@ LockAcquireExtended(const LOCKTAG *locktag,
 
 	/*
 	 * If lock requested conflicts with locks requested by waiters, must join
-	 * wait queue.  Otherwise, check for conflict with already-held locks.
-	 * (That's last because most complex check.)
+	 * wait queue (except for certain cases involving

Re: [HACKERS] Additional role attributes superuser review

On Mon, Nov 3, 2014 at 11:44 AM, Adam Brightwell
adam.brightw...@crunchydatasolutions.com wrote:
 That said, I don't feel very strongly about that position, so if you and
 Robert (and others on the thread) agree that's the right approach then
 I'll see about getting it done.

 We haven't reached consensus on this one yet and I didn't want it to fall
 too far off the radar.

 Here is what I summarize as the current state of the discussion:

 1. Syntax:

 ALTER ROLE role { ADD | DROP } CAPABILITY capability

Seems reasonable.

 * Perhaps keeping the current syntax around as deprecated to be removed in a
 scheduled future version. (provide a deprecated message to the user?)

Yeah, I don't think we should remove the existing syntax because a lot
of people are used to it.  We still have some very old COPY syntax
around for backward-compatibility, and it's not hurting us, either.
At the same time, I think recasting the existing capability-like
things as capabilities per se is a good idea, because otherwise we've
got this old stuff that is randomly different for no especially good
reason.

 GRANT EXECUTE PRIVILEGES ON capability TO role

Yuck.

The only other approach I can think of is have some magic, hard-coded
roles that implicitly have each capability, and then those can be
granted.  e.g. have a role pg_unrestricted_copy or whatever with a
given OID, and then test has_privs_of_role() with that OID.

 Condense all attributes in pg_authid to single int64 column and create
 bitmasks accordingly.

 Obviously there is some concern for upgrading and whether to do both at once
 or to do them incrementally.  IMHO, I think if the changes are going to be
 made, then we should go ahead and do them at the same time.  Though, would
 it be beneficial to separate in to two distinct patches?

If we're going to change the catalog representation for the existing
capabilities, I'd recommend that the first patch change the catalog
representation and add the new syntax without adding any more
capabilities; and then the second and subsequent patches can add
additional capabilities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Convert query plan to sql query

2014-11-05 Thread mariem

wrote:

I don't think SQL can express the information the plan contains. For
example,
join methods (hash, nest loop, merge).

I don't need the way the query will be executed, so there is no need for
(hash, nest loop, merge).






--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Convert-query-plan-to-sql-query-tp5825727p5825877.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Convert query plan to sql query

2014-11-05 Thread mariem


May be you want to check how it's done in Postgres-XC. Postgres-XC works on
plans being created by PostgreSQL and reverse-engineers queries (for
parts of the plans which are shippable.) The notions of shippability
may not be of interest to you, but the code to reverse-engineer most of
the plan nodes is there in Postgres-XC.

I'm glad you used reverse engineering that's exactly what I need. I didn't
get a chance to look at the internals of Postgres-XC, thank you for the
info.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Convert-query-plan-to-sql-query-tp5825727p5825878.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

On Thu, Oct 30, 2014 at 5:19 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Updated patch is attached.
Please find attached an updated patch with the following things changed:
- Addition of tab completion in psql for all new commands
- Addition of a call to WaitForLockers in index_concurrent_swap to
ensure that there are no running transactions on the parent table
running before exclusive locks are taken on the index and its
concurrent entry. Previous patch versions created deadlocks because of
that, issue spotted by the isolation tests integrated in the patch.
- Isolation tests for reindex concurrently are re-enabled by default.
Regards,
-- 
Michael
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index cd55be8..653b120 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -864,7 +864,8 @@ ERROR:  could not serialize access due to read/write dependencies among transact
 
 para
  Acquired by commandVACUUM/command (without optionFULL/option),
- commandANALYZE/, commandCREATE INDEX CONCURRENTLY/, and
+ commandANALYZE/, commandCREATE INDEX CONCURRENTLY/,
+ commandREINDEX CONCURRENTLY/,
  commandALTER TABLE VALIDATE/command and other
  commandALTER TABLE/command variants (for full details see
  xref linkend=SQL-ALTERTABLE).
@@ -1143,7 +1144,7 @@ ERROR:  could not serialize access due to read/write dependencies among transact
 
sect2 id=locking-pages
 titlePage-level Locks/title
-  
+
 para
  In addition to table and row locks, page-level share/exclusive locks are
  used to control read/write access to table pages in the shared buffer
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index cabae19..285f3ff 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  refsynopsisdiv
 synopsis
-REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERname/replaceable [ FORCE ]
+REINDEX { INDEX | TABLE | DATABASE | SYSTEM } [ CONCURRENTLY ] replaceable class=PARAMETERname/replaceable [ FORCE ]
 /synopsis
  /refsynopsisdiv
 
@@ -68,9 +68,12 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam
   An index build with the literalCONCURRENTLY/ option failed, leaving
   an quoteinvalid/ index. Such indexes are useless but it can be
   convenient to use commandREINDEX/ to rebuild them. Note that
-  commandREINDEX/ will not perform a concurrent build. To build the
-  index without interfering with production you should drop the index and
-  reissue the commandCREATE INDEX CONCURRENTLY/ command.
+  commandREINDEX/ will perform a concurrent build if literal
+  CONCURRENTLY/ is specified. To build the index without interfering
+  with production you should drop the index and reissue either the
+  commandCREATE INDEX CONCURRENTLY/ or commandREINDEX CONCURRENTLY/
+  command. Indexes of toast relations can be rebuilt with commandREINDEX
+  CONCURRENTLY/.
  /para
 /listitem
 
@@ -139,6 +142,21 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam
/varlistentry
 
varlistentry
+termliteralCONCURRENTLY/literal/term
+listitem
+ para
+  When this option is used, productnamePostgreSQL/ will rebuild the
+  index without taking any locks that prevent concurrent inserts,
+  updates, or deletes on the table; whereas a standard reindex build
+  locks out writes (but not reads) on the table until it's done.
+  There are several caveats to be aware of when using this option
+  mdash; see xref linkend=SQL-REINDEX-CONCURRENTLY
+  endterm=SQL-REINDEX-CONCURRENTLY-title.
+ /para
+/listitem
+   /varlistentry
+
+   varlistentry
 termliteralFORCE/literal/term
 listitem
  para
@@ -218,6 +236,194 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam
reindex anything.
   /para
 
+  refsect2 id=SQL-REINDEX-CONCURRENTLY
+   title id=SQL-REINDEX-CONCURRENTLY-titleRebuilding Indexes Concurrently/title
+
+   indexterm zone=SQL-REINDEX-CONCURRENTLY
+   primaryindex/primary
+   secondaryrebuilding concurrently/secondary
+   /indexterm
+
+   para
+Rebuilding an index can interfere with regular operation of a database.
+Normally productnamePostgreSQL/ locks the table whose index is rebuilt
+against writes and performs the entire index build with a single scan of the
+table. Other transactions can still read the table, but if they try to
+insert, update, or delete rows in the table they will block until the
+index rebuild is finished. This could have a severe effect if the system is
+a live production database. Very large tables can take many hours to be
+indexed, and even for smaller tables, an index rebuild can lock out writers
+for periods that are unacceptably long for a production system.
+   /para

Re: [HACKERS] Amazon Redshift

2014-11-05 Thread Jaime Casanova

On Wed, Nov 5, 2014 at 5:36 PM, philip taylor philta...@mail.com wrote:
 do you think we should implement some of the functions offered by
 Amazon Redshift?

 http://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html


 JSON Functions

 http://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html
 http://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html


We actually have the two above:
http://www.postgresql.org/docs/current/static/functions-json.html
Actually, if you look at the examples you will see they only copy ours
and changed the values a little.

-- 
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED

On Sat, Sep 13, 2014 at 11:02 PM, Fabrízio de Royes Mello
fabriziome...@gmail.com wrote:
 Patch rebased and added to commitfest [1].
It looks like a good thing to remove ATChangeIndexesPersistence, this
puts the persistence switch directly into reindex process.

A couple of minor comments about this patch:
1) Reading it, I am wondering if it would not be finally time to
switch to a macro to get a relation's persistence, something like
RelationGetPersistence in rel.h... Not related directly to this patch.
2) reindex_index has as new argument a relpersislence value for the
new index. reindex_relation has differently a new set of flags to
enforce the relpersistence of all the underling indexes. Wouldn't it
be better for API consistency to pass directly a relpersistence value
through reindex_relation? In any case, the comment block of
reindex_relation does not contain a description of the new flags.
3) Here you may as well just set the value and be done:
+/*
+ * Check if need to set the new relpersistence
+ */
+if (iRel-rd_rel-relpersistence != relpersistence)
+iRel-rd_rel-relpersistence = relpersistence;
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

2014-11-05 Thread Amit Kapila

On Wed, Nov 5, 2014 at 2:56 PM, Heikki Linnakangas hlinnakan...@vmware.com
wrote:
 On 11/05/2014 09:06 AM, Amit Kapila wrote:

 2.
 XLogRecPtr
 XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)

 So the scenario is that:

 * XLogRecordAssemble decides that a page doesn't need to be backed up
 * both RedoRecPtr and doPageWrites change while building the record.
doPageWrites goes from true to false.

 Without the patch, we would retry, because first check RedoRecPtr has
changed. With the patch, we notice that even though RedoRecPtr has changed,
doPageWrites is now off, so no FPWs are required regardless of RedoRecPtr,
and not retry. Yeah, you're right, the behavior changes in that case.
However, the new behavior is correct; the retry is unnecessary in that
scenario.


How does it interact with backup, basically in stop backup we first
change forcePageWrite to false and then get the stop wal location
by inserting XLOG_BACKUP_END, so it seems to me that it is quite
possible that the record which backend is inserting using XLogInsert()
will be considered in backup.  Now shouldn't this record contain FPW
if forcePageWrite was true when XLogInsert() started to avoid any torn
page taken in backup?


 3. There are couple of places where *XLogInsert* is used in wal.sgml
 and it seems to me some of those needs change w.r.t this patch.


 Hmm. It's a bit strange to mention XLogInsert and XLogFlush functions by
name in that text. It's otherwise admin-level stuff, on how to set
WAL-related settings. Up to 9.2 that manual page actually called them
LogInsert and LogFlush. But I guess you're right; if we are to mention
the function by name, it should say XLogInsertRecord.


Agreed.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Convert query plan to sql query

mariem mariem.benfad...@gmail.com writes:
 I don't think SQL can express the information the plan contains. For example,
 join methods (hash, nest loop, merge).

 I don't need the way the query will be executed, so there is no need for
 (hash, nest loop, merge).

If you don't need that, why are you insistent on extracting the
information from a plan tree?

It seems far simpler to me to make use of ruleutils.c to reverse-list
the original parsetree.  That functionality already exists and is well
tested and well maintained.  If you insist on working from a plan tree,
you will be writing a fair amount of code that you will have to maintain
yourself.  And I absolutely, positively guarantee that we will break it
in every major release, and occasionally in minor releases.  You should
read the git history of explain.c and ruleutils.c and ask yourself whether
you want to keep up with that level of churn.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

Jim Nasby jim.na...@bluetreble.com writes:
 Hmm... we do have transaction_timestamp(); perhaps we could leave that as the 
 time BEGIN executed and shift everything else to use the snapshot time.

It's not possible to take a timestamp that *exactly* matches the snapshot
time.   We could rearrange the code so that we ask the kernel for
timeofday just before or after capturing some relevant snapshot, but
there's still going to be some skew there.

In any case, I believe we document those timestamps as being the time of
arrival of a command from the client, not in terms of any snapshots.
To the extent that people use now() to represent the time of insertion of
data, I think the command arrival time is arguably the best definition.
Delaying it until the server code gets around to capturing a snapshot
would not be an improvement.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tracking commit timestamps