date:20131025

On 2013-10-24 17:17:22 -0700, Josh Berkus wrote:
 On 10/24/2013 04:55 PM, Robert Haas wrote:
  On Thu, Oct 24, 2013 at 1:09 PM, Josh Berkus j...@agliodbs.com wrote:
  On 10/23/2013 09:58 PM, Amit Kapila wrote:
  I wonder why anyone would like to freeze during CLUSTER command when
  they already have separate way (VACUUM FREEZE) to achieve it, do you
  know or can think of any case where user wants to do it along with
  Cluster command?
 
  If I'm rewriting the table anyway, let's freeze it.
 
  Otherwise, you have to write the same pages twice, if both CLUSTER and
  FREEZE are required.
  
  I wonder if we should go so far as to make this the default behavior,
  instead of just making it an option.
 
 +1 from me.  Can you think of a reason you *wouldn't* want to freeze?

It makes content from the future appear when you start using the
relation in a query/session with an older snapshot. Currently CLUSTER is
safe against that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for fail-back without fresh backup

2013-10-25 Thread Michael Paquier

On Fri, Oct 25, 2013 at 5:57 AM, Magnus Hagander mag...@hagander.net wrote:
 In fact I've been considering suggesting we might want to retire the
 difference between archive and hot_standby as wal_level, because the
 difference is usually so small. And the advantage of hot_standby is in
 almost every case worth it. Even in the archive recovery mode, being
 able to do pause_at_recovery_target is extremely useful. And as you
 say in (c) above, many users don't realize that until it's too late.
+1 on removing archive from wal_level. Having both archive and
hot_standby for wal_level is confusing, and if I recall correctly
hot_standby and archive have been kept as possible settings only to
protect people from bugs that the newly-introduced hot_standby could
introduce due to the few WAL records it adds. But it has been a couple
of releases since there have been no such bugs, no?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Using indexes for ORDER BY and PARTITION BY clause in windowing functions

2013-10-25 Thread Kyotaro HORIGUCHI

Hello, 

 Agree that windowing function will return all the rows compared to max and
 group by returing only max rows per group. But even while arriving at the
 aggregate/sorting windowing function seems to spend more effort than group
 by/order by.

(I'll apologise in advance for possible misreading..)

The most cause of the difference in time comes from sorting. Over
90% of total execution time has elapsed while sorting
(49ms-2733ms) for the one using windowing function. If this sort
were useless, the execution time would be less than 300 ms -
seems comparable enough to group-by query.

| Subquery Scan on __unnamed_subquery_0 
|  (actual time=2606.075..2953.937 rows=558 loops=1)
|   Filter: (__unnamed_subquery_0.rn = 1)
|   -  WindowAgg  (actual time=2606.063..2928.061 rows=122880 loops=1)
| -  Sort (actual time=2606.020..2733.677 rows=122880 loops=1)
|   Sort Key: student_score.course, student_score.score
|   -  Seq Scan on student_score  
|   (actual time=0.009..49.026 rows=122880 loops=1)

As you see in above plan, sorting key is (course, score). If your
point is the overall performance but not reusing a kind of
'hash', there's a big chance to eliminate this sorting if you are
able to have an additional index, say,

=# create index idx_co_sc on student_score using btree (course, score);

With this index, you will get a different plan like this,

 uniontest=# explain analyze select student_name from (select student_name, 
 dense_rank() over(partition by course order by score) rn, score from 
 student_score) rnn where rn=2;
   QUERY PLAN
 ---
  Subquery Scan on rnn  (actual time=0.088..319.403 rows=135 loops=1)
Filter: (rnn.rn = 2)
Rows Removed by Filter: 122746
-  WindowAgg  (actual time=0.037..296.851 rows=122881 loops=1)
  -  Index Scan using idx_co_sc on student_score 
(actual time=0.027..111.333 rows=122881 loops=1)
  Total runtime: 319.483 ms

Does this satisfies your needs?

===
 Another thing, (I may be stupid and naive here) does PostgreSQL
 re-uses the hash which has been already created for sort. In
 this case the inner query must have created a hash for windoing
 aggregate. Can't we use that same one while applying the the
 filter rn=1 ?

Generally saying, hashes cannot yield ordered output by its
nature, I believe.

Windowing function (execnode) always receives tuples sequentially
in the window-defined order (as you see in the explained plan
above) then processes the tuples in semi tuple-by-tuple manner to
perform per-frame aggregaion, and finally outputs tuples of the
same number to input. And furthermore, dense_rank() doesn't even
need per-frame aggregations. So none of the planners so far seems
to have chance to use a kind of hash tables to culculate/execute
windowing fucntions. On the another point, automatically
preserving some internal data within a query beyond the end of
the query brings in 'when to discard it?' problem.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Example query causing param_info to be set in plain rel path

2013-10-25 Thread Ashutosh Bapat

Hi All,
In order to test various cases of LATERAL join in Postgres-XC, I am trying
to find a query where RelOptInof-lateral_relids would get set for plain
base relations. I tried various examples like below. But none of them was
able to set lateral_relids in RelOptInfo of the involved relations. I am
using current master branch.

select * from tab1 left join lateral (select * from tab2 where tab2.val =
tab1.val2) q using(val);
This just pulled up the qual at higher level.

select * from tab1 left join lateral (select sum(val2) sum_val2, val from
tab2 where tab2.val2 = tab1.val2 group by tab2.val) q using(val);

This at least caused the outer reference to be added as parameter, but
again no lateral_relid.

Is there an example query (however impractial query it would be), which can
make the lateral_relids set in RelOptInfo of plain base relation?
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

[HACKERS] Document update in alter_foreign_data_wrapper.sgml

2013-10-25 Thread Etsuro Fujita

Hi,

ISTM the document in alter_foreign_data_wrapper.sgml and the comment in
foreigncmds.c should be updated.  Please find attached a patch.

Thanks,

Best regards,
Etsuro Fujita


fdw-doc-update.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for fail-back without fresh backup

On 2013-10-24 13:51:52 -0700, Josh Berkus wrote:
 On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:
  One extra WAL record whenever a hint bit is set on a page, for the first
  time after a checkpoint. In other words, a WAL record needs to be
  written in the same circumstances as with page checksums, but the WAL
  records are much smaller as they don't need to contain a full page
  image, just the block number of the changed block.
  
  Or maybe we'll write the full page image after all, like with page
  checksums, just without calculating the checksums. It might be tricky to
  skip the full-page image, because then a subsequent change of the page
  (which isn't just a hint-bit update) needs to somehow know it needs to
  take a full page image even though a WAL record for it was already written.
 
 I think it would be worth estimating what this actually looks like in
 terms of log write quantity.  My inclication is to say that if it
 increases log writes less than 10%, we don't need to provide an option
 to turn it off.

It entirely depends on your workload. If it happens to be something
like:
INSERT INTO table (lots_of_data);
CHECKPOINT;
SELECT * FROM TABLE;

i.e. there's a checkpoint between loading the data and reading it - not
exactly all that uncommon - we'll need to log something for every
page. That can be rather noticeable. Especially as I think it will be
rather hard to log anything but a real FPI.

I really don't think everyone will want this. I am absolutely not
against providing an option to log enough information to make pg_rewind
work, but I think providing a command to do *safe* *planned* failover
will help in many more.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for fail-back without fresh backup

On 2013-10-24 22:57:29 +0200, Magnus Hagander wrote:
 In fact I've been considering suggesting we might want to retire the
 difference between archive and hot_standby as wal_level, because the
 difference is usually so small. And the advantage of hot_standby is in
 almost every case worth it. Even in the archive recovery mode, being
 able to do pause_at_recovery_target is extremely useful. And as you
 say in (c) above, many users don't realize that until it's too late.

+1.

On 2013-10-25 15:16:30 +0900, Michael Paquier wrote:
 But it has been a couple of releases since there have been no such
 bugs, no?

One 'no' too much? Anyway, I think there have been more recent ones, but
it's infrequent enough that we can remove the level anyway.

FWIW, I've wondered if we shouldn't remove most of the EnableHotStandby
checks in xlog.c. There are way too many difference how StartupXLOG
behaves depending on HS.
E.g. I quite dislike that we do stuff like StartupCLOG at entirely
different times during recovery.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6.2

On 2013-10-24 10:59:21 -0400, Robert Haas wrote:
 On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
  On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
   That strikes me as a flaw in the implementation rather than the idea.
   You're presupposing a patch where the necessary information is
   available in WAL yet you don't make use of it at the proper time.
  
   The problem is that the mapping would be somewhere *ahead* from the
   transaction/WAL we're currently decoding. We'd need to read ahead till
   we find the correct one.
 
  Yes, I think that's what you need to do.
 
  My problem with that is that rewrite can be gigabytes into the future.
 
  When reading forward we could either just continue reading data into the
  reorderbuffer, but delay replaying all future commits till we found the
  currently needed remap. That might have quite the additional
  storage/memory cost, but runtime complexity should be the same as normal
  decoding.
  Or we could individually read ahead for every transaction. But doing so
  for every transaction will get rather expensive (rougly O(amount_of_wal^2)).
 
 [ Sorry it's taken me a bit of time to get back to this; other tasks
 intervened, and I also just needed some time to let it settle in my
 brain. ]

No worries. I've had enough things to work on ;)

 If you read ahead looking for a set of ctid translations from
 relfilenode A to relfilenode B, and along the way you happen to
 encounter a set of translations from relfilenode C to relfilenode D,
 you could stash that set of translations away somewhere, so that if
 the next transaction you process needs that set of mappings, it's
 already computed.  With that approach, you'd never have to pre-read
 the same set of WAL files more than once.

 But, as I think about it more, that's not very different from your
 idea of stashing the translations someplace other than WAL in the
 first place.  I mean, if the read-ahead thread generates a series of
 files in pg_somethingorother that contain those maps, you could have
 just written the maps to that directory in the first place.  So on
 further review I think we could adopt that approach.

Yea, that basically was my reasoning, only expressed much more nicely ;)

 However, I'm leery about the idea of using a relation fork for this.
 I'm not sure whether that's what you had it mind, but it gives me the
 willies.  First, it adds distributed overhead to the system, as
 previously discussed; and second, I think the accounting may be kind
 of tricky, especially in the face of multiple rewrites.  I'd be more
 inclined to find a separate place to store the mappings.  Note that,
 AFAICS, there's no real need for the mapping file to be
 block-structured, and I believe they'll be written first (with no
 readers) and subsequently only read (with no further writes) and
 eventually deleted.

I was thinking of storing it along other data used during logical
decoding and let decoding's cleanup clean up that data as well. All the
information for that should be there.

There's one snag I currently can see, namely that we actually need to
prevent that a formerly dropped relfilenode is getting reused. Not
entirely sure what the best way for that is.

 One possible objection to this is that it would preclude decoding on a
 standby, which seems like a likely enough thing to want to do.  So
 maybe it's best to WAL-log the changes to the mapping file so that the
 standby can reconstruct it if needed.

The mapping file probably can be one big wal record, so it should be
easy enough to do.

For a moment I thought there's a problem with decoding on the standby
having to read ahead of the current location to find the newer mapping,
but that's actually not required since we're protected by the AEL lock
during rewrites on the standby as well.

  I think that'd be pretty similar to just disallowing VACUUM
  FREEZE/CLUSTER on catalog relations since effectively it'd be to
  expensive to use.
 
 This seems unduly pessimistic to me; unless the catalogs are really
 darn big, this is a mostly theoretical problem.

Well, it's not the size of the relation, but the amount of concurrent
WAL that's being generated that matters. But anyway, if we do it like
you described above that shouldn't be a problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6.2

On 2013-10-21 16:15:58 +0200, Andres Freund wrote:
  I don't think I understand exactly what you have in mind for (2); can
  you elaborate?  I have always thought that having a
  WaitForDecodingToCatchUp() primitive was a good way of handling
  changes that were otherwise too difficult to track our way through.  I
  am not sure you're doing that at all right now, which in some sense I
  guess is fine, but I haven't really understood your aversion to this
  solution.  There are some locking issues to be worked out here, but
  the problems don't seem altogether intractable.
 
 So, what we need to do for rewriting catalog tables would be:
 1) lock table against writes
 2) wait for all in-progress xacts to finish, they could have modified
the table in question (we don't keep locks on system tables)
 3) acquire xlog insert pointer
 4) wait for all logical decoding actions to read past that pointer
 5) upgrade the lock to an access exclusive one
 6) perform vacuum full as usual
 
 The lock upgrade hazards in here are the reason I am adverse to the
 solution. And I don't see how we can avoid them, since in order for
 decoding to catchup it has to be able to read from the
 catalog... Otherwise it's easy enough to implement.

So, I thought about this for some more and I think I've a partial
solution to the problem.

The worst thing about deadlocks that occur in the above is that they
could be the VACUUM FULL waiting for the restart LSN[1] of a decoding
slot to progress, but the restart LSN cannot progress because the slot
is waiting for a xid/transaction to end which is being blocked by the
lock upgrade from VACUUM FULL. Such conflicts are not visible to the
deadlock detector, which obviously is bad.
I've prototyped this (~25 lines) and this happens pretty frequently. But
it turns out that we can actually fix this by exporting (to shared
memory) the oldest in-progress xid of a decoding slot. Then the waiting
code can do a XactLockTableWait() for that xid...

I wonder if this is isn't maybe sufficient. Yes, it can deadlock, but
that's already the case for VACUUM FULLs of system tables, although less
likely. And it will be detected/handled.
There's one more snag though, we currently allow CLUSTER system_table;
in an existing transaction. I think that'd have to be disallowed.

What do you think?

Greetings,

Andres Freund

[1] The restart LSN is the point from where we need to be able read
WAL to replay all changes the receiving side hasn't acked yet.

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Location for external scripts for Extensions?

2013-10-25 Thread Peter Eisentraut

On 10/24/13, 5:37 PM, Josh Berkus wrote:
 On 10/24/2013 02:36 PM, Peter Eisentraut wrote:
 On 10/22/13, 2:27 PM, Josh Berkus wrote:
 pg_partman has several external (python) scripts which help the
 extension, located in /extras/ in its source.  The problem currently is
 that if you install pg_partman via pgxn or package, you don't get those
 scripts, because there's no install location for them.

 Use the SCRIPTS variable in pgxs, and they will get installed.

 
 Oh yeah?  Cool.  Is there an existing extension with an example of this?

Don't know.  I just saw this in the pgxs source code.  Presumably, it
was added because it was used somewhere.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Detection of nested function calls

Hi all,

The Oslandia team is involved in PostGIS project for years, with a
current focus on PostGIS 3D support.
With PostGIS queries, nested functions calls that manipulate geometries
are quite common, e.g.: SELECT ST_Union(ST_Intersection(a.geom,
ST_Buffer(b.geom, 50)))

PostGIS functions that manipulate geometries have to unserialize their
input geometries from the 'flat' varlena representation to their own,
and serialize the processed geometries back when returning.
But in such nested call queries, this serialization-unserialization
process is just an overhead.

Avoiding it could then lead to a real gain in terms of performances [1],
especially here when the internal type takes time to serialize (and with
new PostGIS types like rasters or 3D geometries it's really meaningful)

So we thought having a way for user functions to know if they are part
of a nested call could allow them to avoid this serialization phase.

The idea would be to have a boolean flag reachable from a user function
(within FunctionCallInfoData) that says if the current function is
nested or not.

We already investigated such a modification and here is where we are up
to now :
  - we modified the parser with a new boolean member 'nested' to the
FuncExpr struct. Within the parser, we know if a function call is nested
into another one and then we can mark the FuncExpr
  - the executor has been modified so it can take into account this
nested member and pass it to the FunctionCallInfoData structure before
evaluating the function

We are working on a PostGIS branch that takes benefit of this
functionality [2]

You can find in attachment a first draft of the patch.

Obviously, even if this is about a PostGIS use case here, this subject
could be helpful for every other queries using both nested functions and
serialization.

I am quite new to postgresql hacking, so I'm sure there is room for
improvements. But, what about this first proposal ?

I'll be at the PGDay conf in Dublin next week, so we could discuss this
topic.

[1] Talking about performances, we already investigated such
pass-by-reference mechanism with PostGIS. Taking a dummy function
st_copy that only copies its input geometry to its output with 4
levels of nesting gives encouraging results (passing geometries by
reference is more than 2x faster than (un)serializing) :
https://github.com/Oslandia/sfcgal-tests/blob/master/bench/report_serialization_referenced_vs_native.pdf

[2] https://github.com/Oslandia/postgis/tree/nested_ref_passing
-- 
Hugo Mercier
Oslandia
diff --git a/src/backend/executor/execQual.c b/src/backend/executor/execQual.c
index 90c2753..56c701f 100644
--- a/src/backend/executor/execQual.c
+++ b/src/backend/executor/execQual.c
@@ -182,6 +182,8 @@ static Datum ExecEvalArrayCoerceExpr(ArrayCoerceExprState *astate,
 static Datum ExecEvalCurrentOfExpr(ExprState *exprstate, ExprContext *econtext,
 	  bool *isNull, ExprDoneCond *isDone);
 
+static ExprState * ExecInitExprRec(Expr *node, PlanState *parent, Expr *parentNode);
+
 
 /* 
  *		ExecEvalExpr routines
@@ -4303,7 +4305,6 @@ ExecEvalExprSwitchContext(ExprState *expression,
 	return retDatum;
 }
 
-
 /*
  * ExecInitExpr: prepare an expression tree for execution
  *
@@ -4336,9 +4337,21 @@ ExecEvalExprSwitchContext(ExprState *expression,
  * associated with a plan tree.  (If so, it can't have aggs or subplans.)
  * This case should usually come through ExecPrepareExpr, not directly here.
  */
+
 ExprState *
 ExecInitExpr(Expr *node, PlanState *parent)
 {
+return ExecInitExprRec(node, parent, NULL);
+}
+
+/*
+ * ExecInitExprRec: used by ExecInitExpr
+ *
+ * The additional parameter 'parentNode' is the expression node parent of 'node'.
+ */
+static ExprState *
+ExecInitExprRec(Expr *node, PlanState *parent, Expr *parentNode )
+{
 	ExprState  *state;
 
 	if (node == NULL)
@@ -4408,10 +4421,10 @@ ExecInitExpr(Expr *node, PlanState *parent)
 	aggstate-aggs = lcons(astate, aggstate-aggs);
 	naggs = ++aggstate-numaggs;
 
-	astate-args = (List *) ExecInitExpr((Expr *) aggref-args,
-		 parent);
-	astate-aggfilter = ExecInitExpr(aggref-aggfilter,
-	 parent);
+	astate-args = (List *) ExecInitExprRec((Expr *) aggref-args,
+			parent, node);
+	astate-aggfilter = ExecInitExprRec(aggref-aggfilter,
+		parent, node);
 
 	/*
 	 * Complain if the aggregate's arguments contain any
@@ -4448,10 +4461,10 @@ ExecInitExpr(Expr *node, PlanState *parent)
 	if (wfunc-winagg)
 		winstate-numaggs++;
 
-	wfstate-args = (List *) ExecInitExpr((Expr *) wfunc-args,
-		  parent);
-	wfstate-aggfilter = ExecInitExpr(wfunc-aggfilter,
-	  parent);
+	wfstate-args = (List *) ExecInitExprRec((Expr *) wfunc-args,
+			 parent, node);
+	wfstate-aggfilter = ExecInitExprRec(wfunc-aggfilter,
+		 parent, node);
 
 	/*
 	 * Complain if

Re: [HACKERS] Detection of nested function calls

2013-10-25 Thread Pavel Stehule

Hello


2013/10/25 Hugo Mercier hugo.merc...@oslandia.com

 Hi all,

 The Oslandia team is involved in PostGIS project for years, with a
 current focus on PostGIS 3D support.
 With PostGIS queries, nested functions calls that manipulate geometries
 are quite common, e.g.: SELECT ST_Union(ST_Intersection(a.geom,
 ST_Buffer(b.geom, 50)))

 PostGIS functions that manipulate geometries have to unserialize their
 input geometries from the 'flat' varlena representation to their own,
 and serialize the processed geometries back when returning.
 But in such nested call queries, this serialization-unserialization
 process is just an overhead.

 Avoiding it could then lead to a real gain in terms of performances [1],
 especially here when the internal type takes time to serialize (and with
 new PostGIS types like rasters or 3D geometries it's really meaningful)

 So we thought having a way for user functions to know if they are part
 of a nested call could allow them to avoid this serialization phase.

 The idea would be to have a boolean flag reachable from a user function
 (within FunctionCallInfoData) that says if the current function is
 nested or not.

 We already investigated such a modification and here is where we are up
 to now :
   - we modified the parser with a new boolean member 'nested' to the
 FuncExpr struct. Within the parser, we know if a function call is nested
 into another one and then we can mark the FuncExpr
   - the executor has been modified so it can take into account this
 nested member and pass it to the FunctionCallInfoData structure before
 evaluating the function

 We are working on a PostGIS branch that takes benefit of this
 functionality [2]

 You can find in attachment a first draft of the patch.

 Obviously, even if this is about a PostGIS use case here, this subject
 could be helpful for every other queries using both nested functions and
 serialization.

 I am quite new to postgresql hacking, so I'm sure there is room for
 improvements. But, what about this first proposal ?


I am not sure, if this solution is enough - what will be done if I store
some values in PL/pgSQL variables?

Regards

Pavel



 I'll be at the PGDay conf in Dublin next week, so we could discuss this
 topic.

 [1] Talking about performances, we already investigated such
 pass-by-reference mechanism with PostGIS. Taking a dummy function
 st_copy that only copies its input geometry to its output with 4
 levels of nesting gives encouraging results (passing geometries by
 reference is more than 2x faster than (un)serializing) :

 https://github.com/Oslandia/sfcgal-tests/blob/master/bench/report_serialization_referenced_vs_native.pdf

 [2] https://github.com/Oslandia/postgis/tree/nested_ref_passing
 --
 Hugo Mercier
 Oslandia


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] New committer

2013-10-25 Thread Tatsuo Ishii

Hi pgpool hackers,

Muhammad Usama did great job in restructuring pgpool source code. And
now he is working on incorporating PostgreSQL's exception and memory
manager. I believe now is the time to add him to our one of committers
and I just have done so.

Please welcome Muhammad Usama as our new committer!
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

Le 25/10/2013 14:29, Pavel Stehule a écrit :
 Hello
 
 
 2013/10/25 Hugo Mercier hugo.merc...@oslandia.com
 mailto:hugo.merc...@oslandia.com.
 
 I am quite new to postgresql hacking, so I'm sure there is room for
 improvements. But, what about this first proposal ?
 
 
 I am not sure, if this solution is enough - what will be done if I store
 some values in PL/pgSQL variables?
 

You mean if you store the result of a (nested) function evaluation in a
PL/pgSQL variable ?
Then no nesting will be detected by the parser and in this case the user
function must ensure its result is serialized, since it could be stored
(in a variable or a table) at any time.

-- 
Hugo Mercier
Oslandia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

2013-10-25 Thread Pavel Stehule

2013/10/25 Hugo Mercier hugo.merc...@oslandia.com

 Le 25/10/2013 14:29, Pavel Stehule a écrit :
  Hello
 
 
  2013/10/25 Hugo Mercier hugo.merc...@oslandia.com
  mailto:hugo.merc...@oslandia.com.
 
  I am quite new to postgresql hacking, so I'm sure there is room for
  improvements. But, what about this first proposal ?
 
 
  I am not sure, if this solution is enough - what will be done if I store
  some values in PL/pgSQL variables?
 

 You mean if you store the result of a (nested) function evaluation in a
 PL/pgSQL variable ?
 Then no nesting will be detected by the parser and in this case the user
 function must ensure its result is serialized, since it could be stored
 (in a variable or a table) at any time.


ok

I remember, so I though about similar optimization when I worked on SQL/XML
implementation - so same optimization can be used there.

Regards

Pavel




 --
 Hugo Mercier
 Oslandia


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CLUSTER FREEZE

On Fri, Oct 25, 2013 at 2:12 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-10-24 17:17:22 -0700, Josh Berkus wrote:
 On 10/24/2013 04:55 PM, Robert Haas wrote:
  On Thu, Oct 24, 2013 at 1:09 PM, Josh Berkus j...@agliodbs.com wrote:
  On 10/23/2013 09:58 PM, Amit Kapila wrote:
  I wonder why anyone would like to freeze during CLUSTER command when
  they already have separate way (VACUUM FREEZE) to achieve it, do you
  know or can think of any case where user wants to do it along with
  Cluster command?
 
  If I'm rewriting the table anyway, let's freeze it.
 
  Otherwise, you have to write the same pages twice, if both CLUSTER and
  FREEZE are required.
 
  I wonder if we should go so far as to make this the default behavior,
  instead of just making it an option.

 +1 from me.  Can you think of a reason you *wouldn't* want to freeze?

 It makes content from the future appear when you start using the
 relation in a query/session with an older snapshot. Currently CLUSTER is
 safe against that.

Eh, what?  We wouldn't freeze XIDs that don't precede RecentGlobalXmin.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CLUSTER FREEZE

On 2013-10-25 09:13:20 -0400, Robert Haas wrote:
 On Fri, Oct 25, 2013 at 2:12 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-10-24 17:17:22 -0700, Josh Berkus wrote:
  On 10/24/2013 04:55 PM, Robert Haas wrote:
   On Thu, Oct 24, 2013 at 1:09 PM, Josh Berkus j...@agliodbs.com wrote:
   On 10/23/2013 09:58 PM, Amit Kapila wrote:
   I wonder why anyone would like to freeze during CLUSTER command when
   they already have separate way (VACUUM FREEZE) to achieve it, do you
   know or can think of any case where user wants to do it along with
   Cluster command?
  
   If I'm rewriting the table anyway, let's freeze it.
  
   Otherwise, you have to write the same pages twice, if both CLUSTER and
   FREEZE are required.
  
   I wonder if we should go so far as to make this the default behavior,
   instead of just making it an option.
 
  +1 from me.  Can you think of a reason you *wouldn't* want to freeze?
 
  It makes content from the future appear when you start using the
  relation in a query/session with an older snapshot. Currently CLUSTER is
  safe against that.
 
 Eh, what?  We wouldn't freeze XIDs that don't precede RecentGlobalXmin.

Ah sorry, I thought that'd be the plan, similar to COPY FREEZE. Maybe
because I've wished for it in the past ;)

In that case I agree it should be the default. There really isn't any
reason to not to immediately freeze tuples that can be frozen according
to the xmin horizon. We don't immediately do it during normal vacuums
because it would possibly cause superflous io - but that's not the case here.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CLUSTER FREEZE

On Thu, Oct 24, 2013 at 10:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I wonder if we should go so far as to make this the default behavior,
 instead of just making it an option.

 In that case you'd have to invent a NOFREEZE keyword, no?  Ick.

Only if we think anyone would ever NOT want to freeze.

 In any case, it's very far from obvious to me that CLUSTER ought
 to throw away information by default, which is what you're proposing.

I find it odd to referring to this as throwing away information.  I
know that you have a general concern about throwing away XIDs that are
still needed for forensic purposes, but that is clearly the ONLY
purpose that those XIDs serve, and the I/O advantages of freezing by
default could be massive for many of our users.  What's going to
happen in practice is that experienced users will simply recommend
CLUSTER FREEZE rather than plain CLUSTER, and you won't have the
forensic information *anyway*.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNION ALL on partitioned tables won't use indices.

On Thu, Oct 24, 2013 at 6:39 AM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Hello, Sorry that it's been a while..

 1. Observed symptom

 As you know, UNION ALL accompanied with ORDER BY uses indexes if
 available.

 uniontest=# EXPLAIN SELECT * FROM c11 UNION ALL SELECT * FROM c12 ORDER BY a;
QUERY PLAN
 ---
  Merge Append  (cost=0.59..10214.10 rows=20 width=16)
Sort Key: c11.a
-  Index Scan using ... on c11  (cost=0.29..3857.04 rows=10 width=16)
-  Index Scan using ... on c12  (cost=0.29..3857.04 rows=10 width=16)

 So do simple queries on partitioned (inheritance) tables.

 uniontest=# EXPLAIN SELECT * FROM p1 ORDER BY a;
 QUERY PLAN
 --
  Merge Append  (cost=0.73..11392.19 rows=21 width=16)
Sort Key: p1.a
-  Index Scan using ... on p1  (cost=0.12..8.14 rows=1 width=44)
-  Index Scan using ... on c11  (cost=0.29..3857.04 rows=10 width=16)
-  Index Scan using ... on c12  (cost=0.29..3857.04 rows=10 width=16)

 Nevertheless, UNION ALL on partitioned tables doesn't. This is
 quite unfavourable behavior especially having LIMIT.

uniontest=# EXPLAIN ANALYZE SELECT * FROM p1
UNION ALL SELECT * FROM p2 ORDER BY a LIMIT 10;
  QUERY PLAN

 Limit   (actual time=182.732..182.735 rows=10 loops=1)
   -  Sort  (actual time=182.729..182.730 rows=10 loops=1)
 Sort Key: p1.a
 Sort Method: top-N heapsort  Memory: 25kB
 -  Append  (actual time=0.029..108.109 rows=40 loops=1)
   -  Seq Scan on p1  (actual time=0.001..0.001 rows=0 loops=1)
   -  Seq Scan on c11  (actual time=0.027..19.074 rows=10 loops=1)
   -  Seq Scan on c12  (actual time=0.014..16.653 rows=10 loops=1)
   -  Seq Scan on p2  (actual time=0.000..0.000 rows=0 loops=1)
   -  Seq Scan on c21  (actual time=0.012..16.677 rows=10 loops=1)
   -  Seq Scan on c22  (actual time=0.012..16.794 rows=10 loops=1)
 Total runtime: 182.857 ms


 2. The cause

 In grouping_planner, flattern_simple_union_all creates
 appendrelinfos for each subqueries then expand_inherited_tables
 furthur expands the parent tables in each subqueries. Finally,
 this sequence leaves following structure. Where rte[2] and [3]
 are subqueries abandoned on the way pulling up rte[4] and [5].

 rte[1] Subquery SELECT*1, inh = 1
+- appinfo[0] - rte[4] Relation p1, inh = 1
|   +- appinfo[2] - rte[6]  Relation p1, inh = 0
|   +- appinfo[3] - rte[7]  Relation c11, inh = 0
|   +- appinfo[4] - rte[8]  Relation c12, inh = 0
+- appinfo[1] - rte[5] Relation p2, inh = 1
+- appinfo[5] - rte[9]  Relation p1, inh = 0
+- appinfo[6] - rte[10] Relation c11, inh = 0
+- appinfo[7] - rte[11] Relation c12, inh = 0

 On the other hand, ec member finally has exprs only for varno =
 1, 4 and 5, in other words, it lacks the members for the most
 descendant RTEs.  This is because add_child_rel_equivalences()
 always inhibits add_eq_member from registering the child_rel's
 relid on EC. Conequently these grandchild relations does not find
 index pathkeys for themselves.


 3. Measures

 I could thought up three approaches for the problem.

 One is to simplly modifying here to give child flag in the
 parameters of add_eq_member accordig to whether the child_rel is
 inh or not. This seems to work fine although leaves two levels of
 MergeAppends (PATCH-1). So the additional patch is attached to
 collapse these MergeAppends (PATCH-2). This gives the same plan
 as PATCH-3.

 uniontest=# explain analyze select * from p1 union all
 select * from p2 order by a limit 10;
QUERY PLAN
 
  Limit  (actual time=0.208..0.224 rows=10 loops=1)
-  Merge Append  (actual time=0.205..0.218 rows=10 loops=1)
  Sort Key: p1.a
  -  Merge Append  (actual time=0.110..0.120 rows=10 loops=1)
Sort Key: p1.a
-  Index Scan .. on p1  (actual time=0.006..0.006 rows=0 loops=1)
-  Index Scan .. on c11  (actual time=0.054..0.060 rows=10 loops=1)
-  Index Scan .. on c12  (actual time=0.046..0.046 rows=1 loops=1)
  -  Merge Append  (actual time=0.093..0.093 rows=1 loops=1)
Sort Key: p2.a
-  Index Scan .. on p2  (actual time=0.002..0.002 rows=0 loops=1)
-  Index Scan .. on c21  (actual time=0.047..0.047 rows=1 loops=1)
-  Index Scan .. on c22  (actual time=0.043..0.043 rows=1 loops=1)
  Total runtime: 0.448 ms


 The second is to collapse the appendrel structure shown above to
 have only single level

Re: [HACKERS] proposal: lob conversion functionality

2013-10-25 Thread Pavel Stehule

fixed documentation

Regards

Pavel


2013/10/24 Heikki Linnakangas hlinnakan...@vmware.com

 On 22.10.2013 13:55, Pavel Stehule wrote:

 2013/10/21 Noah Mischn...@leadboat.com

 If you're prepared to change the function names and add the
 subset-oriented
 functions, I would appreciate that.

  here is patch


 lobj.sgml still refer to the old names.

 - Heikki

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e397386..8509d09 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3399,6 +3399,78 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
See also the aggregate function functionstring_agg/function in
xref linkend=functions-aggregate.
   /para
+
+   table id=functions-binarystring-largeobjects-transformations
+titleLoading and saving from/to Large Objects Functions/title
+tgroup cols=5
+ thead
+  row
+   entryFunction/entry
+   entryReturn Type/entry
+   entryDescription/entry
+   entryExample/entry
+   entryResult/entry
+  /row
+ /thead
+
+ tbody
+ row
+  entry
+indexterm
+ primarylo_create/primary
+/indexterm
+   literalfunctionlo_create(parameterloid/parameter typeoid/type, parameterstring/parameter typebytea/type )/function/literal
+  /entry
+  entrytypeoid/type/entry
+  entry
+   Create a large object and store a binary string there. Returns a oid of
+   created large object.
+  /entry
+  entryliteralselect lo_create(0, decode('ff00','hex'))/literal/entry
+  entryliteral24528/literal/entry
+ /row
+
+  row
+   entry
+indexterm
+ primarylo_get/primary
+/indexterm
+literalfunctionlo_get(parameterloid/parameter typeoid/type optional, parameteroffset/parameter typebigint/type, parameterstr/parameter typebytea/type/optional)/function/literal
+   /entry
+   entrytypebytea/type/entry
+   entry
+Returns a binary string based on content a entered large object. Attention: binary
+string has lower size limit (1GB) than large objects (4GB). Processing very large
+large object can be very expensive for memory resources. Bytea data are completly
+holded in memomory.
+  /entry
+  entryliterallo_get(24628)/literal/entry
+  entryliteral\xff00/literal/entry
+ /row
+
+  row
+   entry
+indexterm
+ primarylo_put/primary
+/indexterm
+literalfunctionlo_put(parameterloid/parametertypeoid/type, parameteroffset/parameter typebigint/type, parameterstr/parameter typebytea/type)/function/literal
+   /entry
+   entrytypevoid/type/entry
+   entry
+Write data at offset.
+  /entry
+  entryliterallo_put(24628, 0, decode('', 'hex'))/literal/entry
+  entry/entry
+ /row
+
+/tbody
+   /tgroup
+  /table
+
+  para
+   See also a description of other Large Objects Function
+   in xref linkend=lo-funcs.
+  /para
  /sect1
 
 
diff --git a/doc/src/sgml/lobj.sgml b/doc/src/sgml/lobj.sgml
index bb3e08f..32fda11 100644
--- a/doc/src/sgml/lobj.sgml
+++ b/doc/src/sgml/lobj.sgml
@@ -580,6 +580,42 @@ SELECT lo_export(image.raster, '/tmp/motd') FROM image
 these functions as functionloread/ and functionlowrite/.
   /para
 
+  para
+   There are other two functions , that doesn't correspond with client api
+   (see in xref linkend=functions-binarystring-largeobjects-transformations).
+   functionlo_create/function transforms a binary string to lo object,
+   functionlo_get/function transforms a lo object to binary string, and
+   functionlo_put/function write a binary string at offset to lo object.
+  /para
+
+  para
+   Some examples:
+programlisting
+SELECT lo_create(0, decode('ff00','hex'));
+ lo_create 
+---
+ 16392
+(1 row)
+
+SELECT lo_get(16392);
+   lo_get   
+
+ \xff00
+(1 row)
+
+SELECT lo_put(16392, 1, decode('aa','hex'));
+ lo_put 
+
+ 
+(1 row)
+
+SELECT lo_get(16392);
+   lo_get   
+
+ \xffaaff00
+(1 row)
+/programlisting
+  /para
 /sect1
 
 sect1 id=lo-examplesect
diff --git a/src/backend/libpq/be-fsstubs.c b/src/backend/libpq/be-fsstubs.c
index fa00383..aa12349 100644
--- a/src/backend/libpq/be-fsstubs.c
+++ b/src/backend/libpq/be-fsstubs.c
@@ -754,3 +754,137 @@ deleteLOfd(int fd)
 {
 	cookies[fd] = NULL;
 }
+
+/*
+ *	auxiliary LO functions for management LO from SQL and PL
+ */
+
+/*
+ * Load LO fragment and returns bytea
+ *
+ * When nbytes is a -1, then it reads from start (specified by offset) to end.
+ */
+static bytea *
+lo_get_fragment_internal(Oid loOid, int64 offset, int nbytes)
+{
+	LargeObjectDesc	*loDesc;
+	int64		loSize;
+	int		result_length;
+	int total_read;
+	bytea *result = NULL;
+
+	/*
+	 * We don't actually need to store into fscxt, but create it

Re: [HACKERS] Example query causing param_info to be set in plain rel path

Ashutosh Bapat ashutosh.ba...@enterprisedb.com writes:
 In order to test various cases of LATERAL join in Postgres-XC, I am trying
 to find a query where RelOptInof-lateral_relids would get set for plain
 base relations.

I think you need a lateral reference in a function or VALUES FROM-item.
As you say, plain sub-selects are likely to get flattened.  (Possibly
if you stuck in a flattening fence such as OFFSET 0, you could get the
case to happen with a sub-select FROM item, but I'm too lazy to check.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RULE regression test fragility?

2013-10-25 Thread Alvaro Herrera

Andres Freund wrote:
 On 2013-10-24 09:22:52 -0400, Tom Lane wrote:
  Andres Freund and...@2ndquadrant.com writes:
   FWIW, I've repeatedly now thought that it'd make maintaining/updating
   patches easier if we switched that query into unaligned tuple only (\a
   \t) mode. That would remove the frequent conflicts on the row count and
   widespread changes due to changed alignment.
   Alternatively we could just wrap the query in \copy ... CSV.
  
  Hm ... yeah, it would be a good thing if changes in one view didn't so
  frequently have ripple effects to the whole output.  Not sure which
  format is best for that though.
 
 Something like the attached maybe?

+1 (but what are those silly parens in pg_seclabels definition?),
except:

 +-- disable fancy output again
 +\a\t

Should be enable.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

Hugo Mercier hugo.merc...@oslandia.com writes:
 PostGIS functions that manipulate geometries have to unserialize their
 input geometries from the 'flat' varlena representation to their own,
 and serialize the processed geometries back when returning.
 But in such nested call queries, this serialization-unserialization
 process is just an overhead.

This is a reasonable thing to worry about, not just for PostGIS types but
for many container types such as arrays --- it'd be nice to be able to
work with an in-memory representation that wasn't just a contiguous blob
of data.  For instance, assignment to an array element might become a
constant-time operation even when working with variable-length datatypes.

 So we thought having a way for user functions to know if they are part
 of a nested call could allow them to avoid this serialization phase.

However, this seems like a completely wrong way to go at it.  In the first
place, it wouldn't help for situations like a complex value stored in a
plpgsql variable.  In the second, I don't think that what you are
describing scales to any more than the most trivial situations.  What
about functions with more than one complex-type input, for example?  And
you'd need to be certain that every single function taking or returning
the datatype gets updated at exactly the same time, else it'll break.

I think the right way to attack it is to create some way for a Datum
value to indicate, at runtime, whether it's a flat value or an in-memory
representation.  Any given function returning the type could choose to
return either representation.  The datatype would have to provide a way
to serialize the in-memory representation, when and if it came time to
store it in a table.  To avoid breaking functions that hadn't yet been
taught about the new representation, we'd probably want to redefine the
existing DETOAST macros as also invoking this datatype flattening
function, and then you'd need to use some new access macro if you wanted
visibility of the non-flat representation.  (This assumes that the whole
thing is only applicable to toastable datatypes, but that seems like a
reasonable restriction.)

Another thing that would have to be attacked in order to make the
plpgsql-variable case work is that you'd need some design for copying such
Datums in-memory, and perhaps a reference count mechanism to optimize away
unnecessary copies.  Your idea of tying the optimization to the nested
function call scenario would avoid the need to solve this problem, but
I think it's too narrow a scope to justify all the other work that'd be
involved.

Some colleagues of mine at Salesforce have been playing with ideas like
this, though last I heard they were nowhere near having a submittable
patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New committer

Tatsuo Ishii is...@postgresql.org writes:
 Hi pgpool hackers,
 Muhammad Usama did great job in restructuring pgpool source code. And
 now he is working on incorporating PostgreSQL's exception and memory
 manager. I believe now is the time to add him to our one of committers
 and I just have done so.

I assume this went to the wrong mailing list?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
 I think the right way to attack it is to create some way for a Datum
 value to indicate, at runtime, whether it's a flat value or an in-memory
 representation.  Any given function returning the type could choose to
 return either representation.  The datatype would have to provide a way
 to serialize the in-memory representation, when and if it came time to
 store it in a table.  To avoid breaking functions that hadn't yet been
 taught about the new representation, we'd probably want to redefine the
 existing DETOAST macros as also invoking this datatype flattening
 function, and then you'd need to use some new access macro if you wanted
 visibility of the non-flat representation.  (This assumes that the whole
 thing is only applicable to toastable datatypes, but that seems like a
 reasonable restriction.)

That sounds reasonable, and we have most of the infrastructure for it
since the indirect toast thing got in.

 Another thing that would have to be attacked in order to make the
 plpgsql-variable case work is that you'd need some design for copying such
 Datums in-memory, and perhaps a reference count mechanism to optimize away
 unnecessary copies.  Your idea of tying the optimization to the nested
 function call scenario would avoid the need to solve this problem, but
 I think it's too narrow a scope to justify all the other work that'd be
 involved.

I've thought about refcounting Datums several times, but I always got
stuck when thinking about how to deal memory context resets and errors.
Any ideas about that?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6.4

Hi,

On 2013-10-22 16:07:16 +0200, Andres Freund wrote:
 On 2013-10-21 20:16:29 +0200, Andres Freund wrote:
  Current draft is:
  ALTER TABLE ... REPLICA IDENTITY NOTHING|FULL|DEFAULT
  ALTER TABLE ... REPLICA IDENTITY USING INDEX ...;
  
  which leaves the door open for
  
  ALTER TABLE ... REPLICA IDENTITY USING '(' column_name_list ')';
  
  Does anybody have a strong feeling about requiring support for CREATE
  TABLE for this?
 
 Attached is a patch ontop of master implementing this syntax. It's not
 wired up to the changeset extraction patch yet as I am not sure whether
 others agree about the storage.

So, I am currently wondering about how to store the old tuple, based
on this. Currently it is stored using the TupleDesc of the index the old
tuple is based on. But if we want to allow transporting the entire tuple
that obviously cannot be the only option.
One option would be to change the stored format based on what's
configured, using the relation's TupleDesc if FULL is used. But I think
always using the heap relation's desc is better.
The not-logged columns would then just be represented as NULLs. That
will make old primary keys bigger if the relation has a high number of
columns and the key small, but I don't think it matters enough.

Opinions?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

Andres Freund and...@2ndquadrant.com writes:
 On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
 I think the right way to attack it is to create some way for a Datum
 value to indicate, at runtime, whether it's a flat value or an in-memory
 representation.

 That sounds reasonable, and we have most of the infrastructure for it
 since the indirect toast thing got in.

Oh really?  I hadn't been paying much attention to that, but obviously
I better go back and study it.

 I've thought about refcounting Datums several times, but I always got
 stuck when thinking about how to deal memory context resets and errors.
 Any ideas about that?

Not yet.  But it makes no sense to claim that a Datum could have a
reference that's longer-lived than the memory context it's in, so
I'm not sure the context reset case is really a problem.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

Le 25/10/2013 16:18, Tom Lane a écrit :
 Hugo Mercier hugo.merc...@oslandia.com writes:
 PostGIS functions that manipulate geometries have to unserialize their
 input geometries from the 'flat' varlena representation to their own,
 and serialize the processed geometries back when returning.
 But in such nested call queries, this serialization-unserialization
 process is just an overhead.
 
 This is a reasonable thing to worry about, not just for PostGIS types but
 for many container types such as arrays --- it'd be nice to be able to
 work with an in-memory representation that wasn't just a contiguous blob
 of data.  For instance, assignment to an array element might become a
 constant-time operation even when working with variable-length datatypes.
 
 So we thought having a way for user functions to know if they are part
 of a nested call could allow them to avoid this serialization phase.
 
 However, this seems like a completely wrong way to go at it.  In the first
 place, it wouldn't help for situations like a complex value stored in a
 plpgsql variable.  In the second, I don't think that what you are
 describing scales to any more than the most trivial situations.  What
 about functions with more than one complex-type input, for example?  And
 you'd need to be certain that every single function taking or returning
 the datatype gets updated at exactly the same time, else it'll break.

About plpgsql variables : no there won't be no optimization in that
case. At the time the function result has to be stored in a variable, it
must be serialized.

About functions with more than one complex-type input, as soon as each
parameter are of the same type, there is no problem with that.
But if your function deals with more than one complex type AND you want
to avoid serialization on each parameter, then yes, each type must be
aware of this possible optimization (choose whether to serialize or not).

I don't understand what you mean by be certain that every single
function ... gets updated at exactly the same time. Could you develop ?

 
 I think the right way to attack it is to create some way for a Datum
 value to indicate, at runtime, whether it's a flat value or an in-memory
 representation.  Any given function returning the type could choose to
 return either representation.  The datatype would have to provide a way
 to serialize the in-memory representation, when and if it came time to
 store it in a table.  To avoid breaking functions that hadn't yet been
 taught about the new representation, we'd probably want to redefine the
 existing DETOAST macros as also invoking this datatype flattening
 function, and then you'd need to use some new access macro if you wanted
 visibility of the non-flat representation.  (This assumes that the whole
 thing is only applicable to toastable datatypes, but that seems like a
 reasonable restriction.)

You're totally right. That is very close to what I am working on with
PostGIS.
This is still early work, but for some details :

https://github.com/Oslandia/postgis/blob/nested_ref_passing/postgis/lwgeom_ref.h

Basically, the 'geometry' type of PostGIS is here extended with a flag
saying if the data is actual 'flat' data or a plain pointer. And if this
is a pointer, a type identifier is stored.

And there is a new DETOAST macro (here POSTGIS_DETOAST_DATUM) that will
test if the Datum is a pointer or not and if it is the case, call
corresponding unserializing functions. So you can avoid copies if your
function is aware of that, and the change for existing functions will be
minimum.

https://github.com/Oslandia/postgis/blob/nested_ref_passing/postgis/lwgeom_ref.c

You said when and if it came time to store it in a table. And, that is
exactly the point of this 'nested' boolean: when do you know that it is
time to store in a table, from a function point of view, otherwise ?

 
 Another thing that would have to be attacked in order to make the
 plpgsql-variable case work is that you'd need some design for copying such
 Datums in-memory, and perhaps a reference count mechanism to optimize away
 unnecessary copies.  Your idea of tying the optimization to the nested
 function call scenario would avoid the need to solve this problem, but
 I think it's too narrow a scope to justify all the other work that'd be
 involved.

Do you think it must necessarly cover the plpgsql variable case to be
acceptable ?


-- 
Hugo Mercier
Oslandia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

On 2013-10-25 11:01:28 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-10-25 10:18:27 -0400, Tom Lane wrote:
  I think the right way to attack it is to create some way for a Datum
  value to indicate, at runtime, whether it's a flat value or an in-memory
  representation.
 
  That sounds reasonable, and we have most of the infrastructure for it
  since the indirect toast thing got in.
 
 Oh really?  I hadn't been paying much attention to that, but obviously
 I better go back and study it.

Well, it has the infrastructure for adding further types of
varattrib_1b_e types and for computing the size independently. So you
can easily add a new type of toast datum. There still needs to be
handling for it in tuptoaster.c et al, but that's not surprising ;)

  I've thought about refcounting Datums several times, but I always got
  stuck when thinking about how to deal memory context resets and errors.
  Any ideas about that?
 
 Not yet.  But it makes no sense to claim that a Datum could have a
 reference that's longer-lived than the memory context it's in, so
 I'm not sure the context reset case is really a problem.

Given how short lived many of the contexts used for expression
evaluation are, that might restrict the usefullness quite a bit. I think
at the very least it has to be allowed that a Datum gets also used in
child contexts.
But that's already opens up the door for refcount leakage when the child
context gets destroyed.

I wonder if this needs mcxt.c/aset.c support to be useful.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

Hugo Mercier hugo.merc...@oslandia.com writes:
 Le 25/10/2013 16:18, Tom Lane a écrit :
 However, this seems like a completely wrong way to go at it.  In the first
 place, it wouldn't help for situations like a complex value stored in a
 plpgsql variable.  In the second, I don't think that what you are
 describing scales to any more than the most trivial situations.  What
 about functions with more than one complex-type input, for example?  And
 you'd need to be certain that every single function taking or returning
 the datatype gets updated at exactly the same time, else it'll break.

 About functions with more than one complex-type input, as soon as each
 parameter are of the same type, there is no problem with that.

How do you tell the difference between

   foo(col1, bar(col2))
   foo(bar(col1), col2)

 I don't understand what you mean by be certain that every single
 function ... gets updated at exactly the same time. Could you develop ?

If you're tying this to the syntax of the expression, then bar() *must*
return a non-serialized value when and only when foo() is expecting that,
therefore their implementations must change at the same time.  Perhaps
that's workable for PostGIS, but it's a complete nonstarter for
widely-known datatypes like arrays, where affected functions might be
spread through any number of extensions.  We need a design that permits
incremental fixing of functions that work with a deserializable datatype.

Another point worth worrying about is that not all expressions are
function calls, nor do all function calls arise from expressions.
Chasing down all the corner cases and making sure they work properly
in a syntax-driven approach is going to be a headache.

 Basically, the 'geometry' type of PostGIS is here extended with a flag
 saying if the data is actual 'flat' data or a plain pointer. And if this
 is a pointer, a type identifier is stored.

If you're doing that, why do you need the decoration on the FuncExpr
expressions?  Can't you just look at your input datums and see if they're
flat or not?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

Le 25/10/2013 17:20, Tom Lane a écrit :
 Hugo Mercier hugo.merc...@oslandia.com writes:
 Le 25/10/2013 16:18, Tom Lane a écrit :

 How do you tell the difference between
 
foo(col1, bar(col2))
foo(bar(col1), col2)
 

Still not sure to understand ...
I assume foo() takes two argument of type A.
bar() can take one argument of A or another type B.

In bar(), you would have the choice to return either a plain A
or a pointer to A. Because bar() knows its call is nested (by foo()),
than it can decide to return a pointer to A.

foo() is then evaluated and we assume it knows A can be a pointer.
foo() then knows its nesting level of 0 and must return something
serialized in that case.

 I don't understand what you mean by be certain that every single
 function ... gets updated at exactly the same time. Could you develop ?
 
 If you're tying this to the syntax of the expression, then bar() *must*
 return a non-serialized value when and only when foo() is expecting that,
 therefore their implementations must change at the same time.  Perhaps
 that's workable for PostGIS, but it's a complete nonstarter for
 widely-known datatypes like arrays, where affected functions might be
 spread through any number of extensions.  We need a design that permits
 incremental fixing of functions that work with a deserializable datatype.

Yes.
It could work for each user type assuming each function working with
this type is aware of this pointer/serialized nature, including extensions.
So you have to, at least, recompile every extensions depending on that
types. Which ... limits the interest for very general types, I have to
admit.

 
 Another point worth worrying about is that not all expressions are
 function calls, nor do all function calls arise from expressions.
 Chasing down all the corner cases and making sure they work properly
 in a syntax-driven approach is going to be a headache.

We could add this 'nesting' detection to operators (and probably other
constructs that I don't know) little by little.
Optimizing only function calls as a first step is not enough ?

 
 Basically, the 'geometry' type of PostGIS is here extended with a flag
 saying if the data is actual 'flat' data or a plain pointer. And if this
 is a pointer, a type identifier is stored.
 
 If you're doing that, why do you need the decoration on the FuncExpr
 expressions?  Can't you just look at your input datums and see if they're
 flat or not?
 

If a function returns a pointer whatever the nesting level is, you could
end with something storing a raw pointer, which is bad. You could
eventually add a way to detect that what you stored was as pointer and
that your data no longer exists (be NULL ?) when read back, but you
basically end with users manipulating pointers, which is bad.

If you want to make it transparent to the user, you need to know the
nesting level to decide whether you could just pass it to something that
is aware of this pointer (nesting level =1) or serialize it back
(nesting level == 0).

-- 
Hugo Mercier
Oslandia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] LISTEN / NOTIFY enhancement request for Postgresql

2013-10-25 Thread Sev Zaslavsky


Hi pgsql-hackers,

The LISTEN / NOTIFY feature (along with the pg_notify() function) is a 
unique feature that differentiates Postgresql from nearly all other 
relational database systems.  With the exception of SQL Server, I know 
of no other RDBMSs that allow a client to be asynchronously notified by 
the database server.


This feature embodies the modern push approach and allows delivering 
timely data to the user as it changes, instead of the more traditional 
pull approach which requires the user to request the data at specific 
intervals.  Vendors are rolling out push technologies to meet market 
demand.  Microsoft recently introduced SignalR - which is a framework 
for pushing content to ASP.NET Web pages.  Similarly Complex Event 
Processing systems push information to users' dashboards in real-time.


In contrast with RDBMS's where asynchronous notification is a special 
feature, message broker software implementations live and breathe 
asynchronous notification.So I feel that the LISTEN / NOTIFY feature 
is trying to deliver some of the asynchronous notification features of a 
message broker but it lacks some of the flexibility.


One particular shortcoming of LISTEN / NOTIFY is the fact that the 
channel specified on the LISTEN must _exactly _match the channel 
specified on the NOTIFY.  Here is an example of the problem:


I have two listeners:
 1. Interested in all stock quote updates
 2. Interested in stock quote updates for IBM only

There is a table that contains stock prices with a trigger proc that 
issues a NOTIFY using pg_notify() upon update.  There isn't a single 
channel that I can use that will deliver the message to both listeners.  
To get around the problem I could publish a message on channel PRICE 
and another message on channel PRICE.IBM but sending two notifications 
is far from optimal.


Message brokers have implemented a neat way to get around this issue.   
It is accomplished by allowing wildcards in message topic subscriptions.


Here is an example 
implementation:http://activemq.apache.org/nms/activemq-wildcards.html


 * is used to separate names in a path
 * * is used to match any name in a path
 *   is used to recursively match any destination starting from this name

For example using the example above, these subscriptions are possible

SubscriptionMeaning
PRICE.  Any price for any product on any exchange
PRICE.STOCK.Any price for a stock on any exchange
PRICE.STOCK.NASDAQ.*Any stock price on NASDAQ
PRICE.STOCK.*.IBM   Any IBM stock price on any exchange


My request is to implement the same or similar feature in Postgresql.

Thank you.

-Sev

Re: [HACKERS] Detection of nested function calls

Hugo Mercier hugo.merc...@oslandia.com writes:
 Le 25/10/2013 17:20, Tom Lane a écrit :
 How do you tell the difference between
 
 foo(col1, bar(col2))
 foo(bar(col1), col2)

 Still not sure to understand ...
 I assume foo() takes two argument of type A.
 bar() can take one argument of A or another type B.

I was assuming everything was the same datatype in this example, ie
col1, col2, and the result of bar() are all type A.

The point I'm trying to make is that in the first case, foo would be
receiving a first argument that was flat and a second that was not flat;
while in the second case, it would be receiving a first argument that was
not flat and a second that was flat.  The expression labeling you're
proposing does not help it tell the difference.  What's more, you're
proposing that the labeling be made by generic code that can't possibly
know what bar() is really going to do.

 In bar(), you would have the choice to return either a plain A
 or a pointer to A. Because bar() knows its call is nested (by foo()),
 than it can decide to return a pointer to A.

 foo() is then evaluated and we assume it knows A can be a pointer.
 foo() then knows its nesting level of 0 and must return something
 serialized in that case.

Whoa.  That's the most fragile, assumption-filled way you could possibly
go about this.  In general, bar() cannot be expected to know whether the
outer function is able to take a non-flat parameter value.  And you've
glossed over how foo() would know whether its input was flat or not.

Another point here is that there's no good reason to suppose that a
function should return a flattened value just because it's at the outer
level of its syntactic expression.  For example, if we're doing a plain
SELECT foo(...) FROM ..., the next thing that will happen with that value
is it'll be fed to the output function for the datatype.  Maybe that
output function would like to have a non-flat input value, too, to save
the time of transforming back to that representation.  On the other hand,
if it's a SELECT ... ORDER BY ... and the planner chooses to do the ORDER
BY with a final sort step, we'll probably have to flatten the value to
pass it through sorting.  (Or possibly not --- perhaps we could just pass
the toast token through sorting?)  There are a lot of considerations here
and it's really unreasonable to expect that static expression labeling
will be able to do the right thing every time.

Basically the only way to make this work reliably is for Datums to be
self-identifying as to whether they're flat or structured values; then
make code do the right thing on-the-fly at runtime depending on what kind
of Datum it gets.  Once you've done that, I don't see that parse-time
labeling of expression nesting adds anything useful.  As Andres said,
the provisions for toasted datums are a good precedent, and none of that
depends on parse-time decisions.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CLUSTER FREEZE

2013-10-25 Thread Josh Berkus

On 10/24/2013 07:19 PM, Tom Lane wrote:
 In any case, it's very far from obvious to me that CLUSTER ought
 to throw away information by default, which is what you're proposing.

The problem here is that you're thinking of the 1/10 of 1% of our users
who have a serious PostgreSQL failure and post something on the lists
for help, for which XID forensic information is useful.  As opposed to
the 99.9% of our users for whom deferred freezing is a performance
burden.  While I realize that the 0.1% of users are more likely to have
contact with you, personally, it's still bad policy for the project.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql should show disabled internal triggers

2013-10-25 Thread fabriziomello

On 2013-09-18 15:15:55 +0200, Bernd Helmle wrote:
On 2013-09-18 15:15:55 +0200, Bernd Helmle wrote:
--On 18. September 2013 13:52:29 +0200 Andres Freund
lt;andres@gt; wrote:

If you do ALTER TABLE ... DISABLE TRIGGER ALL; and then individually
re-enable the disabled triggers it's easy to miss internal triggers.
A \d+ tablename will not show anything out of the ordinary for that
situation since we don't show internal triggers. But foreign key checks
won't work.
So, how about displaying disabled internal triggers in psql?

Hi had exactly the same concerns this morning while starting to look at
the
ENABLE/DISABLE constraint patch. However, i wouldn't display them as
triggers, but maybe more generally as disabled constraints or such.

Well, that will lead the user in the wrong direction, won't it? They
haven't disabled the constraint but the trigger. Especially as we
already have NOT VALID and might grow DISABLED for constraint
themselves...

Hi,

The attached patch [1] enable PSQL to list internal disabled triggers in \d
only in
versions = 9.0.

[1] psql-display-all-triggers-v1.patch
http://postgresql.1045698.n5.nabble.com/file/n5775954/psql-display-all-triggers-v1.patch

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
Timbira: http://www.timbira.com.br
Blog sobre TI: http://fabriziomello.blogspot.com
Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello

-
--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
Blog sobre TI: http://fabriziomello.blogspot.com
Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
Twitter: http://twitter.com/fabriziomello
--
View this message in context:
http://postgresql.1045698.n5.nabble.com/psql-should-show-disabled-internal-triggers-tp5771406p5775954.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detection of nested function calls

On Fri, Oct 25, 2013 at 10:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Hugo Mercier hugo.merc...@oslandia.com writes:
 PostGIS functions that manipulate geometries have to unserialize their
 input geometries from the 'flat' varlena representation to their own,
 and serialize the processed geometries back when returning.
 But in such nested call queries, this serialization-unserialization
 process is just an overhead.

 This is a reasonable thing to worry about, not just for PostGIS types but
 for many container types such as arrays --- it'd be nice to be able to
 work with an in-memory representation that wasn't just a contiguous blob
 of data.  For instance, assignment to an array element might become a
 constant-time operation even when working with variable-length datatypes.

I bet numeric could benefit as well.  Essentially all of the
operations on numeric start by transforming the on-disk representation
into an internal form used only for the duration of a single call, and
end by transforming the internal form of the result back to the
on-disk representation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Additional information on log_line_prefix

2013-10-25 Thread Emanuel Calvo

Hi guys,

I'm working on a quick convertion script for query reviews and I wonder if
a feature request to add the following values will be possible:

 %D = duration
 %L = lock_time   (lock only for this query)
 %E = estimated rows
 %R = total rows returned
 %B = total bytes sent
 %T = temporal tables used

Those prefixes/values are just examples/proposals.

Thanks for the hard work!


-- 
--
Emanuel Calvo

Re: [HACKERS] Additional information on log_line_prefix

2013-10-25 Thread David Johnston

emanuel_calvo wrote
  %E = estimated rows

How would you expect this to work?  This information seems mostly useless
without the context of a full EXPLAIN output.


  %T = temporal tables used

I am guessing you mean temporary, not temporal - the later also being
known as time oriented

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Additional-information-on-log-line-prefix-tp5775956p5775958.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional information on log_line_prefix

2013-10-25 Thread Andrew Dunstan



On 10/25/2013 01:50 PM, Emanuel Calvo wrote:


Hi guys,

I'm working on a quick convertion script for query reviews and I 
wonder if a feature request to add the following values will be possible:


 %D = duration
 %L = lock_time   (lock only for this query)
 %E = estimated rows
 %R = total rows returned
 %B = total bytes sent
 %T = temporal tables used

Those prefixes/values are just examples/proposals.

Thanks for the hard work!




Pretty much all of this can be got with the auto_explain module, and I 
think that's where it belongs.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failure while inserting parent tuple to B-tree is not fun

2013-10-25 Thread Heikki Linnakangas


On 22.10.2013 19:55, Heikki Linnakangas wrote:

I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate the downlink hasn't been inserted
yet. When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.


This is what I came up with.

One thing I'm not totally happy about is the way page deletions of 
incompletely split pages are handled. Basically, it just bails out and 
refuses to delete a page that is part of an incomplete split. That's 
probably OK in practice, as incomplete splits should be very rare 
anyway, but it's a bit dissatisfying to not handle the case because at 
first glance it seems like it should be even simpler than usual to 
delete a page that has no downlink. Nevertheless, I decided to just skip 
that for now.


After this patch, deleting the only child of a parent and the parent 
itself is still a multi-WAL-record operation that needs to be tracked 
during recovery, and completed at the end of recovery. I'd like to 
eliminate that too, but that's another patch.


- Heikki
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..29d8bd1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -384,12 +384,41 @@ an additional insertion above that, etc).
 For a root split, the followon WAL entry is a new root entry rather than
 an insertion entry, but details are otherwise much the same.
 
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course).  This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because insertion involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks along the way, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra writes in what would otherwise
+be a read-only operation (it would not be possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is atomically cleared.
+The child page is kept locked until the insertion in the parent is finished
+and the flag in the child cleared, but can be released immediately after
+that, before recursing up the tree, if the parent also needs to be split.
+That ensures that incompletely split pages should not be seen under normal
+circumstances; only when a transaction fails to insert the parent for some
+reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
+
+We used to keep track of incomplete splits during recovery and finish them
+immediately at end of recovery, instead of doing it lazily at the next
+insertion. However, that made the recovery much more complicated, and only
+fixed the problem when crash recovery was performed. An incomplete split can
+also occur if an otherwise recoverable error, like out-of-memory or
+out-of-disk-space, happens while inserting the downlink to the parent.
 
 When splitting a non-root page that is alone on its level, the required
 metapage update (of the fast root link) is performed and logged as part
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a452fea..a810015 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -59,14 +59,16 @@ static void _bt_findinsertloc(Relation rel,
   ScanKey scankey,
   IndexTuple newtup,
   Relation heapRel);
-static void _bt_insertonpg(Relation rel, Buffer buf,
+static void _bt_insertonpg(Relation rel, Buffer buf, Buffer

Re: [HACKERS] CLUSTER FREEZE

2013-10-25 Thread Thomas Munro

On 25 October 2013 01:17, Josh Berkus j...@agliodbs.com wrote:

 On 10/24/2013 04:55 PM, Robert Haas wrote:
  I wonder if we should go so far as to make this the default behavior,
  instead of just making it an option.

 +1 from me.  Can you think of a reason you *wouldn't* want to freeze?


Ok, I attach an alternative patch that makes CLUSTER *always* freeze,
without any option (but doesn't affect VACUUM FULL in the same way). I will
post both alternatives to the commitfest app since there seems to be some
disagreement about whether tuple freezing should be an optional.

Thanks
Thomas Munro


cluster-freeze-always.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6.4

On Fri, Oct 25, 2013 at 10:58 AM, Andres Freund and...@2ndquadrant.com wrote:
 So, I am currently wondering about how to store the old tuple, based
 on this. Currently it is stored using the TupleDesc of the index the old
 tuple is based on. But if we want to allow transporting the entire tuple
 that obviously cannot be the only option.
 One option would be to change the stored format based on what's
 configured, using the relation's TupleDesc if FULL is used. But I think
 always using the heap relation's desc is better.

I heartily agree.

 The not-logged columns would then just be represented as NULLs. That
 will make old primary keys bigger if the relation has a high number of
 columns and the key small, but I don't think it matters enough.

Even if it does matter, the cure seems likely to be worse than the disease.

My only other comment is that if NONE is selected, we ought to omit
the old tuple altogether, not store one that is all-nulls.  But I bet
you had that in mind anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] dsm use of uint64