Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread David G Johnston
Peter Eisentraut-2 wrote
 On 10/31/14 6:19 AM, Simon Riggs wrote:
 Various ways of tweaking Foreign Keys are suggested that are helpful
 for larger databases.
 
 *INITIALLY NOT ENFORCED
 FK created, but is not enforced during DML.
 Will be/Must be marked NOT VALID when first created.
 We can run a VALIDATE on the constraint at any time; if it passes the
 check it is marked VALID and presumed to stay that way until the next
 VALIDATE run.
 
 Does that mean the FK would become invalid after every DML operation,
 until you expicitly revalidate it?  Is that practical?

My read is that it means that you can insert invalid data but the system
will pretend it is valid unless someone asks it for confirmation.  Upon
validation the FK will become invalid until the discrepancy is fixed and
another validation is performed.


 ON DELETE IGNORE
 ON UPDATE IGNORE
 If we allow this specification then the FK is one way - we check the
 existence of a row in the referenced table, but there is no need for a
 trigger on the referenced table to enforce an action on delete or
 update, so no need to lock the referenced table when adding FKs.
 
 Are you worried about locking the table at all, or about having to lock
 many rows?

Wouldn't you at least need some kind of trigger to make the constraint
invalid as soon as any record is updated or removed from the referenced
table since in all likelihood the FK relationship has just been broken?


How expensive is validation going to be?  Especially, can validation occur
incrementally or does every record need to be validated each time?

Is this useful for master-detail setups, record-category, or both (others?)?

Will optimizations over invalid data give incorrect answers and in what
specific scenarios can that be expected?

I get the idea of having a system that let's you skip constant data
validation since in all likelihood once in production some scenarios would
be extremely resistant to the introduction of errors and can be dealt with
on-the-fly.  Trust only since the verify is expensive - but keep the option
open and the model faithfully represented.

I don't know that I would ever think to use this in my world since the
additional admin effort is obvious but the cost of the thing I'd be avoiding
is vague.  As it is now someone could simply drop their FK constraints and
run a validation query periodically to see if the data being inserted is
correct.  That doesn't allow for optimizations to take place though and so
this is an improvement; but the documentation and support aspects for a
keep/drop decision can be fleshed out first as that would be valuable in its
own right.  Then go about figuring out how to make a hybrid implementation
work.

Put another way: at what point does the cost of the FK constraint outweigh
the optimization savings?  While size is obvious both schema and read/write
patterns likely have a significant influence.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Tweaking-Foreign-Keys-for-larger-tables-tp5825162p5825891.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread Simon Riggs
On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote:

 ON DELETE IGNORE
 ON UPDATE IGNORE
 If we allow this specification then the FK is one way - we check the
 existence of a row in the referenced table, but there is no need for a
 trigger on the referenced table to enforce an action on delete or
 update, so no need to lock the referenced table when adding FKs.

 Are you worried about locking the table at all, or about having to lock
 many rows?

This is useful for smaller, highly referenced tables that don't change
much, if ever.

In that case the need for correctness thru locking is minimal. If we
do lock it will cause very high multixact traffic, so that is worth
avoiding alone.

The main issue is referencing a table many times. Getting a full table
lock can halt all FK checks, so skipping adding the trigger altogether
avoids freezing up everything just for a trigger that doesn't actually
do much.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread Simon Riggs
On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote:
 On 10/31/14 6:19 AM, Simon Riggs wrote:
 Various ways of tweaking Foreign Keys are suggested that are helpful
 for larger databases.

 *INITIALLY NOT ENFORCED
 FK created, but is not enforced during DML.
 Will be/Must be marked NOT VALID when first created.
 We can run a VALIDATE on the constraint at any time; if it passes the
 check it is marked VALID and presumed to stay that way until the next
 VALIDATE run.

 Does that mean the FK would become invalid after every DML operation,
 until you expicitly revalidate it?  Is that practical?

I think so.

We store the validity on the relcache entry.

Constraint would add a statement-level after trigger for insert,
update, delete and trigger, which issues a relcache invalidation if
the state was marked valid. Marked as deferrable initially deferred.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-11-06 Thread Fujii Masao
On Tue, Nov 4, 2014 at 2:03 PM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello ,

 Please find updated patch with the review comments given above implemented

Hunk #3 FAILED at 692.
1 out of 3 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlogreader.c.rej

The patch was not applied to the master cleanly. Could you update the patch?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE IF NOT EXISTS INDEX

2014-11-06 Thread Fujii Masao
On Sat, Nov 1, 2014 at 1:56 AM, Fabrízio de Royes Mello
fabriziome...@gmail.com wrote:

 On Fri, Oct 31, 2014 at 2:46 PM, Adam Brightwell
 adam.brightw...@crunchydatasolutions.com wrote:

 All,

 FWIW, I've cleanly applied v8 of this patch to master (252e652) and
 check-world was successful.  I also successfully ran through a few manual
 test cases.


 Thanks for your review!

Applied. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sequence Access Method WIP

2014-11-06 Thread Craig Ringer
On 11/05/2014 05:01 AM, Petr Jelinek wrote:
 I guess I could port BDR sequences to this if it would help (once we
 have bit more solid agreement that the proposed API at least
 theoretically seems ok so that I don't have to rewrite it 10 times if at
 all possible).

Because the BDR sequences rely on all the other BDR machinery I suspect
that'd be a pretty big thing to review and follow for someone who
doesn't know the BDR code.

Do you think it'd be simple to provide a blocking, transactional
sequence allocator via this API? i.e. gapless sequences, much the same
as typically implemented with SELECT ... FOR UPDATE on a counter table.

It might be more digestible standalone, and would be a handy contrib/
example extension demonstrating use of the API.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typo in comment

2014-11-06 Thread Etsuro Fujita
I ran into a typo in a comment in src/backend/commands/matview.c.
Please find attached a patch.

Best regards,
Etsuro Fujita
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 30bd40d..db05f7c 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -473,7 +473,7 @@ transientrel_destroy(DestReceiver *self)
  * the given integer, to make a new table name based on the old one.
  *
  * This leaks memory through palloc(), which won't be cleaned up until the
- * current memory memory context is freed.
+ * current memory context is freed.
  */
 static char *
 make_temptable_name_n(char *tempname, int n)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] two dimensional statistics in Postgres

2014-11-06 Thread Gavin Flower

On 06/11/14 23:15, Katharina Büchse wrote:

Hi,

I'm a phd-student at the university of Jena, Thüringen, Germany, in 
the field of data bases, more accurate query optimization.
I want to implement a system in PostgreSQL that detects column 
correlations and creates statistical data about correlated columns for 
the optimizer. Therefore I need to store two dimensional statistics 
(especially two dimensional histograms) in PostgreSQL.
I had a look at the description of WIP: multivariate statistics / 
proof of concept, which looks really promising, I guess these 
statistics are based on scans of the data? For my system I need both 
-- statistical data based on table scans (actually, samples are 
enough) and those based on query feedback. Query feedback (tuple 
counts and, speaking a little inaccurately, the where-part of the 
query itself) needs to be extracted and there needs to be a decision 
for the optimizer, when to take multivariate statistics and when to 
use the one dimensional ones. Oracle in this case just disables one 
dimensional histograms if there is already a multidimensional 
histogram, but this is not always useful, especially in the case of a 
feedback based histogram (which might not cover the whole data space). 
I want to use both kinds of histograms because correlations might 
occur only in parts of the data. In this case a histogram based on a 
sample of the whole table might not get the point and wouldn't help 
for the part of the data the user seems to be interested in.
There are special data structures for storing multidimensional 
histograms based on feedback and I already tried to implement one of 
these in C. In the case of two dimensions they are of course not for 
free (one dimensional would be much cheaper), but based on the 
principle of maximum entropy they deliver really good results. I 
decided for only two dimensions because in this case we have the best 
proportion of cost and benefit when searching for correlation (here 
I'm relying on tests that were made in DB2 within a project called 
CORDS which detects correlations even between different tables).


I'd be grateful for any advices and discussions.
Regards,

Katharina


Could you store a 2 dimensional histogram in a one dimensional array: 
A[z] = value, where z = col * rowSize + row (zero starting index)?



Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree index builds, CLUSTER, and sortsupport

2014-11-06 Thread Andreas Karlsson

On 11/06/2014 02:30 AM, Peter Geoghegan wrote:

Thanks for the review.

On Wed, Nov 5, 2014 at 4:33 PM, Andreas Karlsson andr...@proxel.se wrote:

I looked at the changes to the code. The new code is clean and there is more
code re-use and improved readability. On possible further improvement would
be to move the preparation of SortSupport to a common function since this is
done three time in the code.


The idea there is to have more direct control of sortsupport. With the
abbreviated keys patch, abbreviation occurs based on a decision made
by tuplesort.c. I can see why you'd say that, but I prefer to keep
initialization of sortsupport structs largely concentrated in
tuplesort.c, and more or less uniform regardless of the tuple-type
being sorted.


Ok, I am fine with either.


Is there any case where we should expect any greater performance
improvement?


The really compelling case is abbreviated keys - as you probably know,
there is a patch that builds on this patch, and the abbreviated keys
patch, so that B-Tree builds and CLUSTER can use abbreviated keys too.
That doesn't really have much to do with this patch, though. The
important point is that heap tuple sorting (with a query that has no
client overhead, and involves one big sort) and B-Tree index creation
both have their tuplesort as a totally dominant cost. The improvements
for each case should be quite comparable, which is good (except, as
noted in my opening e-mail, when heap/datum tuplesorting can use the
onlyKey optimization, while B-Tree/CLUSTER tuplesorting cannot).


Ah, I see.

--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] two dimensional statistics in Postgres

2014-11-06 Thread Tomas Vondra
Hi,

Dne 6 Listopad 2014, 11:15, Katharina Büchse napsal(a):
 Hi,

 I'm a phd-student at the university of Jena, Thüringen, Germany, in the
 field of data bases, more accurate query optimization.
 I want to implement a system in PostgreSQL that detects column
 correlations and creates statistical data about correlated columns for
 the optimizer. Therefore I need to store two dimensional statistics
 (especially two dimensional histograms) in PostgreSQL.

Cool!

 I had a look at the description of WIP: multivariate statistics / proof
 of concept, which looks really promising, I guess these statistics are
 based on scans of the data? For my system I need both -- statistical

Yes, it's based on a sample of the data.

 data based on table scans (actually, samples are enough) and those based
 on query feedback. Query feedback (tuple counts and, speaking a little
 inaccurately, the where-part of the query itself) needs to be extracted
 and there needs to be a decision for the optimizer, when to take
 multivariate statistics and when to use the one dimensional ones. Oracle
 in this case just disables one dimensional histograms if there is
 already a multidimensional histogram, but this is not always useful,
 especially in the case of a feedback based histogram (which might not
 cover the whole data space). I want to use both kinds of histograms

What do you mean by not covering the whole data space? I assume that when
building feedback-based histogram, parts of the data will be filtered out
because of WHERE clauses etc. Is that what you mean? I don't see how this
could happen for regular histograms, though.

 because correlations might occur only in parts of the data. In this case
 a histogram based on a sample of the whole table might not get the point
 and wouldn't help for the part of the data the user seems to be
 interested in.

Yeah, there may be dependencies that are difficult to spot in the whole
dataset, but emerge once you filter to a specific subset.

Now, how would that work in practice? Initially the query needs to be
planned using regular stats (because there's no feedback yet), and then -
when we decide the estimates are way off - may be re-planned using the
feedback. The feedback is inherently query-specific, so I'm not sure if
it's possible to reuse it for multiple queries (might be possible for
sufficiently similar ones).

Would this be done automatically for all queries / all conditions, or only
when specifically enabled (on a table, columns, ...)?

 There are special data structures for storing multidimensional
 histograms based on feedback and I already tried to implement one of
 these in C. In the case of two dimensions they are of course not for
 free (one dimensional would be much cheaper), but based on the
 principle of maximum entropy they deliver really good results. I decided
 for only two dimensions because in this case we have the best proportion
 of cost and benefit when searching for correlation (here I'm relying on

I think hardcoding the two-dimensions limit is wrong. I understand higher
number of dimensions means more expensive operation, but if the user can
influence it, I believe it's OK.

Also, is there any particular reason why not to support other kinds of
stats (say, MCV lists)? In the end it's just a different way to
approximate the distribution, and it may be way cheaper than histograms.

 tests that were made in DB2 within a project called CORDS which detects
 correlations even between different tables).

Is this somehow related to LEO? I'm not familiar with the details, but
from the description it might be related.

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] two dimensional statistics in Postgres

2014-11-06 Thread Tomas Vondra
Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a):

 Could you store a 2 dimensional histogram in a one dimensional array:
 A[z] = value, where z = col * rowSize + row (zero starting index)?

How would that work for columns with different data types?

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : REINDEX SCHEMA

2014-11-06 Thread Fujii Masao
On Fri, Oct 24, 2014 at 3:41 AM, Fabrízio de Royes Mello
fabriziome...@gmail.com wrote:
 On Wed, Oct 22, 2014 at 9:21 PM, Alvaro Herrera alvhe...@2ndquadrant.com
 wrote:

 Sawada Masahiko wrote:

  Thank you for reviewing.
  I agree 2) - 5).
  Attached patch is latest version patch I modified above.
  Also, I noticed I had forgotten to add the patch regarding document of
  reindexdb.

 Please don't use pg_catalog in the regression test.  That way we will
 need to update the expected file whenever a new catalog is added, which
 seems pointless.  Maybe create a schema with a couple of tables
 specifically for this, instead.


 Attached new regression test.

Hunk #1 FAILED at 1.
1 out of 2 hunks FAILED -- saving rejects to file
src/bin/scripts/t/090_reindexdb.pl.rej

I tried to apply the 001 patch after applying the 000, but it was not
applied cleanly.

At least to me, it's more intuitive to use SCHEMA instead of ALL IN SCHEMA
here because we already use DATABASE instead of ALL IN DATABASE.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typo in comment

2014-11-06 Thread Fujii Masao
On Thu, Nov 6, 2014 at 7:44 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 I ran into a typo in a comment in src/backend/commands/matview.c.
 Please find attached a patch.

Thanks! Applied.


 Best regards,
 Etsuro Fujita


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] two dimensional statistics in Postgres

2014-11-06 Thread Gavin Flower

On 06/11/14 23:57, Tomas Vondra wrote:

Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a):

Could you store a 2 dimensional histogram in a one dimensional array:
A[z] = value, where z = col * rowSize + row (zero starting index)?

How would that work for columns with different data types?

Tomas

I implicitly assumed that all cells were the same size  type. However, 
I could devise a scheme to cope with columns of different types, given 
the relevant definitions - but this would obviously be more complicated.


Also this method can be extended into higher dimensions.


Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Order of views in stats docs

2014-11-06 Thread Magnus Hagander
On Wed, Nov 5, 2014 at 4:57 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Magnus Hagander mag...@hagander.net writes:
 http://www.postgresql.org/docs/9.3/static/monitoring-stats.html, table 27-1.
 Can somebody find or explain the order of the views in there? It's not
 actually alphabetical, but it's also not logical. In particular, what
 is pg_stat_replication doing second to last?

 I would suggest we move pg_stat_replication up to directly under
 pg_stat_activity, and move pg_stat_database_conflicts up to directly
 under pg_stat_database. I think the rest makes reasonable sense.

 Any objections to this? Can anybody spot a reason for why they are
 where they are other than that it was just appended to the end of the
 table without realizing the order that I'm missing now and am about to
 break?

 I agree that the last two items seem to be suffering from blindly-add-
 it-to-the-end syndrome, which is a disease that runs rampant around here.

 However, should we consider the possibility of changing the table to
 straight alphabetical ordering?  I'm not as much in love with that
 approach as some folks, but it does have the merit that it's always clear
 where you ought to put a new item.  This would result in grouping the
 all, sys, and user views separately, rather than grouping those
 variants of a view together ... but on reflection I'm not sure that
 that'd be totally horrible.

That would at least make it very predictable, yes.

Another thought I had in that case is maybe we need to break out the
pg_stat_activity and pg_stat_replication views into their own table.
They are really the only two views that are different in a lot of
ways. Maybe call the splits session statistics views and object
statistics views, instead of just standard statistics views? The
difference being that they show snapshots of *right now*, whereas all
the other views show accumulated statistics over time.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] two dimensional statistics in Postgres

2014-11-06 Thread Tomas Vondra
Dne 6 Listopad 2014, 12:05, Gavin Flower napsal(a):
 On 06/11/14 23:57, Tomas Vondra wrote:
 Dne 6 Listopad 2014, 11:50, Gavin Flower napsal(a):
 Could you store a 2 dimensional histogram in a one dimensional array:
 A[z] = value, where z = col * rowSize + row (zero starting index)?
 How would that work for columns with different data types?

 Tomas

 I implicitly assumed that all cells were the same size  type. However,
 I could devise a scheme to cope with columns of different types, given
 the relevant definitions - but this would obviously be more complicated.

 Also this method can be extended into higher dimensions.

Which is what I did in the WIP: multivariate stats ;-)

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

2014-11-06 Thread Álvaro Hernández Tortosa


On 06/11/14 02:06, Jim Nasby wrote:

On 11/5/14, 6:04 PM, Álvaro Hernández Tortosa wrote:


On 05/11/14 17:46, Jim Nasby wrote:

On 11/4/14, 6:11 PM, Álvaro Hernández Tortosa wrote:
 Should we improve then the docs stating this more clearly? Any 
objection to do this?


If we go that route we should also mention that now() will no longer 
be doing what you probably hope it would (AFAIK it's driven by BEGIN 
and not the first snapshot).


 If I understand you correctly, you mean that if we add a note to 
the documentation stating that the transaction really freezes when 
you do the first query, people would expect now() to be also frozen 
when the first query is done, which is not what happens (it's frozen 
at tx start). Then, yes, you're right, probably *both* the isolation 
levels and the now() function documentation should be patched to 
become more precise.


Bingo.

Hrm, is there anything else that differs between the two?

Perhaps we should change how now() works, but I'm worried about what 
that might do to existing applications...


 Perhaps, I also believe it might not be good for existing 
applications, but it definitely has a different freeze behavior, 
which seems inconsistent too.


Yeah, I'd really rather fix it...


There has been two comments which seem to state that changing this 
may introduce some performance problems and some limitations when you 
need to take out some locks. I still believe, however, that current 
behavior is confusing for the user. Sure, one option is to patch the 
documentation, as I was suggesting.


But what about creating a flag to BEGIN and SET TRANSACTION 
commands, called IMMEDIATE FREEZE (or something similar), which 
applies only to REPEATABLE READ and SERIALIZABLE? If this flag is set 
(and may be off by default, but of course the default may be 
configurable via a guc parameter), freeze happens when it is present 
(BEGIN or SET TRANSACTION) time. This would be a backwards-compatible 
change, while would provide the option of freezing without the nasty 
hack of having to do a SELECT 1 prior to your real queries, and 
everything will of course be well documented.


What do you think?


Best regards,

Álvaro

--
Álvaro Hernández Tortosa


---
8Kdata



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-11-06 Thread Heikki Linnakangas

On 11/05/2014 11:30 AM, Andres Freund wrote:

On 2014-11-04 18:33:34 +0200, Heikki Linnakangas wrote:

On 10/30/2014 06:02 PM, Andres Freund wrote:

On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:

On 10/06/2014 02:29 PM, Andres Freund wrote:

I've not yet really looked,
but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
make me happy...


Ok.. Can you elaborate?


To me the split between xloginsert.c doing some of the record assembly,
and xlog.c doing the lower level part of the assembly is just wrong.


I moved the checks for bootstrap mode and xl_len == 0, from the current
XLogInsert to the new XLogInsert() function. I don't imagine that to be
enough address your feelings about the split between xlog.c and
xloginsert.c, but makes it a tiny bit clearer IMHO. I don't know what else
to do about it, as it feels very sensible to me as it is now. So unless you
object more loudly, I'm going to just go ahead with this and commit,
probably tomorrow.


I'm still not particularly happy with the split. But I think this patch
is enough of a win anyhow. So go ahead.


Committed. Now, to the main patch...

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2014-11-06 Thread Abhijit Menon-Sen
At 2014-10-30 14:30:27 +0530, a...@2ndquadrant.com wrote:

 Here's a proposed patch to initdb to make initdb -S fsync everything
 under pg_tblspc.

Oops, I meant to include the corresponding patch to xlog.c to do the
same at startup. It's based on the initdb patch, but modified to not
use fprintf/exit_nicely and so on. (Note that this was written in a
single chunk to aid backpatching. There's no attempt made to share
code in this set of patches.)

Now attached.

-- Abhijit
From ae91da4df7ee60e6f83c98801e979929442f588a Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Thu, 6 Nov 2014 00:45:56 +0530
Subject: If we need to perform crash recovery, fsync PGDATA recursively

This is so that we don't lose older unflushed writes in the event of
a power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c | 184 ++
 1 file changed, 184 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ab04380..ef95f64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -830,6 +830,12 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void pre_sync_fname(char *fname, bool isdir);
+static void walkdir(char *path, void (*action) (char *fname, bool isdir));
+static void walktblspc_links(char *path,
+			 void (*action) (char *fname, bool isdir));
+static void perform_fsync(char *pg_data);
+
 /*
  * Insert an XLOG record having the specified RMID and info bytes,
  * with the body of the record being the data chunk(s) described by
@@ -5981,6 +5987,19 @@ StartupXLOG(void)
 		ereport(FATAL,
 (errmsg(control file contains invalid data)));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory and its contents to try to ensure that any data
+	 * written before the crash are flushed to disk. Otherwise a power
+	 * failure in the near future might cause earlier unflushed writes
+	 * to be lost, even though more recent data written to disk from
+	 * here on would be persisted.
+	 */
+
+	if (ControlFile-state != DB_SHUTDOWNED 
+		ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY)
+		perform_fsync(data_directory);
+
 	if (ControlFile-state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
@@ -11262,3 +11281,168 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl-WalWriterSleeping = sleeping;
 	SpinLockRelease(XLogCtl-info_lck);
 }
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Adapted from pre_sync_fname in initdb.c
+ */
+static void
+pre_sync_fname(char *fname, bool isdir)
+{
+#if defined(HAVE_SYNC_FILE_RANGE) || \
+	(defined(USE_POSIX_FADVISE)  defined(POSIX_FADV_DONTNEED))
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES)
+	 */
+	if (fd  0  isdir  (errno == EISDIR || errno == EACCES))
+		return;
+
+	if (fd  0)
+		ereport(FATAL,
+(errmsg(could not open file \%s\ before fsync,
+		fname)));
+
+	/*
+	 * Prefer sync_file_range, else use posix_fadvise.  We ignore any error
+	 * here since this operation is only a hint anyway.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE)
+	sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
+#elif defined(USE_POSIX_FADVISE)  defined(POSIX_FADV_DONTNEED)
+	posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+#endif
+
+	close(fd);
+#endif
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself).
+ *
+ * Adapted from copydir() in copydir.c.
+ */
+static void
+walkdir(char *path, void (*action) (char *fname, bool isdir))
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir(path);
+	while ((de = ReadDir(dir, path)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (strcmp(de-d_name, .) == 0 ||
+			strcmp(de-d_name, ..) == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, %s/%s, path, de-d_name);
+
+		if (lstat(subpath, fst)  0)
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg(could not stat file \%s\: %m, subpath)));
+
+		if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action);
+		else if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false);
+	}
+	FreeDir(dir);
+
+	/*
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.  Recent versions of ext4 have made the window much wider but
+	 * it's been an issue for ext3 and other filesystems in the past.
+	 */
+	(*action) (path, true);
+}
+
+/*
+ * walktblspc_links: call walkdir on each entry under the given
+ * 

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

2014-11-06 Thread Kevin Grittner
Álvaro Hernández Tortosa a...@8kdata.com wrote:

 There has been two comments which seem to state that changing this
 may introduce some performance problems and some limitations when you
 need to take out some locks. I still believe, however, that current
 behavior is confusing for the user. Sure, one option is to patch the
 documentation, as I was suggesting.

Yeah, I thought that's what we were talking about, and in that
regard I agree that the docs could be more clear.  I'm not quite
sure what to say where to fix that, but I can see how someone could
be confused and have the expectation that once they have run BEGIN
TRANSACTION ISOLATION LEVEL SERIALIZABLE the transaction will not
see the work of transactions committing after that.  The fact that
this is possible is implied, if one reads carefully and thinks
about it, by the statement right near the start of the Transaction
Isolation section which says any concurrent execution of a set of
Serializable transactions is guaranteed to produce the same effect
as running them one at a time in some order.  As Robert pointed
out, this is not necessarily the commit order or the transaction
start order.

It is entirely possible that if you have serializable transactions
T1 and T2, where T1 executes BEGIN first (and even runs a query
before T2 executes BEGIN) and T1 commits first, that T2 will
appear to have run first because it will look at a set of data
which T1 modifies and not see the changes.  If T1 were to *also*
look at a set of data which T2 modifies, then one of the
transactions would be rolled back with a serialization failure, to
prevent a cycle in the apparent order of execution; so the
requirements of the standard (and of most software which is
attempting to handle race conditions) is satisfied.  For many
popular benchmarks (and I suspect most common workloads) this
provides the necessary protections with better performance than is
possible using blocking to provide the required guarantees.[1]

At any rate, the language in that section is a little fuzzy on the
concept of the start of the transaction.  Perhaps it would be
enough to change language like:

| sees a snapshot as of the start of the transaction, not as of the
| start of the current query within the transaction.

to:

| sees a snapshot as of the start of the first query within the
| transaction, not as of the start of the current query within the
| transaction.

Would that have prevented the confusion here?

 But what about creating a flag to BEGIN and SET TRANSACTION
 commands, called IMMEDIATE FREEZE (or something similar), which
 applies only to REPEATABLE READ and SERIALIZABLE? If this flag is set
 (and may be off by default, but of course the default may be
 configurable via a guc parameter), freeze happens when it is present
 (BEGIN or SET TRANSACTION) time. This would be a backwards-compatible
 change, while would provide the option of freezing without the nasty
 hack of having to do a SELECT 1 prior to your real queries, and
 everything will of course be well documented.

What is the use case where you are having a problem?  This seems
like an odd solution, so it would be helpful to know what problem
it is attempting to solve.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Dan R. K. Ports and Kevin Grittner.  Serializable Snapshot
Isolation in PostgreSQL.  In VLDB, pages 1850--1861, 2012.
http://vldb.org/pvldb/vol5/p1850_danrkports_vldb2012.pdf
(see section 8 for performance graphs and numbers)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Order of views in stats docs

2014-11-06 Thread Peter Eisentraut
On 11/6/14 6:16 AM, Magnus Hagander wrote:
 Another thought I had in that case is maybe we need to break out the
 pg_stat_activity and pg_stat_replication views into their own table.
 They are really the only two views that are different in a lot of
 ways. Maybe call the splits session statistics views and object
 statistics views, instead of just standard statistics views? The
 difference being that they show snapshots of *right now*, whereas all
 the other views show accumulated statistics over time.

Yeah, splitting this up a bit would help, too.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

2014-11-06 Thread Greg Sabino Mullane

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

Kevin Grittner wrote:

(wording change suggestion)
 | sees a snapshot as of the start of the first query within the
 | transaction, not as of the start of the current query within the
 | transaction.

 Would that have prevented the confusion here?

I think it may have, but I also think the wording should be much 
stronger and clearer, as this is unintuitive behavior. Consider 
this snippet from Bruce's excellent MVCC Unmasked presentation:

A snapshot is recorded at the start of each SQL statement in 
READ COMMITTED transaction isolation mode, and at transaction start 
in SERIALIZABLE transaction isolation mode.

This is both correct and incorrect, depending on whether you consider 
a transaction to start with BEGIN; or with the first statement 
after the BEGIN. :) I think most people have always assumed that 
BEGIN starts the transaction and that is the point at which the snapshot 
is obtained.

 But what about creating a flag to BEGIN and SET TRANSACTION
 commands, called IMMEDIATE FREEZE (or something similar), which
 applies only to REPEATABLE READ and SERIALIZABLE? If this flag is set
 (and may be off by default, but of course the default may be
 configurable via a guc parameter), freeze happens when it is present
 (BEGIN or SET TRANSACTION) time. This would be a backwards-compatible
 change, while would provide the option of freezing without the nasty
 hack of having to do a SELECT 1 prior to your real queries, and
 everything will of course be well documented.

 What is the use case where you are having a problem?  This seems
 like an odd solution, so it would be helpful to know what problem
 it is attempting to solve.

Seems like a decent solution to me. The problem it that having to execute 
a dummy SQL statement to start a serializable transaction, rather 
than simply a BEGIN, is ugly.and error prone. Perhaps their app 
assumes (or even requires) that BEGIN starts the snapshot.

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201411060922
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAlRbhD4ACgkQvJuQZxSWSsg/kwCdE9E+d3jDDpLOo4+08wCOMMxE
EHkAnj4uMO8cY6Jl0R19C/6lE6n3bae5
=syg9
-END PGP SIGNATURE-




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-11-06 Thread Fujii Masao
On Tue, Nov 4, 2014 at 12:04 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp wrote:
 IIUC, I think that min = 0 disables fast update, so ISTM that it'd be
 appropriate to set min to some positive value.  And ISTM that the idea of
 using the min value of work_mem is not so bad.

OK. I changed the min value to 64kB.

 *** 356,361  CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ replaceable
 class=parametername/
 --- 356,372 
   /listitem
  /varlistentry
  /variablelist
 +variablelist
 +varlistentry
 + termliteralPENDING_LIST_CLEANUP_SIZE//term

 The above is still in uppercse.

Fixed.

Attached is the updated version of the patch. Thanks for the review!

Regards,

-- 
Fujii Masao
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 5911,5916  SET XML OPTION { DOCUMENT | CONTENT };
--- 5911,5937 
/listitem
   /varlistentry
  
+  varlistentry id=guc-pending-list-cleanup-size xreflabel=pending_list_cleanup_size
+   termvarnamepending_list_cleanup_size/varname (typeinteger/type)
+   indexterm
+primaryvarnamepending_list_cleanup_size/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Sets the maximum size of the GIN pending list which is used
+ when literalfastupdate/ is enabled. If the list grows
+ larger than this maximum size, it is cleaned up by moving
+ the entries in it to the main GIN data structure in bulk.
+ The default is four megabytes (literal4MB/). This setting
+ can be overridden for individual GIN indexes by changing
+ storage parameters.
+  See xref linkend=gin-fast-update and xref linkend=gin-tips
+  for more information.
+/para
+   /listitem
+  /varlistentry
+ 
   /variablelist
  /sect2
   sect2 id=runtime-config-client-format
*** a/doc/src/sgml/gin.sgml
--- b/doc/src/sgml/gin.sgml
***
*** 728,735 
 from the indexed item). As of productnamePostgreSQL/productname 8.4,
 acronymGIN/ is capable of postponing much of this work by inserting
 new tuples into a temporary, unsorted list of pending entries.
!When the table is vacuumed, or if the pending list becomes too large
!(larger than xref linkend=guc-work-mem), the entries are moved to the
 main acronymGIN/acronym data structure using the same bulk insert
 techniques used during initial index creation.  This greatly improves
 acronymGIN/acronym index update speed, even counting the additional
--- 728,735 
 from the indexed item). As of productnamePostgreSQL/productname 8.4,
 acronymGIN/ is capable of postponing much of this work by inserting
 new tuples into a temporary, unsorted list of pending entries.
!When the table is vacuumed, or if the pending list becomes larger than
!xref linkend=guc-pending-list-cleanup-size, the entries are moved to the
 main acronymGIN/acronym data structure using the same bulk insert
 techniques used during initial index creation.  This greatly improves
 acronymGIN/acronym index update speed, even counting the additional
***
*** 750,756 
para
 If consistent response time is more important than update speed,
 use of pending entries can be disabled by turning off the
!literalFASTUPDATE/literal storage parameter for a
 acronymGIN/acronym index.  See xref linkend=sql-createindex
 for details.
/para
--- 750,756 
para
 If consistent response time is more important than update speed,
 use of pending entries can be disabled by turning off the
!literalfastupdate/literal storage parameter for a
 acronymGIN/acronym index.  See xref linkend=sql-createindex
 for details.
/para
***
*** 812,829 
/varlistentry
  
varlistentry
!termxref linkend=guc-work-mem/term
 listitem
  para
   During a series of insertions into an existing acronymGIN/acronym
!  index that has literalFASTUPDATE/ enabled, the system will clean up
   the pending-entry list whenever the list grows larger than
!  varnamework_mem/.  To avoid fluctuations in observed response time,
!  it's desirable to have pending-list cleanup occur in the background
!  (i.e., via autovacuum).  Foreground cleanup operations can be avoided by
!  increasing varnamework_mem/ or making autovacuum more aggressive.
!  However, enlarging varnamework_mem/ means that if a foreground
!  cleanup does occur, it will take even longer.
  /para
 /listitem
/varlistentry
--- 812,837 
/varlistentry
  
varlistentry
!termxref linkend=guc-pending-list-cleanup-size/term
 listitem
  para
   During a series of insertions into an existing acronymGIN/acronym
!  index that has literalfastupdate/ enabled, the system will clean up
   the pending-entry list whenever the list grows larger than
!  

Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

2014-11-06 Thread Kevin Grittner
Greg Sabino Mullane g...@turnstep.com wrote:
 Kevin Grittner wrote:

 (wording change suggestion)
 | sees a snapshot as of the start of the first query within the
 | transaction, not as of the start of the current query within the
 | transaction.

 Would that have prevented the confusion here?

 I think it may have, but I also think the wording should be much
 stronger and clearer, as this is unintuitive behavior. Consider
 this snippet from Bruce's excellent MVCC Unmasked presentation:

 A snapshot is recorded at the start of each SQL statement in
 READ COMMITTED transaction isolation mode, and at transaction start
 in SERIALIZABLE transaction isolation mode.

 This is both correct and incorrect, depending on whether you consider
 a transaction to start with BEGIN; or with the first statement
 after the BEGIN. :) I think most people have always assumed that
 BEGIN starts the transaction and that is the point at which the snapshot
 is obtained.

But there is so much evidence to the contrary.  Not only does the
*name* of the command (BEGIN or START) imply a start, but
pg_stat_activity shows the connection idle in transaction after
the command (and before a snapshot is acquired), the Explicit
Locking section of the docs asserts that Once acquired, a lock is
normally held till end of transaction, and the docs for the SET
command assert that The effects of SET LOCAL last only till the
end of the current transaction, whether committed or not.  The end
of *which* transaction?  The one that started with BEGIN or START
and which might not (or in some cases *must* not) yet have a
snapshot.

But what about creating a flag to BEGIN and SET TRANSACTION
 commands, called IMMEDIATE FREEZE (or something similar), which
 applies only to REPEATABLE READ and SERIALIZABLE? If this flag is set
 (and may be off by default, but of course the default may be
 configurable via a guc parameter), freeze happens when it is present
 (BEGIN or SET TRANSACTION) time. This would be a backwards-compatible
 change, while would provide the option of freezing without the nasty
 hack of having to do a SELECT 1 prior to your real queries, and
 everything will of course be well documented.

 What is the use case where you are having a problem?  This seems
 like an odd solution, so it would be helpful to know what problem
 it is attempting to solve.

 Seems like a decent solution to me. The problem it that having to execute
 a dummy SQL statement to start a serializable transaction, rather
 than simply a BEGIN, is ugly.and error prone. Perhaps their app
 assumes (or even requires) that BEGIN starts the snapshot.

Why?  This fix might not deal with the bigger issues that I
discussed, like that the later-to-start and
later-to-acquire-a-snapshot transaction might logically be first in
the apparent order of execution.  You can't fix that without a
lot of blocking -- that most of us don't want.  Depending on *why*
they think this is important, they might need to be acquiring
various locks to prevent behavior they don't want, in which case
having acquired a snapshot at BEGIN would be exactly the *wrong*
thing to do.  The exact nature of the problem we're trying to solve
here does matter.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2014-11-06 Thread Peter Eisentraut
On 10/1/14 3:00 AM, Michael Paquier wrote:
 - Use of AccessExclusiveLock when swapping relfilenodes of an index and
 its concurrent entry instead of ShareUpdateExclusiveLock for safety. At
 the limit of my understanding, that's the consensus reached until now.

I'm very curious about this point.  I looked through all the previous
discussions, and the only time I saw this mentioned was at the very
beginning when it was said that we could review the patch while ignoring
this issue and fix it later with MVCC catalog access.  Then it got very
technical, but it was never explicitly concluded whether it was possible
to fix this or not.

Also, in the thread Concurrently option for reindexdb you pointed out
that requiring an exclusive lock isn't really concurrent and proposed an
option like --minimum-locks.

I will point out again that we specifically invented DROP INDEX
CONCURRENTLY because holding an exclusive lock even briefly isn't good
enough.

If REINDEX cannot work without an exclusive lock, we should invent some
other qualifier, like WITH FEWER LOCKS.  It's still useful, but we
shouldn't give people the idea that they can throw away their custom
CREATE INDEX CONCURRENTLY + DROP INDEX CONCURRENTLY scripts.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start

2014-11-06 Thread Tom Lane
Kevin Grittner kgri...@ymail.com writes:
 Why?  This fix might not deal with the bigger issues that I
 discussed, like that the later-to-start and
 later-to-acquire-a-snapshot transaction might logically be first in
 the apparent order of execution.  You can't fix that without a
 lot of blocking -- that most of us don't want.  Depending on *why*
 they think this is important, they might need to be acquiring
 various locks to prevent behavior they don't want, in which case
 having acquired a snapshot at BEGIN would be exactly the *wrong*
 thing to do.  The exact nature of the problem we're trying to solve
 here does matter.

Another thing that I think hasn't been mentioned in this thread is
that we used to have severe problems with client libraries that like
to issue BEGIN and then go idle until they have something to do.
Which, for some reason, is a prevalent behavior.  That used to result
in problems like VACUUM not being able to clean up dead rows promptly.
We fixed that some time ago by making sure we didn't acquire an XID until
the first actual statement after BEGIN.  Snapshots as such were never a
problem for this, because we've never acquired a snapshot immediately at
BEGIN ... but if we did so, this problem would come right back.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Amazon Redshift

2014-11-06 Thread Bernd Helmle



--On 5. November 2014 23:36:03 +0100 philip taylor philta...@mail.com 
wrote:



Date Functions

http://docs.aws.amazon.com/redshift/latest/dg/r_ADD_MONTHS.html
http://docs.aws.amazon.com/redshift/latest/dg/r_DATEADD_function.html
http://docs.aws.amazon.com/redshift/latest/dg/r_DATEDIFF_function.html
http://docs.aws.amazon.com/redshift/latest/dg/r_LAST_DAY.html
http://docs.aws.amazon.com/redshift/latest/dg/r_MONTHS_BETWEEN_function.h
tml http://docs.aws.amazon.com/redshift/latest/dg/r_NEXT_DAY.html


fyi, except DATE_DIFF() all of these functions can be installed by using 
the orafce extension i believe.


--
Thanks

Bernd


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-11-06 Thread Heikki Linnakangas
Replying to some of your comments below. The rest were trivial issues 
that I'll just fix.


On 10/30/2014 09:19 PM, Andres Freund wrote:

* Is it really a good idea to separate XLogRegisterBufData() from
   XLogRegisterBuffer() the way you've done it ? If we ever actually get
   a record with a large numbers of blocks touched this issentially is
   O(touched_buffers*data_entries).


Are you worried that the linear search in XLogRegisterBufData(), to find 
the right registered_buffer struct, might be expensive? If that ever 
becomes a problem, a simple fix would be to start the linear search from 
the end, and make sure that when you touch a large number of blocks, you 
do all the XLogRegisterBufData() calls right after the corresponding 
XLogRegisterBuffer() call.


I've also though about having XLogRegisterBuffer() return the pointer to 
the struct, and passing it as argument to XLogRegisterBufData. So the 
pattern in WAL generating code would be like:


registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have 
potential to turn XLogRegisterBufData into a macro or inline function, 
to eliminate the function call overhead. I played with that a little 
bit, but the difference in performance was so small that it didn't seem 
worth it. But passing the registered_buffer pointer, like above, might 
not be a bad thing anyway.



* There's lots of functions like XLogRecHasBlockRef() that dig through
   the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
 XLogRecGetBlockTag(record, 1, NULL, NULL, oldblk);
else
 oldblk = newblk;

   I think doing that repeatedly is quite a bad idea. We should parse the
   record once and then use it in a sensible format. Not do it in pieces,
   over and over again. It's not like we ignore backup blocks - so doing
   this lazily doesn't make sense to me.
   Especially as ValidXLogRecord() *already* has parsed the whole damn
   thing once.


Hmm. Adding some kind of a parsed XLogRecord representation would need a 
fair amount of new infrastructure. Vast majority of WAL records contain 
one, maybe two, block references, so it's not that expensive to find the 
right one, even if you do it several times.


I ran a quick performance test on WAL replay performance yesterday. I 
ran pgbench for 100 transactions with WAL archiving enabled, and 
measured the time it took to replay the generated WAL. I did that with 
and without the patch, and I didn't see any big difference in replay 
times. I also ran perf on the startup process, and the profiles looked 
identical. I'll do more comprehensive testing later, with different 
index types, but I'm convinced that there is no performance issue in 
replay that we'd need to worry about.


If it mattered, a simple optimization to the above pattern would be to 
have XLogRecGetBlockTag return true/false, indicating if the block 
reference existed at all. So you'd do:


if (!XLogRecGetBlockTag(record, 1, NULL, NULL, oldblk))
oldblk != newblk;


On the other hand, decomposing the WAL record into parts, and passing 
the decomposed representation to the redo routines would allow us to 
pack the WAL record format more tightly, as accessing the different 
parts of the on-disk format wouldn't then need to be particularly fast. 
For example, I've been thinking that it would be nice to get rid of the 
alignment padding in XLogRecord, and between the per-buffer data 
portions. We could copy the data to aligned addresses as part of the 
decomposition or parsing of the WAL record, so that the redo routines 
could still assume aligned access.



* I wonder if it wouldn't be worthwile, for the benefit of the FPI
   compression patch, to keep the bkp block data after *all* the
   headers. That'd make it easier to just compress the data.


Maybe. If we do that, I'd also be inclined to move all the bkp block 
headers to the beginning of the WAL record, just after the XLogInsert 
struct. Somehow it feels weird to have a bunch of header structs 
sandwiched between the rmgr-data and per-buffer data. Also, 4-byte 
alignment is enough for the XLogRecordBlockData struct, so that would be 
an easy way to make use of the space currently wasted for alignment 
padding in XLogRecord.



* I think heap_xlog_update is buggy for wal_level=logical because it
   computes the length of the tuple using
   tuplen = recdataend - recdata;
   But the old primary key/old tuple value might be stored there as
   well. Afaics that code has to continue to use xl_heap_header_len.


No, the old primary key or tuple is stored in the main data portion. 
That tuplen computation is done on backup block 0's data.



* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
   to get rid of:
+   /*
+* The new tuple is normally stored as buffer 0's data. But if
+* XLOG_HEAP_CONTAINS_NEW_TUPLE 

Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-06 Thread Robert Haas
On Wed, Nov 5, 2014 at 9:26 PM, Robert Haas robertmh...@gmail.com wrote:
 Yes, I think that's probably a net improvement in robustness quite
 apart from what we decide to do about any of the rest of this.  I've
 attached it here as revise-procglobal-tracking.patch and will commit
 that bit if nobody objects.  The remainder is reattached without
 change as group-locking-v0.1.patch.

 Per your other comment, I've developed the beginnings of a testing
 framework which I attached here as test_group_locking-v0.patch.

Urk.  I attached a version with some stupid bugs.  Here's a slightly better one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/test_group_locking/Makefile b/contrib/test_group_locking/Makefile
new file mode 100644
index 000..2d09341
--- /dev/null
+++ b/contrib/test_group_locking/Makefile
@@ -0,0 +1,21 @@
+# contrib/test_group_locking/Makefile
+
+MODULE_big = test_group_locking
+OBJS = test_group_locking.o $(WIN32RES)
+PGFILEDESC = test_group_locking - test harness for group locking
+
+EXTENSION = test_group_locking
+DATA = test_group_locking--1.0.sql
+
+REGRESS = test_group_locking
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/test_group_locking
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/test_group_locking/test_group_locking--1.0.sql b/contrib/test_group_locking/test_group_locking--1.0.sql
new file mode 100644
index 000..adb2be5
--- /dev/null
+++ b/contrib/test_group_locking/test_group_locking--1.0.sql
@@ -0,0 +1,8 @@
+/* contrib/test_group_locking/test_group_locking--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use CREATE EXTENSION test_group_locking to load this file. \quit
+
+CREATE FUNCTION test_group_locking(spec pg_catalog.text)
+RETURNS pg_catalog.void STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/contrib/test_group_locking/test_group_locking.c b/contrib/test_group_locking/test_group_locking.c
new file mode 100644
index 000..904beff
--- /dev/null
+++ b/contrib/test_group_locking/test_group_locking.c
@@ -0,0 +1,1071 @@
+/*--
+ *
+ * test_group_locking.c
+ *		Test harness code for group locking.
+ *
+ * Copyright (C) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/test_shm_mq/test.c
+ *
+ * -
+ */
+
+#include postgres.h
+
+#include access/xact.h
+#include catalog/namespace.h
+#include commands/dbcommands.h
+#include fmgr.h
+#include lib/ilist.h
+#include lib/stringinfo.h
+#include libpq/libpq.h
+#include libpq/pqformat.h
+#include libpq/pqmq.h
+#include mb/pg_wchar.h
+#include miscadmin.h
+#include nodes/makefuncs.h
+#include parser/scansup.h
+#include storage/ipc.h
+#include storage/lmgr.h
+#include storage/procarray.h
+#include storage/shm_mq.h
+#include storage/shm_toc.h
+#include utils/builtins.h
+#include utils/hsearch.h
+#include utils/memutils.h
+#include utils/resowner.h
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_group_locking);
+
+void test_group_locking_worker_main(Datum main_arg);
+
+/* Names of lock modes, for debug printouts */
+static const char *const lock_mode_names[] =
+{
+	INVALID,
+	AccessShareLock,
+	RowShareLock,
+	RowExclusiveLock,
+	ShareUpdateExclusiveLock,
+	ShareLock,
+	ShareRowExclusiveLock,
+	ExclusiveLock,
+	AccessExclusiveLock
+};
+
+typedef enum
+{
+	TGL_START,
+	TGL_STOP,
+	TGL_LOCK,
+	TGL_UNLOCK
+} TestGroupLockOp;
+
+typedef struct
+{
+	TestGroupLockOp op;
+	LOCKMODE lockmode;
+	Oid	relid;
+} TestGroupLockCommand;
+
+typedef struct
+{
+	dlist_node node;
+	bool verified;
+	int	group_id;
+	int task_id;
+	TestGroupLockCommand command;
+} TestGroupLockStep;
+
+typedef struct
+{
+	int group_id;
+	int task_id;
+} worker_key;
+
+typedef struct
+{
+	worker_key	key;
+	dsm_segment *seg;
+	BackgroundWorkerHandle *handle;
+	shm_mq_handle *requesth;
+	shm_mq_handle *responseh;
+	bool awaiting_response;
+} worker_info;
+
+typedef struct
+{
+	int	group_id;
+	int	leader_task_id;
+	bool has_followers;
+} leader_info;
+
+/* Fixed-size data passed via our dynamic shared memory segment. */
+typedef struct worker_fixed_data
+{
+	Oid	database_id;
+	Oid	authenticated_user_id;
+	NameData	database;
+	NameData	authenticated_user;
+	bool		use_group_locking;
+	pid_t		leader_pid;
+} worker_fixed_data;
+
+#define SHM_QUEUE_SIZE	32768
+#define TEST_GROUP_LOCKING_MAGIC		0x4c616e65
+
+static void check_for_messages(HTAB *worker_hash);
+static void determine_leader_info(dlist_head *plan, HTAB *leader_hash);
+static void handle_sigterm(SIGNAL_ARGS);
+static void process_message(HTAB *worker_hash, worker_info *info,
+char *message, Size message_bytes);
+static void rationalize_steps(dlist_head 

Re: [HACKERS] split builtins.h to quote.h

2014-11-06 Thread Julien Rouhaud
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Le 14/10/2014 10:00, Michael Paquier a écrit :
 On Mon, Oct 13, 2014 at 11:14 PM, Robert Haas
 robertmh...@gmail.com wrote:
 
 IMHO, putting some prototypes for a .c file in one header and
 others in another header is going to make it significantly harder
 to figure out which files you need to #include when. Keeping a
 simple rule there seems essential to me.
 
 OK. Let's make the separation clear then. The attached does so,
 and moves the remaining quote_* functions previously in builtins.h
 to quote.h. I am adding it to the upcoming CF for tracking the
 effort on it.
 
 
 
 
Hi Michael,

I just reviewed this patch :

* applies cleanly to master(d2b8a2c7)
* all regression tests pass

As it's only moving functions from builtins.h to quote.h and update
impacted files, nothing special to add.

It will probably break some user extensions using quote_* functions.
Otherwise it looks ready to commit.

Regards.
- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)

iQEcBAEBAgAGBQJUW6wEAAoJELGaJ8vfEpOqCn4H/AydFB5ERI0x9R6l8T/Qkx9/
Tm0ZgSSsfG39lHkbspZaUwTxmNPam1+EueQrw2VQcT3JXjWaHWvurWjFDBdSi8Ar
fgcD4WOTcyenxsckq5/XwT++ZfOZa1h2qBABoC4nx1qcO4pNBW51ywL11Fmcb7O8
CkwKZ3FfUlVFLsh2Novqa7HtEWdi872zzb4TSxQjsShMuFNX/6PF6RFJVZAy61VA
sbVQ/0rRQ9bOcJgh+DDaIMVQAnOUWsvYdR9VbRCbgcZuBlZKeW7gt9qGgevgjcun
PYB+ZuSC92WiS9y3OhcGKp9cYQpcsOD5ZDxe1Cw7q4YNZNgUtbKpDW6GsTC+vp0=
=NH1j
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread Jim Nasby

On 11/6/14, 2:58 AM, Simon Riggs wrote:

On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote:

On 10/31/14 6:19 AM, Simon Riggs wrote:

Various ways of tweaking Foreign Keys are suggested that are helpful
for larger databases.



*INITIALLY NOT ENFORCED
FK created, but is not enforced during DML.
Will be/Must be marked NOT VALID when first created.
We can run a VALIDATE on the constraint at any time; if it passes the
check it is marked VALID and presumed to stay that way until the next
VALIDATE run.


Does that mean the FK would become invalid after every DML operation,
until you expicitly revalidate it?  Is that practical?


I think so.

We store the validity on the relcache entry.

Constraint would add a statement-level after trigger for insert,
update, delete and trigger, which issues a relcache invalidation if
the state was marked valid. Marked as deferrable initially deferred.


I don't think you'd need to invalidate on insert, or on an update that didn't 
touch a referenced key.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] split builtins.h to quote.h

2014-11-06 Thread Alvaro Herrera
Julien Rouhaud wrote:

 I just reviewed this patch :
 
 * applies cleanly to master(d2b8a2c7)
 * all regression tests pass
 
 As it's only moving functions from builtins.h to quote.h and update
 impacted files, nothing special to add.
 
 It will probably break some user extensions using quote_* functions.
 Otherwise it looks ready to commit.

I thought the consensus was that the SQL-callable function declarations
should remain in builtins.h -- mainly so that quote.h does not need to
include fmgr.h.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread David G Johnston
On Thu, Nov 6, 2014 at 10:29 AM, Jim Nasby-5 [via PostgreSQL] 
ml-node+s1045698n582596...@n5.nabble.com wrote:

 On 11/6/14, 2:58 AM, Simon Riggs wrote:

  On 5 November 2014 21:15, Peter Eisentraut [hidden email]
 http://user/SendEmail.jtp?type=nodenode=5825967i=0 wrote:
  On 10/31/14 6:19 AM, Simon Riggs wrote:
  Various ways of tweaking Foreign Keys are suggested that are helpful
  for larger databases.
 
  *INITIALLY NOT ENFORCED
  FK created, but is not enforced during DML.
  Will be/Must be marked NOT VALID when first created.
  We can run a VALIDATE on the constraint at any time; if it passes the
  check it is marked VALID and presumed to stay that way until the next
  VALIDATE run.
 
  Does that mean the FK would become invalid after every DML operation,
  until you expicitly revalidate it?  Is that practical?
 
  I think so.
 
  We store the validity on the relcache entry.
 
  Constraint would add a statement-level after trigger for insert,
  update, delete and trigger, which issues a relcache invalidation if
  the state was marked valid. Marked as deferrable initially deferred.

 I don't think you'd need to invalidate on insert,


​Why?  Since the FK is not enforced there is no guarantee that what you
just inserted is valid
​

 or on an update that didn't touch a referenced key.


​OK​ - but you would still need the trigger on the FK columns

DELETE is OK as well since you cannot invalidate the constraint by simply
removing the referencing row.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Tweaking-Foreign-Keys-for-larger-tables-tp5825162p5825970.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [HACKERS] What exactly is our CRC algorithm?

2014-11-06 Thread Heikki Linnakangas

On 11/04/2014 03:21 PM, Robert Haas wrote:

On Tue, Nov 4, 2014 at 4:47 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

I hear none, so committed with some small fixes.


Are you going to get the slice-by-N stuff working next, to speed this up?


I don't have any concrete plans, but yeah, that would be great. So 
definitely maybe.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] json, jsonb, and casts

2014-11-06 Thread Andrew Dunstan


In 9.3 we changed the way json generating functions worked by taking 
account of cast functions to json from non-builtin types, such as hstore.


In 9.5 I am proposing to provide similar functionality for jsonb. The 
patch actually takes account of cast functions to both jsonb and json 
(with jsonb preferred). If there is a cast to jsonb, we use it and then 
merge the result into the jsonb being accumulated. If there is just a 
cast to json, we use it, and then parse that directly into the result 
datum.


It was arguably a bit of an oversight not to take account of casts to 
jsonb in 9.4 in datum_to_json(). So I'm thinking of rolling into this 
patch changes to json.c::datum_to_json() and friends to take analogous 
account of casts to jsonb (i.e. call the cast function, turn the 
resulting jsonb into a cstring and append it to the result).


Thoughts?

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread Alvaro Herrera
Simon Riggs wrote:
 On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote:
 
  ON DELETE IGNORE
  ON UPDATE IGNORE
  If we allow this specification then the FK is one way - we check the
  existence of a row in the referenced table, but there is no need for a
  trigger on the referenced table to enforce an action on delete or
  update, so no need to lock the referenced table when adding FKs.
 
  Are you worried about locking the table at all, or about having to lock
  many rows?
 
 This is useful for smaller, highly referenced tables that don't change
 much, if ever.
 
 In that case the need for correctness thru locking is minimal. If we
 do lock it will cause very high multixact traffic, so that is worth
 avoiding alone.

This seems like a can of worms to me.  How about the ability to mark a
table READ ONLY, so that insert/update/delete operations on it raise an
error?  For such tables, you can just assume that tuples never go away,
which can help optimize some ri_triggers.c queries by doing plain
SELECT, not SELECT FOR KEY SHARE.

If you later need to add rows to the table, you set it READ WRITE, and
then ri_triggers.c automatically start using FOR KEY SHARE; add/modify
to your liking, then set READ ONLY again.  So you incur the cost of
tuple locking only while you have the table open for writes.

This way we don't get into the mess of reasoning about foreign keys that
might be violated some of the time.

There's a side effect of tables being READ ONLY which is that tuple
freezing can be optimized as well.  I vaguely recall we have discussed
this.  It's something like SET READ ONLY, then freeze it, which sets its
relfrozenxid to 0 or maybe FrozenXid; vacuum knows it can ignore the
table for freezing purposes.  When SET READ WRITE, relfrozenxid jumps to
RecentXmin.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json, jsonb, and casts

2014-11-06 Thread Tom Lane
Andrew Dunstan andrew.duns...@pgexperts.com writes:
 In 9.3 we changed the way json generating functions worked by taking 
 account of cast functions to json from non-builtin types, such as hstore.

 In 9.5 I am proposing to provide similar functionality for jsonb. The 
 patch actually takes account of cast functions to both jsonb and json 
 (with jsonb preferred). If there is a cast to jsonb, we use it and then 
 merge the result into the jsonb being accumulated. If there is just a 
 cast to json, we use it, and then parse that directly into the result 
 datum.

 It was arguably a bit of an oversight not to take account of casts to 
 jsonb in 9.4 in datum_to_json(). So I'm thinking of rolling into this 
 patch changes to json.c::datum_to_json() and friends to take analogous 
 account of casts to jsonb (i.e. call the cast function, turn the 
 resulting jsonb into a cstring and append it to the result).

Meh.  This leaves it very ambiguous which cast function would be applied
if both are available.  I think it's overcomplicated anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-06 Thread Jim Nasby

On 10/29/14, 11:49 AM, Jim Nasby wrote:

On 10/21/14, 6:05 PM, Tom Lane wrote:

Jim Nasby jim.na...@bluetreble.com writes:

- What happens if we run out of space to remember skipped blocks?


You forget some, and are no worse off than today.  (This might be an
event worthy of logging, if the array is large enough that we don't
expect it to happen often ...)


Makes sense. I'll see if there's some reasonable way to retry pages when the 
array fills up.

I'll make the array 2k in size; that allows for 512 pages without spending a 
bunch of memory.


Attached is a patch for this. It also adds logging of unobtainable cleanup 
locks, and refactors scanning a page for vacuum into it's own function.

Anyone reviewing this might want to look at 
https://github.com/decibel/postgres/commit/69ab22f703d577cbb3d8036e4e42563977bcf74b,
 which is the refactor with no whitespace changes.

I've verified this works correctly by connecting to a backend with gdb and 
halting it with a page pinned. Both vacuum and vacuum freeze on that table do 
what's expected, but I also get this waring (which AFAICT is a false positive):

decibel@decina.local=# vacuum verbose i;
INFO:  vacuuming public.i
INFO:  i: found 0 removable, 399774 nonremovable row versions in 1769 out of 
1770 pages
DETAIL:  20 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
Retried cleanup lock on 0 pages, retry failed on 1, skipped retry on 0.
CPU 0.00s/0.06u sec elapsed 12.89 sec.
WARNING:  buffer refcount leak: [105] (rel=base/16384/16385, blockNum=0, 
flags=0x106, refcount=2 1)
VACUUM

I am doing a simple static allocation of retry_pages[]; my understanding is 
that will only exist for the duration of this function so it's OK. If not I'll 
palloc it. If it is OK then I'll do the same for the freeze array.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
From 1752751903a8d51b7b3b618072b6b0687f9f141c Mon Sep 17 00:00:00 2001
From: Jim Nasby jim.na...@bluetreble.com
Date: Thu, 6 Nov 2014 14:42:52 -0600
Subject: [PATCH] Vacuum cleanup lock retry

This patch will retry failed attempts to obtain the cleanup lock on a
buffer. It remembers failed block numbers in an array and retries after
vacuuming the relation. The array is currently fixed at 512 entries;
additional lock failures will not be re-attempted.

This patch also adds counters to report on failures, as well as
refactoring the guts of page vacuum scans into it's own function.
---
 src/backend/commands/vacuumlazy.c | 964 +-
 1 file changed, 541 insertions(+), 423 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c 
b/src/backend/commands/vacuumlazy.c
index 3778d9d..240113f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -96,6 +96,14 @@
  */
 #define SKIP_PAGES_THRESHOLD   ((BlockNumber) 32)
 
+/*
+ * Instead of blindly skipping pages that we can't immediately acquire a
+ * cleanup lock for (assuming we're not freezing), we keep a list of pages we
+ * initially skipped, up to VACUUM_MAX_RETRY_PAGES. We retry those pages at the
+ * end of vacuuming.
+ */
+#define VACUUM_MAX_RETRY_PAGES 512
+
 typedef struct LVRelStats
 {
/* hasindex = true means two-pass strategy; false means one-pass */
@@ -143,6 +151,10 @@ static void lazy_vacuum_index(Relation indrel,
 static void lazy_cleanup_index(Relation indrel,
   IndexBulkDeleteResult *stats,
   LVRelStats *vacrelstats);
+static void lazy_scan_page(Relation onerel, LVRelStats *vacrelstats,
+   BlockNumber blkno, Buffer buf, Buffer vmbuffer, 
xl_heap_freeze_tuple *frozen,
+   int nindexes, bool all_visible_according_to_vm,
+   BlockNumber *empty_pages, BlockNumber 
*vacuumed_pages, double *nunused);
 static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 int tupindex, LVRelStats *vacrelstats, Buffer 
*vmbuffer);
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
@@ -422,13 +434,15 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 {
BlockNumber nblocks,
blkno;
-   HeapTupleData tuple;
char   *relname;
BlockNumber empty_pages,
-   vacuumed_pages;
-   double  num_tuples,
-   tups_vacuumed,
-   nkeep,
+   vacuumed_pages,
+   retry_pages[VACUUM_MAX_RETRY_PAGES];
+   int retry_pages_insert_ptr;
+   double  retry_page_count,
+   retry_fail_count,
+   retry_pages_skipped,
+   cleanup_lock_waits,
 

Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread Kevin Grittner
Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Simon Riggs wrote:
 On 5 November 2014 21:15, Peter Eisentraut pete...@gmx.net wrote:

 ON DELETE IGNORE
 ON UPDATE IGNORE
 If we allow this specification then the FK is one way - we check the
 existence of a row in the referenced table, but there is no need for a
 trigger on the referenced table to enforce an action on delete or
 update, so no need to lock the referenced table when adding FKs.

 Are you worried about locking the table at all, or about having to lock
 many rows?

 This is useful for smaller, highly referenced tables that don't change
 much, if ever.

 In that case the need for correctness thru locking is minimal. If we
 do lock it will cause very high multixact traffic, so that is worth
 avoiding alone.

 This seems like a can of worms to me.  How about the ability to mark a
 table READ ONLY, so that insert/update/delete operations on it raise an
 error?  For such tables, you can just assume that tuples never go away,
 which can help optimize some ri_triggers.c queries by doing plain
 SELECT, not SELECT FOR KEY SHARE.

 If you later need to add rows to the table, you set it READ WRITE, and
 then ri_triggers.c automatically start using FOR KEY SHARE; add/modify
 to your liking, then set READ ONLY again.  So you incur the cost of
 tuple locking only while you have the table open for writes.

 This way we don't get into the mess of reasoning about foreign keys that
 might be violated some of the time.

On its face, that sounds more promising to me.

 There's a side effect of tables being READ ONLY which is that tuple
 freezing can be optimized as well.  I vaguely recall we have discussed
 this.  It's something like SET READ ONLY, then freeze it, which sets its
 relfrozenxid to 0 or maybe FrozenXid; vacuum knows it can ignore the
 table for freezing purposes.  When SET READ WRITE, relfrozenxid jumps to
 RecentXmin.

It could also allow a (potentially large) optimization to 
serializable transactions -- there is no need to take any predicate 
locks on a table or its indexes if it is read only.  To safely 
transition a table from read only to read write you would need at 
least two flags (similar in some ways to indisvalid and indisready) 
-- one to say whether any of these read only optimizations are 
allowed, and another flag that would only be set after all 
transactions which might have seen the read only state have 
completed which actually allows writes.  Or that could be done with 
a char column with three states.  So on transition to read only 
you would flag it as non-writable, and after all transactions which 
might have seen it in a writable state complete you flag it as 
allowing read only optimizations.  To transition to read write you 
disable the optimizations first and wait before actually flagging 
it as read write.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #11867: Strange behaviour with composite types after resetting database tablespace

2014-11-06 Thread Jim Nasby

On 11/4/14, 10:45 AM, Tom Lane wrote:

So it's safe as things stand; but this seems a bit, um, rickety.  Should
we insert a DropRelFileNodesAllBuffers call into ATExecSetTableSpace to
make it safer?  It's kind of annoying to have to scan the buffer pool
twice, but I'm afraid that sometime in the future somebody will try to get
clever about optimizing away the end-of-transaction buffer pool scan, and
break this case.  Or maybe I'm just worrying too much.


We could possibly remember what relations have been moved to different 
tablespaces in a transaction and avoid the call in that case, but that seems 
rather silly.

If there's any non-trivial actual data involved then presumably the buffer scan 
won't be noticed amidst all the IO, so this would only be an issue if you're 
playing games with mostly empty tables, and only then if you've got really 
large shared buffers. I can't actually come up with any real use case for that.

So +1 for doing the safe thing...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-06 Thread Andres Freund
On 2014-11-06 14:55:37 -0600, Jim Nasby wrote:
 On 10/29/14, 11:49 AM, Jim Nasby wrote:
 On 10/21/14, 6:05 PM, Tom Lane wrote:
 Jim Nasby jim.na...@bluetreble.com writes:
 - What happens if we run out of space to remember skipped blocks?
 
 You forget some, and are no worse off than today.  (This might be an
 event worthy of logging, if the array is large enough that we don't
 expect it to happen often ...)
 
 Makes sense. I'll see if there's some reasonable way to retry pages when the 
 array fills up.
 
 I'll make the array 2k in size; that allows for 512 pages without spending a 
 bunch of memory.
 
 Attached is a patch for this. It also adds logging of unobtainable cleanup 
 locks, and refactors scanning a page for vacuum into it's own function.
 
 Anyone reviewing this might want to look at 
 https://github.com/decibel/postgres/commit/69ab22f703d577cbb3d8036e4e42563977bcf74b,
  which is the refactor with no whitespace changes.
 
 I've verified this works correctly by connecting to a backend with gdb and 
 halting it with a page pinned. Both vacuum and vacuum freeze on that table do 
 what's expected, but I also get this waring (which AFAICT is a false 
 positive):

I think the retry logical is a largely pointless complication of already
complex enough code. You're fixing a problem for which there is
absolutely no evidence of its existance. Yes, this happens
occasionally. But it's going to be so absolutely minor in comparison to
just about every other source of bloat.

So, I pretty strongly against retrying. I could live with adding logging
of the number of pages skipped due to not being able to acquire the
cleanup lock. I don't think that's going to do help many people, but the
cost is pretty low.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] ltree::text not immutable?

2014-11-06 Thread Jim Nasby

On 11/5/14, 7:38 PM, Robert Haas wrote:

Attached is a complete patch along these lines.  As I suggested earlier,
this just makes the relevant changes in ltree--1.0.sql and
pg_trgm--1.1.sql without bumping their extension version numbers,
since it doesn't seem important enough to justify a version bump.


I don't understand why you went to all the trouble of building a
versioning system for extensions if you're not going to use it.


I was about to dismiss this comment since I don't see any real need  for a 
version bump here, except that AFAIK there's no way to upgrade an extension 
without bumping the version, other than resorting to hackery.

So I think this does need to be an upgrade. :(
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BRIN indexes - TRAP: BadArgument

2014-11-06 Thread Alvaro Herrera
Jeff Janes wrote:
 On Wed, Nov 5, 2014 at 12:54 PM, Alvaro Herrera alvhe...@2ndquadrant.com
 wrote:
 
 Thanks for the updated patch.
 
 Now when I run the test program (version with better error reporting
 attached), it runs fine until I open a psql session and issue:
 
 reindex table foo;

Interesting.  This was a more general issue actually -- if you dropped
the index at that point and created it again, the resulting index would
also be corrupt in the same way.  Inspecting with the supplied
pageinspect functions made the situation pretty obvious.  The old code
was skipping page ranges in which it could not find any tuples, but
that's bogus and inefficient.  I changed an if into a loop that
inserts intermediary tuples, if any are needed.  I cannot reproduce that
problem anymore.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


brin-24.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-06 Thread Jim Nasby

On 11/5/14, 8:26 PM, Robert Haas wrote:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# select 
test_group_locking('1.0:start,2.0:start,1.0:lock:AccessExclusiveLock:foo,2.0:lock:AccessExclusiveLock:foo');
NOTICE:  starting worker 1.0
NOTICE:  starting worker 2.0
NOTICE:  instructing worker 1.0 to acquire AccessExclusiveLock on
relation with OID 16387
NOTICE:  instructing worker 2.0 to acquire AccessExclusiveLock on
relation with OID 16387


Damn that's cool to see! :)

FWIW, some creative use of a composite type, plpgsql and string_to_array() 
would have saved a bunch of parsing code... it's surprisingly easy to parse 
stuff like this using string_to_array().
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread Jim Nasby

On 11/6/14, 11:49 AM, David G Johnston wrote:

 Constraint would add a statement-level after trigger for insert,
 update, delete and trigger, which issues a relcache invalidation if
 the state was marked valid. Marked as deferrable initially deferred.
I don't think you'd need to invalidate on insert,


​Why?  Since the FK is not enforced there is no guarantee that what you just 
inserted is valid


I'm talking about the referenced (aka 'parent') table, not the referring table.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] ltree::text not immutable?

2014-11-06 Thread Tom Lane
Jim Nasby jim.na...@bluetreble.com writes:
 On 11/5/14, 7:38 PM, Robert Haas wrote:
 I don't understand why you went to all the trouble of building a
 versioning system for extensions if you're not going to use it.

 I was about to dismiss this comment since I don't see any real need  for a 
 version bump here, except that AFAIK there's no way to upgrade an extension 
 without bumping the version, other than resorting to hackery.

Well, the one or two users who actually care can fix the problem with a
manual ALTER FUNCTION.  I'm not sure it's worth making everyone else
jump through update hoops.  If it hadn't taken us years to even notice
the issue, I might be more willing to expend work here, but I really
doubt the audience is larger than one or two people ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-06 Thread Robert Haas
On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund and...@2ndquadrant.com wrote:
 Maybe we can, as a first step, make those edges in the lock graph
 visible to the deadlock detector? It's pretty clear that undetected
 deadlocks aren't ok, but detectable deadlocks in a couple corner cases
 might be acceptable.

An interesting point came to mind this morning on this topic, while
discussing this issue with Amit.  Suppose process A1 holds a lock on
some object and process A2, a member of the same locking group, waits
for a lock on that same object.  It follows that there must exist a
process B which either *holds* or *awaits* a lock which conflicts with
the one requested by A2; otherwise, A2 would have been granted the
lock at once.  If B *holds* the conflicting lock, it may eventually
release it; but if B *awaits* the conflicting lock, we have a soft
deadlock, because A2 waits for B waits for A1, which we presume to
wait for A1.[1]

The logical consequence of this is that, if a process seeks to acquire
a lock which is already held (in any mode) by a fellow group-member,
it had better jump over the entire wait queue and acquire the lock
immediately if no conflicting locks have already been granted.  If it
fails to do this, it immediately creates a soft deadlock.  What if
there *are* conflicting locks already granted?  In that case, things
are more complicated.  We could, for example, have this situation: A1
holds AccessShareLock, B holds ExclusiveLock, C1 awaits ShareLock, and
then (following it in the wait queue) A2 awaits RowExclusiveLock.
This is not a deadlock, because B can finish, then C can get the lock,
then C1 can finish, then A2 can get the lock.  But if C2 now waits for
some member of group A, then we have a soft deadlock between group A
and group C, which can be resolved by moving A2's request ahead of C1.
(This example is relatively realistic, too: a lock upgrade from
AccessShareLock to RowExclusiveLock is probably the most common type
of lock upgrade, for reasons that are hopefully obvious.)

In practice, I suspect it's best to take a more aggressive approach
and have A2 jump straight to the head of the line right out of the
chute.  Letting some parallel workers make progress while holding up
others out of a notion of locking fairness is probably stupid, because
they're probably dependent on each other in some way that makes it
useless for one to make progress while the others don't.  For the same
reason, I'd argue that we ought to wait until all pending lock
requests from a given group can be satisfied before granting any of
them.  In fact, I suspect it's best in general for the locking
machinery and the deadlock detector to view several simultaneous,
pending lock requests as if they were a single request conflicting
with everything that would conflict with any of the requested modes.
Consider the situation where there are 10 lock waiters, 5 in each of
two parallel groups.  The deadlock detector could spend a good deal of
time and energy trying various reorderings of the lock queue, but in
practice it seems highly unlikely that it's sensible to do anything
other than put all the locks from one group first and all of the locks
from the other group afterwards, regardless of whether the processes
want the same mode or different modes.  Most of the other orderings
will either be equivalent to one of those orderings (because commuting
two non-conflicting locks in the wait queue has no effect) or soft
deadlocks (because the interleave a conflicting lock request between
two requests form the same group).  And even if you can find some
corner case where neither of those things is the case, it's probably
still not useful to let half the lock group run and leave the other
half blocked.  What will probably happen is that some internal queue
will pretty quickly either fill up or drain completely and the process
filling it, or draining it, will wait.  Now we're no better off than
if we'd waited to grant the lock in the first place, and we're now
holding more locks that can block other things or cause deadlocks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]  Here I'm assuming, as in previous discussion, that we regard all
members of a parallel group as mutually waiting for each other, on the
theory that the parallel computation won't complete until all workers
complete and, therefore, the members of the group are either already
waiting for each other or eventually will.  While this might not be
true in every case, I believe it's a good (and generally valid)
simplifying assumption.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] json, jsonb, and casts

2014-11-06 Thread Andrew Dunstan

On 11/06/2014 03:58 PM, Tom Lane wrote:

Andrew Dunstan andrew.duns...@pgexperts.com writes:

In 9.3 we changed the way json generating functions worked by taking
account of cast functions to json from non-builtin types, such as hstore.
In 9.5 I am proposing to provide similar functionality for jsonb. The
patch actually takes account of cast functions to both jsonb and json
(with jsonb preferred). If there is a cast to jsonb, we use it and then
merge the result into the jsonb being accumulated. If there is just a
cast to json, we use it, and then parse that directly into the result
datum.
It was arguably a bit of an oversight not to take account of casts to
jsonb in 9.4 in datum_to_json(). So I'm thinking of rolling into this
patch changes to json.c::datum_to_json() and friends to take analogous
account of casts to jsonb (i.e. call the cast function, turn the
resulting jsonb into a cstring and append it to the result).

Meh.  This leaves it very ambiguous which cast function would be applied
if both are available.  I think it's overcomplicated anyway.




OK, then we can do one of these:

 * just honor casts to json, whether generating json or jsonb, or
 * just honor casts to json when generating json (as now) and just
   casts to jsonb when generating jsonb, ignoring the other casts in
   both cases.


I can see a case for each, so I won't be too fussed either way. On 
balance I probably favor the first option, as it means you would only 
need to supply one cast function to have the type behave as you want, 
and functions providing casts to json are going to be a LOT easier to write.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload

2014-11-06 Thread Peter Eisentraut
On 10/16/14 11:34 PM, Craig Ringer wrote:
 psql: FATAL:  Peer authentication failed for user fred
 HINT:  See the server error log for additional information.

I think this is wrong for many reasons.

I have never seen an authentication system that responds with, hey, what
you just did didn't get you in, but the administrators are currently in
the process of making a configuration change, so why don't you check
that out.

We don't know whether the user has access to the server log.  They
probably don't.  Also, it is vastly more likely that the user really
doesn't have access in the way they chose, so throwing in irrelevant
hints will be distracting.

Moreover, it will be confusing to regular users if this message
sometimes shows up and sometimes doesn't, independent of their own state
and actions.

Finally, the fact that a configuration change is in progress is
privileged information.  Unprivileged users can deduct from the presence
of this message that administrators are doing something, and possibly
that they have done something wrong.

I think it's fine to log a message in the server log if the pg_hba.conf
file needs reloading.  But the client shouldn't know about this at all.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] split builtins.h to quote.h

2014-11-06 Thread Michael Paquier
On Fri, Nov 7, 2014 at 2:31 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Julien Rouhaud wrote:

 I just reviewed this patch :

 * applies cleanly to master(d2b8a2c7)
 * all regression tests pass

 As it's only moving functions from builtins.h to quote.h and update
 impacted files, nothing special to add.

 It will probably break some user extensions using quote_* functions.
 Otherwise it looks ready to commit.

 I thought the consensus was that the SQL-callable function declarations
 should remain in builtins.h -- mainly so that quote.h does not need to
 include fmgr.h.
Moving everything to quote.h is done in-line with what Tom and Robert
suggested, builtins.h being a refuge for function declarations that
have nowhere else to go. Suggestion from here:
http://www.postgresql.org/message-id/ca+tgmozf3dkptua6ex6gxlnnd-nms-fbjcxroitwffh-+6y...@mail.gmail.com
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-06 Thread Greg Stark
On Thu, Nov 6, 2014 at 9:30 PM, Andres Freund and...@2ndquadrant.com wrote:
 I think the retry logical is a largely pointless complication of already
 complex enough code. You're fixing a problem for which there is
 absolutely no evidence of its existance. Yes, this happens
 occasionally. But it's going to be so absolutely minor in comparison to
 just about every other source of bloat.

I agree bloat isn't really a threat, but what about the relfrozenxid?
If we skip even one page we don't get to advance it and retrying could
eliminate those skipped pages and allow us to avoid a vacuum freeze
which can be really painful. Of course that only works if you can be
sure you haven't overflowed and forgotten any skipped pages and if you
don't find the page still pinned every time until you eventually give
up on it.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-06 Thread Jim Nasby

On 11/6/14, 5:40 PM, Greg Stark wrote:

On Thu, Nov 6, 2014 at 9:30 PM, Andres Freund and...@2ndquadrant.com wrote:

I think the retry logical is a largely pointless complication of already
complex enough code. You're fixing a problem for which there is
absolutely no evidence of its existance. Yes, this happens
occasionally. But it's going to be so absolutely minor in comparison to
just about every other source of bloat.


For some reason I don't have Andres' original email, so I'll reply here: I 
agree with you, and my original proposal was simply to log how many pages were 
skipped, but that was objected to. Simply logging this extra information would 
be a patch of a dozen lines or less.

The problem right now is there's no way to actually obtain evidence that this 
is (or isn't) something to worry about, because we just silently skip pages. If 
we had any kind of tracking on this we could stop guessing. :(


I agree bloat isn't really a threat, but what about the relfrozenxid?
If we skip even one page we don't get to advance it and retrying could
eliminate those skipped pages and allow us to avoid a vacuum freeze
which can be really painful. Of course that only works if you can be
sure you haven't overflowed and forgotten any skipped pages and if you
don't find the page still pinned every time until you eventually give
up on it.


The overflow part shouldn't be that big a deal. Either we just bump the array size 
up some more, or worst-case we scan it whenever it fills (like we do when we fill 
vacrelstats-dead_tuples.

But like I said above, I think this is already making a mountain out of a 
mole-hill, until we have evidence there's a real problem.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tracking commit timestamps

2014-11-06 Thread Craig Ringer
On 11/01/2014 09:00 PM, Michael Paquier wrote:
 1) It is untested and actually there is no direct use for it in core.
 2) Pushing code that we know as dead is no good, that's a feature more
 or less defined as maybe-useful-but-we-are-not-sure-yet-what-to-do-with-it.
 3) If you're going to re-use this API in BDR, which is a fork of
 Postgres.

I would like to emphasise that BDR is not really a fork at all, no more
than any big topic branch would be a fork.

BDR is two major parts:

- A collection of patches to core (commit timestamps/extradata, sequence
AM, replication identifiers, logical decoding, DDL deparse, event
triggers, etc). These are being progressively submitted to core.
maintained as multiple feature branches plus a merged version; and

- An extension that uses core features and, where necessary, the
additions to core to implement bi-directional logical replication.

Because of the time scales involved in getting things into core it's
been necessary to *temporarily* get the 9.4-based feature branch into
wider use so that it can be used to run the BDR extension, but if we can
get the required features into core that need will go away.

Event triggers and logical decoding were already merged in 9.4.

If we can get things like commit timestamps, commit extradata / logical
replication identifiers, the sequence access method, etc merged in 9.5
then it should be possible to do away with the need for the patches to
core entirely and run BDR on top of stock 9.5. I'd be delighted if that
were possible, as doing away with the patched 9.4 would get rid of a
great deal of work and frustration on my part.

Note that the BDR extension its self is PostgreSQL-licensed. Externally
maintained extensions have been bought in-core before. It's a lot of
code though, and I can't imagine that being a quick process.



 You'd better complete this API in BDR by yourself and not
 bother core with that.

This argument would've prevented the inclusion of logical decoding,
which is rapidly becoming the headline feature for 9.4, or at least
shortly behind jsonb. Slony is being adapted to use it, multiple people
are working on auditing systems based on it, and AFAIK EDB's xDB is
being adapted to take advantage of it too.

As development gets more complex and people attempt bigger features, the
One Big Patch that adds a feature and an in-core user of the feature is
not going to be a viable approach all the time. In my view it's already
well past that, and some recent patches (like RLS) really should've been
split up into patch series.

If we want to avoid unreviewable monster-patches it will, IMO, be
necessary to have progressive, iterative enhancement. That may sometimes
mean that there's code in core that's only used by future
yet-to-be-merged patches and/or by extensions.

Of course its desirable to have an in-tree user of the code wherever
possible/practical - but sometimes it may *not* be possible or
practical. It seems to me like the benefits of committing work in
smaller, more reviewable chunks outweigh the benefits of merging
multiple related but separate changes just so everything can have an
immediate in-tree user.

That's not to say that extradata must remain glued to commit timestamps.
It might make more sense as a separate patch with an API to allow
extensions to manipulate it directly, plus a dummy extension showing how
it works, like we do with various hooks and with APIs like FDWs.
However, just like the various hooks that we have, it *does* make sense
to have something in-core that has no real world in-core users.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On partitioning

2014-11-06 Thread Amit Langote

Hi,

 ow...@postgresql.org] On Behalf Of Robert Haas
 Sent: Tuesday, October 28, 2014 9:20 PM
 
 On Tue, Oct 28, 2014 at 6:06 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  In my opinion we can reuse (some of) the existing logic for INHERITS to
  implement proper partitioning, but that should be an implementation
  detail.
 
 Sure, that would be a sensible way to do it.  I mostly care about not
 throwing out all the work that's been done on the planner and
 executor.  Maybe you're thinking we'll eventually replace that with
 something better, which is fine, but I wouldn't underestimate the
 effort to make that happen.  For example, I think it's be sensible for
 the first patch to just add some new user-visible syntax with some
 additional catalog representation that doesn't actually do all that
 much yet.  Then subsequent patches could use that additional metadata
 to optimize partition prune, implement tuple routing, etc.
 

I mentioned upthread  about the possibility of resurrecting Itagaki-san's patch 
[1] to try to make things work in this direction. I would be willing to spend 
time on this.  I see useful reviews of the patch by Robert [2], Simon [3] at 
the time but it wasn't pursued further. I think those reviews were valuable 
design input that IMHO would still be relevant. It seems the reviews suggested 
some significant changes to the design proposed. Off course, there are many 
other considerations discussed upthread that need to be addressed. 
Incorporating those changes and others, I think such an approach could be 
worthwhile.

Thoughts?

[1] https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress
[2] 
http://www.postgresql.org/message-id/aanlktikp-1_8b04eyik0sdf8ua5kmo64o8sorfbze...@mail.gmail.com
[3] http://www.postgresql.org/message-id/1279196337.1735.9598.camel@ebony

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2014-11-06 Thread Amit Kapila
On Tue, Nov 4, 2014 at 3:17 PM, Heikki Linnakangas hlinnakan...@vmware.com
wrote:

 On 10/27/2014 06:02 PM, Heikki Linnakangas wrote:

 I came up with the attached patches. They do three things:

 1. Get rid of the 64-bit CRC code. It's not used for anything, and
 haven't been for years, so it doesn't seem worth spending any effort to
 fix them.

 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't
 need to remain compatible across major versions.


Will this change allow database created before this commit to be
started after this commit?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] inherit support for foreign tables

2014-11-06 Thread Kyotaro HORIGUCHI
Hello, I don't fully catch up this topic but tried this one.

  Here are separated patches.
 
  fdw-chk.patch  - CHECK constraints on foreign tables
  fdw-inh.patch  - table inheritance with foreign tables
 
  The latter has been created on top of [1].
 
  [1]
  http://www.postgresql.org/message-id/540da168.3040...@lab.ntt.co.jp
 
  To be exact, it has been created on top of [1] and fdw-chk.patch.

I tried both patches on the current head, the newly added
parameter to analyze_rel() hampered them from applying but it is
easy to fix seemingly and almost all the other part was applied
cleanly.

By the way, are these the result of simply splitting of your last
patch, foreign_inherit-v15.patch?

http://www.postgresql.org/message-id/53feef94.6040...@lab.ntt.co.jp

The result of apllying whole-in-one version and this splitted
version seem to have many differences. Did you added even other
changes? Or do I understand this patch wrongly?

git diff --numstat 0_foreign_inherit_one 0_foreign_inherit_two 
5   51  contrib/file_fdw/file_fdw.c
10  19  contrib/file_fdw/input/file_fdw.source
18  71  contrib/file_fdw/output/file_fdw.source
19  70  contrib/postgres_fdw/expected/postgres_fdw.out
9   66  contrib/postgres_fdw/postgres_fdw.c
12  35  contrib/postgres_fdw/sql/postgres_fdw.sql
13  48  doc/src/sgml/fdwhandler.sgml
39  39  doc/src/sgml/ref/alter_foreign_table.sgml
4   3   doc/src/sgml/ref/create_foreign_table.sgml
8   0   src/backend/catalog/heap.c
7   3   src/backend/commands/analyze.c
0   7   src/backend/commands/tablecmds.c
1   22  src/backend/optimizer/plan/createplan.c
7   7   src/backend/optimizer/prep/prepunion.c
0   26  src/backend/optimizer/util/pathnode.c
1   1   src/include/commands/vacuum.h
0   7   src/include/foreign/fdwapi.h
19  1   src/test/regress/expected/foreign_data.out
9   2   src/test/regress/sql/foreign_data.sql


  I noticed that the latter disallows TRUNCATE on inheritance trees that
  contain at least one child foreign table.  But I think it would be
  better to allow it, with the semantics that we quietly ignore the
  child
  foreign tables and apply the operation to the child plain tables,
  which
  is the same semantics as ALTER COLUMN SET STORAGE on such inheritance
  trees.  Comments welcome.
 
 Done.  And I've also a bit revised regression tests for both
 patches. Patches attached.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add generate_series(numeric, numeric)

2014-11-06 Thread Fujii Masao
On Tue, Oct 14, 2014 at 3:22 PM, Michael Paquier
michael.paqu...@gmail.com wrote:


 On Tue, Oct 7, 2014 at 8:38 AM, Ali Akbar the.ap...@gmail.com wrote:

 2014-10-06 22:51 GMT+07:00 Marti Raudsepp ma...@juffo.org:



  the one that tests values just before numeric overflow

 Actually I don't know if that's too useful. I think you should add a
 test case that causes an error to be thrown.


 Actually i added the test case because in the code, when adding step into
 current for the last value, i expected it to overflow:

 /* increment current in preparation for next iteration */
 add_var(fctx-current, fctx-step, fctx-current);

 where in the last calculation, current is 9 * 10^131071. Plus 10^131071,
 it will be 10^131072, which i expected to overflow numeric type (in the doc,
 numeric's range is up to 131072 digits before the decimal point).

 In attached patch, i narrowed the test case to produce smaller result.


 Well, as things stand now, the logic relies on cmp_var and the sign of the
 stop and current values. it is right that it would be better to check for
 overflow before going through the next iteration, and the cleanest and
 cheapest way to do so would be to move the call to make_result after add_var
 and save the result variable in the function context, or something similar,
 but honestly I am not sure it is worth the complication as it would mean
 some refactoring on what make_result actually already does.

 I looked at this patch again a bit and finished with the attached, adding an
 example in the docs, refining the error messages and improving a bit the
 regression tests. I have nothing more to comment, so I am marking this patch
 as ready for committer.

The patch looks good to me. Barring any objection I will commit the patch.
Memo for me: CATALOG_VERSION_NO must be changed at the commit.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tweaking Foreign Keys for larger tables

2014-11-06 Thread Simon Riggs
On 6 November 2014 20:47, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Simon Riggs wrote:
...
 In that case the need for correctness thru locking is minimal. If we
 do lock it will cause very high multixact traffic, so that is worth
 avoiding alone.

 This seems like a can of worms to me.  How about the ability to mark a
 table READ ONLY, so that insert/update/delete operations on it raise an
 error?  For such tables, you can just assume that tuples never go away,
 which can help optimize some ri_triggers.c queries by doing plain
 SELECT, not SELECT FOR KEY SHARE.

 If you later need to add rows to the table, you set it READ WRITE, and
 then ri_triggers.c automatically start using FOR KEY SHARE; add/modify
 to your liking, then set READ ONLY again.  So you incur the cost of
 tuple locking only while you have the table open for writes.

How about we set lock level on each Foreign Key like this

[USING LOCK [lock level]]

level is one of
KEY - [FOR KEY SHARE] - default
ROW -  [FOR SHARE]
TABLE SHARE - [ ]
TABLE EXCLUSIVE - [FOR TABLE EXCLUSIVE]

which introduces these new level descriptions
TABLE SHARE - is default behavior of SELECT
TABLE EXCLUSIVE - we lock the referenced table against all writes -
this allows the table to be fully cached for use in speeding up checks
 [FOR TABLE EXCLUSIVE] - uses ShareRowExclusiveLock

The last level is like Read Only tables apart from the fact that
they can be written to when needed, but we optimize things on the
assumption that such writes are very rare.

We could also add Read Only tables as well, but I don't see as much
use for them. Sounds like you'd spend a lot of time with ALTER TABLE
as you turn it on and off. I'd like to be able to do that
automatically as needed.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-06 Thread Andres Freund
On 2014-11-06 23:40:18 +, Greg Stark wrote:
 On Thu, Nov 6, 2014 at 9:30 PM, Andres Freund and...@2ndquadrant.com wrote:
  I think the retry logical is a largely pointless complication of already
  complex enough code. You're fixing a problem for which there is
  absolutely no evidence of its existance. Yes, this happens
  occasionally. But it's going to be so absolutely minor in comparison to
  just about every other source of bloat.
 
 I agree bloat isn't really a threat, but what about the relfrozenxid?
 If we skip even one page we don't get to advance it and retrying could
 eliminate those skipped pages and allow us to avoid a vacuum freeze
 which can be really painful. Of course that only works if you can be
 sure you haven't overflowed and forgotten any skipped pages and if you
 don't find the page still pinned every time until you eventually give
 up on it.

I don't buy this argument. Either you're constantly vacuuming the whole
relation anyway - in which case we'll acquire the cleanup lock
unconditionally - or we're doing partial vacuums via the visibility map
anyway.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-11-06 Thread Andres Freund
On 2014-11-06 19:03:20 -0600, Jim Nasby wrote:
 On 11/6/14, 5:40 PM, Greg Stark wrote:
 On Thu, Nov 6, 2014 at 9:30 PM, Andres Freund and...@2ndquadrant.com wrote:
 I think the retry logical is a largely pointless complication of already
 complex enough code. You're fixing a problem for which there is
 absolutely no evidence of its existance. Yes, this happens
 occasionally. But it's going to be so absolutely minor in comparison to
 just about every other source of bloat.
 
 For some reason I don't have Andres' original email, so I'll reply
 here: I agree with you, and my original proposal was simply to log how
 many pages were skipped, but that was objected to. Simply logging this
 extra information would be a patch of a dozen lines or less.

The objection was that it's unneccessary complexity. So you made the
patch a magnitude more complex *and* added logging? That doesn't make
much sense.

 The problem right now is there's no way to actually obtain evidence
 that this is (or isn't) something to worry about, because we just
 silently skip pages. If we had any kind of tracking on this we could
 stop guessing. :(

What's the worst consequence this could have? A couple pages not marked
all visible and not immediately cleaned up. That's not particularly
harmful.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2014-11-06 Thread Heikki Linnakangas

On 11/07/2014 07:08 AM, Amit Kapila wrote:

On Tue, Nov 4, 2014 at 3:17 PM, Heikki Linnakangas hlinnakan...@vmware.com
wrote:


On 10/27/2014 06:02 PM, Heikki Linnakangas wrote:


I came up with the attached patches. They do three things:

1. Get rid of the 64-bit CRC code. It's not used for anything, and
haven't been for years, so it doesn't seem worth spending any effort to
fix them.

2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't
need to remain compatible across major versions.



Will this change allow database created before this commit to be
started after this commit?


No. You could use pg_resetxlog to fix the WAL, but I think at least 
relmap files would still prevent you from starting up. You could use 
pg_upgrade.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2014-11-06 Thread furuyao
 (2014/08/28 18:00), Etsuro Fujita wrote:
  (2014/08/22 11:51), Noah Misch wrote:
  Today's ANALYZE VERBOSE messaging for former inheritance parents
  (tables with relhassubclass = true but no pg_inherits.inhparent
  links) is deceptive, and I welcome a fix to omit the spurious
  message.  As defects go, this is quite minor.  There's fundamentally
  no value in collecting inheritance tree statistics for such a parent,
  and no PostgreSQL command will do so.
 
  A
  credible alternative is to emit a second message indicating that we
  skipped the inheritance tree statistics after all, and why we skipped
  them.
 
  I'd like to address this by emitting the second message as shown below:
 
  A separate patch (analyze.patch) handles the former case in a similar
 way.
 
 I'll add to the upcoming CF, the analyze.patch as an independent item,
 which emits a second message indicating that we skipped the inheritance
 tree statistics and why we skipped them.

I did a review of the patch. 
There was no problem.
I confirmed the following.

1. applied cleanly and compilation was without warnings and errors
2. all regress tests was passed ok
3. The message output from ANALYZE VERBOSE.

Following are the SQL which I used to check messages.

create table parent (id serial);
create table child (name text) inherits (parent);
ANALYZE VERBOSE parent ;
drop table child ;
ANALYZE VERBOSE parent ;

Regards,

--
Furuya Osamu


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers