Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Hello Robert,

I wish to agree, but my interpretation of the previous code is that 
they were ignored before, so ISTM that we are stuck with keeping the 
same unfortunate behavior.


I don't agree.  I'm not in a huge hurry to fix all the places where 
pgbench currently lacks error checks just because I don't have enough to 
do (hint: I do have enough to do), but when we're adding more 
complicated syntax in one particular place, bringing the error checks in 
that portion of the code up to scratch is an eminently sensible thing to 
do, and we should do it.


Ok. I'm in favor of that anyway. It is just that was afraid that changing 
behavior, however poor the said behavior, could be a blocker.



Also, please stop changing the title of this thread every other post.
It breaks threading for me (and anyone else using gmail), and that
makes the thread hard to follow.


Sorry. It does not break my mailer which relies on internal headers, but 
I'll try to be compatible with this gmail features in the future.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2014-07-29 Thread Pavel Stehule
2014-07-28 19:27 GMT+02:00 Marti Raudsepp ma...@juffo.org:

 On Mon, Jul 28, 2014 at 6:24 PM, Kevin Grittner kgri...@ymail.com wrote:
  Do you have some other suggestion?  Keep in mind that it must allow
  the code which will *generate* the transition tables to know
  whether any of the attached triggers use a given transition table
  for the specific operation, regardless of the language of the
  trigger function.

 You will need to access the pg_proc record of the trigger function
 anyway, so it's just a matter of coming up with syntax that makes
 sense, right?

 What I had in mind was that we could re-use function argument
 declaration syntax. For instance, use the argmode specifier to
 declare OLD and NEW. Shouldn't cause grammar conflicts because the
 current OUT and INOUT aren't reserved keywords.

 We could also re-use the refcursor type, which already has bindings in
 some PLs, if that's not too much overhead. That would make the
 behavior straightforward without introducing new constructs, plus you
 can pass them around between functions. Though admittedly it's
 annoying to integrate cursor results into queries.

 Something like:

 CREATE FUNCTION trig(OLD old_rows refcursor, NEW new_rows refcursor)
 RETURNS trigger LANGUAGE plpgsql AS '...';


I dislike this proposal - it is strongly inconsistent with current trigger
design

Regards

Pavel



 Or maybe if the grammar allows, we could spell out NEW TABLE, OLD
 TABLE, but that's redundant since you can already deduce that from
 the refcursor type.

 It could also be extended for different types, like tid[], and maybe
 record for the FOR EACH ROW variant (dunno if that can be made to
 work).

 Regards,
 Marti


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] delta relations in AFTER triggers

2014-07-29 Thread Marti Raudsepp
On Tue, Jul 29, 2014 at 9:49 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 I dislike this proposal - it is strongly inconsistent with current trigger
 design

The real point I was trying to convey (in my previous email) is that
these declarations should be part of the trigger *function* not the
function-to-table relationship. CREATE TRIGGER shouldn't be in the
business of declaring new local variables for the trigger function.
Whether we define new syntax for that or re-use the argument list is
secondary.

But the inconsistency is deliberate, I find the current trigger API
horrible. Magic variables... Text-only TG_ARGV for arguments...
RETURNS trigger... No way to invoke trigger functions directly for
testing.

By not imitating past mistakes, maybe we can eventually arrive at a
language that makes sense.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Hello Robert,


3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc  6 OR argc  6.  I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.


I wish to agree, but my interpretation of the previous code is that they
were ignored before, so ISTM that we are stuck with keeping the same
unfortunate behavior.


I don't agree.


Attached B patch does turn incorrect setrandom syntax into errors instead 
of ignoring extra parameters.


First A patch is repeated to help commitfest references.

--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..e07206a 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -threshold || stdev = threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 		max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max  min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max  min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the 

Re: [HACKERS] delta relations in AFTER triggers

2014-07-29 Thread Pavel Stehule
2014-07-29 9:41 GMT+02:00 Marti Raudsepp ma...@juffo.org:

 On Tue, Jul 29, 2014 at 9:49 AM, Pavel Stehule pavel.steh...@gmail.com
 wrote:
  I dislike this proposal - it is strongly inconsistent with current
 trigger
  design

 The real point I was trying to convey (in my previous email) is that
 these declarations should be part of the trigger *function* not the
 function-to-table relationship. CREATE TRIGGER shouldn't be in the
 business of declaring new local variables for the trigger function.
 Whether we define new syntax for that or re-use the argument list is
 secondary.

 But the inconsistency is deliberate, I find the current trigger API
 horrible. Magic variables... Text-only TG_ARGV for arguments...
 RETURNS trigger...



A notation RETURNS TRIGGER I don't like too much too - RETURNS void or
RETURNS record are much more natural.

My dream is some like CREATE OR REPLACE TRIGGER FUNCTION trg() RETURNS
RECORD

but it is only syntactic sugar - and I don't see any benefit why we should
to implement it.


 No way to invoke trigger functions directly for
 testing.


It is horrible idea. I can agree,  it is a limit - but not too hard - there
is simple possibility to take code from trigger to auxiliary function. But
current design is simply and robust with few possible user errors.



 By not imitating past mistakes, maybe we can eventually arrive at a
 language that makes sense.


Sorry I disagree. Can be subjective is this API is too or not too bad for
redesign. More objective arguments - there are no performance issue, no
security issue. I am thinking, so it has sense, so I don't see reason why
to change it and why we should to have two API. Last argument, if we change
something, then we should to use a ANSI SQL syntax everywhere it is
possible (when we don't get any new special functionality).

Regards

Pavel



 Regards,
 Marti



Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Attached B patch does turn incorrect setrandom syntax into errors instead of 
ignoring extra parameters.


First A patch is repeated to help commitfest references.


Oops, I applied the change on the wrong part:-(

Here is the change on part A which checks setrandom syntax, and B for 
completeness.


--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..16e44bd 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -threshold || stdev = threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 		max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max  min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max  min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min  0 || (max - min) + 1  0)
 			{
@@ -1377,10 +1450,64 @@ top:
 return true;
 			}
 
+			if (argc == 4 || /* uniform without or with uniform keyword */
+(argc == 5  pg_strcasecmp(argv[4], uniform) == 0))
+			

Re: [HACKERS] Is analyze_new_cluster.sh still useful?

2014-07-29 Thread Bruce Momjian
On Wed, Jun 18, 2014 at 05:41:06PM +0200, Christoph Berg wrote:
 Hi,
 
 now that we have vacuumdb --all --analyze-in-stages in 9.4, wouldn't
 it make sense to get rid of the analyze_new_cluster.sh file which
 pg_upgrade writes? The net content is a single line which could as
 well be printed by pg_upgrade itself. Instead of an lengthy
 explanation how to invoke that manually, there should be a short note
 and a pointer to some manual section. I think the chances of people
 reading that would even be increased.

I was not a big fan of keeping analyze_new_cluster.sh with one command
in it, but it does maintain the same user API, so I guess that is why
people wanted it kept.

 Similary, I don't really see the usefulness of delete_old_cluster.sh
 as a file, when rm -rf could just be presented on the console for
 the admin to execute by cut-and-paste.

Uh, that could be hard because delete_old_cluster.sh also can also
delete the old major-version-specific subdirectories in tablespaces, so
I do think we need to keep that.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is analyze_new_cluster.sh still useful?

2014-07-29 Thread Bruce Momjian
On Fri, Jun 20, 2014 at 05:15:05PM +0200, Christoph Berg wrote:
 Another nitpick here: What pg_upgrade outputs doesn't even work on
 most systems, you need to ./analyze_new_cluster.sh or sh
 analyze_new_cluster.sh.

Well, the output is:

Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
analyze_new_cluster.sh

Running this script will delete the old cluster's data files:
delete_old_cluster.sh

It is not really telling you _how_ to run them.  Would adding a ./
prefix help?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog add synchronous mode

2014-07-29 Thread furuyao
I have improved the patch  by making following changes:

1. Since stream_stop() was redundant, stream_stop() at the time of WAL file 
closing was deleted. 

2. Change the Flash judging timing for the readability of source code.
   I have changed the Flash judging timing , from the continuous message after 
receiving to
   before the feedbackmassege decision of continue statement after execution.

Regards,

--
Furuya Osamu


pg_receivexlog-add-fsync-interval-v4.patch
Description: pg_receivexlog-add-fsync-interval-v4.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Production block comparison facility

2014-07-29 Thread Heikki Linnakangas

On 07/23/2014 05:14 PM, Michael Paquier wrote:

On Tue, Jul 22, 2014 at 4:49 PM, Michael Paquier michael.paqu...@gmail.com
wrote:


Then, looking at the code, we would need to tweak XLogInsert for the
WAL record construction to always do a FPW and to update
XLogCheckBufferNeedsBackup. Then for the redo part, we would need to
do some extra operations in the area of
RestoreBackupBlock/RestoreBackupBlockContents, including masking
operations before comparing the content of the FPW and the current
page.

Does that sound right?



I have spent some time digging more into this idea and finished with the
patch attached, doing the following: addition of a consistency check when
FPW is restored and applied on a given page.

The consistency check is made of two phases:
- Apply a mask on the FPW and the current page to eliminate potential
conflicts like hint bits for example.
- Check that the FPW is consistent with the current page, aka the current
page does not contain any new information that the FPW taken has not. This
is done by checking the masked portions of the FPW and the current page.
Also some more details:
- If an inconsistency is found, a WARNING is simply logged.
- The consistency check is done if current page is not empty, and if
database has reached a consistent state.
- The page masking API is taken from the WAL replay patch that was
submitted in CF1 and plugged in as an independent set of API.
- In masking stuff, to facilitate if a page is used by a sequence relation
SEQ_MAGIC as well as the its opaque data structure are renamed and moved
into sequence.h.
- To facilitate debugging and comparison, the masked FPW and current page
are also converted into hex.
Things could be refactored and improved for sure, but this patch is already
useful as-is so I am going to add it to the next commit fest.


I don't understand how this works. A full-page image contains the new 
page contents *after* the WAL-logged operation. For example, in a heap 
insert, the full-page image contains the new tuple. How can you compare 
that with what's on the disk already?


ISTM you'd need to log two full-page images for every WAL record. A 
before image and an after image. Then you could do a lot of checking:


1. the before image should match what's on disk already
2. the result after applying the WAL record should match the after image.

That would be more handy than the approach I used, where the page images 
are logged to a separate file. You wouldn't need to deal with any new 
files, as all the data is in the WAL. Verification would be done 
directly in the standby, with no need to run any extra programs.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-07-29 Thread Heikki Linnakangas

On 07/10/2014 12:41 AM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

On 06/23/2014 08:07 PM, Alvaro Herrera wrote:



I feel that the below would nevertheless be simpler:


I wonder if it would be simpler to just always store the revmap
pages in the beginning of the index, before any other pages. Finding
the revmap page would then be just as easy as with a separate fork.
When the table/index is extended so that a new revmap page is
needed, move the existing page at that block out of the way. Locking
needs some consideration, but I think it would be feasible and
simpler than you have now.


Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.


Hmm. Two alternative schemes come to mind:

1. Move each index tuple off the page individually, updating the
revmap while you do it, until the page is empty. Updating the revmap
for a single index tuple isn't difficult; you have to do it anyway
when an index tuple is replaced. (MMTuples don't contain a heap
block number ATM, but IMHO they should, see below)

2. Store the new block number of the page that you moved out of the
way in the revmap page, and leave the revmap pointers unchanged. The
revmap pointers can be updated later, lazily.

Both of those seem pretty straightforward.


The trouble I have with moving blocks around to make space, is that it
would cause the index to have periodic hiccups to make room for the new
revmap pages.  One nice property that these indexes are supposed to have
is that the effect into insertion times should be pretty minimal.  That
would cease to be the case if we have to do your proposed block moves.


Approach 2 above is fairly quick, quick enough that no-one would notice 
the hiccup. Moving the tuples individually (approach 1) would be slower.



ISTM that when the old tuple cannot be updated in-place, the new
index tuple is inserted with mm_doinsert(), but the old tuple is
never deleted.


It's deleted by the next vacuum.


Ah I see. Vacuum reads the whole index, and builds an in-memory hash
table that contains an ItemPointerData for every tuple in the index.
Doesn't that require a lot of memory, for a large index? That might
be acceptable - you ought to have plenty of RAM if you're pushing
around multi-terabyte tables - but it would nevertheless be nice to
not have a hard requirement for something as essential as vacuum.


I guess if you're expecting that pages_per_range=1 is a common case,
yeah it might become an issue eventually.


Not sure, but I find it easier to think of the patch that way. In any 
case, it would be nice to avoid the problem, even if it's not common.



One idea I just had is to
have a bit for each index tuple, which is set whenever the revmap no
longer points to it.  That way, vacuuming is much easier: just scan the
index and delete all tuples having that bit set.


The bit needs to be set atomically with the insertion of the new tuple, 
so why not just remove the old tuple right away?



Wouldn't it be simpler to remove the old tuple atomically with
inserting the new tuple and updating the revmap? Or at least mark
the old tuple as deletable, so that vacuum can just delete it,
without building the large hash table to determine that it's
deletable.


Yes, it might be simpler, but it'd require dirtying more pages on
insertions (and holding more page-level locks, for longer.  Not good for
concurrent access).


I wouldn't worry much about the performance and concurrency of this 
operation. Remember that the majority of updates are expected to not 
have to update the index, otherwise the minmax index will degenerate 
quickly and performance will suck anyway. And even when updating the 
index is needed, in most cases the new tuple fits on the same page, 
after removing the old one. So the case where you have to insert a new 
index tuple, remove old one (or mark it dead), and update the revmap to 
point to the new tuple, is rare.



I'm quite surprised by the use of LockTuple on the index tuples. I
think the main reason for needing that is the fact that MMTuple
doesn't store the heap (range) block number that the tuple points
to: LockTuple is required to ensure that the tuple doesn't go away
while a scan is following a pointer from the revmap to it. If the
MMTuple contained the BlockNumber, a scan could check that and go
back to the revmap if it doesn't match. Alternatively, you could
keep the revmap page locked when you follow a pointer to the regular
index page.


There's the intention that these accesses be kept as concurrent as
possible; this is why we don't want to block the whole page.  Locking
individual TIDs is fine in this case (which is not in SELECT FOR UPDATE)
because we can only lock a single tuple in any one index scan, so
there's no unbounded growth of the lock table.

I prefer not to have BlockNumbers in index tuples, because that would
make them larger for not much gain.  That data would mostly be
redundant, and would be necessary 

Re: [HACKERS] postgresql.auto.conf and reload

2014-07-29 Thread Amit Kapila
On Mon, Jul 28, 2014 at 11:33 PM, Fujii Masao masao.fu...@gmail.com wrote:
 There is other side effect on this patch. With the patch, when reloading
 the configurartion file, the server cannot warm an invalid setting value
 if it's not the last setting of the parameter. This may cause problematic
 situation as follows.

 1. ALTER SYSTEM SET work_mem TO '1024kB';
 2. Reload the configuration file --- success
 3. Then, a user accidentally adds work_mem = '2048KB' into
postgresql.conf
  The setting value '2048KB' is invalid, and the unit should be
 'kB' instead of 'KB'.
 4. Reload the configuration file --- success
  The invalid setting is ignored because the setting of work_mem in
  postgresql.auto.conf is preferred. So a user cannot notice that
 postgresql.conf
  has an invalid setting.
 5. Failover on shared-disk HA configuration happens, then PostgreSQL
fails to
 start up because of such an invalid setting. When PostgreSQL
 starts up, the last
 setting is preferred. But all the settings are checked.

The reason is that during startup DataDir is not set by the time server
processes config file due to which we process .auto.conf file in second
pass.  I think ideally it should ignore the invalid setting in such a case
as it does during reload, however currently there is no good way to
process .auto.conf  incase DataDir is not set, so we handle it separately
in second pass.

 Can we live with this issue?

If you are worried about the Reload success case, it will anyway fail later
on if user actually wants to set it via postgresql.conf because in such a
case user has to remove the setting  set by Alter System using
Alter System param_name To Default.  Incase he doesn't have any
such intention, then I think ignoring such invalid params is not a
concerning
issue.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Use unique index for longer pathkeys.

2014-07-29 Thread Amit Kapila
On Mon, Jul 28, 2014 at 3:17 PM, Kyotaro HORIGUCHI 
horiguchi.kyot...@lab.ntt.co.jp wrote:

  Now drop the i_t1_pkey_1 and check the query plan again.
 
  drop index i_t1_pkey_1;
  explain (costs off, analyze off) select * from t,t1 where t.a=t1.a
order by
  t1.a,t1.b,t1.c,t1.d;
 QUERY PLAN
  
   Sort
 Sort Key: t.a, t1.b, t1.c, t1.d
 -  Merge Join
   Merge Cond: (t.a = t1.a)
   -  Index Scan using i_t_pkey on t
   -  Index Scan using i_t1_pkey_2 on t1
  (6 rows)
 
  Can't above plan eliminate Sort Key even after dropping index
  (i_t1_pkey_1)?

 My patch doesn't so since there no longer a 'common primary
 pathkeys' in this query. Perhaps the query doesn't allow the sort
 eliminated. Since a is no more a pkey, t1 can have dulicate rows
 for the same a, so the joined relation also may have duplicte
 values in the column a.

I think irrespective of that we can trim t1.c  t1.d as we have
primary key (unique and non column) for t1.a, t1.b.  Basically
even if we don't use the primary key index, we can still trim
the keys in such a case.  IIUC, then Tom has mentioned the
same in his message related to this issue.

I am referring to below text:

If we have ORDER BY a, b, c and (a,b) is the
primary key, then including c in the ORDER BY list is semantically
redundant, *whether or not we use an indexscan on the pkey index at all*.

http://www.postgresql.org/message-id/5212.1397599...@sss.pgh.pa.us


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Production block comparison facility

2014-07-29 Thread Michael Paquier
On Tue, Jul 29, 2014 at 7:30 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I don't understand how this works. A full-page image contains the new page
 contents *after* the WAL-logged operation. For example, in a heap insert,
 the full-page image contains the new tuple. How can you compare that with
 what's on the disk already?

An exact match of the FPW and the current page is not done, the patch
as it stands now checks if a FPW is consistent with the content of
current page by checking if it does not include changes that diverge
from what the FPW has.
For example for a heap insert, if current page has N records
pointer1/tup1..pointerN/tupN, FPW should only contain (N+1) records
pointer1/tup1..pointer(N+1)/tup(N+1). After applying the mask at block
recovery, process simply checks that the FPW and current page contain
the first N records, marking FPW and current page as inconsistent if
the current page has some garbage like some extra tuple entries not in
the FPW. I am sure you have arguments against that though...

 ISTM you'd need to log two full-page images for every WAL record. A before
 image and an after image.
The after image is the current FPW, so there is nothing else to do for
it. But for the before buffer, what do you think about using
ReadBufferExtended with RBM_NORMAL? We could grab its content from
disk in XLogInsert only when we are sure that a backup block is added.

 Then you could do a lot of checking:
 1. the before image should match what's on disk already
 2. the result after applying the WAL record should match the after image.
A WAL record can contain up to XLR_MAX_BKP_BLOCKS backup blocks.
Should we double it from 4 to 8?

 That would be more handy than the approach I used, where the page images are
 logged to a separate file. You wouldn't need to deal with any new files, as
 all the data is in the WAL. Verification would be done directly in the
 standby, with no need to run any extra programs.
In this case, would it better to control that with a GUC? Making that
the default will increase the amount of WAL for all types of
applications, except if couple with FPW compression...
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-07-29 Thread Bruce Momjian
On Thu, Jun 26, 2014 at 09:59:59AM -0400, Stephen Frost wrote:
 Simon,
 
 * Simon Riggs (si...@2ndquadrant.com) wrote:
  Which tables are audited would be available via the reloptions
  field. 
 
 RLS could be implemented through reloptions too.  Would it be useful to
 some people?  Likely.  Would it satisfy the users who, today, are
 actually asking for that feature?  No (or at least, not the ones that
 I've talked with).  We could expand quite a few things to work through
 reloptions but I don't see it as a particularly good design for complex
 subsystems, of which auditing is absolutely one of those.

I saw many mentions of pg_upgrade in this old thread.  I think the focus
should be in pg_dump, which is where the SQL syntax or custom reloptions
would be dumped and restored.  pg_upgrade will just use that
functionality.  In summary, I don't think there is anything
pg_upgrade-specific here, but rather the issue of how this information will
be dumped and restored, regardless of whether a major upgrade is taking
place.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-07-29 Thread Bruce Momjian
On Tue, Jul 29, 2014 at 09:08:38AM -0400, Bruce Momjian wrote:
 On Thu, Jun 26, 2014 at 09:59:59AM -0400, Stephen Frost wrote:
  Simon,
  
  * Simon Riggs (si...@2ndquadrant.com) wrote:
   Which tables are audited would be available via the reloptions
   field. 
  
  RLS could be implemented through reloptions too.  Would it be useful to
  some people?  Likely.  Would it satisfy the users who, today, are
  actually asking for that feature?  No (or at least, not the ones that
  I've talked with).  We could expand quite a few things to work through
  reloptions but I don't see it as a particularly good design for complex
  subsystems, of which auditing is absolutely one of those.
 
 I saw many mentions of pg_upgrade in this old thread.  I think the focus
 should be in pg_dump, which is where the SQL syntax or custom reloptions
 would be dumped and restored.  pg_upgrade will just use that
 functionality.  In summary, I don't think there is anything
 pg_upgrade-specific here, but rather the issue of how this information will
 be dumped and restored, regardless of whether a major upgrade is taking
 place.

Actually, thinking more, Stephen Frost mentioned that the auditing
system has to modify database _state_, and dumping/restoring the state
of an extension might be tricky.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New developer TODO suggestions

2014-07-29 Thread Bruce Momjian
On Tue, Jun 24, 2014 at 10:58:54AM +0800, Craig Ringer wrote:
 Hi all
 
 Someone recently mentioned that there's no generate_series(numeric,
 numeric, numeric) .
 
 That strikes me as a great candidate for a
 new-developer-learning-PostgreSQL TODO.
 
 
 A couple of other things I occasionally run into that'd fit the bill:
 
 * A user-level elog(...) / ereport(...) function callable from SQL.
 Useful in CASE statements.
 
 * A log_ option to log whenever pg switches to a new xlog segment.

The above seem good.

 
 * A 'hex' option to 'decode' that decodes regular hex into bytea, or an
 equivalent decode_hex / hex_decode . That's for plain undecorated hex,
 not \x literals.
 
 * A corresponding encode_hex or hex_encode to emit hex 'text' without \x
 prefix (not a bytea literal)
 
 (Yes, I know you can form bytea literals with concatenation and decode
 that way, and can strip the \x prefix from a literal on output, but it's
 often pretty awkward).

Uh, don't our SQL string function allow removal of \x?

 * A user-accessible function to decode unicode escapes like \U1011 in
 strings.

Makes sense.

 * A function that converts a json array to a PostgreSQL array of a given
 type if all json members are compatible with the type
 
 * Expanding the set of json/jsonb operations to introduce features that
 people are used to from jquery, mongo, etc.
 Replace-key-if-exists-without-adding, add-or-replace-key, etc.
 
 * (not really Pg proper, but enough users run into this that I think we
 should encourage interested people to tackle it): In PgAdmin-III either
 support \copy, \c, etc or detect their use and emit an informative error
 telling the user to use 'psql'.

I think you have to ask Andrew on these.

 * When a user tries to run psql -f some_custom_format_backup, detect
 this and emit a useful error message. Ditto stdin.

Uh, good idea, but can we really do that in psql?

 * Add a built-in aggregate for array_agg(anyarray), i.e. build an array
 of dims n+1 from the input arrays of dims n. For n=1 this can be done
 with a simple SQL level aggregate definition, so all it really needs is
 to error on dims  1 IMO.
 
 * Add a built-in aggregate form of array_cat
 
 ... probably other things I'm forgetting.

No idea on these.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Incremental Backup

2014-07-29 Thread Marco Nenciarini
Il 25/07/14 16:15, Michael Paquier ha scritto:
 On Fri, Jul 25, 2014 at 10:14 PM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 0. Introduction:
 =
 This is a proposal for adding incremental backup support to streaming
 protocol and hence to pg_basebackup command.
 Not sure that incremental is a right word as the existing backup
 methods using WAL archives are already like that. I recall others
 calling that differential backup from some previous threads. Would
 that sound better?
 

differential backup is widely used to refer to a backup that is always
based on a full backup. An incremental backup can be based either on
a full backup or on a previous incremental backup. We picked that
name to emphasize this property.

 1. Proposal
 =
 Our proposal is to introduce the concept of a backup profile.
 Sounds good. Thanks for looking at that.
 
 The backup
 profile consists of a file with one line per file detailing tablespace,
 path, modification time, size and checksum.
 Using that file the BASE_BACKUP command can decide which file needs to
 be sent again and which is not changed. The algorithm should be very
 similar to rsync, but since our files are never bigger than 1 GB per
 file that is probably granular enough not to worry about copying parts
 of files, just whole files.
 There are actually two levels of differential backups: file-level,
 which is the approach you are taking, and block level. Block level
 backup makes necessary a scan of all the blocks of all the relations
 and take only the data from the blocks newer than the LSN given by the
 BASE_BACKUP command. In the case of file-level approach, you could
 already backup the relation file after finding at least one block
 already modified.

I like the idea of shortcutting the checksum when you find a block with
a LSN newer than the previous backup START WAL LOCATION, however I see
it as a further optimization. In any case, it is worth storing the
backup start LSN in the header section of the backup_profile together
with other useful information about the backup starting position.

As a first step we would have a simple and robust method to produce a
file-level incremental backup.

 Btw, the size of relation files depends on the size
 defined by --with-segsize when running configure. 1GB is the default
 though, and the value usually used. Differential backups can reduce
 the size of overall backups depending on the application, at the cost
 of some CPU to analyze the relation blocks that need to be included in
 the backup.

We tested the idea on several multi-terabyte installations using a
custom deduplication script which follows this approach. The result is
that it can reduce the backup size of more than 50%. Also most of
databases in the range 50GB - 1TB can take a big advantage of it.

 
 It could also be used in 'refresh' mode, by allowing the pg_basebackup
 command to 'refresh' an old backup directory with a new backup.
 I am not sure this is really helpful...

Could you please elaborate the last sentence?

 
 The final piece of this architecture is a new program called
 pg_restorebackup which is able to operate on a chain of incremental
 backups, allowing the user to build an usable PGDATA from them or
 executing maintenance operations like verify the checksums or estimate
 the final size of recovered PGDATA.
 Yes, right. Taking a differential backup is not difficult, but
 rebuilding a constant base backup with a full based backup and a set
 of differential ones is the tricky part, but you need to be sure that
 all the pieces of the puzzle are here.

If we limit it to be file-based, the recover procedure is conceptually
simple. Read every involved manifest from the start and take the latest
available version of any file (or mark it for deletion, if the last time
it is named is in a backup_exceptions file). Keeping the algorithm as
simple as possible is in our opinion the best way to go.

 
 We created a wiki page with all implementation details at
 https://wiki.postgresql.org/wiki/Incremental_backup
 I had a look at that, and I think that you are missing the shot in the
 way differential backups should be taken. What would be necessary is
 to pass a WAL position (or LSN, logical sequence number like
 0/260) with a new clause called DIFFERENTIAL (INCREMENTAL in your
 first proposal) in the BASE BACKUP command, and then have the server
 report back to client all the files that contain blocks newer than the
 given LSN position given for file-level backup, or the blocks newer
 than the given LSN for the block-level differential backup.

In our proposal a file is skipped only, and only if it has the same
size, the same mtime and *the same checksum* of the original file. We
intentionally want to keep it simple, easily supporting also files that
are stored in $PGDATA but don't follow any format known by Postgres.
However, even with more complex algorithms, all the 

Re: [HACKERS] Proposal: Incremental Backup

2014-07-29 Thread Marco Nenciarini
Il 25/07/14 20:21, Claudio Freire ha scritto:
 On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 1. Proposal
 =
 Our proposal is to introduce the concept of a backup profile. The backup
 profile consists of a file with one line per file detailing tablespace,
 path, modification time, size and checksum.
 Using that file the BASE_BACKUP command can decide which file needs to
 be sent again and which is not changed. The algorithm should be very
 similar to rsync, but since our files are never bigger than 1 GB per
 file that is probably granular enough not to worry about copying parts
 of files, just whole files.
 
 That wouldn't nearly as useful as the LSN-based approach mentioned before.
 
 I've had my share of rsyncing live databases (when resizing
 filesystems, not for backup, but the anecdotal evidence applies
 anyhow) and with moderately write-heavy databases, even if you only
 modify a tiny portion of the records, you end up modifying a huge
 portion of the segments, because the free space choice is random.
 
 There have been patches going around to change the random nature of
 that choice, but none are very likely to make a huge difference for
 this application. In essence, file-level comparisons get you only a
 mild speed-up, and are not worth the effort.
 
 I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
 the I/O of inspecting the LSN of entire segments (necessary
 optimization for huge multi-TB databases) and backups only the
 portions modified when segments do contain changes, so it's the best
 of both worlds. Any partial implementation would either require lots
 of I/O (LSN only) or save very little (file only) unless it's an
 almost read-only database.
 

From my experience, if a database is big enough and there is any kind of
historical data in the database, the file only approach works well.
Moreover it has the advantage of being simple and easily verifiable.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Proposal: Incremental Backup

2014-07-29 Thread Claudio Freire
On Tue, Jul 29, 2014 at 1:24 PM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 1. Proposal
 =
 Our proposal is to introduce the concept of a backup profile. The backup
 profile consists of a file with one line per file detailing tablespace,
 path, modification time, size and checksum.
 Using that file the BASE_BACKUP command can decide which file needs to
 be sent again and which is not changed. The algorithm should be very
 similar to rsync, but since our files are never bigger than 1 GB per
 file that is probably granular enough not to worry about copying parts
 of files, just whole files.

 That wouldn't nearly as useful as the LSN-based approach mentioned before.

 I've had my share of rsyncing live databases (when resizing
 filesystems, not for backup, but the anecdotal evidence applies
 anyhow) and with moderately write-heavy databases, even if you only
 modify a tiny portion of the records, you end up modifying a huge
 portion of the segments, because the free space choice is random.

 There have been patches going around to change the random nature of
 that choice, but none are very likely to make a huge difference for
 this application. In essence, file-level comparisons get you only a
 mild speed-up, and are not worth the effort.

 I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
 the I/O of inspecting the LSN of entire segments (necessary
 optimization for huge multi-TB databases) and backups only the
 portions modified when segments do contain changes, so it's the best
 of both worlds. Any partial implementation would either require lots
 of I/O (LSN only) or save very little (file only) unless it's an
 almost read-only database.


 From my experience, if a database is big enough and there is any kind of
 historical data in the database, the file only approach works well.
 Moreover it has the advantage of being simple and easily verifiable.

I don't see how that would be true if it's not full of read-only or
append-only tables.

Furthermore, even in that case, you need to have the database locked
while performing the file-level backup, and computing all the
checksums means processing the whole thing. That's a huge amount of
time to be locked for multi-TB databases, so how is that good enough?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] pg_dump behaves differently for different archive formats

2014-07-29 Thread Robert Haas
On Mon, Jul 28, 2014 at 10:55 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Stephen Frost sfr...@snowman.net writes:
 If we're going to change this, it seems to me that the only option would
 be to change the dump format...  Just off-the-cuff, I'm wondering if we
 could actually not change the real 'format' but simply promote each ACL
 entry (and similar cases..) to top-level objects and declare that TOC
 entries should be single statements.

 I don't think we want even more TOC entries, but it would not be
 unreasonable to insist that the statement(s) within a TOC entry be
 subdivided somehow.  Essentially the payload of a TOC entry becomes
 a list of strings rather than just one string.

 That would mean that the problem could not be fixed for existing archive
 files; but that seems OK, given the rather small number of complaints
 so far.

 If we had something like that, I'd be strongly inclined to get rid of
 the existing convention whereby comments and ACL commands are separate
 TOC entries, and make them part of the parent object's TOC entry (which'd
 mean we'd want to label the sub-strings so we can tell whether they are
 main object, comment, or ACL).  The fewer TOC entries we can have, the
 better; there is no reason why comments/ACLs should be independently
 sortable.

Maybe, but I think people will still want an option to skip restoring
them altogether (at least for ACLs).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Incremental Backup

2014-07-29 Thread Marco Nenciarini
Il 25/07/14 20:44, Robert Haas ha scritto:
 On Fri, Jul 25, 2014 at 2:21 PM, Claudio Freire klaussfre...@gmail.com 
 wrote:
 On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
 marco.nenciar...@2ndquadrant.it wrote:
 1. Proposal
 =
 Our proposal is to introduce the concept of a backup profile. The backup
 profile consists of a file with one line per file detailing tablespace,
 path, modification time, size and checksum.
 Using that file the BASE_BACKUP command can decide which file needs to
 be sent again and which is not changed. The algorithm should be very
 similar to rsync, but since our files are never bigger than 1 GB per
 file that is probably granular enough not to worry about copying parts
 of files, just whole files.

 That wouldn't nearly as useful as the LSN-based approach mentioned before.

 I've had my share of rsyncing live databases (when resizing
 filesystems, not for backup, but the anecdotal evidence applies
 anyhow) and with moderately write-heavy databases, even if you only
 modify a tiny portion of the records, you end up modifying a huge
 portion of the segments, because the free space choice is random.

 There have been patches going around to change the random nature of
 that choice, but none are very likely to make a huge difference for
 this application. In essence, file-level comparisons get you only a
 mild speed-up, and are not worth the effort.

 I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
 the I/O of inspecting the LSN of entire segments (necessary
 optimization for huge multi-TB databases) and backups only the
 portions modified when segments do contain changes, so it's the best
 of both worlds. Any partial implementation would either require lots
 of I/O (LSN only) or save very little (file only) unless it's an
 almost read-only database.
 
 I agree with much of that.  However, I'd question whether we can
 really seriously expect to rely on file modification times for
 critical data-integrity operations.  I wouldn't like it if somebody
 ran ntpdate to fix the time while the base backup was running, and it
 set the time backward, and the next differential backup consequently
 omitted some blocks that had been modified during the base backup.
 

Our proposal doesn't rely on file modification times for data integrity.

We are using the file mtime only as a fast indication that the file has
changed, and transfer it again without performing the checksum.
If timestamp and size match we rely on *checksums* to decide if it has
to be sent.

In SMART MODE we would use the file mtime to skip the checksum check
in some cases, but it wouldn't be the default operation mode and it will
have all the necessary warnings attached. However the SMART MODE isn't
a core part of our proposal, and can be delayed until we agree on the
safest way to bring it to the end user.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Re: [GENERAL] pg_dump behaves differently for different archive formats

2014-07-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 28, 2014 at 10:55 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 If we had something like that, I'd be strongly inclined to get rid of
 the existing convention whereby comments and ACL commands are separate
 TOC entries, and make them part of the parent object's TOC entry (which'd
 mean we'd want to label the sub-strings so we can tell whether they are
 main object, comment, or ACL).  The fewer TOC entries we can have, the
 better; there is no reason why comments/ACLs should be independently
 sortable.

 Maybe, but I think people will still want an option to skip restoring
 them altogether (at least for ACLs).

Sure; we already have such an option, and I'm not suggesting removing
it.  The implementation would change though: it would have to look at
the individual command labels to see which part(s) of a TOC entry to
print out.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-07-29 Thread Robert Haas
On Mon, Jul 28, 2014 at 1:50 PM, Andres Freund and...@2ndquadrant.com wrote:
 Don't get me wrong, I don't object to anything in here. It's just that
 the bigger picture can help giving sensible feedback.

Right.  I did not get you wrong.  :-)

The reason I'm making a point of it is that, if somebody wants to
object to the way those facilities are designed, it'd be good to get
that out of the way now rather than waiting until 2 or 3 patch sets
from now and then saying Uh, could you guys go back and rework all
that stuff?.  I'm not going to complain too loudly now if somebody
wants something in there done in a different way, but it's easier to
do that now while there's only pg_background sitting on top of it.

 What I'm thinking of is providing an actual API for the writes instead
 of hooking into the socket API in a couple places. I.e. have something
 like

 typedef struct DestIO DestIO;

 struct DestIO
 {
 void (*flush)(struct DestIO *io);
 int (*putbytes)(struct DestIO *io, const char *s, size_t len);
 int (*getbytes)(struct DestIO *io, const char *s, size_t len);
 ...
 }

 and do everything through it. I haven't thought much about the specific
 API we want, but abstracting the communication properly instead of
 adding hooks here and there is imo much more likely to succeed in the
 long term.

This sounds suspiciously like the DestReceiver thing we've already
got, except that the DestReceiver only applies to tuple results, not
errors and notices and so on.  I'm not totally unamenable to a bigger
refactoring here, but right now it looks to me like a solution in
search of a problem.  The hooks are simple and seem to work fine; I
don't want to add notation for its own sake.

  Also, you seem to have only touched receiving from the client, and not
  sending back to the subprocess. Is that actually sufficient? I'd expect
  that for this facility to be fully useful it'd have to be two way
  communication. But perhaps I'm overestimating what it could be used for.

 Well, the basic shm_mq infrastructure can be used to send any kind of
 messages you want between any pair of processes that care to establish
 them.  But in general I expect that data is going to flow mostly in
 one direction - the user backend will launch workers and give them an
 initial set of instructions, and then results will stream back from
 the workers to the user backend.  Other messaging topologies are
 certainly possible, and probably useful for something, but I don't
 really know exactly what those things will be yet, and I'm not sure
 the FEBE protocol will be the right tool for the job anyway.

 It's imo not particularly unreasonable to e.g. COPY to/from a bgworker. Which
 would require the ability to both read/write from the other side.

Well, that should work fine if the background worker and user backend
generate the CopyData messages via some bespoke code rather than
expecting to be able to jump into copy.c and have everything work.  If
you want that to work, why?  It doesn't make much sense for
pg_background, because I don't think it would be sensible for SELECT
pg_background_result(...) to return CopyInResponse or CopyOutResponse,
and even if it were sensible it doesn't seem useful.  And I can't
think of any other application off-hand, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making joins involving ctid work for the benefit of UPSERT

2014-07-29 Thread Robert Haas
On Mon, Jul 28, 2014 at 1:43 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Jul 28, 2014 at 8:37 AM, Robert Haas robertmh...@gmail.com wrote:
 AFAIUI, this is because your implementation uses lwlocks in a way that
 Andres and I both find unacceptable.

 That's not the case. My implementation uses page-level heavyweight
 locks. The nbtree AM used to use them for other stuff. Plenty of other
 systems use index level locks managed by a heavyweight lock manager.

Oh, OK.  I think I missed that development somewhere.

 Good point.  Maybe the syntax should be something like:

 UPSERT table (keycol [, keycol] ...) { VALUES (val [, val] ...) [,
 ...] | select_query }

 That would address both the concern about being able to pipe multiple
 tuples through it and the point you just raised.  We look for a row
 that matches each given tuple on the key columns; if one is found, we
 update it; if none is found, we insert.

 That basically is my design, except that (tangentially) yours risks
 bloat in exchange for not having to use a real locking mechanism, and
 has a different syntax.

I think it would be advisable to separate the syntax from the
implementation.  Presumably you can make your implementation use some
reasonable syntax we can all agree on, and conversely my proposed
syntax could be made to have a different set of semantics.  There's
some connection between the syntax and semantics, of course, but it's
not 100%.  I mention this because I was mostly concerned with getting
to a reasonable syntax proposal, not so much the implementation
details.  It may well be that your implementation details are perfect
at this point; I don't know because I haven't looked, and I'm not an
expert on that area of the code anyway.  But I have looked at your
syntax, which I wasn't altogether keen on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making joins involving ctid work for the benefit of UPSERT

2014-07-29 Thread Peter Geoghegan
On Tue, Jul 29, 2014 at 9:56 AM, Robert Haas robertmh...@gmail.com wrote:
 I think it would be advisable to separate the syntax from the
 implementation.  Presumably you can make your implementation use some
 reasonable syntax we can all agree on, and conversely my proposed
 syntax could be made to have a different set of semantics.  There's
 some connection between the syntax and semantics, of course, but it's
 not 100%.  I mention this because I was mostly concerned with getting
 to a reasonable syntax proposal, not so much the implementation
 details.  It may well be that your implementation details are perfect
 at this point; I don't know because I haven't looked, and I'm not an
 expert on that area of the code anyway.  But I have looked at your
 syntax, which I wasn't altogether keen on.


Fair enough. I think the syntax should reflect the fact that upserts
are driven by inserts, though. Users will get into trouble with a
syntax that allows a predicate that is evaluated before any rows are
locked.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance issue in pg_dump's dependency loop searching

2014-07-29 Thread Simon Riggs
On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote:

 Another idea would be to

...persist the optimal dump order in the database.

That way we can maintain the correct dump order each time we do DDL,
which is only a small incremental cost, no matter how many objects we
have.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance issue in pg_dump's dependency loop searching

2014-07-29 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote:
 Another idea would be to

 ...persist the optimal dump order in the database.

 That way we can maintain the correct dump order each time we do DDL,
 which is only a small incremental cost, no matter how many objects we
 have.

I don't see any obvious way to make it incremental; so I doubt that
it would be a small extra cost.  In any case I disagree that making DDL
slower to make pg_dump faster is a good tradeoff.  Many people seldom
or never use pg_dump.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance issue in pg_dump's dependency loop searching

2014-07-29 Thread Claudio Freire
On Tue, Jul 29, 2014 at 3:06 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On 25 July 2014 20:47, Tom Lane t...@sss.pgh.pa.us wrote:
 Another idea would be to

 ...persist the optimal dump order in the database.

 That way we can maintain the correct dump order each time we do DDL,
 which is only a small incremental cost, no matter how many objects we
 have.

 I don't see any obvious way to make it incremental; so I doubt that
 it would be a small extra cost.  In any case I disagree that making DDL
 slower to make pg_dump faster is a good tradeoff.  Many people seldom
 or never use pg_dump.

 regards, tom lane


Not to mention slowing down temp tables


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-07-29 Thread Peter Geoghegan
On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan p...@heroku.com wrote:
 I attach a new revision.

I think that I may have missed a trick here.

It turns out that it isn't expensive to also hash original text
values, to track their cardinality using HyperLogLog - it's hardly
measurable when done at an opportune point just after strxfrm()
expansion. I was already tracking the cardinality of poor man's
normalized keys using HyperLogLog. If I track the cardinality of both
sets (original values and normalized keys), I can compare the two when
evaluating if normalization should be aborted. This is by far the most
important consideration.

This causes the optimization to be applied when sorting millions of
tuples with only a tiny number of distinct values (like, 4 or 5),
without making bad cases that we fail to abort in a timely manner any
more likely. This is still significantly profitable - over 90% faster
in one test, because the optimistic memcmp() still allows us to avoid
any strcoll() calls. It looks about the same as using the C
collation. Not quite the huge boost we can see, but still well
worthwhile.

In general it seems well principled to have the should I abort
normalization? algorithm mostly weigh how effective a proxy for full
key cardinality normalized key cardinality is. If it is a good proxy
then nothing else matters. If it's not a very good proxy, that can
only be because there are many differences beyond the first 8 bytes.
Only then will we weigh the total number of distinct normalized keys,
and as the effectiveness of normalized key cardinality as a proxy for
overall cardinality falls, our requirements for the overall number of
distinct normalized keys shoots up rapidly.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] multixact optimization patches

2014-07-29 Thread Alvaro Herrera
I have just pushed two optimization patches for multixacts to HEAD only.
I hesitate to backpatch them to 9.3 right away, but will consider doing
so if people think differently.

I also have a patch that supposedly fixes the performance regression
reported in bug #8470, but it's considerably more obscure so I'm not
pushing it right now.  It's attached.  I'd need to spend some more time
with it to ensure it doesn't break other cases before pushing.  I know
some people is interested in this fix and thought I'd throw it out there
to gather some input.

Since it affects much of the same code as the two patches I just pushed,
using it in 9.3 requires cherry-picking those patches, but they apply
without conflicts so it should be easy.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
*** a/contrib/pgrowlocks/pgrowlocks.c
--- b/contrib/pgrowlocks/pgrowlocks.c
***
*** 136,149  pgrowlocks(PG_FUNCTION_ARGS)
  		infomask = tuple-t_data-t_infomask;
  
  		/*
! 		 * a tuple is locked if HTSU returns BeingUpdated, and if it returns
! 		 * MayBeUpdated but the Xmax is valid and pointing at us.
  		 */
! 		if (htsu == HeapTupleBeingUpdated ||
! 			(htsu == HeapTupleMayBeUpdated 
! 			 !(infomask  HEAP_XMAX_INVALID) 
! 			 !(infomask  HEAP_XMAX_IS_MULTI) 
! 			 (xmax == GetCurrentTransactionIdIfAny(
  		{
  			char	  **values;
  
--- 136,144 
  		infomask = tuple-t_data-t_infomask;
  
  		/*
! 		 * A tuple is locked if HTSU returns BeingUpdated.
  		 */
! 		if (htsu == HeapTupleBeingUpdated)
  		{
  			char	  **values;
  
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***
*** 2711,2791  l1:
  	}
  	else if (result == HeapTupleBeingUpdated  wait)
  	{
- 		TransactionId xwait;
- 		uint16		infomask;
- 
- 		/* must copy state data before unlocking buffer */
- 		xwait = HeapTupleHeaderGetRawXmax(tp.t_data);
- 		infomask = tp.t_data-t_infomask;
- 
- 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
- 
  		/*
! 		 * Acquire tuple lock to establish our priority for the tuple (see
! 		 * heap_lock_tuple).  LockTuple will release us when we are
! 		 * next-in-line for the tuple.
  		 *
! 		 * If we are forced to start over below, we keep the tuple lock;
! 		 * this arranges that we stay at the head of the line while rechecking
! 		 * tuple state.
  		 */
! 		if (!have_tuple_lock)
  		{
! 			LockTupleTuplock(relation, (tp.t_self), LockTupleExclusive);
! 			have_tuple_lock = true;
  		}
  
! 		/*
! 		 * Sleep until concurrent transaction ends.  Note that we don't care
! 		 * which lock mode the locker has, because we need the strongest one.
! 		 */
  
! 		if (infomask  HEAP_XMAX_IS_MULTI)
! 		{
! 			/* wait for multixact */
! 			MultiXactIdWait((MultiXactId) xwait, MultiXactStatusUpdate, infomask,
! 			relation, tp.t_data-t_ctid, XLTW_Delete,
! 			NULL);
! 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
  			/*
! 			 * If xwait had just locked the tuple then some other xact could
! 			 * update this tuple before we get to this point.  Check for xmax
! 			 * change, and start over if so.
  			 */
! 			if (xmax_infomask_changed(tp.t_data-t_infomask, infomask) ||
! !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data),
! 	 xwait))
! goto l1;
  
  			/*
! 			 * You might think the multixact is necessarily done here, but not
! 			 * so: it could have surviving members, namely our own xact or
! 			 * other subxacts of this backend.  It is legal for us to delete
! 			 * the tuple in either case, however (the latter case is
! 			 * essentially a situation of upgrading our former shared lock to
! 			 * exclusive).  We don't bother changing the on-disk hint bits
! 			 * since we are about to overwrite the xmax altogether.
  			 */
- 		}
- 		else
- 		{
- 			/* wait for regular transaction to end */
- 			XactLockTableWait(xwait, relation, tp.t_data-t_ctid, XLTW_Delete);
- 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
! 			/*
! 			 * xwait is done, but if xwait had just locked the tuple then some
! 			 * other xact could update this tuple before we get to this point.
! 			 * Check for xmax change, and start over if so.
! 			 */
! 			if (xmax_infomask_changed(tp.t_data-t_infomask, infomask) ||
! !TransactionIdEquals(HeapTupleHeaderGetRawXmax(tp.t_data),
! 	 xwait))
! goto l1;
  
! 			/* Otherwise check if it committed or aborted */
! 			UpdateXmaxHintBits(tp.t_data, buffer, xwait);
  		}
  
  		/*
--- 2711,2814 
  	}
  	else if (result == HeapTupleBeingUpdated  wait)
  	{
  		/*
! 		 * Somebody is holding a lock on the tuple, or updating it; we now
! 		 * need to sleep on that transaction before we can proceed.  However,
! 		 * if the only locker is our own transaction (or any subtransaction of
! 		 * the current top transaction), then it's not necessary to do so.
! 		 * Note we only check for this case when the locker is a single xid,
! 		 * because MultiXactIdWait 

Re: [HACKERS] Reminder: time to stand down from 8.4 maintenance

2014-07-29 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 PG 8.4.x is EOL as of last week's releases, so it's time to remove that
 branch from any auto-update scripts you might have, reconfigure buildfarm
 members that are force-building it, etc.

I've removed it from the Coverity weekly runs.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] plpgsql.consistent_into

2014-07-29 Thread Marko Tiikkaja

On 1/14/14, 6:15 PM, Tom Lane wrote:

We don't actually implement this in PG yet, except for trivial cases, but
it will certainly happen eventually. I think your sketch above deviates
unnecessarily from what the standard says for UPDATE.  In particular
I think it'd be better to write things like

(a, b) = ROW(1, 2);
(a, b, c) = (SELECT x, y, z FROM foo WHERE id = 42);

which would exactly match what you'd write in a multiple-assignment
UPDATE, and it has the same rejects-multiple-rows semantics too.


Just in case someone's interested: I won't be working on this for 9.5. 
If someone feels like picking this patch up, be my guest.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-07-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 25, 2014 at 6:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 This isn't really acceptable for production usage; if it were, we'd have
 done it already.  The POSIX APIs lack any way to tell how many processes
 are attached to a shmem segment, which is *necessary* functionality for
 us (it's a critical part of the interlock against starting multiple
 postmasters in one data directory).

 I think it would be good to spend some energy figuring out what to do
 about this.

Well, we've been around on this multiple times before, but if we have
any new ideas, sure ...

 In our last discussion on this topic, we talked about using file locks
 as a substitute for nattch.  You concluded that fcntl was totally
 broken for this purpose because of the possibility of some other piece
 of code accidentally opening and closing the lock file.[2]  lockf
 appears to have the same problem, but flock might not, at least on
 some systems.

My Linux man page for flock says

   flock()  does not lock files over NFS.  Use fcntl(2) instead: that does
   work over NFS, given a sufficiently  recent  version  of  Linux  and  a
   server which supports locking.

which seems like a showstopper problem; we might try to tell people not to
put their databases on NFS, but they're not gonna listen.  It also says

   flock()  and  fcntl(2)  locks  have different semantics with respect to
   forked processes and dup(2).  On systems that implement  flock()  using
   fcntl(2),  the  semantics  of  flock()  will  be  different  from those
   described in this manual page.

which is pretty scary if it's accurate for any still-extant platforms;
we might think we're using flock and still get fcntl behavior.  It's
also of concern that (AFAICS) flock is not in POSIX, which means we
can't even expect that platforms will agree on how it *should* behave.

I also noted that flock does not support atomic downgrade of exclusive
lock to shared lock, which seems like a problem for the lock inheritance
scheme sketched in
http://www.postgresql.org/message-id/18162.1340761...@sss.pgh.pa.us
... but OTOH, it sounds like flock locks are not only inherited through
fork() but even preserved across exec(), which would mean that we don't
need that scheme for file lock inheritance, even with EXEC_BACKEND.
Still, it's not clear to me how we could put much faith in flock.

 Finally, how about named pipes? Linux says that trying to open a
 named pipe for write when there are no readers will return ENXIO, and
 attempting to write to an already-open pipe with no remaining readers
 will cause SIGPIPE.  So: create a permanent named pipe in the data
 directory that all PostgreSQL processes keep open.  When the
 postmaster starts, it opens the pipe for read, then for write, then
 closes it for read.  It then tries to write to the pipe.  If this
 fails to result in SIGPIPE, then somebody else has got the thing open;
 so the new postmaster should die at once.   But if does get a SIGPIPE
 then there are as of that moment no other readers.

Hm.  That particular protocol is broken: two postmasters doing it at the
same time would both pass (because neither has it open for read at the
instant where they try to write).  But we could possibly frob the idea
until it works.  Bigger question is how portable is this behavior?
I see named pipes (fifos) in SUS v2, which is our usual baseline
assumption about what's portable across Unixen, so maybe it would work.
But does NFS support named pipes?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-07-29 Thread Baker, Keith [OCDUS Non-JJ]
Thank you to all who have responded to this proposal.
PostgreSQL manages to meet all production requirements on Windows without 
System V shared memory, so I would think this can be achieved on QNX/Linux.

The old PostgreSQL QNX port ran on the very old QNX4 (1991), so I understand 
why it would be of little value today.
Currently, QNX Neutrino 6.5 is well established (and QNX 6.6 is emerging) and 
that is where a PostgreSQL port would be well received.

I have attached my current work-in-progress patches for 9.3.4 and 9.4beta2 for 
the curious.
To minimize risk, I have been careful to ensure my changes will have effect 
only QNX builds, existing ports should see zero impact.
To minimize addition of new files, I have used the linux template rather than 
add qnx6 as a separate port/template.

All regression tests pass on my system, so while not perfect it is at least a 
reasonable start.
posix_shmem.c is still in need of some cleanup and mitigations to make it 
production-strength.

If there are existing tests I can run to ensure the QNX port meets your 
criteria for robust failure handling in this area I would be happy to run them.
If not, perhaps someone can provide a quick list of failure modes to consider.
As-is:
- starting of a second postmaster fails with message 'FATAL: lock file 
postmaster.pid already exists'
- Kill -9 of postmaster followed by a pg_ctl start seems to go through 
recovery, although the original shared memory segments hang out in /dev/shmem 
until reboot (that could be better).

Thanks again and please let me know if I can be of any assistance.

Keith Baker

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, July 29, 2014 7:06 PM
To: Robert Haas
Cc: Baker, Keith [OCDUS Non-JJ]; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

Robert Haas robertmh...@gmail.com writes:
 On Fri, Jul 25, 2014 at 6:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 This isn't really acceptable for production usage; if it were, we'd 
 have done it already.  The POSIX APIs lack any way to tell how many 
 processes are attached to a shmem segment, which is *necessary* 
 functionality for us (it's a critical part of the interlock against 
 starting multiple postmasters in one data directory).

 I think it would be good to spend some energy figuring out what to do 
 about this.

Well, we've been around on this multiple times before, but if we have any new 
ideas, sure ...

 In our last discussion on this topic, we talked about using file locks 
 as a substitute for nattch.  You concluded that fcntl was totally 
 broken for this purpose because of the possibility of some other piece 
 of code accidentally opening and closing the lock file.[2]  lockf 
 appears to have the same problem, but flock might not, at least on 
 some systems.

My Linux man page for flock says

   flock()  does not lock files over NFS.  Use fcntl(2) instead: that does
   work over NFS, given a sufficiently  recent  version  of  Linux  and  a
   server which supports locking.

which seems like a showstopper problem; we might try to tell people not to put 
their databases on NFS, but they're not gonna listen.  It also says

   flock()  and  fcntl(2)  locks  have different semantics with respect to
   forked processes and dup(2).  On systems that implement  flock()  using
   fcntl(2),  the  semantics  of  flock()  will  be  different  from those
   described in this manual page.

which is pretty scary if it's accurate for any still-extant platforms; we might 
think we're using flock and still get fcntl behavior.  It's also of concern 
that (AFAICS) flock is not in POSIX, which means we can't even expect that 
platforms will agree on how it *should* behave.

I also noted that flock does not support atomic downgrade of exclusive lock to 
shared lock, which seems like a problem for the lock inheritance scheme 
sketched in http://www.postgresql.org/message-id/18162.1340761...@sss.pgh.pa.us
... but OTOH, it sounds like flock locks are not only inherited through
fork() but even preserved across exec(), which would mean that we don't need 
that scheme for file lock inheritance, even with EXEC_BACKEND.
Still, it's not clear to me how we could put much faith in flock.

 Finally, how about named pipes? Linux says that trying to open a named 
 pipe for write when there are no readers will return ENXIO, and 
 attempting to write to an already-open pipe with no remaining readers 
 will cause SIGPIPE.  So: create a permanent named pipe in the data 
 directory that all PostgreSQL processes keep open.  When the 
 postmaster starts, it opens the pipe for read, then for write, then 
 closes it for read.  It then tries to write to the pipe.  If this 
 fails to result in SIGPIPE, then somebody else has got the thing open;
 so the new postmaster should die at once.   But if does get a SIGPIPE
 then there are as of that moment no other readers.

Hm.  That 

Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-07-29 Thread Tom Lane
Baker, Keith [OCDUS Non-JJ] kbak...@its.jnj.com writes:
 If there are existing tests I can run to ensure the QNX port meets your 
 criteria for robust failure handling in this area I would be happy to run 
 them.
 If not, perhaps someone can provide a quick list of failure modes to consider.
 As-is:
 - starting of a second postmaster fails with message 'FATAL: lock file 
 postmaster.pid already exists'
 - Kill -9 of postmaster followed by a pg_ctl start seems to go through 
 recovery, although the original shared memory segments hang out in /dev/shmem 
 until reboot (that could be better).

Unfortunately, that probably proves it's broken rather than that it works.
The behavior we need is that after kill -9'ing the postmaster, subsequent
postmaster start attempts *fail* until all the original postmaster's child
processes are gone.  Otherwise you end up with two independent sets of
processes scribbling on the same files (and not sharing shmem either).
Kiss consistency goodbye ...

It's possible that all the children automatically exited, especially if
you had only background processes active; but if you had a live regular
session it would not exit just because the parent process died.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Incremental Backup

2014-07-29 Thread Michael Paquier
On Wed, Jul 30, 2014 at 1:11 AM, Marco Nenciarini
marco.nenciar...@2ndquadrant.it wrote:
 differential backup is widely used to refer to a backup that is always
 based on a full backup. An incremental backup can be based either on
 a full backup or on a previous incremental backup. We picked that
 name to emphasize this property.

You can refer to this email:
http://www.postgresql.org/message-id/cabuevexz-2nh6jxb5sjs_dss7qbmof0noypeeyaybkbufkp...@mail.gmail.com

 As a first step we would have a simple and robust method to produce a
 file-level incremental backup.
An approach using Postgres internals, which we are sure we can rely
on, is more robust. A LSN is similar to a timestamp in pg internals as
it refers to the point in time where a block was lastly modified.

 It could also be used in 'refresh' mode, by allowing the pg_basebackup
 command to 'refresh' an old backup directory with a new backup.
 I am not sure this is really helpful...

 Could you please elaborate the last sentence?
This overlaps with the features you are proposing with
pg_restorebackup, where a backup is rebuilt. Why implementing two
interfaces for the same things?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER SYSTEM RESET?

2014-07-29 Thread Amit Kapila
On Thu, Jun 26, 2014 at 8:17 PM, Vik Fearing vik.fear...@dalibo.com wrote:
 I didn't quite follow your ALTER TABLE example because I don't think
 it's necessary,
I was asking to split the ALTER SYSTEM command like it's there
for ALTER TABLE (AlterTableStmt:
ALTER TABLE relation_expr alter_table_cmds).
It would have make adding further commands to ALTER SYSTEM bit
simpler and systemetic.  However as there is no correctness issue here,
so lets leave it like you have currently done in patch.

I have verified the patch and found that it works well for
all scenario's.  Few minor suggestions:

1.
!values to the filenamepostgresql.auto.conf/filename file.
!Setting the parameter to literalDEFAULT/literal, or using the
!commandRESET/command variant, removes the configuration entry from

It would be better if we can write a separate line for RESET ALL
as is written in case of both Alter Database and Alter Role in their
respective documentation.

2.
! %type vsetstmt generic_set set_rest set_rest_more generic_reset
reset_rest SetResetClause FunctionSetResetClause

Good to break it into 2 lines.

3. I think we can add some text on top of function
AlterSystemSetConfigFile() to explain functionality w.r.t reset all.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com