[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Noah Misch
On Wed, Jun 03, 2015 at 04:53:46PM -0400, Robert Haas wrote:
> So here's a patch taking a different approach.  In this approach, if
> the multixact whose members we want to look up doesn't exist, we don't
> use a later one (that might or might not be valid).  Instead, we
> attempt to cope with the unknown.  That means:
> 
> 1. In TruncateMultiXact(), we don't truncate.

I like that change a lot.  It's much easier to seek forgiveness for wasting <=
28 GiB of disk than for deleting visibility information wrongly.

> 2. If setting the offset stop limit (the point where we refuse to
> create new multixact space), we don't arm the stop point.  This means
> that if you're in this situation, you run without member wraparound
> protection until it's corrected.  A message gets logged once per
> checkpoint telling you that you have this problem, and another message
> gets logged when things get straightened out and the guards are
> enabled.
> 
> 3. If setting the vacuum force point, we assume that it's appropriate
> to immediately force vacuum.

Those seem reasonable, too.

> I've only tested this very lightly - this is just to see what you and
> Noah and others think of the approach.  As compared with the previous
> approach, it has the advantage of making minimal assumptions about the
> sanity of what's on disk.  It has the disadvantage that, for some
> people, the member-wraparound guard won't be enabled at startup -- but
> note that those people can't start 9.3.7/9.4.2 *at all*, so currently
> they are either running without member wraparound protection anyway
> (if they haven't upgraded to those releases) or they're down entirely.

That disadvantage is negligible, considering.

> Another disadvantage is that we'll be triggering what may be quite a
> bit of autovacuum activity for some people, which could be painful.
> On the plus side, they'll hopefully end up with sane relminmxid and
> datminmxid guards afterwards.

That sounds good so long as each table requires just one successful emergency
autovacuum.  I'm not seeing code to ensure that the launched autovacuum will
indeed perform a full-table scan and update relminmxid; is it there?

For sites that can't tolerate an autovacuum storm, what alternative can we
provide?  Is "SET vacuum_multixact_freeze_table_age = 0; VACUUM " of
every table, done before applying the minor update, sufficient?

>  static void
> -DetermineSafeOldestOffset(MultiXactId oldestMXact)
> +DetermineSafeOldestOffset(MultiXactOffset oldestMXact)

Leftover change from an earlier iteration?  The values passed continue to be
MultiXactId values.

>   /* move back to start of the corresponding segment */
> - oldestOffset -= oldestOffset %
> - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
> + offsetStopLimit = oldestOffset - (oldestOffset %
> + (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
> + /* always leave one segment before the wraparound point */
> + offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * 
> SLRU_PAGES_PER_SEGMENT);
> +
> + /* if nothing has changed, we're done */
> + if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit)
> + return;
>  
>   LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
> - /* always leave one segment before the wraparound point */
> - MultiXactState->offsetStopLimit = oldestOffset -
> - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
> + MultiXactState->offsetStopLimit = oldestOffset;

That last line needs s/oldestOffset/offsetStopLimit/, I presume.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further issues with jsonb semantics, documentation

2015-06-03 Thread Peter Geoghegan
On Wed, Jun 3, 2015 at 7:02 PM, Peter Geoghegan  wrote:
> Consider this case:
>
> postgres=# select '{"c":5, "a":6, "b":7}'::jsonb - 1;
>  ?column?
> --
>  {"a": 6, "c": 5}
> (1 row)
>
> Clearly anyone expecting the value "a" to be removed here would be in
> for a surprise. Moreover, it is inconsistent with the established
> behavior of the corresponding array-wise subscript operator:
>
> postgres=# select '{"c":5, "a":6, "b":7}'::jsonb -> 1;
>  ?column?
> --
>  [null]
> (1 row)

For similar reasons, I think that this inconsistency is unacceptable:

postgres=# select '["a", "b", "c"]'::jsonb - -1;
  ?column?

 ["a", "b"]
(1 row)

postgres=# select '["a", "b", "c"]'::jsonb -> -1;
 ?column?
--
 [null]
(1 row)

jsonb now supports Python-style negative subscripting to index
backward. I think that this a fine idea. However, I also think it's a
big POLA violation that this was not done for the ordinary array
subscripting operator ("operator jsonb -> integer") at the same time
as "operator jsonb - integer" was added. Although doing this will
require a compatibility note in the 9.5 release notes, it's extremely
unlikely to destabilize anybody's app, and makes a lot of sense.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-03 Thread Amit Kapila
On Thu, Jun 4, 2015 at 1:52 AM, Andrew Dunstan  wrote:

>
> On 06/02/2015 11:55 PM, Amit Kapila wrote:
>
>  On Tue, Jun 2, 2015 at 10:26 PM, Andrew Dunstan > > wrote:
>>
>> Well, it seems to me the new function is being altogether way too
>> trusting about the nature of what it's being asked to remove. In
>> the first place, the S_ISDIR/rmdir branch should only be for
>> Windows, and secondly in the other branch we should be checking
>> that S_ISLNK is true. It would actually be nice if we could test
>> for a junction point on Windows, but that seems to be a bit
>> difficult.
>>
>> I think during recovery for tablespace related operations, it is
>> quite possible to have a directory instead of symlink in some
>> special cases (see function TablespaceCreateDbspace() and comments
>> in destroy_tablespace_directories() { ..Try to remove the symlink..}).
>> Also this new function is being called from
>> create_tablespace_directories()
>> which uses the code as written in new function, so it doesn't make much
>> sense to change it Windows and non-Windows specific code.
>>
>
>
>
> Looking at it again, this might be not as bad as I thought, but I do think
> we should probably call the function something other than rmsymlink(). That
> seems too generic, since it also tries to remove directories - albeit that
> this will fail if the directory isn't empty. And I still think we should
> add a test for S_ISLNK in the second branch. As it stands the function
> could try to unlink anything that's not a directory. That might be safe-ish
> in the context it's used in for the tablespace code, but it's far from safe
> enough for a function that's in src/common
>
>
Okay, as we both seem to agree that it can be mostly used in
tablespace symlinks context, so I have changed the name to
remove_tablespace_symlink() and moved the function to
tablespace.c.  S_ISLINK check is used for non-windows code,
so not sure adding it here makes any real difference now that
we have made it specific to tablespace and we might need to
write small port specific code if we want to add S_ISLINK check.


> Given that the function raises an error on failure, I think it will
> otherwise be OK as is.
>
>
Please find an updated patch attached with this mail.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Further issues with jsonb semantics, documentation

2015-06-03 Thread Peter Geoghegan
I've noticed some more issues with the jsonb documentation, and the
new jsonb stuff generally. I didn't set out to give Andrew feedback on
the semantics weeks after feature freeze, but unfortunately this feels
like another discussion that we need to have now rather than later.

"operator jsonb - integer"
===

Summary: I think that this operator has a problem, but a problem that
can easily be fixed.


I think it was a bad idea to allow array-style removal of object
key/value pairs. ISTM that it implies a level of stability in the
ordering that doesn't make sense. Besides, is it really all that
useful?

Consider this case:

postgres=# select '{"c":5, "a":6, "b":7}'::jsonb - 1;
 ?column?
--
 {"a": 6, "c": 5}
(1 row)

Clearly anyone expecting the value "a" to be removed here would be in
for a surprise. Moreover, it is inconsistent with the established
behavior of the corresponding array-wise subscript operator:

postgres=# select '{"c":5, "a":6, "b":7}'::jsonb -> 1;
 ?column?
--
 [null]
(1 row)

I suggest, given that this is conceptually a data-modifying operator,
that the minus operator/jsonb_delete() case raise an error rather than
matching "operator jsonb -> integer" and returning NULL. I say this as
the person who successfully argued that the -> operator case above
should return NULL during the 9.4 beta period; returning SQL NULL for
the delete/minus operator feels like going too far in the direction of
permissiveness, even for jsonb; my expression index argument does not
apply here as it did for the "operator jsonb -> integer" case.

"operator jsonb - text"


Summary: I think that this operator is fine.


Documentation needs work, though. The "operator jsonb - text" operator
ought to be documented as in the attached patch, which is closer to
the equivalent hstore operator, and emphasizes the "operator jsonb ?
text" definition of a key. It should emphasize its similarity to the
established "operator jsonb ? text" operator, and in particular that
array elements behave as keys *iff* they're strings.

"operator jsonb - text[]" (and *nested* deletion more generally)
===

Summary: I think that this operator has many problems, and should be
scraped (although only as an operator). IMV nested deletion should
only be handled by functions, and the way that nested deletion works
in general should be slightly adjusted.


The new "operator jsonb - text[]" operator is confusingly inconsistent with:

A) "operator jsonb text"

and:

B) the established "operator hstore - text[]" operator, since that
operator deletes all key/value pairs that have keys that match any of
the right operand text array values. In contrast, this new operator is
passed as its right operand an array of text elements that constitute
a "path" (so the order in the rhs text[] operand matters). If the text
element in the rhs text[] operand happens to be what would pass for a
Postgres integer literal, it can be used to traverse lhs array values
through subscripting at that nesting level.

Regarding nested deletion behavior more generally, consider this
example of how this can work out badly:

postgres=# select jsonb_delete(jsonb_set('["a"]', '{5}', '"b"'), '{5}')  ;
 jsonb_delete
--
 ["a", "b"]
(1 row)

Here, we're adding and then deleting an array element at offset 5 (the
string "b"). But the element is never deleted by the outer
jsonb_delete(), because we can't rely on the element actually being
stored at offset 5. Seems a bit fragile.

More importantly, consider the inconsistency with "operator jsonb
text" ("point A" above):

postgres=# select '["a"]'::jsonb  ?| '{a}'::text[]; -- historic/9.4 behavior
 ?column?
--
 t
(1 row)

postgres=# select '["a"]'::jsonb  - '{a}'::text[]; -- new to 9.5
operator, does not delete!
 ?column?
--
 ["a"]
(1 row)

Perhaps most questionably of all, the non-array based minus/delete
operator (which I like) *does* have the same idea of matching a key as
the established "operator jsonb ?| text[]" operator (and "operator
jsonb ? text", etc):

postgres=# select '["a"]'::jsonb  - 'a'::text; -- new to 9.5 operator,
*does* delete!
 ?column?
--
 []
(1 row)

This conceptual model for manipulating jsonb is entirely new and novel
to this new operator "operator text[]" (and jsonb_set()).

"operator jsonb - text[]" categorization/conceptual model
==

Operators like the established "operator jsonb -> integer" operator (a
jsonb "array-wise" operator) always seemed okay to me because the rhs
operand really was a Postgres integer, and because it's explicitly an
array-wise operator (just like "operator -> text" is explicitly
object-wise). But now, with these new operators, you've added a
"shadow type" system to certain rhs text[] operands, consisting of
types not explicitly delineated by JSON-style double quotes (for
strings, say). So there is kind of a second shad

Re: [HACKERS] [PATCH] Add error handling to byteaout.

2015-06-03 Thread Michael Paquier
On Thu, Jun 4, 2015 at 1:32 AM, Alvaro Herrera 
wrote:

> Andreas Seltenreich wrote:
> > Tom Lane  writes:
> >
> > > Andreas Seltenreich  writes:
> > >> The scary one is due to an integer overflow the attached patch also
> > >> fixes.
> > >
> > > s/int/Size/ doesn't fix anything on 32-bit machines.
> >
> > Well, it changes the signedness of the computation on 32-bit, and in
> > combination with the fact that "len" is always smaller than 2^32, but
> > may exceed 2^31-1, the change avoids the dependency on the undefined
> > behavior of signed integer overflows in C on 32-bit as well.
>
> Why not just use an unsigned 64 bit variable?  Also, perhaps
> palloc_huge() avoids the whole problem in the first place ... though it
> might only move the issue around, if you cannot ship the longer-than-1GB
> resulting escaped value.  (Of course, if you try to allocate 2 GB in a
> 32 bit machine, you're going to be having quite some fun ...)
>

Pure nitpicking: there is no palloc_huge, only repalloc_huge. Though we
could have one.
-- 
Michael


[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Mon, Jun 1, 2015 at 4:55 PM, Noah Misch  wrote:
> While testing this (with inconsistent-multixact-fix-master.patch applied,
> FWIW), I noticed a nearby bug with a similar symptom.  TruncateMultiXact()
> omits the nextMXact==oldestMXact special case found in each other
> find_multixact_start() caller, so it reads the offset of a not-yet-created
> MultiXactId.  The usual outcome is to get rangeStart==0, so we truncate less
> than we could.  This can't make us truncate excessively, because
> nextMXact==oldestMXact implies no table contains any mxid.  If nextMXact
> happens to be the first of a segment, an error is possible.  Procedure:
>
> 1. Make a fresh cluster.
> 2. UPDATE pg_database SET datallowconn = true
> 3. Consume precisely 131071 mxids.  Number of offsets per mxid is unimportant.
> 4. vacuumdb --freeze --all
>
> Expected state after those steps:
> $ pg_controldata | grep NextMultiXactId
> Latest checkpoint's NextMultiXactId:  131072
>
> Checkpoint will fail like this:
> 26699 2015-05-31 17:22:33.134 GMT LOG:  statement: checkpoint
> 26661 2015-05-31 17:22:33.134 GMT DEBUG:  performing replication slot 
> checkpoint
> 26661 2015-05-31 17:22:33.136 GMT ERROR:  could not access status of 
> transaction 131072
> 26661 2015-05-31 17:22:33.136 GMT DETAIL:  Could not open file 
> "pg_multixact/offsets/0002": No such file or directory.
> 26699 2015-05-31 17:22:33.234 GMT ERROR:  checkpoint request failed
> 26699 2015-05-31 17:22:33.234 GMT HINT:  Consult recent messages in the 
> server log for details.
> 26699 2015-05-31 17:22:33.234 GMT STATEMENT:  checkpoint
>
> This does not block startup, and creating one mxid hides the problem again.
> Thus, it is not a top-priority bug like some other parts of this thread.  I
> mention it today mostly so it doesn't surprise hackers testing other fixes.

Thanks.   As mentioned elsewhere in the thread, I discovered that the
same problem exists for page boundaries, with a different error
message.  I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1
and master with the same results:

FATAL:  could not access status of transaction 2048
DETAIL:  Could not read from file "pg_multixact/offsets/" at
offset 8192: Undefined error: 0.

FATAL:  could not access status of transaction 131072
DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file
or directory.

But, yeah, this isn't the bug we're looking for.

-- 
Thomas Munro
http://www.enterprisedb.com


checkpoint-page-boundary.sh
Description: Bourne shell script


checkpoint-segment-boundary.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-03 Thread Jim Nasby

On 6/2/15 4:58 PM, David Steele wrote:

On 5/31/15 1:46 PM, Tom Lane wrote:

Hm.  I think the impact on third-party backup tools would be rather bad,
but there's a simple modification of the idea that might fix that:
just always create pg_xlog as a symlink to pg_xjournal during initdb.
Anybody who blindly removes pg_xlog won't have done anything
irreversible.  We could deprecate pg_xlog and stop creating the symlink
after a few releases, once third-party tools have had a reasonable
amount of time to adjust.


As the author of a third-party backup tool I'd prefer to make a clean
break and just rename the directories in a single release.  9.5 has
similar backup/restore related changes with no nod to backwards
compatibility.

And that's fine.  Applications can iterate faster than databases and
they should.


+1. I think we're making a mountain out of a mole-hill and putting any 
possibility of improvement here at risk. (And I definitely think this 
needs improvement).



Two options to make lives easier:

1) An initdb option to create the necessary symlinks as Tom suggests,
just not by default.
2) Instructions in the release notes for users who did not see the
initdb option in the first place.


#2 seems reasonable. #1 seems like it's partway up the molemountain.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Robert Haas wrote:

> So here's a patch taking a different approach.

I tried to apply this to 9.3 but it's messy because of pgindent.  Anyone
would have a problem with me backpatching a pgindent run of multixact.c?

Also, you have a new function SlruPageExists, but we already have
SimpleLruDoesPhysicalPageExist.  No need for two copies of pretty much
the same code, I think.  Your code uses fstat() instead of
lseek(.., SEEK_END) but the exact technique used is probably not
relevant.

I think I like this approach better than your other patch, FWIW --
mainly because it seems simpler.

Will review.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xid wrap / optimize frozen tables?

2015-06-03 Thread Jeff Janes
On May 24, 2015 6:42 AM, "Nils Goroll"  wrote:
>
> Hi Jeff and all,
>
> On 23/05/15 22:13, Jeff Janes wrote:
> > Are you sure it is the read IO that causes the problem?
>
> Yes. Trouble is here that we are talking about a 361 GB table
>
>List of relations
>  Schema |Name |   Type   |  Owner   |Size|
> Description
>
+-+--+--++-
>  public | *redacted*_y2015m04 | table| postgres | 361 GB |
>
> and while we have
>
> shared_buffers = 325GB
> huge_pages = on

As mentioned, that is very large setting for share buffers.

>
> this is not the only table of this size (total db size ist 1.8tb) and more
> current data got written to *redacted*_y2015m05 (the manually-partitioned
table
> for may), so most of the m04 data would have got evicted from the cache
when
> this issue surfaced initially.
>
> There is one application pushing data (bulk inserts) and we have
transaction
> rates for this app in a log. The moment the vacuum started, these rates
dropped.
> Unfortunately I cannot present helpful log excerpts here as the
autovacuum never
> finished so far (because the admin killed the db), so we have zero
logging about
> past autovac events.

Could you do an experiment in which you do a large sequential read on the
database files and measure the impact on the queries that way?  Like:

tar -cf - data_dir | wc -c

Or better, use some fancy version that throttles to the read rate observed
below.

> At the moment, the application is shut down and the machine is only
running the
> vacs:
>
> query_start  | 2015-05-22 19:33:52.44334+02
> waiting  | f
> query| autovacuum: VACUUM public.*redacted*_y2015m04 (to
prevent
> wraparound)
> query_start  | 2015-05-22 19:34:02.46004+02
> waiting  | f
> query| autovacuum: VACUUM ANALYZE public.*redacted*_y2015m05
(to
> prevent wraparound)
>
> so we know that any io must be caused by the vacs:
>
> shell# uptime
>  13:33:33 up 1 day, 18:01,  2 users,  load average: 5.75, 12.71, 8.43

What OS is this?  This load average looks very high.  Does the OS charge
processes that are blocked on IO against uptime?

> shell# zpool iostat
> capacity operationsbandwidth
> pool alloc   free   read  write   read  write
> ---  -  -  -  -  -  -
> tank1 358G  6.90T872 55  15.1M  3.08M

I'm not familiar with zpool but this shows a lot of writing going on.   If
the table was already frozen and just needed to be observed as being all
frozen, then it should not be dirtying one block for every 5 blocks read.

I would not be surprised if it were the reading, not the writing, which
caused the performance problem.

Cheers, Jeff


Re: [HACKERS] [PATCH] Document that directly callable functions may use fn_extra

2015-06-03 Thread Jim Nasby

On 5/29/15 10:21 AM, Peter Eisentraut wrote:

On 5/28/15 10:15 PM, Craig Ringer wrote:

I was a puzzled by  src/backend/utils/fmgr/README and fmgr.h's
descriptions of fcinfo->flinfo->fn_extra (FmgrInfo.fn_extra) as they
seem to conflict with actual usage.

The docs suggest that fl_extra is for the use of function call handlers,
but in practice it's also used heavily by function implementations as a
cache space.


The documentation could also be updated about this.  It mentions using
fn_extra for PL handlers, gist functions, and set-returning functions,
but it doesn't say that you can use it any old function for anything you
want.

I'm not sure how up to date that README actually is.  It looks more like
a historical document describing the original proposal.


FWIW, I think it'd be good to also document what lifetime you can expect 
out of something in fn_extra. I realize you can figure it out from 
fn_mcxt, but I don't know that that's terribly obvious to others.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas  wrote:
> On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund  wrote:
>>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>>> > that's actually ok; because the value at the beginning of the segment
>>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>>> > filled the page with zeros.
>>> >
>>> > I think that should be harmless, the worst that can happen is that
>>> > oldestOffset errorneously is 0, which should be correct, even though
>>> > possibly overly conservative, in these cases.
>>>
>>> Uh oh.  That seems like a real bad problem for this approach.  What
>>> keeps that from being the opposite of too conservative?  There's no
>>> "safe" value in a circular numbering space.
>>
>> I think it *might* (I'm really jetlagged) be fine because that'll only
>> happen after a upgrade from < 9.3. And in that case we initialize
>> nextOffset to 0. That ought to safe us?
>
> That's pretty much betting the farm on the bugs we know about today
> being the only ones there are.  That seems imprudent.

So here's a patch taking a different approach.  In this approach, if
the multixact whose members we want to look up doesn't exist, we don't
use a later one (that might or might not be valid).  Instead, we
attempt to cope with the unknown.  That means:

1. In TruncateMultiXact(), we don't truncate.

2. If setting the offset stop limit (the point where we refuse to
create new multixact space), we don't arm the stop point.  This means
that if you're in this situation, you run without member wraparound
protection until it's corrected.  A message gets logged once per
checkpoint telling you that you have this problem, and another message
gets logged when things get straightened out and the guards are
enabled.

3. If setting the vacuum force point, we assume that it's appropriate
to immediately force vacuum.

I've only tested this very lightly - this is just to see what you and
Noah and others think of the approach.  As compared with the previous
approach, it has the advantage of making minimal assumptions about the
sanity of what's on disk.  It has the disadvantage that, for some
people, the member-wraparound guard won't be enabled at startup -- but
note that those people can't start 9.3.7/9.4.2 *at all*, so currently
they are either running without member wraparound protection anyway
(if they haven't upgraded to those releases) or they're down entirely.
Another disadvantage is that we'll be triggering what may be quite a
bit of autovacuum activity for some people, which could be painful.
On the plus side, they'll hopefully end up with sane relminmxid and
datminmxid guards afterwards.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..4400fc5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -198,13 +198,24 @@ typedef struct MultiXactStateData
 	/* next-to-be-assigned offset */
 	MultiXactOffset nextOffset;
 
+	/* Have we completed multixact startup? */
+	bool		finishedStartup;
+
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that is still potentially referenced by a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
+
+	/*
+	 * Oldest multixact offset that is potentially referenced by a
+	 * multixact referenced by a relation.  We don't always know this value,
+	 * so there's a flag here to indicate whether or not we currently do.
+	 */
 	MultiXactOffset oldestOffset;
+	bool		oldestOffsetKnown;
 
 	/*
 	 * This is what the previous checkpoint stored as the truncate position.
@@ -221,6 +232,7 @@ typedef struct MultiXactStateData
 
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;
+	bool offsetStopLimitKnown;
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -350,10 +362,11 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 		MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
+static void DetermineSafeOldestOffset(MultiXactOffset oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 		 MultiXactOffset start, uint32 distance);
-static MultiXactOffset find_multixact_start(MultiXactId multi);
+static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZero

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-03 Thread Andrew Dunstan


On 06/02/2015 11:55 PM, Amit Kapila wrote:
On Tue, Jun 2, 2015 at 10:26 PM, Andrew Dunstan > wrote:



On 05/15/2015 02:21 AM, Amit Kapila wrote:


Find the patch which gets rid of rmtree usage.  I have made it as
a separate function because the same code is used from
create_tablespace_directories() as well.  I thought of
extending the
same API for using it from destroy_tablespace_directories() as
well,
but due to special handling (especially for ENOENT) in that
function,
I left it as of now.






Well, it seems to me the new function is being altogether way too
trusting about the nature of what it's being asked to remove. In
the first place, the S_ISDIR/rmdir branch should only be for
Windows, and secondly in the other branch we should be checking
that S_ISLNK is true. It would actually be nice if we could test
for a junction point on Windows, but that seems to be a bit
difficult. 



I think during recovery for tablespace related operations, it is
quite possible to have a directory instead of symlink in some
special cases (see function TablespaceCreateDbspace() and comments
in destroy_tablespace_directories() { ..Try to remove the symlink..}).
Also this new function is being called from 
create_tablespace_directories()

which uses the code as written in new function, so it doesn't make much
sense to change it Windows and non-Windows specific code.




Looking at it again, this might be not as bad as I thought, but I do 
think we should probably call the function something other than 
rmsymlink(). That seems too generic, since it also tries to remove 
directories - albeit that this will fail if the directory isn't empty. 
And I still think we should add a test for S_ISLNK in the second branch. 
As it stands the function could try to unlink anything that's not a 
directory. That might be safe-ish in the context it's used in for the 
tablespace code, but it's far from safe enough for a function that's in 
src/common


Given that the function raises an error on failure, I think it will 
otherwise be OK as is.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add error handling to byteaout.

2015-06-03 Thread Andres Freund
On June 3, 2015 9:42:21 PM GMT+02:00, Andreas Seltenreich 
 wrote:
>Piotr Stefaniak writes:
 s/int/Size/ doesn't fix anything on 32-bit machines.
>>
>> Postgres requires twos-complement representation, so that the
>> assumption that signed integer types wrap around on overflow can be
>> safely made.

Uh, no. The C standard defines signed overflow as undefined, an compilers 
assume it doesn't happen. The optimize based on y observation.


--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add error handling to byteaout.

2015-06-03 Thread Andreas Seltenreich
Piotr Stefaniak writes:
>>> s/int/Size/ doesn't fix anything on 32-bit machines.
>
> Postgres requires twos-complement representation, so that the
> assumption that signed integer types wrap around on overflow can be
> safely made.

Thanks for the clarification!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add error handling to byteaout.

2015-06-03 Thread Andreas Seltenreich
Alvaro Herrera writes:

> Why not just use an unsigned 64 bit variable?  Also, perhaps
> palloc_huge() avoids the whole problem in the first place ...

Ja, that crossed my mind too, but the current limit is already far
beyond anything that is usually configured for per-backend memory use,
so I dismissed it.

> though it might only move the issue around, if you cannot ship the
> longer-than-1GB resulting escaped value.

For example, when client and server encodings do not match:

,[ mbutils.c ]
|   result = palloc(len * MAX_CONVERSION_GROWTH + 1);
`

This results in the fun fact that the maximum size for bytea values that
are guaranteed to be pg_dumpable regardless of encoding/escaping
settings is lower than 64MB.

One thing that would also mitigate the problem is supporting a more
efficient output format.  For example, there's already means for
base64-encoding in the backend:

self=# select c, length(encode(mkbytea(28),c)) from (values ('hex'),('base64')) 
as v(c);
   c|  length
+---
 hex| 536870912
 base64 | 362623337
(2 rows)

Maybe it is reasonable to make it available as another option for use in
bytea_output?

regards,
Andreas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote:
> On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:

> > One idea I had was: what if the oldestMulti pointed to another multi
> > earlier in the same 0046 file, so that it is read-as-zeroes (and the
> > file is created), and then a subsequent multixact truncate tries to read
> > a later page in the file.  In SlruPhysicalReadPage() this would give a
> > change for open() to not fail, and then read() can fail as above.
> > However, we would have an earlier LOG message about "reading as zeroes".
> > 
> > Really, the whole question of how this code goes past the open() failure
> > in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> > the file to be created ...
> 
> Wouldn't a previous WAL record zeroing another part of that segment
> explain this? A zero sized segment pretty much would lead to this error,
> right? Or were you able to check how things look after the failure?

But why would there be a previous WAL record zeroing another part of
that segment?  Note that this segment is very old -- hasn't been written
in quite a while, it's certainly not in slru buffers anymore.

> > 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> > 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
> > "000173DD00AD" from archive
> > 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access 
> > status of transaction 4624559
> > 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
> > file "pg_multixact/offsets/0046" at offset 147456: Success.
> > 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 
> > 4782) exited with exit code 1
> > 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due 
> > to startup process failure
> 
> From this isn't not entirely clear where this error was triggered from.

Well, reading code, it seems reasonable that to assume that replay of
the checkpoint record I mentioned leads to that error message when the
file exists but is not long enough to contain the given offset.  There
are not MultiXact wal records in the segment.  Also note that there's no
other "restored log file" message after the "entering standby mode"
message.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > That's not necessarily the case though, given how the code currently
> > works. In a bunch of places the SLRUs are accessed *before* having been
> > made consistent by WAL replay. Especially if several checkpoints/vacuums
> > happened during the base backup the assumed state (i.e. the mxacts
> > checkpoints refer to) of the data directory soon after the initial
> > start, and the state of pg_multixact/ won't necessarily match at all.
> 
> Well, the log extract is quite simple; it says that as soon as the
> standby starts replay WAL, it fetches exactly one WAL segment, which
> contains exactly one online checkpoint record; this record contains
> oldestMulti=4624559, which belongs in offset file 0046.  But the
> basebackup only contains files starting from 004B onwards.  The base
> backup takes a long time, so my guess of what happened is that the
> backup label points to the start of the WAL stream (obviously), then the
> data files are copied to the standby; during this long process, a
> multixact truncation happens.  So by the time the base backup reaches
> the pg_multixact directory, the 0046 file has been removed.

Yes. That's precisely what I was concerned about above and elsewhere in
the thread. It's simply not ok to access a SLRU while not yet
consistent...

> One idea I had was: what if the oldestMulti pointed to another multi
> earlier in the same 0046 file, so that it is read-as-zeroes (and the
> file is created), and then a subsequent multixact truncate tries to read
> a later page in the file.  In SlruPhysicalReadPage() this would give a
> change for open() to not fail, and then read() can fail as above.
> However, we would have an earlier LOG message about "reading as zeroes".
> 
> Really, the whole question of how this code goes past the open() failure
> in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> the file to be created ...

Wouldn't a previous WAL record zeroing another part of that segment
explain this? A zero sized segment pretty much would lead to this error,
right? Or were you able to check how things look after the failure?

> 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
> "000173DD00AD" from archive
> 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access 
> status of transaction 4624559
> 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
> file "pg_multixact/offsets/0046" at offset 147456: Success.
> 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 
> 4782) exited with exit code 1
> 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
> startup process failure

>From this isn't not entirely clear where this error was triggered from.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Really, the whole question of how this code goes past the open() failure
> in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> the file to be created ...

Hmm, the checkpointer can call TruncateMultiXact when in recovery, on
restartpoints. I wonder if in recovery this could trigger file creation.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote:
> On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
> > Thomas Munro wrote:
> > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
> > > wrote:
> > > > My guess is that the file existed, and perhaps had one or more pages,
> > > > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > > > back.  read() returns 0 in this case but doesn't set errno.
> > > >
> > > > I didn't find a way to set things so that the file exists but is of
> > > > shorter contents than oldestMulti by the time the checkpoint record is
> > > > replayed.
> > > 
> > > I'm just starting to learn about the recovery machinery, so forgive me
> > > if I'm missing something basic here, but I just don't get this.  As I
> > > understand it, offsets/0046 should either have been copied with that
> > > page present in it if it existed before the backup started (apparently
> > > not in this case), or extended to contain it by WAL records that come
> > > after the backup label but before the checkpoint record that
> > > references it (also apparently not in this case).
> 
> That's not necessarily the case though, given how the code currently
> works. In a bunch of places the SLRUs are accessed *before* having been
> made consistent by WAL replay. Especially if several checkpoints/vacuums
> happened during the base backup the assumed state (i.e. the mxacts
> checkpoints refer to) of the data directory soon after the initial
> start, and the state of pg_multixact/ won't necessarily match at all.

Well, the log extract is quite simple; it says that as soon as the
standby starts replay WAL, it fetches exactly one WAL segment, which
contains exactly one online checkpoint record; this record contains
oldestMulti=4624559, which belongs in offset file 0046.  But the
basebackup only contains files starting from 004B onwards.  The base
backup takes a long time, so my guess of what happened is that the
backup label points to the start of the WAL stream (obviously), then the
data files are copied to the standby; during this long process, a
multixact truncation happens.  So by the time the base backup reaches
the pg_multixact directory, the 0046 file has been removed.

2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
"000173DD00AD" from archive
2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status 
of transaction 4624559
2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
file "pg_multixact/offsets/0046" at offset 147456: Success.
2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) 
exited with exit code 1
2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
startup process failure

One idea I had was: what if the oldestMulti pointed to another multi
earlier in the same 0046 file, so that it is read-as-zeroes (and the
file is created), and then a subsequent multixact truncate tries to read
a later page in the file.  In SlruPhysicalReadPage() this would give a
change for open() to not fail, and then read() can fail as above.
However, we would have an earlier LOG message about "reading as zeroes".

Really, the whole question of how this code goes past the open() failure
in SlruPhysicalReadPage baffles me.  I don't see any possible way for
the file to be created ...

> > Exactly --- that's the spot at which I am, also.  I have had this
> > spinning in my head for three days now, and tried every single variation
> > that I could think of, but like you I was unable to reproduce the issue.
> > However, our customer took a second base backup and it failed in exactly
> > the same way, module some changes to the counters (the file that
> > didn't exist was 004B rather than 0046).  I'm still at a loss at what
> > the failure mode is.  We must be missing some crucial detail ...
> 
> I might have missed it in this already long thread. Could you share a
> bunch of details about hte case? It'd be very interesting to see the
> contents of the backup label (to see where start/end are), the contents
> of the initial checkpoint (to see which mxacts we assume to exist at
> start) and what the initial contents of pg_multixact are (to match up).

pg_xlogdump says about the checkpoint record:

rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
73DD/AD834470, prev 73DD/AD8343B8, bkp: , desc: checkpoint: redo 
73DD/4003F648; tli 1; prev tli 1; fpw true; xid 1/3530575042; oid 81303440; 
multi 14003656; offset 66311341; oldest xid 2530472730 in DB 17081243; oldest 
multi 4624559 in DB 17081243; oldest running xid 3530551992; online

I don't have the backup label.  But a listing of the files in
pg_multixact/offsets in the master has this:

./offsets:
total 35816
drwx-- 2 postgres postgres 4096 May 27 16:25 .
drwx-- 4 postgres postgres 4096 Jan 3 03:32 ..
-rw--- 1 postgres postgres 262144 Apr 20 13:07 004

Re: [CORE] [HACKERS] postpone next week's release

2015-06-03 Thread Stefan Kaltenbrunner
On 05/31/2015 03:51 AM, David Steele wrote:
> On 5/30/15 8:38 PM, Joshua D. Drake wrote:
>>
>> On 05/30/2015 03:48 PM, David Steele wrote:
>>> On 5/30/15 2:10 PM, Robert Haas wrote:
 What, in this release, could break things badly?  RLS? Grouping sets?
 Heikki's WAL format changes?  That last one sounds really scary to me;
 it's painful if not impossible to fix the WAL format in a minor
 release.
>>>
>>> I would argue Heikki's WAL stuff is a perfect case for releasing a
>>> public alpha/beta soon.  I'd love to test PgBackRest with an "official"
>>> 9.5dev build.  The PgBackRest test suite has lots of tests that run on
>>> versions 8.3+ and might well shake out any bugs that are lying around.
>>
>> You are right. Clone git, run it nightly automated and please, please
>> report anything you find. There is no reason for a tagged release for
>> that. Consider it a custom, purpose built, build-test farm.
> 
> Sure - I can write code to do that.  But then why release a beta at all?

FWIW: we also carry "official" snapshots on the download site (
https://ftp.postgresql.org/pub/snapshot/dev/) that you could use if you
dont want git directly - those even receive some form of QA (for a
snapshot to be posted it is required to pass a full buildfarm run on the
buildbox).



Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
> Thomas Munro wrote:
> > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
> > wrote:
> > > My guess is that the file existed, and perhaps had one or more pages,
> > > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > > back.  read() returns 0 in this case but doesn't set errno.
> > >
> > > I didn't find a way to set things so that the file exists but is of
> > > shorter contents than oldestMulti by the time the checkpoint record is
> > > replayed.
> > 
> > I'm just starting to learn about the recovery machinery, so forgive me
> > if I'm missing something basic here, but I just don't get this.  As I
> > understand it, offsets/0046 should either have been copied with that
> > page present in it if it existed before the backup started (apparently
> > not in this case), or extended to contain it by WAL records that come
> > after the backup label but before the checkpoint record that
> > references it (also apparently not in this case).

That's not necessarily the case though, given how the code currently
works. In a bunch of places the SLRUs are accessed *before* having been
made consistent by WAL replay. Especially if several checkpoints/vacuums
happened during the base backup the assumed state (i.e. the mxacts
checkpoints refer to) of the data directory soon after the initial
start, and the state of pg_multixact/ won't necessarily match at all.

> Exactly --- that's the spot at which I am, also.  I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046).  I'm still at a loss at what
> the failure mode is.  We must be missing some crucial detail ...

I might have missed it in this already long thread. Could you share a
bunch of details about hte case? It'd be very interesting to see the
contents of the backup label (to see where start/end are), the contents
of the initial checkpoint (to see which mxacts we assume to exist at
start) and what the initial contents of pg_multixact are (to match up).

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Thomas Munro wrote:

> I have finally reproduced that error!  See attached repro shell script.
> 
> The conditions are:
> 
> 1.  next multixact == oldest multixact (no active multixacts, pointing
> past the end)
> 2.  next multixact would be the first item on a new page (multixact % 2048 == 
> 0)
> 3.  the page must not be the first in a segment (or we'd get the
> read-zeroes case)
> 
> That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
> vacuum followed by no multixact creations right before your backup
> checkpoint).  That seems like reasonably low odds... if it happened
> twice in a row, maybe I'm missing something here and there is some
> other way to get this...

Thanks, this is pretty cool (as far as these things go), but it's not
the scenario I see, in which the complained-about segment is very old by
the time it's truncated away by a checkpoint after freeze.  Segment
requested by checkpoint.oldestMulti is 0046 (offset 140k something --
just to clarify it's not at segment boundary), and the base backup
contains segments from 004B to 00D5.  My problem scenario has
oldestMulti close to 5 million and nextMulti close to 14 million.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [CORE] Restore-reliability mode

2015-06-03 Thread Andres Freund
On 2015-06-03 10:21:28 -0700, Josh Berkus wrote:
> So, historically, this is what the period between feature freeze and
> beta1 was for; the "consolidation" phase was supposed to deal with this.
>  The problem over the last few years, by my observation, has been that
> consolidation has been left to just a few people (usually just Bruce &
> Tom or Tom & Robert) and our code base is now much to large for that.
> 
> The way other projects deal with this is having continuous testing as
> stuff comes in, and *more* testing that just our regression tests (e.g.
> acceptance tests, integration tests, performance tests, etc.). So our
> other issue has been that our code complexity has been growing faster
> than our test suite.  Part of that is that this community has never
> placed much value in automated testing or testers, so people who are
> interested in it find other projects to contribute to.
> 
> I would argue that if we delay 9.5 in order to do a 100% manual review
> of code, without adding any new automated tests or other non-manual
> tools for improving stability, then it's a waste of time; we might as
> well just release the beta, and our users will find more issues than we
> will.  I am concerned that if we declare a cleanup period, especially in
> the middle of the summer, all that will happen is that the project will
> go to sleep for an extra three months.
> 
> I will also point out that there is a major adoption cost to delaying
> 9.5.   Right now users are excited about UPSERT, big data, and extra
> JSON features. If they have to wait another 7 months, they'll be a lot
> less excited, and we'll lose more potential users to the new databases
> and the MySQL forks.  It could also delay the BDR project (Simon/Craig
> can speak to this) which would suck.
> 
> Reliability of having a release every year is important as well as
> database reliability ... and for a lot of the new webdev generation,
> PostgreSQL is already the most reliable piece of software infrastructure
> they use.  So if we're going to have a cleanup delay, then let's please
> make it an *intensive* cleanup delay, with specific goals, milestones,
> and a schedule.  Otherwise, don't bother.

+very many


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [CORE] Restore-reliability mode

2015-06-03 Thread Josh Berkus
On 06/03/2015 06:50 AM, Noah Misch wrote:
> Subject changed from "Re: [CORE] postpone next week's release".
> 
> On Sat, May 30, 2015 at 10:48:45PM -0400, Bruce Momjian wrote:
>> If we have to totally stop feature development until we are all happy
>> with the code we have, so be it.  If people feel they have to get into
>> cleanup mode or they will never get to add a feature to Postgres again,
>> so be it.  If people say, heh, I am not going to do anything and just
>> come back when cleanup is done (by someone else), then we will end up
>> with a smaller but more dedicated development team, and I am fine with
>> that too.  I am suggesting that until everyone is happy with the code we
>> have, we should not move forward.
> 
> I like the essence of this proposal.  Two suggestions.  We can't achieve or
> even robustly measure "everyone is happy with the code," so let's pick
> concrete exit criteria.  Given criteria framed like "Files A,B,C and patches
> X,Y,Z have a sign-off from a committer other than their original committer."
> anyone can monitor progress and find specific ways to contribute.  Second, I
> would define the subject matter as "bug fixes, testing and review", not
> "restructuring, testing and review."  Different code structures are clearest
> to different hackers.  Restructuring, on average, adds bugs even more quickly
> than feature development adds them.

So, historically, this is what the period between feature freeze and
beta1 was for; the "consolidation" phase was supposed to deal with this.
 The problem over the last few years, by my observation, has been that
consolidation has been left to just a few people (usually just Bruce &
Tom or Tom & Robert) and our code base is now much to large for that.

The way other projects deal with this is having continuous testing as
stuff comes in, and *more* testing that just our regression tests (e.g.
acceptance tests, integration tests, performance tests, etc.). So our
other issue has been that our code complexity has been growing faster
than our test suite.  Part of that is that this community has never
placed much value in automated testing or testers, so people who are
interested in it find other projects to contribute to.

I would argue that if we delay 9.5 in order to do a 100% manual review
of code, without adding any new automated tests or other non-manual
tools for improving stability, then it's a waste of time; we might as
well just release the beta, and our users will find more issues than we
will.  I am concerned that if we declare a cleanup period, especially in
the middle of the summer, all that will happen is that the project will
go to sleep for an extra three months.

I will also point out that there is a major adoption cost to delaying
9.5.   Right now users are excited about UPSERT, big data, and extra
JSON features. If they have to wait another 7 months, they'll be a lot
less excited, and we'll lose more potential users to the new databases
and the MySQL forks.  It could also delay the BDR project (Simon/Craig
can speak to this) which would suck.

Reliability of having a release every year is important as well as
database reliability ... and for a lot of the new webdev generation,
PostgreSQL is already the most reliable piece of software infrastructure
they use.  So if we're going to have a cleanup delay, then let's please
make it an *intensive* cleanup delay, with specific goals, milestones,
and a schedule.  Otherwise, don't bother.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] anole: assorted stability problems

2015-06-03 Thread Alvaro Herrera
Tom Lane wrote:
> Andres Freund  writes:

> > Uh. I'm pretty sure there were some back when that patch went in. And
> > there definitely used to be a couple earlier. I guess itanium really is
> > dying (mixed bad: It's a horrible architecture, but more coverage would
> > still be good).
> 
> Since that machine is run by EDB, maybe we could persuade them to set up
> a second critter on it that uses gcc.  That would at least help narrow
> down whether it's a compiler-specific issue.

I pinged EDB about this several days ago, and they have now set up
buildfarm member gharial running on the same machine, using gcc.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add error handling to byteaout.

2015-06-03 Thread Alvaro Herrera
Andreas Seltenreich wrote:
> Tom Lane  writes:
> 
> > Andreas Seltenreich  writes:
> >> The scary one is due to an integer overflow the attached patch also
> >> fixes.
> >
> > s/int/Size/ doesn't fix anything on 32-bit machines.
> 
> Well, it changes the signedness of the computation on 32-bit, and in
> combination with the fact that "len" is always smaller than 2^32, but
> may exceed 2^31-1, the change avoids the dependency on the undefined
> behavior of signed integer overflows in C on 32-bit as well.

Why not just use an unsigned 64 bit variable?  Also, perhaps
palloc_huge() avoids the whole problem in the first place ... though it
might only move the issue around, if you cannot ship the longer-than-1GB
resulting escaped value.  (Of course, if you try to allocate 2 GB in a
32 bit machine, you're going to be having quite some fun ...)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restore-reliability mode

2015-06-03 Thread Joshua D. Drake

On 06/03/2015 07:18 AM, Andres Freund wrote:


On 2015-06-03 09:50:49 -0400, Noah Misch wrote:

Second, I would define the subject matter as "bug fixes, testing and
review", not "restructuring, testing and review."  Different code
structures are clearest to different hackers.  Restructuring, on
average, adds bugs even more quickly than feature development adds
them.


I can't agree with this. While I agree with not doing large
restructuring for 9.5, I think we can't affort not to refactor for
clarity, even if that introduces bugs. Noticeable parts of our code have
to frequently be modified for new features and are badly structured at
the same time. While restructuring will may temporarily increase the
number of bugs in the short term, it'll decrease the number of bugs long
term while increasing the number of potential contributors and new
features.  That's obviously not to say we should just refactor for the
sake of it.



Our project has been continuing to increase momentum over the last few 
years and our adoption has increased at an amazing rate. It is important 
to remember that we have users. These users have needs that must be met 
else those users will move on to a different technology.


I agree that we need to postpone this release. I also agree that there 
is likely re-factoring to be done. I have also never met a programmer 
who doesn't think something needs to be re-factored. The majority of 
programmers I know all suffer from NIH and want to change how things are 
implemented.


If we are going to re-factor, it should not be considered global and 
should be attacked with specific goals in mind. If those goals are not 
specifically defined and agreed on, we will get very pretty code with 
very little use for our users. Then our users will leave because they 
are busy waiting on us to re-factor.


In short, we must balance this effort with the needs of the code versus 
the needs of our users.


Sincerely,

JD

--
The most kicking donkey PostgreSQL Infrastructure company in existence.
The oldest, the most experienced, the consulting company to the stars.
Command Prompt, Inc. http://www.commandprompt.com/ +1 -503-667-4564 -
24x7 - 365 - Proactive and Managed Professional Services!


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restore-reliability mode

2015-06-03 Thread Andres Freund
On 2015-06-03 09:50:49 -0400, Noah Misch wrote:
> Second, I would define the subject matter as "bug fixes, testing and
> review", not "restructuring, testing and review."  Different code
> structures are clearest to different hackers.  Restructuring, on
> average, adds bugs even more quickly than feature development adds
> them.

I can't agree with this. While I agree with not doing large
restructuring for 9.5, I think we can't affort not to refactor for
clarity, even if that introduces bugs. Noticeable parts of our code have
to frequently be modified for new features and are badly structured at
the same time. While restructuring will may temporarily increase the
number of bugs in the short term, it'll decrease the number of bugs long
term while increasing the number of potential contributors and new
features.  That's obviously not to say we should just refactor for the
sake of it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restore-reliability mode

2015-06-03 Thread Geoff Winkless
On 3 June 2015 at 14:50, Noah Misch  wrote:

> I
> ​ ​
> would define the subject matter as "bug fixes, testing and review", not
> "restructuring, testing and review."  Different code structures are
> clearest
> to different hackers.  Restructuring, on average, adds bugs even more
> quickly
> than feature development adds them.
>

​+1 to this. Rewriting or restructuring code because you don't trust it
(even though you have no reported real-world bugs)​ is a terrible idea.

Stopping all feature development to do it is even worse.

I know you're not talking about rewriting, but I think
http://www.joelonsoftware.com/articles/fog69.html is always worth a
re-read, if only because it's funny :)

I would always 100% support a decision to push back new releases because of
bugfixes for *known* issues, but if you think you *might *be able to find
bugs in code you don't like, you should do that on your own time. Iff you
find actual bugs, *then *you talk about halting new releases.

Geoff


[HACKERS] Restore-reliability mode

2015-06-03 Thread Noah Misch
Subject changed from "Re: [CORE] postpone next week's release".

On Sat, May 30, 2015 at 10:48:45PM -0400, Bruce Momjian wrote:
> Well, I think we stop what we are doing, focus on restructuring,
> testing, and reviewing areas that historically have had problems, and
> when we are done, we can look to go to 9.5 beta.  What we don't want to
> do is to push out more code and get back into a
> wack-a-bug-as-they-are-found mode, which obviously did not serve us well
> for multi-xact, and which is what releasing a beta will do, and of
> course, more commit-fests, and more features.  
> 
> If we have to totally stop feature development until we are all happy
> with the code we have, so be it.  If people feel they have to get into
> cleanup mode or they will never get to add a feature to Postgres again,
> so be it.  If people say, heh, I am not going to do anything and just
> come back when cleanup is done (by someone else), then we will end up
> with a smaller but more dedicated development team, and I am fine with
> that too.  I am suggesting that until everyone is happy with the code we
> have, we should not move forward.

I like the essence of this proposal.  Two suggestions.  We can't achieve or
even robustly measure "everyone is happy with the code," so let's pick
concrete exit criteria.  Given criteria framed like "Files A,B,C and patches
X,Y,Z have a sign-off from a committer other than their original committer."
anyone can monitor progress and find specific ways to contribute.  Second, I
would define the subject matter as "bug fixes, testing and review", not
"restructuring, testing and review."  Different code structures are clearest
to different hackers.  Restructuring, on average, adds bugs even more quickly
than feature development adds them.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain sample rate

2015-06-03 Thread Craig Ringer
On 3 June 2015 at 20:04, Andres Freund  wrote:

> On 2015-06-03 18:54:24 +0800, Craig Ringer wrote:
> > OK, here we go.
>
> Hm. Wouldn't random sampling be better than what you do? If your queries
> have a pattern to them (e.g. you always issue the same 10 queries in
> succession), this will possibly only show a subset of the queries.
>
> I think a formulation in fraction (i.e. a float between 0 and 1) will
> also be easier to understand.


Could be, yeah. I was thinking about the cost of generating a random each
time, but it's going to vanish in the noise compared to the rest of the
costs in query execution.

---
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund  wrote:
>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>> > that's actually ok; because the value at the beginning of the segment
>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>> > filled the page with zeros.
>> >
>> > I think that should be harmless, the worst that can happen is that
>> > oldestOffset errorneously is 0, which should be correct, even though
>> > possibly overly conservative, in these cases.
>>
>> Uh oh.  That seems like a real bad problem for this approach.  What
>> keeps that from being the opposite of too conservative?  There's no
>> "safe" value in a circular numbering space.
>
> I think it *might* (I'm really jetlagged) be fine because that'll only
> happen after a upgrade from < 9.3. And in that case we initialize
> nextOffset to 0. That ought to safe us?

That's pretty much betting the farm on the bugs we know about today
being the only ones there are.  That seems imprudent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 4:48 AM, Thomas Munro
 wrote:
> On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera  
> wrote:
>> Thomas Munro wrote:
>>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
>>> wrote:
>>> > My guess is that the file existed, and perhaps had one or more pages,
>>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
>>> > back.  read() returns 0 in this case but doesn't set errno.
>>> >
>>> > I didn't find a way to set things so that the file exists but is of
>>> > shorter contents than oldestMulti by the time the checkpoint record is
>>> > replayed.
>>>
>>> I'm just starting to learn about the recovery machinery, so forgive me
>>> if I'm missing something basic here, but I just don't get this.  As I
>>> understand it, offsets/0046 should either have been copied with that
>>> page present in it if it existed before the backup started (apparently
>>> not in this case), or extended to contain it by WAL records that come
>>> after the backup label but before the checkpoint record that
>>> references it (also apparently not in this case).
>>
>> Exactly --- that's the spot at which I am, also.  I have had this
>> spinning in my head for three days now, and tried every single variation
>> that I could think of, but like you I was unable to reproduce the issue.
>> However, our customer took a second base backup and it failed in exactly
>> the same way, module some changes to the counters (the file that
>> didn't exist was 004B rather than 0046).  I'm still at a loss at what
>> the failure mode is.  We must be missing some crucial detail ...
>
> I have finally reproduced that error!  See attached repro shell script.
>
> The conditions are:
>
> 1.  next multixact == oldest multixact (no active multixacts, pointing
> past the end)
> 2.  next multixact would be the first item on a new page (multixact % 2048 == 
> 0)
> 3.  the page must not be the first in a segment (or we'd get the
> read-zeroes case)
>
> That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
> vacuum followed by no multixact creations right before your backup
> checkpoint).  That seems like reasonably low odds... if it happened
> twice in a row, maybe I'm missing something here and there is some
> other way to get this...
>
> I realise now that this is actually a symptom of a problem spotted by
> Noah recently:
>
> http://www.postgresql.org/message-id/20150601045534.gb23...@tornado.leadboat.com
>
> He noticed the problem for segment boundaries, when not in recovery.
> In recovery, segment boundaries don't raise an error (the read-zeroes
> case applies), but page boundaries do.  The fix is probably to do
> nothing if they are the same, as we do elsewhere, like in the attached
> patch.

Actually, we still need to call DetermineSafeOldestOffset in that
case.  Otherwise, if someone goes from lots of multixacts to none, the
stop point won't advance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain sample rate

2015-06-03 Thread Andres Freund
On 2015-06-03 18:54:24 +0800, Craig Ringer wrote:
> OK, here we go.

Hm. Wouldn't random sampling be better than what you do? If your queries
have a pattern to them (e.g. you always issue the same 10 queries in
succession), this will possibly only show a subset of the queries.

I think a formulation in fraction (i.e. a float between 0 and 1) will
also be easier to understand.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why does txid_current() assign new transaction-id?

2015-06-03 Thread Michael Paquier
On Wed, Jun 3, 2015 at 12:13 PM, Fujii Masao  wrote:

> On Wed, Jun 3, 2015 at 9:04 AM, Naoya Anzai 
> wrote:
> >> > diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> >> > index 89a609f..6485b42 100644
> >> > --- a/doc/src/sgml/func.sgml
> >> > +++ b/doc/src/sgml/func.sgml
> >> > @@ -16233,7 +16233,7 @@ SELECT collation for ('foo' COLLATE "de_DE");
> >> >
> >> >
>  txid_current()
> >> > bigint
> >> > -   get current transaction ID
> >> > +   get current transaction ID, assigning a new one if
> current transaction does not have one
> >> ^
> the
> >
> > That looks good.
>
> Committed. Thanks!
>

Thanks for picking up the fix.
-- 
Michael


Re: [HACKERS] auto_explain sample rate

2015-06-03 Thread Craig Ringer
On 2 June 2015 at 15:07, Craig Ringer  wrote:

> On 29 May 2015 at 11:35, Tom Lane  wrote:
>
>> Craig Ringer  writes:
>> > It's sometimes desirable to collect auto_explain data with ANALYZE in
>> order
>> > to track down hard-to-reproduce issues, but the performance impacts can
>> be
>> > pretty hefty on the DB.
>>
>> > I'm inclined to add a sample rate to auto_explain so that it can trigger
>> > only on x percent of queries,
>>
>> That sounds reasonable ...
>>
>
> Cool, I'll cook that up then. Thanks for the sanity check.
>

OK, here we go.

To make sure it doesn't trigger on all backends at once, and to ensure it
doesn't rely on a shared point of contention in shmem, this sets up a
counter with a random value on each backend start.

Because it needs to either always run both the Start and End hooks, or run
neither, this doesn't count nested statements for sampling purposes. So if
you run my_huge_plpgsql_function() then either all its statements will be
explained or none of them will. This only applies if nested statement
explain is enabled. It's possible to get around this by adding a separate
nested statement counter that's reset at each top level End hook, but it
doesn't seem worthwhile.

The sample rate has no effect on ANALYZE, which remains enabled or disabled
for all queries. I don't see any point adding a separate sample rate
control to ANALYZE only some sub-proportion of EXPLAINed statements.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Tra
From 1f395e11bad41a77881970c7179f65e33c523cdd Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 3 Jun 2015 17:48:02 +0800
Subject: [PATCH] Allow sampling of only some queries by auto_explain

The new parameter auto_explain.sample_ratio specifies the proportion
of queries that are to be explained.  The starting offset is randomized
so that somewhere between the 1st and n'th statement gets explained,
then every n'th statement. No statement classification is performed.

If nested statement auto_explain is enabled, then for any given
top level statement either all sub-statements or none will be explained,
i.e. sampling doesn't apply to nested statements.
---
 contrib/auto_explain/auto_explain.c | 56 -
 doc/src/sgml/auto-explain.sgml  | 18 
 2 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 2a184ed..161e24e 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -29,6 +29,7 @@ static bool auto_explain_log_triggers = false;
 static bool auto_explain_log_timing = true;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
+static int  auto_explain_sample_ratio = 1;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -47,9 +48,14 @@ static ExecutorRun_hook_type prev_ExecutorRun = NULL;
 static ExecutorFinish_hook_type prev_ExecutorFinish = NULL;
 static ExecutorEnd_hook_type prev_ExecutorEnd = NULL;
 
+/* per-backend counter used for ratio sampling */
+static int  auto_explain_sample_counter = 0;
+
 #define auto_explain_enabled() \
 	(auto_explain_log_min_duration >= 0 && \
-	 (nesting_level == 0 || auto_explain_log_nested_statements))
+	 (nesting_level == 0 || auto_explain_log_nested_statements)) && \
+	 (auto_explain_sample_ratio == 1 || auto_explain_sample_counter == 0)
+
 
 void		_PG_init(void);
 void		_PG_fini(void);
@@ -61,6 +67,19 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void
+auto_explain_sample_ratio_assign_hook(int newval, void *extra)
+{
+	if (auto_explain_sample_ratio != newval)
+	{
+		/* Schedule a counter reset when the sample ratio changed */
+		auto_explain_sample_counter = -1;
+	}
+
+	auto_explain_sample_ratio = newval;
+}
+
+
 
 /*
  * Module load callback
@@ -159,6 +178,18 @@ _PG_init(void)
 			 NULL,
 			 NULL);
 
+	DefineCustomIntVariable("auto_explain.sample_ratio",
+		"Only explain one in approx. every sample_ratio queries, or 1 for all",
+			NULL,
+			&auto_explain_sample_ratio,
+			1,
+			1, INT_MAX - 1,
+			PGC_SUSET,
+			0,
+			NULL,
+			auto_explain_sample_ratio_assign_hook,
+			NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -191,6 +222,29 @@ _PG_fini(void)
 static void
 explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 {
+	/*
+	 * For ratio sampling, only increment the counter for top-level
+	 * statements. Either all nested statements will be explained
+	 * or none will, because we need to know at ExecutorEnd hook time
+	 * whether or not we explained any given statement.
+	 */
+	if (nesting_level == 0 && auto_explain_sample_ratio > 1)
+	{
+		if (auto_explain

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera  wrote:
> Thomas Munro wrote:
>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
>> wrote:
>> > My guess is that the file existed, and perhaps had one or more pages,
>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
>> > back.  read() returns 0 in this case but doesn't set errno.
>> >
>> > I didn't find a way to set things so that the file exists but is of
>> > shorter contents than oldestMulti by the time the checkpoint record is
>> > replayed.
>>
>> I'm just starting to learn about the recovery machinery, so forgive me
>> if I'm missing something basic here, but I just don't get this.  As I
>> understand it, offsets/0046 should either have been copied with that
>> page present in it if it existed before the backup started (apparently
>> not in this case), or extended to contain it by WAL records that come
>> after the backup label but before the checkpoint record that
>> references it (also apparently not in this case).
>
> Exactly --- that's the spot at which I am, also.  I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046).  I'm still at a loss at what
> the failure mode is.  We must be missing some crucial detail ...

I have finally reproduced that error!  See attached repro shell script.

The conditions are:

1.  next multixact == oldest multixact (no active multixacts, pointing
past the end)
2.  next multixact would be the first item on a new page (multixact % 2048 == 0)
3.  the page must not be the first in a segment (or we'd get the
read-zeroes case)

That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
vacuum followed by no multixact creations right before your backup
checkpoint).  That seems like reasonably low odds... if it happened
twice in a row, maybe I'm missing something here and there is some
other way to get this...

I realise now that this is actually a symptom of a problem spotted by
Noah recently:

http://www.postgresql.org/message-id/20150601045534.gb23...@tornado.leadboat.com

He noticed the problem for segment boundaries, when not in recovery.
In recovery, segment boundaries don't raise an error (the read-zeroes
case applies), but page boundaries do.  The fix is probably to do
nothing if they are the same, as we do elsewhere, like in the attached
patch.

-- 
Thomas Munro
http://www.enterprisedb.com


fix-truncate-none.patch
Description: Binary data


copy-page-boundary.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] psql weird behaviour with charset encodings

2015-06-03 Thread Michael Paquier
On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier  
wrote:

> On Sun, May 24, 2015 at 2:43 AM, Noah Misch  wrote:
> > It would be good to purge the code of precisions on "s" conversion
> specifiers,
> > then Assert(!pointflag) in fmtstr() to catch new introductions.  I won't
> plan
> > to do it myself, but it would be a nice little defensive change.
>
> This sounds like a good protection idea, but as it impacts existing
> backend code relying in sprintf's port version we should only do the
> assertion in HEAD in my opinion, and mention it in the release notes of the
> next major version at the time a patch in this area is applied. I guess
> that we had better backpatch the places using .*s though on back-branches.
>

Attached is a patch purging a bunch of places from using %.*s, this will
make the code more solid when facing non-ASCII strings. I let pg_upgrade
and pg_basebackup code paths alone as it reduces the code lisibility by
moving out of this separator. We may want to fix them though if file path
names have non-ASCII characters, but it seems less critical.
Thoughts?
-- 
Michael
diff --git a/src/backend/nodes/read.c b/src/backend/nodes/read.c
index 0dabfa7..910c124 100644
--- a/src/backend/nodes/read.c
+++ b/src/backend/nodes/read.c
@@ -326,8 +326,18 @@ nodeRead(char *token, int tok_len)
 			break;
 		val = (int) strtol(token, &endptr, 10);
 		if (endptr != token + tok_len)
-			elog(ERROR, "unrecognized integer: \"%.*s\"",
- tok_len, token);
+		{
+			/*
+			 * Cannot use %.*s here because some machines
+			 * interpret precision of %s sometimes in
+			 * characters or in bytes.
+			 */
+			char *buf = (char *) palloc(tok_len + 1);
+			memcpy(buf, token, tok_len);
+			buf[tok_len] = '\0';
+			elog(ERROR, "unrecognized integer: \"%s\"",
+ buf);
+		}
 		l = lappend_int(l, val);
 	}
 }
@@ -346,8 +356,17 @@ nodeRead(char *token, int tok_len)
 			break;
 		val = (Oid) strtoul(token, &endptr, 10);
 		if (endptr != token + tok_len)
-			elog(ERROR, "unrecognized OID: \"%.*s\"",
- tok_len, token);
+		{
+			/*
+			 * Cannot use %.*s here because some machines
+			 * interpret precision of %s sometimes in
+			 * characters or in bytes.
+			 */
+			char *buf = (char *) palloc(tok_len + 1);
+			memcpy(buf, token, tok_len);
+			buf[tok_len] = '\0';
+			elog(ERROR, "unrecognized OID: \"%s\"", buf);
+		}
 		l = lappend_oid(l, val);
 	}
 }
@@ -380,7 +399,14 @@ nodeRead(char *token, int tok_len)
 			}
 			else
 			{
-elog(ERROR, "unrecognized token: \"%.*s\"", tok_len, token);
+/*
+ * Cannot use %.*s here because some machines interpret
+ * precision of %s sometimes in characters or in bytes.
+ */
+char *buf = (char *) palloc(tok_len + 1);
+memcpy(buf, token, tok_len);
+buf[tok_len] = '\0';
+elog(ERROR, "unrecognized token: \"%s\"", buf);
 result = NULL;	/* keep compiler happy */
 			}
 			break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index f5a40fb..444b54d 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -142,6 +142,13 @@
 #define nullable_string(token,length)  \
 	((length) == 0 ? NULL : debackslash(token, length))
 
+#define error_token(message, token, len)	\
+	do {	\
+		char *buf = palloc(len + 1);		\
+		memcpy(buf, token, len);			\
+		buf[len] = '\0';	\
+		elog(ERROR, message, buf);			\
+	} while (0);
 
 static Datum readDatum(bool typbyval);
 
@@ -159,13 +166,13 @@ _readBitmapset(void)
 	if (token == NULL)
 		elog(ERROR, "incomplete Bitmapset structure");
 	if (length != 1 || token[0] != '(')
-		elog(ERROR, "unrecognized token: \"%.*s\"", length, token);
+		error_token("unrecognized token: \"%s\"", token, length);
 
 	token = pg_strtok(&length);
 	if (token == NULL)
 		elog(ERROR, "incomplete Bitmapset structure");
 	if (length != 1 || token[0] != 'b')
-		elog(ERROR, "unrecognized token: \"%.*s\"", length, token);
+		error_token("unrecognized token: \"%s\"", token, length);
 
 	for (;;)
 	{
@@ -179,7 +186,7 @@ _readBitmapset(void)
 			break;
 		val = (int) strtol(token, &endptr, 10);
 		if (endptr != token + length)
-			elog(ERROR, "unrecognized integer: \"%.*s\"", length, token);
+			error_token("unrecognized token: \"%s\"", token, length);
 		result = bms_add_member(result, val);
 	}
 
@@ -803,7 +810,7 @@ _readBoolExpr(void)
 	else if (strncmp(token, "not", 3) == 0)
 		local_node->boolop = NOT_EXPR;
 	else
-		elog(ERROR, "unrecognized boolop \"%.*s\"", length, token);
+		error_token("unrecognized boolop: \"%s\"", token, length);
 
 	READ_NODE_FIELD(args);
 	READ_LOCATION_FIELD(location);
@@ -1534,7 +1541,10 @@ parseNodeString(void)
 		return_value = _readDeclareCursorStmt();
 	else
 	{
-		elog(ERROR, "badly formatted node string \"%.32s\"...", token);
+		char buf[33];
+		memcpy(buf, token, 32);
+		buf[33] = '\0';
+		elog(ERROR, "

Re: [HACKERS] auto_explain sample rate

2015-06-03 Thread Craig Ringer
>
>
>> lot of variants - I would to see cost and times for EXPLAIN ANALYZE every
> time - but the precision of time can be reduced to 1ms. It is question if
> we can significantly reduce the cost (or number of calls) of getting system
> time.
>
> Pavel
>
>
>
OK, so you're suggesting a sampling-based EXPLAIN.

That'd be interesting, but is totally unrelated to this work on
auto_explain.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] auto_explain sample rate

2015-06-03 Thread Pavel Stehule
2015-06-03 9:46 GMT+02:00 Craig Ringer :

>
> On 3 June 2015 at 15:22, Pavel Stehule  wrote:
>
>>
>>
>> 2015-06-03 9:17 GMT+02:00 Craig Ringer :
>>
>>>
>>>
>>> On 2 June 2015 at 15:11, Pavel Stehule  wrote:
>>>


 2015-06-02 9:07 GMT+02:00 Craig Ringer :

>
> For the majority of users I'm sure it's sufficient to just have a
> sample rate.
>
> Anything that's trying to match individual queries could be interested
> in all sorts of different things. Queries that touch a particular table
> being one of the more obvious things, or queries that mention a particular
> literal. Rather than try to design something complicated in advance that
> anticipates all needs, I'm thinking it makes sense to just throw a hook in
> there. If some patterns start to emerge in terms of useful real world
> filtering criteria then that'd better inform any more user accessible
> design down the track.
>

 same method can be interesting for interactive EXPLAIN ANALYZE too.
 TIMING has about 20%-30% overhead and usually we don't need a perfectly
 exact numbers

>>>
>>> I don't understand what you are suggesting here.
>>>
>>
>> using some sampling for EXPLAIN ANALYZE statement
>>
>>
> Do you mean that you'd like to be able to set a fraction of queries on
> which auto_explain does ANALYZE, so most of the time it just outputs an
> ordinary EXPLAIN?
>
> Or maybe we're talking about different things re the original proposal? I
> don't see how this would work. If you run EXPLAIN ANALYZE interactively
> like you said above. You'd surely want it to report costs and timings, or
> whatever it is that you ask for, all the time. Not just some of the time
> based on some background setting.
>
> Are you advocating a profiling-based approach for EXPLAIN ANALYZE timing
> where we sample which executor node we're under at regular intervals,
> instead of timing everything? Or suggesting a way to filter out sub-trees
> so you only get timing data on some sub-portion of a query?
>
>
lot of variants - I would to see cost and times for EXPLAIN ANALYZE every
time - but the precision of time can be reduced to 1ms. It is question if
we can significantly reduce the cost (or number of calls) of getting system
time.

Pavel


>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] auto_explain sample rate

2015-06-03 Thread Craig Ringer
On 3 June 2015 at 15:22, Pavel Stehule  wrote:

>
>
> 2015-06-03 9:17 GMT+02:00 Craig Ringer :
>
>>
>>
>> On 2 June 2015 at 15:11, Pavel Stehule  wrote:
>>
>>>
>>>
>>> 2015-06-02 9:07 GMT+02:00 Craig Ringer :
>>>

 For the majority of users I'm sure it's sufficient to just have a
 sample rate.

 Anything that's trying to match individual queries could be interested
 in all sorts of different things. Queries that touch a particular table
 being one of the more obvious things, or queries that mention a particular
 literal. Rather than try to design something complicated in advance that
 anticipates all needs, I'm thinking it makes sense to just throw a hook in
 there. If some patterns start to emerge in terms of useful real world
 filtering criteria then that'd better inform any more user accessible
 design down the track.

>>>
>>> same method can be interesting for interactive EXPLAIN ANALYZE too.
>>> TIMING has about 20%-30% overhead and usually we don't need a perfectly
>>> exact numbers
>>>
>>
>> I don't understand what you are suggesting here.
>>
>
> using some sampling for EXPLAIN ANALYZE statement
>
>
Do you mean that you'd like to be able to set a fraction of queries on
which auto_explain does ANALYZE, so most of the time it just outputs an
ordinary EXPLAIN?

Or maybe we're talking about different things re the original proposal? I
don't see how this would work. If you run EXPLAIN ANALYZE interactively
like you said above. You'd surely want it to report costs and timings, or
whatever it is that you ask for, all the time. Not just some of the time
based on some background setting.

Are you advocating a profiling-based approach for EXPLAIN ANALYZE timing
where we sample which executor node we're under at regular intervals,
instead of timing everything? Or suggesting a way to filter out sub-trees
so you only get timing data on some sub-portion of a query?


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] auto_explain sample rate

2015-06-03 Thread Pavel Stehule
2015-06-03 9:17 GMT+02:00 Craig Ringer :

>
>
> On 2 June 2015 at 15:11, Pavel Stehule  wrote:
>
>>
>>
>> 2015-06-02 9:07 GMT+02:00 Craig Ringer :
>>
>>>
>>> For the majority of users I'm sure it's sufficient to just have a sample
>>> rate.
>>>
>>> Anything that's trying to match individual queries could be interested
>>> in all sorts of different things. Queries that touch a particular table
>>> being one of the more obvious things, or queries that mention a particular
>>> literal. Rather than try to design something complicated in advance that
>>> anticipates all needs, I'm thinking it makes sense to just throw a hook in
>>> there. If some patterns start to emerge in terms of useful real world
>>> filtering criteria then that'd better inform any more user accessible
>>> design down the track.
>>>
>>
>> same method can be interesting for interactive EXPLAIN ANALYZE too.
>> TIMING has about 20%-30% overhead and usually we don't need a perfectly
>> exact numbers
>>
>
> I don't understand what you are suggesting here.
>

using some sampling for EXPLAIN ANALYZE statement

Pavel

>
>
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] auto_explain sample rate

2015-06-03 Thread Craig Ringer
On 2 June 2015 at 15:11, Pavel Stehule  wrote:

>
>
> 2015-06-02 9:07 GMT+02:00 Craig Ringer :
>
>>
>> For the majority of users I'm sure it's sufficient to just have a sample
>> rate.
>>
>> Anything that's trying to match individual queries could be interested in
>> all sorts of different things. Queries that touch a particular table being
>> one of the more obvious things, or queries that mention a particular
>> literal. Rather than try to design something complicated in advance that
>> anticipates all needs, I'm thinking it makes sense to just throw a hook in
>> there. If some patterns start to emerge in terms of useful real world
>> filtering criteria then that'd better inform any more user accessible
>> design down the track.
>>
>
> same method can be interesting for interactive EXPLAIN ANALYZE too. TIMING
> has about 20%-30% overhead and usually we don't need a perfectly exact
> numbers
>

I don't understand what you are suggesting here.



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services