Re: [HACKERS] Parallel Seq Scan

2015-02-10 Thread Robert Haas
On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund and...@2ndquadrant.com wrote:
 If you make the chunks small enough, and then coordate only the chunk
 distribution, not really.

True, but why do you want to do that in the executor instead of in the heapam?

 For this case, what I would imagine is that there is one parallel heap
 scan, and each PartialSeqScan attaches to it.  The executor says give
 me a tuple and heapam.c provides one.  Details like the chunk size
 are managed down inside heapam.c, and the executor does not know about
 them.  It just knows that it can establish a parallel scan and then
 pull tuples from it.

 I think that's a horrible approach that'll end up with far more
 entangled pieces than what you're trying to avoid. Unless the tuple flow
 is organized to only happen in the necessary cases the performance will
 be horrible.

I can't understand this at all.  A parallel heap scan, as I've coded
it up, involves no tuple flow at all.  All that's happening at the
heapam.c layer is that we're coordinating which blocks to scan.  Not
to be disrespectful, but have you actually looked at the patch?

 And good chunk sizes et al depend on higher layers,
 selectivity estimates and such. And that's planner/executor work, not
 the physical layer (which heapam.c pretty much is).

If it's true that a good chunk size depends on the higher layers, then
that would be a good argument for doing this differently, or at least
exposing an API for the higher layers to tell heapam.c what chunk size
they want.  I hadn't considered that possibility - can you elaborate
on why you think we might want to vary the chunk size?

 A individual heap scan's state lives in process private memory. And if
 the results inside the separate workers should directly be used in the
 these workers without shipping over the network it'd be horrible to have
 the logic in the heapscan. How would you otherwise model an executor
 tree that does the seqscan and aggregation combined in multiple
 processes at the same time?

Again, the heap scan is not shipping anything anywhere ever in any
design of any patch proposed or written.  The results *are* directly
used inside each individual worker.

 I think we're in violent agreement here, except for some
 terminological confusion.  Are there N PartialSeqScan nodes, one
 running in each node, or is there one ParallelSeqScan node, which is
 copied and run jointly across N nodes?  You can talk about either way
 and have it make sense, but we haven't had enough conversations about
 this on this list to have settled on a consistent set of vocabulary
 yet.

 I pretty strongly believe that it has to be independent scan nodes. Both
 from a implementation and a conversational POV. They might have some
 very light cooperation between them (e.g. coordinating block ranges or
 such), but everything else should be separate. From an implementation
 POV it seems pretty awful to have executor node that's accessed by
 multiple separate backends - that'd mean it have to be concurrency safe,
 have state in shared memory and everything.

I don't agree with that, but again I think it's a terminological
dispute.  I think what will happen is that you will have a single node
that gets copied into multiple backends, and in some cases a small
portion of its state will live in shared memory.  That's more or less
what you're thinking of too, I think.

But what I don't want is - if we've got a parallel scan-and-aggregate
happening in N nodes, EXPLAIN shows N copies of all of that - not only
because it's display clutter, but also because a plan to do that thing
with 3 workers is fundamentally the same as a plan to do it with 30
workers.  Those plans shouldn't look different, except perhaps for a
line some place that says Number of Workers: N.

 Now, there'll be a node that needs to do some parallel magic - but in
 the above example that should be the AggCombinerNode, which would not
 only ask for tuples from one of the children at a time, but ask multiple
 ones in parallel. But even then it doesn't have to deal with concurrency
 around it's own state.

Sure, we clearly want to minimize the amount of coordination between nodes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] For cursors, there is FETCH and MOVE, why no TELL?

2015-02-10 Thread Tom Lane
Marc Balmer m...@msys.ch writes:
 That is simple indeed.  I tend to think, however, that it would be
 cleaner to return the position as a proper result from a functionn
 instead of using a side effect from a FETCH/MOVE command.

Yeah.  For one thing, a command tag wouldn't help you at all if you
wanted to know the current cursor position inside a plpgsql function.

There are also backwards-compatibility reasons to be nervous about
changing the long-standing command tag values for these commands.

An issue that would have to be addressed is what the function ought
to do if posOverflow is set, which is entirely feasible on Windows
(or anyplace else where long is only 32 bits).  Maybe we should
redeclare portalPos as int64 and get rid of the posOverflow logic.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] For cursors, there is FETCH and MOVE, why no TELL?

2015-02-10 Thread Pavel Stehule
2015-02-10 16:21 GMT+01:00 Tom Lane t...@sss.pgh.pa.us:

 Marc Balmer m...@msys.ch writes:
  That is simple indeed.  I tend to think, however, that it would be
  cleaner to return the position as a proper result from a functionn
  instead of using a side effect from a FETCH/MOVE command.

 Yeah.  For one thing, a command tag wouldn't help you at all if you
 wanted to know the current cursor position inside a plpgsql function.


It can solved via GET DIAGNOSTICS statement


 There are also backwards-compatibility reasons to be nervous about
 changing the long-standing command tag values for these commands.


yes, this is serious risk -  and this is too high cost for relative less
used feature.

Regards

Pavel


 An issue that would have to be addressed is what the function ought
 to do if posOverflow is set, which is entirely feasible on Windows
 (or anyplace else where long is only 32 bits).  Maybe we should
 redeclare portalPos as int64 and get rid of the posOverflow logic.

 regards, tom lane



[HACKERS] Better error message on pg_upgrade checksum mismatches

2015-02-10 Thread Greg Sabino Mullane
Just a little thing that's been bugging me. If one side of the 
pg_upgrade has checksums and the other does not, give a less 
cryptic error message.


-- 
Greg Sabino Mullane g...@endpoint.com
End Point Corporation
PGP Key: 0x14964AC8
diff --git a/contrib/pg_upgrade/controldata.c b/contrib/pg_upgrade/controldata.c
index a02a8ec..8a7b976 100644
--- a/contrib/pg_upgrade/controldata.c
+++ b/contrib/pg_upgrade/controldata.c
@@ -572,9 +572,17 @@ check_control_data(ControlData *oldctrl,
 	 * We might eventually allow upgrades from checksum to no-checksum
 	 * clusters.
 	 */
+	if (! oldctrl-data_checksum_version  newctrl-data_checksum_version)
+	{
+		pg_fatal(old version does not use data checksums but new one does\n);
+	}
+	if (oldctrl-data_checksum_version  ! newctrl-data_checksum_version)
+	{
+		pg_fatal(old version uses data checksums but new one does not\n);
+	}
 	if (oldctrl-data_checksum_version != newctrl-data_checksum_version)
 	{
-		pg_fatal(old and new pg_controldata checksum versions are invalid or do not match\n);
+		pg_fatal(old and new pg_controldata checksum versions do not match\n);
 	}
 }
 


signature.asc
Description: Digital signature


[HACKERS] Fixing pg_dump's heuristic about casts

2015-02-10 Thread Tom Lane
I was reminded today by Greg Mullane's blog post
http://blog.endpoint.com/2015/02/postgres-custom-casts-and-pgdump.html
about how pg_dump fails to dump custom casts if they are casts between
built-in types.  Setting aside the wisdom of creating such a cast,
it's definitely annoying that pg_dump misses it.  He's hardly the
first to complain, too.

The current heuristic dates back to this discussion:
http://www.postgresql.org/message-id/flat/3f7066bf.6010...@yahoo.com
(see commit a6790ce85752b67ad994f55fdf1a450262ccc32e).  Re-reading it,
I'm wondering why we didn't go with plan B, namely determine whether
to dump a cast based on its OID.  That is, dump it if it has an OID =
FirstNormalObjectId (16384), otherwise not.  That's admittedly pretty
ugly, but the existing heuristic isn't exactly beautiful.  And it's not
like we haven't used the OID rule for other things --- see extensions.

So I propose ripping out the current heuristic (in HEAD, lines 10482-10530
of pg_dump.c) and instead inspecting the cast's OID to decide whether to
dump it, much like selectDumpableExtension() does.

I'd further propose back-patching this fix.

Comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] enabling nestedloop and disabling hashjon

2015-02-10 Thread Tom Lane
Ravi Kiran ravi.kolanp...@gmail.com writes:
 yes sir, I did try the pg_ctl reload command, but its still using the hash
 join algorithm and not the nested loop algorithm. I even restarted the
 server, even then its still using the hash join algorithm

Does show enable_hashjoin say it's off?  If not, I think you must've
fat-fingered the postgresql.conf change somehow.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Peter Geoghegan
On Tue, Feb 10, 2015 at 5:22 PM, Arthur Silva arthur...@gmail.com wrote:
 I assume if the hacker can intercept the server unencrypted traffic and/or
 has access to its hard-drive the database is compromised anyway.

That sounds like an argument against hashing the passwords in general.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Robert Haas
On Tue, Feb 10, 2015 at 7:32 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com wrote:
 Although the patch was described as relatively easy to write, it never
 went anywhere, because it *replaced* MD5 authentication with bcrypt,
 which would be a big problem for existing clients.  It seems clear
 that we should add something new and not immediately kill off what
 we've already got, so that people can transition smoothly.  An idea I
 just had today is to keep using basically the same system that we are
 currently using for MD5, but with a stronger hash algorithm, like
 SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and
 SHA-512).  Those are slower, but my guess is that even SHA-512 is not
 enough slower for anybody to care very much, and if they do, well
 that's another reason to make use of the new stuff optional.

 I believe that a big advantage of bcrypt for authentication is the
 relatively high memory requirements. This frustrates GPU based
 attacks.

I don't actually care which algorithm we use, and I dowannahafta care.
What I do want to do is provide a framework so that, when somebody
discovers that X is better than Y because Z, somebody who knows about
cryptography and not much about PostgreSQL ca add support for X in a
relatively small number of lines of code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Peter Geoghegan
On Tue, Feb 10, 2015 at 5:14 PM, Arthur Silva arthur...@gmail.com wrote:
 I don't think the password storing best practices apply to db connection
 authentication.

Why not?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-02-10 Thread Andres Freund
On 2015-02-10 11:49:58 -0500, Robert Haas wrote:
 On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund and...@2ndquadrant.com wrote:
  * Don't like CreateParallelContextForExtension much as a function name -
I don't think we normally equate the fact that code is located in a
loadable library with the extension mechanism.

 Suggestions for a better name?  CreateParallelContextForLoadableFunction?

*ExternalFunction maybe? That's dfmgr.c calls it.

  * the plain shm calculations intentionally use mul_size/add_size to deal
with overflows. On 32bit it doesn't seem impossible, but unlikely to
overflow size_t.

 Yes, I do that here too, though as with the plain shm calculations,
 only in the estimate functions.  The functions that actually serialize
 stuff don't have to worry about overflow because it's already been
 checked.

I thought I'd seen some estimation functions that didn't use it. But
apparently I was wrong.

  * In +LaunchParallelWorkers, does it make sense trying to start workers
if one failed? ISTM that's likely to not be helpful. I.e. it should
just break; after the first failure.

 It can't just break, because clearing pcxt-worker[i].error_mqh is an
 essential step.  I could add a flag variable that tracks whether any
 registrations have failed and change the if statement to if
 (!any_registrations_failed  RegisterDynamicBackgroundWorker(worker,
 pcxt-worker[i].bgwhandle)), if you want.  I thought about doing that
 but decided it was very unlikely to affect the real-world performance
 of anything, so easier just to keep the code simple. But I'll change
 it if you want.

I think it'd be better.

  * ParallelMain restores libraries before GUC state. Given that
librararies can, and actually somewhat frequently do, inspect GUCs on
load, it seems better to do it the other way round? You argue We want
to do this before restoring GUCs, because the libraries might define
custom variables., but I don't buy that. It's completely normal for
namespaced GUCs to be present before a library is loaded? Especially
as you then go and process_session_preload_libraries() after setting
the GUCs.

 This is a good question, but the answer is not entirely clear to me.
 I'm thinking I should probably just remove
 process_session_preload_libraries() altogether at this point.  That
 was added at a time when RestoreLibraryState() didn't exist yet, and I
 think RestoreLibraryState() is a far superior way of handling this,
 because surely the list of libraries that the parallel leader
 *actually had loaded* is more definitive than any GUC.

That sounds like a good idea to me.

 Now, that doesn't answer the question about whether we should load
 libraries first or GUCs.  I agree that the comment's reasoning is
 bogus, but I'm not sure I understand why you think the other order is
 better.  It *is* normal for namespaced GUCs to be present before a
 library is loaded, but it's equally normal (and, I think, often more
 desirable in practice) to load the library first and then set the
 GUCs.

Well, it's pretty much never the case that the library is loaded before
postgresql.conf gucs, right? A changed postgresql.conf is the only
exception I can see. Neither is it the normal case for
session|local_preload_libraries. Not even when GUCs are loaded via
pg_db_role_setting or the startup packet...

 Generally, I think that libraries ought to be loaded as early
 as possible, because they may install hooks that change the way other
 stuff works, and should therefore be loaded before that other stuff
 happens.

While that may be desirable I don't really see a reason for this to be
treated differently for worker processes than the majority of cases
otherwise.

Anyway, I think this is a relatively minor issue.

  * Should ParallelMain maybe enter the parallel context before loading
user defined libraries? It's far from absurd to execute code touching
the database on library initialization...

 It's hard to judge without specific examples.  What kinds of things do
 they do?

I've seen code filling lookup caches and creating system objects
(including tables and extensions).

 Are they counting on a transaction being active?

 I would have thought that was a no-no, since there are many instances
 in which it won't be true.  Also, you might have just gotten loaded
 because a function stored in your library was called, so you could be
 in a transaction that's busy doing something else, or deep in a
 subtransaction stack, etc.  It seems unsafe to do very much more than
 a few syscache lookups here, even if there does happen to be a
 transaction active.

The only reason I'd like it to be active is because that'd *prohibit*
doing the crazier stuff. There seems little reason to not da it under
the additional protection against crazy things that'd give us?

  * I think restoring snapshots needs to fudge the worker's PGXACT-xmin
to be the minimum of the top transaction id and the
snapshots. 

Re: [HACKERS] ibm system z in the buildfarm

2015-02-10 Thread Robert Haas
On Tue, Feb 10, 2015 at 8:49 PM, Mark Wong m...@2ndquadrant.com wrote:
 I've seen in the archive a call for more architecture coverage so I just
 wanted to send a quick note that there is now Linux on System Z in the
 buildfarm now:

 http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=nudibranchbr=HEAD

Nice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Feb 10, 2015 at 9:30 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Another thing we need to keep in mind besides client compatibility
 is dump/reload compatibility.

 I don't think there's an issue with dump/reload compatibility as far
 as what I proposed, since it only has to do with the authentication
 procedure, not what gets stored in pg_authid.  We might have reasons
 for moving that away from MD5 as well, but it's a separate project.

Hm, well, that doesn't really square with your other expressed opinion:

 Are there other goals?

 I think the goal is stop using MD5, or at least have an option to not
 use MD5, because people think that's insecure.

As you say, it's quite debatable whether MD5 is or isn't secure enough
given the way we use it, but what's not debatable is that the optics of it
are not very good anymore.  However, if we want to shut up the peanut
gallery on this point, we have to get rid of MD5 usage in pg_authid not
just the on-the-wire protocol --- I seriously doubt that the knee jerk
MD5-is-insecure crowd will make any distinction there.  So I'm not
following how you're satisfied with a proposal for just the latter.

In any case, my larger point was that given the pain that we're going to
incur here, and the certainly years-long transition interval involved,
it would be foolish to think only about replacing the MD5 algorithm and
not about reconsidering the context we use it in.  Stuff like unreasonably
short salt values should be dealt with at the same time.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Arthur Silva
On Tue, Feb 10, 2015 at 11:19 PM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Feb 10, 2015 at 5:14 PM, Arthur Silva arthur...@gmail.com wrote:
  I don't think the password storing best practices apply to db
 connection
  authentication.

 Why not?

 --
 Peter Geoghegan


I assume if the hacker can intercept the server unencrypted traffic and/or
has access to its hard-drive the database is compromised anyway.

I maybe missing something though.


[HACKERS] ibm system z in the buildfarm

2015-02-10 Thread Mark Wong
Hi everyone,

I've seen in the archive a call for more architecture coverage so I just
wanted to send a quick note that there is now Linux on System Z in the
buildfarm now:

http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=nudibranchbr=HEAD

Regards,
Mark
-- 
Mark Wonghttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench -f and vacuum

2015-02-10 Thread Peter Eisentraut
On 2/10/15 3:12 AM, Michael Paquier wrote:
 On Tue, Feb 10, 2015 at 5:03 PM, Tatsuo Ishii wrote:
 - The documentation misses some markups for pgbench and VACUUM and did
 not respect the 80-character limit.

 I didn't realize that there's such a style guide. Although I think
 it's a good thing, I just want to know where such a guide is
 described.
 
 That's self-learning based on the style of the other pages. I don't
 recall if there are actually convention guidelines for the docs, the
 only I know of being the coding convention here:
 http://www.postgresql.org/docs/devel/static/source.html
 Maybe we could have a section dedicated to that. Thoughts?

We have http://www.postgresql.org/docs/devel/static/docguide-style.html,
although that doesn't cover formatting.  For that we have .dir-locals.el:

 (sgml-mode . ((fill-column . 78)
   (indent-tabs-mode . nil

;-)

Feel free to improve that.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Josh Berkus
On 02/10/2015 05:28 PM, Robert Haas wrote:
 I don't actually care which algorithm we use, and I dowannahafta care.
 What I do want to do is provide a framework so that, when somebody
 discovers that X is better than Y because Z, somebody who knows about
 cryptography and not much about PostgreSQL ca add support for X in a
 relatively small number of lines of code.

+1

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On 2/10/15 8:28 PM, Robert Haas wrote:
 I don't actually care which algorithm we use, and I dowannahafta care.
 What I do want to do is provide a framework so that, when somebody
 discovers that X is better than Y because Z, somebody who knows about
 cryptography and not much about PostgreSQL ca add support for X in a
 relatively small number of lines of code.

 sounds like SASL

Sounds like pie in the sky really :-(.  We could make the server turn on
a dime perhaps, but the client-side population will not come along nearly
that quickly, nor with small effort.  Stored passwords won't migrate to a
new scheme transparently either.

I think it's probably reasonable to think about a more modern password
auth method, but not to imagine that it will be pluggable or that the
adoption time for any revision will be less than years long.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 Feature help requested: Inclusion Constraints

2015-02-10 Thread Peter Eisentraut
On 2/9/15 3:12 AM, Jeff Davis wrote:
 On Sat, 2015-02-07 at 16:08 -0800, Jeff Davis wrote:
 I believe Inclusion Constraints will be important for postgres.
 
 I forgot to credit Darren Duncan with the name of this feature:
 
 http://www.postgresql.org/message-id/4f8bb9b0.5090...@darrenduncan.net

I think it would be confusing to name something inclusion constraint
that is not somehow the opposite of an exclusion constraint.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Claudio Freire
On Tue, Feb 10, 2015 at 10:19 PM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Feb 10, 2015 at 5:14 PM, Arthur Silva arthur...@gmail.com wrote:
 I don't think the password storing best practices apply to db connection
 authentication.

 Why not?


Usually because handshakes use a random salt on both sides. Not sure
about pg's though, but in general collision strength is required but
not slowness, they're not bruteforceable.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typo in logicaldecoding.sgml

2015-02-10 Thread Robert Haas
On Tue, Feb 10, 2015 at 6:57 PM, Andres Freund and...@2ndquadrant.com wrote:
 I think one of them could be a fix. Any advice?

 I slightly prefer the second form...

I think it uses is definitely more standard English usage in this context.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Robert Haas
On Tue, Feb 10, 2015 at 9:30 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 2. We'd have to figure out how to get support for the new algorithms
 into libpq.  This actually seems a good bit harder than doing it on
 the server-side, because I don't think libpq has any dynamic loading
 facilities the way the server does.

 If you think libpq is the only problem, or even the main problem,
 on the client side then you have seriously misjudged the situation.
 There are multiple clients that do not use libpq at all, JDBC being
 the poster child here.

We don't want to make things unnecessarily difficult for clients that
implement the wire protocol from scratch, but I don't think it's our
job to tell the owners of other connectors if, or when, they want to
support new authentication methods.  They certainly won't support them
*before* we have server-side support; we can hope that if we add
server-side support, they will follow along.

 Another thing we need to keep in mind besides client compatibility
 is dump/reload compatibility.

I don't think there's an issue with dump/reload compatibility as far
as what I proposed, since it only has to do with the authentication
procedure, not what gets stored in pg_authid.  We might have reasons
for moving that away from MD5 as well, but it's a separate project.

 I think it would be wise to take two steps back and think about what
 the threat model is here, and what we actually need to improve.
 Offhand I can remember two distinct things we might wish to have more
 protection against:

 * scraping of passwords off the wire protocol (but is that still
 a threat in an SSL world?).  Better salting practice would do more
 than replacing the algorithm as such for this, IMO.

 * scraping of passwords directly from the pg_authid table or from
 a pg_dump dump.  In this case it's not so much that we are worried
 about preventing the scraper from accessing the database (he's
 evidently in already) as that we'd like to prevent him from recovering
 the original cleartext, since the user might've used that same password
 elsewhere.  (Bad user, but nonetheless something we might try to
 protect against.)  Again, more salt seems like it might go a lot further
 than just changing the algorithm.

 Are there other goals?

I think the goal is stop using MD5, or at least have an option to not
use MD5, because people think that's insecure.  Whether those people
are right or not is probably extremely arguable - indeed, I can
imagine your opening salvo here giving rise to vigorous debate about
whether MD5, in the particular way we are using it, is or is not
secure.  The problem is that MD5 has enough weaknesses that, even if
it is actually secure in the way we are using it, some people are not
going to believe that, and are going to think we're bad people for
using it in any way whatsoever.  I'd like to keep those people as
users, whoever is right about the security issues.  The second problem
is that, even if we could convince all of the people who might be
worried about MD5 that their fears are unfounded, a new attack can be
discovered at any time that weakens it still further and renders what
we're doing unsafe.  We shouldn't wait until then to start the
years-long process of weaning ourselves off of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Peter Eisentraut
On 2/10/15 8:28 PM, Robert Haas wrote:
 I don't actually care which algorithm we use, and I dowannahafta care.
 What I do want to do is provide a framework so that, when somebody
 discovers that X is better than Y because Z, somebody who knows about
 cryptography and not much about PostgreSQL ca add support for X in a
 relatively small number of lines of code.

sounds like SASL



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Show the LSN in rm_redo_error_callback

2015-02-10 Thread Peter Eisentraut
On 2/10/15 5:15 PM, Andres Freund wrote:
 Hi,
 
 I've repeatedly stared at logs containing PANICs during WAL
 replay. Unfortunately it's rather hard to find out which record actually
 caused the problem as we print the record's contents but not its LSN.
 
 I think it's a no-brainer to add that to master.

Makes sense.

 But I'd even argue that
 it'd be a good idea to add it to the backbranches - there've been a
 significant number of bugs causing PANICs due to replay errors in the
 past (and we might be hunting one more of them right now).

We've had WAL since 7.1, and this is the first I've heard of this idea,
so while I sympathize with the idea, it's hard to agree that this is a
must-have.  The standard for back-patching ought to be higher than mere
convenience.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Although the patch was described as relatively easy to write, it never
 went anywhere, because it *replaced* MD5 authentication with bcrypt,
 which would be a big problem for existing clients.

Right.  The client end of it is the elephant in the room in any discussion
of improving the password hash algorithm.

 2. We'd have to figure out how to get support for the new algorithms
 into libpq.  This actually seems a good bit harder than doing it on
 the server-side, because I don't think libpq has any dynamic loading
 facilities the way the server does.

If you think libpq is the only problem, or even the main problem,
on the client side then you have seriously misjudged the situation.
There are multiple clients that do not use libpq at all, JDBC being
the poster child here.

Another thing we need to keep in mind besides client compatibility
is dump/reload compatibility.


I think it would be wise to take two steps back and think about what
the threat model is here, and what we actually need to improve.
Offhand I can remember two distinct things we might wish to have more
protection against:

* scraping of passwords off the wire protocol (but is that still
a threat in an SSL world?).  Better salting practice would do more
than replacing the algorithm as such for this, IMO.

* scraping of passwords directly from the pg_authid table or from
a pg_dump dump.  In this case it's not so much that we are worried
about preventing the scraper from accessing the database (he's
evidently in already) as that we'd like to prevent him from recovering
the original cleartext, since the user might've used that same password
elsewhere.  (Bad user, but nonetheless something we might try to
protect against.)  Again, more salt seems like it might go a lot further
than just changing the algorithm.

Are there other goals?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Arthur Silva
On Tue, Feb 10, 2015 at 11:25 PM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Feb 10, 2015 at 5:22 PM, Arthur Silva arthur...@gmail.com wrote:
  I assume if the hacker can intercept the server unencrypted traffic
 and/or
  has access to its hard-drive the database is compromised anyway.

 That sounds like an argument against hashing the passwords in general.


 --
 Peter Geoghegan


Indeed.

In a perfect world SCRAM would be the my choice. FWIW Mongodb 3.0 also uses
SCRAM as the preferred method for password based authentication.


Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

2015-02-10 Thread Peter Eisentraut
On 2/5/15 10:13 AM, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 All that having been said, it wouldn't be crazy to try to invent a
 system to lock this down, but it *would* be complicated.  An
 individual FDW can call its authentication-related options anything it
 likes; they do not need to be called 'password'.  So we'd need a way
 to identify which options should be hidden from untrusted users, and
 then a bunch of mechanism to do that.
 
 It's also debatable whether this wouldn't be a violation of the SQL
 standard.  I see nothing in the SQL-MED spec authorizing filtering
 of the information_schema.user_mapping_options view.
 
 We actually are doing some filtering of values in user_mapping_options,
 but it's all-or-nothing so far as the options for any one mapping go.
 That's still not exactly supportable per spec but it's probably less of a
 violation than option-by-option filtering would be.
 
 It also looks like that filtering differs in corner cases from what the
 regular pg_user_mappings view does, which is kinda silly.  In particular
 I think we should try to get rid of the explicit provision for superuser
 access.
 
 I was hoping Peter would weigh in on what his design considerations
 were for these views ...

I recall that we had extensive discussions about this back in the day.
The SQL standard doesn't provide for any filtering of options, and the
current behavior is just a bare minimum to not appear completely stupid.

Since we don't know which options are security-sensitive, the only
choices are hide everything or hide nothing.  Note that if you opt for
hide everything, you will also need to hide everything from servers and
wrappers, since they could also contains passwords in their options.

We could ask the FDW which options are security sensitive, but that
seems quite complicated to implement.  We could require the user to
specify which option values they want hidden (SENSITIVE OPTION '...' or
whatever).

I would welcome improvements in this area.  I'm not worried about SQL
standard requirements here.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_upgrade bug in handling postgres/template1 databases

2015-02-10 Thread Bruce Momjian
I received a private pg_upgrade bug report related to its affect on the
removal of clog files in the upgraded cluster.  The user reported that
an upgraded cluster yielded this error very soon after being started for
the first time:

SELECT * FROM test;
ERROR:  could not access status of transaction 685
DETAIL:  Could not open file pg_clog/: No such file or directory.

I was confused as I had never seen such a report before.  After much
digging, I found out that the user was storing data _only_ in the
'postgres' database, not any other databases, and that pg_upgrade was
not preserving pg_database.datfrozenxid and pg_database.datminmxid, even
though it was processing those databases and copying over any user
tables and indexes.

When the new server was started, pg_database.datminmxid equaled the
current transaction counter and old clog files were removed, yielding
the error.

This is not a problem for users who store functions or extensions in the
postgres or template1 database, just user tables and indexes.

The attached patch fixes the problem, and I have reproduced the bug and
verified the patch fixes it.  It would have been nice if I had figured
this out two weeks ago, before we released minor version upgrades.  This
bug has existed back to 9.0 or earlier, so it isn't new.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
new file mode 100644
index 3e3b433..ec7246a
*** a/src/bin/pg_dump/pg_dumpall.c
--- b/src/bin/pg_dump/pg_dumpall.c
*** dumpCreateDB(PGconn *conn)
*** 1417,1432 
  
  			appendPQExpBufferStr(buf, ;\n);
  
! 			if (binary_upgrade)
! 			{
! appendPQExpBufferStr(buf, -- For binary upgrade, set datfrozenxid and datminmxid.\n);
! appendPQExpBuffer(buf, UPDATE pg_catalog.pg_database 
! SET datfrozenxid = '%u', datminmxid = '%u' 
!   WHERE datname = ,
!   dbfrozenxid, dbminmxid);
! appendStringLiteralConn(buf, dbname, conn);
! appendPQExpBufferStr(buf, ;\n);
! 			}
  		}
  
  		if (!skip_acls 
--- 1417,1433 
  
  			appendPQExpBufferStr(buf, ;\n);
  
! 		}
! 
! 		if (binary_upgrade)
! 		{
! 			appendPQExpBufferStr(buf, -- For binary upgrade, set datfrozenxid and datminmxid.\n);
! 			appendPQExpBuffer(buf, UPDATE pg_catalog.pg_database 
! 			SET datfrozenxid = '%u', datminmxid = '%u' 
! 			  WHERE datname = ,
! 			  dbfrozenxid, dbminmxid);
! 			appendStringLiteralConn(buf, dbname, conn);
! 			appendPQExpBufferStr(buf, ;\n);
  		}
  
  		if (!skip_acls 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The return value of allocate_recordbuf()

2015-02-10 Thread Robert Haas
On Mon, Feb 9, 2015 at 7:02 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote:
 MemoryContextAllocExtended() was added, so isn't it time to replace palloc()
 with MemoryContextAllocExtended(CurrentMemoryContext, MCXT_ALLOC_NO_OOM)
 in allocate_recordbuf()?

 Yeah, let's move on here, but not with this patch I am afraid as this
 breaks any client utility using xlogreader.c in frontend, like
 pg_xlogdump or pg_rewind because MemoryContextAllocExtended is not
 available in frontend, only in backend. We are going to need something
 like palloc_noerror, defined on both backend (mcxt.c) and frontend
 (fe_memutils.c) side.

Unfortunately, that gets us back into the exact terminological dispute
that we were hoping to avoid.  Perhaps we could compromise on
palloc_extended().

 Btw, the huge flag should be used as well as palloc only allows
 allocation up to 1GB, and this is incompatible with ~9.4 where malloc
 is used.

I think that flag should be used only if it's needed, not just on the
grounds that 9.4 has no such limit.  In general, limiting allocations
to 1GB has been good to us; for example, it prevents a corrupted TOAST
length from causing a memory allocation large enough to crash the
machine (yes, there are machines you can crash with a giant memory
allocation!).  We should bypass that limit only where it is clearly
necessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RangeType internal use

2015-02-10 Thread Robert Haas
On Mon, Feb 9, 2015 at 7:54 PM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 Well, that's debatable IMO (especially your claim that variable-size
 partitions would be needed by a majority of users).  But in any case,
 partitioning behavior that is emergent from a bunch of independent pieces
 of information scattered among N tables seems absolutely untenable from
 where I sit.  Whatever we support, the behavior needs to be described by
 *one* chunk of information --- a sorted list of bin bounding values,
 perhaps.

 I'm a bit confused here. I got an impression that partitioning formula
 as you suggest would consist of two pieces of information - an origin
 point  a bin width. Then routing a tuple consists of using exactly
 these two values to tell a bin number and hence a partition in O(1) time
 assuming we've made all partitions be exactly bin-width wide.

 You mention here a sorted list of bin bounding values which we can very
 well put together for a partitioned table in its relation descriptor
 based on whatever information we stored in catalog. That is, we can
 always have a *one* chunk of partitioning information as *internal*
 representation irrespective of how generalized we make our on-disk
 representation. We can get O(log N) if not O(1) from that I'd hope. In
 fact, that's what I had in mind about this.

Sure, we can always assemble data into a relation descriptor from
across multiple catalog entries.  I think the question is whether
there is any good reason to split up the information across multiple
relations or whether it might not be better, as I have suggested
multiple times, to serialize it using nodeToString() and stuff it in a
single column in pg_class.  There may be such a reason, but if you
said what it was, I missed that.  This thread started as a discussion
about using range types, and I think it's pretty clear that's a bad
idea, because:

1. There's no guarantee that a range type for the datatype exists at all.
2. If it does, there's no guarantee that it uses the same opclass that
we want to use for partitioning, and I certainly think it would be
strange if we refused to let the user pick the opclass she wants to
use.
3. Even if there is a suitable range type available, it's a poor
representational choice here, because it will be considerably more
verbose than just storing a sorted list of partition bounds.  In the
common case where the ranges are adjacent, you'll end up storing two
copies of every bound but the first and last for no discernable
benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-02-10 Thread Robert Haas
On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund and...@2ndquadrant.com wrote:
 Observations:
 * Some tailing whitespace in the readme. Very nice otherwise.

Fixed.  Thanks.

 * Don't like CreateParallelContextForExtension much as a function name -
   I don't think we normally equate the fact that code is located in a
   loadable library with the extension mechanism.

Suggestions for a better name?  CreateParallelContextForLoadableFunction?

 * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's
   that about?

Gee, maybe I should have added a comment so I'd remember.  If the
buffer size isn't MAXALIGN'd, that would be really bad, because
shm_mq_create() assumes that it's being given an aligned address.
Maybe I should add an Assert() there.  If it is MAXALIGN'd but not
BUFFERALIGN'd, we might waste a few bytes of space, since
shm_toc_allocate() always allocates in BUFFERALIGN'd chunks, but I
don't think anything will actually break.  Not sure if that's worth an
assert or not.

 * the plain shm calculations intentionally use mul_size/add_size to deal
   with overflows. On 32bit it doesn't seem impossible, but unlikely to
   overflow size_t.

Yes, I do that here too, though as with the plain shm calculations,
only in the estimate functions.  The functions that actually serialize
stuff don't have to worry about overflow because it's already been
checked.

 * I'd s/very // i We might be running in a very short-lived memory
   context.. Or replace it with too.

Removed very.

 * In +LaunchParallelWorkers, does it make sense trying to start workers
   if one failed? ISTM that's likely to not be helpful. I.e. it should
   just break; after the first failure.

It can't just break, because clearing pcxt-worker[i].error_mqh is an
essential step.  I could add a flag variable that tracks whether any
registrations have failed and change the if statement to if
(!any_registrations_failed  RegisterDynamicBackgroundWorker(worker,
pcxt-worker[i].bgwhandle)), if you want.  I thought about doing that
but decided it was very unlikely to affect the real-world performance
of anything, so easier just to keep the code simple. But I'll change
it if you want.

 * +WaitForParallelWorkersToFinish says that it waits for workers to exit
   cleanly. To me that's ambiguous. How about fully?

I've removed the word cleanly and added a comment to more fully
explain the danger:

+ * Even if the parallel operation seems to have completed successfully, it's
+ * important to call this function afterwards.  We must not miss any errors
+ * the workers may have thrown during the parallel operation, or any that they
+ * may yet throw while shutting down.

 * ParallelMain restores libraries before GUC state. Given that
   librararies can, and actually somewhat frequently do, inspect GUCs on
   load, it seems better to do it the other way round? You argue We want
   to do this before restoring GUCs, because the libraries might define
   custom variables., but I don't buy that. It's completely normal for
   namespaced GUCs to be present before a library is loaded? Especially
   as you then go and process_session_preload_libraries() after setting
   the GUCs.

This is a good question, but the answer is not entirely clear to me.
I'm thinking I should probably just remove
process_session_preload_libraries() altogether at this point.  That
was added at a time when RestoreLibraryState() didn't exist yet, and I
think RestoreLibraryState() is a far superior way of handling this,
because surely the list of libraries that the parallel leader
*actually had loaded* is more definitive than any GUC.

Now, that doesn't answer the question about whether we should load
libraries first or GUCs.  I agree that the comment's reasoning is
bogus, but I'm not sure I understand why you think the other order is
better.  It *is* normal for namespaced GUCs to be present before a
library is loaded, but it's equally normal (and, I think, often more
desirable in practice) to load the library first and then set the
GUCs.  Generally, I think that libraries ought to be loaded as early
as possible, because they may install hooks that change the way other
stuff works, and should therefore be loaded before that other stuff
happens.

 * Should ParallelMain maybe enter the parallel context before loading
   user defined libraries? It's far from absurd to execute code touching
   the database on library initialization...

It's hard to judge without specific examples.  What kinds of things do
they do?  Are they counting on a transaction being active?  I would
have thought that was a no-no, since there are many instances in which
it won't be true.  Also, you might have just gotten loaded because a
function stored in your library was called, so you could be in a
transaction that's busy doing something else, or deep in a
subtransaction stack, etc.  It seems unsafe to do very much more than
a few syscache lookups here, even if there does happen to be a
transaction active.

 * 

Re: [HACKERS] pgbench -f and vacuum

2015-02-10 Thread Tatsuo Ishii
 On Mon, Feb 9, 2015 at 2:54 PM, Tatsuo Ishii wrote:
 Agreed. Here is the patch to implement the idea: -f just implies -n.
 
 Some small comments:
 - is_no_vacuum, as well as is_init_mode, are defined as an integers
 but their use imply that they are boolean switches. This patch sets
 is_no_vacuum to true, while the rest of the code actually increment
 its value when -n is used. Why not simply changing both flags as
 booleans? My suggestion is not really related to this patch and purely
 cosmetic but we could change things at the same time, or update your
 patch to increment to is_no_vacuum++ when -f is used.

Yes, I have to admit that the current pgench code is quite confusing
in this regard. I think we should change is_no_vacuum and is_init_mode
to boolean.

 - The documentation misses some markups for pgbench and VACUUM and did
 not respect the 80-character limit.

I didn't realize that there's such a style guide. Although I think
it's a good thing, I just want to know where such a guide is
described.

 Attached is an updated patch with those things changed.

Looks good.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

2015-02-10 Thread Andres Freund
On 2015-02-04 16:49:46 -0800, Peter Geoghegan wrote:
 On Tue, Feb 2, 2015 at 01:06 AM, Andres Freund and...@2ndquadrant.com wrote:
  Generally the split into the individual commits doesn't seem to make
  much sense to me.

I think trying to make that possible is a good idea in patches of this
size. It e.g. seems entirely possible to structure the patchset so that
the speculative lock infrastructure is added first and the rest
later. I've not thought more about how to split it up further, but I'm
pretty sure it's possible.

  The commits individually (except the first) aren't
  indivdiually commitable and aren't even meaningful. Splitting off the
  internal docs, tests and such actually just seems to make reviewing
  harder because you miss context. Splitting it so that individual piece
  are committable and reviewable makes sense, but... I have no problem
  doing the user docs later. If you split of RLS support, you need to
  throw an error before it's implemented.

 I mostly agree. Basically, I did not intend for all of the patches to
 be individually committed. The mechanism by which EXCLUDED.*
 expressions are added is somewhat novel, and deserves to be
 independently *considered*. I'm trying to show how the parts fit
 together more so than breaking things down in to smaller commits (as
 you picked up on, 0001 is the exception - that is genuinely intended
 to be committed early). Also, those commit messages give me the
 opportunity to put those parts in their appropriate context vis-a-vis
 our discussions. They refer to the Wiki, for example, or reasons why
 pg_stat_statements shouldn't care about ExcludedExpr. Obviously the
 final commit messages won't look that way.

FWIW, I don't think anything here really should refer to the wiki...

  0002:
  * Tentatively I'd say that killspeculative should be done via a separate
function instead of heap_delete()

 Really? I guess if that were to happen, it would entail refactoring
 heap_delete() to call a static function, which was also called by a
 new kill_speculative() function that does this. Otherwise, you'd have
 far too much duplication.

I don't really think there actually is that much common inbetween
those. It looks to me that most of the code in heap_delete isn't
actually relevant for this case and could be cut short. My guess is that
only the WAL logging would be separated out.

  * I doubt logical decoding works with the patch as it stands.

 I thought so. Perhaps you could suggest a better use of the available
 XLOG_HEAP_* bits. I knew I needed to consider that more carefully
 (hence the XXX comment), but didn't get around to it.

I think you probably need to add test macros that make sure only the
individual bits are sets, and not the combination and then only use those.

  * If a arbiter index is passed to ExecCheckIndexConstraints(), can't we
abort the loop after checking it? Also, do we really have to iterate
over indexes for that case? How about moving the loop contents to a
separate function and using that separately for the arbiter cases?

 Well, the failure to do that implies very few extra cycles, but sure.

It's not that much about the CPU cycles, but also about the mental
ones. If you have to think what happens if there's more than one
match...

  * ItemPointerIsValid

 What about it?

Uh. Oh. No idea. I wrote this pretty late at night ;)


  * /*
 * This may occur when an instantaneously invisible tuple is blamed
 * as a conflict because multiple rows are inserted with the same
 * constrained values.
 How can this happen? We don't insert multiple rows with the same
 command id?

 This is a cardinality violation [1]. It can definitely happen - just
 try the examples you see on the Wiki.

I don't care about the wiki from the point of code comments. This needs
to be understandable in five years time.

  * Perhaps it has previously been discussed but I'm not convinced by the
reasoning for not looking at opclasses in infer_unique_index(). This
seems like it'd prohibit ever having e.g. case insensitive opclasses -
something surely worthwile.

 I don't think anyone gave that idea the thumbs-up. However, I really
 don't see the problem. Sure, we could have case insensitive opclasses
 in the future, and you may want to make a unique index using one.

Then the problem suddenly becomes that previous choices of
indexes/statements aren't possible anymore. It seems much better to
introduce the syntax now and not have too much of a usecase for
it.


  * The whole speculative insert logic isn't really well documented. Why,
for example, do we actually need the token? And why are there no
issues with overflow? And where is it documented that a 0 means
there's no token? ...

 Fair enough. Presumably it's okay that overflow theoretically could
 occur, because a race is all but impossible. The token represents a
 particular attempt by some backend at inserting a tuple, that needs to
 be waited on 

Re: [HACKERS] For cursors, there is FETCH and MOVE, why no TELL?

2015-02-10 Thread Pavel Stehule
Hi


the patch can be very simple:

diff --git a/src/backend/commands/portalcmds.c
b/src/backend/commands/portalcmds.c
new file mode 100644
index 2794537..20b9206
*** a/src/backend/commands/portalcmds.c
--- b/src/backend/commands/portalcmds.c
*** PerformPortalFetch(FetchStmt *stmt,
*** 181,189 

/* Return command status if wanted */
if (completionTag)
!   snprintf(completionTag, COMPLETION_TAG_BUFSIZE, %s %ld,
 stmt-ismove ? MOVE : FETCH,
!nprocessed);
  }

  /*
--- 181,190 

/* Return command status if wanted */
if (completionTag)
!   snprintf(completionTag, COMPLETION_TAG_BUFSIZE, %s %ld
%ld,
 stmt-ismove ? MOVE : FETCH,
!nprocessed,
!portal-portalPos);
  }

  /*


2015-02-09 10:59 GMT+01:00 Marc Balmer m...@msys.ch:

 
  2015-02-09 10:37 GMT+01:00 Marc Balmer m...@msys.ch mailto:
 m...@msys.ch:
 
  Currently there are FETCH and the (non standard) MOVE commands to
 work
  on cursors.
 
  (I use cursors to display large datasets in a page-wise way, where
 the
  user can move per-page, or, when displaying a single record, per
 record.
   When the user goes back from per-record view to page-view, I have to
  restore the cursor to the position it was on before the user changed
 to
  per-record view.)
 
  I have to manually keep track of the cursor position, but in some
  cases it would definitely be easier to just query the current cursor
  position directly from the database and later use MOVE ABSOLUTE to
  rewind it to that position.  That could be achieved e.g. by a
  hypothetical TELL cursor-name command.  It does, however, not
 exist
  and I have not found an alternative.  Is there a way to query the
  current cusros position at all?  If not, does a TELL command sound
 like
  a good or bad idea?
 
 
  It sounds like good idea.
 
  Do we need a new statement? We can implement returning the position to
  MOVE statement. It returns a delta, but it can returns a absolute
  position too.

 On second thought, a new statement is not needed at all.  As Heikki
 noticed in hsi reply, it could either be a new function or have move to
 return the current position somehow(tm).  Or a nw option to move, maybe
 MOVE NOT (don't move the cursor but return it's position?


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] pgbench -f and vacuum

2015-02-10 Thread Michael Paquier
On Tue, Feb 10, 2015 at 5:03 PM, Tatsuo Ishii wrote:
 - The documentation misses some markups for pgbench and VACUUM and did
 not respect the 80-character limit.

 I didn't realize that there's such a style guide. Although I think
 it's a good thing, I just want to know where such a guide is
 described.

That's self-learning based on the style of the other pages. I don't
recall if there are actually convention guidelines for the docs, the
only I know of being the coding convention here:
http://www.postgresql.org/docs/devel/static/source.html
Maybe we could have a section dedicated to that. Thoughts?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench -f and vacuum

2015-02-10 Thread Tatsuo Ishii
 On Tue, Feb 10, 2015 at 5:03 PM, Tatsuo Ishii wrote:
 - The documentation misses some markups for pgbench and VACUUM and did
 not respect the 80-character limit.

 I didn't realize that there's such a style guide. Although I think
 it's a good thing, I just want to know where such a guide is
 described.
 
 That's self-learning based on the style of the other pages. I don't
 recall if there are actually convention guidelines for the docs, the
 only I know of being the coding convention here:
 http://www.postgresql.org/docs/devel/static/source.html
 Maybe we could have a section dedicated to that. Thoughts?

I think you need to talk to Peter.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2015-02-10 Thread Heikki Linnakangas

On 02/09/2015 03:20 PM, Abhijit Menon-Sen wrote:

At 2015-02-09 12:52:41 +0200, hlinnakan...@vmware.com wrote:

Do you have access to big-endian hardware to test this on?


Yes, I tested it on a Linux/ppc system. I wasn't speculating when I said
it does make a noticeable difference, though I'm afraid I did not keep
the timings after submitting the revised patch. The speedup was some
double-digit percentage, IIRC.


Ok, that's good enough for me.

Committed with a few more edits on comments and such. I'll start looking 
at the second patch now.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] enabling nestedloop and disabling hashjon

2015-02-10 Thread Ravi Kiran
yes sir, I did try the pg_ctl reload command, but its still using the hash
join algorithm and not the nested loop algorithm. I even restarted the
server, even then its still using the hash join algorithm

Thanks
ᐧ

On Tue, Feb 10, 2015 at 5:28 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Ravi Kiran ravi.kolanp...@gmail.com writes:
  I tried modifying the postgresql.conf file where I set the value
  enable_hashjoin=off and also enable_mergejoin=off, so that I could force
  postgres to use nested loop.
  but postgres is still using the hash join algorithm even after modifying
  the postgresql code.

 Did you remember pg_ctl reload?  enable_hashjoin=off will most certainly
 work if it's active.

 regards, tom lane



Re: [HACKERS] pg_basebackup may fail to send feedbacks.

2015-02-10 Thread Kyotaro HORIGUCHI
Hello,

The attached patch is fix for walreciever not using gettimeofday,
and fix for receivelog using it.

  XLogWalRcvProcessMsg doesn't send feedback when processing
  'w'=WAL record packet. So the same thing as pg_basebackup and
  pg_receivexlog will occur on walsender.
 
  Exiting the for(;;) loop by TimestampDifferenceExceeds just
  before line 439 is an equivalent measure to I poposed for
  receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there
  is seemingly simpler (but I feel a bit uncomfortable for the
  latter)
 
 I'm concerned about the performance degradation by calling
 getimeofday() per a record.
 
  Or other measures?
 
 Introduce the maximum number of records that we can receive and
 process between feedbacks? For example, we can change
 XLogWalRcvSendHSFeedback() so that it's called per at least
 8 records. I'm not sure what number is good, though...

At the beginning of the while(len  0){if (len  0){ block,
last_recv_timestamp is always kept up to date (by using
gettimeotda():). So we can use the variable instead of
gettimeofday() in my previous proposal.

The start time of the timeout could be last_recv_timestamp at the
beginning of the while loop, since it is a bit earlier than the
actual time at the point. 

The another solution would be using
RegisterTimeout/enable_timeout_after and it seemed to be work for
me. It can throw out the time counting stuff in the loop we are
talking about and that of XLogWalRcvSendHSFeedback and
XLogWalRcvSendReply, but it might be a bit too large for the
gain.

Considering pg_basebackup/receivexlog, the loop in receivelog.c
does not maintain the time value within it, so I think we are
forced to use feGetCurrentTimeStamp if not using SIGALRM. The wal
reading function simply gets the data from the buffer in memory
for most calls so the gettimeofday for each iteration could slow
the process significantly. SIGALRM seems to be valuable for the
case.

Thoughts?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c9017f7c55de864bb3459f6f927803577e94c5eb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp
Date: Mon, 2 Feb 2015 12:49:45 +0900
Subject: [PATCH] Make sure to send feedback at desired timing.

Continuous stream due to heavy-load on client side can prevent
feedbacks to be sent with expected interval, and it results in a
replication timeout on walsender. Exiting from the fast-path loop when
the time comes to make sure the feedback to be sent with expected
intervals.
---
 src/backend/replication/walreceiver.c | 54 +--
 src/bin/pg_basebackup/receivelog.c| 10 +--
 2 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index bfbc02f..d77dc91 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -409,35 +409,45 @@ WalReceiverMain(void)
 if (len != 0)
 {
 	/*
+	 * The feedback interval cannot be longer than
+	 * wal_receiver_status_interval. last_recv_timestamp is a
+	 * bit earlier than the actual time here so this is
+	 * available for the start time of the timeout in the loop
+	 * below.
+	 */
+	TimestampTz last_feedback = last_recv_timestamp;
+
+	/*
 	 * Process the received data, and any subsequent data we
 	 * can read without blocking.
 	 */
-	for (;;)
+	while (len  0)
 	{
-		if (len  0)
-		{
-			/*
-			 * Something was received from master, so reset
-			 * timeout
-			 */
-			last_recv_timestamp = GetCurrentTimestamp();
-			ping_sent = false;
-			XLogWalRcvProcessMsg(buf[0], buf[1], len - 1);
-		}
-		else if (len == 0)
-			break;
-		else if (len  0)
-		{
-			ereport(LOG,
-	(errmsg(replication terminated by primary server),
-	 errdetail(End of WAL reached on timeline %u at %X/%X.,
-			   startpointTLI,
-			   (uint32) (LogstreamResult.Write  32), (uint32) LogstreamResult.Write)));
-			endofwal = true;
+		/*
+		 * Something was received from master, so reset
+		 * timeout
+		 */
+		last_recv_timestamp = GetCurrentTimestamp();
+		ping_sent = false;
+		XLogWalRcvProcessMsg(buf[0], buf[1], len - 1);
+
+		/* Exit this loop if the time to reply has come */
+		if (TimestampDifferenceExceeds(last_feedback, last_recv_timestamp,
+	  wal_receiver_status_interval * 1000))
 			break;
-		}
+
 		len = walrcv_receive(0, buf);
 	}
+	if (len  0)
+	{
+		ereport(LOG,
+(errmsg(replication terminated by primary server),
+ errdetail(End of WAL reached on timeline %u at %X/%X.,
+		   startpointTLI,
+		   (uint32) (LogstreamResult.Write  32), (uint32) LogstreamResult.Write)));
+		endofwal = true;
+		break;
+	}
 
 	/* Let the master know that we received some data. */
 	

Re: [HACKERS] [pgsql-advocacy] GSoC 2015 - mentors, students and admins.

2015-02-10 Thread Fabrízio de Royes Mello
On Mon, Feb 9, 2015 at 6:52 PM, Thom Brown t...@linux.com wrote:

 Hi all,

 Google Summer of Code 2015 is approaching.  I'm intending on registering
PostgreSQL again this year.

 Before I do that, I'd like to have an idea of how many people are
interested in being either a student or a mentor.


I'm interested to participate as student again.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello


Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

2015-02-10 Thread Michael Paquier
On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
 - * Wait for more WAL to arrive. Time out after 5 
 seconds,
 - * like when polling the archive, to react to a trigger
 - * file promptly.
 + * Wait for more WAL to arrive. Time out after
 the amount of
 + * time specified by wal_retrieve_retry_interval, like
 + * when polling the archive, to react to a
 trigger file promptly.
   */
  WaitLatch(XLogCtl-recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT,
 -  5000L);
 +  wal_retrieve_retry_interval * 1000L);

 This change can prevent the startup process from reacting to
 a trigger file. Imagine the case where the large interval is set
 and the user want to promote the standby by using the trigger file
 instead of pg_ctl promote. I think that the sleep time should be 5s
 if the interval is set to more than 5s. Thought?

 I disagree here. It is interesting to accelerate the check of WAL
 availability from a source in some cases for replication, but the
 opposite is true as well as mentioned by Alexey at the beginning of
 the thread to reduce the number of requests when requesting WAL
 archives from an external storage type AWS. Hence a correct solution
 would be to check periodically for the trigger file with a maximum
 one-time wait of 5s to ensure backward-compatible behavior. We could
 reduce it to 1s or something like that as well.

 You seem to have misunderstood the code in question. Or I'm missing something.
 The timeout of the WaitLatch is just the interval to check for the trigger 
 file
 while waiting for more WAL to arrive from streaming replication. Not related 
 to
 the retry time to restore WAL from the archive.

[Re-reading the code...]
Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
maximum of 5s then. I also noticed in previous patch that the wait was
maximized to 5s. To begin with, a loop should have been used if it was
a sleep, but as now patch uses a latch this limit does not make much
sense... Patch updated is attached.
Regards,
-- 
Michael
From 9b7e3bb32c744b328a0d99db3040cadfcba606aa Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Mon, 19 Jan 2015 16:08:48 +0900
Subject: [PATCH] Add wal_retrieve_retry_interval

This parameter aids to control at which timing WAL availability is checked
when a node is in recovery, particularly  when successive failures happen when
fetching WAL archives, or when fetching WAL records from a streaming source.

Default value is 5s.
---
 doc/src/sgml/config.sgml  | 17 ++
 src/backend/access/transam/xlog.c | 46 +++
 src/backend/utils/misc/guc.c  | 12 +++
 src/backend/utils/misc/postgresql.conf.sample |  3 ++
 src/include/access/xlog.h |  1 +
 5 files changed, 66 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bcb106..d82b26a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2985,6 +2985,23 @@ include_dir 'conf.d'
   /listitem
  /varlistentry
 
+ varlistentry id=guc-wal-retrieve-retry-interval xreflabel=wal_retrieve_retry_interval
+  termvarnamewal_retrieve_retry_interval/varname (typeinteger/type)
+  indexterm
+   primaryvarnamewal_retrieve_retry_interval/ configuration parameter/primary
+  /indexterm
+  /term
+  listitem
+   para
+Specify the amount of time to wait when WAL is not available from
+any sources (streaming replication, local filenamepg_xlog/ or
+WAL archive) before retrying to retrieve WAL.  This parameter can
+only be set in the filenamepostgresql.conf/ file or on the
+server command line.  The default value is 5 seconds.
+   /para
+  /listitem
+ /varlistentry
+
  /variablelist
 /sect2
/sect1
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..1f9c3c4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -93,6 +93,7 @@ int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
+int			wal_retrieve_retry_interval = 5000;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -10340,8 +10341,8 @@ static bool
 WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			bool fetching_ckpt, XLogRecPtr tliRecPtr)
 {
-	static pg_time_t last_fail_time = 0;
-	pg_time_t	now;
+	TimestampTz now = 

Re: [HACKERS] Assertion failure when streaming logical changes

2015-02-10 Thread Andres Freund
On 2015-02-10 22:06:34 +0900, Michael Paquier wrote:
 On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund and...@2ndquadrant.com
 wrote:
 
  Yea, it really looks like the above commit is to blame. The new xmin
  tracking infrastructure doesn't know about the historical snapshot...
 
 
 I think that we need a better regression coverage here... For example, we
 could add some tap tests in test_decoding to test streaming of logical
 changes. This would help in the future to detect such problems via the
 buildfarm.

I agree. It's more or less a accident that the assert - which just
should be moved in the regd_count == 0 branch - didn't trigger for the
SQL interface. The snapshot acquired by the SELECT statement prevents it
there.

It's not entirely trivial to add tests for receivelogical though. You
need to stop it programatically after a while.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] For cursors, there is FETCH and MOVE, why no TELL?

2015-02-10 Thread Pavel Stehule
2015-02-10 14:32 GMT+01:00 Marc Balmer m...@msys.ch:



 Am 10.02.15 um 09:06 schrieb Pavel Stehule:
  Hi
 
 
  the patch can be very simple:
 
  diff --git a/src/backend/commands/portalcmds.c
  b/src/backend/commands/portalcmds.c
  new file mode 100644
  index 2794537..20b9206
  *** a/src/backend/commands/portalcmds.c
  --- b/src/backend/commands/portalcmds.c
  *** PerformPortalFetch(FetchStmt *stmt,
  *** 181,189 
 
  /* Return command status if wanted */
  if (completionTag)
  !   snprintf(completionTag, COMPLETION_TAG_BUFSIZE, %s %ld,
   stmt-ismove ? MOVE : FETCH,
  !nprocessed);
}
 
/*
  --- 181,190 
 
  /* Return command status if wanted */
  if (completionTag)
  !   snprintf(completionTag, COMPLETION_TAG_BUFSIZE, %s %ld
  %ld,
   stmt-ismove ? MOVE : FETCH,
  !nprocessed,
  !portal-portalPos);
}
 
/*
 

 That is simple indeed.  I tend to think, however, that it would be
 cleaner to return the position as a proper result from a functionn
 instead of using a side effect from a FETCH/MOVE command.


I have not strong opinion about it

Pavel


 
  2015-02-09 10:59 GMT+01:00 Marc Balmer m...@msys.ch mailto:
 m...@msys.ch:
 
  
   2015-02-09 10:37 GMT+01:00 Marc Balmer m...@msys.ch
  mailto:m...@msys.ch mailto:m...@msys.ch mailto:m...@msys.ch:
  
   Currently there are FETCH and the (non standard) MOVE commands
 to work
   on cursors.
  
   (I use cursors to display large datasets in a page-wise way,
 where the
   user can move per-page, or, when displaying a single record,
 per record.
When the user goes back from per-record view to page-view, I
 have to
   restore the cursor to the position it was on before the user
 changed to
   per-record view.)
  
   I have to manually keep track of the cursor position, but in
 some
   cases it would definitely be easier to just query the current
 cursor
   position directly from the database and later use MOVE
 ABSOLUTE to
   rewind it to that position.  That could be achieved e.g. by a
   hypothetical TELL cursor-name command.  It does, however,
 not exist
   and I have not found an alternative.  Is there a way to query
 the
   current cusros position at all?  If not, does a TELL command
 sound like
   a good or bad idea?
  
  
   It sounds like good idea.
  
   Do we need a new statement? We can implement returning the
 position to
   MOVE statement. It returns a delta, but it can returns a absolute
   position too.
 
  On second thought, a new statement is not needed at all.  As Heikki
  noticed in hsi reply, it could either be a new function or have move
 to
  return the current position somehow(tm).  Or a nw option to move,
 maybe
  MOVE NOT (don't move the cursor but return it's position?
 
 
  --
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org
  mailto:pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 
 



Re: [HACKERS] Parallel Seq Scan

2015-02-10 Thread Robert Haas
On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund and...@2ndquadrant.com wrote:
 Note that I'm not saying that Amit's patch is right - I haven't read it
 - but that I don't think a 'scan this range of pages' heapscan API would
 not be a bad idea. Not even just for parallelism, but for a bunch of
 usecases.

We do have that, already.  heap_setscanlimits().  I'm just not
convinced that that's the right way to split up a parallel scan.
There's too much risk of ending up with a very-uneven distribution of
work.

 Regarding tuple flow between backends, I've thought about that before,
 I agree that we need it, and I don't think I know how to do it.  I can
 see how to have a group of processes executing a single node in
 parallel, or a single process executing a group of nodes we break off
 from the query tree and push down to it, but what you're talking about
 here is a group of processes executing a group of nodes jointly.

 I don't think it really is that. I think you'd do it essentially by
 introducing a couple more nodes. Something like

   SomeUpperLayerNode
   |
   |
  AggCombinerNode
 /   \
/ \
   /   \
PartialHashAggNode   PartialHashAggNode  
 .PartialHashAggNode ...
||
||
||
||
 PartialSeqScanPartialSeqScan

 The only thing that'd potentially might need to end up working jointly
 jointly would be the block selection of the individual PartialSeqScans
 to avoid having to wait for stragglers for too long. E.g. each might
 just ask for a range of a 16 megabytes or so that it scans sequentially.

 In such a plan - a pretty sensible and not that uncommon thing for
 parallelized aggregates - you'd need to be able to tell the heap scans
 which blocks to scan. Right?

For this case, what I would imagine is that there is one parallel heap
scan, and each PartialSeqScan attaches to it.  The executor says give
me a tuple and heapam.c provides one.  Details like the chunk size
are managed down inside heapam.c, and the executor does not know about
them.  It just knows that it can establish a parallel scan and then
pull tuples from it.

 Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
 scan, mostly, I would think) and can-absorb-parallel-tuple-streams
 (sort, hash, materialize), or something like that, but I'm really
 fuzzy on the details.

 I don't think we really should have individual nodes that produce
 multiple streams - that seems like it'd end up being really
 complicated. I'd more say that we have distinct nodes (like the
 PartialSeqScan ones above) that do a teensy bit of coordination about
 which work to perform.

I think we're in violent agreement here, except for some
terminological confusion.  Are there N PartialSeqScan nodes, one
running in each node, or is there one ParallelSeqScan node, which is
copied and run jointly across N nodes?  You can talk about either way
and have it make sense, but we haven't had enough conversations about
this on this list to have settled on a consistent set of vocabulary
yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fetch zero result rows when executing a query?

2015-02-10 Thread Robert Haas
On Sun, Feb 8, 2015 at 3:56 AM, Shay Rojansky r...@roji.org wrote:
 Just to be precise: what is strange to me is that the max_rows feature
 exists
 but has no 0 value. You and Marko are arguing that the whole feature should
 be
 deprecated (i.e. always return all rows).

I think the fact that it has no zero value is probably just a
historical accident; most likely, whoever designed it originally
(probably twenty years ago) didn't think about queries with
side-effects and therefore didn't consider that wanting 0 rows would
ever be sensible.  Meanwhile, a sentinel value was needed to request
all rows, so they used 0.  If they'd thought of it, they might have
picked -1 and we'd not be having this discussion.

FWIW, I'm in complete agreement that it would be good if we had this
feature.  I believe this is not the first report we've had of
PostgreSQL doing things in ways that mesh nicely with standardized
driver interfaces.  Whether we think those interfaces are
well-designed or not, they are standardized.  When people use $OTHERDB
and have a really great driver, and then they move to PostgreSQL and
get one with more warts, it does not encourage them to stick with
PostgreSQL.

.NET is not some fringe user community that we can dismiss as
irrelevant.  We need users of all languages to want to use PostgreSQL,
not just users of languages any one of us happens to personally like.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Odd behavior of updatable security barrier views on foreign tables

2015-02-10 Thread Stephen Frost
Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 On 9 February 2015 at 21:17, Stephen Frost sfr...@snowman.net wrote:
   On Fri, Jan 30, 2015 at 5:20 AM, Etsuro Fujita
I noticed that when updating security barrier views on foreign tables,
we fail to give FOR UPDATE to selection queries issued at ForeignScan.
 
  I've looked into this a fair bit more over the weekend and the issue
  appears to be that the FDW isn't expecting a do-instead sub-query.
  I've been considering how we might be able to address that but havn't
  come up with any particularly great ideas and would welcome any
  suggestions.  Simply having the FDW try to go up through the query would
  likely end up with too many queries showing up with 'for update'.  We
  add the 'for update' to the sub-query before we even get called from
  the 'Modify' path too, which means we can't use that to realize when
  we're getting ready to modify rows and therefore need to lock them.
 
  In any case, I'll continue to look but would welcome any other thoughts.
 
 Sorry, I didn't have time to look at this properly. My initial thought
 is that expand_security_qual() needs to request a lock on rows coming
 from the relation it pushes down into a subquery if that relation was
 the result relation, because otherwise it won't have any locks, since
 preprocess_rowmarks() only adds PlanRowMarks to non-target relations.

Yes, that works.  I had been focused on trying to figure out a way to
make this work just in the FDW, but you're right, fixing it in
expand_security_qual() looks like the right approach.

 Of course that means that it may end up locking more rows than are
 actually updated, but that's essentially the same as a SELECT FOR
 UPDATE on a s.b. view right now.

Agreed.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Better error message on pg_upgrade checksum mismatches

2015-02-10 Thread Bruce Momjian
On Tue, Feb 10, 2015 at 10:55:07AM -0500, Greg Sabino Mullane wrote:
 Just a little thing that's been bugging me. If one side of the 
 pg_upgrade has checksums and the other does not, give a less 
 cryptic error message.

OK, sure.  Good fix, will apply.  This seems to be the day for pg_upgrade
fixes/improvements.  :-)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Odd behavior of updatable security barrier views on foreign tables

2015-02-10 Thread Stephen Frost
Etsuro,

* Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:
 On 2015/02/10 7:23, Dean Rasheed wrote:
 Sorry, I didn't have time to look at this properly. My initial thought
 is that expand_security_qual() needs to request a lock on rows coming
 from the relation it pushes down into a subquery if that relation was
 the result relation, because otherwise it won't have any locks, since
 preprocess_rowmarks() only adds PlanRowMarks to non-target relations.
 
 That seems close to what I had in mind; expand_security_qual() needs
 to request a FOR UPDATE lock on rows coming from the relation it
 pushes down into a subquery only when that relation is the result
 relation and *foreign table*.

I had been trying to work out an FDW-specific way to address this, but I
think Dean's right that this should be addressed in
expand_security_qual(), which means it'll apply to all cases and not
just these FDW calls.  I don't think that's actually an issue though and
it'll match up to how SELECT FOR UPDATE is handled today.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Corner case for add_path_precheck

2015-02-10 Thread Tom Lane
Antonin Houska a...@cybertec.at writes:
 The special case is that the path passed to add_path_precheck() has costs
 *equal to* those of the old_path. If pathkeys, outer rells and costs are the
 same, as summarized in the comment above, I expect add_path_precheck() to
 return false. Do I misread anything?

It does, so I don't see your point?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory

2015-02-10 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 I've now taken this idea as far as building the required infrastructure
 and revamping a couple of array operators to use it.  There's a lot yet
 to do, but I've done enough to get some preliminary ideas about
 performance (see below).

Nice!

 * Although I said above that everything owned by a deserialized object
 has to live in a single memory context, I do have ideas about relaxing
 that.  The core idea would be to invent a memory context reset/delete
 callback feature in mcxt.c.  Then a deserialized object could register
 such a callback on its own memory context, and use the callback to clean
 up resources outside its context.  This is potentially useful for instance
 for something like PostGIS, where an object likely includes some data that
 was allocated with malloc not palloc because it was created by library
 functions that aren't Postgres-aware.  Another likely use-case is for
 deserialized objects representing composite types to maintain reference
 counts on their tuple descriptors instead of having to copy said
 descriptors into their private contexts.  This'd be material for a
 separate patch though.

Being able to register a callback to be used on deletion of the context
would certainly be very nice and strikes me as pretty independent of the
rest of this.  You've probably thought of this already, but registering
the callback should probably allow the caller to pass in a pointer to be
passed back to the callback function when the delete happens, so that
there's a place for the metadata to be stored about what the callback
function needs to clean up when it's called.

 So that's the plan, and attached is a very-much-WIP patch that uses this
 approach to speed up plpgsql array element assignments (and not a whole
 lot else as yet).  Here's the basic test case I've been using:

I've not looked at the code at all as yet, but it makes sense to me.

 With the attached patch, those timings drop to 80 and 150 ms respectively.

And those numbers are pretty fantastic and would address an area we
regularly get dinged on.

 Still, if the worst-case slowdown is around 20% on trivially-sized arrays,
 I'd gladly take that to have better performance on larger arrays.  And I
 think this example is close to the worst case for the patch's approach,
 since it's testing small, fixed-element-length, no-nulls arrays, which is
 what the existing code can handle without spending a lot of cycles.

Agreed.

 BTW, I'm not all that thrilled with the deserialized object terminology.
 I found myself repeatedly tripping up on which form was serialized and
 which de-.  If anyone's got a better naming idea I'm willing to adopt it.

Unfortunately, nothing comes to mind.  Serialization is, at least, a
pretty well understood concept and so the naming will likely make sense
to newcomers, even if it's difficult to keep track of which is
serialized and which is deserialized.

 I'm not sure exactly how to push this forward.  I would not want to
 commit it without converting a significant number of array functions to
 understand about deserialized inputs, and by the time I've finished that
 work it's likely to be too late for 9.5.  OTOH I'm sure that the PostGIS
 folk would love to have this infrastructure in 9.5 not 9.6 so they could
 make a start on fixing their issues.  (Further down the pike, I'd plan to
 look at adapting composite-type operations, JSONB, etc, to make use of
 this approach, but that certainly isn't happening for 9.5.)
 
 Thoughts, advice, better ideas?

I'm not really a big fan of putting an infrastructure out there for
modules to use that we don't use ourselves (particularly when it's clear
that there are places where we could/should be).  On the other hand,
this doesn't impact on-disk format and therefore I'm less worried that
we'll end up with a release-critical issue when we're getting ready to
put 9.5 out there.

So, I'm on the fence about it.  I'd love to see all of this in 9.5 with
the array functions converted, but I don't think it'd be horrible if
only a subset had been done in time for 9.5.  The others aren't going to
go anywhere and will still work.  I do think it'd be better to have at
least some core users of this new infrastructure rather than just
putting it out there for modules to use but I agree it'd be a bit grotty
to have only some of the array functions converted.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Parallel Seq Scan

2015-02-10 Thread Andres Freund
On 2015-02-10 09:23:02 -0500, Robert Haas wrote:
 On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund and...@2ndquadrant.com wrote:
  And good chunk sizes et al depend on higher layers,
  selectivity estimates and such. And that's planner/executor work, not
  the physical layer (which heapam.c pretty much is).
 
 If it's true that a good chunk size depends on the higher layers, then
 that would be a good argument for doing this differently, or at least
 exposing an API for the higher layers to tell heapam.c what chunk size
 they want.  I hadn't considered that possibility - can you elaborate
 on why you think we might want to vary the chunk size?

Because things like chunk size depend on the shape of the entire
plan. If you have a 1TB table and want to sequentially scan it in
parallel with 10 workers you better use some rather large chunks. That
way readahead will be efficient in a cpu/socket local manner,
i.e. directly reading in the pages into the directly connected memory of
that cpu. Important for performance on a NUMA system, otherwise you'll
constantly have everything go over the shared bus.  But if you instead
have a plan where the sequential scan goes over a 1GB table, perhaps
with some relatively expensive filters, you'll really want a small
chunks size to avoid waiting.  The chunk size will also really depend on
what other nodes are doing, at least if they can run in the same worker.

Even without things like NUMA and readahead I'm pretty sure that you'll
want a chunk size a good bit above one page. The locks we acquire for
the buffercache lookup and for reading the page are already quite bad
for performance/scalability; even if we don't always/often hit the same
lock. Making 20 processes that scan pages in parallel acquire yet a
another lock (that's shared between all of them!) for every single page
won't be fun, especially without or fast filters.

  For this case, what I would imagine is that there is one parallel heap
  scan, and each PartialSeqScan attaches to it.  The executor says give
  me a tuple and heapam.c provides one.  Details like the chunk size
  are managed down inside heapam.c, and the executor does not know about
  them.  It just knows that it can establish a parallel scan and then
  pull tuples from it.
 
  I think that's a horrible approach that'll end up with far more
  entangled pieces than what you're trying to avoid. Unless the tuple flow
  is organized to only happen in the necessary cases the performance will
  be horrible.
 
 I can't understand this at all.  A parallel heap scan, as I've coded
 it up, involves no tuple flow at all.  All that's happening at the
 heapam.c layer is that we're coordinating which blocks to scan.  Not
 to be disrespectful, but have you actually looked at the patch?

No, and I said so upthread. I started commenting because you argued that
architecturally parallelism belongs in heapam.c instead of upper layers,
and I can't agree with that.  I now have, and it looks less bad than I
had assumed, sorry.

Unfortunately I still think it's wrong approach, also sorry.

As pointed out above (moved there after reading the patch...) I don't
think a chunk size of 1 or any other constant size can make sense. I
don't even believe it'll necessarily be constant across an entire query
execution (big initially, small at the end).  Now, we could move
determining that before the query execution into executor
initialization, but then we won't yet know how many workers we're going
to get. We could add a function setting that at runtime, but that'd mix
up responsibilities quite a bit.

I also can't agree with having a static snapshot in shared memory put
there by the initialization function. For one it's quite awkward to end
up with several equivalent snapshots at various places in shared
memory. Right now the entire query execution can share one snapshot,
this way we'd end up with several of them.  Imo for actual parallel
query execution the plan should be shared once and then be reused for
everything done in the name of the query.

Without the need to do that you end up pretty much with only with setup
for infrastructure so heap_parallelscan_nextpage is called. How about
instead renaming heap_beginscan_internal() to _extended and offering an
option to provide a callback + state that determines the next page?
Additionally provide some separate functions managing a simple
implementation of such a callback + state?

Btw, using a atomic uint32 you'd end up without the spinlock and just
about the same amount of code... Just do a atomic_fetch_add_until32(var,
1, InvalidBlockNumber)... ;)

  I think we're in violent agreement here, except for some
  terminological confusion.  Are there N PartialSeqScan nodes, one
  running in each node, or is there one ParallelSeqScan node, which is
  copied and run jointly across N nodes?  You can talk about either way
  and have it make sense, but we haven't had enough conversations about
  this on this list to have settled on a 

Re: [HACKERS] RangeType internal use

2015-02-10 Thread David Fetter
On Mon, Feb 09, 2015 at 12:37:05PM -0500, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Mon, Feb 9, 2015 at 10:36 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  It's going to be complicated and probably buggy, and I think it is heading
  in the wrong direction altogether.  If you want to partition in some
  arbitrary complicated fashion that the system can't reason about very
  effectively, we *already have that*.  IMO the entire point of building
  a new partitioning infrastructure is to build something simple, reliable,
  and a whole lot faster than what you can get from inheritance
  relationships.  And faster is going to come mainly from making the
  partitioning rules as simple as possible, not as complex as possible.
 
  Yeah, but people expect to be able to partition on ranges that are not
  all of equal width.  I think any proposal that we shouldn't support
  that is the kiss of death for a feature like this - it will be so
  restricted as to eliminate 75% of the use cases.
 
 Well, that's debatable IMO (especially your claim that variable-size
 partitions would be needed by a majority of users).

It's ubiquitous.

Time range partition sets almost always have some sets with finite
range and at least one range with infinity in it: current end to
infinity, and somewhat less frequently in my experience, -infinity to
some arbitrary start.

 But in any case, partitioning behavior that is emergent from a bunch
 of independent pieces of information scattered among N tables seems
 absolutely untenable from where I sit.

Agreed.

 Whatever we support, the behavior needs to be described by *one*
 chunk of information --- a sorted list of bin bounding values,
 perhaps.

This would work for some interesting generalization of sorted.

Maybe going back to the mathematical definition of partition could
bear more fruit.  All that's needed for that is an equivalence
relation, however it's denoted.  In practical terms, you'd need to
have a quick way to enumerate the equivalence classes and another to
establish whether equivalence holds.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0

2015-02-10 Thread Peter Geoghegan
On Tue, Feb 10, 2015 at 12:04 AM, Andres Freund and...@2ndquadrant.com wrote:
 FWIW, I don't think anything here really should refer to the wiki...

The Wiki pages have done a good job of summarizing things...it
certainly didn't seem that hard for you to get up to speed here. Also,
as you'll know from working on logical decoding, it's hard to know
what is complex and esoteric and what is relatively accessible when
you're this close to a big project. I recall that you said as much
before. I'm focused on signposting so that reviewers can follow what
each patch does with the minimum of effort (with reference to the Wiki
or whatever). I see no reason to not do things that way until
commit...it feels like there is less chance of the concepts going over
people's head this way.

 I don't really think there actually is that much common inbetween
 those. It looks to me that most of the code in heap_delete isn't
 actually relevant for this case and could be cut short. My guess is that
 only the WAL logging would be separated out.

I'll think about that some more.

  * I doubt logical decoding works with the patch as it stands.

 I thought so. Perhaps you could suggest a better use of the available
 XLOG_HEAP_* bits. I knew I needed to consider that more carefully
 (hence the XXX comment), but didn't get around to it.

 I think you probably need to add test macros that make sure only the
 individual bits are sets, and not the combination and then only use those.

This too.

  * /*
 * This may occur when an instantaneously invisible tuple is blamed
 * as a conflict because multiple rows are inserted with the same
 * constrained values.
 How can this happen? We don't insert multiple rows with the same
 command id?

 This is a cardinality violation [1]. It can definitely happen - just
 try the examples you see on the Wiki.

 I don't care about the wiki from the point of code comments. This needs
 to be understandable in five years time.

That wasn't clear before - you seemed to me to be questioning if this
was even possible. There is a section in the INSERT SQL reference page
about cardinality violations, so we're certainly talking about
something that a code reader likely heard of. Also, the nitty gritty
showing various scenarios on the Wiki is the quickest way to know what
is possible (but is much too long winded for user visible
documentation or code comments).

  * Perhaps it has previously been discussed but I'm not convinced by the
reasoning for not looking at opclasses in infer_unique_index(). This
seems like it'd prohibit ever having e.g. case insensitive opclasses -
something surely worthwile.

 I don't think anyone gave that idea the thumbs-up. However, I really
 don't see the problem. Sure, we could have case insensitive opclasses
 in the future, and you may want to make a unique index using one.

 Then the problem suddenly becomes that previous choices of
 indexes/statements aren't possible anymore. It seems much better to
 introduce the syntax now and not have too much of a usecase for
 it.

The only way the lack of a way of specifying which opclass to use
could bite us is if there were two *actually* defined unique indexes
on the same column, each with different equality operators. That
seems like kind of a funny scenario, even if that were quite possible
(even if non-default opclasses existed that had a non-identical
equality operators, which is basically not the case today).

I grant that is a bit odd that we're talking about unique indexes
definitions affecting semantics, but that is to a certain extent the
nature of the beast. As a compromise, I suggest having the inference
specification optionally accept a named opclass per attribute, in the
style of CREATE INDEX (I'm already reusing a bit of the raw parser
support for CREATE INDEX, you see) - that'll make inference insist on
that opclass, rather than make it a strict matter of costing available
alternatives (not that any alternative is expected with idiomatic
usage). That implies no additional parser overhead, really. If that's
considered ugly, then at least it's an ugly thing that literally no
one will ever use in the foreseeable future...and an ugly thing that
is no more necessary in CREATE INDEX than here (and yet CREATE INDEX
lives with the ugliness).

 It's really not about me understanding it right now, but about longer term.

Sure.

 Can you add a UPSERT test for logical decoding? I doubt it'll work right
 now, even in the repost.

Okay.

  * /*
  * Immediately VACUUM super-deleted tuples
  */
  if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple),
  InvalidTransactionId))
  return HEAPTUPLE_DEAD;
Is that branch really needed? Shouldn't it just be happening as a
consequence of the already existing code? Same in SatisfiesMVCC. If
you actually needed that block, it'd need to be done in SatisfiesSelf
as well, no? You have a comment about a possible loop - but that seems
wrong to me, 

[HACKERS] Manipulating complex types as non-contiguous structures in-memory

2015-02-10 Thread Tom Lane
I've been fooling around with a design to support computation-oriented,
not-necessarily-contiguous-blobs representations of datatypes in memory,
along the lines I mentioned here:
http://www.postgresql.org/message-id/2355.1382710...@sss.pgh.pa.us

In particular this is meant to reduce the overhead for repeated operations
on arrays, records, etc.  We've had several previous discussions about
that, and even some single-purpose patches such as in this thread:
http://www.postgresql.org/message-id/flat/cafj8prakudu_0md-dg6ftk0wsupvmlyrv1pb+hyc+gubzz3...@mail.gmail.com
There was also a thread discussing how this sort of thing could be
useful to PostGIS:
http://www.postgresql.org/message-id/526a61fb.1050...@oslandia.com
and it's been discussed a few other times too, but I'm too lazy to
search the archives any further.

I've now taken this idea as far as building the required infrastructure
and revamping a couple of array operators to use it.  There's a lot yet
to do, but I've done enough to get some preliminary ideas about
performance (see below).

The core ideas of this patch are:

* Invent a new TOAST datum type pointer to deserialized object, which
is physically just like the existing indirect-toast-pointer concept, but
it has a different va_tag code and somewhat different semantics.

* A deserialized object has a standard header (which is what the toast
pointers point to) and typically will have additional data-type-specific
fields after that.  One component of the standard header is a pointer to
a set of method functions that provide ways to accomplish standard
data-type-independent operations on the deserialized object.

* Another standard header component is a MemoryContext identifier: the
header, as well as all subsidiary data belonging to the deserialized
object, must live in this context.  (Well, I guess there could also be
child contexts.)  By exposing an explicit context identifier, we can
accomplish tasks like move this object into another context by
reparenting the object's context rather than physically copying anything.

* The only standard methods I've found a need for so far are functions
to re-serialize the object, that is generate a plain varlena value that is
semantically equivalent.  To avoid extra copying, this is split into
separate compute the space needed and serialize into this memory
steps, so that the result can be dropped exactly where the caller needs
it.

* Currently, a deserialized object will be reserialized in that way
whenever we incorporate it into a physical tuple (ie, heap_form_tuple
or index_form_tuple), or whenever somebody applies datumCopy() to it.
I'd like to relax this later, but there's an awful lot of code that
supposes that heap_form_tuple or datumCopy will produce a self-contained
value that survives beyond, eg, destruction of the memory context that
contained the source Datums.  We can get good speedups in a lot of
interesting cases without solving that problem, so I don't feel too bad
about leaving it as a future project.

* In particular, things like PG_GETARG_ARRAYTYPE_P() treat a deserialized
toast pointer as something to be detoasted, and will produce a palloc'd
re-serialized value.  This means that we do not need to convert all the
C functions concerned with a given datatype at the same time (or indeed
ever); a function that hasn't been upgraded will build a re-serialized
representation and then operate on that.  We'll invent alternate
argument-fetching functions that skip the reserialization step, for use
by functions that have been upgraded to handle either case.  This is
basically the same approach we used when we introduced short varlena
headers, and that seems to have gone smoothly enough.

* There's a concept that a deserialized object has a primary toast
pointer, which is physically part of the object, as well as secondary
toast pointers which might or might not be part of the object.  If you
have a Datum pointer to the primary toast pointer then you are authorized
to modify the object in-place; if you have a Datum pointer to a secondary
toast pointer then you must treat it as read-only (ie, you have to make a
copy if you're going to change it).  Functions that construct a new
deserialized object always return its primary toast pointer; this allows a
nest of functions to modify an object in-place without copying, which was
the primary need that the PostGIS folks expressed.  On the other hand,
plpgsql can hand out secondary toast pointers to deserialized objects
stored in plpgsql function variables, thus ensuring that the objects won't
be modified unexpectedly, while never having to physically copy them if
the called functions just need to inspect them.

* Primary and secondary pointers are physically identical, but the
primary pointer resides in a specific spot in the deserialized object's
standard header.  (So you can tell if you've got the primary pointer via
a simple address comparison.)

* I've modified the array element assignment path in 

Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory

2015-02-10 Thread Jim Nasby

On 2/10/15 5:19 PM, Tom Lane wrote:

Jim Nasby jim.na...@bluetreble.com writes:

Without having read the patch, I think this is great. I've been wishing
for something like this while working on my variant data type.



Are there any cases where we would want to use this on a non-variant?
Perhaps types where we're paying an alignment penalty?


What do you mean by non-variant?


Ugh, sorry, brainfart. I meant to say non-varlena.

I can't think of any non-varlena types we'd want this for, but maybe 
someone else can think of a case. If there is a use-case I wouldn't 
handle it with this patch, but we'd want to consider it...

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Jim Nasby

On 2/10/15 6:32 PM, Peter Geoghegan wrote:

On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com wrote:

Although the patch was described as relatively easy to write, it never
went anywhere, because it *replaced* MD5 authentication with bcrypt,
which would be a big problem for existing clients.  It seems clear
that we should add something new and not immediately kill off what
we've already got, so that people can transition smoothly.  An idea I
just had today is to keep using basically the same system that we are
currently using for MD5, but with a stronger hash algorithm, like
SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and
SHA-512).  Those are slower, but my guess is that even SHA-512 is not
enough slower for anybody to care very much, and if they do, well
that's another reason to make use of the new stuff optional.


I believe that a big advantage of bcrypt for authentication is the
relatively high memory requirements. This frustrates GPU based
attacks.


This is especially critical if things like bitcoin ASIC rigs could be 
put to use generating generic SHA256 hashes. A few grand will buy you 
hardware that can generate trillions of hash values per second. AFAIK 
there's no specialized hardware for scrypt though (even though it's used 
for other cryptocurrencies), in part because it's not cost effective to 
put enough memory in place.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords

2015-02-10 Thread Jim Nasby

On 2/5/15 10:48 AM, Tom Lane wrote:

Stephen Frostsfr...@snowman.net  writes:

* Robert Haas (robertmh...@gmail.com) wrote:

On Thu, Feb 5, 2015 at 10:48 AM, Stephen Frostsfr...@snowman.net  wrote:

And I thought this was about FDW options and not about dblink, really..

The OP is pretty clearly asking about dblink.

I was just pointing out that it was an issue that all FDWs suffer from,
since we don't have any way for an FDW to say don't show this option,
as discussed.

The dblink example is entirely uncompelling, given that as you said
somebody with access to a dblink connection could execute ALTER USER on
the far end.


Actually, you can eliminate that by not granting direct access to dblink 
functions. Instead you create a SECURITY DEFINER function that sanity 
checks the SQL you're trying to run and rejects things like ALTER USER. 
While you're doing that, you can also lock away the connection 
information. A former coworker actually built a system that does this, 
at least to a limited degree.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Show the LSN in rm_redo_error_callback

2015-02-10 Thread Andres Freund
Hi,

I've repeatedly stared at logs containing PANICs during WAL
replay. Unfortunately it's rather hard to find out which record actually
caused the problem as we print the record's contents but not its LSN.

I think it's a no-brainer to add that to master. But I'd even argue that
it'd be a good idea to add it to the backbranches - there've been a
significant number of bugs causing PANICs due to replay errors in the
past (and we might be hunting one more of them right now).

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory

2015-02-10 Thread Jim Nasby
Without having read the patch, I think this is great. I've been wishing 
for something like this while working on my variant data type.


Are there any cases where we would want to use this on a non-variant? 
Perhaps types where we're paying an alignment penalty?


On 2/10/15 3:00 PM, Stephen Frost wrote:

BTW, I'm not all that thrilled with the deserialized object terminology.
I found myself repeatedly tripping up on which form was serialized and
which de-.  If anyone's got a better naming idea I'm willing to adopt it.

Unfortunately, nothing comes to mind.  Serialization is, at least, a
pretty well understood concept and so the naming will likely make sense
to newcomers, even if it's difficult to keep track of which is
serialized and which is deserialized.


Apologies if I'm just dense, but what's the confusion? Is it what a 
serialized datum means in this context? (de)serialized seems like a 
perfectly logical name to me...



I'm not sure exactly how to push this forward.  I would not want to
commit it without converting a significant number of array functions to
understand about deserialized inputs, and by the time I've finished that
work it's likely to be too late for 9.5.  OTOH I'm sure that the PostGIS
folk would love to have this infrastructure in 9.5 not 9.6 so they could
make a start on fixing their issues.  (Further down the pike, I'd plan to
look at adapting composite-type operations, JSONB, etc, to make use of
this approach, but that certainly isn't happening for 9.5.)

Thoughts, advice, better ideas?

I'm not really a big fan of putting an infrastructure out there for
modules to use that we don't use ourselves (particularly when it's clear
that there are places where we could/should be).  On the other hand,
this doesn't impact on-disk format and therefore I'm less worried that
we'll end up with a release-critical issue when we're getting ready to
put 9.5 out there.

So, I'm on the fence about it.  I'd love to see all of this in 9.5 with
the array functions converted, but I don't think it'd be horrible if
only a subset had been done in time for 9.5.  The others aren't going to
go anywhere and will still work.  I do think it'd be better to have at
least some core users of this new infrastructure rather than just
putting it out there for modules to use but I agree it'd be a bit grotty
to have only some of the array functions converted.


I think the solution here is to have people other than Tom do the 
gruntwork of applying this to the remaining array code. My thought is 
that if Tom shows how to do this correctly in a rather complex case (ie, 
where you need to worry about primary vs secondary), then less 
experienced hackers should be able to take the ball and run with it.


Maybe we won't get complete array coverage, but I think any performance 
gains here are a win. And really, don't we just need enough usage so the 
buildfarm will tell us if we accidentally break something?

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory

2015-02-10 Thread Tom Lane
[ this is addressing a tangential point ... ]

Stephen Frost sfr...@snowman.net writes:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 * Although I said above that everything owned by a deserialized object
 has to live in a single memory context, I do have ideas about relaxing
 that.  The core idea would be to invent a memory context reset/delete
 callback feature in mcxt.c.  Then a deserialized object could register
 such a callback on its own memory context, and use the callback to clean
 up resources outside its context.

 Being able to register a callback to be used on deletion of the context
 would certainly be very nice and strikes me as pretty independent of the
 rest of this.  You've probably thought of this already, but registering
 the callback should probably allow the caller to pass in a pointer to be
 passed back to the callback function when the delete happens, so that
 there's a place for the metadata to be stored about what the callback
 function needs to clean up when it's called.

Yeah, there would likely be use-cases for that outside of deserialized
objects.  I could submit a separate patch for that now, but I'm hesitant
to add a mechanism without any use-case in the same patch.  But maybe we
could find a caller somewhere in the core aggregate code --- there are
some aggregates that need cleanup callbacks already, IIRC, and maybe we
could change them to use a memory context callback instead of whatever
they're doing now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] reducing our reliance on MD5

2015-02-10 Thread Robert Haas
There have been a few previous attempts to wean PostgreSQL off of MD5.
Back in 2012, Heikki proposed using SCRAM for our next-generation
authentication mechanism:

http://www.postgresql.org/message-id/507550bd.2030...@vmware.com

That seems likely to be a good idea, but nobody's come up with a
patch.  Peter Bex *did* come up with a patch to replace MD5
authentication with bcrypt, which I guess is based on Blowfish:

http://www.postgresql.org/message-id/20121227144638.gc...@frohike.homeunix.org

Although the patch was described as relatively easy to write, it never
went anywhere, because it *replaced* MD5 authentication with bcrypt,
which would be a big problem for existing clients.  It seems clear
that we should add something new and not immediately kill off what
we've already got, so that people can transition smoothly.  An idea I
just had today is to keep using basically the same system that we are
currently using for MD5, but with a stronger hash algorithm, like
SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and
SHA-512).  Those are slower, but my guess is that even SHA-512 is not
enough slower for anybody to care very much, and if they do, well
that's another reason to make use of the new stuff optional.

Maybe we could make the authentication handshake pluggable:

typedef (*password_encryption_fn)(const char *passwd, const char
*salt, size_t salt_len, char *buf);  // like pg_md5_encrypt
extern void register_encrypted_authentication_type(char *name,
password_encryption_fn);

Then we could add a _PG_init() function to pgcrypto that would enable
the use of whatever pgcrypto has got to offer as authentication
handshake methods.  This has a couple of advantages.  First, we don't
have to suck pgcrypto into core, something we've been reluctant to do;
second, if MD5 or any successor algorithm is shown to be insecure,
users can just drop in a new one, either provided by us or one they
write themselves or something in an external library they have and can
integrate to.

There are a few complications:

1. Each encryption algorithm requires a wire-protocol constant to
identify it, and the client and the server have to agree on that
value.  So we'd have to define constants for whatever algorithms we
want to provide via the core distribution, and possibly either
maintain a registry of other values or at least set aside a space
(values  1e6, maybe) that is reserved for use by extensions.

2. We'd have to figure out how to get support for the new algorithms
into libpq.  This actually seems a good bit harder than doing it on
the server-side, because I don't think libpq has any dynamic loading
facilities the way the server does.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Peter Geoghegan
On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com wrote:
 Although the patch was described as relatively easy to write, it never
 went anywhere, because it *replaced* MD5 authentication with bcrypt,
 which would be a big problem for existing clients.  It seems clear
 that we should add something new and not immediately kill off what
 we've already got, so that people can transition smoothly.  An idea I
 just had today is to keep using basically the same system that we are
 currently using for MD5, but with a stronger hash algorithm, like
 SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and
 SHA-512).  Those are slower, but my guess is that even SHA-512 is not
 enough slower for anybody to care very much, and if they do, well
 that's another reason to make use of the new stuff optional.

I believe that a big advantage of bcrypt for authentication is the
relatively high memory requirements. This frustrates GPU based
attacks.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2015-02-10 Thread Heikki Linnakangas

On 01/09/2015 10:32 AM, Abhijit Menon-Sen wrote:

2. The sse4.2 patch has only some minor compile fixes.

I have built and tested both patches individually on little-endian
(amd64) and big-endian (ppc) systems. I verified that the _sse is
chosen at startup on the former, and _sb8 on the latter, and that
both implementations function correctly with respect to HEAD.


I'd like to arrange this somewhat differently, to keep the really 
low-level assembler blocks and such separate from the higher level 
parts. Especially now that pg_crc.c is in src/common and src/port, it 
doesn't seem appropriate to have assembler code in there. Also, some of 
the high-level code can be reused if we later add support e.g. for the 
ARMv8 CRC instructions.


I propose that we add a new header file in src/port, let's call it 
crc_instructions.h. That file contains the very low-level stuff, the 
pg_asm_crc32q() and pg_asm_crc32b() inline functions, which just contain 
the single assembler instruction. Or the corresponding mnemomic or macro 
depending on the compiler; the point of the crc_instructions.h header is 
to hide the differences between compilers and architectures.


If the CRC instructions are available, crc_instructions.h defines 
PG_HAVE_CRC32C_INSTRUCTIONS, as well as the pg_asm_crc32q() and 
pg_asm_crc32b() macros/functions. It also defines 
pg_crc32_instructions_runtime_check(), to perform the runtime check to 
determine whether the instructions can be used on the current platform 
(i.e. if you're running on a CPU with SSE 4.2 support). There's another 
symbol PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK, which indicates 
whether the runtime check is needed. That's currently always defined 
when PG_HAVE_CRC32C_INSTRUCTIONS is, but conceivably you might want to 
build a version that skips the runtime check altogether, and doesn't 
therefore require the slicing-by-8 fallback implementation at all. 
Gentoo users might like that ;-), as well as possible future 
architectures that always have CRC32 instructions.


Attached is a patch along those lines. I haven't done much testing, but 
it demonstrates the code structure I have in mind.


A couple of remarks on your original patch, which also apply to the 
attached version:



+static inline pg_crc32
+pg_asm_crc32b(pg_crc32 crc, unsigned char data)
+{
+#if defined(__GNUC__)  defined(__x86_64__)
+   __asm__ (crc32b %[data], %[crc]\n : [crc] +r (crc) : [data] rm 
(data));
+   return crc;
+#elif defined(_MSC_VER)
+   return _mm_crc32_u8(crc, data);
+#else
+   /* Can't generate crc32b, but keep the compiler quiet. */
+   return 0;
+#endif


I believe the CRC instructions are also available in 32-bit mode, so the 
check for __x86_64__ is not needed. Right? Any CPU recent enough to have 
these instructions certainly supports the x86-64 instruction set, but 
you might nevertheless have compiled and be running in i386 mode.


It would be nice to use GCC's builtin intrinsics, __builtin_ia32_crc32* 
where available, but I guess those are only available if you build with 
-msse4.2, and that would defeat the point of the runtime check.


I believe the _mm_* mnemonics would also work with the Intel compiler.

- Heikki
From 7e16b0d5be39f832769da463f069c51afa04fb57 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas heikki.linnakangas@iki.fi
Date: Tue, 10 Feb 2015 14:26:24 +0200
Subject: [PATCH 1/1] Use Intel SSE4.2 CRC instructions where available.

On x86, perform a runtime check to see if we're running on a CPU that
supports SSE 4.2. If we are, we can use the special crc32b and crc32q
instructions for the CRC-32C calculations. That greatly speeds up CRC
calculation.

Abhijit Menon-Sen, reviewed by Andres Freund and me.
---
 configure   |   2 +-
 configure.in|   2 +-
 src/common/pg_crc.c | 109 
 src/include/common/pg_crc.h |  12 -
 src/include/pg_config.h.in  |   3 ++
 5 files changed, 114 insertions(+), 14 deletions(-)

diff --git a/configure b/configure
index fa271fe..c352128 100755
--- a/configure
+++ b/configure
@@ -9204,7 +9204,7 @@ fi
 done
 
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h cpuid.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo ac_cv_header_$ac_header | $as_tr_sh`
 ac_fn_c_check_header_mongrel $LINENO $ac_header $as_ac_Header $ac_includes_default
diff --git a/configure.in b/configure.in
index e6a49d1..588d626 100644

[HACKERS] Assertion failure when streaming logical changes

2015-02-10 Thread Michael Paquier
Hi all,

Using test_decoding on HEAD (cc761b1) I am seeing the following assertion
failure:
TRAP: FailedAssertion(!(!((RegisteredSnapshots)-ph_root ==
((void*)0))), File: snapmgr.c, Line: 677)

(lldb) bt
* thread #1: tid = 0x, 0x7fff8b246d46 libsystem_kernel.dylib`__kill
+ 10, stop reason = signal SIGSTOP
  * frame #0: 0x7fff8b246d46 libsystem_kernel.dylib`__kill + 10
frame #1: 0x7fff8861bf83 libsystem_c.dylib`abort + 177
frame #2: 0x00010b556dd9
postgres`ExceptionalCondition(conditionName=0x00010b66ffcb,
errorType=0x00010b5c105d, fileName=0x00010b66fe68, lineNumber=677)
+ 137 at assert.c:54
frame #3: 0x00010b5af952
postgres`UnregisterSnapshotFromOwner(snapshot=0x7faea38c22d8,
owner=0x7faea38a6838) + 146 at snapmgr.c:677
frame #4: 0x00010b5af8b2
postgres`UnregisterSnapshot(snapshot=0x7faea38c22d8) + 50 at
snapmgr.c:663
frame #5: 0x00010b0166ce
postgres`systable_endscan(sysscan=0x7faea38cf090) + 110 at genam.c:504
frame #6: 0x00010b5474b8
postgres`RelationBuildTupleDesc(relation=0x000114932e68) + 952 at
relcache.c:568
frame #7: 0x00010b53e45c
postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at
relcache.c:1035
frame #8: 0x00010b53d564
postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777
frame #9: 0x00010aff093c postgres`relation_open(relationId=3455,
lockmode=1) + 108 at heapam.c:1047
frame #10: 0x00010b016ac9 postgres`index_open(relationId=3455,
lockmode=1) + 25 at indexam.c:167
frame #11: 0x00010b01603d
postgres`systable_beginscan(heapRelation=0x0001149214d0, indexId=3455,
indexOK='\x01', snapshot=0x, nkeys=2,
key=0x7fff54c6a990) + 93 at genam.c:334
frame #12: 0x00010b54a976
postgres`RelidByRelfilenode(reltablespace=0, relfilenode=12709) + 742 at
relfilenodemap.c:204
frame #13: 0x00010b3353d9
postgres`ReorderBufferCommit(rb=0x7faea38c1038, xid=1001,
commit_lsn=23817712, end_lsn=23818128, commit_time=476842122105685) + 665
at reorderbuffer.c:1338
frame #14: 0x00010b330ccb
postgres`DecodeCommit(ctx=0x7faea38b0838, buf=0x7fff54c6ad58,
xid=1001, dboid=16384, commit_time=476842122105685, nsubxacts=0,
sub_xids=0x7faea3885530, ninval_msgs=22, msgs=0x7faea3885530) + 443
at decode.c:548
frame #15: 0x00010b32f663
postgres`DecodeXactOp(ctx=0x7faea38b0838, buf=0x7fff54c6ad58) + 547
at decode.c:210
frame #16: 0x00010b32f10e
postgres`LogicalDecodingProcessRecord(ctx=0x7faea38b0838,
record=0x7faea38b0af8) + 142 at decode.c:103
frame #17: 0x00010b3433f5 postgres`XLogSendLogical + 165 at
walsender.c:2425
frame #18: 0x00010b3431ad
postgres`WalSndLoop(send_data=0x00010b343350) + 269 at walsender.c:1834
frame #19: 0x00010b341938
postgres`StartLogicalReplication(cmd=0x7faea38846a8) + 568 at
walsender.c:997
frame #20: 0x00010b34021c
postgres`exec_replication_command(cmd_string=0x7faea3833a38) + 524 at
walsender.c:1326
frame #21: 0x00010b3abaab postgres`PostgresMain(argc=1,
argv=0x7faea3803fc8, dbname=0x7faea3803ec0,
username=0x7faea3803ea0) + 2475 at postgres.c:4022
frame #22: 0x00010b312a2e
postgres`BackendRun(port=0x7faea3405d60) + 686 at postmaster.c:4141
frame #23: 0x00010b311ff0
postgres`BackendStartup(port=0x7faea3405d60) + 384 at postmaster.c:3826
frame #24: 0x00010b30e7f7 postgres`ServerLoop + 663 at
postmaster.c:1594
frame #25: 0x00010b30c017 postgres`PostmasterMain(argc=3,
argv=0x7faea34044d0) + 5623 at postmaster.c:1241
frame #26: 0x00010b24e11d postgres`main(argc=3,
argv=0x7faea34044d0) + 749 at main.c:221

The problem can be easily reproduced using pg_recvlogical after creating a
logical slot plugged with test_decoding.
Regards,
-- 
Michael


Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory

2015-02-10 Thread Tom Lane
Jim Nasby jim.na...@bluetreble.com writes:
 Without having read the patch, I think this is great. I've been wishing 
 for something like this while working on my variant data type.

 Are there any cases where we would want to use this on a non-variant? 
 Perhaps types where we're paying an alignment penalty?

What do you mean by non-variant?

The use cases that have come to mind for me are:

* arrays, of course
* composite types (records)
* PostGIS geometry type
* JSONB, hstore
* possibly regex patterns (we could invent a data type representing these
and then store the compiled form as a deserialized representation;
although there would be some issues to be worked out to get any actual
win, probably)

The principal thing that's a bit hard to figure out is when it's a win to
convert to a deserialized representation and when you should just leave
well enough alone.  I'm planning to investigate that further in the
context of plpgsql array variables, but I'm not sure how well those
answers will carry over to datatypes that plpgsql has no intrinsic
understanding of.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typo in logicaldecoding.sgml

2015-02-10 Thread Tatsuo Ishii
I found a small typo in logicaldecoding.sgml. I think one of them
could be a fix. Any advice?

diff --git a/doc/src/sgml/logicaldecoding.sgml 
b/doc/src/sgml/logicaldecoding.sgml
index 3efe433..3650567 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -299,7 +299,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
 
para
 The command xref linkend=app-pgrecvlogical can be used to control
-logical decoding over a streaming replication connection.  (It is uses
+logical decoding over a streaming replication connection.  (It uses
 these commands internally.)
/para
   /sect1

diff --git a/doc/src/sgml/logicaldecoding.sgml 
b/doc/src/sgml/logicaldecoding.sgml
index 3efe433..d89545a 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -299,7 +299,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
 
para
 The command xref linkend=app-pgrecvlogical can be used to control
-logical decoding over a streaming replication connection.  (It is uses
+logical decoding over a streaming replication connection.  (It is using
 these commands internally.)
/para
   /sect1

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typo in logicaldecoding.sgml

2015-02-10 Thread Tatsuo Ishii
 Do you want to push this yourself or have me do it?

If ok, could you please push it?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typo in logicaldecoding.sgml

2015-02-10 Thread Andres Freund
Hi,

On 2015-02-11 08:55:12 +0900, Tatsuo Ishii wrote:
 I found a small typo in logicaldecoding.sgml.

Oops.

 I think one of them could be a fix. Any advice?

I slightly prefer the second form...

Do you want to push this yourself or have me do it?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fetch zero result rows when executing a query?

2015-02-10 Thread Shay Rojansky
Thanks for understanding Robert, that's more or less what I had in mind, I
was mainly wondering if I were missing some deeper explanation for the
absence of the possibility of requesting 0 rows.

Regardless of all of the above, it's really no big deal. I'll go ahead and
use max_rows=1 for now, hopefully you guys don't decide to deprecate it.

Shay

On Tue, Feb 10, 2015 at 3:00 PM, Robert Haas robertmh...@gmail.com wrote:

 On Sun, Feb 8, 2015 at 3:56 AM, Shay Rojansky r...@roji.org wrote:
  Just to be precise: what is strange to me is that the max_rows feature
  exists
  but has no 0 value. You and Marko are arguing that the whole feature
 should
  be
  deprecated (i.e. always return all rows).

 I think the fact that it has no zero value is probably just a
 historical accident; most likely, whoever designed it originally
 (probably twenty years ago) didn't think about queries with
 side-effects and therefore didn't consider that wanting 0 rows would
 ever be sensible.  Meanwhile, a sentinel value was needed to request
 all rows, so they used 0.  If they'd thought of it, they might have
 picked -1 and we'd not be having this discussion.

 FWIW, I'm in complete agreement that it would be good if we had this
 feature.  I believe this is not the first report we've had of
 PostgreSQL doing things in ways that mesh nicely with standardized
 driver interfaces.  Whether we think those interfaces are
 well-designed or not, they are standardized.  When people use $OTHERDB
 and have a really great driver, and then they move to PostgreSQL and
 get one with more warts, it does not encourage them to stick with
 PostgreSQL.

 .NET is not some fringe user community that we can dismiss as
 irrelevant.  We need users of all languages to want to use PostgreSQL,
 not just users of languages any one of us happens to personally like.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] Assertion failure when streaming logical changes

2015-02-10 Thread Heikki Linnakangas

On 02/10/2015 02:46 PM, Andres Freund wrote:

On 2015-02-10 21:20:37 +0900, Michael Paquier wrote:

Using test_decoding on HEAD (cc761b1) I am seeing the following assertion
failure:
TRAP: FailedAssertion(!(!((RegisteredSnapshots)-ph_root ==
((void*)0))), File: snapmgr.c, Line: 677)


Ick. I guess a revert of 94028691609f8e148bd4ce72c46163f018832a5b fixes
it?
...

Yea, it really looks like the above commit is to blame. The new xmin
tracking infrastructure doesn't know about the historical snapshot...


The new xmin tracking code assumes that if a snapshots's regd_count  0, 
it has already been pushed to the RegisteredSnapshots heap. That 
assumption doesn't hold because the logical decoding stuff creates its 
own snapshots and sets regd_count=1 to prevent snapmgr.c from freeing 
them, even though they haven't been registered with RegisterSnapshot.


The second paragraph in this comment in snapmgr.c needs fixing:


 * Likewise, any snapshots that have been exported by pg_export_snapshot
 * have regd_count = 1 and are counted in RegisteredSnapshots, but are not
 * tracked by any resource owner.
 *
 * The same is true for historic snapshots used during logical decoding,
 * their lifetime is managed separately (as they life longer as one xact.c
 * transaction).


Besides the spelling, that's incorrect: historic snapshots are *not* 
counted in RegisteredSnapshots. That was harmless before commit 
94028691, but it was always wrong.



I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack. 
snapbuild.c also abuses active_count as a reference counter, which is 
similarly ugly. I think regd_count and active_count should both be 
treated as private to snapmgr.c, and initialized to zero elsewhere.


As a minimal fix, we could change the logical decoding code to not use 
regd_count to prevent snapmgr.c from freeing its snapshots, but use 
active_count for that too. Setting regd_count to 1 in 
SnapBuildBuildSnapshot() seems unnecessary in the first place, as the 
callers also call SnapBuildSnapIncRefcount(). Patch attached.


But could we get rid of those active_count manipulations too? Could you 
replace the SnapBuildSnap[Inc|Dec]Refcount calls with 
[Un]RegisterSnapshot()?


- Heikki

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bcd5896..1f76731 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1188,8 +1188,8 @@ ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
 	memcpy(snap, orig_snap, sizeof(SnapshotData));
 
 	snap-copied = true;
-	snap-active_count = 0;
-	snap-regd_count = 1;
+	snap-active_count = 1;
+	snap-regd_count = 0;
 	snap-xip = (TransactionId *) (snap + 1);
 
 	memcpy(snap-xip, orig_snap-xip, sizeof(TransactionId) * snap-xcnt);
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index e911453..80f5b44 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -348,7 +348,7 @@ SnapBuildFreeSnapshot(Snapshot snap)
 	Assert(snap-curcid == FirstCommandId);
 	Assert(!snap-suboverflowed);
 	Assert(!snap-takenDuringRecovery);
-	Assert(snap-regd_count == 1);
+	Assert(snap-regd_count == 0);
 
 	/* slightly more likely, so it's checked even without c-asserts */
 	if (snap-copied)
@@ -407,16 +407,16 @@ SnapBuildSnapDecRefcount(Snapshot snap)
 	Assert(!snap-suboverflowed);
 	Assert(!snap-takenDuringRecovery);
 
-	Assert(snap-regd_count == 1);
+	Assert(snap-regd_count == 0);
 
-	Assert(snap-active_count);
+	Assert(snap-active_count  0);
 
 	/* slightly more likely, so it's checked even without casserts */
 	if (snap-copied)
 		elog(ERROR, cannot free a copied snapshot);
 
 	snap-active_count--;
-	if (!snap-active_count)
+	if (snap-active_count == 0)
 		SnapBuildFreeSnapshot(snap);
 }
 
@@ -495,7 +495,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder, TransactionId xid)
 	snapshot-copied = false;
 	snapshot-curcid = FirstCommandId;
 	snapshot-active_count = 0;
-	snapshot-regd_count = 1;	/* mark as registered so nobody frees it */
+	snapshot-regd_count = 0;
 
 	return snapshot;
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing our reliance on MD5

2015-02-10 Thread Arthur Silva
On Tue, Feb 10, 2015 at 10:32 PM, Peter Geoghegan p...@heroku.com wrote:

 On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com
 wrote:
  Although the patch was described as relatively easy to write, it never
  went anywhere, because it *replaced* MD5 authentication with bcrypt,
  which would be a big problem for existing clients.  It seems clear
  that we should add something new and not immediately kill off what
  we've already got, so that people can transition smoothly.  An idea I
  just had today is to keep using basically the same system that we are
  currently using for MD5, but with a stronger hash algorithm, like
  SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and
  SHA-512).  Those are slower, but my guess is that even SHA-512 is not
  enough slower for anybody to care very much, and if they do, well
  that's another reason to make use of the new stuff optional.

 I believe that a big advantage of bcrypt for authentication is the
 relatively high memory requirements. This frustrates GPU based
 attacks.


 --
 Peter Geoghegan


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


There's also scrypt, which can be tuned for both memory and compute
requirements.

I don't think the password storing best practices apply to db connection
authentication. So SHA256 (or any other non terribly broken hash) is
probably fine for Pg.


Re: [HACKERS] Parallel Seq Scan

2015-02-10 Thread Andres Freund
On 2015-02-10 08:52:09 -0500, Robert Haas wrote:
 On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund and...@2ndquadrant.com wrote:
  Note that I'm not saying that Amit's patch is right - I haven't read it
  - but that I don't think a 'scan this range of pages' heapscan API would
  not be a bad idea. Not even just for parallelism, but for a bunch of
  usecases.
 
 We do have that, already.  heap_setscanlimits().  I'm just not
 convinced that that's the right way to split up a parallel scan.
 There's too much risk of ending up with a very-uneven distribution of
 work.

If you make the chunks small enough, and then coordate only the chunk
distribution, not really.

  Regarding tuple flow between backends, I've thought about that before,
  I agree that we need it, and I don't think I know how to do it.  I can
  see how to have a group of processes executing a single node in
  parallel, or a single process executing a group of nodes we break off
  from the query tree and push down to it, but what you're talking about
  here is a group of processes executing a group of nodes jointly.
 
  I don't think it really is that. I think you'd do it essentially by
  introducing a couple more nodes. Something like
 
SomeUpperLayerNode
|
|
   AggCombinerNode
  /   \
 / \
/   \
 PartialHashAggNode   PartialHashAggNode  
  .PartialHashAggNode ...
 ||
 ||
 ||
 ||
  PartialSeqScanPartialSeqScan
 
  The only thing that'd potentially might need to end up working jointly
  jointly would be the block selection of the individual PartialSeqScans
  to avoid having to wait for stragglers for too long. E.g. each might
  just ask for a range of a 16 megabytes or so that it scans sequentially.
 
  In such a plan - a pretty sensible and not that uncommon thing for
  parallelized aggregates - you'd need to be able to tell the heap scans
  which blocks to scan. Right?
 
 For this case, what I would imagine is that there is one parallel heap
 scan, and each PartialSeqScan attaches to it.  The executor says give
 me a tuple and heapam.c provides one.  Details like the chunk size
 are managed down inside heapam.c, and the executor does not know about
 them.  It just knows that it can establish a parallel scan and then
 pull tuples from it.

I think that's a horrible approach that'll end up with far more
entangled pieces than what you're trying to avoid. Unless the tuple flow
is organized to only happen in the necessary cases the performance will
be horrible. And good chunk sizes et al depend on higher layers,
selectivity estimates and such. And that's planner/executor work, not
the physical layer (which heapam.c pretty much is).

A individual heap scan's state lives in process private memory. And if
the results inside the separate workers should directly be used in the
these workers without shipping over the network it'd be horrible to have
the logic in the heapscan. How would you otherwise model an executor
tree that does the seqscan and aggregation combined in multiple
processes at the same time?

  Maybe we designate nodes as can-generate-multiple-tuple-streams (seq
  scan, mostly, I would think) and can-absorb-parallel-tuple-streams
  (sort, hash, materialize), or something like that, but I'm really
  fuzzy on the details.
 
  I don't think we really should have individual nodes that produce
  multiple streams - that seems like it'd end up being really
  complicated. I'd more say that we have distinct nodes (like the
  PartialSeqScan ones above) that do a teensy bit of coordination about
  which work to perform.
 
 I think we're in violent agreement here, except for some
 terminological confusion.  Are there N PartialSeqScan nodes, one
 running in each node, or is there one ParallelSeqScan node, which is
 copied and run jointly across N nodes?  You can talk about either way
 and have it make sense, but we haven't had enough conversations about
 this on this list to have settled on a consistent set of vocabulary
 yet.

I pretty strongly believe that it has to be independent scan nodes. Both
from a implementation and a conversational POV. They might have some
very light cooperation between them (e.g. coordinating block ranges or
such), but everything else should be separate. From an implementation
POV it seems pretty awful to have executor node that's accessed by
multiple separate backends - that'd mean it have to be concurrency safe,
have state in shared memory and everything.

Now, there'll be a node that needs to do some parallel magic - but in
the