Re: [HACKERS] Parallel Seq Scan
On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund and...@2ndquadrant.com wrote: If you make the chunks small enough, and then coordate only the chunk distribution, not really. True, but why do you want to do that in the executor instead of in the heapam? For this case, what I would imagine is that there is one parallel heap scan, and each PartialSeqScan attaches to it. The executor says give me a tuple and heapam.c provides one. Details like the chunk size are managed down inside heapam.c, and the executor does not know about them. It just knows that it can establish a parallel scan and then pull tuples from it. I think that's a horrible approach that'll end up with far more entangled pieces than what you're trying to avoid. Unless the tuple flow is organized to only happen in the necessary cases the performance will be horrible. I can't understand this at all. A parallel heap scan, as I've coded it up, involves no tuple flow at all. All that's happening at the heapam.c layer is that we're coordinating which blocks to scan. Not to be disrespectful, but have you actually looked at the patch? And good chunk sizes et al depend on higher layers, selectivity estimates and such. And that's planner/executor work, not the physical layer (which heapam.c pretty much is). If it's true that a good chunk size depends on the higher layers, then that would be a good argument for doing this differently, or at least exposing an API for the higher layers to tell heapam.c what chunk size they want. I hadn't considered that possibility - can you elaborate on why you think we might want to vary the chunk size? A individual heap scan's state lives in process private memory. And if the results inside the separate workers should directly be used in the these workers without shipping over the network it'd be horrible to have the logic in the heapscan. How would you otherwise model an executor tree that does the seqscan and aggregation combined in multiple processes at the same time? Again, the heap scan is not shipping anything anywhere ever in any design of any patch proposed or written. The results *are* directly used inside each individual worker. I think we're in violent agreement here, except for some terminological confusion. Are there N PartialSeqScan nodes, one running in each node, or is there one ParallelSeqScan node, which is copied and run jointly across N nodes? You can talk about either way and have it make sense, but we haven't had enough conversations about this on this list to have settled on a consistent set of vocabulary yet. I pretty strongly believe that it has to be independent scan nodes. Both from a implementation and a conversational POV. They might have some very light cooperation between them (e.g. coordinating block ranges or such), but everything else should be separate. From an implementation POV it seems pretty awful to have executor node that's accessed by multiple separate backends - that'd mean it have to be concurrency safe, have state in shared memory and everything. I don't agree with that, but again I think it's a terminological dispute. I think what will happen is that you will have a single node that gets copied into multiple backends, and in some cases a small portion of its state will live in shared memory. That's more or less what you're thinking of too, I think. But what I don't want is - if we've got a parallel scan-and-aggregate happening in N nodes, EXPLAIN shows N copies of all of that - not only because it's display clutter, but also because a plan to do that thing with 3 workers is fundamentally the same as a plan to do it with 30 workers. Those plans shouldn't look different, except perhaps for a line some place that says Number of Workers: N. Now, there'll be a node that needs to do some parallel magic - but in the above example that should be the AggCombinerNode, which would not only ask for tuples from one of the children at a time, but ask multiple ones in parallel. But even then it doesn't have to deal with concurrency around it's own state. Sure, we clearly want to minimize the amount of coordination between nodes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] For cursors, there is FETCH and MOVE, why no TELL?
Marc Balmer m...@msys.ch writes: That is simple indeed. I tend to think, however, that it would be cleaner to return the position as a proper result from a functionn instead of using a side effect from a FETCH/MOVE command. Yeah. For one thing, a command tag wouldn't help you at all if you wanted to know the current cursor position inside a plpgsql function. There are also backwards-compatibility reasons to be nervous about changing the long-standing command tag values for these commands. An issue that would have to be addressed is what the function ought to do if posOverflow is set, which is entirely feasible on Windows (or anyplace else where long is only 32 bits). Maybe we should redeclare portalPos as int64 and get rid of the posOverflow logic. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] For cursors, there is FETCH and MOVE, why no TELL?
2015-02-10 16:21 GMT+01:00 Tom Lane t...@sss.pgh.pa.us: Marc Balmer m...@msys.ch writes: That is simple indeed. I tend to think, however, that it would be cleaner to return the position as a proper result from a functionn instead of using a side effect from a FETCH/MOVE command. Yeah. For one thing, a command tag wouldn't help you at all if you wanted to know the current cursor position inside a plpgsql function. It can solved via GET DIAGNOSTICS statement There are also backwards-compatibility reasons to be nervous about changing the long-standing command tag values for these commands. yes, this is serious risk - and this is too high cost for relative less used feature. Regards Pavel An issue that would have to be addressed is what the function ought to do if posOverflow is set, which is entirely feasible on Windows (or anyplace else where long is only 32 bits). Maybe we should redeclare portalPos as int64 and get rid of the posOverflow logic. regards, tom lane
[HACKERS] Better error message on pg_upgrade checksum mismatches
Just a little thing that's been bugging me. If one side of the pg_upgrade has checksums and the other does not, give a less cryptic error message. -- Greg Sabino Mullane g...@endpoint.com End Point Corporation PGP Key: 0x14964AC8 diff --git a/contrib/pg_upgrade/controldata.c b/contrib/pg_upgrade/controldata.c index a02a8ec..8a7b976 100644 --- a/contrib/pg_upgrade/controldata.c +++ b/contrib/pg_upgrade/controldata.c @@ -572,9 +572,17 @@ check_control_data(ControlData *oldctrl, * We might eventually allow upgrades from checksum to no-checksum * clusters. */ + if (! oldctrl-data_checksum_version newctrl-data_checksum_version) + { + pg_fatal(old version does not use data checksums but new one does\n); + } + if (oldctrl-data_checksum_version ! newctrl-data_checksum_version) + { + pg_fatal(old version uses data checksums but new one does not\n); + } if (oldctrl-data_checksum_version != newctrl-data_checksum_version) { - pg_fatal(old and new pg_controldata checksum versions are invalid or do not match\n); + pg_fatal(old and new pg_controldata checksum versions do not match\n); } } signature.asc Description: Digital signature
[HACKERS] Fixing pg_dump's heuristic about casts
I was reminded today by Greg Mullane's blog post http://blog.endpoint.com/2015/02/postgres-custom-casts-and-pgdump.html about how pg_dump fails to dump custom casts if they are casts between built-in types. Setting aside the wisdom of creating such a cast, it's definitely annoying that pg_dump misses it. He's hardly the first to complain, too. The current heuristic dates back to this discussion: http://www.postgresql.org/message-id/flat/3f7066bf.6010...@yahoo.com (see commit a6790ce85752b67ad994f55fdf1a450262ccc32e). Re-reading it, I'm wondering why we didn't go with plan B, namely determine whether to dump a cast based on its OID. That is, dump it if it has an OID = FirstNormalObjectId (16384), otherwise not. That's admittedly pretty ugly, but the existing heuristic isn't exactly beautiful. And it's not like we haven't used the OID rule for other things --- see extensions. So I propose ripping out the current heuristic (in HEAD, lines 10482-10530 of pg_dump.c) and instead inspecting the cast's OID to decide whether to dump it, much like selectDumpableExtension() does. I'd further propose back-patching this fix. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] enabling nestedloop and disabling hashjon
Ravi Kiran ravi.kolanp...@gmail.com writes: yes sir, I did try the pg_ctl reload command, but its still using the hash join algorithm and not the nested loop algorithm. I even restarted the server, even then its still using the hash join algorithm Does show enable_hashjoin say it's off? If not, I think you must've fat-fingered the postgresql.conf change somehow. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 5:22 PM, Arthur Silva arthur...@gmail.com wrote: I assume if the hacker can intercept the server unencrypted traffic and/or has access to its hard-drive the database is compromised anyway. That sounds like an argument against hashing the passwords in general. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 7:32 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com wrote: Although the patch was described as relatively easy to write, it never went anywhere, because it *replaced* MD5 authentication with bcrypt, which would be a big problem for existing clients. It seems clear that we should add something new and not immediately kill off what we've already got, so that people can transition smoothly. An idea I just had today is to keep using basically the same system that we are currently using for MD5, but with a stronger hash algorithm, like SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and SHA-512). Those are slower, but my guess is that even SHA-512 is not enough slower for anybody to care very much, and if they do, well that's another reason to make use of the new stuff optional. I believe that a big advantage of bcrypt for authentication is the relatively high memory requirements. This frustrates GPU based attacks. I don't actually care which algorithm we use, and I dowannahafta care. What I do want to do is provide a framework so that, when somebody discovers that X is better than Y because Z, somebody who knows about cryptography and not much about PostgreSQL ca add support for X in a relatively small number of lines of code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 5:14 PM, Arthur Silva arthur...@gmail.com wrote: I don't think the password storing best practices apply to db connection authentication. Why not? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 2015-02-10 11:49:58 -0500, Robert Haas wrote: On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund and...@2ndquadrant.com wrote: * Don't like CreateParallelContextForExtension much as a function name - I don't think we normally equate the fact that code is located in a loadable library with the extension mechanism. Suggestions for a better name? CreateParallelContextForLoadableFunction? *ExternalFunction maybe? That's dfmgr.c calls it. * the plain shm calculations intentionally use mul_size/add_size to deal with overflows. On 32bit it doesn't seem impossible, but unlikely to overflow size_t. Yes, I do that here too, though as with the plain shm calculations, only in the estimate functions. The functions that actually serialize stuff don't have to worry about overflow because it's already been checked. I thought I'd seen some estimation functions that didn't use it. But apparently I was wrong. * In +LaunchParallelWorkers, does it make sense trying to start workers if one failed? ISTM that's likely to not be helpful. I.e. it should just break; after the first failure. It can't just break, because clearing pcxt-worker[i].error_mqh is an essential step. I could add a flag variable that tracks whether any registrations have failed and change the if statement to if (!any_registrations_failed RegisterDynamicBackgroundWorker(worker, pcxt-worker[i].bgwhandle)), if you want. I thought about doing that but decided it was very unlikely to affect the real-world performance of anything, so easier just to keep the code simple. But I'll change it if you want. I think it'd be better. * ParallelMain restores libraries before GUC state. Given that librararies can, and actually somewhat frequently do, inspect GUCs on load, it seems better to do it the other way round? You argue We want to do this before restoring GUCs, because the libraries might define custom variables., but I don't buy that. It's completely normal for namespaced GUCs to be present before a library is loaded? Especially as you then go and process_session_preload_libraries() after setting the GUCs. This is a good question, but the answer is not entirely clear to me. I'm thinking I should probably just remove process_session_preload_libraries() altogether at this point. That was added at a time when RestoreLibraryState() didn't exist yet, and I think RestoreLibraryState() is a far superior way of handling this, because surely the list of libraries that the parallel leader *actually had loaded* is more definitive than any GUC. That sounds like a good idea to me. Now, that doesn't answer the question about whether we should load libraries first or GUCs. I agree that the comment's reasoning is bogus, but I'm not sure I understand why you think the other order is better. It *is* normal for namespaced GUCs to be present before a library is loaded, but it's equally normal (and, I think, often more desirable in practice) to load the library first and then set the GUCs. Well, it's pretty much never the case that the library is loaded before postgresql.conf gucs, right? A changed postgresql.conf is the only exception I can see. Neither is it the normal case for session|local_preload_libraries. Not even when GUCs are loaded via pg_db_role_setting or the startup packet... Generally, I think that libraries ought to be loaded as early as possible, because they may install hooks that change the way other stuff works, and should therefore be loaded before that other stuff happens. While that may be desirable I don't really see a reason for this to be treated differently for worker processes than the majority of cases otherwise. Anyway, I think this is a relatively minor issue. * Should ParallelMain maybe enter the parallel context before loading user defined libraries? It's far from absurd to execute code touching the database on library initialization... It's hard to judge without specific examples. What kinds of things do they do? I've seen code filling lookup caches and creating system objects (including tables and extensions). Are they counting on a transaction being active? I would have thought that was a no-no, since there are many instances in which it won't be true. Also, you might have just gotten loaded because a function stored in your library was called, so you could be in a transaction that's busy doing something else, or deep in a subtransaction stack, etc. It seems unsafe to do very much more than a few syscache lookups here, even if there does happen to be a transaction active. The only reason I'd like it to be active is because that'd *prohibit* doing the crazier stuff. There seems little reason to not da it under the additional protection against crazy things that'd give us? * I think restoring snapshots needs to fudge the worker's PGXACT-xmin to be the minimum of the top transaction id and the snapshots.
Re: [HACKERS] ibm system z in the buildfarm
On Tue, Feb 10, 2015 at 8:49 PM, Mark Wong m...@2ndquadrant.com wrote: I've seen in the archive a call for more architecture coverage so I just wanted to send a quick note that there is now Linux on System Z in the buildfarm now: http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=nudibranchbr=HEAD Nice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
Robert Haas robertmh...@gmail.com writes: On Tue, Feb 10, 2015 at 9:30 PM, Tom Lane t...@sss.pgh.pa.us wrote: Another thing we need to keep in mind besides client compatibility is dump/reload compatibility. I don't think there's an issue with dump/reload compatibility as far as what I proposed, since it only has to do with the authentication procedure, not what gets stored in pg_authid. We might have reasons for moving that away from MD5 as well, but it's a separate project. Hm, well, that doesn't really square with your other expressed opinion: Are there other goals? I think the goal is stop using MD5, or at least have an option to not use MD5, because people think that's insecure. As you say, it's quite debatable whether MD5 is or isn't secure enough given the way we use it, but what's not debatable is that the optics of it are not very good anymore. However, if we want to shut up the peanut gallery on this point, we have to get rid of MD5 usage in pg_authid not just the on-the-wire protocol --- I seriously doubt that the knee jerk MD5-is-insecure crowd will make any distinction there. So I'm not following how you're satisfied with a proposal for just the latter. In any case, my larger point was that given the pain that we're going to incur here, and the certainly years-long transition interval involved, it would be foolish to think only about replacing the MD5 algorithm and not about reconsidering the context we use it in. Stuff like unreasonably short salt values should be dealt with at the same time. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 11:19 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 10, 2015 at 5:14 PM, Arthur Silva arthur...@gmail.com wrote: I don't think the password storing best practices apply to db connection authentication. Why not? -- Peter Geoghegan I assume if the hacker can intercept the server unencrypted traffic and/or has access to its hard-drive the database is compromised anyway. I maybe missing something though.
[HACKERS] ibm system z in the buildfarm
Hi everyone, I've seen in the archive a call for more architecture coverage so I just wanted to send a quick note that there is now Linux on System Z in the buildfarm now: http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=nudibranchbr=HEAD Regards, Mark -- Mark Wonghttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench -f and vacuum
On 2/10/15 3:12 AM, Michael Paquier wrote: On Tue, Feb 10, 2015 at 5:03 PM, Tatsuo Ishii wrote: - The documentation misses some markups for pgbench and VACUUM and did not respect the 80-character limit. I didn't realize that there's such a style guide. Although I think it's a good thing, I just want to know where such a guide is described. That's self-learning based on the style of the other pages. I don't recall if there are actually convention guidelines for the docs, the only I know of being the coding convention here: http://www.postgresql.org/docs/devel/static/source.html Maybe we could have a section dedicated to that. Thoughts? We have http://www.postgresql.org/docs/devel/static/docguide-style.html, although that doesn't cover formatting. For that we have .dir-locals.el: (sgml-mode . ((fill-column . 78) (indent-tabs-mode . nil ;-) Feel free to improve that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On 02/10/2015 05:28 PM, Robert Haas wrote: I don't actually care which algorithm we use, and I dowannahafta care. What I do want to do is provide a framework so that, when somebody discovers that X is better than Y because Z, somebody who knows about cryptography and not much about PostgreSQL ca add support for X in a relatively small number of lines of code. +1 -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
Peter Eisentraut pete...@gmx.net writes: On 2/10/15 8:28 PM, Robert Haas wrote: I don't actually care which algorithm we use, and I dowannahafta care. What I do want to do is provide a framework so that, when somebody discovers that X is better than Y because Z, somebody who knows about cryptography and not much about PostgreSQL ca add support for X in a relatively small number of lines of code. sounds like SASL Sounds like pie in the sky really :-(. We could make the server turn on a dime perhaps, but the client-side population will not come along nearly that quickly, nor with small effort. Stored passwords won't migrate to a new scheme transparently either. I think it's probably reasonable to think about a more modern password auth method, but not to imagine that it will be pluggable or that the adoption time for any revision will be less than years long. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.6 Feature help requested: Inclusion Constraints
On 2/9/15 3:12 AM, Jeff Davis wrote: On Sat, 2015-02-07 at 16:08 -0800, Jeff Davis wrote: I believe Inclusion Constraints will be important for postgres. I forgot to credit Darren Duncan with the name of this feature: http://www.postgresql.org/message-id/4f8bb9b0.5090...@darrenduncan.net I think it would be confusing to name something inclusion constraint that is not somehow the opposite of an exclusion constraint. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 10:19 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 10, 2015 at 5:14 PM, Arthur Silva arthur...@gmail.com wrote: I don't think the password storing best practices apply to db connection authentication. Why not? Usually because handshakes use a random salt on both sides. Not sure about pg's though, but in general collision strength is required but not slowness, they're not bruteforceable. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typo in logicaldecoding.sgml
On Tue, Feb 10, 2015 at 6:57 PM, Andres Freund and...@2ndquadrant.com wrote: I think one of them could be a fix. Any advice? I slightly prefer the second form... I think it uses is definitely more standard English usage in this context. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 9:30 PM, Tom Lane t...@sss.pgh.pa.us wrote: 2. We'd have to figure out how to get support for the new algorithms into libpq. This actually seems a good bit harder than doing it on the server-side, because I don't think libpq has any dynamic loading facilities the way the server does. If you think libpq is the only problem, or even the main problem, on the client side then you have seriously misjudged the situation. There are multiple clients that do not use libpq at all, JDBC being the poster child here. We don't want to make things unnecessarily difficult for clients that implement the wire protocol from scratch, but I don't think it's our job to tell the owners of other connectors if, or when, they want to support new authentication methods. They certainly won't support them *before* we have server-side support; we can hope that if we add server-side support, they will follow along. Another thing we need to keep in mind besides client compatibility is dump/reload compatibility. I don't think there's an issue with dump/reload compatibility as far as what I proposed, since it only has to do with the authentication procedure, not what gets stored in pg_authid. We might have reasons for moving that away from MD5 as well, but it's a separate project. I think it would be wise to take two steps back and think about what the threat model is here, and what we actually need to improve. Offhand I can remember two distinct things we might wish to have more protection against: * scraping of passwords off the wire protocol (but is that still a threat in an SSL world?). Better salting practice would do more than replacing the algorithm as such for this, IMO. * scraping of passwords directly from the pg_authid table or from a pg_dump dump. In this case it's not so much that we are worried about preventing the scraper from accessing the database (he's evidently in already) as that we'd like to prevent him from recovering the original cleartext, since the user might've used that same password elsewhere. (Bad user, but nonetheless something we might try to protect against.) Again, more salt seems like it might go a lot further than just changing the algorithm. Are there other goals? I think the goal is stop using MD5, or at least have an option to not use MD5, because people think that's insecure. Whether those people are right or not is probably extremely arguable - indeed, I can imagine your opening salvo here giving rise to vigorous debate about whether MD5, in the particular way we are using it, is or is not secure. The problem is that MD5 has enough weaknesses that, even if it is actually secure in the way we are using it, some people are not going to believe that, and are going to think we're bad people for using it in any way whatsoever. I'd like to keep those people as users, whoever is right about the security issues. The second problem is that, even if we could convince all of the people who might be worried about MD5 that their fears are unfounded, a new attack can be discovered at any time that weakens it still further and renders what we're doing unsafe. We shouldn't wait until then to start the years-long process of weaning ourselves off of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On 2/10/15 8:28 PM, Robert Haas wrote: I don't actually care which algorithm we use, and I dowannahafta care. What I do want to do is provide a framework so that, when somebody discovers that X is better than Y because Z, somebody who knows about cryptography and not much about PostgreSQL ca add support for X in a relatively small number of lines of code. sounds like SASL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Show the LSN in rm_redo_error_callback
On 2/10/15 5:15 PM, Andres Freund wrote: Hi, I've repeatedly stared at logs containing PANICs during WAL replay. Unfortunately it's rather hard to find out which record actually caused the problem as we print the record's contents but not its LSN. I think it's a no-brainer to add that to master. Makes sense. But I'd even argue that it'd be a good idea to add it to the backbranches - there've been a significant number of bugs causing PANICs due to replay errors in the past (and we might be hunting one more of them right now). We've had WAL since 7.1, and this is the first I've heard of this idea, so while I sympathize with the idea, it's hard to agree that this is a must-have. The standard for back-patching ought to be higher than mere convenience. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
Robert Haas robertmh...@gmail.com writes: Although the patch was described as relatively easy to write, it never went anywhere, because it *replaced* MD5 authentication with bcrypt, which would be a big problem for existing clients. Right. The client end of it is the elephant in the room in any discussion of improving the password hash algorithm. 2. We'd have to figure out how to get support for the new algorithms into libpq. This actually seems a good bit harder than doing it on the server-side, because I don't think libpq has any dynamic loading facilities the way the server does. If you think libpq is the only problem, or even the main problem, on the client side then you have seriously misjudged the situation. There are multiple clients that do not use libpq at all, JDBC being the poster child here. Another thing we need to keep in mind besides client compatibility is dump/reload compatibility. I think it would be wise to take two steps back and think about what the threat model is here, and what we actually need to improve. Offhand I can remember two distinct things we might wish to have more protection against: * scraping of passwords off the wire protocol (but is that still a threat in an SSL world?). Better salting practice would do more than replacing the algorithm as such for this, IMO. * scraping of passwords directly from the pg_authid table or from a pg_dump dump. In this case it's not so much that we are worried about preventing the scraper from accessing the database (he's evidently in already) as that we'd like to prevent him from recovering the original cleartext, since the user might've used that same password elsewhere. (Bad user, but nonetheless something we might try to protect against.) Again, more salt seems like it might go a lot further than just changing the algorithm. Are there other goals? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 11:25 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 10, 2015 at 5:22 PM, Arthur Silva arthur...@gmail.com wrote: I assume if the hacker can intercept the server unencrypted traffic and/or has access to its hard-drive the database is compromised anyway. That sounds like an argument against hashing the passwords in general. -- Peter Geoghegan Indeed. In a perfect world SCRAM would be the my choice. FWIW Mongodb 3.0 also uses SCRAM as the preferred method for password based authentication.
Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords
On 2/5/15 10:13 AM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: All that having been said, it wouldn't be crazy to try to invent a system to lock this down, but it *would* be complicated. An individual FDW can call its authentication-related options anything it likes; they do not need to be called 'password'. So we'd need a way to identify which options should be hidden from untrusted users, and then a bunch of mechanism to do that. It's also debatable whether this wouldn't be a violation of the SQL standard. I see nothing in the SQL-MED spec authorizing filtering of the information_schema.user_mapping_options view. We actually are doing some filtering of values in user_mapping_options, but it's all-or-nothing so far as the options for any one mapping go. That's still not exactly supportable per spec but it's probably less of a violation than option-by-option filtering would be. It also looks like that filtering differs in corner cases from what the regular pg_user_mappings view does, which is kinda silly. In particular I think we should try to get rid of the explicit provision for superuser access. I was hoping Peter would weigh in on what his design considerations were for these views ... I recall that we had extensive discussions about this back in the day. The SQL standard doesn't provide for any filtering of options, and the current behavior is just a bare minimum to not appear completely stupid. Since we don't know which options are security-sensitive, the only choices are hide everything or hide nothing. Note that if you opt for hide everything, you will also need to hide everything from servers and wrappers, since they could also contains passwords in their options. We could ask the FDW which options are security sensitive, but that seems quite complicated to implement. We could require the user to specify which option values they want hidden (SENSITIVE OPTION '...' or whatever). I would welcome improvements in this area. I'm not worried about SQL standard requirements here. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_upgrade bug in handling postgres/template1 databases
I received a private pg_upgrade bug report related to its affect on the removal of clog files in the upgraded cluster. The user reported that an upgraded cluster yielded this error very soon after being started for the first time: SELECT * FROM test; ERROR: could not access status of transaction 685 DETAIL: Could not open file pg_clog/: No such file or directory. I was confused as I had never seen such a report before. After much digging, I found out that the user was storing data _only_ in the 'postgres' database, not any other databases, and that pg_upgrade was not preserving pg_database.datfrozenxid and pg_database.datminmxid, even though it was processing those databases and copying over any user tables and indexes. When the new server was started, pg_database.datminmxid equaled the current transaction counter and old clog files were removed, yielding the error. This is not a problem for users who store functions or extensions in the postgres or template1 database, just user tables and indexes. The attached patch fixes the problem, and I have reproduced the bug and verified the patch fixes it. It would have been nice if I had figured this out two weeks ago, before we released minor version upgrades. This bug has existed back to 9.0 or earlier, so it isn't new. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c new file mode 100644 index 3e3b433..ec7246a *** a/src/bin/pg_dump/pg_dumpall.c --- b/src/bin/pg_dump/pg_dumpall.c *** dumpCreateDB(PGconn *conn) *** 1417,1432 appendPQExpBufferStr(buf, ;\n); ! if (binary_upgrade) ! { ! appendPQExpBufferStr(buf, -- For binary upgrade, set datfrozenxid and datminmxid.\n); ! appendPQExpBuffer(buf, UPDATE pg_catalog.pg_database ! SET datfrozenxid = '%u', datminmxid = '%u' ! WHERE datname = , ! dbfrozenxid, dbminmxid); ! appendStringLiteralConn(buf, dbname, conn); ! appendPQExpBufferStr(buf, ;\n); ! } } if (!skip_acls --- 1417,1433 appendPQExpBufferStr(buf, ;\n); ! } ! ! if (binary_upgrade) ! { ! appendPQExpBufferStr(buf, -- For binary upgrade, set datfrozenxid and datminmxid.\n); ! appendPQExpBuffer(buf, UPDATE pg_catalog.pg_database ! SET datfrozenxid = '%u', datminmxid = '%u' ! WHERE datname = , ! dbfrozenxid, dbminmxid); ! appendStringLiteralConn(buf, dbname, conn); ! appendPQExpBufferStr(buf, ;\n); } if (!skip_acls -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The return value of allocate_recordbuf()
On Mon, Feb 9, 2015 at 7:02 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Mon, Feb 9, 2015 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote: MemoryContextAllocExtended() was added, so isn't it time to replace palloc() with MemoryContextAllocExtended(CurrentMemoryContext, MCXT_ALLOC_NO_OOM) in allocate_recordbuf()? Yeah, let's move on here, but not with this patch I am afraid as this breaks any client utility using xlogreader.c in frontend, like pg_xlogdump or pg_rewind because MemoryContextAllocExtended is not available in frontend, only in backend. We are going to need something like palloc_noerror, defined on both backend (mcxt.c) and frontend (fe_memutils.c) side. Unfortunately, that gets us back into the exact terminological dispute that we were hoping to avoid. Perhaps we could compromise on palloc_extended(). Btw, the huge flag should be used as well as palloc only allows allocation up to 1GB, and this is incompatible with ~9.4 where malloc is used. I think that flag should be used only if it's needed, not just on the grounds that 9.4 has no such limit. In general, limiting allocations to 1GB has been good to us; for example, it prevents a corrupted TOAST length from causing a memory allocation large enough to crash the machine (yes, there are machines you can crash with a giant memory allocation!). We should bypass that limit only where it is clearly necessary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RangeType internal use
On Mon, Feb 9, 2015 at 7:54 PM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: Well, that's debatable IMO (especially your claim that variable-size partitions would be needed by a majority of users). But in any case, partitioning behavior that is emergent from a bunch of independent pieces of information scattered among N tables seems absolutely untenable from where I sit. Whatever we support, the behavior needs to be described by *one* chunk of information --- a sorted list of bin bounding values, perhaps. I'm a bit confused here. I got an impression that partitioning formula as you suggest would consist of two pieces of information - an origin point a bin width. Then routing a tuple consists of using exactly these two values to tell a bin number and hence a partition in O(1) time assuming we've made all partitions be exactly bin-width wide. You mention here a sorted list of bin bounding values which we can very well put together for a partitioned table in its relation descriptor based on whatever information we stored in catalog. That is, we can always have a *one* chunk of partitioning information as *internal* representation irrespective of how generalized we make our on-disk representation. We can get O(log N) if not O(1) from that I'd hope. In fact, that's what I had in mind about this. Sure, we can always assemble data into a relation descriptor from across multiple catalog entries. I think the question is whether there is any good reason to split up the information across multiple relations or whether it might not be better, as I have suggested multiple times, to serialize it using nodeToString() and stuff it in a single column in pg_class. There may be such a reason, but if you said what it was, I missed that. This thread started as a discussion about using range types, and I think it's pretty clear that's a bad idea, because: 1. There's no guarantee that a range type for the datatype exists at all. 2. If it does, there's no guarantee that it uses the same opclass that we want to use for partitioning, and I certainly think it would be strange if we refused to let the user pick the opclass she wants to use. 3. Even if there is a suitable range type available, it's a poor representational choice here, because it will be considerably more verbose than just storing a sorted list of partition bounds. In the common case where the ranges are adjacent, you'll end up storing two copies of every bound but the first and last for no discernable benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund and...@2ndquadrant.com wrote: Observations: * Some tailing whitespace in the readme. Very nice otherwise. Fixed. Thanks. * Don't like CreateParallelContextForExtension much as a function name - I don't think we normally equate the fact that code is located in a loadable library with the extension mechanism. Suggestions for a better name? CreateParallelContextForLoadableFunction? * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's that about? Gee, maybe I should have added a comment so I'd remember. If the buffer size isn't MAXALIGN'd, that would be really bad, because shm_mq_create() assumes that it's being given an aligned address. Maybe I should add an Assert() there. If it is MAXALIGN'd but not BUFFERALIGN'd, we might waste a few bytes of space, since shm_toc_allocate() always allocates in BUFFERALIGN'd chunks, but I don't think anything will actually break. Not sure if that's worth an assert or not. * the plain shm calculations intentionally use mul_size/add_size to deal with overflows. On 32bit it doesn't seem impossible, but unlikely to overflow size_t. Yes, I do that here too, though as with the plain shm calculations, only in the estimate functions. The functions that actually serialize stuff don't have to worry about overflow because it's already been checked. * I'd s/very // i We might be running in a very short-lived memory context.. Or replace it with too. Removed very. * In +LaunchParallelWorkers, does it make sense trying to start workers if one failed? ISTM that's likely to not be helpful. I.e. it should just break; after the first failure. It can't just break, because clearing pcxt-worker[i].error_mqh is an essential step. I could add a flag variable that tracks whether any registrations have failed and change the if statement to if (!any_registrations_failed RegisterDynamicBackgroundWorker(worker, pcxt-worker[i].bgwhandle)), if you want. I thought about doing that but decided it was very unlikely to affect the real-world performance of anything, so easier just to keep the code simple. But I'll change it if you want. * +WaitForParallelWorkersToFinish says that it waits for workers to exit cleanly. To me that's ambiguous. How about fully? I've removed the word cleanly and added a comment to more fully explain the danger: + * Even if the parallel operation seems to have completed successfully, it's + * important to call this function afterwards. We must not miss any errors + * the workers may have thrown during the parallel operation, or any that they + * may yet throw while shutting down. * ParallelMain restores libraries before GUC state. Given that librararies can, and actually somewhat frequently do, inspect GUCs on load, it seems better to do it the other way round? You argue We want to do this before restoring GUCs, because the libraries might define custom variables., but I don't buy that. It's completely normal for namespaced GUCs to be present before a library is loaded? Especially as you then go and process_session_preload_libraries() after setting the GUCs. This is a good question, but the answer is not entirely clear to me. I'm thinking I should probably just remove process_session_preload_libraries() altogether at this point. That was added at a time when RestoreLibraryState() didn't exist yet, and I think RestoreLibraryState() is a far superior way of handling this, because surely the list of libraries that the parallel leader *actually had loaded* is more definitive than any GUC. Now, that doesn't answer the question about whether we should load libraries first or GUCs. I agree that the comment's reasoning is bogus, but I'm not sure I understand why you think the other order is better. It *is* normal for namespaced GUCs to be present before a library is loaded, but it's equally normal (and, I think, often more desirable in practice) to load the library first and then set the GUCs. Generally, I think that libraries ought to be loaded as early as possible, because they may install hooks that change the way other stuff works, and should therefore be loaded before that other stuff happens. * Should ParallelMain maybe enter the parallel context before loading user defined libraries? It's far from absurd to execute code touching the database on library initialization... It's hard to judge without specific examples. What kinds of things do they do? Are they counting on a transaction being active? I would have thought that was a no-no, since there are many instances in which it won't be true. Also, you might have just gotten loaded because a function stored in your library was called, so you could be in a transaction that's busy doing something else, or deep in a subtransaction stack, etc. It seems unsafe to do very much more than a few syscache lookups here, even if there does happen to be a transaction active. *
Re: [HACKERS] pgbench -f and vacuum
On Mon, Feb 9, 2015 at 2:54 PM, Tatsuo Ishii wrote: Agreed. Here is the patch to implement the idea: -f just implies -n. Some small comments: - is_no_vacuum, as well as is_init_mode, are defined as an integers but their use imply that they are boolean switches. This patch sets is_no_vacuum to true, while the rest of the code actually increment its value when -n is used. Why not simply changing both flags as booleans? My suggestion is not really related to this patch and purely cosmetic but we could change things at the same time, or update your patch to increment to is_no_vacuum++ when -f is used. Yes, I have to admit that the current pgench code is quite confusing in this regard. I think we should change is_no_vacuum and is_init_mode to boolean. - The documentation misses some markups for pgbench and VACUUM and did not respect the 80-character limit. I didn't realize that there's such a style guide. Although I think it's a good thing, I just want to know where such a guide is described. Attached is an updated patch with those things changed. Looks good. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0
On 2015-02-04 16:49:46 -0800, Peter Geoghegan wrote: On Tue, Feb 2, 2015 at 01:06 AM, Andres Freund and...@2ndquadrant.com wrote: Generally the split into the individual commits doesn't seem to make much sense to me. I think trying to make that possible is a good idea in patches of this size. It e.g. seems entirely possible to structure the patchset so that the speculative lock infrastructure is added first and the rest later. I've not thought more about how to split it up further, but I'm pretty sure it's possible. The commits individually (except the first) aren't indivdiually commitable and aren't even meaningful. Splitting off the internal docs, tests and such actually just seems to make reviewing harder because you miss context. Splitting it so that individual piece are committable and reviewable makes sense, but... I have no problem doing the user docs later. If you split of RLS support, you need to throw an error before it's implemented. I mostly agree. Basically, I did not intend for all of the patches to be individually committed. The mechanism by which EXCLUDED.* expressions are added is somewhat novel, and deserves to be independently *considered*. I'm trying to show how the parts fit together more so than breaking things down in to smaller commits (as you picked up on, 0001 is the exception - that is genuinely intended to be committed early). Also, those commit messages give me the opportunity to put those parts in their appropriate context vis-a-vis our discussions. They refer to the Wiki, for example, or reasons why pg_stat_statements shouldn't care about ExcludedExpr. Obviously the final commit messages won't look that way. FWIW, I don't think anything here really should refer to the wiki... 0002: * Tentatively I'd say that killspeculative should be done via a separate function instead of heap_delete() Really? I guess if that were to happen, it would entail refactoring heap_delete() to call a static function, which was also called by a new kill_speculative() function that does this. Otherwise, you'd have far too much duplication. I don't really think there actually is that much common inbetween those. It looks to me that most of the code in heap_delete isn't actually relevant for this case and could be cut short. My guess is that only the WAL logging would be separated out. * I doubt logical decoding works with the patch as it stands. I thought so. Perhaps you could suggest a better use of the available XLOG_HEAP_* bits. I knew I needed to consider that more carefully (hence the XXX comment), but didn't get around to it. I think you probably need to add test macros that make sure only the individual bits are sets, and not the combination and then only use those. * If a arbiter index is passed to ExecCheckIndexConstraints(), can't we abort the loop after checking it? Also, do we really have to iterate over indexes for that case? How about moving the loop contents to a separate function and using that separately for the arbiter cases? Well, the failure to do that implies very few extra cycles, but sure. It's not that much about the CPU cycles, but also about the mental ones. If you have to think what happens if there's more than one match... * ItemPointerIsValid What about it? Uh. Oh. No idea. I wrote this pretty late at night ;) * /* * This may occur when an instantaneously invisible tuple is blamed * as a conflict because multiple rows are inserted with the same * constrained values. How can this happen? We don't insert multiple rows with the same command id? This is a cardinality violation [1]. It can definitely happen - just try the examples you see on the Wiki. I don't care about the wiki from the point of code comments. This needs to be understandable in five years time. * Perhaps it has previously been discussed but I'm not convinced by the reasoning for not looking at opclasses in infer_unique_index(). This seems like it'd prohibit ever having e.g. case insensitive opclasses - something surely worthwile. I don't think anyone gave that idea the thumbs-up. However, I really don't see the problem. Sure, we could have case insensitive opclasses in the future, and you may want to make a unique index using one. Then the problem suddenly becomes that previous choices of indexes/statements aren't possible anymore. It seems much better to introduce the syntax now and not have too much of a usecase for it. * The whole speculative insert logic isn't really well documented. Why, for example, do we actually need the token? And why are there no issues with overflow? And where is it documented that a 0 means there's no token? ... Fair enough. Presumably it's okay that overflow theoretically could occur, because a race is all but impossible. The token represents a particular attempt by some backend at inserting a tuple, that needs to be waited on
Re: [HACKERS] For cursors, there is FETCH and MOVE, why no TELL?
Hi the patch can be very simple: diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c new file mode 100644 index 2794537..20b9206 *** a/src/backend/commands/portalcmds.c --- b/src/backend/commands/portalcmds.c *** PerformPortalFetch(FetchStmt *stmt, *** 181,189 /* Return command status if wanted */ if (completionTag) ! snprintf(completionTag, COMPLETION_TAG_BUFSIZE, %s %ld, stmt-ismove ? MOVE : FETCH, !nprocessed); } /* --- 181,190 /* Return command status if wanted */ if (completionTag) ! snprintf(completionTag, COMPLETION_TAG_BUFSIZE, %s %ld %ld, stmt-ismove ? MOVE : FETCH, !nprocessed, !portal-portalPos); } /* 2015-02-09 10:59 GMT+01:00 Marc Balmer m...@msys.ch: 2015-02-09 10:37 GMT+01:00 Marc Balmer m...@msys.ch mailto: m...@msys.ch: Currently there are FETCH and the (non standard) MOVE commands to work on cursors. (I use cursors to display large datasets in a page-wise way, where the user can move per-page, or, when displaying a single record, per record. When the user goes back from per-record view to page-view, I have to restore the cursor to the position it was on before the user changed to per-record view.) I have to manually keep track of the cursor position, but in some cases it would definitely be easier to just query the current cursor position directly from the database and later use MOVE ABSOLUTE to rewind it to that position. That could be achieved e.g. by a hypothetical TELL cursor-name command. It does, however, not exist and I have not found an alternative. Is there a way to query the current cusros position at all? If not, does a TELL command sound like a good or bad idea? It sounds like good idea. Do we need a new statement? We can implement returning the position to MOVE statement. It returns a delta, but it can returns a absolute position too. On second thought, a new statement is not needed at all. As Heikki noticed in hsi reply, it could either be a new function or have move to return the current position somehow(tm). Or a nw option to move, maybe MOVE NOT (don't move the cursor but return it's position? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench -f and vacuum
On Tue, Feb 10, 2015 at 5:03 PM, Tatsuo Ishii wrote: - The documentation misses some markups for pgbench and VACUUM and did not respect the 80-character limit. I didn't realize that there's such a style guide. Although I think it's a good thing, I just want to know where such a guide is described. That's self-learning based on the style of the other pages. I don't recall if there are actually convention guidelines for the docs, the only I know of being the coding convention here: http://www.postgresql.org/docs/devel/static/source.html Maybe we could have a section dedicated to that. Thoughts? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench -f and vacuum
On Tue, Feb 10, 2015 at 5:03 PM, Tatsuo Ishii wrote: - The documentation misses some markups for pgbench and VACUUM and did not respect the 80-character limit. I didn't realize that there's such a style guide. Although I think it's a good thing, I just want to know where such a guide is described. That's self-learning based on the style of the other pages. I don't recall if there are actually convention guidelines for the docs, the only I know of being the coding convention here: http://www.postgresql.org/docs/devel/static/source.html Maybe we could have a section dedicated to that. Thoughts? I think you need to talk to Peter. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On 02/09/2015 03:20 PM, Abhijit Menon-Sen wrote: At 2015-02-09 12:52:41 +0200, hlinnakan...@vmware.com wrote: Do you have access to big-endian hardware to test this on? Yes, I tested it on a Linux/ppc system. I wasn't speculating when I said it does make a noticeable difference, though I'm afraid I did not keep the timings after submitting the revised patch. The speedup was some double-digit percentage, IIRC. Ok, that's good enough for me. Committed with a few more edits on comments and such. I'll start looking at the second patch now. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] enabling nestedloop and disabling hashjon
yes sir, I did try the pg_ctl reload command, but its still using the hash join algorithm and not the nested loop algorithm. I even restarted the server, even then its still using the hash join algorithm Thanks ᐧ On Tue, Feb 10, 2015 at 5:28 AM, Tom Lane t...@sss.pgh.pa.us wrote: Ravi Kiran ravi.kolanp...@gmail.com writes: I tried modifying the postgresql.conf file where I set the value enable_hashjoin=off and also enable_mergejoin=off, so that I could force postgres to use nested loop. but postgres is still using the hash join algorithm even after modifying the postgresql code. Did you remember pg_ctl reload? enable_hashjoin=off will most certainly work if it's active. regards, tom lane
Re: [HACKERS] pg_basebackup may fail to send feedbacks.
Hello, The attached patch is fix for walreciever not using gettimeofday, and fix for receivelog using it. XLogWalRcvProcessMsg doesn't send feedback when processing 'w'=WAL record packet. So the same thing as pg_basebackup and pg_receivexlog will occur on walsender. Exiting the for(;;) loop by TimestampDifferenceExceeds just before line 439 is an equivalent measure to I poposed for receivelog.c, but calling XLogWalRcvSendHSFeedback(false) there is seemingly simpler (but I feel a bit uncomfortable for the latter) I'm concerned about the performance degradation by calling getimeofday() per a record. Or other measures? Introduce the maximum number of records that we can receive and process between feedbacks? For example, we can change XLogWalRcvSendHSFeedback() so that it's called per at least 8 records. I'm not sure what number is good, though... At the beginning of the while(len 0){if (len 0){ block, last_recv_timestamp is always kept up to date (by using gettimeotda():). So we can use the variable instead of gettimeofday() in my previous proposal. The start time of the timeout could be last_recv_timestamp at the beginning of the while loop, since it is a bit earlier than the actual time at the point. The another solution would be using RegisterTimeout/enable_timeout_after and it seemed to be work for me. It can throw out the time counting stuff in the loop we are talking about and that of XLogWalRcvSendHSFeedback and XLogWalRcvSendReply, but it might be a bit too large for the gain. Considering pg_basebackup/receivexlog, the loop in receivelog.c does not maintain the time value within it, so I think we are forced to use feGetCurrentTimeStamp if not using SIGALRM. The wal reading function simply gets the data from the buffer in memory for most calls so the gettimeofday for each iteration could slow the process significantly. SIGALRM seems to be valuable for the case. Thoughts? regards, -- Kyotaro Horiguchi NTT Open Source Software Center From c9017f7c55de864bb3459f6f927803577e94c5eb Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp Date: Mon, 2 Feb 2015 12:49:45 +0900 Subject: [PATCH] Make sure to send feedback at desired timing. Continuous stream due to heavy-load on client side can prevent feedbacks to be sent with expected interval, and it results in a replication timeout on walsender. Exiting from the fast-path loop when the time comes to make sure the feedback to be sent with expected intervals. --- src/backend/replication/walreceiver.c | 54 +-- src/bin/pg_basebackup/receivelog.c| 10 +-- 2 files changed, 40 insertions(+), 24 deletions(-) diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index bfbc02f..d77dc91 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -409,35 +409,45 @@ WalReceiverMain(void) if (len != 0) { /* + * The feedback interval cannot be longer than + * wal_receiver_status_interval. last_recv_timestamp is a + * bit earlier than the actual time here so this is + * available for the start time of the timeout in the loop + * below. + */ + TimestampTz last_feedback = last_recv_timestamp; + + /* * Process the received data, and any subsequent data we * can read without blocking. */ - for (;;) + while (len 0) { - if (len 0) - { - /* - * Something was received from master, so reset - * timeout - */ - last_recv_timestamp = GetCurrentTimestamp(); - ping_sent = false; - XLogWalRcvProcessMsg(buf[0], buf[1], len - 1); - } - else if (len == 0) - break; - else if (len 0) - { - ereport(LOG, - (errmsg(replication terminated by primary server), - errdetail(End of WAL reached on timeline %u at %X/%X., - startpointTLI, - (uint32) (LogstreamResult.Write 32), (uint32) LogstreamResult.Write))); - endofwal = true; + /* + * Something was received from master, so reset + * timeout + */ + last_recv_timestamp = GetCurrentTimestamp(); + ping_sent = false; + XLogWalRcvProcessMsg(buf[0], buf[1], len - 1); + + /* Exit this loop if the time to reply has come */ + if (TimestampDifferenceExceeds(last_feedback, last_recv_timestamp, + wal_receiver_status_interval * 1000)) break; - } + len = walrcv_receive(0, buf); } + if (len 0) + { + ereport(LOG, +(errmsg(replication terminated by primary server), + errdetail(End of WAL reached on timeline %u at %X/%X., + startpointTLI, + (uint32) (LogstreamResult.Write 32), (uint32) LogstreamResult.Write))); + endofwal = true; + break; + } /* Let the master know that we received some data. */
Re: [HACKERS] [pgsql-advocacy] GSoC 2015 - mentors, students and admins.
On Mon, Feb 9, 2015 at 6:52 PM, Thom Brown t...@linux.com wrote: Hi all, Google Summer of Code 2015 is approaching. I'm intending on registering PostgreSQL again this year. Before I do that, I'd like to have an idea of how many people are interested in being either a student or a mentor. I'm interested to participate as student again. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote: - * Wait for more WAL to arrive. Time out after 5 seconds, - * like when polling the archive, to react to a trigger - * file promptly. + * Wait for more WAL to arrive. Time out after the amount of + * time specified by wal_retrieve_retry_interval, like + * when polling the archive, to react to a trigger file promptly. */ WaitLatch(XLogCtl-recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + wal_retrieve_retry_interval * 1000L); This change can prevent the startup process from reacting to a trigger file. Imagine the case where the large interval is set and the user want to promote the standby by using the trigger file instead of pg_ctl promote. I think that the sleep time should be 5s if the interval is set to more than 5s. Thought? I disagree here. It is interesting to accelerate the check of WAL availability from a source in some cases for replication, but the opposite is true as well as mentioned by Alexey at the beginning of the thread to reduce the number of requests when requesting WAL archives from an external storage type AWS. Hence a correct solution would be to check periodically for the trigger file with a maximum one-time wait of 5s to ensure backward-compatible behavior. We could reduce it to 1s or something like that as well. You seem to have misunderstood the code in question. Or I'm missing something. The timeout of the WaitLatch is just the interval to check for the trigger file while waiting for more WAL to arrive from streaming replication. Not related to the retry time to restore WAL from the archive. [Re-reading the code...] Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a maximum of 5s then. I also noticed in previous patch that the wait was maximized to 5s. To begin with, a loop should have been used if it was a sleep, but as now patch uses a latch this limit does not make much sense... Patch updated is attached. Regards, -- Michael From 9b7e3bb32c744b328a0d99db3040cadfcba606aa Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Mon, 19 Jan 2015 16:08:48 +0900 Subject: [PATCH] Add wal_retrieve_retry_interval This parameter aids to control at which timing WAL availability is checked when a node is in recovery, particularly when successive failures happen when fetching WAL archives, or when fetching WAL records from a streaming source. Default value is 5s. --- doc/src/sgml/config.sgml | 17 ++ src/backend/access/transam/xlog.c | 46 +++ src/backend/utils/misc/guc.c | 12 +++ src/backend/utils/misc/postgresql.conf.sample | 3 ++ src/include/access/xlog.h | 1 + 5 files changed, 66 insertions(+), 13 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6bcb106..d82b26a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2985,6 +2985,23 @@ include_dir 'conf.d' /listitem /varlistentry + varlistentry id=guc-wal-retrieve-retry-interval xreflabel=wal_retrieve_retry_interval + termvarnamewal_retrieve_retry_interval/varname (typeinteger/type) + indexterm + primaryvarnamewal_retrieve_retry_interval/ configuration parameter/primary + /indexterm + /term + listitem + para +Specify the amount of time to wait when WAL is not available from +any sources (streaming replication, local filenamepg_xlog/ or +WAL archive) before retrying to retrieve WAL. This parameter can +only be set in the filenamepostgresql.conf/ file or on the +server command line. The default value is 5 seconds. + /para + /listitem + /varlistentry + /variablelist /sect2 /sect1 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 629a457..1f9c3c4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -93,6 +93,7 @@ int sync_method = DEFAULT_SYNC_METHOD; int wal_level = WAL_LEVEL_MINIMAL; int CommitDelay = 0; /* precommit delay in microseconds */ int CommitSiblings = 5; /* # concurrent xacts needed to sleep */ +int wal_retrieve_retry_interval = 5000; #ifdef WAL_DEBUG bool XLOG_DEBUG = false; @@ -10340,8 +10341,8 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, bool fetching_ckpt, XLogRecPtr tliRecPtr) { - static pg_time_t last_fail_time = 0; - pg_time_t now; + TimestampTz now =
Re: [HACKERS] Assertion failure when streaming logical changes
On 2015-02-10 22:06:34 +0900, Michael Paquier wrote: On Tue, Feb 10, 2015 at 9:46 PM, Andres Freund and...@2ndquadrant.com wrote: Yea, it really looks like the above commit is to blame. The new xmin tracking infrastructure doesn't know about the historical snapshot... I think that we need a better regression coverage here... For example, we could add some tap tests in test_decoding to test streaming of logical changes. This would help in the future to detect such problems via the buildfarm. I agree. It's more or less a accident that the assert - which just should be moved in the regd_count == 0 branch - didn't trigger for the SQL interface. The snapshot acquired by the SELECT statement prevents it there. It's not entirely trivial to add tests for receivelogical though. You need to stop it programatically after a while. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] For cursors, there is FETCH and MOVE, why no TELL?
2015-02-10 14:32 GMT+01:00 Marc Balmer m...@msys.ch: Am 10.02.15 um 09:06 schrieb Pavel Stehule: Hi the patch can be very simple: diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c new file mode 100644 index 2794537..20b9206 *** a/src/backend/commands/portalcmds.c --- b/src/backend/commands/portalcmds.c *** PerformPortalFetch(FetchStmt *stmt, *** 181,189 /* Return command status if wanted */ if (completionTag) ! snprintf(completionTag, COMPLETION_TAG_BUFSIZE, %s %ld, stmt-ismove ? MOVE : FETCH, !nprocessed); } /* --- 181,190 /* Return command status if wanted */ if (completionTag) ! snprintf(completionTag, COMPLETION_TAG_BUFSIZE, %s %ld %ld, stmt-ismove ? MOVE : FETCH, !nprocessed, !portal-portalPos); } /* That is simple indeed. I tend to think, however, that it would be cleaner to return the position as a proper result from a functionn instead of using a side effect from a FETCH/MOVE command. I have not strong opinion about it Pavel 2015-02-09 10:59 GMT+01:00 Marc Balmer m...@msys.ch mailto: m...@msys.ch: 2015-02-09 10:37 GMT+01:00 Marc Balmer m...@msys.ch mailto:m...@msys.ch mailto:m...@msys.ch mailto:m...@msys.ch: Currently there are FETCH and the (non standard) MOVE commands to work on cursors. (I use cursors to display large datasets in a page-wise way, where the user can move per-page, or, when displaying a single record, per record. When the user goes back from per-record view to page-view, I have to restore the cursor to the position it was on before the user changed to per-record view.) I have to manually keep track of the cursor position, but in some cases it would definitely be easier to just query the current cursor position directly from the database and later use MOVE ABSOLUTE to rewind it to that position. That could be achieved e.g. by a hypothetical TELL cursor-name command. It does, however, not exist and I have not found an alternative. Is there a way to query the current cusros position at all? If not, does a TELL command sound like a good or bad idea? It sounds like good idea. Do we need a new statement? We can implement returning the position to MOVE statement. It returns a delta, but it can returns a absolute position too. On second thought, a new statement is not needed at all. As Heikki noticed in hsi reply, it could either be a new function or have move to return the current position somehow(tm). Or a nw option to move, maybe MOVE NOT (don't move the cursor but return it's position? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org mailto:pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund and...@2ndquadrant.com wrote: Note that I'm not saying that Amit's patch is right - I haven't read it - but that I don't think a 'scan this range of pages' heapscan API would not be a bad idea. Not even just for parallelism, but for a bunch of usecases. We do have that, already. heap_setscanlimits(). I'm just not convinced that that's the right way to split up a parallel scan. There's too much risk of ending up with a very-uneven distribution of work. Regarding tuple flow between backends, I've thought about that before, I agree that we need it, and I don't think I know how to do it. I can see how to have a group of processes executing a single node in parallel, or a single process executing a group of nodes we break off from the query tree and push down to it, but what you're talking about here is a group of processes executing a group of nodes jointly. I don't think it really is that. I think you'd do it essentially by introducing a couple more nodes. Something like SomeUpperLayerNode | | AggCombinerNode / \ / \ / \ PartialHashAggNode PartialHashAggNode .PartialHashAggNode ... || || || || PartialSeqScanPartialSeqScan The only thing that'd potentially might need to end up working jointly jointly would be the block selection of the individual PartialSeqScans to avoid having to wait for stragglers for too long. E.g. each might just ask for a range of a 16 megabytes or so that it scans sequentially. In such a plan - a pretty sensible and not that uncommon thing for parallelized aggregates - you'd need to be able to tell the heap scans which blocks to scan. Right? For this case, what I would imagine is that there is one parallel heap scan, and each PartialSeqScan attaches to it. The executor says give me a tuple and heapam.c provides one. Details like the chunk size are managed down inside heapam.c, and the executor does not know about them. It just knows that it can establish a parallel scan and then pull tuples from it. Maybe we designate nodes as can-generate-multiple-tuple-streams (seq scan, mostly, I would think) and can-absorb-parallel-tuple-streams (sort, hash, materialize), or something like that, but I'm really fuzzy on the details. I don't think we really should have individual nodes that produce multiple streams - that seems like it'd end up being really complicated. I'd more say that we have distinct nodes (like the PartialSeqScan ones above) that do a teensy bit of coordination about which work to perform. I think we're in violent agreement here, except for some terminological confusion. Are there N PartialSeqScan nodes, one running in each node, or is there one ParallelSeqScan node, which is copied and run jointly across N nodes? You can talk about either way and have it make sense, but we haven't had enough conversations about this on this list to have settled on a consistent set of vocabulary yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fetch zero result rows when executing a query?
On Sun, Feb 8, 2015 at 3:56 AM, Shay Rojansky r...@roji.org wrote: Just to be precise: what is strange to me is that the max_rows feature exists but has no 0 value. You and Marko are arguing that the whole feature should be deprecated (i.e. always return all rows). I think the fact that it has no zero value is probably just a historical accident; most likely, whoever designed it originally (probably twenty years ago) didn't think about queries with side-effects and therefore didn't consider that wanting 0 rows would ever be sensible. Meanwhile, a sentinel value was needed to request all rows, so they used 0. If they'd thought of it, they might have picked -1 and we'd not be having this discussion. FWIW, I'm in complete agreement that it would be good if we had this feature. I believe this is not the first report we've had of PostgreSQL doing things in ways that mesh nicely with standardized driver interfaces. Whether we think those interfaces are well-designed or not, they are standardized. When people use $OTHERDB and have a really great driver, and then they move to PostgreSQL and get one with more warts, it does not encourage them to stick with PostgreSQL. .NET is not some fringe user community that we can dismiss as irrelevant. We need users of all languages to want to use PostgreSQL, not just users of languages any one of us happens to personally like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd behavior of updatable security barrier views on foreign tables
Dean, * Dean Rasheed (dean.a.rash...@gmail.com) wrote: On 9 February 2015 at 21:17, Stephen Frost sfr...@snowman.net wrote: On Fri, Jan 30, 2015 at 5:20 AM, Etsuro Fujita I noticed that when updating security barrier views on foreign tables, we fail to give FOR UPDATE to selection queries issued at ForeignScan. I've looked into this a fair bit more over the weekend and the issue appears to be that the FDW isn't expecting a do-instead sub-query. I've been considering how we might be able to address that but havn't come up with any particularly great ideas and would welcome any suggestions. Simply having the FDW try to go up through the query would likely end up with too many queries showing up with 'for update'. We add the 'for update' to the sub-query before we even get called from the 'Modify' path too, which means we can't use that to realize when we're getting ready to modify rows and therefore need to lock them. In any case, I'll continue to look but would welcome any other thoughts. Sorry, I didn't have time to look at this properly. My initial thought is that expand_security_qual() needs to request a lock on rows coming from the relation it pushes down into a subquery if that relation was the result relation, because otherwise it won't have any locks, since preprocess_rowmarks() only adds PlanRowMarks to non-target relations. Yes, that works. I had been focused on trying to figure out a way to make this work just in the FDW, but you're right, fixing it in expand_security_qual() looks like the right approach. Of course that means that it may end up locking more rows than are actually updated, but that's essentially the same as a SELECT FOR UPDATE on a s.b. view right now. Agreed. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Better error message on pg_upgrade checksum mismatches
On Tue, Feb 10, 2015 at 10:55:07AM -0500, Greg Sabino Mullane wrote: Just a little thing that's been bugging me. If one side of the pg_upgrade has checksums and the other does not, give a less cryptic error message. OK, sure. Good fix, will apply. This seems to be the day for pg_upgrade fixes/improvements. :-) -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd behavior of updatable security barrier views on foreign tables
Etsuro, * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote: On 2015/02/10 7:23, Dean Rasheed wrote: Sorry, I didn't have time to look at this properly. My initial thought is that expand_security_qual() needs to request a lock on rows coming from the relation it pushes down into a subquery if that relation was the result relation, because otherwise it won't have any locks, since preprocess_rowmarks() only adds PlanRowMarks to non-target relations. That seems close to what I had in mind; expand_security_qual() needs to request a FOR UPDATE lock on rows coming from the relation it pushes down into a subquery only when that relation is the result relation and *foreign table*. I had been trying to work out an FDW-specific way to address this, but I think Dean's right that this should be addressed in expand_security_qual(), which means it'll apply to all cases and not just these FDW calls. I don't think that's actually an issue though and it'll match up to how SELECT FOR UPDATE is handled today. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Corner case for add_path_precheck
Antonin Houska a...@cybertec.at writes: The special case is that the path passed to add_path_precheck() has costs *equal to* those of the old_path. If pathkeys, outer rells and costs are the same, as summarized in the comment above, I expect add_path_precheck() to return false. Do I misread anything? It does, so I don't see your point? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory
Tom, * Tom Lane (t...@sss.pgh.pa.us) wrote: I've now taken this idea as far as building the required infrastructure and revamping a couple of array operators to use it. There's a lot yet to do, but I've done enough to get some preliminary ideas about performance (see below). Nice! * Although I said above that everything owned by a deserialized object has to live in a single memory context, I do have ideas about relaxing that. The core idea would be to invent a memory context reset/delete callback feature in mcxt.c. Then a deserialized object could register such a callback on its own memory context, and use the callback to clean up resources outside its context. This is potentially useful for instance for something like PostGIS, where an object likely includes some data that was allocated with malloc not palloc because it was created by library functions that aren't Postgres-aware. Another likely use-case is for deserialized objects representing composite types to maintain reference counts on their tuple descriptors instead of having to copy said descriptors into their private contexts. This'd be material for a separate patch though. Being able to register a callback to be used on deletion of the context would certainly be very nice and strikes me as pretty independent of the rest of this. You've probably thought of this already, but registering the callback should probably allow the caller to pass in a pointer to be passed back to the callback function when the delete happens, so that there's a place for the metadata to be stored about what the callback function needs to clean up when it's called. So that's the plan, and attached is a very-much-WIP patch that uses this approach to speed up plpgsql array element assignments (and not a whole lot else as yet). Here's the basic test case I've been using: I've not looked at the code at all as yet, but it makes sense to me. With the attached patch, those timings drop to 80 and 150 ms respectively. And those numbers are pretty fantastic and would address an area we regularly get dinged on. Still, if the worst-case slowdown is around 20% on trivially-sized arrays, I'd gladly take that to have better performance on larger arrays. And I think this example is close to the worst case for the patch's approach, since it's testing small, fixed-element-length, no-nulls arrays, which is what the existing code can handle without spending a lot of cycles. Agreed. BTW, I'm not all that thrilled with the deserialized object terminology. I found myself repeatedly tripping up on which form was serialized and which de-. If anyone's got a better naming idea I'm willing to adopt it. Unfortunately, nothing comes to mind. Serialization is, at least, a pretty well understood concept and so the naming will likely make sense to newcomers, even if it's difficult to keep track of which is serialized and which is deserialized. I'm not sure exactly how to push this forward. I would not want to commit it without converting a significant number of array functions to understand about deserialized inputs, and by the time I've finished that work it's likely to be too late for 9.5. OTOH I'm sure that the PostGIS folk would love to have this infrastructure in 9.5 not 9.6 so they could make a start on fixing their issues. (Further down the pike, I'd plan to look at adapting composite-type operations, JSONB, etc, to make use of this approach, but that certainly isn't happening for 9.5.) Thoughts, advice, better ideas? I'm not really a big fan of putting an infrastructure out there for modules to use that we don't use ourselves (particularly when it's clear that there are places where we could/should be). On the other hand, this doesn't impact on-disk format and therefore I'm less worried that we'll end up with a release-critical issue when we're getting ready to put 9.5 out there. So, I'm on the fence about it. I'd love to see all of this in 9.5 with the array functions converted, but I don't think it'd be horrible if only a subset had been done in time for 9.5. The others aren't going to go anywhere and will still work. I do think it'd be better to have at least some core users of this new infrastructure rather than just putting it out there for modules to use but I agree it'd be a bit grotty to have only some of the array functions converted. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Parallel Seq Scan
On 2015-02-10 09:23:02 -0500, Robert Haas wrote: On Tue, Feb 10, 2015 at 9:08 AM, Andres Freund and...@2ndquadrant.com wrote: And good chunk sizes et al depend on higher layers, selectivity estimates and such. And that's planner/executor work, not the physical layer (which heapam.c pretty much is). If it's true that a good chunk size depends on the higher layers, then that would be a good argument for doing this differently, or at least exposing an API for the higher layers to tell heapam.c what chunk size they want. I hadn't considered that possibility - can you elaborate on why you think we might want to vary the chunk size? Because things like chunk size depend on the shape of the entire plan. If you have a 1TB table and want to sequentially scan it in parallel with 10 workers you better use some rather large chunks. That way readahead will be efficient in a cpu/socket local manner, i.e. directly reading in the pages into the directly connected memory of that cpu. Important for performance on a NUMA system, otherwise you'll constantly have everything go over the shared bus. But if you instead have a plan where the sequential scan goes over a 1GB table, perhaps with some relatively expensive filters, you'll really want a small chunks size to avoid waiting. The chunk size will also really depend on what other nodes are doing, at least if they can run in the same worker. Even without things like NUMA and readahead I'm pretty sure that you'll want a chunk size a good bit above one page. The locks we acquire for the buffercache lookup and for reading the page are already quite bad for performance/scalability; even if we don't always/often hit the same lock. Making 20 processes that scan pages in parallel acquire yet a another lock (that's shared between all of them!) for every single page won't be fun, especially without or fast filters. For this case, what I would imagine is that there is one parallel heap scan, and each PartialSeqScan attaches to it. The executor says give me a tuple and heapam.c provides one. Details like the chunk size are managed down inside heapam.c, and the executor does not know about them. It just knows that it can establish a parallel scan and then pull tuples from it. I think that's a horrible approach that'll end up with far more entangled pieces than what you're trying to avoid. Unless the tuple flow is organized to only happen in the necessary cases the performance will be horrible. I can't understand this at all. A parallel heap scan, as I've coded it up, involves no tuple flow at all. All that's happening at the heapam.c layer is that we're coordinating which blocks to scan. Not to be disrespectful, but have you actually looked at the patch? No, and I said so upthread. I started commenting because you argued that architecturally parallelism belongs in heapam.c instead of upper layers, and I can't agree with that. I now have, and it looks less bad than I had assumed, sorry. Unfortunately I still think it's wrong approach, also sorry. As pointed out above (moved there after reading the patch...) I don't think a chunk size of 1 or any other constant size can make sense. I don't even believe it'll necessarily be constant across an entire query execution (big initially, small at the end). Now, we could move determining that before the query execution into executor initialization, but then we won't yet know how many workers we're going to get. We could add a function setting that at runtime, but that'd mix up responsibilities quite a bit. I also can't agree with having a static snapshot in shared memory put there by the initialization function. For one it's quite awkward to end up with several equivalent snapshots at various places in shared memory. Right now the entire query execution can share one snapshot, this way we'd end up with several of them. Imo for actual parallel query execution the plan should be shared once and then be reused for everything done in the name of the query. Without the need to do that you end up pretty much with only with setup for infrastructure so heap_parallelscan_nextpage is called. How about instead renaming heap_beginscan_internal() to _extended and offering an option to provide a callback + state that determines the next page? Additionally provide some separate functions managing a simple implementation of such a callback + state? Btw, using a atomic uint32 you'd end up without the spinlock and just about the same amount of code... Just do a atomic_fetch_add_until32(var, 1, InvalidBlockNumber)... ;) I think we're in violent agreement here, except for some terminological confusion. Are there N PartialSeqScan nodes, one running in each node, or is there one ParallelSeqScan node, which is copied and run jointly across N nodes? You can talk about either way and have it make sense, but we haven't had enough conversations about this on this list to have settled on a
Re: [HACKERS] RangeType internal use
On Mon, Feb 09, 2015 at 12:37:05PM -0500, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Feb 9, 2015 at 10:36 AM, Tom Lane t...@sss.pgh.pa.us wrote: It's going to be complicated and probably buggy, and I think it is heading in the wrong direction altogether. If you want to partition in some arbitrary complicated fashion that the system can't reason about very effectively, we *already have that*. IMO the entire point of building a new partitioning infrastructure is to build something simple, reliable, and a whole lot faster than what you can get from inheritance relationships. And faster is going to come mainly from making the partitioning rules as simple as possible, not as complex as possible. Yeah, but people expect to be able to partition on ranges that are not all of equal width. I think any proposal that we shouldn't support that is the kiss of death for a feature like this - it will be so restricted as to eliminate 75% of the use cases. Well, that's debatable IMO (especially your claim that variable-size partitions would be needed by a majority of users). It's ubiquitous. Time range partition sets almost always have some sets with finite range and at least one range with infinity in it: current end to infinity, and somewhat less frequently in my experience, -infinity to some arbitrary start. But in any case, partitioning behavior that is emergent from a bunch of independent pieces of information scattered among N tables seems absolutely untenable from where I sit. Agreed. Whatever we support, the behavior needs to be described by *one* chunk of information --- a sorted list of bin bounding values, perhaps. This would work for some interesting generalization of sorted. Maybe going back to the mathematical definition of partition could bear more fruit. All that's needed for that is an equivalence relation, however it's denoted. In practical terms, you'd need to have a quick way to enumerate the equivalence classes and another to establish whether equivalence holds. Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE} 2.0
On Tue, Feb 10, 2015 at 12:04 AM, Andres Freund and...@2ndquadrant.com wrote: FWIW, I don't think anything here really should refer to the wiki... The Wiki pages have done a good job of summarizing things...it certainly didn't seem that hard for you to get up to speed here. Also, as you'll know from working on logical decoding, it's hard to know what is complex and esoteric and what is relatively accessible when you're this close to a big project. I recall that you said as much before. I'm focused on signposting so that reviewers can follow what each patch does with the minimum of effort (with reference to the Wiki or whatever). I see no reason to not do things that way until commit...it feels like there is less chance of the concepts going over people's head this way. I don't really think there actually is that much common inbetween those. It looks to me that most of the code in heap_delete isn't actually relevant for this case and could be cut short. My guess is that only the WAL logging would be separated out. I'll think about that some more. * I doubt logical decoding works with the patch as it stands. I thought so. Perhaps you could suggest a better use of the available XLOG_HEAP_* bits. I knew I needed to consider that more carefully (hence the XXX comment), but didn't get around to it. I think you probably need to add test macros that make sure only the individual bits are sets, and not the combination and then only use those. This too. * /* * This may occur when an instantaneously invisible tuple is blamed * as a conflict because multiple rows are inserted with the same * constrained values. How can this happen? We don't insert multiple rows with the same command id? This is a cardinality violation [1]. It can definitely happen - just try the examples you see on the Wiki. I don't care about the wiki from the point of code comments. This needs to be understandable in five years time. That wasn't clear before - you seemed to me to be questioning if this was even possible. There is a section in the INSERT SQL reference page about cardinality violations, so we're certainly talking about something that a code reader likely heard of. Also, the nitty gritty showing various scenarios on the Wiki is the quickest way to know what is possible (but is much too long winded for user visible documentation or code comments). * Perhaps it has previously been discussed but I'm not convinced by the reasoning for not looking at opclasses in infer_unique_index(). This seems like it'd prohibit ever having e.g. case insensitive opclasses - something surely worthwile. I don't think anyone gave that idea the thumbs-up. However, I really don't see the problem. Sure, we could have case insensitive opclasses in the future, and you may want to make a unique index using one. Then the problem suddenly becomes that previous choices of indexes/statements aren't possible anymore. It seems much better to introduce the syntax now and not have too much of a usecase for it. The only way the lack of a way of specifying which opclass to use could bite us is if there were two *actually* defined unique indexes on the same column, each with different equality operators. That seems like kind of a funny scenario, even if that were quite possible (even if non-default opclasses existed that had a non-identical equality operators, which is basically not the case today). I grant that is a bit odd that we're talking about unique indexes definitions affecting semantics, but that is to a certain extent the nature of the beast. As a compromise, I suggest having the inference specification optionally accept a named opclass per attribute, in the style of CREATE INDEX (I'm already reusing a bit of the raw parser support for CREATE INDEX, you see) - that'll make inference insist on that opclass, rather than make it a strict matter of costing available alternatives (not that any alternative is expected with idiomatic usage). That implies no additional parser overhead, really. If that's considered ugly, then at least it's an ugly thing that literally no one will ever use in the foreseeable future...and an ugly thing that is no more necessary in CREATE INDEX than here (and yet CREATE INDEX lives with the ugliness). It's really not about me understanding it right now, but about longer term. Sure. Can you add a UPSERT test for logical decoding? I doubt it'll work right now, even in the repost. Okay. * /* * Immediately VACUUM super-deleted tuples */ if (TransactionIdEquals(HeapTupleHeaderGetRawXmin(tuple), InvalidTransactionId)) return HEAPTUPLE_DEAD; Is that branch really needed? Shouldn't it just be happening as a consequence of the already existing code? Same in SatisfiesMVCC. If you actually needed that block, it'd need to be done in SatisfiesSelf as well, no? You have a comment about a possible loop - but that seems wrong to me,
[HACKERS] Manipulating complex types as non-contiguous structures in-memory
I've been fooling around with a design to support computation-oriented, not-necessarily-contiguous-blobs representations of datatypes in memory, along the lines I mentioned here: http://www.postgresql.org/message-id/2355.1382710...@sss.pgh.pa.us In particular this is meant to reduce the overhead for repeated operations on arrays, records, etc. We've had several previous discussions about that, and even some single-purpose patches such as in this thread: http://www.postgresql.org/message-id/flat/cafj8prakudu_0md-dg6ftk0wsupvmlyrv1pb+hyc+gubzz3...@mail.gmail.com There was also a thread discussing how this sort of thing could be useful to PostGIS: http://www.postgresql.org/message-id/526a61fb.1050...@oslandia.com and it's been discussed a few other times too, but I'm too lazy to search the archives any further. I've now taken this idea as far as building the required infrastructure and revamping a couple of array operators to use it. There's a lot yet to do, but I've done enough to get some preliminary ideas about performance (see below). The core ideas of this patch are: * Invent a new TOAST datum type pointer to deserialized object, which is physically just like the existing indirect-toast-pointer concept, but it has a different va_tag code and somewhat different semantics. * A deserialized object has a standard header (which is what the toast pointers point to) and typically will have additional data-type-specific fields after that. One component of the standard header is a pointer to a set of method functions that provide ways to accomplish standard data-type-independent operations on the deserialized object. * Another standard header component is a MemoryContext identifier: the header, as well as all subsidiary data belonging to the deserialized object, must live in this context. (Well, I guess there could also be child contexts.) By exposing an explicit context identifier, we can accomplish tasks like move this object into another context by reparenting the object's context rather than physically copying anything. * The only standard methods I've found a need for so far are functions to re-serialize the object, that is generate a plain varlena value that is semantically equivalent. To avoid extra copying, this is split into separate compute the space needed and serialize into this memory steps, so that the result can be dropped exactly where the caller needs it. * Currently, a deserialized object will be reserialized in that way whenever we incorporate it into a physical tuple (ie, heap_form_tuple or index_form_tuple), or whenever somebody applies datumCopy() to it. I'd like to relax this later, but there's an awful lot of code that supposes that heap_form_tuple or datumCopy will produce a self-contained value that survives beyond, eg, destruction of the memory context that contained the source Datums. We can get good speedups in a lot of interesting cases without solving that problem, so I don't feel too bad about leaving it as a future project. * In particular, things like PG_GETARG_ARRAYTYPE_P() treat a deserialized toast pointer as something to be detoasted, and will produce a palloc'd re-serialized value. This means that we do not need to convert all the C functions concerned with a given datatype at the same time (or indeed ever); a function that hasn't been upgraded will build a re-serialized representation and then operate on that. We'll invent alternate argument-fetching functions that skip the reserialization step, for use by functions that have been upgraded to handle either case. This is basically the same approach we used when we introduced short varlena headers, and that seems to have gone smoothly enough. * There's a concept that a deserialized object has a primary toast pointer, which is physically part of the object, as well as secondary toast pointers which might or might not be part of the object. If you have a Datum pointer to the primary toast pointer then you are authorized to modify the object in-place; if you have a Datum pointer to a secondary toast pointer then you must treat it as read-only (ie, you have to make a copy if you're going to change it). Functions that construct a new deserialized object always return its primary toast pointer; this allows a nest of functions to modify an object in-place without copying, which was the primary need that the PostGIS folks expressed. On the other hand, plpgsql can hand out secondary toast pointers to deserialized objects stored in plpgsql function variables, thus ensuring that the objects won't be modified unexpectedly, while never having to physically copy them if the called functions just need to inspect them. * Primary and secondary pointers are physically identical, but the primary pointer resides in a specific spot in the deserialized object's standard header. (So you can tell if you've got the primary pointer via a simple address comparison.) * I've modified the array element assignment path in
Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory
On 2/10/15 5:19 PM, Tom Lane wrote: Jim Nasby jim.na...@bluetreble.com writes: Without having read the patch, I think this is great. I've been wishing for something like this while working on my variant data type. Are there any cases where we would want to use this on a non-variant? Perhaps types where we're paying an alignment penalty? What do you mean by non-variant? Ugh, sorry, brainfart. I meant to say non-varlena. I can't think of any non-varlena types we'd want this for, but maybe someone else can think of a case. If there is a use-case I wouldn't handle it with this patch, but we'd want to consider it... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On 2/10/15 6:32 PM, Peter Geoghegan wrote: On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com wrote: Although the patch was described as relatively easy to write, it never went anywhere, because it *replaced* MD5 authentication with bcrypt, which would be a big problem for existing clients. It seems clear that we should add something new and not immediately kill off what we've already got, so that people can transition smoothly. An idea I just had today is to keep using basically the same system that we are currently using for MD5, but with a stronger hash algorithm, like SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and SHA-512). Those are slower, but my guess is that even SHA-512 is not enough slower for anybody to care very much, and if they do, well that's another reason to make use of the new stuff optional. I believe that a big advantage of bcrypt for authentication is the relatively high memory requirements. This frustrates GPU based attacks. This is especially critical if things like bitcoin ASIC rigs could be put to use generating generic SHA256 hashes. A few grand will buy you hardware that can generate trillions of hash values per second. AFAIK there's no specialized hardware for scrypt though (even though it's used for other cryptocurrencies), in part because it's not cost effective to put enough memory in place. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GRANT USAGE on FOREIGN SERVER exposes passwords
On 2/5/15 10:48 AM, Tom Lane wrote: Stephen Frostsfr...@snowman.net writes: * Robert Haas (robertmh...@gmail.com) wrote: On Thu, Feb 5, 2015 at 10:48 AM, Stephen Frostsfr...@snowman.net wrote: And I thought this was about FDW options and not about dblink, really.. The OP is pretty clearly asking about dblink. I was just pointing out that it was an issue that all FDWs suffer from, since we don't have any way for an FDW to say don't show this option, as discussed. The dblink example is entirely uncompelling, given that as you said somebody with access to a dblink connection could execute ALTER USER on the far end. Actually, you can eliminate that by not granting direct access to dblink functions. Instead you create a SECURITY DEFINER function that sanity checks the SQL you're trying to run and rejects things like ALTER USER. While you're doing that, you can also lock away the connection information. A former coworker actually built a system that does this, at least to a limited degree. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Show the LSN in rm_redo_error_callback
Hi, I've repeatedly stared at logs containing PANICs during WAL replay. Unfortunately it's rather hard to find out which record actually caused the problem as we print the record's contents but not its LSN. I think it's a no-brainer to add that to master. But I'd even argue that it'd be a good idea to add it to the backbranches - there've been a significant number of bugs causing PANICs due to replay errors in the past (and we might be hunting one more of them right now). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory
Without having read the patch, I think this is great. I've been wishing for something like this while working on my variant data type. Are there any cases where we would want to use this on a non-variant? Perhaps types where we're paying an alignment penalty? On 2/10/15 3:00 PM, Stephen Frost wrote: BTW, I'm not all that thrilled with the deserialized object terminology. I found myself repeatedly tripping up on which form was serialized and which de-. If anyone's got a better naming idea I'm willing to adopt it. Unfortunately, nothing comes to mind. Serialization is, at least, a pretty well understood concept and so the naming will likely make sense to newcomers, even if it's difficult to keep track of which is serialized and which is deserialized. Apologies if I'm just dense, but what's the confusion? Is it what a serialized datum means in this context? (de)serialized seems like a perfectly logical name to me... I'm not sure exactly how to push this forward. I would not want to commit it without converting a significant number of array functions to understand about deserialized inputs, and by the time I've finished that work it's likely to be too late for 9.5. OTOH I'm sure that the PostGIS folk would love to have this infrastructure in 9.5 not 9.6 so they could make a start on fixing their issues. (Further down the pike, I'd plan to look at adapting composite-type operations, JSONB, etc, to make use of this approach, but that certainly isn't happening for 9.5.) Thoughts, advice, better ideas? I'm not really a big fan of putting an infrastructure out there for modules to use that we don't use ourselves (particularly when it's clear that there are places where we could/should be). On the other hand, this doesn't impact on-disk format and therefore I'm less worried that we'll end up with a release-critical issue when we're getting ready to put 9.5 out there. So, I'm on the fence about it. I'd love to see all of this in 9.5 with the array functions converted, but I don't think it'd be horrible if only a subset had been done in time for 9.5. The others aren't going to go anywhere and will still work. I do think it'd be better to have at least some core users of this new infrastructure rather than just putting it out there for modules to use but I agree it'd be a bit grotty to have only some of the array functions converted. I think the solution here is to have people other than Tom do the gruntwork of applying this to the remaining array code. My thought is that if Tom shows how to do this correctly in a rather complex case (ie, where you need to worry about primary vs secondary), then less experienced hackers should be able to take the ball and run with it. Maybe we won't get complete array coverage, but I think any performance gains here are a win. And really, don't we just need enough usage so the buildfarm will tell us if we accidentally break something? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory
[ this is addressing a tangential point ... ] Stephen Frost sfr...@snowman.net writes: * Tom Lane (t...@sss.pgh.pa.us) wrote: * Although I said above that everything owned by a deserialized object has to live in a single memory context, I do have ideas about relaxing that. The core idea would be to invent a memory context reset/delete callback feature in mcxt.c. Then a deserialized object could register such a callback on its own memory context, and use the callback to clean up resources outside its context. Being able to register a callback to be used on deletion of the context would certainly be very nice and strikes me as pretty independent of the rest of this. You've probably thought of this already, but registering the callback should probably allow the caller to pass in a pointer to be passed back to the callback function when the delete happens, so that there's a place for the metadata to be stored about what the callback function needs to clean up when it's called. Yeah, there would likely be use-cases for that outside of deserialized objects. I could submit a separate patch for that now, but I'm hesitant to add a mechanism without any use-case in the same patch. But maybe we could find a caller somewhere in the core aggregate code --- there are some aggregates that need cleanup callbacks already, IIRC, and maybe we could change them to use a memory context callback instead of whatever they're doing now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] reducing our reliance on MD5
There have been a few previous attempts to wean PostgreSQL off of MD5. Back in 2012, Heikki proposed using SCRAM for our next-generation authentication mechanism: http://www.postgresql.org/message-id/507550bd.2030...@vmware.com That seems likely to be a good idea, but nobody's come up with a patch. Peter Bex *did* come up with a patch to replace MD5 authentication with bcrypt, which I guess is based on Blowfish: http://www.postgresql.org/message-id/20121227144638.gc...@frohike.homeunix.org Although the patch was described as relatively easy to write, it never went anywhere, because it *replaced* MD5 authentication with bcrypt, which would be a big problem for existing clients. It seems clear that we should add something new and not immediately kill off what we've already got, so that people can transition smoothly. An idea I just had today is to keep using basically the same system that we are currently using for MD5, but with a stronger hash algorithm, like SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and SHA-512). Those are slower, but my guess is that even SHA-512 is not enough slower for anybody to care very much, and if they do, well that's another reason to make use of the new stuff optional. Maybe we could make the authentication handshake pluggable: typedef (*password_encryption_fn)(const char *passwd, const char *salt, size_t salt_len, char *buf); // like pg_md5_encrypt extern void register_encrypted_authentication_type(char *name, password_encryption_fn); Then we could add a _PG_init() function to pgcrypto that would enable the use of whatever pgcrypto has got to offer as authentication handshake methods. This has a couple of advantages. First, we don't have to suck pgcrypto into core, something we've been reluctant to do; second, if MD5 or any successor algorithm is shown to be insecure, users can just drop in a new one, either provided by us or one they write themselves or something in an external library they have and can integrate to. There are a few complications: 1. Each encryption algorithm requires a wire-protocol constant to identify it, and the client and the server have to agree on that value. So we'd have to define constants for whatever algorithms we want to provide via the core distribution, and possibly either maintain a registry of other values or at least set aside a space (values 1e6, maybe) that is reserved for use by extensions. 2. We'd have to figure out how to get support for the new algorithms into libpq. This actually seems a good bit harder than doing it on the server-side, because I don't think libpq has any dynamic loading facilities the way the server does. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com wrote: Although the patch was described as relatively easy to write, it never went anywhere, because it *replaced* MD5 authentication with bcrypt, which would be a big problem for existing clients. It seems clear that we should add something new and not immediately kill off what we've already got, so that people can transition smoothly. An idea I just had today is to keep using basically the same system that we are currently using for MD5, but with a stronger hash algorithm, like SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and SHA-512). Those are slower, but my guess is that even SHA-512 is not enough slower for anybody to care very much, and if they do, well that's another reason to make use of the new stuff optional. I believe that a big advantage of bcrypt for authentication is the relatively high memory requirements. This frustrates GPU based attacks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On 01/09/2015 10:32 AM, Abhijit Menon-Sen wrote: 2. The sse4.2 patch has only some minor compile fixes. I have built and tested both patches individually on little-endian (amd64) and big-endian (ppc) systems. I verified that the _sse is chosen at startup on the former, and _sb8 on the latter, and that both implementations function correctly with respect to HEAD. I'd like to arrange this somewhat differently, to keep the really low-level assembler blocks and such separate from the higher level parts. Especially now that pg_crc.c is in src/common and src/port, it doesn't seem appropriate to have assembler code in there. Also, some of the high-level code can be reused if we later add support e.g. for the ARMv8 CRC instructions. I propose that we add a new header file in src/port, let's call it crc_instructions.h. That file contains the very low-level stuff, the pg_asm_crc32q() and pg_asm_crc32b() inline functions, which just contain the single assembler instruction. Or the corresponding mnemomic or macro depending on the compiler; the point of the crc_instructions.h header is to hide the differences between compilers and architectures. If the CRC instructions are available, crc_instructions.h defines PG_HAVE_CRC32C_INSTRUCTIONS, as well as the pg_asm_crc32q() and pg_asm_crc32b() macros/functions. It also defines pg_crc32_instructions_runtime_check(), to perform the runtime check to determine whether the instructions can be used on the current platform (i.e. if you're running on a CPU with SSE 4.2 support). There's another symbol PG_CRC32C_INSTRUCTIONS_NEED_RUNTIME_CHECK, which indicates whether the runtime check is needed. That's currently always defined when PG_HAVE_CRC32C_INSTRUCTIONS is, but conceivably you might want to build a version that skips the runtime check altogether, and doesn't therefore require the slicing-by-8 fallback implementation at all. Gentoo users might like that ;-), as well as possible future architectures that always have CRC32 instructions. Attached is a patch along those lines. I haven't done much testing, but it demonstrates the code structure I have in mind. A couple of remarks on your original patch, which also apply to the attached version: +static inline pg_crc32 +pg_asm_crc32b(pg_crc32 crc, unsigned char data) +{ +#if defined(__GNUC__) defined(__x86_64__) + __asm__ (crc32b %[data], %[crc]\n : [crc] +r (crc) : [data] rm (data)); + return crc; +#elif defined(_MSC_VER) + return _mm_crc32_u8(crc, data); +#else + /* Can't generate crc32b, but keep the compiler quiet. */ + return 0; +#endif I believe the CRC instructions are also available in 32-bit mode, so the check for __x86_64__ is not needed. Right? Any CPU recent enough to have these instructions certainly supports the x86-64 instruction set, but you might nevertheless have compiled and be running in i386 mode. It would be nice to use GCC's builtin intrinsics, __builtin_ia32_crc32* where available, but I guess those are only available if you build with -msse4.2, and that would defeat the point of the runtime check. I believe the _mm_* mnemonics would also work with the Intel compiler. - Heikki From 7e16b0d5be39f832769da463f069c51afa04fb57 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas heikki.linnakangas@iki.fi Date: Tue, 10 Feb 2015 14:26:24 +0200 Subject: [PATCH 1/1] Use Intel SSE4.2 CRC instructions where available. On x86, perform a runtime check to see if we're running on a CPU that supports SSE 4.2. If we are, we can use the special crc32b and crc32q instructions for the CRC-32C calculations. That greatly speeds up CRC calculation. Abhijit Menon-Sen, reviewed by Andres Freund and me. --- configure | 2 +- configure.in| 2 +- src/common/pg_crc.c | 109 src/include/common/pg_crc.h | 12 - src/include/pg_config.h.in | 3 ++ 5 files changed, 114 insertions(+), 14 deletions(-) diff --git a/configure b/configure index fa271fe..c352128 100755 --- a/configure +++ b/configure @@ -9204,7 +9204,7 @@ fi done -for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h +for ac_header in atomic.h cpuid.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h do : as_ac_Header=`$as_echo ac_cv_header_$ac_header | $as_tr_sh` ac_fn_c_check_header_mongrel $LINENO $ac_header $as_ac_Header $ac_includes_default diff --git a/configure.in b/configure.in index e6a49d1..588d626 100644
[HACKERS] Assertion failure when streaming logical changes
Hi all, Using test_decoding on HEAD (cc761b1) I am seeing the following assertion failure: TRAP: FailedAssertion(!(!((RegisteredSnapshots)-ph_root == ((void*)0))), File: snapmgr.c, Line: 677) (lldb) bt * thread #1: tid = 0x, 0x7fff8b246d46 libsystem_kernel.dylib`__kill + 10, stop reason = signal SIGSTOP * frame #0: 0x7fff8b246d46 libsystem_kernel.dylib`__kill + 10 frame #1: 0x7fff8861bf83 libsystem_c.dylib`abort + 177 frame #2: 0x00010b556dd9 postgres`ExceptionalCondition(conditionName=0x00010b66ffcb, errorType=0x00010b5c105d, fileName=0x00010b66fe68, lineNumber=677) + 137 at assert.c:54 frame #3: 0x00010b5af952 postgres`UnregisterSnapshotFromOwner(snapshot=0x7faea38c22d8, owner=0x7faea38a6838) + 146 at snapmgr.c:677 frame #4: 0x00010b5af8b2 postgres`UnregisterSnapshot(snapshot=0x7faea38c22d8) + 50 at snapmgr.c:663 frame #5: 0x00010b0166ce postgres`systable_endscan(sysscan=0x7faea38cf090) + 110 at genam.c:504 frame #6: 0x00010b5474b8 postgres`RelationBuildTupleDesc(relation=0x000114932e68) + 952 at relcache.c:568 frame #7: 0x00010b53e45c postgres`RelationBuildDesc(targetRelId=3455, insertIt='\x01') + 604 at relcache.c:1035 frame #8: 0x00010b53d564 postgres`RelationIdGetRelation(relationId=3455) + 324 at relcache.c:1777 frame #9: 0x00010aff093c postgres`relation_open(relationId=3455, lockmode=1) + 108 at heapam.c:1047 frame #10: 0x00010b016ac9 postgres`index_open(relationId=3455, lockmode=1) + 25 at indexam.c:167 frame #11: 0x00010b01603d postgres`systable_beginscan(heapRelation=0x0001149214d0, indexId=3455, indexOK='\x01', snapshot=0x, nkeys=2, key=0x7fff54c6a990) + 93 at genam.c:334 frame #12: 0x00010b54a976 postgres`RelidByRelfilenode(reltablespace=0, relfilenode=12709) + 742 at relfilenodemap.c:204 frame #13: 0x00010b3353d9 postgres`ReorderBufferCommit(rb=0x7faea38c1038, xid=1001, commit_lsn=23817712, end_lsn=23818128, commit_time=476842122105685) + 665 at reorderbuffer.c:1338 frame #14: 0x00010b330ccb postgres`DecodeCommit(ctx=0x7faea38b0838, buf=0x7fff54c6ad58, xid=1001, dboid=16384, commit_time=476842122105685, nsubxacts=0, sub_xids=0x7faea3885530, ninval_msgs=22, msgs=0x7faea3885530) + 443 at decode.c:548 frame #15: 0x00010b32f663 postgres`DecodeXactOp(ctx=0x7faea38b0838, buf=0x7fff54c6ad58) + 547 at decode.c:210 frame #16: 0x00010b32f10e postgres`LogicalDecodingProcessRecord(ctx=0x7faea38b0838, record=0x7faea38b0af8) + 142 at decode.c:103 frame #17: 0x00010b3433f5 postgres`XLogSendLogical + 165 at walsender.c:2425 frame #18: 0x00010b3431ad postgres`WalSndLoop(send_data=0x00010b343350) + 269 at walsender.c:1834 frame #19: 0x00010b341938 postgres`StartLogicalReplication(cmd=0x7faea38846a8) + 568 at walsender.c:997 frame #20: 0x00010b34021c postgres`exec_replication_command(cmd_string=0x7faea3833a38) + 524 at walsender.c:1326 frame #21: 0x00010b3abaab postgres`PostgresMain(argc=1, argv=0x7faea3803fc8, dbname=0x7faea3803ec0, username=0x7faea3803ea0) + 2475 at postgres.c:4022 frame #22: 0x00010b312a2e postgres`BackendRun(port=0x7faea3405d60) + 686 at postmaster.c:4141 frame #23: 0x00010b311ff0 postgres`BackendStartup(port=0x7faea3405d60) + 384 at postmaster.c:3826 frame #24: 0x00010b30e7f7 postgres`ServerLoop + 663 at postmaster.c:1594 frame #25: 0x00010b30c017 postgres`PostmasterMain(argc=3, argv=0x7faea34044d0) + 5623 at postmaster.c:1241 frame #26: 0x00010b24e11d postgres`main(argc=3, argv=0x7faea34044d0) + 749 at main.c:221 The problem can be easily reproduced using pg_recvlogical after creating a logical slot plugged with test_decoding. Regards, -- Michael
Re: [HACKERS] Manipulating complex types as non-contiguous structures in-memory
Jim Nasby jim.na...@bluetreble.com writes: Without having read the patch, I think this is great. I've been wishing for something like this while working on my variant data type. Are there any cases where we would want to use this on a non-variant? Perhaps types where we're paying an alignment penalty? What do you mean by non-variant? The use cases that have come to mind for me are: * arrays, of course * composite types (records) * PostGIS geometry type * JSONB, hstore * possibly regex patterns (we could invent a data type representing these and then store the compiled form as a deserialized representation; although there would be some issues to be worked out to get any actual win, probably) The principal thing that's a bit hard to figure out is when it's a win to convert to a deserialized representation and when you should just leave well enough alone. I'm planning to investigate that further in the context of plpgsql array variables, but I'm not sure how well those answers will carry over to datatypes that plpgsql has no intrinsic understanding of. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Typo in logicaldecoding.sgml
I found a small typo in logicaldecoding.sgml. I think one of them could be a fix. Any advice? diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml index 3efe433..3650567 100644 --- a/doc/src/sgml/logicaldecoding.sgml +++ b/doc/src/sgml/logicaldecoding.sgml @@ -299,7 +299,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot para The command xref linkend=app-pgrecvlogical can be used to control -logical decoding over a streaming replication connection. (It is uses +logical decoding over a streaming replication connection. (It uses these commands internally.) /para /sect1 diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml index 3efe433..d89545a 100644 --- a/doc/src/sgml/logicaldecoding.sgml +++ b/doc/src/sgml/logicaldecoding.sgml @@ -299,7 +299,7 @@ $ pg_recvlogical -d postgres --slot test --drop-slot para The command xref linkend=app-pgrecvlogical can be used to control -logical decoding over a streaming replication connection. (It is uses +logical decoding over a streaming replication connection. (It is using these commands internally.) /para /sect1 Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typo in logicaldecoding.sgml
Do you want to push this yourself or have me do it? If ok, could you please push it? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typo in logicaldecoding.sgml
Hi, On 2015-02-11 08:55:12 +0900, Tatsuo Ishii wrote: I found a small typo in logicaldecoding.sgml. Oops. I think one of them could be a fix. Any advice? I slightly prefer the second form... Do you want to push this yourself or have me do it? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fetch zero result rows when executing a query?
Thanks for understanding Robert, that's more or less what I had in mind, I was mainly wondering if I were missing some deeper explanation for the absence of the possibility of requesting 0 rows. Regardless of all of the above, it's really no big deal. I'll go ahead and use max_rows=1 for now, hopefully you guys don't decide to deprecate it. Shay On Tue, Feb 10, 2015 at 3:00 PM, Robert Haas robertmh...@gmail.com wrote: On Sun, Feb 8, 2015 at 3:56 AM, Shay Rojansky r...@roji.org wrote: Just to be precise: what is strange to me is that the max_rows feature exists but has no 0 value. You and Marko are arguing that the whole feature should be deprecated (i.e. always return all rows). I think the fact that it has no zero value is probably just a historical accident; most likely, whoever designed it originally (probably twenty years ago) didn't think about queries with side-effects and therefore didn't consider that wanting 0 rows would ever be sensible. Meanwhile, a sentinel value was needed to request all rows, so they used 0. If they'd thought of it, they might have picked -1 and we'd not be having this discussion. FWIW, I'm in complete agreement that it would be good if we had this feature. I believe this is not the first report we've had of PostgreSQL doing things in ways that mesh nicely with standardized driver interfaces. Whether we think those interfaces are well-designed or not, they are standardized. When people use $OTHERDB and have a really great driver, and then they move to PostgreSQL and get one with more warts, it does not encourage them to stick with PostgreSQL. .NET is not some fringe user community that we can dismiss as irrelevant. We need users of all languages to want to use PostgreSQL, not just users of languages any one of us happens to personally like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Assertion failure when streaming logical changes
On 02/10/2015 02:46 PM, Andres Freund wrote: On 2015-02-10 21:20:37 +0900, Michael Paquier wrote: Using test_decoding on HEAD (cc761b1) I am seeing the following assertion failure: TRAP: FailedAssertion(!(!((RegisteredSnapshots)-ph_root == ((void*)0))), File: snapmgr.c, Line: 677) Ick. I guess a revert of 94028691609f8e148bd4ce72c46163f018832a5b fixes it? ... Yea, it really looks like the above commit is to blame. The new xmin tracking infrastructure doesn't know about the historical snapshot... The new xmin tracking code assumes that if a snapshots's regd_count 0, it has already been pushed to the RegisteredSnapshots heap. That assumption doesn't hold because the logical decoding stuff creates its own snapshots and sets regd_count=1 to prevent snapmgr.c from freeing them, even though they haven't been registered with RegisterSnapshot. The second paragraph in this comment in snapmgr.c needs fixing: * Likewise, any snapshots that have been exported by pg_export_snapshot * have regd_count = 1 and are counted in RegisteredSnapshots, but are not * tracked by any resource owner. * * The same is true for historic snapshots used during logical decoding, * their lifetime is managed separately (as they life longer as one xact.c * transaction). Besides the spelling, that's incorrect: historic snapshots are *not* counted in RegisteredSnapshots. That was harmless before commit 94028691, but it was always wrong. I think setting regd_count=1 outside snapmgr.c is a pretty ugly hack. snapbuild.c also abuses active_count as a reference counter, which is similarly ugly. I think regd_count and active_count should both be treated as private to snapmgr.c, and initialized to zero elsewhere. As a minimal fix, we could change the logical decoding code to not use regd_count to prevent snapmgr.c from freeing its snapshots, but use active_count for that too. Setting regd_count to 1 in SnapBuildBuildSnapshot() seems unnecessary in the first place, as the callers also call SnapBuildSnapIncRefcount(). Patch attached. But could we get rid of those active_count manipulations too? Could you replace the SnapBuildSnap[Inc|Dec]Refcount calls with [Un]RegisterSnapshot()? - Heikki diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index bcd5896..1f76731 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -1188,8 +1188,8 @@ ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap, memcpy(snap, orig_snap, sizeof(SnapshotData)); snap-copied = true; - snap-active_count = 0; - snap-regd_count = 1; + snap-active_count = 1; + snap-regd_count = 0; snap-xip = (TransactionId *) (snap + 1); memcpy(snap-xip, orig_snap-xip, sizeof(TransactionId) * snap-xcnt); diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index e911453..80f5b44 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -348,7 +348,7 @@ SnapBuildFreeSnapshot(Snapshot snap) Assert(snap-curcid == FirstCommandId); Assert(!snap-suboverflowed); Assert(!snap-takenDuringRecovery); - Assert(snap-regd_count == 1); + Assert(snap-regd_count == 0); /* slightly more likely, so it's checked even without c-asserts */ if (snap-copied) @@ -407,16 +407,16 @@ SnapBuildSnapDecRefcount(Snapshot snap) Assert(!snap-suboverflowed); Assert(!snap-takenDuringRecovery); - Assert(snap-regd_count == 1); + Assert(snap-regd_count == 0); - Assert(snap-active_count); + Assert(snap-active_count 0); /* slightly more likely, so it's checked even without casserts */ if (snap-copied) elog(ERROR, cannot free a copied snapshot); snap-active_count--; - if (!snap-active_count) + if (snap-active_count == 0) SnapBuildFreeSnapshot(snap); } @@ -495,7 +495,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder, TransactionId xid) snapshot-copied = false; snapshot-curcid = FirstCommandId; snapshot-active_count = 0; - snapshot-regd_count = 1; /* mark as registered so nobody frees it */ + snapshot-regd_count = 0; return snapshot; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] reducing our reliance on MD5
On Tue, Feb 10, 2015 at 10:32 PM, Peter Geoghegan p...@heroku.com wrote: On Tue, Feb 10, 2015 at 4:21 PM, Robert Haas robertmh...@gmail.com wrote: Although the patch was described as relatively easy to write, it never went anywhere, because it *replaced* MD5 authentication with bcrypt, which would be a big problem for existing clients. It seems clear that we should add something new and not immediately kill off what we've already got, so that people can transition smoothly. An idea I just had today is to keep using basically the same system that we are currently using for MD5, but with a stronger hash algorithm, like SHA-1 or SHA-2 (which includes SHA-224, SHA-256, SHA-384, and SHA-512). Those are slower, but my guess is that even SHA-512 is not enough slower for anybody to care very much, and if they do, well that's another reason to make use of the new stuff optional. I believe that a big advantage of bcrypt for authentication is the relatively high memory requirements. This frustrates GPU based attacks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers There's also scrypt, which can be tuned for both memory and compute requirements. I don't think the password storing best practices apply to db connection authentication. So SHA256 (or any other non terribly broken hash) is probably fine for Pg.
Re: [HACKERS] Parallel Seq Scan
On 2015-02-10 08:52:09 -0500, Robert Haas wrote: On Tue, Feb 10, 2015 at 2:48 AM, Andres Freund and...@2ndquadrant.com wrote: Note that I'm not saying that Amit's patch is right - I haven't read it - but that I don't think a 'scan this range of pages' heapscan API would not be a bad idea. Not even just for parallelism, but for a bunch of usecases. We do have that, already. heap_setscanlimits(). I'm just not convinced that that's the right way to split up a parallel scan. There's too much risk of ending up with a very-uneven distribution of work. If you make the chunks small enough, and then coordate only the chunk distribution, not really. Regarding tuple flow between backends, I've thought about that before, I agree that we need it, and I don't think I know how to do it. I can see how to have a group of processes executing a single node in parallel, or a single process executing a group of nodes we break off from the query tree and push down to it, but what you're talking about here is a group of processes executing a group of nodes jointly. I don't think it really is that. I think you'd do it essentially by introducing a couple more nodes. Something like SomeUpperLayerNode | | AggCombinerNode / \ / \ / \ PartialHashAggNode PartialHashAggNode .PartialHashAggNode ... || || || || PartialSeqScanPartialSeqScan The only thing that'd potentially might need to end up working jointly jointly would be the block selection of the individual PartialSeqScans to avoid having to wait for stragglers for too long. E.g. each might just ask for a range of a 16 megabytes or so that it scans sequentially. In such a plan - a pretty sensible and not that uncommon thing for parallelized aggregates - you'd need to be able to tell the heap scans which blocks to scan. Right? For this case, what I would imagine is that there is one parallel heap scan, and each PartialSeqScan attaches to it. The executor says give me a tuple and heapam.c provides one. Details like the chunk size are managed down inside heapam.c, and the executor does not know about them. It just knows that it can establish a parallel scan and then pull tuples from it. I think that's a horrible approach that'll end up with far more entangled pieces than what you're trying to avoid. Unless the tuple flow is organized to only happen in the necessary cases the performance will be horrible. And good chunk sizes et al depend on higher layers, selectivity estimates and such. And that's planner/executor work, not the physical layer (which heapam.c pretty much is). A individual heap scan's state lives in process private memory. And if the results inside the separate workers should directly be used in the these workers without shipping over the network it'd be horrible to have the logic in the heapscan. How would you otherwise model an executor tree that does the seqscan and aggregation combined in multiple processes at the same time? Maybe we designate nodes as can-generate-multiple-tuple-streams (seq scan, mostly, I would think) and can-absorb-parallel-tuple-streams (sort, hash, materialize), or something like that, but I'm really fuzzy on the details. I don't think we really should have individual nodes that produce multiple streams - that seems like it'd end up being really complicated. I'd more say that we have distinct nodes (like the PartialSeqScan ones above) that do a teensy bit of coordination about which work to perform. I think we're in violent agreement here, except for some terminological confusion. Are there N PartialSeqScan nodes, one running in each node, or is there one ParallelSeqScan node, which is copied and run jointly across N nodes? You can talk about either way and have it make sense, but we haven't had enough conversations about this on this list to have settled on a consistent set of vocabulary yet. I pretty strongly believe that it has to be independent scan nodes. Both from a implementation and a conversational POV. They might have some very light cooperation between them (e.g. coordinating block ranges or such), but everything else should be separate. From an implementation POV it seems pretty awful to have executor node that's accessed by multiple separate backends - that'd mean it have to be concurrency safe, have state in shared memory and everything. Now, there'll be a node that needs to do some parallel magic - but in the