Re: [HACKERS] On partitioning
On Fri, Dec 5, 2014 at 10:03 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Dec 2, 2014 at 10:43 PM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: I wonder if your suggestion of pg_node_tree plays well here. This then could be a list of CONSTs or some such... And I am thinking it's a concern only for range partitions, no? (that is, a multicolumn partition key) I guess you could list or hash partition on multiple columns, too. How would you distinguish values in list partition for multiple columns? I mean for range partition, we are sure there will be either one value for each column, but for list it could be multiple and not fixed for each partition, so I think it will not be easy to support the multicolumn partition key for list partitions. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On partitioning
On Fri, Dec 5, 2014 at 10:12 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Dec 5, 2014 at 2:18 AM, Amit Kapila amit.kapil...@gmail.com wrote: Do we really need to support dml or pg_dump for individual partitions? I think we do. It's quite reasonable for a DBA (or developer or whatever) to want to dump all the data that's in a single partition; for example, maybe they have the table partitioned, but also spread across several servers. When the data on one machine grows too big, they want to dump that partition, move it to a new machine, and drop the partition from the old machine. That needs to be easy and efficient. More generally, with inheritance, I've seen the ability to reference individual inheritance children be a real life-saver on any number of occasions. Now, a new partitioning system that is not as clunky as constraint exclusion will hopefully be fast enough that people don't need to do it very often any more. But I would be really cautious about removing the option. That is the equivalent of installing a new fire suppression system and then boarding up the emergency exit. Yeah, you *hope* the new fire suppression system is good enough that nobody will ever need to go out that way any more. But if you're wrong, people will die, so getting rid of it isn't prudent. The stakes are not quite so high here, but the principle is the same. Sure, I don't feel we should not provide anyway to take dump for individual partition but not at level of independent table. May be something like --table table_name --partition partition_name. In general, I think we should try to avoid exposing that partitions are individual tables as that might hinder any future enhancement in that area (example if we someone finds a different and better way to arrange the partition data, then due to the currently exposed syntax, we might feel blocked). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Mon, Dec 1, 2014 at 12:18 PM, Dilip kumar dilip.ku...@huawei.com wrote: On 24 November 2014 11:29, Amit Kapila Wrote, I have verified that all previous comments are addressed and the new version is much better than previous version. here we are setting each target once and doing for all the tables.. Hmm, theoretically I think new behaviour could lead to more I/O in certain cases as compare to existing behaviour. The reason for more I/O is that in the new behaviour, while doing Analyze for a particular table at different targets, in-between it has Analyze of different table as well, so the pages in shared buffers or OS cache for a particular table needs to be reloded again for a new target whereas currently it will do all stages of Analyze for a particular table in one-go which means that each stage of Analyze could get benefit from the pages of a table loaded by previous stage. If you agree, then we should try to avoid this change in new behaviour. Please provide you opinion. I have few questions regarding function GetIdleSlot() + static int + GetIdleSlot(ParallelSlot *pSlot, int max_slot, const char *dbname, + const char *progname, bool completedb) { .. + /* + * Some of the slot are free, Process the results for slots whichever + * are free + */ + + do + { + SetCancelConn(pSlot[0].connection); + + i = select_loop(maxFd, slotset); + + ResetCancelConn(); + + if (i 0) + { + /* + * This can only happen if user has sent the cancel request using + * Ctrl+C, Cancel is handled by 0th slot, so fetch the error result. + */ + + GetQueryResult(pSlot[0].connection, dbname, progname, + completedb); + return NO_SLOT; + } + + Assert(i != 0); + + for (i = 0; i max_slot; i++) + { + if (!FD_ISSET(pSlot[i].sock, slotset)) + continue; + + PQconsumeInput(pSlot[i].connection); + if (PQisBusy(pSlot[i].connection)) + continue; + + pSlot[i].isFree = true; + + if (!GetQueryResult(pSlot[i].connection, dbname, progname, + completedb)) + return NO_SLOT; + + if (firstFree 0) + firstFree = i; + } + }while(firstFree 0); } I wanted to understand what exactly the above loop is doing. a. first of all the comment on top of it says Some of the slot are free, ..., if some slot is free, then why do you want to process the results? (Do you mean to say that *None* of the slot is free?) b. IIUC, you have called function select_loop(maxFd, slotset) to check if socket descriptor is readable, if yes then why in do..while loop the same maxFd is checked always, don't you want to check different socket descriptors? I am not sure if I am missing something here c. After checking the socket descriptor for maxFd why you want to run run the below for loop for all slots? for (i = 0; i max_slot; i++) With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Parallel Seq Scan
* Amit Kapila (amit.kapil...@gmail.com) wrote: 1. As the patch currently stands, it just shares the relevant data (like relid, target list, block range each worker should perform on etc.) to the worker and then worker receives that data and form the planned statement which it will execute and send the results back to master backend. So the question here is do you think it is reasonable or should we try to form the complete plan for each worker and then share the same and may be other information as well like range table entries which are required. My personal gut feeling in this matter is that for long term it might be better to form the complete plan of each worker in master and share the same, however I think the current way as done in patch (okay that needs some improvement) is also not bad and quite easier to implement. For my 2c, I'd like to see it support exactly what the SeqScan node supports and then also what Foreign Scan supports. That would mean we'd then be able to push filtering down to the workers which would be great. Even better would be figuring out how to parallelize an Append node (perhaps only possible when the nodes underneath are all SeqScan or ForeignScan nodes) since that would allow us to then parallelize the work across multiple tables and remote servers. One of the big reasons why I was asking about performance data is that, today, we can't easily split a single relation across multiple i/o channels. Sure, we can use RAID and get the i/o channel that the table sits on faster than a single disk and possibly fast enough that a single CPU can't keep up, but that's not quite the same. The historical recommendations for Hadoop nodes is around one CPU per drive (of course, it'll depend on workload, etc, etc, but still) and while there's still a lot of testing, etc, to be done before we can be sure about the 'right' answer for PG (and it'll also vary based on workload, etc), that strikes me as a pretty reasonable rule-of-thumb to go on. Of course, I'm aware that this won't be as easy to implement.. 2. Next question related to above is what should be the output of ExplainPlan, as currently worker is responsible for forming its own plan, Explain Plan is not able to show the detailed plan for each worker, is that okay? I'm not entirely following this. How can the worker be responsible for its own plan when the information passed to it (per the above paragraph..) is pretty minimal? In general, I don't think we need to have specifics like this worker is going to do exactly X because we will eventually need some communication to happen between the worker and the master process where the worker can ask for more work because it's finished what it was tasked with and the master will need to give it another chunk of work to do. I don't think we want exactly what each worker process will do to be fully formed at the outset because, even with the best information available, given concurrent load on the system, it's not going to be perfect and we'll end up starving workers. The plan, as formed by the master, should be more along the lines of this is what I'm gonna have my workers do along w/ how many workers, etc, and then it goes and does it. Perhaps for an 'explain analyze' we return information about what workers actually *did* what, but that's a whole different discussion. 3. Some places where optimizations are possible: - Currently after getting the tuple from heap, it is deformed by worker and sent via message queue to master backend, master backend then forms the tuple and send it to upper layer which before sending it to frontend again deforms it via slot_getallattrs(slot). If this is done as I was proposing above, we might be able to avoid this, but I don't know that it's a huge issue either way.. The bigger issue is getting the filtering pushed down. - Master backend currently receives the data from multiple workers serially. We can optimize in a way that it can check other queues, if there is no data in current queue. Yes, this is pretty critical. In fact, it's one of the recommendations I made previously about how to change the Append node to parallelize Foreign Scan node work. - Master backend is just responsible for coordination among workers It shares the required information to workers and then fetch the data processed by each worker, by using some more logic, we might be able to make master backend also fetch data from heap rather than doing just co-ordination among workers. I don't think this is really necessary... I think in all above places we can do some optimisation, however we can do that later as well, unless they hit the performance badly for cases which people care most. I agree that we can improve the performance through various optimizations later, but it's important to get the general structure and design right or we'll end up having to reimplement a lot of it. 4. Should parallel_seqscan_degree value be
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-12-06 00:10:11 +0900, Michael Paquier wrote: On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed rahilasyed...@gmail.com wrote: I attempted quick review and could not come up with much except this + /* +* Calculate the amount of FPI data in the record. Each backup block +* takes up BLCKSZ bytes, minus the hole length. +* +* XXX: We peek into xlogreader's private decoded backup blocks for the +* hole_length. It doesn't seem worth it to add an accessor macro for +* this. +*/ + fpi_len = 0; + for (block_id = 0; block_id = record-max_block_id; block_id++) + { + if (XLogRecHasCompressedBlockImage(record, block_id)) + fpi_len += BLCKSZ - record-blocks[block_id].compress_len; IIUC, fpi_len in case of compressed block image should be fpi_len = record-blocks[block_id].compress_len; Yep, true. Patches need a rebase btw as Heikki fixed a commit related to the stats of pg_xlogdump. In any case, any opinions to switch this patch as Ready for committer? Needing a rebase is a obvious conflict to that... But I guess some wider looks afterwards won't hurt. Here are rebased versions, which are patches 1 and 2. And I am switching as well the patch to Ready for Committer. The important point to consider for this patch is the use of the additional 2-bytes as uint16 in the block information structure to save the length of a compressed block, which may be compressed without its hole to achieve a double level of compression (image compressed without its hole). We may use a simple flag on one or two bits using for example a bit from hole_length, but in this case we would need to always compress images with their hole included, something more expensive as the compression would take more time. Robert wrote: What I would suggest is instrument the backend with getrusage() at startup and shutdown and have it print the difference in user time and system time. Then, run tests for a fixed number of transactions and see how the total CPU usage for the run differs. That's a nice idea, which is done with patch 3 as a simple hack calling twice getrusage at the beginning of PostgresMain and before proc_exit, calculating the difference time and logging it for each process (used as well log_line_prefix with %p). Then I just did a small test with a load of a pgbench-scale-100 database on fresh instances: 1) Compression = on: Stop LSN: 0/487E49B8 getrusage: proc 11163: LOG: user diff: 63.071127, system diff: 10.898386 pg_xlogdump: FPI size: 122296653 [90.52%] 2) Compression = off Stop LSN: 0/4E54EB88 Result: proc 11648: LOG: user diff: 43.855212, system diff: 7.857965 pg_xlogdump: FPI size: 204359192 [94.10%] And the CPU consumption is showing quite some difference... I'd expect as well pglz_compress to show up high in a perf profile for this case (don't have the time to do that now, but a perf record -a -g would be fine I guess). Regards, -- Michael From 63264c7bc1fc22c647b73f67a41224f6b6ad3150 Mon Sep 17 00:00:00 2001 From: Michael Paquier michael@otacoo.com Date: Tue, 25 Nov 2014 14:05:59 +0900 Subject: [PATCH 1/3] Move pg_lzcompress.c to src/common Exposing compression and decompression APIs of pglz makes possible its use by extensions and contrib modules. pglz_decompress contained a call to elog to emit an error message in case of corrupted data. This function is changed to return a status code to let its callers return an error instead. Compression function is changed similarly to make the whole set consistent. --- src/backend/access/heap/tuptoaster.c | 11 +- src/backend/utils/adt/Makefile| 4 +- src/backend/utils/adt/pg_lzcompress.c | 779 - src/common/Makefile | 3 +- src/common/pg_lzcompress.c| 784 ++ src/include/utils/pg_lzcompress.h | 19 +- src/tools/msvc/Mkvcbuild.pm | 3 +- 7 files changed, 813 insertions(+), 790 deletions(-) delete mode 100644 src/backend/utils/adt/pg_lzcompress.c create mode 100644 src/common/pg_lzcompress.c diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c index ce44bbd..9af456f 100644 --- a/src/backend/access/heap/tuptoaster.c +++ b/src/backend/access/heap/tuptoaster.c @@ -142,7 +142,8 @@ heap_tuple_untoast_attr(struct varlena * attr) attr = (struct varlena *) palloc(PGLZ_RAW_SIZE(tmp) + VARHDRSZ); SET_VARSIZE(attr, PGLZ_RAW_SIZE(tmp) + VARHDRSZ); - pglz_decompress(tmp, VARDATA(attr)); + if (pglz_decompress(tmp, VARDATA(attr)) != PGLZ_OK) +elog(ERROR, compressed data is corrupted); pfree(tmp); } } @@ -167,7 +168,8 @@ heap_tuple_untoast_attr(struct varlena * attr) attr = (struct
Re: [HACKERS] superuser() shortcuts
On 12/4/14 3:32 PM, Stephen Frost wrote: On reflection, this seemed odd because of how the code was written but perhaps it was intentional after all. In general, superuser should be able to bypass permissions restrictions and I don't see why this case should be different. In general, I don't think we want to allow giving away of objects by unprivileged users. We don't allow that to be done for tables and I'm surprised to hear that it's possible to give domains away. Superuser should be able to bypass the restriction, BUT the object given away by the superuser to an unprivileged user should NOT be able to be further given away by that unprivileged user. Clearly, this issue is a bit more complex than a simple code cleanup. So I'm going to set this patch as returned with feedback. My suggestion for moving forward would be to define a general security policy for the ALTER OWNER cases, and then fix those properly. The changes for integration the superuser check into the replication role check should perhaps be tackled as part of a general refactoring of capability checks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Role Attribute Bitmask Catalog Representation
Adam, * Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote: Attached is an updated patch. Awesome, thanks! diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c *** pg_extension_ownercheck(Oid ext_oid, Oid *** 5051,5102 } /* ! * Check whether specified role has CREATEROLE privilege (or is a superuser) * ! * Note: roles do not have owners per se; instead we use this test in ! * places where an ownership-like permissions test is needed for a role. ! * Be sure to apply it to the role trying to do the operation, not the ! * role being operated on! Also note that this generally should not be ! * considered enough privilege if the target role is a superuser. ! * (We don't handle that consideration here because we want to give a ! * separate error message for such cases, so the caller has to deal with it.) */ The comment above is pretty big and I don't think we want to completely remove it. Can you add it as another 'Note' in the 'has_role_attribute' comment and reword it accordingly? *** AlterRole(AlterRoleStmt *stmt) *** 508,513 --- 512,518 DefElem*dvalidUntil = NULL; DefElem*dbypassRLS = NULL; Oid roleid; + RoleAttr attributes; Whitespace issue that should be fixed- attributes should line up with roleid. --- 1405,1421 FROM (pg_get_replication_slots() l(slot_name, plugin, slot_type, datoid, active, xmin, catalog_xmin, restart_lsn) LEFT JOIN pg_database d ON ((l.datoid = d.oid))); pg_roles| SELECT pg_authid.rolname, ! pg_check_role_attribute(pg_authid.oid, 'SUPERUSER'::text) AS rolsuper, ! pg_check_role_attribute(pg_authid.oid, 'INHERIT'::text) AS rolinherit, ! pg_check_role_attribute(pg_authid.oid, 'CREATEROLE'::text) AS rolcreaterole, ! pg_check_role_attribute(pg_authid.oid, 'CREATEDB'::text) AS rolcreatedb, ! pg_check_role_attribute(pg_authid.oid, 'CATUPDATE'::text) AS rolcatupdate, ! pg_check_role_attribute(pg_authid.oid, 'CANLOGIN'::text) AS rolcanlogin, ! pg_check_role_attribute(pg_authid.oid, 'REPLICATION'::text) AS rolreplication, ! pg_check_role_attribute(pg_authid.oid, 'BYPASSRLS'::text) AS rolbypassrls, pg_authid.rolconnlimit, ''::text AS rolpassword, pg_authid.rolvaliduntil, s.setconfig AS rolconfig, pg_authid.oid FROM (pg_authid It occurs to me that in this case (and a few others..), we're doing a lot of extra work- each call to pg_check_role_attribute() is doing a lookup on the oid just to get back to what the rolattr value on this row is. How about a function which takes rolattr and the text representation of the attribute and returns if the bit is set for that rolattr value? Then you'd pass pg_authid.rolattr into the function calls above instead of the role's oid. I don't see any changes to the regression test files, were they forgotten in the patch? I would think that at least the view definition changes would require updates to the regression tests, though perhaps nothing else. Overall, I'm pretty happy with the patch and would suggest moving on to writing up the documentation changes to go along with the code changes. I'll continue to play around with it but it all seems pretty clean to me and will allow us to easily add the additiaonl role attributes being discussed. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] [v9.5] Custom Plan API
On 27 November 2014 at 20:48, Simon Riggs si...@2ndquadrant.com wrote: On 27 November 2014 at 10:33, Kouhei Kaigai kai...@ak.jp.nec.com wrote: The reason why documentation portion was not yet committed is, sorry, it is due to quality of documentation from the standpoint of native English speaker. Now, I'm writing up a documentation stuff according to the latest code base, please wait for several days and help to improve. Happy to help with that. Please post to the Wiki first so we can edit it communally. I've corrected a spelling mistake, but it reads OK at moment. The example contrib module was not committed and I am advised no longer works. May I submit the contrib/ctidscan module again for an example? Yes please. We have other contrib modules that exist as tests, so this seems reasonable to me. I can't improve the docs without the example code. Is that available now? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
On Thu, Dec 04, 2014 at 10:00:14AM +0530, Ashutosh Bapat wrote: On Thu, Dec 4, 2014 at 9:05 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: (2014/12/03 19:35), Ashutosh Bapat wrote: On Tue, Dec 2, 2014 at 8:29 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp mailto:fujita.ets...@lab.ntt.co.jp wrote: IIUC, even the transactions over the local and the *single* remote server are not guaranteed to be executed atomically in the current form. It is possible that the remote transaction succeeds and the local one fails, for example, resulting in data inconsistency between the local and the remote. IIUC, while committing transactions involving a single remote server, the steps taken are as follows 1. the local changes are brought to PRE-COMMIT stage, which means that the transaction *will* succeed locally after successful completion of this phase, 2. COMMIT message is sent to the foreign server 3. If step two succeeds, local changes are committed and successful commit is conveyed to the client 4. if step two fails, local changes are rolled back and abort status is conveyed to the client 5. If step 1 itself fails, the remote changes are rolled back. This is as per one phase commit protocol which guarantees ACID for single foreign data source. So, the changes involving local and a single foreign server seem to be atomic and consistent. Really? Maybe I'm missing something, but I don't think the current implementation for committing transactions has such a mechanism stated in step 1. So, I think it's possible that the local transaction fails in step3 while the remote transaction succeeds, as mentioned above. PFA a script attached which shows this. You may want to check the code in pgfdw_xact_callback() for actions taken by postgres_fdw on various events. CommitTransaction() for how those events are generated. The code there complies with the sequence above. While postgres_fdw delays remote commit to eliminate many causes for having one server commit while another aborts, other causes remain. The local transaction could still fail due to WAL I/O problems or a system crash. 2PC is the reliable answer, but that was explicitly out of scope for the initial postgres_fdw write support. Does this inheritance patch add any atomicity problem that goes away when one breaks up the inheritance hierarchy and UPDATEs each table separately? If not, this limitation is okay. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Role Attribute Bitmask Catalog Representation
Stephen, The comment above is pretty big and I don't think we want to completely remove it. Can you add it as another 'Note' in the 'has_role_attribute' comment and reword it accordingly? Ok, agreed. Will address. Whitespace issue that should be fixed- attributes should line up with roleid. Ok. Will address. It occurs to me that in this case (and a few others..), we're doing a lot of extra work- each call to pg_check_role_attribute() is doing a lookup on the oid just to get back to what the rolattr value on this row is. How about a function which takes rolattr and the text representation of the attribute and returns if the bit is set for that rolattr value? Then you'd pass pg_authid.rolattr into the function calls above instead of the role's oid. Makes sense, I'll put that together. I don't see any changes to the regression test files, were they forgotten in the patch? I would think that at least the view definition changes would require updates to the regression tests, though perhaps nothing else. Hmmm... :-/ The regression tests that changed were in 'src/test/regress/expected/rules.out' and should be near the bottom of the patch. Overall, I'm pretty happy with the patch and would suggest moving on to writing up the documentation changes to go along with the code changes. I'll continue to play around with it but it all seems pretty clean to me and will allow us to easily add the additiaonl role attributes being discussed. Sounds good. I'll start on those changes next. Thanks, Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com
Re: [HACKERS] Role Attribute Bitmask Catalog Representation
* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote: I don't see any changes to the regression test files, were they forgotten in the patch? I would think that at least the view definition changes would require updates to the regression tests, though perhaps nothing else. Hmmm... :-/ The regression tests that changed were in 'src/test/regress/expected/rules.out' and should be near the bottom of the patch. Hah, looked just like changes to the system_views, sorry for the confusion. :) Overall, I'm pretty happy with the patch and would suggest moving on to writing up the documentation changes to go along with the code changes. I'll continue to play around with it but it all seems pretty clean to me and will allow us to easily add the additiaonl role attributes being discussed. Sounds good. I'll start on those changes next. Great! Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] inherit support for foreign tables
On Thu, Dec 04, 2014 at 12:35:54PM +0900, Etsuro Fujita wrote: (2014/12/03 19:35), Ashutosh Bapat wrote: On Tue, Dec 2, 2014 at 8:29 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp mailto:fujita.ets...@lab.ntt.co.jp wrote: This is not exactly extension of non-inheritance case. non-inheritance case doesn't show two remote SQLs under the same plan node. May be you can rename the label Remote SQL as Remote UPDATE/INSERT/DELETE (or something to that effect) for the DML command and the Foreign plan node should be renamed to Foreign access node or something to indicate that it does both the scan as well as DML. I am not keen about the actual terminology, but I think a reader of plan shouldn't get confused. We can leave this for committer's judgement. Thanks for the proposal! I think that would be a good idea. But I think there would be another idea. An example will be shown below. We show the update commands below the ModifyTable node, not above the corresponding ForeignScan nodes, so maybe less confusing. If there are no objections of you and others, I'll update the patch this way. postgres=# explain verbose update parent set a = a * 2 where a = 5; QUERY PLAN - Update on public.parent (cost=0.00..280.77 rows=25 width=10) On public.ft1 Remote SQL: UPDATE public.mytable_1 SET a = $2 WHERE ctid = $1 ^^ It occurs to me that the command generated by the FDW might well not be SQL at all, as is the case with file_fdw and anything else that talks to a NoSQL engine. Would it be reasonable to call this Remote command or something similarly generic? Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer
On 2.12.2014 02:52, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: On 2.12.2014 01:33, Tom Lane wrote: What I suspect you're looking at here is the detritus of creation of a huge number of memory contexts. mcxt.c keeps its own state about existing contents in TopMemoryContext. So, if we posit that those contexts weren't real small, there's certainly room to believe that there was some major memory bloat going on recently. Aha! MemoryContextCreate allocates the memory for the new context from TopMemoryContext explicitly, so that it survives resets of the parent context. Is that what you had in mind by keeping state about existing contexts? Right. That'd probably explain the TopMemoryContext size, because array_agg() creates separate context for each group. So if you have 1M groups, you have 1M contexts. Although I don's see how the size of those contexts would matter? Well, if they're each 6K, that's your 6GB right there. FWIW, I tried to repeat the test with the array_agg() patch, submitted to the next CF: http://www.postgresql.org/message-id/5479faea.3070...@fuzzy.cz and the memory context stats after the first iteration look like this: master (unpatched) -- TopMemoryContext: 204878128 total in 25011 blocks; 204010192 free (750034 chunks); 867936 used patched with array_agg -- TopMemoryContext: 78128 total in 11 blocks; 10144 free (34 chunks); 67984 used Also, the RSS issue (which was the initial subject of this thread) pretty much disappeared. So indeed, this seems to be caused by the 'islands' referenced by TopMemoryContext. I wonder whether it'd be worth looking for places doing something similar to array_agg(), potentially creating many contexts in parallel, and maybe modifying them as in the patch. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.5] Custom Plan API
Simon, Yes please. We have other contrib modules that exist as tests, so this seems reasonable to me. I can't improve the docs without the example code. Is that available now? Please wait for a few days. The ctidscan module is not adjusted for the latest interface yet. -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -Original Message- From: Simon Riggs [mailto:si...@2ndquadrant.com] Sent: Sunday, December 07, 2014 12:37 AM To: Kaigai Kouhei(海外 浩平) Cc: Robert Haas; Thom Brown; Kohei KaiGai; Tom Lane; Alvaro Herrera; Shigeru Hanada; Stephen Frost; Andres Freund; PgHacker; Jim Mlodgenski; Peter Eisentraut Subject: Re: [HACKERS] [v9.5] Custom Plan API On 27 November 2014 at 20:48, Simon Riggs si...@2ndquadrant.com wrote: On 27 November 2014 at 10:33, Kouhei Kaigai kai...@ak.jp.nec.com wrote: The reason why documentation portion was not yet committed is, sorry, it is due to quality of documentation from the standpoint of native English speaker. Now, I'm writing up a documentation stuff according to the latest code base, please wait for several days and help to improve. Happy to help with that. Please post to the Wiki first so we can edit it communally. I've corrected a spelling mistake, but it reads OK at moment. The example contrib module was not committed and I am advised no longer works. May I submit the contrib/ctidscan module again for an example? Yes please. We have other contrib modules that exist as tests, so this seems reasonable to me. I can't improve the docs without the example code. Is that available now? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)
Hi! This was initially posted to pgsql-performance in this thread: http://www.postgresql.org/message-id/5472416c.3080...@fuzzy.cz but pgsql-hackers seems like a more appropriate place for further discussion. Anyways, attached is v3 of the patch implementing the adaptive ndistinct estimator. Just like the previous version, the original estimate is the one stored/used, and the alternative one is just printed, to make it possible to compare the results. Changes in this version: 1) implementing compute_minimal_stats - So far only the 'scalar' (more common) case was handled. - The algorithm requires more detailed input data, the MCV-based stats insufficient, so the code hashes the values and then determines the f1, f2, ..., fN coefficients by sorting and walking the array of hashes. 2) handling wide values properly (now are counted into f1) 3) compensating for NULL values when calling optimize_estimate - The estimator has no notion of NULL values, so it's necessary to remove them both from the total number of rows, and sampled rows. 4) some minor fixes and refactorings I also repeated the tests comparing the results to the current estimator - full results are at the end of the post. The one interesting case is the 'step skew' with statistics_target=10, i.e. estimates based on mere 3000 rows. In that case, the adaptive estimator significantly overestimates: values currentadaptive -- 106 99 107 1068 6449190 1006 38 6449190 10006327 42441 I don't know why I didn't get these errors in the previous runs, because when I repeat the tests with the old patches I get similar results with a 'good' result from time to time. Apparently I had a lucky day back then :-/ I've been messing with the code for a few hours, and I haven't found any significant error in the implementation, so it seems that the estimator does not perform terribly well for very small samples (in this case it's 3000 rows out of 10.000.000 (i.e. ~0.03%). However, I've been able to come up with a simple way to limit such errors, because the number of distinct values is naturally bounded by (totalrows / samplerows) * ndistinct where ndistinct is the number of distinct values in the sample. This essentially means that if you slice the table into sets of samplerows rows, you get different ndistinct values. BTW, this also fixes the issue reported by Jeff Janes on 21/11. With this additional sanity check, the results look like this: values currentadaptive -- 106 99 116 1068 23331 1006 38 96657 10006327 12400 Which is much better, but clearly still a bit on the high side. So either the estimator really is a bit unstable for such small samples (it tends to overestimate a bit in all the tests), or there's a bug in the implementation - I'd be grateful if someone could peek at the code and maybe compare it to the paper describing the estimator. I've spent a fair amount of time analyzing it, but found nothing. But maybe the estimator really is unstable for such small samples - in that case we could probably use the current estimator as a fallback. After all, this only happens when someone explicitly decreases the statistics target to 10 - with the default statistics target it's damn accurate. kind regards Tomas statistics_target = 10 == a) smooth skew, 101 values, different skew ('k') - defaults to the current estimator b) smooth skew, 10.001 values, different skew ('k') kcurrent adaptive --- 1 10231 11259 2 6327 8543 3 4364 7707 4 3436 7052 5 2725 5868 6 2223 5071 7 1979 5011 8 1802 5017 9 1581 4546 c) step skew (different numbers of values) values currentadaptive -- 106 99 107 1068 6449190 1006 38 6449190 10006327 42441 patched: values currentadaptive -- 106 99 116 1068 23331 1006 38 96657 10006327 12400 statistics_target = 100 === a) smooth skew, 101 values, different skew ('k') - defaults to the current estimator b) smooth skew, 10.001 values, different skew ('k') k current adaptive - 11001110655 2 964110944 3 883710846 4 831510992 5 765410760 6 716210524 7 665010375 8 626810275 9 5871
[HACKERS] PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching
Hi, back when we were discussing the hashjoin patches (now committed), Robert proposed that maybe it'd be a good idea to sometimes increase the number of tuples per bucket instead of batching. That is, while initially sizing the hash table - if the hash table with enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into work_mem, try a bit higher load factor before starting to batch. Attached patch is an initial attempt to implement this - it's a bit rough on the edges, but hopefully enough to judge the benefit of this. The default load factor is 1. The patch tries to degrade this to 2, 4 or 8 in attempt to fit the hash table into work mem. If it doesn't work, it starts batching with the default load factor. If the batching is required while the hashjoin is running, the load factor is switched back to the default one (if we're batching, there's no point in keeping the slower hash table). The patch also modifies explain output, to show the load factor. The testing I've done so far are rather disappointing, though. create table a as select i from generate_series(1,100) s(i); create table b as select i from generate_series(1,100) s(i); analyze a; analyze b; select a.i, b.i from a join b on (a.i = b.i); work_mem batchestuples per bucket duration - 64 MB11585 ms 46 MB12639 ms 43 MB14794 ms 40 MB18 1075 ms 39 MB21623 ms So, even increasing the load factor to 2 is slower than batching. Of course, with other examples the results may be different. For example with a much larger outer table: create table a as select mod(i,100) i from generate_series(1,1000) s(i); analyze a; work_mem batchestuples per bucket duration - 64 MB11 3904 ms 46 MB12 4692 ms 43 MB14 6499 ms 40 MB18 9264 ms 39 MB21 4054 ms Same results. Of course, for huge outer tables it will make a difference. For example on a ~8GB outer table (on a machine with 8GB of RAM), the batching causes enough I/O to make it slower than ntup=2 (50s vs. 80s, for example). I'm not sure what's the best way forward - clearly, for cases that fit into RAM (temp files into page cache), batching is faster. For tables large enough to cause a lot of I/O, it may make a difference - but I'm not sure how to identify these cases. So ISTM implementing this requires a reliable way to identify which case we're dealing with - if the outer table is large enough (prefer increasing load factor) or not (prefer batching). Until then keeping the current simple/predictible approach is probably better. Of course, this also depends on additional variables (e.g. is the system memory-stressed?). regards Tomas diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 332f04a..06e612a 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1936,26 +1936,33 @@ show_hash_info(HashState *hashstate, ExplainState *es) ExplainPropertyLong(Original Hash Batches, hashtable-nbatch_original, es); ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es); + ExplainPropertyLong(Hash Tuples Per Bucket, hashtable-ntup_per_bucket, es); + ExplainPropertyLong(Original Tuples Per Bucket, +hashtable-ntup_per_bucket, es); + ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es); } else if (hashtable-nbatch_original != hashtable-nbatch || - hashtable-nbuckets_original != hashtable-nbuckets) + hashtable-nbuckets_original != hashtable-nbuckets || + hashtable-ntup_per_bucket_original != hashtable-ntup_per_bucket) { appendStringInfoSpaces(es-str, es-indent * 2); appendStringInfo(es-str, - Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n, + Buckets: %d (originally %d) Batches: %d (originally %d) Tuples: %d (originally %d) Memory Usage: %ldkB\n, hashtable-nbuckets, hashtable-nbuckets_original, hashtable-nbatch, hashtable-nbatch_original, + hashtable-ntup_per_bucket, + hashtable-ntup_per_bucket_original, spacePeakKb); } else { appendStringInfoSpaces(es-str, es-indent * 2); appendStringInfo(es-str, - Buckets: %d Batches: %d Memory Usage: %ldkB\n, + Buckets: %d Batches: %d Tuples: %d Memory Usage: %ldkB\n, hashtable-nbuckets, hashtable-nbatch, - spacePeakKb); + hashtable-ntup_per_bucket,
Re: [HACKERS] Testing DDL deparsing support
On Tue, Dec 2, 2014 at 03:13:07PM -0300, Alvaro Herrera wrote: Robert Haas wrote: On Thu, Nov 27, 2014 at 11:43 PM, Ian Barwick i...@2ndquadrant.com wrote: A simple schedule to demonstrate this is available; execute from the src/test/regress/ directory like this: ./pg_regress \ --temp-install=./tmp_check \ --top-builddir=../../.. \ --dlpath=. \ --schedule=./schedule_ddl_deparse_demo I haven't read the code, but this concept seems good to me. Excellent, thanks. It has the unfortunate weakness that a difference could exist during the *middle* of the regression test run that is gone by the *end* of the run, but our existing pg_upgrade testing has the same weakness, so I guess we can view this as one more reason not to be too aggressive about having regression tests drop the unshared objects they create. Agreed. Not dropping objects also helps test pg_dump itself; the normal procedure there is run the regression tests, then pg_dump the regression database. Objects that are dropped never exercise their corresponding pg_dump support code, which I think is a bad thing. I think we should institute a policy that regression tests must keep the objects they create; maybe not all of them, but at least a sample large enough to cover all interesting possibilities. This causes creation DDL is checked if it is used in the regression database, but what about ALTER and DROP? pg_dump doesn't issue those, except in special cases like inheritance. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers