Re: [HACKERS] increasing the default WAL segment size
Hello, On Tue, Mar 28, 2017 at 1:06 AM, Beena Emerson wrote: > Hello, > > On Sat, Mar 25, 2017 at 10:32 PM, Peter Eisentraut > wrote: >> >> At this point, I suggest splitting this patch up into several >> potentially less controversial pieces. >> >> One big piece is that we currently don't support segment sizes larger >> than 64 GB, for various internal arithmetic reasons. Your patch appears >> to address that. So I suggest isolating that. Assuming it works >> correctly, I think there would be no great concern about it. >> >> The next piece would be making the various tools aware of varying >> segment sizes without having to rely on a built-in value. >> >> The third piece would then be the rest that allows you to set the size >> at initdb >> >> If we take these in order, we would make it easier to test various sizes >> and see if there are any more unforeseen issues when changing sizes. It >> would also make it easier to do performance testing so we can address >> the original question of what the default size should be. > > > PFA the patches divided into 3 parts: > > 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes > the internal representation of max_wal_size and min_wal_size to mb. Already committed. > > 03-modify-tools.patch - Makes XLogSegSize into a variable, currently set as > XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using > inbuilt value. > The updated 03-modify-tools_v2.patch has the following changes: - Rebased over current HEAD - Impoverished comments - Adding error messages where applicable. - Replace XLOG_SEG_SIZE in the tools and xlog_internal.h to XLogSegSize. XLOG_SEG_SIZE is the wal_segment_size the server was compiled with and XLogSegSize is the wal_segment_size of the target server on which the tool is run. When the binaries used and the target server are compiled with different wal_segment_size, the calculations would be be affected and the tool would crash. To avoid it, all the calculations used by tool should use XLogSegSize. - pg_waldump : The fuzzy_open_file is split into two functions - open_file_in_directory and identify_target_directory so that code can be reused when determining the XLogSegSize from the WAL file header. - IsValidXLogSegSize macro is moved from 04 to here so that we can use it for validating the size in all the tools. > 04-initdb-walsegsize.patch - Adds the initdb option to set wal-segsize and > make related changes. Update pg_test_fsync to use DEFAULT_XLOG_SEG_SIZE > instead of XLOG_SEG_SIZE The 04-initdb-walsegsize_v2.patch has the following improvements: - Rebased over new 03 patch - Pass the wal-segsize intidb option as command-line option rathern than in an environment variable. - Since new function check_wal_size had only had two checks and was sed once, moved the code to ReadControlFile where it is used and removed this function. - improve comments and add validations where required. - Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and max_wal_size,instead of the value 16. - Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc wal_segment_size instead 16 - INT_MAX. > >> >> One concern I have is that your patch does not contain any tests. There >> should probably be lots of tests. > > > 05-initdb_tests.patch adds tap tests to initialize cluster with different > wal_segment_size and then check the config values. What other tests do you > have in mind? Checking the various tools? > > -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 04-initdb-walsegsize_v2.patch Description: Binary data 03-modify-tools_v2.patch Description: Binary data 01-add-XLogSegmentOffset-macro_rebase.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : For Auto-Prewarm.
On Wed, Jul 5, 2017 at 6:25 PM, Mithun Cy wrote: > On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila wrote: >> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy >> wrote: >>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy >>> wrote: On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown wrote: > > Also, I find it a bit messy that launch_autoprewarm_dump() doesn't > detect an autoprewarm process already running. I'd want this to > return NULL or an error if called for a 2nd time. We log instead of error as we try to check only after launching the worker and inside worker. One solution could be as similar to autoprewam_dump_now(), the autoprewarm_start_worker() can init shared memory and check if we can launch worker in backend itself. I will try to fix same. >>> >>> I have fixed it now as follows >>> >>> +Datum >>> +autoprewarm_start_worker(PG_FUNCTION_ARGS) >>> +{ >>> + pid_t pid; >>> + >>> + init_apw_shmem(); >>> + pid = apw_state->bgworker_pid; >>> + if (pid != InvalidPid) >>> + ereport(ERROR, >>> + (errmsg("autoprewarm worker is already running under PID >>> %d", >>> + pid))); >>> + >>> + autoprewarm_dump_launcher(); >>> + PG_RETURN_VOID(); >>> +} >>> >>> In backend itself, we shall check if an autoprewarm worker is running >>> then only start the server. There is a possibility if this function is >>> executed concurrently when there is no worker already running (Which I >>> think is not a normal usage) then both call will say it has >>> successfully launched the worker even though only one could have >>> successfully done that (other will log and silently die). >> >> Why can't we close this remaining race condition? Basically, if we >> just perform all of the actions in this function under the lock and >> autoprewarm_dump_launcher waits till the autoprewarm worker has >> initialized the bgworker_pid, then there won't be any remaining race >> condition. I think if we don't close this race condition, it will be >> unpredictable whether the user will get the error or there will be >> only a server log for the same. > > Yes, I can make autoprewarm_dump_launcher to wait until the launched > bgworker set its pid, but this requires one more synchronization > variable between launcher and worker. More than that I see > ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the > worker), which needs to be called before setting pid. So I thought it > won't be harmful let two concurrent calls to launch workers and we > just log failures. Please let me know if I need to rethink about it. > I don't know whether you need to rethink but as presented in the patch, it seems unclear to me about the specs of API. As this is an exposed function to the user, I think the behavior should be well defined. If you don't think changing code has many advantages, then at the very least update the docs to indicate the expectation and behavior of the API. Also, I think it is better to add few comments in the code to tell about the unpredictable behavior in case of race condition and the reason for same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : For Auto-Prewarm.
On Wed, Jul 5, 2017 at 6:25 PM, Mithun Cy wrote: > On Mon, Jul 3, 2017 at 3:34 PM, Amit Kapila wrote: >> >> Few comments on the latest patch: >> >> 1. >> + LWLockRelease(&apw_state->lock); >> + if (!is_bgworker) >> + ereport(ERROR, >> + (errmsg("could not perform block dump because dump file is being >> used by PID %d", >> + apw_state->pid_using_dumpfile))); >> + ereport(LOG, >> + (errmsg("skipping block dump because it is already being performed by PID >> %d", >> + apw_state->pid_using_dumpfile))); >> >> >> The above code looks confusing as both the messages are saying the >> same thing in different words. I think you keep one message (probably >> the first one) and decide error level based on if this is invoked for >> bgworker. Also, you can move LWLockRelease after error message, >> because if there is any error, then it will automatically release all >> lwlocks. > > ERROR is used for autoprewarm_dump_now which is called from the backend. > LOG is used for bgworker. > wordings used are to match the context if failing to dump is > acceptable or not. In the case of bgworker, it is acceptable we are > not particular about the start time of dump but the only interval > between the dumps. So if already somebody doing it is acceptable. But > one who calls autoprewarm_dump_now might be particular about the start > time of dump so we throw error making him retry same. > > The wording's are suggested by Robert(below snapshot) in one of his > previous comments and I also agree with it. If you think I should > reconsider this and I am missing something I am open to suggestions. > Not an issue, if you and Robert think having two different messages is better, then let's leave it. One improvement we could do here is to initialize a boolean global variable for AutoPrewarmWorker, then use it wherever required. >> 3. >> +dump_now(bool is_bgworker) >> { >> .. >> + if (buf_state & BM_TAG_VALID) >> + { >> + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; >> + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode; >> + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; >> + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; >> + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; >> + ++num_blocks; >> + } >> .. >> } >>I think there is no use of writing Unlogged buffers unless the dump is >>for the shutdown. You might want to use BufferIsPermanent to detect the same. > > -- I do not think that is true pages of the unlogged table are also > read into buffers for read-only purpose. So if we miss to dump them > while we shut down then the previous dump should be used. > I am not able to understand what you want to say. Unlogged tables should be empty in case of crash recovery. Also, we never flush unlogged buffers except for shutdown checkpoint, refer BufferAlloc and in particular below comment: * Make sure BM_PERMANENT is set for buffers that must be written at every * checkpoint. Unlogged buffers only need to be written at shutdown * checkpoints, except for their "init" forks, which need to be treated * just like permanent relations. >> 4. >> +static uint32 >> +dump_now(bool is_bgworker) >> { >> .. >> + for (num_blocks = 0, i = 0; i < NBuffers; i++) >> + { >> + uint32 buf_state; >> + >> + if (!is_bgworker) >> + CHECK_FOR_INTERRUPTS(); >> .. >> } >> >> Why checking for interrupts is only for non-bgwroker cases? > > -- autoprewarm_dump_now is directly called from the backend. In such > case, we have to handle signals registered for backend in dump_now(). > For bgworker dump_block_info_periodically caller of dump_now() handles > SIGTERM, SIGUSR1 which we are interested in. > Okay, but what about signal handler for SIGUSR1 (procsignal_sigusr1_handler). Have you verified that it will never set the InterruptPending flag? > >> 6. >> +dump_now(bool is_bgworker) >> { >> .. >> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR); >> + apw_state->pid_using_dumpfile = InvalidPid; >> .. >> } >> >> How will pid_using_dumpfile be set to InvalidPid in the case of error >> for non-bgworker cases? > > -- I have a try() - catch() in autoprewarm_dump_now I think that is okay. > Okay, then that will work. >> >> 7. >> +dump_now(bool is_bgworker) >> { >> .. >> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR); >> >> .. >> } >> >> How will transient_dump_file_path be unlinked in the case of error in >> durable_rename? I think you need to use PG_TRY..PG_CATCH to ensure >> same? > > -- If durable_rename is failing that seems basic functionality of > autoperwarm is failing so I want it to be an ERROR. I do not want to > remove the temp file as we always truncate before reusing it again. So > I think there is no need to catch all ERROR in dump_now() just to > remove the temp file. > I am not getting your argument here, do you mean to say that if writing to a transient file is failed then we should remove the transient
Re: [HACKERS] pgsql 10: hash indexes testing
On Thu, Jul 6, 2017 at 2:40 AM, AP wrote: > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote: >> >> > version | bucket_pages | overflow_pages | bitmap_pages | unused_pages >> >> > | live_items | dead_items | free_percent >> >> > -+--++--+--+++-- >> >> >3 | 10485760 |2131192 | 66 |0 >> >> > | 2975444240 | 0 | 1065.19942179026 >> >> > (1 row) > ... >> >> > And I do appear to have an odd percentage of free space. :) >> >> Are you worried about "unused_pages"? If so, then this is not a major > > Nope. "free_percent" Just a bit weird that I have it at over 1000% free. :) > Shouldn't that number be < 100? > Yes, there seems to be some gotcha in free percent calculation. Is it possible for you to debug or in some way share the test? >> reason to worry, because these are probably freed overflow pages which >> can be used in future. In the hash index, when we free the overflow >> pages, they are not returned back to OS, rather they are tracked in >> the index as unused pages which will get used when required in future. > >> >> It looks like Vacuum hasn't been triggered. >> >> Vacuum won't be triggered on insert load. I think that is one of the >> reasons why in your initial copy, you might have got the error. We >> had some discussion in the past to trigger Vacuum on insert heavy >> workloads [1], but the patch still didn't get committed. I think if >> that patch or some other form of that patch gets committed, it will >> help the workload what you are trying here. > > Well, if this is the cause of my little issue, it might be nice. ATM > my import script bombs out on errors (that I've duplicated! :) It took > 11 hours but it bombed) and it sounds like I'll need to do a manual > VACUUM before it can be run again. > Yeah, I think after manual vacuum you should be able to proceed. > The stats you were looking for before are: > > # select * from pgstathashindex('link_datum_id_idx'); > version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | > live_items | dead_items | free_percent > -+--++--+--+++-- >3 | 8559258 |4194176 | 128 | 1926502 | > 3591812743 | 0 | 942.873199357466 > (1 row) > > # select 4194176.0/128/8; >?column? > --- > 4095.8750 > (1 row) > >From above stats, it is clear that you have hit the maximum number of overflow pages we can support today. Now, here one can argue that we should increase the limit of overflow pages in hash index which we can do, but I think you can again hit such a problem after some more time. So at this stage, there are two possibilities for you (a) run manual Vacuum in-between (b) create the index after bulk load. In general, whatever I have mentioned in (b) is a better way for bulk loading. Note here again the free_percent seems to be wrong. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hash index on unlogged tables doesn't behave as expected
On Mon, Jul 03, 2017 at 09:12:01AM +0530, Amit Kapila wrote: > While discussing the behavior of hash indexes with Bruce in the nearby > thread [1], it has been noticed that hash index on unlogged tables > doesn't behave as expected. Prior to 10, it has the different set of > problems (mainly because hash indexes are not WAL-logged) which were > discussed on that thread [1], however when I checked, it doesn't work > even for 10. Below are steps to reproduce the problem. > > 1. Setup master and standby > 2. On the master, create unlogged table and hash index. >2A. Create unlogged table t1(c1 int); >2B. Create hash index idx_t1_hash on t1 using hash(c1); > 3. On Standby, try selecting data, >select * from t1; >ERROR: cannot access temporary or unlogged relations during recovery > ---Till here everything works as expected > 4. Promote standby to master (I have just stopped the standby and > master and removed recovery.conf file from the standby database > location) and try starting the new master, it > gives below error and doesn't get started. > FATAL: could not create file "base/12700/16387": File exists > > The basic issue was that the WAL logging for Create Index operation > was oblivion of the fact that for unlogged tables only INIT forks need > to be logged. Another point which we need to consider is that while > replaying the WAL for the create index operation, we need to flush the > buffer if it is for init fork. This needs to be done only for pages > that can be part of init fork file (like metapage, bitmappage). > Attached patch fixes the issue. > > Another approach to fix the issue could be that always log complete > pages for the create index operation on unlogged tables (in > hashbuildempty). However, the logic for initial hash index pages > needs us to perform certain other actions (like update metapage after > the creation of bitmappage) which make it difficult to follow that > approach. > > I think this should be considered a PostgreSQL 10 open item. > > > [1] - https://www.postgresql.org/message-id/20170630005634.GA4448%40momjian.us [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Robert, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
Hi Dean, On 2017/07/05 23:18, Dean Rasheed wrote: > On 5 July 2017 at 10:43, Amit Langote wrote: >> In retrospect, that sounds like something that was implemented in the >> earlier versions of the patch, whereby there was no ability to specify >> UNBOUNDED on a per-column basis. So the syntax was: >> >> FROM { (x [, ...]) | UNBOUNDED } TO { (y [, ...]) | UNBOUNDED } >> > Yes, that's where I ended up too. I see. >> But, it was pointed out to me [1] that that doesn't address the use case, >> for example, where part1 goes up to (10, 10) and part2 goes from (10, 10) >> up to (10, unbounded). >> >> The new design will limit the usage of unbounded range partitions at the >> tail ends. > > True, but I don't think that's really a problem. When the first column > is a discrete type, an upper bound of (10, unbounded) can be rewritten > as (11) in the new design. When it's a continuous type, e.g. floating > point, it can no longer be represented, because (10.0, unbounded) > really means (col1 <= 10.0). But we've already decided not to support > anything other than inclusive lower bounds and exclusive upper bounds, > so allowing this upper bound goes against that design choice. Yes. >>> Of course, it's pretty late in the day to be proposing this kind of >>> redesign, but I fear that if we don't tackle it now, it will just be >>> harder to deal with in the future. >>> >>> Actually, a quick, simple hacky implementation might be to just fill >>> in any omitted values in a partition bound with negative infinity >>> internally, and when printing a bound, omit any values after an >>> infinite value. But really, I think we'd want to tidy up the >>> implementation, and I think a number of things would actually get much >>> simpler. For example, get_qual_for_range() could simply stop when it >>> reached the end of the list of values for the bound, and it wouldn't >>> need to worry about an unbounded value following a bounded one. >>> >>> Thoughts? >> >> I cooked up a patch for the "hacky" implementation for now, just as you >> described in the above paragraph. Will you be willing to give it a look? >> I will also think about the non-hacky way of implementing this. > > OK, I'll take a look. Thanks a lot for your time. > Meanwhile, I already had a go at the "non-hacky" implementation (WIP > patch attached). The more I worked on it, the simpler things got, > which I think is a good sign. It definitely looks good to me. I was thinking of more or less the same approach, but couldn't have done as clean a job as you've done with your patch. > Part-way through, I realised that the PartitionRangeDatum Node type is > no longer needed, because each bound value is now necessarily finite, > so the lowerdatums and upperdatums lists in a PartitionBoundSpec can > now be made into lists of Const nodes, making them match the > listdatums field used for LIST partitioning, and then a whole lot of > related code gets simplified. Yeah, seems that way. > It needed a little bit more code in partition.c to track individual > bound sizes, but there were a number of other places that could be > simplified, so overall this represents a reduction in the code size > and complexity. Sounds reasonable. > It's not complete (e.g., no doc updates yet), but it passes all the > tests, and so far seems to work as I would expect. Thanks a lot for working on it. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
To make the queries fired by the RI triggers GIN indexed. We need to ‒ as Tom Lane has previously suggested[1] ‒ to replace the query SELECT 1 FROM ONLY fktable x WHERE $1 = ANY (fkcol) FOR SHARE OF x; with SELECT 1 FROM ONLY fktable x WHERE ARRAY[$1] <@ fkcol FOR SHARE OF x; but since we have @<(anyarray, anyelement) it can be improved to SELECT 1 FROM ONLY fktable x WHERE $1 @> fkcol FOR SHARE OF x; and the piece of code responsible for all of this is ri_GenerateQual in ri_triggers.c. How to accomplish that is the next step. I don't know if we should hardcode the "@>" symbol or if we just index the fk table then ri_GenerateQual would be able to find the operator on it's own. *What I plan to do:* - study how to index the fk table upon its creation. I suspect this can be done in tablecmds.c *Questions:* - how can you programmatically in C index a table? [1] https://www.postgresql.org/message-id/28389.1351094795%40sss.pgh.pa.us Best Regards, Mark Rofail diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c index 3a25ba52f3..0045f64c9e 100644 --- a/src/backend/utils/adt/ri_triggers.c +++ b/src/backend/utils/adt/ri_triggers.c @@ -2650,7 +2650,7 @@ quoteRelationName(char *buffer, Relation rel) * ri_GenerateQual --- generate a WHERE clause equating two variables * * The idea is to append " sep leftop op rightop" to buf, or if fkreftype is - * FKCONSTR_REF_EACH_ELEMENT, append " sep leftop op ANY(rightop)" to buf. + * FKCONSTR_REF_EACH_ELEMENT, append " sep leftop <@ rightop" to buf. * * The complexity comes from needing to be sure that the parser will select * the desired operator. We always name the operator using @@ -2697,17 +2697,10 @@ ri_GenerateQual(StringInfo buf, appendStringInfo(buf, " %s %s", sep, leftop); if (leftoptype != operform->oprleft) ri_add_cast_to(buf, operform->oprleft); - - appendStringInfo(buf, " OPERATOR(%s.%s) ", - quote_identifier(nspname), oprname); - - if (fkreftype == FKCONSTR_REF_EACH_ELEMENT) - appendStringInfoString(buf, "ANY ("); + appendStringInfo(buf, " @> "); appendStringInfoString(buf, rightop); if (rightoptype != oprright) ri_add_cast_to(buf, oprright); - if (fkreftype == FKCONSTR_REF_EACH_ELEMENT) - appendStringInfoChar(buf, ')'); ReleaseSysCache(opertup); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SCRAM auth and Pgpool-II
Michael, > Couldn't you cache one single SASL exchange status for each > connection, meaning one PGconn saved for each? As the challenge sent > by the server and the response generated by the client are different > by design, I am afraid you would need to do that anyway in this > context (Isn't PG-pool using already the weaknesses of MD5 to make > things easier?). As the server decides first which authentication type > should happen before beginning the real message exchange, that should > not be difficult. It seems to me that you would need something more > modular than you have now if you want for example to handle > automatically connections to multiple servers that have different > password hashes stored for the same user. The latter may be an edge > case with pgpool though. Thank you for the quick response. I will study your suggestion along with the SCRAM code in PostgreSQL whether it could be possible in Pgpool-II. Regarding your question on md5 auth handling in Pgpool-II, please look into: https://pgpool.net/mediawiki/index.php/FAQ#How_does_pgpool-II_handle_md5_authentication.3F Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SCRAM auth and Pgpool-II
On Thu, Jul 6, 2017 at 10:03 AM, Tatsuo Ishii wrote: > For Pgpool-II, things would go as follows: > > 1) clients sends user name to Pgpool-II. > 2) Pgpool-II forwards it to PostgreSQL servers. > 3) Each PostgreSQL server sends their own salt to Pgpool-II. > 4) Pgpool-II is confused because there are multiple salts and each has >different values. The client only accepts single salt obviously. Couldn't you cache one single SASL exchange status for each connection, meaning one PGconn saved for each? As the challenge sent by the server and the response generated by the client are different by design, I am afraid you would need to do that anyway in this context (Isn't PG-pool using already the weaknesses of MD5 to make things easier?). As the server decides first which authentication type should happen before beginning the real message exchange, that should not be difficult. It seems to me that you would need something more modular than you have now if you want for example to handle automatically connections to multiple servers that have different password hashes stored for the same user. The latter may be an edge case with pgpool though. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suspicious place in heap_prepare_freeze_tuple()
Masahiko Sawada wrote: > On Thu, Jul 6, 2017 at 1:36 AM, Alvaro Herrera > wrote: > > Teodor Sigaev wrote: > > > >> Playing around freezing tuple I found suspicious piece of code: > >> > >> heap_prepare_freeze_tuple(): > >> ... > >> frz->t_infomask = tuple->t_infomask; > >> ... > >> frz->t_infomask &= ~HEAP_XMAX_BITS; > >> frz->xmax = newxmax; > >> if (flags & FRM_MARK_COMMITTED) > >> frz->t_infomask &= HEAP_XMAX_COMMITTED; > >> > >> Seems, in last line it should be a bitwise OR instead of AND. Now this line > >> cleans all bits in t_infomask which later will be copied directly in tuple. > > > > I think you're right. > > I also think that's right. Should we back-patch it down to 9.3? Of course. I think this could cause data corruption. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] SCRAM auth and Pgpool-II
Hi PostgreSQL hackers, I would like to hear ideas how Pgpool-II can deal with SCRAM auth which will be in PostgreSQL 10. For those who are not familiar with Pgpool-II[1], it is an external OSS project to provide some additional features to PostgreSQL, including load balancing and automatic failover. Pgpool-II works as a proxy between PostgreSQL client and PostgreSQL server(s). When a client wants to connects to PostgreSQL and SCRAM auth is enabled, it sends user name to server. Then the server sends information including a salt to the client. The client computes a "ClientProof" using the salt and other information, and sends it to the server[2]. For Pgpool-II, things would go as follows: 1) clients sends user name to Pgpool-II. 2) Pgpool-II forwards it to PostgreSQL servers. 3) Each PostgreSQL server sends their own salt to Pgpool-II. 4) Pgpool-II is confused because there are multiple salts and each has different values. The client only accepts single salt obviously. So my question is, is there any solution or workaround for the problem #4. Someone at PGCon 2017 suggested that the problem could be avoided if the auth method between the client and Pgpool-II is "trust" (which means no auth). But this does not seem to be a best solution for me because it would weaken the security. I suspect this problem may not be specific to Pgpool-II. Any middle ware which handles multiple PostgreSQL servers could have the similar problem. Any suggestion would be appreciated. Thanks in advance. [1] https://pgpool.net [2] https://tools.ietf.org/html/rfc5802#section-3 -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suspicious place in heap_prepare_freeze_tuple()
On Thu, Jul 6, 2017 at 1:36 AM, Alvaro Herrera wrote: > Teodor Sigaev wrote: > >> Playing around freezing tuple I found suspicious piece of code: >> >> heap_prepare_freeze_tuple(): >> ... >> frz->t_infomask = tuple->t_infomask; >> ... >> frz->t_infomask &= ~HEAP_XMAX_BITS; >> frz->xmax = newxmax; >> if (flags & FRM_MARK_COMMITTED) >> frz->t_infomask &= HEAP_XMAX_COMMITTED; >> >> Seems, in last line it should be a bitwise OR instead of AND. Now this line >> cleans all bits in t_infomask which later will be copied directly in tuple. > > I think you're right. > I also think that's right. Should we back-patch it down to 9.3? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: pg_ctl wait exit code (was Re: [COMMITTERS] pgsql: Additional tests for subtransactions in recovery)
On Thu, Jul 6, 2017 at 2:41 AM, Peter Eisentraut wrote: > On 7/2/17 20:28, Michael Paquier wrote: >>> I was going to hold this back for PG11, but since we're now doing some >>> other tweaks in pg_ctl, it might be useful to add this too. Thoughts? >> >> The use of 0 as exit code for the new promote -w if timeout is reached >> looks like an open item to me. Cleaning up the pool queries after >> promotion would be nice to see as well. > > committed Thanks for finishing the cleanup. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Error-like LOG when connecting with SSL for password authentication
On Thu, Jul 6, 2017 at 12:45 AM, Ryan Murphy wrote: > I tried to apply your patch to HEAD and it failed. > But I think that's because some version of it has already been committed, > correct? I see some of your ECONNRESET and "SSL connection has been closed > unexpectedly" code already in HEAD. Thanks Ryan for the reminder. Heikki has pushed a minimum patch set as b93827c7. So I have updated the CF app accordingly. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=
I wrote: > (Pokes at it some more...) Oh, interesting: it behaves that way except > when p is exactly the lowest histogram entry. Here's a revised version that addresses that point and cleans up some other minor details about treatment of cases near the histogram endpoints. I'm still pretty unclear on exactly why it works (see XXX comments in patch), but testing shows that it does work very well indeed given a flat data distribution. On less-flat distributions, you get some estimation errors, but that's neither new nor surprising. regards, tom lane diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml index 333a36c..e9ff110 100644 --- a/doc/src/sgml/xindex.sgml +++ b/doc/src/sgml/xindex.sgml @@ -792,8 +792,7 @@ CREATE OPERATOR < ( It is important to specify the correct commutator and negator operators, as well as suitable restriction and join selectivity functions, otherwise the optimizer will be unable to make effective - use of the index. Note that the less-than, equal, and - greater-than cases should use different selectivity functions. + use of the index. diff --git a/doc/src/sgml/xoper.sgml b/doc/src/sgml/xoper.sgml index 8568e21..d484d80 100644 --- a/doc/src/sgml/xoper.sgml +++ b/doc/src/sgml/xoper.sgml @@ -242,20 +242,11 @@ column OP constant eqsel for = neqsel for <> - scalarltsel for < or <= - scalargtsel for > or >= - -It might seem a little odd that these are the categories, but they -make sense if you think about it. = will typically accept only -a small fraction of the rows in a table; <> will typically reject -only a small fraction. < will accept a fraction that depends on -where the given constant falls in the range of values for that table -column (which, it just so happens, is information collected by -ANALYZE and made available to the selectivity estimator). -<= will accept a slightly larger fraction than < for the same -comparison constant, but they're close enough to not be worth -distinguishing, especially since we're not likely to do better than a -rough guess anyhow. Similar remarks apply to > and >=. + scalarltsel for < + scalarlesel for <= + scalargtsel for > + scalargesel for >= + @@ -267,10 +258,12 @@ column OP constant -You can use scalarltsel and scalargtsel for comparisons on data types that -have some sensible means of being converted into numeric scalars for -range comparisons. If possible, add the data type to those understood -by the function convert_to_scalar() in src/backend/utils/adt/selfuncs.c. +You can use scalarltsel, scalarlesel, +scalargtsel and scalargesel for comparisons on +data types that have some sensible means of being converted into numeric +scalars for range comparisons. If possible, add the data type to those +understood by the function convert_to_scalar() in +src/backend/utils/adt/selfuncs.c. (Eventually, this function should be replaced by per-data-type functions identified through a column of the pg_type system catalog; but that hasn't happened yet.) If you do not do this, things will still work, but the optimizer's @@ -310,8 +303,10 @@ table1.column1 OP table2.column2 eqjoinsel for = neqjoinsel for <> - scalarltjoinsel for < or <= - scalargtjoinsel for > or >= + scalarltjoinsel for < + scalarlejoinsel for <= + scalargtjoinsel for > + scalargejoinsel for >= areajoinsel for 2D area-based comparisons positionjoinsel for 2D position-based comparisons contjoinsel for 2D containment-based comparisons diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 9d34025..b4cbc34 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -71,7 +71,7 @@ static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root, * * We also recognize "range queries", such as "x > 34 AND x < 42". Clauses * are recognized as possible range query components if they are restriction - * opclauses whose operators have scalarltsel() or scalargtsel() as their + * opclauses whose operators have scalarltsel or a related function as their * restriction selectivity estimator. We pair up clauses of this form that * refer to the same variable. An unpairable clause of this kind is simply * multiplied into the selectivity product in the normal way. But when we @@ -92,8 +92,8 @@ static RelOptInfo *find_single_rel_for_clauses(PlannerInfo *root, * A free side-effect is that we can recognize redundant inequalities such * as "x < 4 AND x < 5"; only the tighter constraint will be counted. * - * Of course this is all very dependent on the behavior of - * scalarltsel/scalargtsel; perhaps some day we can generalize the approach. + * Of course this is all very dependent on the behavior
Re: [HACKERS] pgsql 10: hash indexes testing
On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote: > >> > version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | > >> > live_items | dead_items | free_percent > >> > -+--++--+--+++-- > >> >3 | 10485760 |2131192 | 66 |0 | > >> > 2975444240 | 0 | 1065.19942179026 > >> > (1 row) ... > >> > And I do appear to have an odd percentage of free space. :) > > Are you worried about "unused_pages"? If so, then this is not a major Nope. "free_percent" Just a bit weird that I have it at over 1000% free. :) Shouldn't that number be < 100? > reason to worry, because these are probably freed overflow pages which > can be used in future. In the hash index, when we free the overflow > pages, they are not returned back to OS, rather they are tracked in > the index as unused pages which will get used when required in future. > >> It looks like Vacuum hasn't been triggered. > > Vacuum won't be triggered on insert load. I think that is one of the > reasons why in your initial copy, you might have got the error. We > had some discussion in the past to trigger Vacuum on insert heavy > workloads [1], but the patch still didn't get committed. I think if > that patch or some other form of that patch gets committed, it will > help the workload what you are trying here. Well, if this is the cause of my little issue, it might be nice. ATM my import script bombs out on errors (that I've duplicated! :) It took 11 hours but it bombed) and it sounds like I'll need to do a manual VACUUM before it can be run again. The stats you were looking for before are: # select * from pgstathashindex('link_datum_id_idx'); version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | live_items | dead_items | free_percent -+--++--+--+++-- 3 | 8559258 |4194176 | 128 | 1926502 | 3591812743 | 0 | 942.873199357466 (1 row) # select 4194176.0/128/8; ?column? --- 4095.8750 (1 row) If you need more info or whatnot, shout. I've a problematic index to play with now. > [1] - > https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30cfb%402ndquadrant.fr AP -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: pg_ctl wait exit code (was Re: [COMMITTERS] pgsql: Additional tests for subtransactions in recovery)
On 7/2/17 20:28, Michael Paquier wrote: >> I was going to hold this back for PG11, but since we're now doing some >> other tweaks in pg_ctl, it might be useful to add this too. Thoughts? > > The use of 0 as exit code for the new promote -w if timeout is reached > looks like an open item to me. Cleaning up the pool queries after > promotion would be nice to see as well. committed -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More race conditions in logical replication
Noah Misch wrote: > The above-described topic is currently a PostgreSQL 10 open item. Peter, > since you committed the patch believed to have created it, you own this open > item. I volunteer to own this item. My next update is going to be on or before Friday 7th at 19:00 Chilean time, though I don't think I can have this ready before then. I expect a fix for this to miss next week's beta release, but I'll try to get it done sooner. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 5 July 2017 at 10:43, Amit Langote wrote: > In retrospect, that sounds like something that was implemented in the > earlier versions of the patch, whereby there was no ability to specify > UNBOUNDED on a per-column basis. So the syntax was: > > FROM { (x [, ...]) | UNBOUNDED } TO { (y [, ...]) | UNBOUNDED } > > But, it was pointed out to me [1] that that doesn't address the use case, > for example, where part1 goes up to (10, 10) and part2 goes from (10, 10) > up to (10, unbounded). > [Reading that other thread] It's a reasonable point that our syntax is quite different from Oracle's, and doing this takes it even further away, and removes support for things that they do support. For the record, Oracle allows things like the following: DROP TABLE t1; CREATE TABLE t1 (a NUMBER, b NUMBER, c NUMBER) PARTITION BY RANGE (a,b,c) (PARTITION t1p1 VALUES LESS THAN (1,2,3), PARTITION t1p2 VALUES LESS THAN (2,3,4), PARTITION t1p3 VALUES LESS THAN (3,MAXVALUE,5), PARTITION t1p4 VALUES LESS THAN (4,MAXVALUE,6) ); INSERT INTO t1 VALUES(1,2,3); INSERT INTO t1 VALUES(2,3,4); INSERT INTO t1 VALUES(3,4,5); INSERT INTO t1 VALUES(3.01,4,5); INSERT INTO t1 VALUES(4,5,10); COLUMN subobject_name FORMAT a20; SELECT a, b, c, subobject_name FROM t1, user_objects o WHERE o.data_object_id = dbms_rowid.rowid_object(t1.ROWID) ORDER BY a,b,c; A B C SUBOBJECT_NAME -- -- -- 1 2 3 T1P2 2 3 4 T1P3 3 4 5 T1P3 3.01 4 5 T1P4 4 5 10 T1P4 So they use MAXVALUE instead of UNBOUNDED for an upper bound, which is more explicit. They don't have an equivalent MINVALUE, but it's arguably not necessary, since the first partition's lower bound is implicitly unbounded. With this syntax they don't need to worry about gaps or overlaps between partitions, which is nice, but arguably less flexible. They're also more lax about allowing finite values after MAXVALUE, and they document the fact that any value after a MAXVALUE is ignored. I don't think their scheme provides any way to define a partition of the above table that would hold all rows for which a < some value. So if we were to go for maximum flexibility and compatibility with Oracle, then perhaps what we would do is more like the original idea of UNBOUNDED ABOVE/BELOW, except call them MINVALUE and MAXVALUE, which conveniently are already unreserved keywords, as well as being much shorter. Plus, we would also relax the constraint about having finite values after MINVALUE/MAXVALUE. I think I'll go play around with that idea to see what it looks like in practice. Your previous patch already does much of that, and is far less invasive. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suspicious place in heap_prepare_freeze_tuple()
Teodor Sigaev wrote: > Playing around freezing tuple I found suspicious piece of code: > > heap_prepare_freeze_tuple(): > ... > frz->t_infomask = tuple->t_infomask; > ... > frz->t_infomask &= ~HEAP_XMAX_BITS; > frz->xmax = newxmax; > if (flags & FRM_MARK_COMMITTED) > frz->t_infomask &= HEAP_XMAX_COMMITTED; > > Seems, in last line it should be a bitwise OR instead of AND. Now this line > cleans all bits in t_infomask which later will be copied directly in tuple. I think you're right. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Suspicious place in heap_prepare_freeze_tuple()
Hi! Playing around freezing tuple I found suspicious piece of code: heap_prepare_freeze_tuple(): ... frz->t_infomask = tuple->t_infomask; ... frz->t_infomask &= ~HEAP_XMAX_BITS; frz->xmax = newxmax; if (flags & FRM_MARK_COMMITTED) frz->t_infomask &= HEAP_XMAX_COMMITTED; Seems, in last line it should be a bitwise OR instead of AND. Now this line cleans all bits in t_infomask which later will be copied directly in tuple. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 8deb344d09..ec227bac80 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -6639,7 +6639,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid, frz->t_infomask &= ~HEAP_XMAX_BITS; frz->xmax = newxmax; if (flags & FRM_MARK_COMMITTED) -frz->t_infomask &= HEAP_XMAX_COMMITTED; +frz->t_infomask |= HEAP_XMAX_COMMITTED; changed = true; totally_frozen = false; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra Vietnamese unaccent rules
On 2017/07/05 15:28, Michael Paquier wrote: I have finally been able to look at this patch. Thanks for reviewing and the new version of the patch. (Surprised to see that generate_unaccent_rules.py is inconsistent on MacOS, runs fine on Linux). def get_plain_letter(codepoint, table): """Return the base codepoint without marks.""" if is_letter_with_marks(codepoint, table): -return table[codepoint.combining_ids[0]] +if len(table[codepoint.combining_ids[0]].combining_ids) > 1: +# Recursive to find the plain letter +return get_plain_letter(table[codepoint.combining_ids[0]],table) +elif is_plain_letter(table[codepoint.combining_ids[0]]): +return table[codepoint.combining_ids[0]] +else: +return None elif is_plain_letter(codepoint): return codepoint else: -raise "mu" +return None The code paths returning None should not be reached, so I would suggest adding an assertion instead. Callers of get_plain_letter would blow up on None, still that would make future debugging harder. def is_letter_with_marks(codepoint, table): -"""Returns true for plain letters combined with one or more marks.""" +"""Returns true for letters combined with one or more marks.""" # See http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values return len(codepoint.combining_ids) > 1 and \ - is_plain_letter(table[codepoint.combining_ids[0]]) and \ + (is_plain_letter(table[codepoint.combining_ids[0]]) or\ +is_letter_with_marks(table[codepoint.combining_ids[0]],table)) and \ all(is_mark(table[i]) for i in codepoint.combining_ids[1:] This was already hard to follow, and this patch makes its harder. I think that the thing should be refactored with multiple conditions. if is_letter_with_marks(codepoint, table): -charactersSet.add((codepoint.id, +if get_plain_letter(codepoint, table) <> None: +charactersSet.add((codepoint.id, This change is not necessary as a letter with marks is not a plain character anyway. Testing with characters having two accents, the results are produced as wanted. I am attaching an updated patch with all those simplifications. Thoughts? Thanks, so pretty. The patch is fine to me. --- Thanks and best regards, Dang Minh Huong --- このEメールはアバスト アンチウイルスによりウイルススキャンされています。 https://www.avast.com/antivirus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Error-like LOG when connecting with SSL for password authentication
Hi Michael, I tried to apply your patch to HEAD and it failed. But I think that's because some version of it has already been committed, correct? I see some of your ECONNRESET and "SSL connection has been closed unexpectedly" code already in HEAD. Best, Ryan The new status of this patch is: Waiting on Author -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] outfuncs.c utility statement support
On 6/21/17 23:40, Tom Lane wrote: > Peter Eisentraut writes: >> On 6/18/17 10:14, Tom Lane wrote: >>> Doesn't cope with backslash-quoted characters. If we're going to bother >>> to do anything here, I think we ought to make it reversible for all >>> possible characters. > >> Makes sense. Updated patch attached. > > That looks sane to me, though I've still not actually tested any > interesting cases. I have committed this. I had to build a bunch more stuff around it to be able to test it, which I will tidy up and submit at some later point. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 5 July 2017 at 10:43, Amit Langote wrote: >> So the more I think about this, the more I think that a cleaner design >> would be as follows: >> >> 1). Don't allow UNBOUNDED, except in the first column, where it can >> keep it's current meaning. >> >> 2). Allow the partition bounds to have fewer columns than the >> partition definition, and have that mean the same as it would have >> meant if you were partitioning by that many columns. So, for >> example, if you were partitioning by (col1,col2), you'd be allowed >> to define a partition like so: >> >> FROM (x) TO (y) >> >> and it would mean >> >> x <= col1 < y >> >> Or you'd be able to define a partition like >> >> FROM (x1,x2) TO (y) >> >> which would mean >> >> (col1 > x1) OR (col1 = x1 AND col2 >= x2) AND col1 < y >> >> 3). Don't allow any value after UNBOUNDED (i.e., only specify >> UNBOUNDED once in a partition bound). > > I assume we don't need the ability of specifying ABOVE/BELOW in this design. > Yes that's right. > In retrospect, that sounds like something that was implemented in the > earlier versions of the patch, whereby there was no ability to specify > UNBOUNDED on a per-column basis. So the syntax was: > > FROM { (x [, ...]) | UNBOUNDED } TO { (y [, ...]) | UNBOUNDED } > Yes, that's where I ended up too. > But, it was pointed out to me [1] that that doesn't address the use case, > for example, where part1 goes up to (10, 10) and part2 goes from (10, 10) > up to (10, unbounded). > > The new design will limit the usage of unbounded range partitions at the > tail ends. > True, but I don't think that's really a problem. When the first column is a discrete type, an upper bound of (10, unbounded) can be rewritten as (11) in the new design. When it's a continuous type, e.g. floating point, it can no longer be represented, because (10.0, unbounded) really means (col1 <= 10.0). But we've already decided not to support anything other than inclusive lower bounds and exclusive upper bounds, so allowing this upper bound goes against that design choice. >> Of course, it's pretty late in the day to be proposing this kind of >> redesign, but I fear that if we don't tackle it now, it will just be >> harder to deal with in the future. >> >> Actually, a quick, simple hacky implementation might be to just fill >> in any omitted values in a partition bound with negative infinity >> internally, and when printing a bound, omit any values after an >> infinite value. But really, I think we'd want to tidy up the >> implementation, and I think a number of things would actually get much >> simpler. For example, get_qual_for_range() could simply stop when it >> reached the end of the list of values for the bound, and it wouldn't >> need to worry about an unbounded value following a bounded one. >> >> Thoughts? > > I cooked up a patch for the "hacky" implementation for now, just as you > described in the above paragraph. Will you be willing to give it a look? > I will also think about the non-hacky way of implementing this. > OK, I'll take a look. Meanwhile, I already had a go at the "non-hacky" implementation (WIP patch attached). The more I worked on it, the simpler things got, which I think is a good sign. Part-way through, I realised that the PartitionRangeDatum Node type is no longer needed, because each bound value is now necessarily finite, so the lowerdatums and upperdatums lists in a PartitionBoundSpec can now be made into lists of Const nodes, making them match the listdatums field used for LIST partitioning, and then a whole lot of related code gets simplified. It needed a little bit more code in partition.c to track individual bound sizes, but there were a number of other places that could be simplified, so overall this represents a reduction in the code size and complexity. It's not complete (e.g., no doc updates yet), but it passes all the tests, and so far seems to work as I would expect. Regards, Dean diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c new file mode 100644 index 7da2058..aade9f5 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -66,23 +66,26 @@ * is an upper bound. */ -/* Ternary value to represent what's contained in a range bound datum */ -typedef enum RangeDatumContent +/* Type of an individual bound of a range partition */ +typedef enum RangeBoundKind { - RANGE_DATUM_FINITE = 0, /* actual datum stored elsewhere */ - RANGE_DATUM_NEG_INF, /* negative infinity */ - RANGE_DATUM_POS_INF /* positive infinity */ -} RangeDatumContent; + RANGE_BOUND_FINITE = 0, /* actual bound stored in the datums array */ + RANGE_BOUND_NEG_INF, /* negative infinity; NULL datums array */ + RANGE_BOUND_POS_INF /* positive infinity; NULL datums array */ +} RangeBoundKind; typedef struct PartitionBoundInfoData { char strategy; /* list or range bounds? */ int ndatums; /* Length of the datums following a
Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select
> On Jul 5, 2017, at 5:30 AM, Amit Kapila wrote: > > On Wed, Jul 5, 2017 at 9:44 AM, Mark Dilger wrote: >> >>> On Jul 3, 2017, at 10:25 PM, Amit Kapila wrote: >>> >>> On Mon, Jul 3, 2017 at 8:57 PM, Simon Riggs wrote: On 30 June 2017 at 05:14, Amit Kapila wrote: > This is explained in section 15.2 [1], refer below para: > "The query might be suspended during execution. In any situation in > which the system thinks that partial or incremental execution might > occur, no parallel plan is generated. For example, a cursor created > using DECLARE CURSOR will never use a parallel plan. Similarly, a > PL/pgSQL loop of the form FOR x IN query LOOP .. END LOOP will never > use a parallel plan, because the parallel query system is unable to > verify that the code in the loop is safe to execute while parallel > query is active." Can you explain "unable to verify that the code in the loop is safe to execute while parallel query is active". Surely we aren't pushing code in the loop into the actual query, so the safety of command in the FOR loop has nothing to do with the parallel safety of the query. Please give an example of something that would be unsafe? Is that documented anywhere, README etc? FOR x IN query LOOP .. END LOOP seems like a case that would be just fine, since we're going to loop thru every row or break early. >>> >>> It is not fine because we don't support partial execution support. In >>> above case, if the loop breaks, we can't break parallel query >>> execution. Now, I don't think it will impossible to support the same, >>> but as of now, parallel subsystem doesn't have such a support. >> >> I can understand this, but wonder if I could use something like >> >> FOR I TOTALLY PROMISE TO USE ALL ROWS rec IN EXECUTE sql LOOP >> ... >> END LOOP; >> >> if I hacked the grammar up a bit. Would the problem go away, or would >> I still have problems when exceptions beyond my control get thrown inside >> the loop? >> > > I don't think it is just a matter of hacking grammar, internally we > are using cursor fetch to fetch the rows and there we are passing some > fixed number of rows to fetch which again is a killer to invoke the > parallel query. Is the risk that a RETURN will be executed within the loop block the only risk that is causing parallel query to be disabled in these plpgsql loops? If so, it seems that a plpgsql language syntax which created a loop that disabled RETURN from within the loop body would solve the problem. I do not propose any such language extension. It was just a thought experiment. Is there a conditional somewhere in the source that says, in effect, "if this is a plpgsql loop, then disable parallel query"? If so, and if I removed that check such that parallel query would run in plpgsql loops, would I only get breakage when I returned out of a loop? Or would there be other situations where that would break? Thanks, mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Tue, Jul 04, 2017 at 08:23:20PM -0700, Jeff Janes wrote: > On Tue, Jul 4, 2017 at 3:57 AM, AP wrote: > > The data being indexed is BYTEA, (quasi)random and 64 bytes in size. > > The table has over 2 billion entries. The data is not unique. There's > > an average of 10 duplicates for every unique value. > > What is the number of duplicates for the most common value? Damn. Was going to collect this info as I was doing a fresh upload but it fell through the cracks of my mind. It'll probably take at least half a day to collect (a simple count(*) on the table takes 1.5-1.75 hours parallelised across 11 processes) so I'll probably have this in around 24 hours if all goes well. (and I don't stuff up the SQL :) ) AP. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Wed, Jul 05, 2017 at 03:33:45PM +1000, AP wrote: > > Do you have any deletes? How have you verified whether autovacuum has > > No DELETEs. Just the initial COPY, then SELECTs, then a DB rename to get it > out of the way of other testing, then the REINDEX. > > > been triggered or not? > > I just checked pg_stat_user_tables (which I hope is the right place for > this info :) > >relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | > idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | > n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum > | last_analyze | last_autoanalyze| > vacuum_count | autovacuum_count | analyze_count | autoanalyze_count > ---++-+--+--+--+---++---+---+---+++-+-+-+---+---+--+--+---+--- > 129311803 | public | link| 70 | 15085880072 | 5779 | > 465623 | 2975444240 | 0 | 0 | 0 | 928658178 | > 0 | 0 | | | > | 2017-06-28 10:43:51.273241+10 |0 | > 0 | 0 | 2 > > So it appears not. Actually, after a bit more late-arvo thought, I take it this would be the case as the table has not changed since creation. Thus no need to autovac. I've newer timestamps on the live db (whose data was uploaded more recently) for its tables so I think autovac is functioning. AP -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Wed, Jul 05, 2017 at 10:29:09AM +0530, Amit Kapila wrote: > >> bitmappages. Can you try to use pgstattuple extension and get us the > >> results of Select * from pgstathashindex('index_name');? If the > >> number of bitmappages is 128 and total overflow pages are 128 * 4096, > >> then that would mean that all the pages are used. Then maybe we can > > > > Hmm. Unless I misunderstood that'd mean that overflow_pages/4096 should > > result in a number <= 128 at the moment, right? > > No, sorry, I think my calculation above has something missing. It > should be 128 * 4096 * 8. How we can compute this number is > no_bitmap_pages * no_bits_used_to_represent_overflow_pages. AHA! Ok. Then that appears to match. I get 65.041. > >If so then something is > > amiss: > > > > # select * from pgstathashindex('link_datum_id_hash_idx'); > > version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | > > live_items | dead_items | free_percent > > -+--++--+--+++-- > >3 | 10485760 |2131192 | 66 |0 | > > 2975444240 | 0 | 1065.19942179026 > > (1 row) > > > > oldmdstash=# select 2131192/4096; > > ?column? > > -- > > 520 > > (1 row) > > You need to divide 520 by 8 to get the bitmap page. Is this the index > in which you get the error or is this the one on which you have done > REINDEX? Post REINDEX. > > And I do appear to have an odd percentage of free space. :) > > > > It looks like Vacuum hasn't been triggered. :( > > This index was created yesterday so it has been around for maybe 18 hours. > > Autovac is likely to have hit it by now. > > Do you have any deletes? How have you verified whether autovacuum has No DELETEs. Just the initial COPY, then SELECTs, then a DB rename to get it out of the way of other testing, then the REINDEX. > been triggered or not? I just checked pg_stat_user_tables (which I hope is the right place for this info :) relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd | n_live_tup | n_dead_tup | n_mod_since_analyze | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze| vacuum_count | autovacuum_count | analyze_count | autoanalyze_count ---++-+--+--+--+---++---+---+---+++-+-+-+---+---+--+--+---+--- 129311803 | public | link| 70 | 15085880072 | 5779 | 465623 | 2975444240 | 0 | 0 | 0 | 928658178 | 0 | 0 | | | | 2017-06-28 10:43:51.273241+10 |0 |0 | 0 | 2 So it appears not. # show autovacuum; autovacuum on (1 row) All autovacuum parameters are as per default. The autovacuum launcher process exists. :( AP -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : For Auto-Prewarm.
On Mon, Jul 3, 2017 at 3:34 PM, Amit Kapila wrote: > > Few comments on the latest patch: > > 1. > + LWLockRelease(&apw_state->lock); > + if (!is_bgworker) > + ereport(ERROR, > + (errmsg("could not perform block dump because dump file is being > used by PID %d", > + apw_state->pid_using_dumpfile))); > + ereport(LOG, > + (errmsg("skipping block dump because it is already being performed by PID > %d", > + apw_state->pid_using_dumpfile))); > > > The above code looks confusing as both the messages are saying the > same thing in different words. I think you keep one message (probably > the first one) and decide error level based on if this is invoked for > bgworker. Also, you can move LWLockRelease after error message, > because if there is any error, then it will automatically release all > lwlocks. ERROR is used for autoprewarm_dump_now which is called from the backend. LOG is used for bgworker. wordings used are to match the context if failing to dump is acceptable or not. In the case of bgworker, it is acceptable we are not particular about the start time of dump but the only interval between the dumps. So if already somebody doing it is acceptable. But one who calls autoprewarm_dump_now might be particular about the start time of dump so we throw error making him retry same. The wording's are suggested by Robert(below snapshot) in one of his previous comments and I also agree with it. If you think I should reconsider this and I am missing something I am open to suggestions. On Wed, May 31, 2017 at 10:18 PM, Robert Haas wrote: +If we go to perform +an immediate dump process and finds a non-zero value already just does +ereport(ERROR, ...), including the PID of the other process in the +message (e.g. "unable to perform block dump because dump file is being +used by PID %d"). In a background worker, if we go to dump and find +the file in use, log a message (e.g. "skipping block dump because it +is already being performed by PID %d", "skipping prewarm because block +dump file is being rewritten by PID %d"). Thanks moved the LWLockRelease after ERROR call. > 2. > +autoprewarm_dump_now(PG_FUNCTION_ARGS) > +{ > + uint32 num_blocks = 0; > + > .. > + PG_RETURN_INT64(num_blocks); > .. > } > > Is there any reason for using PG_RETURN_INT64 instead of PG_RETURN_UINT32? Return type autoprewarm_dump_now() is pg_catalog.int8 to accommodate uint32 so I used PG_RETURN_INT64. I think PG_RETURN_UINT32 can be used as well I have replaced now. > 3. > +dump_now(bool is_bgworker) > { > .. > + if (buf_state & BM_TAG_VALID) > + { > + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; > + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode; > + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; > + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; > + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; > + ++num_blocks; > + } > .. > } >I think there is no use of writing Unlogged buffers unless the dump is >for the shutdown. You might want to use BufferIsPermanent to detect the same. -- I do not think that is true pages of the unlogged table are also read into buffers for read-only purpose. So if we miss to dump them while we shut down then the previous dump should be used. > 4. > +static uint32 > +dump_now(bool is_bgworker) > { > .. > + for (num_blocks = 0, i = 0; i < NBuffers; i++) > + { > + uint32 buf_state; > + > + if (!is_bgworker) > + CHECK_FOR_INTERRUPTS(); > .. > } > > Why checking for interrupts is only for non-bgwroker cases? -- autoprewarm_dump_now is directly called from the backend. In such case, we have to handle signals registered for backend in dump_now(). For bgworker dump_block_info_periodically caller of dump_now() handles SIGTERM, SIGUSR1 which we are interested in. > 5. > + * Each entry of BlockRecordInfo consists of database, tablespace, filenode, > + * forknum, blocknum. Note that this is in the text form so that the dump > + * information is readable and can be edited, if required. > + */ > > In the above comment, you are saying that the dump file is in text > form whereas in the code you are using binary form. I think code > should match comments. Is there a reason of preferring binary over > text or vice versa? -- Previously I used the write() on file descriptor. Sorry I should have changed the mode of opening to text mode when I moved the code to use AllocateFile Sorry fixed same now. > 6. > +dump_now(bool is_bgworker) > { > .. > + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR); > + apw_state->pid_using_dumpfile = InvalidPid; > .. > } > > How will pid_using_dumpfile be set to InvalidPid in the case of error > for non-bgworker cases? -- I have a try() - catch() in autoprewarm_dump_now I think that is okay. > > 7. > +dump_now(bool is_bgworker) > { > .. > + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR); > > .. > } > > How will transient_dump_file_path be unlinked in the
Re: [HACKERS] Proposal : For Auto-Prewarm.
On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila wrote: > On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy wrote: >> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy >> wrote: >>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown wrote: Also, I find it a bit messy that launch_autoprewarm_dump() doesn't detect an autoprewarm process already running. I'd want this to return NULL or an error if called for a 2nd time. >>> >>> We log instead of error as we try to check only after launching the >>> worker and inside worker. One solution could be as similar to >>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared >>> memory and check if we can launch worker in backend itself. I will try >>> to fix same. >> >> I have fixed it now as follows >> >> +Datum >> +autoprewarm_start_worker(PG_FUNCTION_ARGS) >> +{ >> + pid_t pid; >> + >> + init_apw_shmem(); >> + pid = apw_state->bgworker_pid; >> + if (pid != InvalidPid) >> + ereport(ERROR, >> + (errmsg("autoprewarm worker is already running under PID %d", >> + pid))); >> + >> + autoprewarm_dump_launcher(); >> + PG_RETURN_VOID(); >> +} >> >> In backend itself, we shall check if an autoprewarm worker is running >> then only start the server. There is a possibility if this function is >> executed concurrently when there is no worker already running (Which I >> think is not a normal usage) then both call will say it has >> successfully launched the worker even though only one could have >> successfully done that (other will log and silently die). > > Why can't we close this remaining race condition? Basically, if we > just perform all of the actions in this function under the lock and > autoprewarm_dump_launcher waits till the autoprewarm worker has > initialized the bgworker_pid, then there won't be any remaining race > condition. I think if we don't close this race condition, it will be > unpredictable whether the user will get the error or there will be > only a server log for the same. Yes, I can make autoprewarm_dump_launcher to wait until the launched bgworker set its pid, but this requires one more synchronization variable between launcher and worker. More than that I see ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the worker), which needs to be called before setting pid. So I thought it won't be harmful let two concurrent calls to launch workers and we just log failures. Please let me know if I need to rethink about it. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : For Auto-Prewarm.
On Mon, Jul 3, 2017 at 3:55 PM, Amit Kapila wrote: > On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas wrote: >> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy >> wrote: >> >> * Instead of creating our own buffering system via buffer_file_write() >> and buffer_file_flush(), why not just use the facilities provided by >> the operating system? fopen() et. al. provide buffering, and we have >> AllocateFile() to provide a FILE *; it's just like >> OpenTransientFile(), which you are using, but you'll get the buffering >> stuff for free. Maybe there's some reason why this won't work out >> nicely, but off-hand it seems like it might. It looks like you are >> already using AllocateFile() to read the dump, so using it to write >> the dump as well seems like it would be logical. >> > > One thing that is worth considering is AllocateFile is recommended to > be used for short operations. Refer text atop AllocateFile(). If the > number of blocks to be dumped is large, then the file can remain open > for the significant period of time. -- Agree. I think I need suggestion on this we will hold on to 1 fd, but I am not sure what amount of time file being opened qualify as a case against using AllocateFile(). -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition : Append node over a single SeqScan
Hi, On Wed, Jul 5, 2017 at 3:58 PM, Ashutosh Bapat wrote: > On Wed, Jul 5, 2017 at 3:48 PM, Ashutosh Sharma wrote: >> Hi All, >> >> Today while exploring a bit on Range table partitioning, I could see >> that even if scan is performed on a single partition, the plan node >> has Append node in it. Isn't it a bug? > > No. See following comment from create_append_plan() > 1045 /* > 1046 * XXX ideally, if there's just one child, we'd not bother to > generate an > 1047 * Append node but just return the single child. At the > moment this does > 1048 * not work because the varno of the child scan plan won't match the > 1049 * parent-rel Vars it'll be asked to emit. > 1050 */ > Thanks for the information. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core
On 3 Jul. 2017 23:01, "K S, Sandhya (Nokia - IN/Bangalore)" < sandhya@nokia.com> wrote: Hi Craig, Thanks for the response. Scenario tried here is restart of the system multiple times. sh-QUIT core is generated when Postgres is invoking the shell to exit and may not be due to kernel or file system issues. I will try to reproduce the issue with dmesg output being printed. However, is there any instance in Postgres where 'sh -c exit 1' will be invoked? Most likely it's used directly or indirectly by an archive_commsnd or restore_comand you have configured.
Re: [HACKERS] [POC] hash partitioning
On Wed, Jul 5, 2017 at 4:50 PM, Dilip Kumar wrote: > On Mon, Jul 3, 2017 at 4:39 PM, amul sul wrote: >> Thanks to catching this, fixed in the attached version. > > Few comments on the latest version. > Thanks for your review, please find my comment inline: > 0001 looks fine, for 0002 I have some comments. > > 1. > + hbounds = (PartitionHashBound * *) palloc(nparts * > + sizeof(PartitionHashBound *)); > > /s/(PartitionHashBound * *)/(PartitionHashBound **)/g > Fixed in the attached version. > 2. > RelationBuildPartitionDesc > { > > > > * catalog scan that retrieved them, whereas that in the latter is > * defined by canonicalized representation of the list values or the > * range bounds. > */ > for (i = 0; i < nparts; i++) > result->oids[mapping[i]] = oids[i]; > > Should this comments mention about hash as well? > Instead, I have generalised this comment in the attached patch > 3. > > if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0]) > return false; > > if (b1->ndatums != b2->ndatums) > return false; > > If ndatums itself is different then no need to access datum memory, so > better to check ndatum first. > You are correct, we already doing this in the partition_bounds_equal(). This is a redundant code, removed in the attached version. > 4. > + * next larger modulus. For example, if you have a bunch > + * of partitions that all have modulus 5, you can add a > + * new new partition with modulus 10 or a new partition > > Typo, "new new partition" -> "new partition" > Fixed in the attached version. Regards, Amul 0001-Cleanup_v6.patch Description: Binary data 0002-hash-partitioning_another_design-v15.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select
On Wed, Jul 5, 2017 at 9:44 AM, Mark Dilger wrote: > >> On Jul 3, 2017, at 10:25 PM, Amit Kapila wrote: >> >> On Mon, Jul 3, 2017 at 8:57 PM, Simon Riggs wrote: >>> On 30 June 2017 at 05:14, Amit Kapila wrote: >>> This is explained in section 15.2 [1], refer below para: "The query might be suspended during execution. In any situation in which the system thinks that partial or incremental execution might occur, no parallel plan is generated. For example, a cursor created using DECLARE CURSOR will never use a parallel plan. Similarly, a PL/pgSQL loop of the form FOR x IN query LOOP .. END LOOP will never use a parallel plan, because the parallel query system is unable to verify that the code in the loop is safe to execute while parallel query is active." >>> >>> Can you explain "unable to verify that the code in the loop is safe to >>> execute while parallel query is active". Surely we aren't pushing code >>> in the loop into the actual query, so the safety of command in the FOR >>> loop has nothing to do with the parallel safety of the query. >>> >>> Please give an example of something that would be unsafe? Is that >>> documented anywhere, README etc? >>> >>> FOR x IN query LOOP .. END LOOP >>> seems like a case that would be just fine, since we're going to loop >>> thru every row or break early. >>> >> >> It is not fine because we don't support partial execution support. In >> above case, if the loop breaks, we can't break parallel query >> execution. Now, I don't think it will impossible to support the same, >> but as of now, parallel subsystem doesn't have such a support. > > I can understand this, but wonder if I could use something like > > FOR I TOTALLY PROMISE TO USE ALL ROWS rec IN EXECUTE sql LOOP > ... > END LOOP; > > if I hacked the grammar up a bit. Would the problem go away, or would > I still have problems when exceptions beyond my control get thrown inside > the loop? > I don't think it is just a matter of hacking grammar, internally we are using cursor fetch to fetch the rows and there we are passing some fixed number of rows to fetch which again is a killer to invoke the parallel query. > And if exception handling is a problem in the loop, are exceptions > somehow not a problem in other parallel queries? > I don't see exceptions as a blocking factor to choose any parallel query inside PL. However, if you have something in mind, feel free to share with some example? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Wed, Jul 5, 2017 at 11:03 AM, AP wrote: > On Wed, Jul 05, 2017 at 10:29:09AM +0530, Amit Kapila wrote: >> >> bitmappages. Can you try to use pgstattuple extension and get us the >> >> results of Select * from pgstathashindex('index_name');? If the >> >> number of bitmappages is 128 and total overflow pages are 128 * 4096, >> >> then that would mean that all the pages are used. Then maybe we can >> > >> > Hmm. Unless I misunderstood that'd mean that overflow_pages/4096 should >> > result in a number <= 128 at the moment, right? >> >> No, sorry, I think my calculation above has something missing. It >> should be 128 * 4096 * 8. How we can compute this number is >> no_bitmap_pages * no_bits_used_to_represent_overflow_pages. > > AHA! Ok. Then that appears to match. I get 65.041. > >> >If so then something is >> > amiss: >> > >> > # select * from pgstathashindex('link_datum_id_hash_idx'); >> > version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | >> > live_items | dead_items | free_percent >> > -+--++--+--+++-- >> >3 | 10485760 |2131192 | 66 |0 | >> > 2975444240 | 0 | 1065.19942179026 >> > (1 row) >> > >> > oldmdstash=# select 2131192/4096; >> > ?column? >> > -- >> > 520 >> > (1 row) >> >> You need to divide 520 by 8 to get the bitmap page. Is this the index >> in which you get the error or is this the one on which you have done >> REINDEX? > > Post REINDEX. > >> > And I do appear to have an odd percentage of free space. :) Are you worried about "unused_pages"? If so, then this is not a major reason to worry, because these are probably freed overflow pages which can be used in future. In the hash index, when we free the overflow pages, they are not returned back to OS, rather they are tracked in the index as unused pages which will get used when required in future. >> > >> >> It looks like Vacuum hasn't been triggered. >> Vacuum won't be triggered on insert load. I think that is one of the reasons why in your initial copy, you might have got the error. We had some discussion in the past to trigger Vacuum on insert heavy workloads [1], but the patch still didn't get committed. I think if that patch or some other form of that patch gets committed, it will help the workload what you are trying here. [1] - https://www.postgresql.org/message-id/b970f20f-f096-2d3a-6c6d-ee887bd30cfb%402ndquadrant.fr -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Append implementation
On 30 June 2017 at 15:10, Rafia Sabih wrote: > > On Tue, Apr 4, 2017 at 12:37 PM, Amit Khandekar > wrote: >> >> Attached is an updated patch v13 that has some comments changed as per >> your review, and also rebased on latest master. > > > This is not applicable on the latest head i.e. commit -- > 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing. Thanks for notifying. Attached is the rebased version of the patch. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company ParallelAppend_v13_rebased.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On Mon, Jul 3, 2017 at 4:39 PM, amul sul wrote: > Thanks to catching this, fixed in the attached version. Few comments on the latest version. 0001 looks fine, for 0002 I have some comments. 1. + hbounds = (PartitionHashBound * *) palloc(nparts * + sizeof(PartitionHashBound *)); /s/(PartitionHashBound * *)/(PartitionHashBound **)/g 2. RelationBuildPartitionDesc { * catalog scan that retrieved them, whereas that in the latter is * defined by canonicalized representation of the list values or the * range bounds. */ for (i = 0; i < nparts; i++) result->oids[mapping[i]] = oids[i]; Should this comments mention about hash as well? 3. if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0]) return false; if (b1->ndatums != b2->ndatums) return false; If ndatums itself is different then no need to access datum memory, so better to check ndatum first. 4. + * next larger modulus. For example, if you have a bunch + * of partitions that all have modulus 5, you can add a + * new new partition with modulus 10 or a new partition Typo, "new new partition" -> "new partition" -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition : Append node over a single SeqScan
On Wed, Jul 5, 2017 at 3:48 PM, Ashutosh Sharma wrote: > Hi All, > > Today while exploring a bit on Range table partitioning, I could see > that even if scan is performed on a single partition, the plan node > has Append node in it. Isn't it a bug? No. See following comment from create_append_plan() 1045 /* 1046 * XXX ideally, if there's just one child, we'd not bother to generate an 1047 * Append node but just return the single child. At the moment this does 1048 * not work because the varno of the child scan plan won't match the 1049 * parent-rel Vars it'll be asked to emit. 1050 */ -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Partition : Append node over a single SeqScan
Hi All, Today while exploring a bit on Range table partitioning, I could see that even if scan is performed on a single partition, the plan node has Append node in it. Isn't it a bug? As per the usage of Append Node, it should only be seen in the queryplan when scan is performed on multiple tables. Following are the steps to reproduce the problem. CREATE TABLE part_tab (c1 int, c2 int) PARTITION BY RANGE (c1); CREATE TABLE part1 PARTITION OF part_tab FOR VALUES FROM (0) TO (100); CREATE TABLE part2 PARTITION OF part_tab FOR VALUES FROM (100) TO (200); postgres=# explain analyze select * from part_tab where c1 > 100; QUERY PLAN Append (cost=0.00..38.25 rows=753 width=8) (actual time=0.009..0.009 rows=0 loops=1) -> Seq Scan on part2 (cost=0.00..38.25 rows=753 width=8) (actual time=0.009..0.009 rows=0 loops=1) Filter: (c1 > 100) Planning time: 166698.973 ms Execution time: 0.043 ms (5 rows) -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
Hi Dean, On 2017/07/04 16:49, Dean Rasheed wrote: > On 3 July 2017 at 10:32, Amit Langote wrote: >> On 2017/07/03 17:36, Dean Rasheed wrote: >>> The bigger question is do we want this for PG10? If so, time is >>> getting tight. My feeling is that we do, because otherwise we'd be >>> changing the syntax in PG11 of a feature only just released in PG10, >>> and I think the current syntax is flawed, so it would be better not to >>> have it in any public release. I'd feel better hearing from the >>> original committer though. >> >> The way I have extended the syntax in the posted patch, ABOVE/BELOW (or >> whatever we decide instead) are optional. UNBOUNDED without the >> ABOVE/BELOW specifications implicitly means UNBOUNDED ABOVE if in FROM and >> vice versa, which seems to me like sensible default behavior and what's >> already present in PG 10. >> >> Do you think ABOVE/BELOW shouldn't really be optional? >> > > Hmm, I'm not so sure about that. > > The more I think about this, the more I think that the current design > is broken, and that introducing UNBOUNDED ABOVE/BELOW is just a > sticking plaster to cover that up. Yes, it allows nicer multi-column > ranges to be defined, as demonstrated upthread. But, it also allows > some pretty counterintuitive things like making the lower bound > exclusive and the upper bound inclusive. Yes, I kind of got that impression from the example, but wasn't able to reach the same conclusion as yours that it stems from the underlying design issues; I thought we'd just have to document them as caveats, but that doesn't really sound nice. Thanks for pointing that out. > I think that's actually the real problem with the current design. If I > have a single-column partition like > > (col) FROM (x) TO (y) > > it's pretty clear that's a simple range, inclusive at the lower end > and exclusive at the upper end: > > (x) <= (col) < (y) > > If I now make that a 2-column partition, but leave the second column > unbounded: > > (col1,col2) FROM (x,UNBOUNDED) TO (y,UNBOUNDED) > > my initial expectation would have been for that to mean the same > thing, i.e., > > (x) <= (col1) < (y) > > but that only happens if "UNBOUNDED" means negative infinity in both > places. That then starts to give the sort of desirable properties > you'd expect, like using the same expression for the lower bound of > one partition as the upper bound of another makes the two partitions > contiguous. > > But of course, that's not exactly a pretty design either, because then > you'd be saying that UNBOUNDED means positive infinity if it's the > upper bound of the first column, and negative infinity if it's the > lower bound of the first column or either bound of any other column. Initially, I didn't understand the part where you said FROM (x, UNBOUNDED) TO (y, UNBOUNDED) would mean the same thing as (x) <= (col1) < (y), because row comparison logic that underlying multi-column range partition key comparisons appears to me to contradict the same. But, maybe it's thinking about the implementation details like this that's clouding my judgement about the correctness or the intuitiveness of the current design. > Another aspect of the current design I don't like is that you have to > keep repeating UNBOUNDED [ABOVE/BELOW], for each of the rest of the > columns in the bound, and anything else is an error. That's a pretty > verbose way of saying "the rest of the columns are unbounded". > > So the more I think about this, the more I think that a cleaner design > would be as follows: > > 1). Don't allow UNBOUNDED, except in the first column, where it can > keep it's current meaning. > > 2). Allow the partition bounds to have fewer columns than the > partition definition, and have that mean the same as it would have > meant if you were partitioning by that many columns. So, for > example, if you were partitioning by (col1,col2), you'd be allowed > to define a partition like so: > > FROM (x) TO (y) > > and it would mean > > x <= col1 < y > > Or you'd be able to define a partition like > > FROM (x1,x2) TO (y) > > which would mean > > (col1 > x1) OR (col1 = x1 AND col2 >= x2) AND col1 < y > > 3). Don't allow any value after UNBOUNDED (i.e., only specify > UNBOUNDED once in a partition bound). I assume we don't need the ability of specifying ABOVE/BELOW in this design. In retrospect, that sounds like something that was implemented in the earlier versions of the patch, whereby there was no ability to specify UNBOUNDED on a per-column basis. So the syntax was: FROM { (x [, ...]) | UNBOUNDED } TO { (y [, ...]) | UNBOUNDED } But, it was pointed out to me [1] that that doesn't address the use case, for example, where part1 goes up to (10, 10) and part2 goes from (10, 10) up to (10, unbounded). The new design will limit the usage of unbounded range partitions at the tail ends. > This design has a few neat properties: > > - Lower bounds ar
Re: [HACKERS] UPDATE of partition key
On 4 July 2017 at 15:23, Amit Khandekar wrote: > On 4 July 2017 at 14:48, Amit Khandekar wrote: >> On 4 July 2017 at 14:38, Amit Langote wrote: >>> On 2017/07/04 17:25, Etsuro Fujita wrote: On 2017/07/03 18:54, Amit Langote wrote: > On 2017/07/02 20:10, Robert Haas wrote: >> That >> seems pretty easy to do - just have expand_inherited_rtentry() notice >> that it's got a partitioned table and call >> RelationGetPartitionDispatchInfo() instead of find_all_inheritors() to >> produce the list of OIDs. Seems like a good idea. > Interesting idea. > > If we are going to do this, I think we may need to modify > RelationGetPartitionDispatchInfo() a bit or invent an alternative that > does not do as much work. Currently, it assumes that it's only ever > called by ExecSetupPartitionTupleRouting() and hence also generates > PartitionDispatchInfo objects for partitioned child tables. We don't need > that if called from within the planner. > > Actually, it seems that RelationGetPartitionDispatchInfo() is too coupled > with its usage within the executor, because there is this comment: > > /* > * We keep the partitioned ones open until we're done using the > * information being collected here (for example, see > * ExecEndModifyTable). > */ Yeah, we need some refactoring work. Is anyone working on that? >>> >>> I would like to take a shot at that if someone else hasn't already cooked >>> up a patch. Working on making RelationGetPartitionDispatchInfo() a >>> routine callable from both within the planner and the executor should be a >>> worthwhile effort. >> >> What I am currently working on is to see if we can call >> find_all_inheritors() or find_inheritance_children() instead of >> generating the leaf_part_oids using APPEND_REL_PARTITION_OIDS(). >> Possibly we don't have to refactor it completely. >> find_inheritance_children() needs to return the oids in canonical >> order. So in find_inheritance_children () need to re-use part of >> RelationBuildPartitionDesc() where it generates those oids in that >> order. I am checking this part, and am going to come up with an >> approach based on findings. > > The other approach is to make canonical ordering only in > find_all_inheritors() by replacing call to find_inheritance_children() > with the refactored/modified RelationGetPartitionDispatchInfo(). But > that would mean that the callers of find_inheritance_children() would > have one ordering, while the callers of find_all_inheritors() would > have a different ordering; that brings up chances of deadlocks. That's > why I think, we need to think about modifying the common function > find_inheritance_children(), so that we would be consistent with the > ordering. And then use find_inheritance_children() or > find_all_inheritors() in RelationGetPartitionDispatchInfo(). So yes, > there would be some modifications to > RelationGetPartitionDispatchInfo(). > >> >> Also, need to investigate whether *always* sorting the oids in >> canonical order is going to be much expensive than the current sorting >> using oids. But I guess it won't be that expensive. Like I mentioned upthread... in expand_inherited_rtentry(), if we replace find_all_inheritors() with something else that returns oids in canonical order, that will change the order in which children tables get locked, which increases the chance of deadlock. Because, then the callers of find_all_inheritors() will lock them in one order, while callers of expand_inherited_rtentry() will lock them in a different order. Even in the current code, I think there is a chance of deadlocks because RelationGetPartitionDispatchInfo() and find_all_inheritors() have different lock ordering. Now, to get the oids of a partitioned table children sorted by canonical ordering, (i.e. using the partition bound values) we need to either use the partition bounds to sort the oids like the way it is done in RelationBuildPartitionDesc() or, open the parent table and get it's Relation->rd_partdesc->oids[] which are already sorted in canonical order. So if we generate oids using this way in find_all_inheritors() and find_inheritance_children(), that will generate consistent ordering everywhere. But this method is quite expensive as compared to the way oids are generated and sorted using oid values in find_inheritance_children(). In both expand_inherited_rtentry() and RelationGetPartitionDispatchInfo(), each of the child tables are opened. So, in both of these functions, what we can do is : call a new function partition_tree_walker() which does following : 1. Lock the children using the existing order (i.e. sorted by oid values) using the same function find_all_inheritors(). Rename find_all_inheritors() to lock_all_inheritors(... , bool return_oids) which returns the oid list only if requested. 2. And then scan through each of the partition
Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server
On Wed, Jul 5, 2017 at 10:19 AM, Masahiko Sawada wrote: > I feel that since we cannot switch the WAL forcibly on the standby > server we need to find a new way to do so. I'm not sure but it might > be a hard work and be late for PG10. Or you meant that you have a idea > for this? Why not refactoring a bit do_pg_stop_backup() so as the wait phase happens even if a backup is started in recovery? Now wait_for_archive is ignored because no wait is happening and the stop point is directly returned back to the caller. For the wait actually happening, I don't have a better idea than documenting the fact that enforcing manually a segment switch on the primary needs to happen. That's better than having users including WAL in their base backups but not actually having everything they need. And I think that documenting that properly is better than restricting things that should work. In most workloads, multiple WAL segments can be generated per second, and in even more of them a new segment generated would happen in less than a minute, so waiting for a segment switch on the primary should not be a problem for most users. The log letting user know about the wait should be more informative when things happen on a standby, like "waiting for segment to be finished or switched on the primary". If the restriction approach is preferred, I think that the check should happen in do_pg_stop_backup as well, and not in pg_stop_backup_v2 as your patch suggests. pg_basebackup is not able to do non-exclusive backups but this may happen some day, who knows.. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers