[HACKERS] Fix header comment of streamutil.c
Hi, While reading source code, I found that the header comment of streamutil.c is not correct. I guess pg_receivelog is a mistake of pg_receivexlog and it's an oversight when changing xlog to wal. Attached patch fixes this. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center fix_header_of_streamutil_c.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Thu, Jul 06, 2017 at 05:19:59PM +0530, Amit Kapila wrote: > I think if you are under development, it is always advisable to create > indexes after initial bulk load. That way it will be faster and will > take lesser space atleast in case of hash index. This is a bit of a pickle, actually: * if I do have a hash index I'll wind up with a bloated one at some stage that refused to allow more inserts until the index is re-created * if I don't have an index then I'll wind up with a table where I cannot create a hash index because it has too many rows for it to handle I'm at a bit of a loss as to how to deal with this. The DB design does come with a kind of partitioning where a bundle of tables get put off to the side and searched seperately as needed but too many of those and the impact on performance can be noticed so I need to minimise them. > >> As mentioned above REINDEX might be a better option. I think for such > >> situation we should have some provision to allow squeeze functionality > >> of hash exposed to the user, this will be less costly than REINDEX and > >> might serve the purpose for the user. Hey, can you try some hack in > > > > Assuming it does help, would this be something one would need to guess > > at? "I did a whole bunch of concurrent INSERT heavy transactions so I > > guess I should do a squeeze now"? > > > > Or could it be figured out programmatically? > > I think one can refer free_percent and number of overflow pages to > perform such a command. It won't be 100% correct, but we can make a > guess. We can even check free space in overflow pages with page > inspect to make it more accurate. Does this take much time? Main reason I am asking is that this looks like something that the db ought to handle underneath (say as part of an autovac run) and so if there are stats that the index code can maintain that can then be used by the autovac (or something) code to trigger a cleanup this I think would be of benefit. Unless I am being /so/ unusual that it's not worth it. :) I'll reply to the rest in a separate stream as I'm still poking other work related things atm so can't do the debug testing as yet. AP -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 2017/07/07 4:55, Dean Rasheed wrote: > On 5 July 2017 at 18:07, Dean Rasheedwrote: >> So if we were to go for maximum flexibility and compatibility with >> Oracle, then perhaps what we would do is more like the original idea >> of UNBOUNDED ABOVE/BELOW, except call them MINVALUE and MAXVALUE, >> which conveniently are already unreserved keywords, as well as being >> much shorter. Plus, we would also relax the constraint about having >> finite values after MINVALUE/MAXVALUE. >> > So I know that I have flip-flopped a few times on this now, but I'm > now starting to think that this approach, replacing UNBOUNDED with > MINVALUE and MAXVALUE is the best way to go, along with permitting > finite values after MINVALUE/MAXVALUE. Sure. > This gives the greatest flexibility, it's not too verbose, and it > makes it easy to define contiguous sets of partitions just by making > the lower bound of one match the upper bound of another. > > With this approach, any partition bounds that Oracle allows are also > valid in PostgreSQL, not that I would normally give too much weight to > that, but it is I think quite a nice syntax. Of course, we also > support things that Oracle doesn't allow, such as MINVALUE and gaps > between partitions. Agreed. MINVALUE/MAXVALUE seems like a good way forward. > Parts of the patch are similar to your UNBOUNDED ABOVE/BELOW patch, > but there are a number of differences -- most notably, I replaced the > "infinite" boolean flag on PartitionRangeDatum with a 3-value enum and > did away with all the DefElem nodes and the associated special string > constants being copied and compared. That's better. > However, this is also an incompatible syntax change, and any attempt > to support both the old and new syntaxes is likely to be messy, so we > really need to get consensus on whether this is the right thing to do, > and whether it *can* be done now for PG10. +1 to releasing this syntax in PG 10. The patch looks generally good, although I found and fixed some minor issues (typos and such). Please find attached the updated patch. Thanks, Amit diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index b15c19d3d0..06bea2a18a 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -87,8 +87,8 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI and partition_bound_spec is: IN ( { numeric_literal | string_literal | NULL } [, ...] ) | -FROM ( { numeric_literal | string_literal | UNBOUNDED } [, ...] ) - TO ( { numeric_literal | string_literal | UNBOUNDED } [, ...] ) +FROM ( { numeric_literal | string_literal | MINVALUE | MAXVALUE } [, ...] ) + TO ( { numeric_literal | string_literal | MINVALUE | MAXVALUE } [, ...] ) index_parameters in UNIQUE, PRIMARY KEY, and EXCLUDE constraints are: @@ -269,10 +269,10 @@ FROM ( { numeric_literal | Each of the values specified in the partition_bound_spec is - a literal, NULL, or UNBOUNDED. - Each literal value must be either a numeric constant that is coercible - to the corresponding partition key column's type, or a string literal - that is valid input for that type. + a literal, NULL, MINVALUE, or + MAXVALUE. Each literal value must be either a + numeric constant that is coercible to the corresponding partition key + column's type, or a string literal that is valid input for that type. @@ -300,13 +300,47 @@ FROM ( { numeric_literal | - Writing UNBOUNDED in FROM - signifies -infinity as the lower bound of the - corresponding column, whereas when written in TO, - it signifies +infinity as the upper bound. - All items following an UNBOUNDED item within - a FROM or TO list must also - be UNBOUNDED. + The special values MINVALUE and MAXVALUE + may be use when creating a range partition to indicate that there + is no lower or upper bound on the column's value. For example, a + partition defined using FROM (MINVALUE) TO (10) allows + any values less than 10, and a partition defined using + FROM (10) TO (MAXVALUE) allows any values greater than + or equal to 10. + + + + When creating a range partition involving more than one column, it + can also make sense to use MAXVALUE as part of the lower + bound, and MINVALUE as part of the upper bound. For + example, a partition defined using + FROM (0, MAXVALUE) TO (10, MAXVALUE) allows any rows + where the first partitioned column is greater than 0 and less than + or equal to 10. Similarly, a partition defined using + FROM ('a', MINVALUE) TO ('b', MINVALUE) + allows only rows where the first partitioned column starts with "a". + + + + Note that any values after MINVALUE or + MAXVALUE in a partition bound are ignored; so the bound + (10,
Re: [HACKERS] Error while copying a large file in pg_rewind
On Thu, Jul 6, 2017 at 8:51 PM, Dilip Kumarwrote: > On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh > wrote: >> But, I'm little concerned/doubt regarding the following part of the code. >> +/* >> + * Converts an int64 from network byte order to native format. >> + */ >> +static int64 >> +pg_recvint64(int64 value) >> +{ >> + union >> + { >> + int64 i64; >> + uint32 i32[2]; >> + } swap; >> + int64 result; >> + >> + swap.i64 = value; >> + >> + result = (uint32) ntohl(swap.i32[0]); >> + result <<= 32; >> + result |= (uint32) ntohl(swap.i32[1]); >> + >> + return result; >> +} >> Does this always work correctly irrespective of the endianess of the >> underlying system? > > I think this will have problem, we may need to do like > > and reverse complete array if byte order is changed This comes from the the implementation of 64b-large objects, which was first broken on big-endian systems, until Tom fixed it with the following commit, and this looks fine to me: commit: 26fe56481c0f7baa705f0b3265b5a0676f894a94 author: Tom Lane date: Mon, 8 Oct 2012 18:24:32 -0400 Code review for 64-bit-large-object patch. At some point it would really make sense to group all things under the same banner (64-b LO, pg_basebackup, and now pg_rewind). > or we can use something like "be64toh" That's even less portable. For example macos would need a macro to map to OSSwapBigToHostInt64 or OSSwapLittleToHostInt64 from OSByteOrder.h. Brr. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New partitioning - some feedback
Hi Mark, On 2017/07/07 9:02, Mark Kirkwood wrote: > I've been trying out the new partitioning in version 10. Firstly, I must > say this is excellent - so much nicer than the old inheritance based method! Thanks. :) > My only niggle is the display of partitioned tables via \d etc. e.g: > > part=# \d > List of relations > Schema | Name | Type | Owner > +--+---+-- > public | date_fact| table | postgres > public | date_fact_201705 | table | postgres > public | date_fact_201706 | table | postgres > public | date_fact_20170601 | table | postgres > public | date_fact_2017060100 | table | postgres > public | date_fact_201707 | table | postgres > public | date_fact_rest | table | postgres > (7 rows) > > Now it can be inferred from the names that date_fact is a partitioned > table and the various date_fact_ are its partitions - but \d is not > providing any hints of this. The more detailed individual describe is fine: > > part=# \d date_fact > Table "public.date_fact" > Column | Type | Collation | Nullable | Default > +--+---+--+- > id | integer | | not null | > dte| timestamp with time zone | | not null | > val| integer | | not null | > Partition key: RANGE (dte) > Number of partitions: 6 (Use \d+ to list them.) > > I'd prefer *not* to see a table and its partitions all intermixed in the > same display (especially with nothing indicating which are partitions) - > as this will make for unwieldy long lists when tables have many > partitions. Also it would be good if the 'main' partitioned table and its > 'partitions' showed up as a different type in some way. > I note the they do in pg_class: > > part=# SELECT relname,relkind,relispartition FROM pg_class WHERE relname > LIKE 'date_fact%'; >relname| relkind | relispartition > --+-+ > date_fact| p | f > date_fact_201705 | r | t > date_fact_201706 | r | t > date_fact_20170601 | r | t > date_fact_2017060100 | r | t > date_fact_201707 | r | t > date_fact_rest | r | t > (7 rows) > > ...so it looks to be possible to hide the partitions from the main display > and/or mark them as such. Now I realize that making this comment now that > beta is out is a bit annoying - apologies, but I think seeing a huge list > of 'tables' is going to make \d frustrating for folk doing partitioning. Someone complained about this awhile back [1]. And then it came up again [2], where Noah appeared to take a stance that partitions should be visible in views / output of commands that list "tables". Although I too tend to prefer not filling up the \d output space by listing partitions (pg_class.relispartition = true relations), there wasn't perhaps enough push for creating a patch for that. If some committer is willing to consider such a patch, I can make one. Thanks, Amit [1] https://www.postgresql.org/message-id/CAM-w4HOZ5fPS7GoCTTrW42q01%2BwPrOWFCnr9H0iDyVTZP2H1CA%40mail.gmail.com [2] https://www.postgresql.org/message-id/20170406070227.GA2741046%40tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Out of date comment in predicate.c
On Sat, Jul 1, 2017 at 6:38 AM, Peter Eisentrautwrote: > On 6/27/17 01:21, Thomas Munro wrote: >> Commit ea9df812d8502fff74e7bc37d61bdc7d66d77a7f got rid of >> FirstPredicateLockMgrLock, but it's still referred to in a comment in >> predicate.c where the locking protocol is documented. I think it's >> probably best to use the name of the macro that's usually used to >> access the lock array in the code. Please see attached. > > Does this apply equally to PredicateLockHashPartitionLock() and > PredicateLockHashPartitionLockByIndex()? Should the comment mention or > imply both? Yeah, I guess so. How about listing the hashcode variant, as it's the more commonly used and important for a reader to understand of the two, but mentioning the ByIndex variant in a bullet point below? Like this. -- Thomas Munro http://www.enterprisedb.com fix-comments-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 2017/07/06 18:30, Dean Rasheed wrote: > On 5 July 2017 at 10:43, Amit Langotewrote: >> 0001 is your patch to tidy up check_new_partition_bound() (must be >> applied before 0002) >> > > I pushed this first patch, simplifying check_new_partition_bound() for > range partitions, since it seemed like a good simplification, but note > that I don't think that was actually the cause of the latent bug you > saw upthread. I like how simple check_new_partition_bound() has now become. > I think the real issue was in partition_rbound_cmp() -- normally, if > the upper bound of one partition coincides with the lower bound of > another, that function would report the upper bound as the smaller > one, but that logic breaks if any of the bound values are infinite, > since then it will exit early, returning 0, without ever comparing the > "lower" flags on the bounds. > > I'm tempted to push a fix for that independently, since it's a bug > waiting to happen, even though it's not possible to hit it currently. Oops, you're right. Thanks for the fix. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] New partitioning - some feedback
I've been trying out the new partitioning in version 10. Firstly, I must say this is excellent - so much nicer than the old inheritance based method! My only niggle is the display of partitioned tables via \d etc. e.g: part=# \d List of relations Schema | Name | Type | Owner +--+---+-- public | date_fact| table | postgres public | date_fact_201705 | table | postgres public | date_fact_201706 | table | postgres public | date_fact_20170601 | table | postgres public | date_fact_2017060100 | table | postgres public | date_fact_201707 | table | postgres public | date_fact_rest | table | postgres (7 rows) Now it can be inferred from the names that date_fact is a partitioned table and the various date_fact_ are its partitions - but \d is not providing any hints of this. The more detailed individual describe is fine: part=# \d date_fact Table "public.date_fact" Column | Type | Collation | Nullable | Default +--+---+--+- id | integer | | not null | dte| timestamp with time zone | | not null | val| integer | | not null | Partition key: RANGE (dte) Number of partitions: 6 (Use \d+ to list them.) I'd prefer *not* to see a table and its partitions all intermixed in the same display (especially with nothing indicating which are partitions) - as this will make for unwieldy long lists when tables have many partitions. Also it would be good if the 'main' partitioned table and its 'partitions' showed up as a different type in some way. I note the they do in pg_class: part=# SELECT relname,relkind,relispartition FROM pg_class WHERE relname LIKE 'date_fact%'; relname| relkind | relispartition --+-+ date_fact| p | f date_fact_201705 | r | t date_fact_201706 | r | t date_fact_20170601 | r | t date_fact_2017060100 | r | t date_fact_201707 | r | t date_fact_rest | r | t (7 rows) ...so it looks to be possible to hide the partitions from the main display and/or mark them as such. Now I realize that making this comment now that beta is out is a bit annoying - apologies, but I think seeing a huge list of 'tables' is going to make \d frustrating for folk doing partitioning. regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 07/06/2017 01:24 PM, Dean Rasheed wrote: > On 6 July 2017 at 21:04, Tom Lanewrote: >> Dean Rasheed writes: >>> However, this is also an incompatible syntax change, and any attempt >>> to support both the old and new syntaxes is likely to be messy, so we >>> really need to get consensus on whether this is the right thing to do, >>> and whether it *can* be done now for PG10. >> >> FWIW, I'd much rather see us get it right the first time than release >> PG10 with a syntax that we'll regret later. I do not think that beta2, >> or even beta3, is too late for such a change. >> >> I'm not taking a position on whether this proposal is actually better >> than what we have. But if there's a consensus that it is, we should >> go ahead and do it, not worry that it's too late. >> > > OK, thanks. That's good to know. I agree we should get this right the first time and I also agree with Dean's proposal, so I guess I'm a +2 Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=
Kuntal Ghoshwrites: > On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane wrote: > + * In the first bin (i==1), add a fudge factor that ensures > + * that histfrac is at least eq_selec. We do this because we > + * know that the first histogram entry does satisfy the > + * inequality (if !isgt) or not satisfy it (if isgt), so our > + * estimate here should certainly not be zero even if binfrac > + * is zero. (XXX experimentally this is the correct way to do > + * it, but why isn't it a linear adjustment across the whole > + * histogram rather than just the first bin?) > Given that the values are distinct, (I've some doubts for real number case) Well, really, we are *always* dealing with discrete distributions here. > if histogram_bounds are assigned as, > {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..} > ... > So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets > which means it assumes val is included in the previous bucket. I looked at that again and realized that one of the answers I was missing lies in the behavior of analyze.c's compute_scalar_stats(). When it has collected "nvals" values and wants to make a histogram containing "num_hist" entries, it does this: * The object of this loop is to copy the first and last values[] * entries along with evenly-spaced values in between. So the * i'th value is values[(i * (nvals - 1)) / (num_hist - 1)]. (where i runs from 0 to num_hist - 1). Because the "/" denotes integer division, this effectively means that for all but the first entry, we are taking the last value out of the corresponding bucket of samples. That's obviously true for the last histogram entry, which will be the very last sample value, and it's also true for earlier ones --- except for the *first* histogram entry, which is necessarily the first one in its bucket. So the first histogram bucket is actually a tad narrower than the others, which is visible in the typical data you showed above: 0 to 9 is only 9 counts wide, but all the remaining buckets are 10 counts wide. That explains why we need a scale adjustment just in the first bucket. I think I'm also beginning to grasp why scalarineqsel's basic estimation process produces an estimate for "x <= p" rather than some other condition such as "x < p". For clarity, imagine that we're given p equal to the last histogram entry. If the test operator is in fact "<=", then it will say "true" for every histogram entry, and we'll fall off the end of the histogram and return 1.0, which is correct. If the test operator is "<", it will return "true" for all but the last entry, so that we end up with "i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0, if convert_to_scalar() produces sane answers, again resulting in histfrac = 1.0. Similar reasoning applies for ">" and ">=", so that in all cases we arrive at histfrac = 1.0 if p equals the last histogram entry. More generally, if p is equal to some interior histogram entry, say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1) no matter which operator we probe with, assuming that convert_to_scalar() correctly gives us a binfrac of 0.0 or 1.0 depending on which of the adjacent bins we end up examining. So, remembering that the histogram entries occupy the right ends of their buckets, the proper interpretation of that is that it's the probability of "x <= p". (This is wrong for the first histogram entry, but that's why we need a correction there.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 6 July 2017 at 21:04, Tom Lanewrote: > Dean Rasheed writes: >> However, this is also an incompatible syntax change, and any attempt >> to support both the old and new syntaxes is likely to be messy, so we >> really need to get consensus on whether this is the right thing to do, >> and whether it *can* be done now for PG10. > > FWIW, I'd much rather see us get it right the first time than release > PG10 with a syntax that we'll regret later. I do not think that beta2, > or even beta3, is too late for such a change. > > I'm not taking a position on whether this proposal is actually better > than what we have. But if there's a consensus that it is, we should > go ahead and do it, not worry that it's too late. > OK, thanks. That's good to know. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
Dean Rasheedwrites: > However, this is also an incompatible syntax change, and any attempt > to support both the old and new syntaxes is likely to be messy, so we > really need to get consensus on whether this is the right thing to do, > and whether it *can* be done now for PG10. FWIW, I'd much rather see us get it right the first time than release PG10 with a syntax that we'll regret later. I do not think that beta2, or even beta3, is too late for such a change. I'm not taking a position on whether this proposal is actually better than what we have. But if there's a consensus that it is, we should go ahead and do it, not worry that it's too late. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 5 July 2017 at 18:07, Dean Rasheedwrote: > So if we were to go for maximum flexibility and compatibility with > Oracle, then perhaps what we would do is more like the original idea > of UNBOUNDED ABOVE/BELOW, except call them MINVALUE and MAXVALUE, > which conveniently are already unreserved keywords, as well as being > much shorter. Plus, we would also relax the constraint about having > finite values after MINVALUE/MAXVALUE. > So I know that I have flip-flopped a few times on this now, but I'm now starting to think that this approach, replacing UNBOUNDED with MINVALUE and MAXVALUE is the best way to go, along with permitting finite values after MINVALUE/MAXVALUE. This gives the greatest flexibility, it's not too verbose, and it makes it easy to define contiguous sets of partitions just by making the lower bound of one match the upper bound of another. With this approach, any partition bounds that Oracle allows are also valid in PostgreSQL, not that I would normally give too much weight to that, but it is I think quite a nice syntax. Of course, we also support things that Oracle doesn't allow, such as MINVALUE and gaps between partitions. Parts of the patch are similar to your UNBOUNDED ABOVE/BELOW patch, but there are a number of differences -- most notably, I replaced the "infinite" boolean flag on PartitionRangeDatum with a 3-value enum and did away with all the DefElem nodes and the associated special string constants being copied and compared. However, this is also an incompatible syntax change, and any attempt to support both the old and new syntaxes is likely to be messy, so we really need to get consensus on whether this is the right thing to do, and whether it *can* be done now for PG10. Regards, Dean Replace_UNBOUNDED_with_MINVALUE_and_MAXVALUE.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: pg_test_timing: Add NLS
Peter Eisentraut wrote: > pg_test_timing: Add NLS > > Also straighten out use of time unit abbreviations a bit. We (well, Carlos Chapi, who's doing the translation work now) just noticed that this has a bug in this line + printf("%6s %10s %10s\n", _("< us"), _("% of total"), _("count")); _() marks the strings with the c-string flag, which means that the %-specifiers are checked by gettext, but the % in the third literal is not a printf specifier. So there's no correct way to write the translation. We need to use a different xgettext trigger there, one that doesn't set c-format, but I don't know what. Babel is now complaining: /home/nlsuser/admin/wwwtools/scm/postgresql-master/src/bin/pg_test_timing/po/es.po:73: format specifications in 'msgid' and 'msgstr' for argument 1 are not the same where the line is #: pg_test_timing.c:181 #, c-format msgid "% of total" msgstr "% del total" -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rust bindings to pgtypes lib
Hi, On 2017-07-06 20:26:29 +0200, Michael Meskes wrote: > > But is this possible, or feasible? I see that the makefile refers to > > Possible yes, but in general I'm not a big fan of duplicating code. I > spend too much time to keep those copies in sync. Indeed. I'm quite strongly against exposing / using pgtypeslib in more places. If anything it should be phased out. Because that code is definitely not always kept up2date, and it's a lot of work to do so. If anything the code should be rewritten to *not* require so much duplication, then we could talk. > > How much of the source tree would I have to carve out? > > Or from another perspective; how do other language (if any) solve this? > > I'm not aware of any other language binding for pgtypeslib. Some people use http://libpqtypes.esilo.com/ Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rust bindings to pgtypes lib
> But is this possible, or feasible? I see that the makefile refers to Possible yes, but in general I'm not a big fan of duplicating code. I spend too much time to keep those copies in sync. > Makefile.global, and also includes stuff from ecpg, aat least. I don't want > to Yes, but these should be minor. Makefile.global is mostly included for some rules and other than some includes I don't know what else is taken from ecpg. However, there are a couple small files that are taken from teh backend source tree, like pgstrcasecmp.c > How much of the source tree would I have to carve out? > Or from another perspective; how do other language (if any) solve this? I'm not aware of any other language binding for pgtypeslib. Michael -- Michael Meskes Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org) Meskes at (Debian|Postgresql) dot Org Jabber: michael at xmpp dot meskes dot org VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?
On 6 July 2017 at 15:29, Jim Finnertywrote: > > Feel free to knock down this 'straw man' and propose something better! I think the pattern in this design that we don't want is that it imposes extra complexity on every user of every page even when the page doesn't have the problem and even when the problem isn't anywhere in the database. Even years from now when this problem is long gone you'll have code paths for dealing with this special page format that are rarely executed and never tested that will have to be maintained blind. Ideally a solution to this problem that imposes a cost only on the weird pages and only temporarily and leave the database in a "consistent" state that doesn't require any special processing when reading the data would be better. The "natural" solution is what was discussed for incompatible page format changes in the past where there's an point release of one Postgres version that tries to ensure there's enough space on the page for the next version and keeps track of whether there are any problematic pages. Then you would be blocked from upgrading until you had ensured all pages had space (presumably by running some special "vacuum upgrade" or something like that). Incidentally it's somewhat intriguing to think about what would happen if we *always* did such a tombstone for deletes. Or perhaps only when it's a full_page_write. Since the whole page is going into the log and that tuple will never be modified again you could imagine just replacing the tuple with the LSN of the deletion and letting anyone who really needs it fetch it from the xlog. That would be a completely different model from the way Postgres works though. More like a log-structured storage system. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More race conditions in logical replication
On 06/07/17 17:33, Petr Jelinek wrote: > On 03/07/17 01:54, Tom Lane wrote: >> I noticed a recent failure that looked suspiciously like a race condition: >> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2017-07-02%2018%3A02%3A07 >> >> The critical bit in the log file is >> >> error running SQL: 'psql::1: ERROR: could not drop the replication >> slot "tap_sub" on publisher >> DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID >> 3866790' >> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R >> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION >> tap_sub' at >> /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm >> line 1198. >> >> After poking at it a bit, I found that I can cause several different >> failures of this ilk in the subscription tests by injecting delays at >> the points where a slot's active_pid is about to be cleared, as in the >> attached patch (which also adds some extra printouts for debugging >> purposes; none of that is meant for commit). It seems clear that there >> is inadequate interlocking going on when we kill and restart a logical >> rep worker: we're trying to start a new one before the old one has >> gotten out of the slot. >> > > Thanks for the test case. > > It's not actually that we start new worker fast. It's that we try to > drop the slot right after worker process was killed but if the code that > clears slot's active_pid takes too long, it still looks like it's being > used. I am quite sure it's possible to make this happen for physical > replication as well when using slots. > > This is not something that can be solved by locking on subscriber. ISTM > we need to make pg_drop_replication_slot behave more nicely, like making > it wait for the slot to become available (either by default or as an > option). > > I'll have to think about how to do it without rewriting half of > replication slots or reimplementing lock queue though because the > replication slots don't use normal catalog access so there is no object > locking with wait queue. We could use some latch wait with small timeout > but that seems ugly as that function can be called by user without > having dropped the slot before so the wait can be quite long (as in > "forever"). > Naive fix would be something like attached. But as I said, it's not exactly pretty. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services >From 6d016a3fad8635c2366bcf5438d49f41c905621c Mon Sep 17 00:00:00 2001 From: Petr JelinekDate: Thu, 6 Jul 2017 18:16:44 +0200 Subject: [PATCH] Wait for slot to become free in when dropping it --- src/backend/replication/logical/logicalfuncs.c | 2 +- src/backend/replication/slot.c | 51 ++ src/backend/replication/slotfuncs.c| 2 +- src/backend/replication/walsender.c| 6 +-- src/include/replication/slot.h | 4 +- 5 files changed, 50 insertions(+), 15 deletions(-) diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c index 363ca82..a3ba2b1 100644 --- a/src/backend/replication/logical/logicalfuncs.c +++ b/src/backend/replication/logical/logicalfuncs.c @@ -244,7 +244,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin else end_of_wal = GetXLogReplayRecPtr(); - ReplicationSlotAcquire(NameStr(*name)); + ReplicationSlotAcquire(NameStr(*name), true); PG_TRY(); { diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index dc7de20..ab115f4 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -46,6 +46,7 @@ #include "pgstat.h" #include "replication/slot.h" #include "storage/fd.h" +#include "storage/ipc.h" #include "storage/proc.h" #include "storage/procarray.h" #include "utils/builtins.h" @@ -323,7 +324,7 @@ ReplicationSlotCreate(const char *name, bool db_specific, * Find a previously created slot and mark it as used by this backend. */ void -ReplicationSlotAcquire(const char *name) +ReplicationSlotAcquire(const char *name, bool nowait) { ReplicationSlot *slot = NULL; int i; @@ -331,6 +332,8 @@ ReplicationSlotAcquire(const char *name) Assert(MyReplicationSlot == NULL); +retry: + /* Search for the named slot and mark it active if we find it. */ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); for (i = 0; i < max_replication_slots; i++) @@ -350,16 +353,48 @@ ReplicationSlotAcquire(const char *name) } LWLockRelease(ReplicationSlotControlLock); - /* If we did not find the slot or it was already active, error out. */ + /* If we did not find the slot, error out. */ if (slot == NULL) ereport(ERROR, (errcode(ERRCODE_UNDEFINED_OBJECT), errmsg("replication slot \"%s\" does not exist", name))); + + /* +
[HACKERS] paths in partitions of a dummy partitioned table
If a partitioned table is proven dummy, set_rel_pathlist() doesn't mark the partition relations dummy and thus doesn't set any (dummy) paths in the partition relations. The lack of paths in the partitions means that we can not use partition-wise join while joining this table with some other similarly partitioned table as the partitions do not have any paths that child-joins can use. This means that we can not create partition relations for such a join and thus can not consider the join to be partitioned. This doesn't matter much when the dummy relation is the outer relation, since the resultant join is also dummy. But when the dummy relation is inner relation, the resultant join is not dummy and can be considered to be partitioned with same partitioning scheme as the outer relation to be joined with other similarly partitioned table. Not having paths in the partitions deprives us of this future optimization. Without partition-wise join, not setting dummy paths in the partition relations makes sense, since those do not have any use. But with partition-wise join that changes. If we always mark the partitions dummy, that effort may get wasted if the partitioned table doesn't participate in the partition-wise join. A possible solution would be to set the partitions dummy when only the partitioned table participates in the join, but that happens during make_join_rel(), much after we have set up the simple relations. So, I am not sure, whether that's a good option. But I am open to suggestions. What if we always mark the partition relations of a dummy partitioned table dummy? I tried attached path on a thousand partition table, the planning time increased by 3-5%. That doesn't look that bad for a thousand partitions. Any other options/comments? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index f087ddb..4af95b9 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -421,7 +421,7 @@ static void set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte) { - if (IS_DUMMY_REL(rel)) + if (IS_DUMMY_REL(rel) && !rte->inh) { /* We already proved the relation empty, so nothing more to do */ } @@ -1244,6 +1244,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, /* * Compute the child's access paths. */ + if (IS_DUMMY_REL(rel)) + set_dummy_rel_pathlist(childrel); set_rel_pathlist(root, childrel, childRTindex, childRTE); /* @@ -1258,6 +1260,12 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, live_childrels = lappend(live_childrels, childrel); } + if (!live_childrels) + { + Assert(IS_DUMMY_REL(rel)); + return; + } + /* Add paths to the "append" relation. */ add_paths_to_append_rel(root, rel, live_childrels); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] More race conditions in logical replication
On 03/07/17 01:54, Tom Lane wrote: > I noticed a recent failure that looked suspiciously like a race condition: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2017-07-02%2018%3A02%3A07 > > The critical bit in the log file is > > error running SQL: 'psql::1: ERROR: could not drop the replication > slot "tap_sub" on publisher > DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID > 3866790' > while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R > dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION > tap_sub' at > /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm > line 1198. > > After poking at it a bit, I found that I can cause several different > failures of this ilk in the subscription tests by injecting delays at > the points where a slot's active_pid is about to be cleared, as in the > attached patch (which also adds some extra printouts for debugging > purposes; none of that is meant for commit). It seems clear that there > is inadequate interlocking going on when we kill and restart a logical > rep worker: we're trying to start a new one before the old one has > gotten out of the slot. > Thanks for the test case. It's not actually that we start new worker fast. It's that we try to drop the slot right after worker process was killed but if the code that clears slot's active_pid takes too long, it still looks like it's being used. I am quite sure it's possible to make this happen for physical replication as well when using slots. This is not something that can be solved by locking on subscriber. ISTM we need to make pg_drop_replication_slot behave more nicely, like making it wait for the slot to become available (either by default or as an option). I'll have to think about how to do it without rewriting half of replication slots or reimplementing lock queue though because the replication slots don't use normal catalog access so there is no object locking with wait queue. We could use some latch wait with small timeout but that seems ugly as that function can be called by user without having dropped the slot before so the wait can be quite long (as in "forever"). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Poking this. Looking back through the discussion, this issue has been reproduced by multiple people. The patch still applies to HEAD without issues. I have no experience with PostgreSQL replication, so I'm not qualified to really review this. From what I can see with the patch, it's just a small block of code added to /src/backend/replication/walreceiver.c to handle some edge case where the WAL file no longer exists or something. I think one thing that would help move this forward is if we edited the patch to include a comment explaining why this new code is necessary. There's lots of great discussion on this issue in the email list, so if a summary of that gets into the code I think it would make the patch easier to understand and make the new walreceiver.c less confusing. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Rust bindings to pgtypes lib
Hi For learning purposes (and because I probably need it) , I've started to make Rust bindings to the pgtypes Library. One problem I'm thinking about right now is how to know if, and where, the library and the include files are installed. One way to avoid that problem is to include the source of pgtypeslib in my project, and let the rust build system take care of it *. But is this possible, or feasible? I see that the makefile refers to Makefile.global, and also includes stuff from ecpg, aat least. I don't want to include the whole pg source tree, just the nice, safe, little corner that would be used for this purpose. How much of the source tree would I have to carve out? Or from another perspective; how do other language (if any) solve this? * Not sure it can, but other rust crates seem to do it that way, e.g. https:// github.com/alexcrichton/bzip2-rs -- Med venlig hilsen Kaare Rasmussen, Jasonic Jasonic: Nordre Fasanvej 12, 2000 Frederiksberg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Wed, Jul 05, 2017 at 07:31:39PM +1000, AP wrote: > On Tue, Jul 04, 2017 at 08:23:20PM -0700, Jeff Janes wrote: > > On Tue, Jul 4, 2017 at 3:57 AM, APwrote: > > > The data being indexed is BYTEA, (quasi)random and 64 bytes in size. > > > The table has over 2 billion entries. The data is not unique. There's > > > an average of 10 duplicates for every unique value. > > > > What is the number of duplicates for the most common value? > > Damn. Was going to collect this info as I was doing a fresh upload but > it fell through the cracks of my mind. It'll probably take at least > half a day to collect (a simple count(*) on the table takes 1.5-1.75 > hours parallelised across 11 processes) so I'll probably have this in > around 24 hours if all goes well. (and I don't stuff up the SQL :) ) Well... num_ids | count -+-- 1 | 91456442 2 | 56224976 4 | 14403515 16 | 13665967 3 | 12929363 17 | 12093367 15 | 10347006 So the most common is a unique value, then a dupe. AP. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Thu, Jul 06, 2017 at 12:38:38PM +0530, Amit Kapila wrote: > On Thu, Jul 6, 2017 at 9:32 AM, APwrote: > > On Thu, Jul 06, 2017 at 08:52:03AM +0530, Amit Kapila wrote: > >> On Thu, Jul 6, 2017 at 2:40 AM, AP wrote: > >> > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote: > >> >> >> > version | bucket_pages | overflow_pages | bitmap_pages | > >> >> >> > unused_pages | live_items | dead_items | free_percent > >> >> >> > -+--++--+--+++-- > >> >> >> >3 | 10485760 |2131192 | 66 | > >> >> >> > 0 | 2975444240 | 0 | 1065.19942179026 > >> >> >> > (1 row) > >> > ... > >> >> >> > And I do appear to have an odd percentage of free space. :) > >> >> > >> >> Are you worried about "unused_pages"? If so, then this is not a major > >> > > >> > Nope. "free_percent" Just a bit weird that I have it at over 1000% free. > >> > :) > >> > Shouldn't that number be < 100? > >> > >> Yes, there seems to be some gotcha in free percent calculation. Is it > >> possible for you to debug or in some way share the test? > > > > I can try to debug but I need to know what to look for and how. > > Okay, you need to debug function pgstathashindex and have your > breakpoint at free_percent calculation, then try to get the values of > nblocks, all the values in stats struct and total_space. I think > after getting this info we can further decide what to look for. Ok. I'll try and get to this tomorrow amidst fun with NFS. Hopefully there'll be time. So... I'll need postgresql-10-dbg - debug symbols for postgresql-10 Then given https://doxygen.postgresql.org/pgstatindex_8c.html#af86e3b4c40779d4f30a73b0bfe06316f set a breakpoint at pgstatindex.c:710 via gdb and then have fun with print? > > As for sharing the test, that'd mean sharing the data. If it helps I can > > provide the content of that column but you're looking at an sql dump that > > is roughly (2*64+1)*2.3 billion (give or take a (few) billion) in size. :) > > This is tricky, will Ibe able to import that column values by creating > table, if so, then probably it is worth. Should do. Thinking about it a little more, I can shrink the file down by roughly half if I don't do a pg_dump or similar. Doing COPY link (datum_id) TO '/tmp/moocow.copy' WITH (FORMAT BINARY) should allow you to use COPY FROM to restore the file and produce something a lot smaller than a dump, right? The table is simple: CREATE TABLE link (datum_id BYTEA); I can't give you the rest of the table (one other column) as the stuff hidden in there is private. The only thing that wont give you is the manner in which the column is filled (ie: the transactions, their size, how long they run, their concurrency etc). Don't know if that's important. > >> So at this stage, there are two possibilities for you (a) run manual > >> Vacuum in-between (b) create the index after bulk load. In general, > >> whatever I have mentioned in (b) is a better way for bulk loading. > >> Note here again the free_percent seems to be wrong. > > > > If you didn't mean VACUUM FULL then (a) does not appear to work and (b) > > would kill usability of the db during import, which would happen daily > > (though with a vastly reduced data size). > > If the data size in subsequent import is very less, then you only need > to Create the index after first import and then let it continue like > that. Managing this has suddenly gotten a lot more complex. :) > > It also messes with the > > permission model that has been set up for the least-trusted section of > > the project (at the moment that section can only INSERT). > > As per your permission model Vacuum Full is allowed, but not Create index? Well, at the moment it's all in development (though time for that to end is coming up). As such I can do things with enhanced permissions manually. When it hits production, that rather stops. > As mentioned above REINDEX might be a better option. I think for such > situation we should have some provision to allow squeeze functionality > of hash exposed to the user, this will be less costly than REINDEX and > might serve the purpose for the user. Hey, can you try some hack in Assuming it does help, would this be something one would need to guess at? "I did a whole bunch of concurrent INSERT heavy transactions so I guess I should do a squeeze now"? Or could it be figured out programmatically? > the code which can at least tell us whether squeezing hash can give us > any benefit or is the index in such a situation that only REINDEX will > help. Basically, while doing Vacuum (non-full), you need to somehow > bypass the below line in lazy_scan_heap > lazy_scan_heap > { > .. > if (vacrelstats->num_dead_tuples > 0) > .. > } > > I am not sure it will work, but we can try once if you are okay. So this would be at L1284 of
Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility
On 07/03/2017 06:30 PM, Chapman Flack wrote: On 07/03/2017 09:39 AM, Heikki Linnakangas wrote: Hmm. That's not the problem, though. Imagine that instead of the loop above, you do just: WALInsertLockUpdateInsertingAt(CurrPos); AdvanceXLInsertBuffer(EndPos, false); AdvanceXLInsertBuffer() will call XLogWrite(), to flush out any pages before EndPos, to make room in the wal_buffers for the new pages. Before doing that, it will call WaitXLogInsertionsToFinish() Although it's moot in the straightforward approach of re-zeroing in the loop, it would still help my understanding of the system to know if there is some subtlety that would have broken what I proposed earlier, which was an extra flag to AdvanceXLInsertBuffer() that would tell it not only to skip initializing headers, but also to skip the WaitXLogInsertionsToFinish() check ... because I have the entire region reserved and I hold all the writer slots at that moment, it seems safe to assure AdvanceXLInsertBuffer() that there are no outstanding writes to wait for. Yeah, I suppose that would work, too. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Thu, Jul 06, 2017 at 08:52:03AM +0530, Amit Kapila wrote: > On Thu, Jul 6, 2017 at 2:40 AM, APwrote: > > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote: > >> >> > version | bucket_pages | overflow_pages | bitmap_pages | > >> >> > unused_pages | live_items | dead_items | free_percent > >> >> > -+--++--+--+++-- > >> >> >3 | 10485760 |2131192 | 66 | > >> >> > 0 | 2975444240 | 0 | 1065.19942179026 > >> >> > (1 row) > > ... > >> >> > And I do appear to have an odd percentage of free space. :) > >> > >> Are you worried about "unused_pages"? If so, then this is not a major > > > > Nope. "free_percent" Just a bit weird that I have it at over 1000% free. :) > > Shouldn't that number be < 100? > > Yes, there seems to be some gotcha in free percent calculation. Is it > possible for you to debug or in some way share the test? I can try to debug but I need to know what to look for and how. If it requires data reloads then that's around 12-15 hours per hit. As for sharing the test, that'd mean sharing the data. If it helps I can provide the content of that column but you're looking at an sql dump that is roughly (2*64+1)*2.3 billion (give or take a (few) billion) in size. :) > > Well, if this is the cause of my little issue, it might be nice. ATM > > my import script bombs out on errors (that I've duplicated! :) It took > > 11 hours but it bombed) and it sounds like I'll need to do a manual > > VACUUM before it can be run again. > > > > Yeah, I think after manual vacuum you should be able to proceed. I don't think that'll help. I did a small check which I hope is helpful in seeing if it will. Working off a similar db that completed (as it was smaller and I did not want to mess with my one copy of the broken db) I got the following results: Pre-VACUUM: --- # \di+ List of relations Schema | Name| Type | Owner | Table | Size | Description +---+---+---+-+-+- ... public | link_datum_id_idx | index | mdkingpin | link| 90 GB | ... # select * from pgstathashindex('link_datum_id_idx'); version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | live_items | dead_items | free_percent -+--++--+--+++- 3 | 7611595 |3451261 | 106 | 777013 | 3076131325 | 0 | 1512.8635780908 # vacuum VERBOSE ANALYZE link; INFO: vacuuming "public.link" INFO: "link": found 0 removable, 2272156152 nonremovable row versions in 120507771 out of 123729466 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 8594 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 457.16 s, system: 755.84 s, elapsed: 4196.75 s. INFO: vacuuming "pg_toast.pg_toast_183727891" INFO: index "pg_toast_183727891_index" now contains 1441820 row versions in 3956 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.56 s. INFO: "pg_toast_183727891": found 0 removable, 1441820 nonremovable row versions in 332271 out of 332271 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 8594 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 1.16 s, system: 2.26 s, elapsed: 22.80 s. INFO: analyzing "public.link" INFO: "link": scanned 300 of 123729466 pages, containing 56661337 live rows and 0 dead rows; 300 rows in sample, 2330296882 estimated total rows VACUUM Time: 7057484.079 ms (01:57:37.484) Post-VACUUM: # \di+ Schema | Name| Type | Owner | Table | Size | Description +---+---+---+-+-+- public | link_datum_id_idx | index | mdkingpin | link| 90 GB | # select * from pgstathashindex('link_datum_id_idx'); version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | live_items | dead_items | free_percent -+--++--+--+++- 3 | 7611595 |3451261 | 106 | 777013 | 3076131325 | 0 | 1512.8635780908 The end results are the same. Then I did: # CREATE INDEX CONCURRENTLY ON link USING hash (datum_id) WITH (fillfactor = 90); CREATE INDEX Time: 12545612.560 ms (03:29:05.613) # \di+ List of relations Schema |Name| Type | Owner | Table | Size | Description
Re: [HACKERS] Proposal : For Auto-Prewarm.
On Thu, Jul 6, 2017 at 10:52 AM, Amit Kapilawrote: > I am not able to understand what you want to say. Unlogged tables > should be empty in case of crash recovery. Also, we never flush > unlogged buffers except for shutdown checkpoint, refer BufferAlloc and > in particular below comment: -- Sorry I said that because of my lack of knowledge about unlogged tables. Yes, what you say is right "an unlogged table is automatically truncated after a crash or unclean shutdown". So it will be enough if we just dump them during shutdown time. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error while copying a large file in pg_rewind
On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghoshwrote: > On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquier > wrote: >> On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh >> wrote: >>> I've not yet started the patch and it may take some time for me to >>> understand and write >>> the patch in a correct way. Since, you've almost written the patch, >>> IMHO, please go ahead >>> and submit the patch. I'll happily review and test it. :-) >>> >>> Thanks for the notes. >> >> OK, thanks. Here you go. >> > Thanks for the patch. It looks good and it solves the existing issues. > > But, I'm little concerned/doubt regarding the following part of the code. > +/* > + * Converts an int64 from network byte order to native format. > + */ > +static int64 > +pg_recvint64(int64 value) > +{ > + union > + { > + int64 i64; > + uint32 i32[2]; > + } swap; > + int64 result; > + > + swap.i64 = value; > + > + result = (uint32) ntohl(swap.i32[0]); > + result <<= 32; > + result |= (uint32) ntohl(swap.i32[1]); > + > + return result; > +} > Does this always work correctly irrespective of the endianess of the > underlying system? I think this will have problem, we may need to do like union { int64 i64; uint8 i8[8]; } swap; and reverse complete array if byte order is changed or we can use something like "be64toh" > > > -- > Thanks & Regards, > Kuntal Ghosh > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql 10: hash indexes testing
On Thu, Jul 6, 2017 at 5:04 PM, APwrote: > On Thu, Jul 06, 2017 at 12:38:38PM +0530, Amit Kapila wrote: >> On Thu, Jul 6, 2017 at 9:32 AM, AP wrote: >> > On Thu, Jul 06, 2017 at 08:52:03AM +0530, Amit Kapila wrote: >> >> On Thu, Jul 6, 2017 at 2:40 AM, AP wrote: >> >> > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote: >> >> >> >> > version | bucket_pages | overflow_pages | bitmap_pages | >> >> >> >> > unused_pages | live_items | dead_items | free_percent >> >> >> >> > -+--++--+--+++-- >> >> >> >> >3 | 10485760 |2131192 | 66 | >> >> >> >> >0 | 2975444240 | 0 | 1065.19942179026 >> >> >> >> > (1 row) >> >> > ... >> >> >> >> > And I do appear to have an odd percentage of free space. :) >> >> >> >> >> >> Are you worried about "unused_pages"? If so, then this is not a major >> >> > >> >> > Nope. "free_percent" Just a bit weird that I have it at over 1000% >> >> > free. :) >> >> > Shouldn't that number be < 100? >> >> >> >> Yes, there seems to be some gotcha in free percent calculation. Is it >> >> possible for you to debug or in some way share the test? >> > >> > I can try to debug but I need to know what to look for and how. >> >> Okay, you need to debug function pgstathashindex and have your >> breakpoint at free_percent calculation, then try to get the values of >> nblocks, all the values in stats struct and total_space. I think >> after getting this info we can further decide what to look for. > > Ok. I'll try and get to this tomorrow amidst fun with NFS. Hopefully > there'll be time. > Cool. > So... I'll need > > postgresql-10-dbg - debug symbols for postgresql-10 > > Then given > https://doxygen.postgresql.org/pgstatindex_8c.html#af86e3b4c40779d4f30a73b0bfe06316f > set a breakpoint at pgstatindex.c:710 via gdb and then have fun with > print? > If I am reading it correctly it should be line 706 as below: if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE) >> > As for sharing the test, that'd mean sharing the data. If it helps I can >> > provide the content of that column but you're looking at an sql dump that >> > is roughly (2*64+1)*2.3 billion (give or take a (few) billion) in size. :) >> >> This is tricky, will Ibe able to import that column values by creating >> table, if so, then probably it is worth. > > Should do. Thinking about it a little more, I can shrink the file down by > roughly half if I don't do a pg_dump or similar. Doing > > COPY link (datum_id) TO '/tmp/moocow.copy' WITH (FORMAT BINARY) > > should allow you to use COPY FROM to restore the file and produce something > a lot smaller than a dump, right? > I think so. You can share it. I will try. > The table is simple: > > CREATE TABLE link (datum_id BYTEA); > > I can't give you the rest of the table (one other column) as the stuff hidden > in there is private. > No issues. > The only thing that wont give you is the manner in which the column is filled > (ie: the transactions, their size, how long they run, their concurrency etc). > Don't know if that's important. > >> >> So at this stage, there are two possibilities for you (a) run manual >> >> Vacuum in-between (b) create the index after bulk load. In general, >> >> whatever I have mentioned in (b) is a better way for bulk loading. >> >> Note here again the free_percent seems to be wrong. >> > >> > If you didn't mean VACUUM FULL then (a) does not appear to work and (b) >> > would kill usability of the db during import, which would happen daily >> > (though with a vastly reduced data size). >> >> If the data size in subsequent import is very less, then you only need >> to Create the index after first import and then let it continue like >> that. > > Managing this has suddenly gotten a lot more complex. :) > >> > It also messes with the >> > permission model that has been set up for the least-trusted section of >> > the project (at the moment that section can only INSERT). >> >> As per your permission model Vacuum Full is allowed, but not Create index? > > Well, at the moment it's all in development (though time for that to end > is coming up). As such I can do things with enhanced permissions manually. > I think if you are under development, it is always advisable to create indexes after initial bulk load. That way it will be faster and will take lesser space atleast in case of hash index. > When it hits production, that rather stops. > >> As mentioned above REINDEX might be a better option. I think for such >> situation we should have some provision to allow squeeze functionality >> of hash exposed to the user, this will be less costly than REINDEX and >> might serve the purpose for the user. Hey, can you try some hack in > > Assuming it does help, would this be something one would need to guess > at? "I did a whole bunch of concurrent
Re: [HACKERS] hash index on unlogged tables doesn't behave as expected
On Thu, Jul 6, 2017 at 3:24 PM, Kyotaro HORIGUCHIwrote: > Hello, > > At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapila > wrote in >> On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier >> wrote: >> > On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila >> > wrote: >> > >> > It seems to me that this qualifies as an open item for 10. WAL-logging >> > is new for hash tables. Amit, could you add one? >> > >> >> Added. >> >> > + use_wal = RelationNeedsWAL(rel) || >> > + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED && >> > + forkNum == INIT_FORKNUM); >> > In access AMs, empty() routines are always only called for unlogged >> > relations, so you could relax that to check for INIT_FORKNUM only. >> > >> >> Yeah, but _hash_init() is an exposed API, so I am not sure it is a >> good idea to make such an assumption at that level of code. Now, as >> there is no current user which wants to use it any other way, we can >> make such an assumption, but does it serve any purpose? > > Check for INIT_FORKNUM appears both accompanied and not > accompanied by check for RELPER.._UNLOGGED, so I'm not sure which > is the convention here. > > The difference of the two is an init fork of TEMP > relations. However I believe that init fork is by definition a > part of an unlogged relation, it seems to me that it ought not to > be wal-logged if we had it. From this viewpoint, the middle line > makes sense. > What is the middle line? Are you okay with a current check or do you see any problem with it or do you prefer to write it differently? > > >> > Looking at the patch, I think that you are right to do the logging >> > directly in _hash_init() and not separately in hashbuildempty(). >> > Compared to other relations the init data is more dynamic. >> > >> > + /* >> > +* Force the on-disk state of init forks to always be in sync with >> > the >> > +* state in shared buffers. See XLogReadBufferForRedoExtended. >> > +*/ >> > + XLogRecGetBlockTag(record, 0, NULL, , NULL); >> > + if (forknum == INIT_FORKNUM) >> > + FlushOneBuffer(metabuf); >> > I find those forced syncs a bit surprising. The buffer is already >> > marked as dirty, so even if there is a concurrent checkpoint they >> > would be flushed. >> > >> >> What if there is no concurrent checkpoint activity and the server is >> shutdown? Remember this replay happens on standby and we will just >> try to perform Restartpoint and there is no guarantee that it will >> flush the buffers. If the buffers are not flushed at shutdown, then >> the Init fork will be empty on restart and the hash index will be >> corrupt. > > XLogReadBufferForRedoExtended() automatically force the flush for > a XLOG record on init fork that having FPW imaeges. And > _hash_init seems to have emitted such a XLOG record using > log_newpage. > Sure, but log_newpage is not used for metapage and bitmappage. >> > If recovery comes again here, then they would just >> > be replayed. I am unclear why XLogReadBufferForRedoExtended is not >> > enough to replay a FPW of that. >> > >> >> Where does FPW come into the picture for a replay of metapage or bitmappage? > > Since required FPW seems to be emitted and the flush should be > done automatically, the issuer side (_hash_init) may attach wrong > FPW images if hash_xlog_init_meta_page requires additional > flushes. But I don't see such a flaw there. I think Michael wants > to say such a kind of thing. > I am not able to understand what you want to say, but I think you don't see any problem with the patch as such. > > > By the way the comment of the function ReadBufferWithoutRelcache > has the following sentense. > I don't think we need anything with respect to this patch because ReadBufferWithoutRelcache is used in case of FPW replay of INIT_FORK pages as well. Do you have something else in mind which I am not able to see? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add support for tuple routing to foreign partitions
On 2017/07/05 9:13, Amit Langote wrote: On 2017/06/29 20:20, Etsuro Fujita wrote: In relation to this, I also allowed expand_inherited_rtentry() to build an RTE and AppendRelInfo for each partition of a partitioned table that is an INSERT target, as mentioned by Amit in [1], by modifying transformInsertStmt() so that we set the inh flag for the target table's RTE to true when the target table is a partitioned table. About this part, Robert sounded a little unsure back then [1]; his words: "Doing more inheritance expansion early is probably not a good idea because it likely sucks for performance, and that's especially unfortunate if it's only needed for foreign tables." But... The partition RTEs are not only referenced by FDWs, but used in selecting relation aliases for EXPLAIN (see below) as well as extracting plan dependencies in setref.c so that we force rebuilding of the plan on any change to partitions. (We do replanning on FDW table-option changes to foreign partitions, currently. See regression tests in the attached patch.) ...this part means we cannot really avoid locking at least the foreign partitions during the planning stage, which we currently don't, as we don't look at partitions at all before ExecSetupPartitionTupleRouting() is called by ExecInitModifyTable(). Another case where we need inheritance expansion during the planning stage would probably be INSERT .. ON CONFLICT into a partitioned table, I guess. We don't support that yet, but if implementing that, I guess we would probably need to look at each partition and do something like infer_arbiter_indexes() for each partition during the planner stage. Not sure. Since we are locking *all* the partitions anyway, it may be better to shift the cost of locking to the planner by doing the inheritance expansion even in the insert case (for partitioned tables) and avoid calling the lock manager again in the executor when setting up tuple-routing data structures (that is, in ExecSetupPartitionTupleRouting). We need the execution-time lock, anyway. See the comments from Robert in [3]. An idea that Robert recently mentioned on the nearby "UPDATE partition key" thread [2] seems relevant in this regard. By that idea, expand_inherited_rtentry(), instead of reading in the partition OIDs in the old catalog order, will instead call RelationBuildPartitionDispatchInfo(), which locks the partitions and returns their OIDs in the canonical order. If we store the foreign partition RT indexes in that order in ModifyTable.partition_rels introduced by this patch, then the following code added by the patch could be made more efficient (as also explained in aforementioned Robert's email): +foreach(l, node->partition_rels) +{ +Index rti = lfirst_int(l); +Oid relid = getrelid(rti, estate->es_range_table); +boolfound; +int j; + +/* First, find the ResultRelInfo for the partition */ +found = false; +for (j = 0; j < mtstate->mt_num_partitions; j++) +{ +resultRelInfo = partitions + j; + +if (RelationGetRelid(resultRelInfo->ri_RelationDesc) == relid) +{ +Assert(mtstate->mt_partition_indexes[i] == 0); +mtstate->mt_partition_indexes[i] = j; +found = true; +break; +} +} +if (!found) +elog(ERROR, "failed to find ResultRelInfo for rangetable index %u", rti); Agreed. I really want to improve this based on that idea. Thank you for the comments! Best regards, Etsuro Fujita [3] https://www.postgresql.org/message-id/ca+tgmoyiwvicdri3zk+quxj1r7umu9t_kdnq+17pcwgzrbz...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Error while copying a large file in pg_rewind
On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquierwrote: > On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh > wrote: >> I've not yet started the patch and it may take some time for me to >> understand and write >> the patch in a correct way. Since, you've almost written the patch, >> IMHO, please go ahead >> and submit the patch. I'll happily review and test it. :-) >> >> Thanks for the notes. > > OK, thanks. Here you go. > Thanks for the patch. It looks good and it solves the existing issues. But, I'm little concerned/doubt regarding the following part of the code. +/* + * Converts an int64 from network byte order to native format. + */ +static int64 +pg_recvint64(int64 value) +{ + union + { + int64 i64; + uint32 i32[2]; + } swap; + int64 result; + + swap.i64 = value; + + result = (uint32) ntohl(swap.i32[0]); + result <<= 32; + result |= (uint32) ntohl(swap.i32[1]); + + return result; +} Does this always work correctly irrespective of the endianess of the underlying system? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hash index on unlogged tables doesn't behave as expected
FWIW.. At Thu, 06 Jul 2017 18:54:47 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170706.185447.256482539.horiguchi.kyot...@lab.ntt.co.jp> > > > + /* > > > +* Force the on-disk state of init forks to always be in sync > > > with the > > > +* state in shared buffers. See XLogReadBufferForRedoExtended. > > > +*/ > > > + XLogRecGetBlockTag(record, 0, NULL, , NULL); > > > + if (forknum == INIT_FORKNUM) > > > + FlushOneBuffer(metabuf); > > > I find those forced syncs a bit surprising. The buffer is already > > > marked as dirty, so even if there is a concurrent checkpoint they > > > would be flushed. > > > > > > > What if there is no concurrent checkpoint activity and the server is > > shutdown? Remember this replay happens on standby and we will just > > try to perform Restartpoint and there is no guarantee that it will > > flush the buffers. If the buffers are not flushed at shutdown, then > > the Init fork will be empty on restart and the hash index will be > > corrupt. > > XLogReadBufferForRedoExtended() automatically force the flush for > a XLOG record on init fork that having FPW imaeges. And > _hash_init seems to have emitted such a XLOG record using > log_newpage. > > > > If recovery comes again here, then they would just > > > be replayed. I am unclear why XLogReadBufferForRedoExtended is not > > > enough to replay a FPW of that. > > > > > > > Where does FPW come into the picture for a replay of metapage or bitmappage? > > Since required FPW seems to be emitted and the flush should be > done automatically, the issuer side (_hash_init) may attach wrong "may attach" -> "may have attached" > FPW images if hash_xlog_init_meta_page requires additional > flushes. But I don't see such a flaw there. I think Michael wants > to say such a kind of thing. regards. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
On Thu, Jul 6, 2017 at 3:21 PM, tusharwrote: > On 07/06/2017 12:04 PM, Beena Emerson wrote: >> >> The 04-initdb-walsegsize_v2.patch has the following improvements: >> - Rebased over new 03 patch >> - Pass the wal-segsize intidb option as command-line option rathern >> than in an environment variable. >> - Since new function check_wal_size had only had two checks and was >> sed once, moved the code to ReadControlFile where it is used and >> removed this function. >> - improve comments and add validations where required. >> - Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and >> max_wal_size,instead of the value 16. >> - Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc >> wal_segment_size instead 16 - INT_MAX. > > Thanks Beena. I tested with below following scenarios and all are working > as expected > .)Different WAL segment size i.e 1,2,8,16,32,64,512,1024 at the time of > initdb > .)SR setup in place. > .)Combinations of max/min_wal_size in postgresql.conf with different > wal_segment_size. > .)shutdown the server forcefully (kill -9) / promote slave / to make sure > -recovery happened successfully. > .)with different utilities like pg_resetwal/pg_upgrade/pg_waldump > .)running pg_bench for substantial workloads (~ 1 hour) > .)WAL segment size is not default (changed at the time of ./configure) + > different wal_segment_size (at the time of initdb) . > > Things looks fine. > Thank you Tushar. -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] hash index on unlogged tables doesn't behave as expected
Hello, At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapilawrote in > On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier > wrote: > > On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila wrote: > > > > It seems to me that this qualifies as an open item for 10. WAL-logging > > is new for hash tables. Amit, could you add one? > > > > Added. > > > + use_wal = RelationNeedsWAL(rel) || > > + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED && > > + forkNum == INIT_FORKNUM); > > In access AMs, empty() routines are always only called for unlogged > > relations, so you could relax that to check for INIT_FORKNUM only. > > > > Yeah, but _hash_init() is an exposed API, so I am not sure it is a > good idea to make such an assumption at that level of code. Now, as > there is no current user which wants to use it any other way, we can > make such an assumption, but does it serve any purpose? Check for INIT_FORKNUM appears both accompanied and not accompanied by check for RELPER.._UNLOGGED, so I'm not sure which is the convention here. The difference of the two is an init fork of TEMP relations. However I believe that init fork is by definition a part of an unlogged relation, it seems to me that it ought not to be wal-logged if we had it. From this viewpoint, the middle line makes sense. > > Looking at the patch, I think that you are right to do the logging > > directly in _hash_init() and not separately in hashbuildempty(). > > Compared to other relations the init data is more dynamic. > > > > + /* > > +* Force the on-disk state of init forks to always be in sync with > > the > > +* state in shared buffers. See XLogReadBufferForRedoExtended. > > +*/ > > + XLogRecGetBlockTag(record, 0, NULL, , NULL); > > + if (forknum == INIT_FORKNUM) > > + FlushOneBuffer(metabuf); > > I find those forced syncs a bit surprising. The buffer is already > > marked as dirty, so even if there is a concurrent checkpoint they > > would be flushed. > > > > What if there is no concurrent checkpoint activity and the server is > shutdown? Remember this replay happens on standby and we will just > try to perform Restartpoint and there is no guarantee that it will > flush the buffers. If the buffers are not flushed at shutdown, then > the Init fork will be empty on restart and the hash index will be > corrupt. XLogReadBufferForRedoExtended() automatically force the flush for a XLOG record on init fork that having FPW imaeges. And _hash_init seems to have emitted such a XLOG record using log_newpage. > > If recovery comes again here, then they would just > > be replayed. I am unclear why XLogReadBufferForRedoExtended is not > > enough to replay a FPW of that. > > > > Where does FPW come into the picture for a replay of metapage or bitmappage? Since required FPW seems to be emitted and the flush should be done automatically, the issuer side (_hash_init) may attach wrong FPW images if hash_xlog_init_meta_page requires additional flushes. But I don't see such a flaw there. I think Michael wants to say such a kind of thing. By the way the comment of the function ReadBufferWithoutRelcache has the following sentense. | * NB: At present, this function may only be used on permanent relations, which | * is OK, because we only use it during XLOG replay. If in the future we | * want to use it on temporary or unlogged relations, we could pass additional | * parameters. and does | return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum, mode, strategy, ); This surely works since BufferAlloc recognizes INIT_FORK. But some adjustment may be needed around here. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
On 07/06/2017 12:04 PM, Beena Emerson wrote: The 04-initdb-walsegsize_v2.patch has the following improvements: - Rebased over new 03 patch - Pass the wal-segsize intidb option as command-line option rathern than in an environment variable. - Since new function check_wal_size had only had two checks and was sed once, moved the code to ReadControlFile where it is used and removed this function. - improve comments and add validations where required. - Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and max_wal_size,instead of the value 16. - Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc wal_segment_size instead 16 - INT_MAX. Thanks Beena. I tested with below following scenarios and all are working as expected .)Different WAL segment size i.e 1,2,8,16,32,64,512,1024 at the time of initdb .)SR setup in place. .)Combinations of max/min_wal_size in postgresql.conf with different wal_segment_size. .)shutdown the server forcefully (kill -9) / promote slave / to make sure -recovery happened successfully. .)with different utilities like pg_resetwal/pg_upgrade/pg_waldump .)running pg_bench for substantial workloads (~ 1 hour) .)WAL segment size is not default (changed at the time of ./configure) + different wal_segment_size (at the time of initdb) . Things looks fine. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Multi column range partition table
On 5 July 2017 at 10:43, Amit Langotewrote: > 0001 is your patch to tidy up check_new_partition_bound() (must be > applied before 0002) > I pushed this first patch, simplifying check_new_partition_bound() for range partitions, since it seemed like a good simplification, but note that I don't think that was actually the cause of the latent bug you saw upthread. I think the real issue was in partition_rbound_cmp() -- normally, if the upper bound of one partition coincides with the lower bound of another, that function would report the upper bound as the smaller one, but that logic breaks if any of the bound values are infinite, since then it will exit early, returning 0, without ever comparing the "lower" flags on the bounds. I'm tempted to push a fix for that independently, since it's a bug waiting to happen, even though it's not possible to hit it currently. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=
On Thu, Jul 6, 2017 at 3:45 AM, Tom Lanewrote: > I wrote: >> (Pokes at it some more...) Oh, interesting: it behaves that way except >> when p is exactly the lowest histogram entry. > + /* + * In the first bin (i==1), add a fudge factor that ensures + * that histfrac is at least eq_selec. We do this because we + * know that the first histogram entry does satisfy the + * inequality (if !isgt) or not satisfy it (if isgt), so our + * estimate here should certainly not be zero even if binfrac + * is zero. (XXX experimentally this is the correct way to do + * it, but why isn't it a linear adjustment across the whole + * histogram rather than just the first bin?) + */ Given that the values are distinct, (I've some doubts for real number case) if histogram_bounds are assigned as, {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..} I think the buckets are defined as, 0 < bucket1 <= 9 9 < bucket2 <=19 19 < bucket3 <= 29 and so on. Because, the histfrac is calculated as follows: histfrac = (double) (bucket_current - 1) + (val - low) / (high - low); (where bucket_current is obtained by doing a binary search on histogram_bounds.) histfrac /= (double) (nvalues - 1); So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets which means it assumes val is included in the previous bucket. This means that it always fails to calculate the selectivity for lowest histogram boundary. Hence, we need adjustment only for the first bucket. Do you think my reasoning justifies your concern? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Another comment typo in execMain.c
Here is a comment in ExecFindPartition() in execMain.c: /* * First check the root table's partition constraint, if any. No point in * routing the tuple it if it doesn't belong in the root table itself. */ I think that in the second sentence "it" just before "if" is a typo. Attached is a small patch for fixing that. Best regards, Etsuro Fujita diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 0f08283..06b467b 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -3310,7 +3310,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd, /* * First check the root table's partition constraint, if any. No point in -* routing the tuple it if it doesn't belong in the root table itself. +* routing the tuple if it doesn't belong in the root table itself. */ if (resultRelInfo->ri_PartitionCheck) ExecPartitionCheck(resultRelInfo, slot, estate); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables
Here is an example: postgres=# create table col_desc (a int, b int) partition by list (a); postgres=# create table col_desc_1 partition of col_desc for values in (1); postgres=# alter table col_desc_1 add check (b > 0); postgres=# create role col_desc_user; postgres=# grant insert on col_desc to col_desc_user; postgres=# revoke select on col_desc from col_desc_user; postgres=# set role col_desc_user; postgres=> insert into col_desc values (1, -1) returning 1; ERROR: new row for relation "col_desc_1" violates check constraint "col_desc_1_b_check" DETAIL: Failing row contains (a, b) = (1, -1). Looks good, but postgres=> reset role; postgres=# create table result (f1 text default 'foo', f2 text default 'bar', f3 int); postgres=# grant insert on result to col_desc_user; postgres=# set role col_desc_user; postgres=> with t as (insert into col_desc values (1, -1) returning 1) insert into result (f3) select * from t; ERROR: new row for relation "col_desc_1" violates check constraint "col_desc_1_b_check" Looks odd to me because the error message doesn't show any DETAIL info; since the CTE query, which produces the message, is the same as the above query, the message should also be the same as the one for the above query. I think the reason for that is: ExecConstraints looks at an incorrect resultRelInfo to build the message for a violating tuple that has been routed *in the case where the partitioned table isn't the primary ModifyTable node*, which leads to deriving an incorrect modifiedCols and then invoking ExecBuildSlotValueDescription with the wrong bitmap. I think this should be fixed. Attached is a patch for that. Best regards, Etsuro Fujita *** a/src/backend/executor/execMain.c --- b/src/backend/executor/execMain.c *** *** 92,97 static bool ExecCheckRTEPermsModified(Oid relOid, Oid userid, --- 92,99 Bitmapset *modifiedCols, AclMode requiredPerms); static void ExecCheckXactReadOnly(PlannedStmt *plannedstmt); + static ResultRelInfo *ExecFindResultRelInfo(EState *estate, Oid relid, + bool missing_ok); static char *ExecBuildSlotValueDescription(Oid reloid, TupleTableSlot *slot, TupleDesc tupdesc, *** *** 1360,1365 InitResultRelInfo(ResultRelInfo *resultRelInfo, --- 1362,1392 } /* + * ExecFindResultRelInfo -- find the ResultRelInfo struct for given relation OID + * + * If no such struct, either return NULL or throw error depending on missing_ok + */ + static ResultRelInfo * + ExecFindResultRelInfo(EState *estate, Oid relid, bool missing_ok) + { + ResultRelInfo *rInfo; + int nr; + + rInfo = estate->es_result_relations; + nr = estate->es_num_result_relations; + while (nr > 0) + { + if (RelationGetRelid(rInfo->ri_RelationDesc) == relid) + return rInfo; + rInfo++; + nr--; + } + if (!missing_ok) + elog(ERROR, "failed to find ResultRelInfo for relation %u", relid); + return NULL; + } + + /* *ExecGetTriggerResultRel * * Get a ResultRelInfo for a trigger target relation. Most of the time, *** *** 1379,1399 ResultRelInfo * ExecGetTriggerResultRel(EState *estate, Oid relid) { ResultRelInfo *rInfo; - int nr; ListCell *l; Relationrel; MemoryContext oldcontext; /* First, search through the query result relations */ ! rInfo = estate->es_result_relations; ! nr = estate->es_num_result_relations; ! while (nr > 0) ! { ! if (RelationGetRelid(rInfo->ri_RelationDesc) == relid) ! return rInfo; ! rInfo++; ! nr--; ! } /* Nope, but maybe we already made an extra ResultRelInfo for it */ foreach(l, estate->es_trig_target_relations) { --- 1406,1419 ExecGetTriggerResultRel(EState *estate, Oid relid) { ResultRelInfo *rInfo; ListCell *l; Relationrel; MemoryContext oldcontext; /* First, search through the query result relations */ ! rInfo = ExecFindResultRelInfo(estate, relid, true); ! if (rInfo) ! return rInfo; /* Nope, but maybe we already made an extra ResultRelInfo for it */ foreach(l, estate->es_trig_target_relations) { *** *** 1828,1833 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot, --- 1848,1854 EState *estate) { Relationrel =
Re: [HACKERS] pgsql 10: hash indexes testing
On Thu, Jul 6, 2017 at 9:32 AM, APwrote: > On Thu, Jul 06, 2017 at 08:52:03AM +0530, Amit Kapila wrote: >> On Thu, Jul 6, 2017 at 2:40 AM, AP wrote: >> > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote: >> >> >> > version | bucket_pages | overflow_pages | bitmap_pages | >> >> >> > unused_pages | live_items | dead_items | free_percent >> >> >> > -+--++--+--+++-- >> >> >> >3 | 10485760 |2131192 | 66 | >> >> >> > 0 | 2975444240 | 0 | 1065.19942179026 >> >> >> > (1 row) >> > ... >> >> >> > And I do appear to have an odd percentage of free space. :) >> >> >> >> Are you worried about "unused_pages"? If so, then this is not a major >> > >> > Nope. "free_percent" Just a bit weird that I have it at over 1000% free. :) >> > Shouldn't that number be < 100? >> >> Yes, there seems to be some gotcha in free percent calculation. Is it >> possible for you to debug or in some way share the test? > > I can try to debug but I need to know what to look for and how. > Okay, you need to debug function pgstathashindex and have your breakpoint at free_percent calculation, then try to get the values of nblocks, all the values in stats struct and total_space. I think after getting this info we can further decide what to look for. > If it > requires data reloads then that's around 12-15 hours per hit. > > As for sharing the test, that'd mean sharing the data. If it helps I can > provide the content of that column but you're looking at an sql dump that > is roughly (2*64+1)*2.3 billion (give or take a (few) billion) in size. :) > This is tricky, will Ibe able to import that column values by creating table, if so, then probably it is worth. >> > Well, if this is the cause of my little issue, it might be nice. ATM >> > my import script bombs out on errors (that I've duplicated! :) It took >> > 11 hours but it bombed) and it sounds like I'll need to do a manual >> > VACUUM before it can be run again. >> > >> >> Yeah, I think after manual vacuum you should be able to proceed. > > > That's markedly different. At a rough estimate I should be able to double > the row count before I hit the "real" limits of a hash index. > > When you refer to VACUUM do you mean VACUUM FULL? > Normal Vauum won't work for this case as you don't have dead tuples (deleted rows in table). > That's going to get nasty > if that's the case as the table is almost 1TB in size. > Yeah, I think for this situation REINDEX will be a better option because anyway Vacuum Full will rewrite the entire index and heap. >> >From above stats, it is clear that you have hit the maximum number of >> overflow pages we can support today. Now, here one can argue that we >> should increase the limit of overflow pages in hash index which we can >> do, but I think you can again hit such a problem after some more time. > > True, though I'm not far off hitting it "for real" at the present limits > so an increase would be of benefit in other respects (ie not needing to > have as many tables to manage because we have to bundle a set off due > to index limits). > >> So at this stage, there are two possibilities for you (a) run manual >> Vacuum in-between (b) create the index after bulk load. In general, >> whatever I have mentioned in (b) is a better way for bulk loading. >> Note here again the free_percent seems to be wrong. > > If you didn't mean VACUUM FULL then (a) does not appear to work and (b) > would kill usability of the db during import, which would happen daily > (though with a vastly reduced data size). > If the data size in subsequent import is very less, then you only need to Create the index after first import and then let it continue like that. > It also messes with the > permission model that has been set up for the least-trusted section of > the project (at the moment that section can only INSERT). > As per your permission model Vacuum Full is allowed, but not Create index? > Still, I may wind up going with (b) if a VACUUM FULL is the only other > real choice but would prefer to avoid it. The fact that the index is > around 300GB smaller (so far) than btree may well be worth the pain > all by its lonesome. > As mentioned above REINDEX might be a better option. I think for such situation we should have some provision to allow squeeze functionality of hash exposed to the user, this will be less costly than REINDEX and might serve the purpose for the user. Hey, can you try some hack in the code which can at least tell us whether squeezing hash can give us any benefit or is the index in such a situation that only REINDEX will help. Basically, while doing Vacuum (non-full), you need to somehow bypass the below line in lazy_scan_heap lazy_scan_heap { .. if (vacrelstats->num_dead_tuples > 0) .. } I am not sure it will work, but we can try once if you are
Re: [HACKERS] Update comment in ExecPartitionCheck
On 2017/07/04 18:15, Amit Langote wrote: On 2017/07/04 17:55, Etsuro Fujita wrote: This comment in an error handling in ExecPartitionCheck(): if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext)) { char *val_desc; Relationorig_rel = rel; /* See the comment above. */ if (resultRelInfo->ri_PartitionRoot) should be updated because we don't have any comment on that above in the code. Since we have a comment on that in ExecConstraints() defined just below that function, I think the comment should be something like this: "See the comment in ExecConstraints().". Attached is a patch for that. Thanks for fixing that. I forgot to do the same in the patch that got committed as 15ce775faa428 [1], which moved that code block from ExecConstraints() to ExecPartitionCheck(). Thanks for the explanation! In relation to this, I found odd behavior in the error handling. Since that is another topic, I'll start a new thread. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
Hello, On Tue, Mar 28, 2017 at 1:06 AM, Beena Emersonwrote: > Hello, > > On Sat, Mar 25, 2017 at 10:32 PM, Peter Eisentraut > wrote: >> >> At this point, I suggest splitting this patch up into several >> potentially less controversial pieces. >> >> One big piece is that we currently don't support segment sizes larger >> than 64 GB, for various internal arithmetic reasons. Your patch appears >> to address that. So I suggest isolating that. Assuming it works >> correctly, I think there would be no great concern about it. >> >> The next piece would be making the various tools aware of varying >> segment sizes without having to rely on a built-in value. >> >> The third piece would then be the rest that allows you to set the size >> at initdb >> >> If we take these in order, we would make it easier to test various sizes >> and see if there are any more unforeseen issues when changing sizes. It >> would also make it easier to do performance testing so we can address >> the original question of what the default size should be. > > > PFA the patches divided into 3 parts: > > 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes > the internal representation of max_wal_size and min_wal_size to mb. Already committed. > > 03-modify-tools.patch - Makes XLogSegSize into a variable, currently set as > XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using > inbuilt value. > The updated 03-modify-tools_v2.patch has the following changes: - Rebased over current HEAD - Impoverished comments - Adding error messages where applicable. - Replace XLOG_SEG_SIZE in the tools and xlog_internal.h to XLogSegSize. XLOG_SEG_SIZE is the wal_segment_size the server was compiled with and XLogSegSize is the wal_segment_size of the target server on which the tool is run. When the binaries used and the target server are compiled with different wal_segment_size, the calculations would be be affected and the tool would crash. To avoid it, all the calculations used by tool should use XLogSegSize. - pg_waldump : The fuzzy_open_file is split into two functions - open_file_in_directory and identify_target_directory so that code can be reused when determining the XLogSegSize from the WAL file header. - IsValidXLogSegSize macro is moved from 04 to here so that we can use it for validating the size in all the tools. > 04-initdb-walsegsize.patch - Adds the initdb option to set wal-segsize and > make related changes. Update pg_test_fsync to use DEFAULT_XLOG_SEG_SIZE > instead of XLOG_SEG_SIZE The 04-initdb-walsegsize_v2.patch has the following improvements: - Rebased over new 03 patch - Pass the wal-segsize intidb option as command-line option rathern than in an environment variable. - Since new function check_wal_size had only had two checks and was sed once, moved the code to ReadControlFile where it is used and removed this function. - improve comments and add validations where required. - Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and max_wal_size,instead of the value 16. - Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc wal_segment_size instead 16 - INT_MAX. > >> >> One concern I have is that your patch does not contain any tests. There >> should probably be lots of tests. > > > 05-initdb_tests.patch adds tap tests to initialize cluster with different > wal_segment_size and then check the config values. What other tests do you > have in mind? Checking the various tools? > > -- Beena Emerson EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company 04-initdb-walsegsize_v2.patch Description: Binary data 03-modify-tools_v2.patch Description: Binary data 01-add-XLogSegmentOffset-macro_rebase.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal : For Auto-Prewarm.
On Wed, Jul 5, 2017 at 6:25 PM, Mithun Cywrote: > On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila wrote: >> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy >> wrote: >>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy >>> wrote: On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown wrote: > > Also, I find it a bit messy that launch_autoprewarm_dump() doesn't > detect an autoprewarm process already running. I'd want this to > return NULL or an error if called for a 2nd time. We log instead of error as we try to check only after launching the worker and inside worker. One solution could be as similar to autoprewam_dump_now(), the autoprewarm_start_worker() can init shared memory and check if we can launch worker in backend itself. I will try to fix same. >>> >>> I have fixed it now as follows >>> >>> +Datum >>> +autoprewarm_start_worker(PG_FUNCTION_ARGS) >>> +{ >>> + pid_t pid; >>> + >>> + init_apw_shmem(); >>> + pid = apw_state->bgworker_pid; >>> + if (pid != InvalidPid) >>> + ereport(ERROR, >>> + (errmsg("autoprewarm worker is already running under PID >>> %d", >>> + pid))); >>> + >>> + autoprewarm_dump_launcher(); >>> + PG_RETURN_VOID(); >>> +} >>> >>> In backend itself, we shall check if an autoprewarm worker is running >>> then only start the server. There is a possibility if this function is >>> executed concurrently when there is no worker already running (Which I >>> think is not a normal usage) then both call will say it has >>> successfully launched the worker even though only one could have >>> successfully done that (other will log and silently die). >> >> Why can't we close this remaining race condition? Basically, if we >> just perform all of the actions in this function under the lock and >> autoprewarm_dump_launcher waits till the autoprewarm worker has >> initialized the bgworker_pid, then there won't be any remaining race >> condition. I think if we don't close this race condition, it will be >> unpredictable whether the user will get the error or there will be >> only a server log for the same. > > Yes, I can make autoprewarm_dump_launcher to wait until the launched > bgworker set its pid, but this requires one more synchronization > variable between launcher and worker. More than that I see > ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the > worker), which needs to be called before setting pid. So I thought it > won't be harmful let two concurrent calls to launch workers and we > just log failures. Please let me know if I need to rethink about it. > I don't know whether you need to rethink but as presented in the patch, it seems unclear to me about the specs of API. As this is an exposed function to the user, I think the behavior should be well defined. If you don't think changing code has many advantages, then at the very least update the docs to indicate the expectation and behavior of the API. Also, I think it is better to add few comments in the code to tell about the unpredictable behavior in case of race condition and the reason for same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers