Re: [HACKERS] path toward faster partition pruning
Horiguchi-san, Thanks for taking a look. Replying to all your emails here. On 2017/11/10 12:30, Kyotaro HORIGUCHI wrote: > In 0002, bms_add_range has a bit naive-looking loop > > + while (wordnum <= uwordnum) > + { > + bitmapword mask = (bitmapword) ~0; > + > + /* If working on the lower word, zero out bits below 'lower'. */ > + if (wordnum == lwordnum) > + { > + int lbitnum = BITNUM(lower); > + mask >>= lbitnum; > + mask <<= lbitnum; > + } > + > + /* Likewise, if working on the upper word, zero bits above > 'upper' */ > + if (wordnum == uwordnum) > + { > + int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) + > 1); > + mask <<= ushiftbits; > + mask >>= ushiftbits; > + } > + > + a->words[wordnum++] |= mask; > + } > > Without some aggressive optimization, the loop takes most of the > time to check-and-jump for nothing especially with many > partitions and somewhat unintuitive. > > The following uses a bit tricky bitmap operation but > is straightforward as a whole. > > = > /* fill the bits upper from BITNUM(lower) (0-based) of the first word */ > a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1); > > /* fill up intermediate words */ > while (wordnum < uwordnum) >a->words[wordnum++] = ~(bitmapword) 0; > > /* fill up to BITNUM(upper) bit (0-based) of the last word */ > a->workds[wordnum++] |= > (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1)); > = Considering also the David's comment downthread, I will try to incorporate this into bms_add_range(). > In 0003, > > +match_clauses_to_partkey(RelOptInfo *rel, > ... > + if (rinfo->pseudoconstant && > +(IsA(clause, Const) && > + Const *) clause)->constisnull) || > + !DatumGetBool(((Const *) clause)->constvalue > + { > +*constfalse = true; > +continue; > > If we call this function in both conjunction and disjunction > context (the latter is only in recursive case). constfalse == > true means no need of any clauses for the former case. > > Since (I think) just a list of RestrictInfo is expected to be > treated as a conjunction (it's quite doubious, though..), I think it makes sense to consider a list of RestrictInfo's, such as baserestrictinfo, that is passed as input to match_clauses_to_partkey(), to be mutually conjunctive for our purpose here. > we > might be better call this for each subnodes of a disjunction. Or, > like match_clauses_to_index, we might be better provide > match_clause_to_partkey(rel, rinfo, contains_const), which > returns NULL if constfalse. (I'm not self-confident on this..) After reading your comment, I realized that it was wrong that the recursive call to match_clauses_to_partkey() passed the arguments of an OR clause all at once. That broke the assumption mentioned above that all of the clauses in the list passed to match_clauses_to_partkey() are mutually conjunctive. Instead, we must make a single-member list for each of the OR clause's arguments and pass the same. Then if we got constfalse for all of the OR's arguments, then we return the constfalse=true to the original caller. > > + /* > + * If no commutator exists, cannot flip the qual's args, > + * so give up. > + */ > + if (!OidIsValid(expr_op)) > +continue; > > I suppose it's better to leftop and rightop as is rather than > flipping over so that var is placed left-side. Does that make > things so complex? Reason to do it that way is that the code placed far away (code in partition.c that extracts constant values to use for pruning from matched clauses) can always assume that the clauses determined to be useful for partition-pruning always come in the 'partkey op constant' form. > + * It the operator happens to be '<>', which is never listed > If? Right, will fix. > +if (!op_in_opfamily(expr_op, partopfamily)) > +{ > + Oidnegator = get_negator(expr_op); > + > + if (!OidIsValid(negator) || > +!op_in_opfamily(negator, partopfamily)) > +continue; > > classify_partition_bounding_keys() checks the same thing by > checking whether the negator's strategy is > BTEquealStrategyNumber. (I'm not sure the operator is guaranteed > to be of btreee, though..) Aren't they needed to be in similar > way? You're right. The <>'s negator may not always be a btree operator. So, we should add a check in match_clauses_to_partkey() that list or range partitioning is in use, because only those require a btree operator family. We now have hash partitioning, so need to be careful not to make the assumption that all partitioning operators are from btree operator families. If match_clauses_to_partkey()
Re: [HACKERS] path toward faster partition pruning
Hi Amul. On 2017/11/09 20:05, amul sul wrote: > On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote > wrote: >> On 2017/11/06 14:32, David Rowley wrote: >>> On 6 November 2017 at 17:30, Amit Langote wrote: >>>> On 2017/11/03 13:32, David Rowley wrote: >>>>> On 31 October 2017 at 21:43, Amit Langote wrote: > [] >> >> Attached updated set of patches, including the fix to make the new pruning >> code handle Boolean partitioning. >> > > I am getting following warning on mac os x: Thanks for the review. > partition.c:1800:24: warning: comparison of constant -1 with > expression of type 'NullTestType' > (aka 'enum NullTestType') is always false > [-Wtautological-constant-out-of-range-compare] > if (keynullness[i] == -1) > ~~ ^ ~~ > partition.c:1932:25: warning: comparison of constant -1 with > expression of type 'NullTestType' > (aka 'enum NullTestType') is always false > [-Wtautological-constant-out-of-range-compare] > if (keynullness[i] == -1) > ~~ ^ ~~ > 2 warnings generated. > > > Comment for 0004 patch: > 270 + /* -1 represents an invalid value of NullTestType. */ > 271 + memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType)); > > I think we should not use memset to set a value other than 0 or true/false. > This will work for -1 on the system where values are stored in the 2's > complement but I am afraid of other architecture. OK, I will remove all instances of comparing and setting variables of type NullTestType to a value of -1. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] fix wrong create table statement in documentation
On 2017/11/09 7:21, Tom Lane wrote: > jotpe writes: >> In the current documentation [1] this create table statement is listed: >> CREATE TABLE measurement_y2008m01 PARTITION OF measurement >> FOR VALUES FROM ('2008-01-01') TO ('2008-02-01') >> TABLESPACE fasttablespace >> WITH (parallel_workers = 4); > > Yup, that's wrong. Fix pushed, thanks! Oops! Thanks Tom for the fix. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
Hi Rajkumar, Thanks for testing. On 2017/11/08 15:52, Rajkumar Raghuwanshi wrote: > On Mon, Nov 6, 2017 at 3:31 PM, Amit Langote > wrote: > >> Attached updated set of patches, including the fix to make the new pruning >> code handle Boolean partitioning. >> > > Hi Amit, > > I have tried pruning for different values of constraint exclusion GUC > change, not sure exactly how it should behave, but I can see with the > delete statement pruning is not happening when constraint_exclusion is off, > but select is working as expected. Is this expected behaviour? Hmm, the new pruning only works for selects, not DML. The patch also changes get_relation_constraints() to not include the internal partition constraints, but mistakenly does so for all query types, not just select. Will look into it. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
Hi David. Thanks for the review. (..also looking at the comments you sent earlier today.) On 2017/11/07 11:14, David Rowley wrote: > On 7 November 2017 at 01:52, David Rowley > wrote: >> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13) > > I have a little more review to share: > > 1. Missing "in" in comment. Should be "mentioned in" > > * get_append_rel_partitions > * Return the list of partitions of rel that pass the clauses mentioned > * rel->baserestrictinfo > > 2. Variable should be declared in inner scope with the following fragment: > > void > set_basic_child_rel_properties(PlannerInfo *root, >RelOptInfo *rel, >RelOptInfo *childrel, >AppendRelInfo *appinfo) > { > AttrNumber attno; > > if (rel->part_scheme) > { > > which makes the code the same as where you moved it from. It seems like you included the above changes in your attached C file, which I will incorporate into my repository. > 3. Normally lfirst() is assigned to a variable at the start of a > foreach() loop. You have code which does not follow this. > > foreach(lc, clauses) > { > Expr *clause; > int i; > > if (IsA(lfirst(lc), RestrictInfo)) > { > RestrictInfo *rinfo = lfirst(lc); > > You could assign this to a Node * since the type is unknown to you at > the start of the loop. Will make the suggested changes to match_clauses_to_partkey(). > 4. > /* > * Useless if what we're thinking of as a constant is actually > * a Var coming from this relation. > */ > if (bms_is_member(rel->relid, constrelids)) > continue; > > should this be moved to just above the op_strict() test? This one seems > cheaper. Agreed, will do. Also makes sense to move the PartCollMatchesExprColl() test together. > 5. Typo "paritions": /* No clauses to prune paritions, so scan all > partitions. */ > > But thinking about it more the comment should something more along the > lines of /* No useful clauses for partition pruning. Scan all > partitions. */ You fixed it. :) > The key difference is that there might be clauses, just without Consts. > > Actually, the more I look at get_append_rel_partitions() I think it > would be better if you re-shaped that if/else if test so that it only > performs the loop over the partindexes if it's been set. > > I ended up with the attached version of the function after moving > things around a little bit. Thanks a lot for that. Looks much better now. > I'm still reviewing but thought I'd share this part so far. As mentioned at the top, I'm looking at your latest comments and they all seem to be good points to me, so will address those in the next version. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 2017/11/07 14:40, Amit Khandekar wrote: > On 7 November 2017 at 00:33, Robert Haas wrote: > >> Also, +1 for Amit Langote's idea of trying to merge >> mt_perleaf_childparent_maps with mt_persubplan_childparent_maps. > > Currently I am trying to see if it simplifies things if we do that. We > will be merging these arrays into one, but we are adding a new int[] > array that maps subplans to leaf partitions. Will get back with how it > looks finally. One thing to note is that the int[] array I mentioned will be much faster to compute than going to convert_tuples_by_name() to build the additional maps array. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
On 2017/11/06 21:52, David Rowley wrote: > On 6 November 2017 at 23:01, Amit Langote > wrote: >> OK, I have gotten rid of the min/max partition index interface and instead >> adopted the bms_add_range() approach by including your patch to add the >> same in the patch set (which is now 0002 in the whole set). I have to >> admit that it's simpler to understand the new code with just Bitmapsets to >> look at, but I'm still a bit concerned about materializing the whole set >> right within partition.c, although we can perhaps optimize it later. > > Thanks for making that change. The code looks much more simple now. > > For performance, if you're worried about a very large number of > partitions, then I think you're better off using bms_next_member() > rather than bms_first_member(), (likely this applies globally, but you > don't need to worry about those). > > The problem with bms_first_member is that it must always loop over the > 0 words before it finds any bits set for each call, whereas > bms_next_member will start on the word it was last called for. There > will likely be a pretty big performance difference between the two > when processing a large Bitmapset. Ah, thanks for the explanation. I will change it to bms_next_member() in the next version. >> Attached updated set of patches, including the fix to make the new pruning >> code handle Boolean partitioning. > > Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13) Thank you. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
On 2017/11/06 13:15, David Rowley wrote: > On 31 October 2017 at 21:43, Amit Langote > wrote: >> Attached updated version of the patches > > match_clauses_to_partkey() needs to allow for the way quals on Bool > columns are represented. > > create table pt (a bool not null) partition by list (a); > create table pt_true partition of pt for values in('t'); > create table pt_false partition of pt for values in('f'); > explain select * from pt where a = true; > QUERY PLAN > -- > Append (cost=0.00..76.20 rows=2810 width=1) >-> Seq Scan on pt_false (cost=0.00..38.10 rows=1405 width=1) > Filter: a >-> Seq Scan on pt_true (cost=0.00..38.10 rows=1405 width=1) > Filter: a > (5 rows) > > match_clause_to_indexcol() shows an example of how to handle this. > > explain select * from pt where a = false; > > will need to be allowed too. This works slightly differently. You're right. I've fixed things to handle Boolean partitioning in the updated set of patches I will post shortly. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning
On 2017/11/06 12:53, David Rowley wrote: > On 3 November 2017 at 17:32, David Rowley > wrote: >> 2. This code is way more complex than it needs to be. >> >> if (num_parts > 0) >> { >> int j; >> >> all_indexes = (int *) palloc(num_parts * sizeof(int)); >> j = 0; >> if (min_part_idx >= 0 && max_part_idx >= 0) >> { >> for (i = min_part_idx; i <= max_part_idx; i++) >> all_indexes[j++] = i; >> } >> if (!bms_is_empty(other_parts)) >> while ((i = bms_first_member(other_parts)) >= 0) >> all_indexes[j++] = i; >> if (j > 1) >> qsort((void *) all_indexes, j, sizeof(int), intcmp); >> } >> >> It looks like the min/max partition stuff is just complicating things >> here. If you need to build this array of all_indexes[] anyway, I don't >> quite understand the point of the min/max. It seems like min/max would >> probably work a bit nicer if you didn't need the other_parts >> BitmapSet, so I recommend just getting rid of min/max completely and >> just have a BitmapSet with bit set for each partition's index you >> need, you'd not need to go to the trouble of performing a qsort on an >> array and you could get rid of quite a chunk of code too. >> >> The entire function would then not be much more complex than: >> >> partindexes = get_partitions_from_clauses(parent, partclauses); >> >> while ((i = bms_first_member(partindexes)) >= 0) >> { >> AppendRelInfo *appinfo = rel->part_appinfos[i]; >> result = lappend(result, appinfo); >> } >> >> Then you can also get rid of your intcmp() function too. > > I've read a bit more of the patch and I think even more now that the > min/max stuff should be removed. Oops, I didn't catch this email before sending my earlier reply. Thanks for the bms range patch. Will reply to this shortly after reading your patch and thinking on it a bit. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dropping partitioned tables without CASCADE
On 2017/11/03 21:39, Alvaro Herrera wrote: > Ashutosh Bapat wrote: >> On Fri, Nov 3, 2017 at 1:42 PM, Alvaro Herrera >> wrote: > >>> I think adding "is partitioned" at end of line isn't good; looks like a >>> phrase but isn't translatable. Maybe add keyword PARTITIONED instead? >> >> In that case may be we should separate bounds and "PARTITIONED" with a >> ",". "part_default DEFAULT, PARTITIONED" would read better than >> "part_default DEFAULT PARTITIONED"? > > Hmm, I vote +0.5 for the comma. Me too. >>> Is it possible to put it at either start or end of the list? >> >> Right now, we could do that if we order the list by bound expression; >> lexically DEFAULT would come before FOR VALUES ... . But that's not >> future-safe; we may have a bound expression starting with A, B or C. >> Beyond that it really gets tricky to order the partitions by bounds. > > I was just thinking in changing the query to be "order by > is_the_default_partition, partition_name" instead of just "order by > partition_name". Sorting by bounds rather than name (a feature whose > worth should definitely be discussed separately IMV) sounds a lot more > complicated. Yeah, it sounds like a desirable feature, but as you both say, should be discussed separately. Since the facility to order partitions in the bound order is internal to the server yet, we'd need some new server-side functionality to expose the same with sane SQL-callable interface, which clearly needs its own separate discussion. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/11/03 6:24, Tom Lane wrote: > Amit Langote writes: >> On 2017/09/26 16:30, Michael Paquier wrote: >>> Cool, let's switch it back to a ready for committer status then. > >> Sure, thanks. > > Pushed with some cosmetic adjustments --- mostly, making the comments more > explicit about why we need the apparently-redundant assignments to > pd_lower. Thank you. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
Hi Amit. Thanks a lot for updated patches and sorry that I couldn't get to looking at your emails sooner. Note that I'm replying here to both of your emails, but looking at only the latest v22 patch. On 2017/10/24 0:15, Amit Khandekar wrote: > On 16 October 2017 at 08:28, Amit Langote > wrote: >> >> + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup >> == NULL >> >> Is there some reason why a bitwise operator is used here? > > That exact condition means that the function is called for transition > capture for updated rows being moved to another partition. For this > scenario, either the oldtup or the newtup is NULL. I wanted to exactly > capture that condition there. I think the bitwise operator is more > user-friendly in emphasizing the point that it is indeed an "either a > or b, not both" condition. I see. In that case, since this patch adds the new condition, a note about it in the comment just above would be good, because the situation you describe here seems to arise only during update-tuple-routing, IIUC. >> + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap >> objects >> + * with on entry for every leaf partition (required to convert input >> tuple >> + * based on the root table's rowtype to a leaf partition's rowtype after >> + * tuple routing is done) >> >> Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that >> they are maps needed for "tuple conversion". And the other field holding >> the reverse map as leaf_rev_tupconv_maps. Either that or use underscores >> to separate words, but then it gets too long I guess. > > In master branch, now this param is already there with the name > "tup_conv_maps". In the rebased version in the earlier mail, I haven't > again changed it. I think "tup_conv_maps" looks clear enough. OK. In the latest patch: + * 'update_rri' has the UPDATE per-subplan result rels. These are re-used + * instead of allocating new ones while generating the array of all leaf + * partition result rels. Instead of: "These are re-used instead of allocating new ones while generating the array of all leaf partition result rels." how about: "There is no need to allocate a new ResultRellInfo entry for leaf partitions for which one already exists in this array" >> ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex >> interface. I guess it could simply have the following interface: >> >> static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate, >>HeapTuple tuple, bool is_update); >> >> And figure out, based on the value of is_update, which map to use and >> which slot to set *p_new_slot to (what is now "new_slot" argument). >> You're getting mtstate here anyway, which contains all the information you >> need here. It seems better to make that (selecting which map and which >> slot) part of the function's implementation if we're having this function >> at all, imho. Maybe I'm missing some details there, but my point still >> remains that we should try to put more logic in that function instead of >> it just do the mechanical tuple conversion. > > I tried to see how the interface would look if we do that way. Here is > how the code looks : > > static TupleTableSlot * > ConvertPartitionTupleSlot(ModifyTableState *mtstate, > bool for_update_tuple_routing, > int map_index, > HeapTuple *tuple, > TupleTableSlot *slot) > { >TupleConversionMap *map; >TupleTableSlot *new_slot; > >if (for_update_tuple_routing) >{ > map = mtstate->mt_persubplan_childparent_maps[map_index]; > new_slot = mtstate->mt_rootpartition_tuple_slot; >} >else >{ > map = mtstate->mt_perleaf_parentchild_maps[map_index]; > new_slot = mtstate->mt_partition_tuple_slot; >} > >if (!map) > return slot; > >*tuple = do_convert_tuple(*tuple, map); > >/* > * Change the partition tuple slot descriptor, as per converted tuple. > */ >ExecSetSlotDescriptor(new_slot, map->outdesc); >ExecStoreTuple(*tuple, new_slot, InvalidBuffer, true); > >return new_slot; > } > > It looks like the interface does not much simplify, and above that, we > have more number of lines in that function. Also, the caller anyway > has to be aware whether the map_index is the index into the leaf > partitions or the
Re: [HACKERS] Adding column_constraint description in ALTER TABLE synopsis
On 2017/10/31 21:31, Stephen Frost wrote: > * Lætitia Avrot (laetitia.av...@gmail.com) wrote: >> As Amit Langot pointed out, the column_constraint definition is missing >> whereas it is used in ALTER TABLE synopsis. It can be easily found in the >> CREATE TABLE synopsis, but it's not very user friendly. > > Thanks, this looks pretty reasonable, but did you happen to look for any > other keywords in the ALTER TABLE that should really be in ALTER TABLE > also? > > I'm specifically looking at, at least, partition_bound_spec. Maybe you > could propose an updated patch which addresses that also, and any other > cases you find? Ah, yes. I remember having left out partition_bound_spec simply because I thought it was kind of how it was supposed to be done, seeing that neither column_constraint and table_constraint were expanded in the ALTER TABLE's synopsis. It seems that there are indeed a couple of other things that need to be brought over to ALTER TABLE synopsis including partition_bound_spec. 9f295c08f877 [1] added table_constraint, but missed to add the description of index_parameters and exclude_element which are referenced therein. Attached find updated version of the Lætitia's patch. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9f295c diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 41acda003f..e059f87875 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -85,6 +85,20 @@ ALTER TABLE [ IF EXISTS ] name OWNER TO { new_owner | CURRENT_USER | SESSION_USER } REPLICA IDENTITY { DEFAULT | USING INDEX index_name | FULL | NOTHING } +and column_constraint is: + +[ CONSTRAINT constraint_name ] +{ NOT NULL | + NULL | + CHECK ( expression ) [ NO INHERIT ] | + DEFAULT default_expr | + GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ] | + UNIQUE index_parameters | + PRIMARY KEY index_parameters | + REFERENCES reftable [ ( refcolumn ) ] [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] +[ ON DELETE action ] [ ON UPDATE action ] } +[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] + and table_constraint is: [ CONSTRAINT constraint_name ] @@ -96,6 +110,15 @@ ALTER TABLE [ IF EXISTS ] name [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON DELETE action ] [ ON UPDATE action ] } [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ] +index_parameters in UNIQUE, PRIMARY KEY, and EXCLUDE constraints are: + +[ WITH ( storage_parameter [= value] [, ... ] ) ] +[ USING INDEX TABLESPACE tablespace_name ] + +exclude_element in an EXCLUDE constraint is: + +{ column_name | ( expression ) } [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] + and table_constraint_using_index is: [ CONSTRAINT constraint_name ] @@ -104,6 +127,13 @@ ALTER TABLE [ IF EXISTS ] name +and partition_bound_spec is: + +IN ( { numeric_literal | string_literal | NULL } [, ...] ) | +FROM ( { numeric_literal | string_literal | MINVALUE | MAXVALUE } [, ...] ) + TO ( { numeric_literal | string_literal | MINVALUE | MAXVALUE } [, ...] ) + + Description -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
Thanks for the test case. On 2017/10/30 17:09, Rajkumar Raghuwanshi wrote: > I am getting wrong output when default is sub-partitioned further, below is > a test case. > > CREATE TABLE lpd(a int, b varchar, c float) PARTITION BY LIST (a); > CREATE TABLE lpd_p1 PARTITION OF lpd FOR VALUES IN (1,2,3); > CREATE TABLE lpd_p2 PARTITION OF lpd FOR VALUES IN (4,5); > CREATE TABLE lpd_d PARTITION OF lpd DEFAULT PARTITION BY LIST(a); > CREATE TABLE lpd_d1 PARTITION OF lpd_d FOR VALUES IN (7,8,9); > CREATE TABLE lpd_d2 PARTITION OF lpd_d FOR VALUES IN (10,11,12); > CREATE TABLE lpd_d3 PARTITION OF lpd_d FOR VALUES IN (6,null); > INSERT INTO lpd SELECT i,i,i FROM generate_Series (1,12)i; > INSERT INTO lpd VALUES (null,null,null); > > --on HEAD > postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE > a IS NOT NULL ORDER BY 1; > QUERY PLAN > - > Sort >Sort Key: ((lpd_p1.tableoid)::regclass) >-> Result > -> Append >-> Seq Scan on lpd_p1 > Filter: (a IS NOT NULL) >-> Seq Scan on lpd_p2 > Filter: (a IS NOT NULL) >-> Seq Scan on lpd_d3 > Filter: (a IS NOT NULL) >-> Seq Scan on lpd_d1 > Filter: (a IS NOT NULL) >-> Seq Scan on lpd_d2 > Filter: (a IS NOT NULL) > (14 rows) > > postgres=# > postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER > BY 1; > tableoid | a | b | c > --+++ > lpd_p1 | 1 | 1 | 1 > lpd_p1 | 2 | 2 | 2 > lpd_p1 | 3 | 3 | 3 > lpd_p2 | 4 | 4 | 4 > lpd_p2 | 5 | 5 | 5 > lpd_d1 | 7 | 7 | 7 > lpd_d1 | 8 | 8 | 8 > lpd_d1 | 9 | 9 | 9 > lpd_d2 | 12 | 12 | 12 > lpd_d2 | 10 | 10 | 10 > lpd_d2 | 11 | 11 | 11 > lpd_d3 | 6 | 6 | 6 > (12 rows) > > > --on HEAD + v8 patches > > postgres=# EXPLAIN (COSTS OFF) SELECT tableoid::regclass, * FROM lpd WHERE > a IS NOT NULL ORDER BY 1; > QUERY PLAN > - > Sort >Sort Key: ((lpd_p1.tableoid)::regclass) >-> Result > -> Append >-> Seq Scan on lpd_p1 > Filter: (a IS NOT NULL) >-> Seq Scan on lpd_p2 > Filter: (a IS NOT NULL) > (8 rows) > > postgres=# SELECT tableoid::regclass, * FROM lpd WHERE a IS NOT NULL ORDER > BY 1; > tableoid | a | b | c > --+---+---+--- > lpd_p1 | 1 | 1 | 1 > lpd_p1 | 2 | 2 | 2 > lpd_p1 | 3 | 3 | 3 > lpd_p2 | 4 | 4 | 4 > lpd_p2 | 5 | 5 | 5 > (5 rows) I found bugs in 0003 and 0005 that caused this. Will post the patches containing the fix in reply to the Dilip's email which contains some code review comments [1]. Also, I noticed that the new pruning code was having a hard time do deal with the fact that the default "range" partition doesn't explicitly say in its partition constraint that it might contain null values. More precisely perhaps, the default range partition's constraint appears to imply that it can only contain non-null values, which confuses the new pruning code. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFiTN-thYsobXxPS6bwOA_9erpax_S=iztsn3rtuxkkmkg4...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On 2017/10/27 13:57, Robert Haas wrote: > On Fri, Oct 27, 2017 at 3:17 AM, Amit Langote > wrote: >>> I don't think we really want to get into theorem-proving here, because >>> it's slow. >> >> Just to be clear, I'm saying we could use theorem-proving (if at all) just >> for the default partition. > > I don't really see why it should be needed there either. We've got > all the bounds in order, so we should know where there are any gaps > that are covered by the default partition in the range we care about. Sorry, I forgot to add: "...just for the default partition, for cases like the one in Beena's example." In normal cases, default partition selection doesn't require any theorem-proving. It proceeds in a straightforward manner more or less like what you said it should. After thinking more on it, I think there is a rather straightforward trick to implement the idea you mentioned to get this working for the case presented in Beena's example, which works as follows: For any non-root partitioned tables, we add the list of its partition constraint clauses to the query-provided list of clauses and use the whole list to drive the partition-pruning algorithm. So, when partition-pruning runs for tprt_1, along with (< 1) which the original query provides, we also have (>= 1) which comes from the partition constraint of tprt_1 (which is >= 1 and < 5). Note that there exists a trick in the new code for the (< 5) coming from the constraint to be overridden by the more restrictive (< 1) coming from the original query. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On 2017/10/26 20:34, Robert Haas wrote: > On Thu, Oct 26, 2017 at 1:17 PM, Amit Langote > wrote: >> It can perhaps taught to not make that conclusion by taking into account >> the default partition's partition constraint, which includes constraint >> inherited from the parent, viz. 1 <= col1 < 50001. To do that, it might >> be possible to summon up predtest.c's powers to conclude from the default >> partition's partition constraint that it cannot contain any keys < 1, but >> then we'll have to frame up a clause expression describing the latter. >> Generating such a clause expression can be a bit daunting for a >> multi-column key. So, I haven't yet tried really hard to implement this. >> Any thoughts on that? > > I don't think we really want to get into theorem-proving here, because > it's slow. Just to be clear, I'm saying we could use theorem-proving (if at all) just for the default partition. > Whatever we're going to do we should be able to do without > that - keeping it in the form of btree-strategy + value. It doesn't > seem that hard. Suppose we're asked to select partitions from tprt > subject to (<, 1). Well, we determine that some of the tprt_1 > partitions may be relevant, so we tell tprt_1 to select partitions > subject to (>=, 1, <, 1). We know to do that because we know that > 1 < 5 and we know to include >= 1 because we haven't got any > lower bound currently at all. What's the problem? Hmm, that's interesting. With the approach that the patch currently takes, (>= 1) wouldn't be passed down when selecting the partitions of tprt_1. The source of values (+ btree strategy) to use to select partitions is the same original set of clauses for all partitioned tables in the tree, as the patch currently implements it. Nothing would get added to that set (>= 1, as in this example), nor subtracted (such as clauses that are trivially true). I will think about this approach in general and to solve this problem in particular. > In some sense it's tempting to say that this case just doesn't matter > very much; after all, subpartitioning on the same column used to > partition at the top level is arguably lame. But if we can get it > right in a relatively straightforward manner then let's do it. Yeah, I tend to agree. Thanks for the input. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in adjust_appendrel_attrs_mutator
On 2017/10/24 0:22, Tom Lane wrote: > Amit Langote writes: >> On 2017/10/23 2:07, Tom Lane wrote: >>> Hmm. adjust_appendrel_attrs() thinks it's only used after conversion >>> of sublinks to subplans, but this is a counterexample. I wonder if >>> that assumption was ever correct? Or maybe we need to rethink what >>> it does when recursing into RTE subqueries. > >> Looking at the backtrace, the crashing SubLink seems to have been >> referenced from a PlaceHolderVar that is in turn referenced by >> joinaliasvars of a JOIN rte in query->rtable. > > Right. The core of the issue is that joinaliasvars lists don't get run > through preprocess_expression, so (among other things) any SubLinks in > them don't get expanded to subplans. Ah, I missed that. > Probably the reason we've not > heard field complaints about this is that in a non-assert build there > would be no observable bad effects --- the scan would simply ignore > the subquery, and whether the joinaliasvars entry has been correctly > mutated doesn't matter at all because it will never be used again. I see. >> I wonder if we shouldn't just ignore those (joinaliasvars in JOIN rtes) >> while adjust_appendrel_attrs() is doing its job on a Query, just like we >> ask to ignore subqueries by passing QTW_IGNORE_RC_SUBQUERIES to >> query_tree_mutator()? > > I don't really like this fix, because ISTM it's fixing one symptom rather > than the root of the problem. That's true. > The root is that joinaliasvars lists > diverge from the representation of expressions elsewhere in the tree, > and not only with respect to SubLinks --- another example is that function > calls with named arguments won't have been rearranged into executable > form. That could easily be a dangerous thing, if we allow arbitrary > expression processing to get done on them. Moreover, your patch is > letting the divergence get even bigger: now the joinaliasvars lists don't > even have the right varnos, making them certainly unusable for anything. Hmm, yes. Although, I only proposed that patch because I had convinced myself that joinaliasvars lists weren't looked at anymore. > So what I'm thinking is that we would be well advised to actually remove > the untransformed joinaliasvars from the tree as soon as we're done with > preprocessing expressions. We'd drop them at the end of planning anyway > (cf. add_rte_to_flat_rtable) so this is just getting rid of them a bit > sooner, and it won't affect anything after the planner. > > In short, I'm thinking something more like the attached. Yeah, that makes more sense. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 2017/10/24 1:15, Alvaro Herrera wrote: > Robert Haas wrote: >> On Mon, Oct 23, 2017 at 11:12 AM, Alvaro Herrera >> wrote: >>> I started with Maksim's submitted code, and developed according to the >>> ideas discussed in this thread. Attached is a very WIP patch series for >>> this feature. Nice! >>> Many things remain to be done before this is committable: pg_dump >>> support needs to be written. ALTER INDEX ATTACH/DETACH not yet >>> implemented. No REINDEX support yet. Docs not updated (but see the >>> regression test as a guide for how this is supposed to work; see patch >>> 0005). CREATE INDEX CONCURRENTLY not done yet. >>> >>> I'm now working on the ability to build unique indexes (and unique >>> constraints) on top of this. >> >> Cool. Are you planning to do that by (a) only allowing the special >> case where the partition key columns/expressions are included in the >> indexed columns/expressions, (b) trying to make every insert to any >> index check all of the indexes for uniqueness conflicts, or (c) >> implementing global indexes? Because (b) sounds complex - think about >> attach operations, for example - and (c) sounds super-hard. I'd >> suggest doing (a) first, just on the basis of complexity. > > Yes, I think (a) is a valuable thing to have -- not planning on doing > (c) at all because I fear it'll be a huge time sink. I'm not sure about > (b), but it's not currently on my plan. +1 to proceeding with (a) first. >> I hope that you don't get so involved in making this unique index >> stuff work that we don't get the cascading index feature, at least, >> committed to v11. That's already a considerable step forward in terms >> of ease of use, and I'd really like to have it. > > Absolutely -- I do plan to get this one finished regardless of unique > indexes. +1 Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in adjust_appendrel_attrs_mutator
On 2017/10/23 2:07, Tom Lane wrote: > Andreas Seltenreich writes: >> testing master as of 7c981590c2, sqlsmith just triggered the following >> assertion: >> TRAP: FailedAssertion("!(!const Node*)(node))->type) == T_SubLink))", >> File: "prepunion.c", Line: 2231) > > Hmm. adjust_appendrel_attrs() thinks it's only used after conversion > of sublinks to subplans, but this is a counterexample. I wonder if > that assumption was ever correct? Or maybe we need to rethink what > it does when recursing into RTE subqueries. Looking at the backtrace, the crashing SubLink seems to have been referenced from a PlaceHolderVar that is in turn referenced by joinaliasvars of a JOIN rte in query->rtable. I wonder if we shouldn't just ignore those (joinaliasvars in JOIN rtes) while adjust_appendrel_attrs() is doing its job on a Query, just like we ask to ignore subqueries by passing QTW_IGNORE_RC_SUBQUERIES to query_tree_mutator()? IOW, it doesn't look like anything critically depends on the Vars referenced from joinaliasvars being translated at that point in inheritance_planner(), but I may be missing something. The attached patch to do the same, while didn't break any existing tests, fixed the crash reported by OP. Thoughts? Thanks, Amit diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c index 1c84a2cb28..4bdfe73d29 100644 --- a/src/backend/optimizer/prep/prepunion.c +++ b/src/backend/optimizer/prep/prepunion.c @@ -1964,7 +1964,8 @@ adjust_appendrel_attrs(PlannerInfo *root, Node *node, int nappinfos, newnode = query_tree_mutator((Query *) node, adjust_appendrel_attrs_mutator, (void *) &context, - QTW_IGNORE_RC_SUBQUERIES); + QTW_IGNORE_RC_SUBQUERIES | + QTW_IGNORE_JOINALIASES); for (cnt = 0; cnt < nappinfos; cnt++) { AppendRelInfo *appinfo = appinfos[cnt]; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alter table doc fix
On 2017/10/18 20:37, Alvaro Herrera wrote: > Amit Langote wrote: >> Hi. >> >> Noticed that a alter table sub-command's name in Description (where it's >> OWNER) differs from that in synopsis (where it's OWNER TO). Attached >> patch to make them match, if the difference is unintentional. > > I agree -- pushed. Thanks for committing. > This paragraph > > > The actions for identity columns (ADD > GENERATED, SET etc., DROP > IDENTITY), as well as the actions > TRIGGER, CLUSTER, > OWNER, > and TABLESPACE never recurse to descendant tables; > that is, they always act as though ONLY were specified. > Adding a constraint recurses only for CHECK constraints > that are not marked NO INHERIT. > > > is a bit annoying, though I think it'd be worse if we "fix" it to be > completely strict about the subcommands it refers to. I didn't notice it in this paragraph before you pointed out, but maybe as you say, there's not much point in trying to be strict here too. If we do fix it though, we might want to do something about TRIGGER, CLUSTER, too, because there are no sub-commands named just TRIGGER, CLUSTER. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block level parallel vacuum WIP
On 2017/10/22 5:25, Thomas Munro wrote: > On Sun, Oct 22, 2017 at 5:09 AM, Robert Haas wrote: >> On Tue, Sep 19, 2017 at 3:31 AM, Masahiko Sawada >> wrote: Down at the bottom of the build log in the regression diffs file you can see: ! ERROR: cache lookup failed for relation 32893 https://travis-ci.org/postgresql-cfbot/postgresql/builds/277165907 >>> >>> Thank you for letting me know. >>> >>> Hmm, it's an interesting failure. I'll investigate it and post the new >>> patch. >> >> Did you ever find out what the cause of this problem was? > > I wonder if it might have been the same issue that commit > 19de0ab23ccba12567c18640f00b49f01471018d fixed a week or so later. Hmm, 19de0ab23ccba seems to prevent the "cache lookup failed for relation XXX" error in a different code path though (the code path handling manual vacuum). Not sure if the commit could have prevented that error being emitted by DROP SCHEMA ... CASCADE, which seems to be what produced it in this case. Maybe I'm missing something though. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.
On 2017/10/18 1:52, Alvaro Herrera wrote: > Alvaro Herrera wrote: >> Robert Haas wrote: >>> Implement table partitioning. >> >> Is it intentional that you can use ALTER TABLE OWNER TO on the parent >> table, and that this does not recurse to modify the partitions' owners? >> This doesn't seem to be mentioned in comments nor documentation, so it >> seems an oversight to me. Hmm, I would say of it that the new partitioning didn't modify the behavior that existed for inheritance. That said, I'm not sure if the lack of recursive application of ownership change to descendant tables is unintentional. > The alter table docs say that ONLY must be specified if one does not > want to modify descendants, so I think this is a bug. Just to clarify, if we do think of it as a bug, then it will apply to the inheritance case as well, right? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] alter table doc fix
Hi. Noticed that a alter table sub-command's name in Description (where it's OWNER) differs from that in synopsis (where it's OWNER TO). Attached patch to make them match, if the difference is unintentional. Thanks, Amit diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 68393d70b4..b4b8dab911 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -713,7 +713,7 @@ ALTER TABLE [ IF EXISTS ] name -OWNER +OWNER TO This form changes the owner of the table, sequence, view, materialized view, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] relkind check in DefineIndex
On 2017/10/14 4:32, Robert Haas wrote: > On Fri, Oct 13, 2017 at 12:38 PM, Alvaro Herrera > wrote: >> The relkind check in DefineIndex has grown into an ugly rats nest of >> 'if' statements. I propose to change it into a switch, as per the >> attached. > > wfm +1 Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
Hi Amit. On 2017/10/04 22:51, Amit Khandekar wrote: > Main patch : > update-partition-key_v20.patch Guess you're already working on it but the patch needs a rebase. A couple of hunks in the patch to execMain.c and nodeModifyTable.c fail. Meanwhile a few comments: +void +pull_child_partition_columns(Bitmapset **bitmapset, +Relation rel, +Relation parent) Nitpick: don't we normally list the output argument(s) at the end? Also, "bitmapset" could be renamed to something that conveys what it contains? + if (partattno != 0) + child_keycols = + bms_add_member(child_keycols, + partattno - FirstLowInvalidHeapAttributeNumber); + } + foreach(lc, partexprs) + { Elsewhere (in quite a few places), we don't iterate over partexprs separately like this, although I'm not saying it is bad, just different from other places. + * the transition tuplestores can be built. Furthermore, if the transition + * capture is happening for UPDATEd rows being moved to another partition due + * partition-key change, then this function is called once when the row is + * deleted (to capture OLD row), and once when the row is inserted to another + * partition (to capture NEW row). This is done separately because DELETE and + * INSERT happen on different tables. Extra space at the beginning from the 2nd line onwards. + (event == TRIGGER_EVENT_UPDATE && ((oldtup == NULL) ^ (newtup == NULL Is there some reason why a bitwise operator is used here? + * 'update_rri' has the UPDATE per-subplan result rels. Could you explain why they are being received as input here? + * 'perleaf_parentchild_maps' receives an array of TupleConversionMap objects + * with on entry for every leaf partition (required to convert input tuple + * based on the root table's rowtype to a leaf partition's rowtype after + * tuple routing is done) Could this be named leaf_tupconv_maps, maybe? It perhaps makes clear that they are maps needed for "tuple conversion". And the other field holding the reverse map as leaf_rev_tupconv_maps. Either that or use underscores to separate words, but then it gets too long I guess. + tuple = ConvertPartitionTupleSlot(mtstate, + mtstate->mt_perleaf_parentchild_maps[leaf_part_index], + The 2nd line here seems to have gone over 80 characters. ISTM, ConvertPartitionTupleSlot has a very specific job and a bit complex interface. I guess it could simply have the following interface: static HeapTuple ConvertPartitionTuple(ModifyTabelState *mtstate, HeapTuple tuple, bool is_update); And figure out, based on the value of is_update, which map to use and which slot to set *p_new_slot to (what is now "new_slot" argument). You're getting mtstate here anyway, which contains all the information you need here. It seems better to make that (selecting which map and which slot) part of the function's implementation if we're having this function at all, imho. Maybe I'm missing some details there, but my point still remains that we should try to put more logic in that function instead of it just do the mechanical tuple conversion. + * We have already checked partition constraints above, so skip them + * below. How about: ", so skip checking here."? ISTM, the newly introduced logic in ExecSetupTransitionCaptureState() to try to reuse the per-subplan child-to-parent map as per-leaf child-to-parent map could be simplified a bit. I mean the following code: +/* + * But for Updates, we can share the per-subplan maps with the per-leaf + * maps. + */ +update_rri_index = 0; +update_rri = mtstate->resultRelInfo; +if (mtstate->mt_nplans > 0) +cur_reloid = RelationGetRelid(update_rri[0].ri_RelationDesc); -/* Choose the right set of partitions */ -if (mtstate->mt_partition_dispatch_info != NULL) +for (i = 0; i < numResultRelInfos; ++i) +{ How about (pseudo-code): j = 0; for (i = 0; i < n_leaf_parts; i++) { if (j < n_subplans && leaf_rri[i]->oid == subplan_rri[j]->oid) { leaf_childparent_map[i] = subplan_childparent_map[j]; j++; } else { leaf_childparent_map[i] = new map } } I think the above would also be useful in ExecSetupPartitionTupleRouting() where you've added similar code to try to reuse per-subplan ResultRelInfos. In ExecInitModifyTable(), can we try to minimize the number of places where update_tuple_routing_needed is being set. Currently, it's being set in 3 places: +boolupdate_tuple_routing_needed = node->part_cols_updated; & +/* + * If this is an UPDATE and a BEFORE UPDATE trigger is present, we may + * need to do update tuple routing. + */ +if (resultRelInfo->ri_TrigDesc && +resultRelInfo->ri_TrigDesc->trig_update_before_row && +
Re: [HACKERS] v10 bottom-listed
On 2017/10/13 22:58, Magnus Hagander wrote: > On Fri, Oct 6, 2017 at 3:59 AM, Amit Langote > wrote: > >> On 2017/10/05 22:28, Erik Rijkers wrote: >>> In the 'ftp' listing, v10 appears at the bottom: >>> https://www.postgresql.org/ftp/source/ >>> >>> With all the other v10* directories at the top, we could get a lot of >>> people installing wrong binaries... >>> >>> Maybe it can be fixed so that it appears at the top. >> >> Still see it at the bottom. Maybe ping pgsql-www? >> > > Turns out there was already quite an ugly hack to deal with this, so I just > made it a bit more ugly. Once the caches expire I believe sorting should be > correct. Saw that it's now listed all the way up, thanks! :) Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPDATE of partition key
On 2017/10/13 6:18, Robert Haas wrote: > Is anybody still reviewing the main patch here? (It would be good if > the answer is "yes".) I am going to try to look at the latest version over the weekend and early next week. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimise default partition scanning while adding new partition
On 2017/10/13 4:18, Robert Haas wrote: > On Thu, Oct 5, 2017 at 9:29 PM, Amit Langote > wrote: >> Attached a patch to modify the INFO messages in check_default_allows_bound. > > Committed. However, I didn't see a reason to adopt the comment change > you proposed, so I left that part out. Okay, thanks. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On 2017/09/30 1:53, Robert Haas wrote: > On Thu, Sep 28, 2017 at 1:54 AM, Amit Langote > wrote: >> I looked into how satisfies_hash_partition() works and came up with an >> idea that I think will make constraint exclusion work. What if we emitted >> the hash partition constraint in the following form instead: >> >> hash_partition_mod(hash_partition_hash(key1-exthash, key2-exthash), >>) = >> >> With that form, constraint exclusion seems to work as illustrated below: >> >> \d+ p0 >> <...> >> Partition constraint: >> (hash_partition_modulus(hash_partition_hash(hashint4extended(a, >> '8816678312871386367'::bigint)), 4) = 0) >> >> -- note only p0 is scanned >> explain select * from p where >> hash_partition_modulus(hash_partition_hash(hashint4extended(a, >> '8816678312871386367'::bigint)), 4) = 0; > > What we actually want constraint exclusion to cover is SELECT * FROM p > WHERE a = 525600; I agree. > As Amul says, nobody's going to enter a query in the form you have it > here. Life is too short to take time to put queries into bizarre > forms. Here too. I was falsely thinking that satisfies_hash_partition() is intended to be used for more than just enforcing the partition constraint when data is directly inserted into a hash partition, or more precisely to be used in the CHECK constraint of the table that is to be attached as a hash partition. Now, we ask users to add such a constraint to avoid the constraint validation scan, because the system knows how to infer from the constraint that the partition constraint is satisfied. I observed however that, unlike range and list partitioning, the hash partition's constraint could only ever be implied because of structural equality (equal()'ness) of the existing constraint expression and the partition constraint expression. For example, a more restrictive range or list qual implies the partition constraint, but it requires performing btree operator based proof. The proof is impossible with the chosen structure of hash partitioning constraint, but it seems that that's OK. That is, it's OK to ask users to add the exact constraint (matching modulus and reminder values in the call to satisfies_hash_partition() specified in the CHECK constraint) to avoid the validation scan. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] v10 bottom-listed
On 2017/10/05 22:28, Erik Rijkers wrote: > In the 'ftp' listing, v10 appears at the bottom: > https://www.postgresql.org/ftp/source/ > > With all the other v10* directories at the top, we could get a lot of > people installing wrong binaries... > > Maybe it can be fixed so that it appears at the top. Still see it at the bottom. Maybe ping pgsql-www? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimise default partition scanning while adding new partition
On 2017/10/06 2:25, Robert Haas wrote: > On Tue, Sep 26, 2017 at 4:27 AM, Amit Langote wrote: >> I guess we don't need to squash, as they could be seen as implementing >> different features. Reordering the patches helps though. So, apply them >> in this order: >> >> 1. My patch to teach ValidatePartitionConstraints() to skip scanning >>a partition's own partitions, which optimizes ATTACH PARTITION >>command's partition constraint validation scan (this also covers the >>case of scanning the default partition to validate its updated >>constraint when attaching a new partition) >> >> 2. Jeevan's patch to teach check_default_allows_bound() to skip scanning >>the default partition's own partitions, which covers the case of >>scanning the default partition to validate its updated constraint when >>adding a new partition using CREATE TABLE >> >> Attached 0001 and 0002 are ordered that way. > > OK, I pushed all of this, spread out over 3 commits. I reworked the > test cases not to be entangled with the existing test cases, and also > to test both of these highly-related features together. Hopefully you > like the result. Thanks for committing. I noticed that 6476b26115f3 doesn't use the same INFO message as 14f67a8ee28. You can notice in the following example that the message emitted (that the default partition scan is skipped) is different when a new partition is added with CREATE TABLE and with ATTACH PARTITION, respectively. create table foo (a int, b char) partition by list (a); -- default partition create table food partition of foo default partition by list (b); -- default partition's partition with the check constraint create table fooda partition of food (check (not (a is not null and a = 1 and a = 2))) for values in ('a'); create table foo1 partition of foo for values in (1); INFO: partition constraint for table "fooda" is implied by existing constraints CREATE TABLE create table foo2 (like foo); alter table foo attach partition foo2 for values in (2); INFO: updated partition constraint for default partition "fooda" is implied by existing constraints ALTER TABLE That's because it appears that you applied Jeevan's original patch. Revised version of his patch that I had last posted contained the new message. Actually the revised patch had not only addressed the case where the default partition's child's scan is skipped, but also the case where the scan of default partition itself is skipped. As things stand now: alter table food add constraint skip_check check (not (a is not null and (a = any (array[1, 2]; create table foo1 partition of foo for values in (1); INFO: partition constraint for table "food" is implied by existing constraints CREATE TABLE create table foo2 (like foo); CREATE TABLE alter table foo attach partition foo2 for values in (2); INFO: updated partition constraint for default partition "food" is implied by existing constraints ALTER TABLE Attached a patch to modify the INFO messages in check_default_allows_bound. Thanks, Amit From 1bd3b03bcd1f8af6cec3a22c40e96c9cde46e705 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 6 Oct 2017 10:23:24 +0900 Subject: [PATCH] Like c31e9d4bafd, but for check_default_allows_bound --- src/backend/catalog/partition.c | 10 +++--- src/test/regress/expected/alter_table.out | 4 ++-- 2 files changed, 5 insertions(+), 9 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 3a8306a055..1459fba778 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -920,7 +920,7 @@ check_default_allows_bound(Relation parent, Relation default_rel, if (PartConstraintImpliedByRelConstraint(default_rel, def_part_constraints)) { ereport(INFO, - (errmsg("partition constraint for table \"%s\" is implied by existing constraints", + (errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints", RelationGetRelationName(default_rel; return; } @@ -956,16 +956,12 @@ check_default_allows_bound(Relation parent, Relation default_rel, { part_rel = heap_open(part_relid, NoLock); - /* -* If the partition constraints on default partition child imply -* that it will not contain any row that would belong to the new -* partition, we can avoid scanning the child table. -*/ +
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/10/04 4:27, Robert Haas wrote: > On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat > wrote: >>> Regarding nomenclature and my previous griping about wisdom, I was >>> wondering about just calling this a "partition join" like you have in >>> the regression test. So the GUC would be enable_partition_join, you'd >>> have generate_partition_join_paths(), etc. Basically just delete >>> "wise" throughout. >> >> Partition-wise join is standard term used in literature and in >> documentation of other popular DBMSes, so partition_wise makes more >> sense. But I am fine with partition_join as well. Do you want it >> partition_join or partitionjoin like enable_mergejoin/enable_hashjoin >> etc.? > > Well, you're making me have second thoughts. It's really just that > partition_wise looks a little awkward to me, and maybe that's not > enough reason to change anything. I suppose if I commit it this way > and somebody really hates it, it can always be changed later. We're > not getting a lot of input from anyone else at the moment. FWIW, the name enable_partition_join seems enough to convey the core feature, that is, I see "_wise" as redundant, even though I'm now quite used to seeing "_wise" in the emails here and saying it out loud every now and then. Ashutosh may have a point though that users coming from other databases might miss the "_wise". :) Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
Hi David. Thanks a lot for your review comments and sorry it took me a while to reply. On 2017/09/28 18:16, David Rowley wrote: > On 27 September 2017 at 14:22, Amit Langote > wrote: >> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as >> an author). I slightly tweaked his patch -- renamed the function >> get_matching_clause to match_clauses_to_partkey, similar to >> match_clauses_to_index. > > Hi Amit, > > I've made a pass over the 0001 patch while trying to get myself up to > speed with the various developments that are going on in partitioning > right now. > > These are just my thoughts from reading over the patch. I understand > that there's quite a bit up in the air right now about how all this is > going to work, but I have some thoughts about that too, which I > wouldn't mind some feedback on as my line of thought may be off. > > Anyway, I will start with some small typos that I noticed while reading: > > partition.c: > > + * telling what kind of NullTest has been applies to the corresponding > > "applies" should be "applied" Will fix. > plancat.c: > > * might've occurred to satisfy the query. Rest of the fields are set in > > "Rest of the" should be "The remaining" Will fix. > Any onto more serious stuff: > > allpath.c: > > get_rel_partitions() > > I wonder if this function does not belong in partition.c. I'd have > expected a function to exist per partition type, RANGE and LIST, I'd > imagine should have their own function in partition.c to eliminate > partitions > which cannot possibly match, and return the remainder. I see people > speaking of HASH partitioning, but we might one day also want > something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine > routing records to be processed into foreign tables where you never > need to query them again). It would be good if we could easily expand > this list and not have to touch files all over the optimizer to do > that. Of course, there would be other work to do in the executor to > implement any new partitioning method too. I think there will have to be at least some code in the optimizer. That is, the code to match the query to the table's partition keys and using the constant values therein to then look up the table's partitions. More importantly, once the partitions (viz. their offsets in the table's partition descriptor) have been looked up by partition.c, we must be able to then map them to their planner data structures viz. their AppendRelInfo, and subsequently RelOptInfo. This last part will have to be in the optimizer. In fact, that was the role of get_rel_partitions in the initial versions of the patch, when neither of the code for matching keys and for pruning using constants was implemented. One may argue that the first part, that is, matching clauses to match to the partition key and subsequent lookup of partitions using constants could both be implemented in partition.c, but it seems better to me to keep at least the part of matching clauses to the partition keys within the planner (just like matching clauses to indexes is done by the planner). Looking up partitions using constants cannot be done outside partition.c anyway, so that's something we have to implement there. > I know there's mention of it somewhere about get_rel_partitions() not > being so smart about WHERE partkey > 20 AND partkey > 10, the code > does not understand what's more restrictive. I think you could > probably follow the same rules here that are done in > eval_const_expressions(). Over there I see that evaluate_function() > will call anything that's not marked as volatile. Hmm, unless I've missed it, I don't see in evaluate_function() anything about determining which clause is more restrictive. AFAIK, such determination depends on the btree operator class semantics (at least in the most common cases where, say, ">" means greater than in a sense that btree code uses it), so I was planning to handle it the way btree code handles it in _bt_preprocess_keys(). In fact, the existing code in predtest.c, which makes decisions of the similar vein also relies on btree semantics. It's OK to do so, because partitioning methods that exist today and for which we'd like to have such smarts use btree semantics to partition the data. Also, we won't need to optimize such cases for hash partitioning anyway. > I'd imagine, for > each partition key, you'd want to store a Datum with the minimum and > maximum possible value based on the quals processed. If either the > minimum or maximum is still set to NULL, then it's unbounded at that > end. If you encou
Re: [HACKERS] Commitfest 201709 is now closed
On 2017/10/03 7:16, Michael Paquier wrote: > On Tue, Oct 3, 2017 at 1:23 AM, Alexander Korotkov > wrote: >> On Mon, Oct 2, 2017 at 6:57 PM, Tom Lane wrote: >>> >>> Daniel Gustafsson writes: Thanks to everyone who participated, and to everyone who have responded to my nagging via the CF app email function. This is clearly an awesome community. >>> >>> And thanks to you for your hard work as CFM! That's tedious and >>> largely thankless work, but it's needed to keep things moving. >> >> +1, >> Thank you very much, Daniel! It was a pleasure working with you at >> commitfest. > > Thanks for doing a bunch of work, Daniel. This is a difficult task. +1. Thank you, Daniel! Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On 2017/09/30 1:28, Robert Haas wrote: > On Thu, Sep 28, 2017 at 5:16 AM, David Rowley > wrote: >> I'd imagine, for >> each partition key, you'd want to store a Datum with the minimum and >> maximum possible value based on the quals processed. If either the >> minimum or maximum is still set to NULL, then it's unbounded at that >> end. If you encounter partkey = Const, then minimum and maximum can be >> set to the value of that Const. From there you can likely ignore any >> other quals for that partition key, as if the query did contain >> another qual with partkey = SomeOtherConst, then that would have >> become a gating qual during the constant folding process. This way if >> the user had written WHERE partkey >= 1 AND partkey <= 1 the >> evaluation would end up the same as if they'd written WHERE partkey = >> 1 as the minimum and maximum would be the same value in both cases, >> and when those two values are the same then it would mean just one >> theoretical binary search on a partition range to find the correct >> partition instead of two. > > I have not looked at the code submitted here in detail yet but I do > think we should try to avoid wasting cycles in the > presumably-quite-common case where equality is being tested. The > whole idea of thinking of this as minimum/maximum seems like it might > be off precisely for that reason. I agree. Equality checks are going to be common enough to warrant them to be handled specially, instead of implementing equality-pruning on top of min/max framework. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partitions: \d vs \d+
On 2017/09/28 22:19, Maksim Milyutin wrote: > I also noticed ambiguity in printing "No partition constraint" in > non-verbose mode and "Partition constraint:..." in verbose one for > partition tables regardless of the type of partition. > Attached small patch removes any output about partition constraint in > non-verbose mode. Patch looks good. So, we should be looking at partconstraintdef only when verbose is true, because that's only when we set it to a valid value. Now, if partconstraintdef is NULL even after verbose is true, that means backend returned that there exists no constraint for that partition, which I thought would be true for a default partition (because the commit that introduced default partitions also introduced "No partition constraint"), but it's really not. For example, \d and \d+ show contradictory outputs for a default partition. create table p (a int) partition by list (a); create table p1 partition of p for values in (1); create table pd partition of p default; \d pd Table "public.pd" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | Partition of: p DEFAULT No partition constraint \d+ pd Table "public.pd" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +-+---+--+-+-+--+- a | integer | | | | plain | | Partition of: p DEFAULT Partition constraint: (NOT ((a IS NOT NULL) AND (a = ANY (ARRAY[1] Perhaps, there is no case when "No partition constraint" should be output, but I may be missing something. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partitions: \d vs \d+
On 2017/09/28 22:29, Jesper Pedersen wrote: > On 09/28/2017 09:19 AM, Maksim Milyutin wrote: >>> E.g. "No partition constraint" vs. "Partition constraint: >>> satisfies_hash_partition(...)". >> >> I also noticed ambiguity in printing "No partition constraint" in >> non-verbose mode and "Partition constraint:..." in verbose one for >> partition tables regardless of the type of partition. >> Attached small patch removes any output about partition constraint in >> non-verbose mode. >> > > Yeah, that could be one way. > > It should likely be backported to REL_10_STABLE, so the question is if we > are too late in the release cycle to change that output. I think the default partition commit [1] introduced some change around that code, so the behavior is new in 11dev and I think it needs a fix like the one that Maksim proposed. When I check with REL_10_STABLE tip, I find things to be normal: create table p (a int) partition by list (a); create table p1 partition of p for values in (1); \d p1 Table "public.p1" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | Partition of: p FOR VALUES IN (1) \d+ p1 Table "public.p1" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description +-+---+--+-+-+--+- a | integer | | | | plain | | Partition of: p FOR VALUES IN (1) Partition constraint: ((a IS NOT NULL) AND (a = ANY (ARRAY[1]))) Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/bin/psql/describe.c;h=d22ec68431e231d9c781c2256a6030d66e0fd09d;hp=6fb9bdd063583fb8b60ad282aeb5256df67942e4;hb=6f6b99d1335be8ea1b74581fc489a97b109dd08a;hpb=2cf15ec8b1cb29bea149559700566a21a790b6d3 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On 2017/09/28 13:58, Dilip Kumar wrote: > On Wed, Sep 27, 2017 at 6:52 AM, Amit Langote > wrote: > > I was looking into the latest patch set, seems like we can reuse some > more code between this path and runtime pruning[1] > > + foreach(lc1, matchedclauses[i]) > + { > + Expr *clause = lfirst(lc1); > + Const *rightop = (Const *) get_rightop(clause); > + Oid opno = ((OpExpr *) clause)->opno, > + opfamily = rel->part_scheme->partopfamily[i]; > + StrategyNumber strategy; > + > + strategy = get_op_opfamily_strategy(opno, opfamily); > + switch (strategy) > + { > + case BTLessStrategyNumber: > + case BTLessEqualStrategyNumber: > + if (need_next_max) > + { > + maxkeys[i] = rightop->constvalue; > + if (!maxkey_set[i]) > + n_maxkeys++; > + maxkey_set[i] = true; > + max_incl = (strategy == BTLessEqualStrategyNumber); > + } > + if (strategy == BTLessStrategyNumber) > + need_next_max = false; > > I think the above logic is common between this patch and the runtime > pruning. I think we can make > a reusable function. Here we are preparing minkey and maxkey of Datum > because we can directly fetch rightop->constvalue whereas for runtime > pruning we are making minkeys and maxkeys of Expr because during > planning time we don't have the values for the Param. I think we can > always make these minkey, maxkey array of Expr and later those can be > processed in whatever way we want it. So this path will fetch the > constval out of Expr and runtime pruning will Eval that expression at > runtime. I think that makes sense. In fact we could even move the minkey/maxkey collection code to match_clauses_to_partkey() itself. No need for a different function and worrying about defining a separate interface for the same. We match clauses exactly because we want to extract the constant or param values out of them. No need to do the two activities independently and in different places. > Does this make sense or it will cause one level of extra processing > for this path i.e converting the Expr array to CONST array? Hm, it's not such a big cost to pay I'd think. I will update the planner patch accordingly. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)
On 2017/09/28 16:13, Ashutosh Bapat wrote: > On Thu, Sep 28, 2017 at 11:47 AM, Amit Langote wrote: >> On 2017/09/21 12:42, Robert Haas wrote: >>> Associate partitioning information with each RelOptInfo. >>> >>> This is not used for anything yet, but it is necessary infrastructure >>> for partition-wise join and for partition pruning without constraint >>> exclusion. >>> >>> Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes, >>> mostly cosmetic, by me. Additional review and testing of this patch >>> series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar >>> Raghuwanshi, Thomas Munro, and Dilip Kumar. >> >> I noticed that this commit does not add partcollation field to >> PartitionSchemeData, while it adds parttypcoll. I think it'd be necessary >> to have partcollation too, because partitioning would have used the same, >> not parttypcoll. For, example see the following code in >> partition_rbound_datum_cmp: >> >> cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[i], >> key->partcollation[i], >> rb_datums[i], >> tuple_datums[i])); >> >> So, it would be wrong to use parttypcoll, if we are to use the collation >> to match a clause with the partition bounds when doing partition-pruning. >> Concretely, a clause's inputcollid should match partcollation for the >> corresponding column, not the column's parttypcoll. >> >> Attached is a patch that adds the same. I first thought of including it >> in the partition-pruning patch set [1], but thought we could independently >> fix this. >> > > I think PartitionSchemeData structure will grow as we need more > information about partition key for various things. E.g. partsupfunc > is not part of this structure right now, but it would be required to > compare partition bound datums. Similarly partcollation. Please add > this to partition pruning patchset. May be parttypcoll won't be used > at all and we may want to remove it altogether. Okay, I will post it with the partition-pruning patch set. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Use of RangeVar for partitioned tables in autovacuum
Thanks Michael for working on this. On 2017/09/27 11:28, Michael Paquier wrote: > Hi all, > > I have been looking more closely at the problem in $subject, that I > have mentioned a couple of times, like here: > https://www.postgresql.org/message-id/cab7npqqa37oune_rjzpmwc4exqalx9f27-ma_-rsfl_3mj+...@mail.gmail.com > > As of HEAD, the RangeVar defined in calls of vacuum() is used for > error reporting, only in two cases: for autoanalyze and autovacuum > when reporting to users that a lock cannot be taken on a relation. The > thing is that autovacuum has the good idea to call vacuum() on only > one relation at the same time, with always a relation OID and a > RangeVar defined, so the code currently on HEAD is basically fine with > its error reporting, because get_rel_oids() will always work on the > relation OID with its RangeVar, and because code paths with manual > VACUUMs don't use the RangeVar for any reports. Yeah, autovacuum_do_vac_analyze will never pass partitioned table OIDs to vacuum, for example, because they're not RELKIND_RELATION relations. > While v10 is safe, I think that this is wrong in the long-term and > that may be a cause of bugs if people begin to generate reports for > manual VACUUMs, which is plausible in my opinion. The patch attached, > is what one solution would look like if we would like to make things > more robust as long as manual VACUUM commands can only specify one > relation at the same time. I have found that tracking the parent OID > by tweaking a bit get_rel_oids() was the most elegant solution. The patch basically looks good to me, FWIW. > Please > note that this range of problems is something that I think is better > solved with the infrastructure that support for VACUUM with multiple > relations includes (last version here > https://www.postgresql.org/message-id/766556dd-aa3c-42f7-acf4-5dc97f41f...@amazon.com). > So I don't think that the patch attached should be integrated, still I > am attaching it to satisfy the curiosity of anybody looking at this > message. Yeah, after looking at the code a bit, it seems that the problem won't really occur for the case we're trying to apply this patch for. > In conclusion, I think that the open item of $subject should be > removed from the list, and we should try to get the multi-VACUUM patch > in to cover any future problems. I'll do so if there are no > objections. +1 to this, taking the previous +1 back [1]. :) > One comment on top of vacuum() is wrong by the way: in the case of a > manual VACUUM or ANALYZE, a NULL RangeVar is used if no relation is > specified in the command. Yep, but it doesn't ever get accessed per our observations. Thanks, Amit [1] https://www.postgresql.org/message-id/8d810dd9-5f64-a5f3-c016-a81f05528fa8%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)
Sorry, I meant to say PartitionSchem"e"Data in subject. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PartitionSchemaData & partcollation (Re: [COMMITTERS] pgsql: Associate partitioning information with each RelOptInfo.)
On 2017/09/21 12:42, Robert Haas wrote: > Associate partitioning information with each RelOptInfo. > > This is not used for anything yet, but it is necessary infrastructure > for partition-wise join and for partition pruning without constraint > exclusion. > > Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes, > mostly cosmetic, by me. Additional review and testing of this patch > series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar > Raghuwanshi, Thomas Munro, and Dilip Kumar. I noticed that this commit does not add partcollation field to PartitionSchemeData, while it adds parttypcoll. I think it'd be necessary to have partcollation too, because partitioning would have used the same, not parttypcoll. For, example see the following code in partition_rbound_datum_cmp: cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[i], key->partcollation[i], rb_datums[i], tuple_datums[i])); So, it would be wrong to use parttypcoll, if we are to use the collation to match a clause with the partition bounds when doing partition-pruning. Concretely, a clause's inputcollid should match partcollation for the corresponding column, not the column's parttypcoll. Attached is a patch that adds the same. I first thought of including it in the partition-pruning patch set [1], but thought we could independently fix this. Thoughts? Thanks, Amit [1] https://commitfest.postgresql.org/14/1272/ From 4e25d503a00c5fb42919e73aae037eacb0164af6 Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 27 Sep 2017 15:49:50 +0900 Subject: [PATCH 1/5] Add partcollation field to PartitionSchemeData It copies PartitionKeyData.partcollation. We need that in addition to parttypcoll. --- src/backend/optimizer/util/plancat.c | 1 + src/include/nodes/relation.h | 7 ++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index cac46bedf9..1273f28069 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -1897,6 +1897,7 @@ find_partition_scheme(PlannerInfo *root, Relation relation) part_scheme->partopfamily = partkey->partopfamily; part_scheme->partopcintype = partkey->partopcintype; part_scheme->parttypcoll = partkey->parttypcoll; + part_scheme->partcollation = partkey->partcollation; part_scheme->parttyplen = partkey->parttyplen; part_scheme->parttypbyval = partkey->parttypbyval; diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h index 48e6012f7f..2adefd0873 100644 --- a/src/include/nodes/relation.h +++ b/src/include/nodes/relation.h @@ -342,6 +342,10 @@ typedef struct PlannerInfo * partition bounds. Since partition key data types and the opclass declared * input data types are expected to be binary compatible (per ResolveOpClass), * both of those should have same byval and length properties. + * + * Since partitioning might be using a collation for a given partition key + * column that is not same as the collation implied by column's type, store + * the same separately. */ typedef struct PartitionSchemeData { @@ -349,7 +353,8 @@ typedef struct PartitionSchemeData int16 partnatts; /* number of partition attributes */ Oid*partopfamily; /* OIDs of operator families */ Oid*partopcintype; /* OIDs of opclass declared input data types */ - Oid*parttypcoll;/* OIDs of collations of partition keys. */ + Oid*parttypcoll;/* OIDs of partition key type collation. */ + Oid*partcollation; /* OIDs of partitioning collation */ /* Cached information about partition key data types. */ int16 *parttyplen; -- 2.11.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On 2017/09/27 22:41, Jesper Pedersen wrote: > On 09/27/2017 03:05 AM, amul sul wrote: Attached rebased patch, thanks. >>> While reading through the patch I thought it would be better to keep >>> MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to >>> highlight that these are "keywords" for hash partition. >>> >>> Also updated some of the documentation. >>> >>> >> Thanks a lot for the patch, included in the attached version. >> > > Thank you. > > Based on [1] I have moved the patch to "Ready for Committer". Thanks a lot Amul for working on this. Like Jesper said, the patch looks pretty good overall. I was looking at the latest version with intent to study certain things about hash partitioning the way patch implements it, during which I noticed some things. + The modulus must be a positive integer, and the remainder must a must be a + suppose you have a hash-partitioned table with 8 children, each of which + has modulus 8, but find it necessary to increase the number of partitions + to 16. Might it be a good idea to say 8 "partitions" instead of "children" in the first sentence? + each modulus-8 partition until none remain. While this may still involve + a large amount of data movement at each step, it is still better than + having to create a whole new table and move all the data at once. + + I read the paragraph that ends with the above text and started wondering if the example to redistribute data in hash partitions by detaching and attaching with new modulus/remainder could be illustrated with an example? Maybe in the examples section of the ALTER TABLE page? + Since hash operator class provide only equality, not ordering, collation Either "Since hash operator classes provide" or "Since hash operator class provides" Other than the above points, patch looks good. By the way, I noticed a couple of things about hash partition constraints: 1. In get_qual_for_hash(), using get_fn_expr_rettype(&key->partsupfunc[i]), which returns InvalidOid for the lack of fn_expr being set to non-NULL value, causes funcrettype of the FuncExpr being generated for hashing partition key columns to be set to InvalidOid, which I think is wrong. That is, the following if condition in get_fn_expr_rettype() is always satisfied: if (!flinfo || !flinfo->fn_expr) return InvalidOid; I think we could use get_func_rettype(&key->partsupfunc[i].fn_oid) instead. Attached patch hash-v21-set-funcexpr-funcrettype-correctly.patch, which applies on top v21 of your patch. 2. It seems that the reason constraint exclusion doesn't work with hash partitions as implemented by the patch is that predtest.c: operator_predicate_proof() returns false even without looking into the hash partition constraint, which is of the following form: satisfies_hash_partition(, , ,..) beccause the above constraint expression doesn't translate into a a binary opclause (an OpExpr), which operator_predicate_proof() knows how to work with. So, false is returned at the beginning of that function by the following code: if (!is_opclause(predicate)) return false; For example, create table p (a int) partition by hash (a); create table p0 partition of p for values with (modulus 4, remainder 0); create table p1 partition of p for values with (modulus 4, remainder 1); \d+ p0 <...> Partition constraint: satisfies_hash_partition(4, 0, hashint4extended(a, '8816678312871386367'::bigint)) -- both p0 and p1 scanned explain select * from p where satisfies_hash_partition(4, 0, hashint4extended(a, '8816678312871386367'::bigint)); QUERY PLAN Append (cost=0.00..96.50 rows=1700 width=4) -> Seq Scan on p0 (cost=0.00..48.25 rows=850 width=4) Filter: satisfies_hash_partition(4, 0, hashint4extended(a, '8816678312871386367'::bigint)) -> Seq Scan on p1 (cost=0.00..48.25 rows=850 width=4) Filter: satisfies_hash_partition(4, 0, hashint4extended(a, '8816678312871386367'::bigint)) (5 rows) -- both p0 and p1 scanned explain select * from p where satisfies_hash_partition(4, 1, hashint4extended(a, '8816678312871386367'::bigint)); QUERY PLAN Append (cost=0.00..96.50 rows=1700 width=4) -> Seq Scan on p0 (cost=0.00..48.25 rows=850 width=4) Filter: satisfies_hash_partition(4, 1, hashint4extended(a, '8816678312871386367'::bigint)) -> Seq Scan on p1 (cost=0.00..48.25 rows=850 width=4) Filter: satisfies_hash_partition(4, 1, hashint4extended(a, '8816678312871386367'::bigint)) (5 rows) I looked into how satisfies_hash_partition() works and came up with an idea that I think will make constraint exclusion work. What if we emitted t
Re: [HACKERS] path toward faster partition pruning
Hi Jesper. Firstly, thanks for looking at the patch. On 2017/09/26 22:00, Jesper Pedersen wrote: > Hi Amit, > > On 09/15/2017 04:50 AM, Amit Langote wrote: >> On 2017/09/15 11:16, Amit Langote wrote: >>> I will post rebased patches later today, although I think the overall >>> design of the patch on the planner side of things is not quite there yet. >>> Of course, your and others' feedback is greatly welcome. >> >> Rebased patches attached. Because Dilip complained earlier today about >> clauses of the form (const op var) not causing partition-pruning, I've >> added code to commute the clause where it is required. Some other >> previously mentioned limitations remain -- no handling of OR clauses, no >> elimination of redundant clauses for given partitioning column, etc. >> >> A note about 0001: this patch overlaps with >> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch >> series that Ashutosh Bapat posted yesterday [1]. Because I implemented >> the planner-portion of this patch based on what 0001 builds, I'm posting >> it here. It might actually turn out that we will review and commit >> 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply >> 0001 if you want to play with the later patches. I would certainly like >> to review 0003-Canonical-partition-scheme.patch myself, but won't be able >> to immediately (see below). >> > > Could you share your thoughts on the usage of PartitionAppendInfo's > min_datum_idx / max_datum_idx ? Especially in relation to hash partitions. > > I'm looking at get_partitions_for_keys. Sure. You may have noticed that min_datum_idx and max_datum_idx in PartitionAppendInfo are offsets in the PartitionDescData.boundinfo.datums array, of datums that lie within the query-specified range (that is, using =, >, >=, <, <= btree operators in the query). That array contains bounds of all partitions sorted in the btree operator class defined order, at least for list and range partitioning. I haven't (yet) closely looked at the composition of hash partition datums in PartitionBoundInfo, which perhaps have different ordering properties (or maybe none) than list and range partitioning datums. Now, since they are offsets of datums in PartitionBoundInfo, not indexes of partitions themselves, their utility outside partition.c might be questionable. But partition-wise join patch, for example, to determine if two partitioned tables can be joined partition-wise, is going to check if PartitionBoundInfos in RelOptInfos of two partitioned tables are bound-to-bound equal [1]. Partition-pruning may select only a subset of partitions of each of the joining partitioned tables. Equi-join requirement for partition-wise join means that the subset of partitions will be same on both sides of the join. My intent of having min_datum_idx, max_datum_idx, along with contains_null_partition, and contains_default_partition in PartitionAppendInfo is to have sort of a cross check that those values end up being same on both sides of the join after equi-join requirement has been satisfied. That is, get_partitions_for_keys will have chosen the same set of partitions for both partitioned tables and hence will have set the same values for those fields. As mentioned above, that may be enough for list and range partitioning, but since hash partition datums do not appear to have the same properties as list and range datums [2], we may need an additional field(s) to describe the hash partition selected by get_partitions_for_keys. I guess only one field will be enough, that will be the offset in the datums array of the hash partition chosen for the query or -1 if query quals couldn't conclusively determine one (maybe not all partition keys were specified to be hashed or some or all used non-equality operator). Hope that answers your question at least to some degree. Now, there are some points Robert mentioned in his reply that I will need to also consider in the patch, which I'll go do now. :) Thanks, Amit [1] https://www.postgresql.org/message-id/CAFjFpRc4UdCYknBai9pBu2GA1h4nZVNPDmzgs4jOkqFamT1huA%40mail.gmail.com [2] It appears that in the hash partitioning case, unlike list and range partitioning, PartitionBoundInfo doesn't contain values that are directly comparable to query-specified constants, but a pair (modulus, remainder) for each partition. We'll first hash *all* the key values (mentioned in the query) using the partitioning hash machinery and then determine the hash partition index by using (hash % largest_modulus) as offset into the PartitionBoundInfo.indexes array. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
Hi David, On 2017/09/27 6:04, David Rowley wrote: > On 25 September 2017 at 23:04, Amit Langote > wrote: >> By the way, I'm now rebasing these patches on top of [1] and will try to >> merge your refactoring patch in some appropriate way. Will post more >> tomorrow. >> >> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826 > > Yeah, I see 0001 conflicts with that. I'm going to set this to waiting > on author while you're busy rebasing this. Thanks for the reminder. Just thought I'd say that while I'm actually done rebasing itself (attaching rebased patches to try 'em out), I'm now considering Robert's comments and will be busy for a bit revising things based on those comments. Some notes about the attached patches: - 0001 includes refactoring that Dilip proposed upthread [1] (added him as an author). I slightly tweaked his patch -- renamed the function get_matching_clause to match_clauses_to_partkey, similar to match_clauses_to_index. - Code to set AppendPath's partitioned_rels in add_paths_to_append_rel() revised by 0a480502b09 (was originally introduced in d3cc37f1d80) is still revised to get partitioned_rels from a source that is not PlannerInfo.pcinfo_list. With the new code, partitioned_rels won't contain RT indexes of the partitioned child tables that were pruned. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFiTN-tGnQzF_4QtbOHT-3hE%3DOvNaMfbbeRxa4UY0CQyF0G8gQ%40mail.gmail.com From f20aebcad9b089434ba60cd4439fa1a9d55091b8 Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 13 Sep 2017 18:24:55 +0900 Subject: [PATCH 1/4] WIP: planner-side changes for partition-pruning Firstly, this adds a stub get_partitions_for_keys() in partition.c with appropriate interface for the caller to specify bounding scan keys, along with other information about the scan keys extracted from the query, such as NULL-ness of the keys, inclusive-ness, etc. More importantly, this implements the planner-side logic to extract bounding scan keys to be passed to get_partitions_for_keys. That is, it will go through rel->baserestrictinfo and match individual clauses to partition keys and construct lower bound and upper bound tuples, which may cover only a prefix of a multi-column partition key. A bunch of smarts are still missing when mapping the clause operands with keys. For example, code to match a clause is specifed as (constant op var) doesn't exist. Also, redundant keys are not eliminated, for example, a combination of clauses a = 10 and a > 1 will cause the later clause a > 1 taking over and resulting in needless scanning of partitions containing values a > 1 and a < 10. ...constraint exclusion is still used, because get_partitions_for_keys is just a stub... Authors: Amit Langote, Dilip Kumar --- src/backend/catalog/partition.c | 43 src/backend/optimizer/path/allpaths.c | 390 ++ src/backend/optimizer/util/plancat.c | 10 + src/backend/optimizer/util/relnode.c | 7 + src/include/catalog/partition.h | 9 + src/include/nodes/nodes.h | 1 + src/include/nodes/relation.h | 66 ++ 7 files changed, 487 insertions(+), 39 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 1ab6dba7ae..ccf8a1fa67 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1335,6 +1335,49 @@ get_partition_dispatch_recurse(Relation rel, Relation parent, } } +/* + * get_partitions_for_keys + * Returns the list of indexes of rel's partitions that will need to be + * scanned given the bounding scan keys. + * + * Each value in the returned list can be used as an index into the oids array + * of the partition descriptor. + * + * Inputs: + * keynullness contains between 0 and (key->partnatts - 1) values, each + * telling what kind of NullTest has been applies to the corresponding + * partition key column. minkeys represents the lower bound on the partition + * the key of the records that the query will return, while maxkeys + * represents upper bound. min_inclusive and max_inclusive tell whether the + * bounds specified minkeys and maxkeys is inclusive, respectively. + * + * Other outputs: + * *min_datum_index will return the index in boundinfo->datums of the first + * datum that the query's bounding keys allow to be returned for the query. + * Similarly, *max_datum_index. *null_partition_chosen returns whether + * the null partition will be scanned. + * + * TODO: Implement. + */ +List * +get_partitions_for_keys(Relation rel, + NullTestType *keynullness, + Datum *minkeys, int n_minkeys, bool min_inclusive, +
Re: [HACKERS] path toward faster partition pruning
On 2017/09/27 1:51, Robert Haas wrote: > On Tue, Sep 26, 2017 at 10:57 AM, Jesper Pedersen > wrote: >> One could advocate (*cough*) that the hash partition patch [1] should be >> merged first in order to find other instances of where other CommitFest >> entries doesn't account for hash partitions at the moment in their method >> signatures; Beena noted something similar in [2]. I know that you said >> otherwise [3], but this is CommitFest 1, so there is time for a revert >> later, and hash partitions are already useful in internal testing. > > Well, that's a fair point. I was assuming that committing things in > that order would cause me to win the "least popular committer" award > at least for that day, but maybe not. It's certainly not ideal to > have to juggle that patch along and keep rebasing it over other > changes when it's basically done, and just waiting on other > improvements to land. Anybody else wish to express an opinion? FWIW, I tend to agree that it would be nice to get the hash partitioning patch in, even with old constraint exclusion based partition-pruning not working for hash partitions. That way, it might be more clear what we need to do in the partition-pruning patches to account for hash partitions. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On 2017/09/25 20:21, Dilip Kumar wrote: > On Mon, Sep 25, 2017 at 3:34 PM, Amit Langote > wrote: > >> Thanks for looking at the patches and the comments. > >> It's not clear to me whether get_rel_partitions() itself, as it is, is >> callable from outside the planner, because its signature contains >> RelOptInfo. We have the RelOptInfo in the signature, because we want to >> mark certain fields in it so that latter planning steps can use them. So, >> get_rel_partitions()'s job is not just to match clauses and find >> partitions, but also to perform certain planner-specific tasks of >> generating information that the later planning steps will want to use. >> That may turn out to be unnecessary, but until we know that, let's not try >> to export get_rel_partitions() itself out of the planner. >> >> OTOH, the function that your refactoring patch separates out to match >> clauses to partition keys and extract bounding values seems reusable >> outside the planner and we should export it in such a way that it can be >> used in the executor. Then, the hypothetical executor function that does >> the pruning will first call the planner's clause-matching function, >> followed by calling get_partitions_for_keys() in partition.c to get the >> selected partitions. > > Thanks for your reply. > > Actually, we are still planning to call get_matching_clause at the > optimizer time only. Since we can not use get_rel_partitions function > directly for runtime pruning because it does all the work (find > matching clause, create minkey and maxkey and call > get_partitions_for_keys) during planning time itself. > > For runtime pruning, we are planning to first get_matching_clause > function during optimizer time to identify the clause which is > matching with partition keys, but for PARAM_EXEC case we can not > depend upon baserelrestriction instead we will get the from join > clause, that's the reason I have separated out get_matching_clause. > But it will still be used during planning time. I see. So, in the run-time pruning case, only the work of extracting bounding values is deferred to execution time. Matching clauses with the partition key still occurs during planning time. Only that the clauses that require run-time pruning are not those in rel->baserestrictinfo. > After separating out the matching clause we will do somewhat similar > processing what "get_rel_partitions" is doing. But, at optimizer time > for PARAM we will not have Datum values for rightop, so we will keep > track of the PARAM itself. I guess information about which PARAMs map to which partition keys will be kept in the plan somehow. > And, finally at runtime when we get the PARAM value we can prepare > minkey and maxkey and call get_partitions_for_keys function. Note that get_partitions_for_keys() is not planner code, nor is it bound with any other planning code. It's callable from executor without much change. Maybe you already know that though. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
On 2017/09/26 11:12, Michael Paquier wrote: > On Tue, Sep 26, 2017 at 10:54 AM, Amit Langote > wrote: >> I think that's right, although, I don't see any new RangeVar created under >> vacuum() at the moment. Maybe, you're referring to the Nathan's patch >> that perhaps does that. > > Yes, you can check what it does here (last version): > 766556dd-aa3c-42f7-acf4-5dc97f41f...@amazon.com Thanks, will look. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
On 2017/09/26 11:14, Michael Paquier wrote: > On Tue, Sep 26, 2017 at 10:55 AM, Amit Langote wrote: >> On 2017/09/26 9:51, Michael Paquier wrote: >>> On Tue, Sep 26, 2017 at 8:48 AM, Michael Paquier wrote: >>>> Something like that looks like a good compromise for v10. I would >>>> rather see a more complete fix with each relation name reported >>>> correctly on HEAD though. The information provided would be useful for >>>> users when using autovacuum when skipping a relation because no lock >>>> could be taken on it. >>> >>> Actually, perhaps this should be tracked as an open item? A simple fix >>> is leading to the path that no information is better than misleading >>> information in this case. >> >> +1. > > Let's track it then and spawn a separate thread with a patch. Do you > want to work on it or should I? The solution proposed by Tom seems > like the correct answer. I am adding an item for now, we could always > link it to a thread later on. I assume you mean the Tom's solution wherein we pass a null RangeVar for tables that were not mentioned in the command (that is, for partitions of a partitioned table that was mentioned in the command or for tables read from pg_class when a user ran VACUUM without mentioning any table name). Please feel free to come up with the patch for the same, if you have time. I'll be glad to review. > Let's also track the problem that has been reported on this thread. Yes. Just to be clear, the original problem this thread was started for is that get_rel_oids() may emit a less user-friendly "cache lookup failed for relation " error, which it shouldn't. We should fix the locking like the patch you posted does, to avoid having to come across this syscache lookup error. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimise default partition scanning while adding new partition
On 2017/09/16 1:57, Amit Langote wrote: > On Sat, Sep 16, 2017 at 12:59 AM, Robert Haas wrote: >> I believe the intended advantage of the current system is that if you >> specify multiple operations in a single ALTER TABLE command, you only >> do one scan rather than having a second scan per operation. If that's >> currently working, we probably don't want to make it stop working. > > OK. > > How about squash Jeevan's and my patch, so both > check_default_allows_bound() and ValidatePartitionConstraints() know > to scan default partition's children and there won't be any surprises > in the regression test output as you found after applying just the > Jeevan's patch. Unfortunately, I'm not able to post such a patch > right now. I guess we don't need to squash, as they could be seen as implementing different features. Reordering the patches helps though. So, apply them in this order: 1. My patch to teach ValidatePartitionConstraints() to skip scanning a partition's own partitions, which optimizes ATTACH PARTITION command's partition constraint validation scan (this also covers the case of scanning the default partition to validate its updated constraint when attaching a new partition) 2. Jeevan's patch to teach check_default_allows_bound() to skip scanning the default partition's own partitions, which covers the case of scanning the default partition to validate its updated constraint when adding a new partition using CREATE TABLE Attached 0001 and 0002 are ordered that way. In addition to implementing the features mentioned in 1 and 2 above, the patches also modify the INFO message to mention "updated partition constraint for default partition \"%s\"", instead of "partition constraint for table \"%s\"", when the default partition is involved. That's so that it's consistent with the error message that would be emitted by either check_default_allows_bound() or ATRewriteTable() when the scan finds a row that violates updated default partition constraint, viz. the following: "updated partition constraint for default partition \"%s\" would be violated by some row" Thoughts? Thanks, Amit From 166cc082b9d652e186df4ef7b6c41976a22e55a8 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 7 Aug 2017 10:51:47 +0900 Subject: [PATCH 1/2] Teach ValidatePartitionConstraints to skip validation in more cases In cases where the table being attached is a partitioned table and the table itself does not have constraints that would allow validation on the whole table to be skipped, we can still skip the validations of individual partitions if they each happen to have the requisite constraints. Idea by Robert Haas. --- src/backend/commands/tablecmds.c | 26 +++--- src/test/regress/expected/alter_table.out | 17 +++-- src/test/regress/sql/alter_table.sql | 10 ++ 3 files changed, 48 insertions(+), 5 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 563bcda30c..2d4dcd7556 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13635,9 +13635,14 @@ ValidatePartitionConstraints(List **wqueue, Relation scanrel, */ if (PartConstraintImpliedByRelConstraint(scanrel, partConstraint)) { - ereport(INFO, - (errmsg("partition constraint for table \"%s\" is implied by existing constraints", - RelationGetRelationName(scanrel; + if (!validate_default) + ereport(INFO, + (errmsg("partition constraint for table \"%s\" is implied by existing constraints", + RelationGetRelationName(scanrel; + else + ereport(INFO, + (errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints", + RelationGetRelationName(scanrel; return; } @@ -13678,6 +13683,21 @@ ValidatePartitionConstraints(List **wqueue, Relation scanrel, /* There can never be a whole-row reference here */ if (found_whole_row) elog(ERROR, "unexpected whole-row reference found in partition key"); + + /* Can we skip scanning this part_rel? */ + if (PartConstraintImpliedByRelConstraint(part_rel, my_partconstr)) +
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/09/26 16:30, Michael Paquier wrote: > On Tue, Sep 26, 2017 at 4:22 PM, Amit Langote > wrote: >>> Except that small thing, the patches do their duty. >> >> Thanks, revised patches attached. > > Cool, let's switch it back to a ready for committer status then. Sure, thanks. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/09/26 12:17, Michael Paquier wrote: > On Mon, Sep 25, 2017 at 3:48 PM, Amit Langote wrote: >> So, ISTM, comments that the patches add should all say that setting the >> meta pages' pd_lower to the correct value helps to pass those pages to >> xlog.c as compressible standard layout pages, regardless of whether they >> are actually passed that way. Even if the patches do take care of the >> latter as well. >> >> Did I miss something? > > Not that I think of. Thanks. > Buffer metabuffer; > + Pagemetapage; > SpGistMetaPageData *metadata; > > metabuffer = ReadBuffer(index, SPGIST_METAPAGE_BLKNO); > + metapage = BufferGetPage(metabuffer); > No need to define metapage here and to call BufferGetPage() as long as > the lock on the buffer is not taken. Ah, okay. Moved those additions inside the if (ConditionalLockBuffer(metabuffer)) block. > Except that small thing, the patches do their duty. Thanks, revised patches attached. Regards, Amit From e82e787bc9533977e73456b0782ed78354637bda Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 23 Jun 2017 11:20:41 +0900 Subject: [PATCH 1/5] Set pd_lower correctly in the GIN metapage. Also tell xlog.c to treat the metapage like a standard page, so any hole in it is compressed. --- src/backend/access/gin/ginfast.c | 20 ++-- src/backend/access/gin/gininsert.c | 4 ++-- src/backend/access/gin/ginutil.c | 17 - src/backend/access/gin/ginxlog.c | 25 ++--- 4 files changed, 46 insertions(+), 20 deletions(-) diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c index 59e435465a..359de85dfc 100644 --- a/src/backend/access/gin/ginfast.c +++ b/src/backend/access/gin/ginfast.c @@ -399,6 +399,14 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) /* * Write metabuffer, make xlog entry */ + + /* +* Set pd_lower just past the end of the metadata. This is essential, +* because without doing so, metadata will be lost if xlog.c compresses +* the page. +*/ + ((PageHeader) metapage)->pd_lower = + ((char *) metadata + sizeof(GinMetaPageData)) - (char *) metapage; MarkBufferDirty(metabuffer); if (needWal) @@ -407,7 +415,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) memcpy(&data.metadata, metadata, sizeof(GinMetaPageData)); - XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT); + XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT | REGBUF_STANDARD); XLogRegisterData((char *) &data, sizeof(ginxlogUpdateMeta)); recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_UPDATE_META_PAGE); @@ -572,6 +580,13 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead, metadata->nPendingHeapTuples = 0; } + /* +* Set pd_lower just past the end of the metadata. This is essential, +* because without doing so, metadata will be lost if xlog.c compresses +* the page. +*/ + ((PageHeader) metapage)->pd_lower = + ((char *) metadata + sizeof(GinMetaPageData)) - (char *) metapage; MarkBufferDirty(metabuffer); for (i = 0; i < data.ndeleted; i++) @@ -586,7 +601,8 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead, XLogRecPtr recptr; XLogBeginInsert(); - XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT); + XLogRegisterBuffer(0, metabuffer, + REGBUF_WILL_INIT | REGBUF_STANDARD); for (i = 0; i < data.ndeleted; i++) XLogRegisterBuffer(i + 1, buffers[i], REGBUF_WILL_INIT); diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index 5378011f50..c9aa4ee147 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -348,7 +348,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo) Pagepage; XLogBeginInsert(); - XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT); + XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT | REGBUF_STANDARD); XLogRegisterBuffer(1, RootBuffer, REGBUF_WILL_INIT); recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_INDEX); @@ -447,7 +447,7 @@ ginbuildempty(Relation index) START_CRIT_SECTION(); GinInitMetabuffer(MetaBuffer); MarkBufferDirty(MetaBuffer); - log_newpage_buffer(Me
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/09/26 11:34, Amit Kapila wrote: > On Mon, Sep 25, 2017 at 12:18 PM, Amit Langote wrote: >> So, ISTM, comments that the patches add should all say that setting the >> meta pages' pd_lower to the correct value helps to pass those pages to >> xlog.c as compressible standard layout pages, regardless of whether they >> are actually passed that way. Even if the patches do take care of the >> latter as well. >> >> Did I miss something? >> >> Looking at Amit K's updated patches for btree and hash, it seems that he >> updated the comment to say that setting pd_lower to the correct value is >> *essential*, because those pages are passed as REGBUF_STANDARD pages and >> hence will be compressed. That seems to contradict what I wrote above, >> but I think that's not wrong, too. >> > > I think you are missing that there are many cases where we use only > REGBUF_STANDARD for meta-pages (cf. hash index). For btree, where the > advantage is in fewer cases, I have explicitly stated those cases. I do see that there are some places that use only REGBUF_STANDARD. I also understand that specifying this flag is necessary condition for XLogRecordAssemble() to perform the hole compression, if it is to be performed at all. ISTM, the hole is compressed only if we write the FP image. However, reasons for why FP image needs to be written, AFAICS, are independent of whether the hole is (should be) compressed or not. Whether compression should occur or not depends on whether the REGBUF_STANDARD flag is passed, that is, whether a caller is sure that the page is of standard layout. The last part is only true if page's pd_lower is always set correctly. Perhaps, we should focus on just writing exactly why setting pd_lower to the correct value is essential. I think that's a good change, so I adopted it from your patch. The *only* reason why it's essential to set pd_lower to the correct value, as I see it, is that xloginsert.c *might* compress the hole in the page and its pd_lower better be correct if we want compression to preserve the metadata content (in meta pages). OTOH, it's not clear to me why we should write in that comment *why* the compression would occur or why the FP image would be written in the first place. Whether or not FP image is written has nothing to do with whether we should set pd_lower correctly, does it? Also, since we want to repeat the same comment in multiple places where we now set pd_lower (gin, brin, spgist patches have many more new places where meta page's pd_lower is set), keeping it to the point would be good. But as you said a few times, we should leave it to the committer to decide the exact text to write in these comment, which I think is fine. > Do you still have confusion? This is an area of PG code I don't have much experience with and is also a bit complex, so sorry if I'm saying things that are not quite right. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] moving some partitioning code to executor
On 2017/09/14 16:13, Amit Langote wrote: > Hi. > > It seems to me that some of the code in partition.c is better placed > somewhere under the executor directory. There was even a suggestion > recently [1] to introduce a execPartition.c to house some code around > tuple-routing. > > IMO, catalog/partition.c should present an interface for handling > operations on a *single* partitioned table and avoid pretending to support > any operations on the whole partition tree. For example, the > PartitionDispatch structure embeds some knowledge about the partition tree > it's part of, which is useful when used for tuple-routing, because of the > way it works now (lock and form ResultRelInfos of *all* leaf partitions > before the first input row is processed). > > So, let's move that structure, along with the code that creates and > manipulates the same, out of partition.c/h and to execPartition.c/h. > Attached patch attempts to do that. > > While doing the same, I didn't move *all* of get_partition_for_tuple() out > to execPartition.c, instead modified its signature as shown below: > > -extern int get_partition_for_tuple(PartitionDispatch *pd, > -TupleTableSlot *slot, > -EState *estate, > -PartitionDispatchData **failed_at, > -TupleTableSlot **failed_slot); > +extern int get_partition_for_tuple(Relation relation, Datum *values, > +bool *isnull); > > That way, we keep the core partition bound comparison logic inside > partition.c and move rest of the stuff to its caller ExecFindPartition(), > which includes navigating the enveloping PartitionDispatch's. > > Thoughts? > > PS: 0001 of the attached is the patch from [2] which is here to be applied > on HEAD before applying the main patch (0002) itself Since that 0001 patch was committed [1], here is the rebased patch. Will add this to the November commit-fest. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=77b6b5e9c From 08b337436b5862b0ee593314864d6ba98f95e6c0 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 8 Sep 2017 19:07:38 +0900 Subject: [PATCH] Move certain partitioning code to the executor --- src/backend/catalog/partition.c| 438 +- src/backend/commands/copy.c| 1 + src/backend/executor/Makefile | 2 +- src/backend/executor/execMain.c| 265 +--- src/backend/executor/execPartition.c | 559 + src/backend/executor/nodeModifyTable.c | 1 + src/include/catalog/partition.h| 48 +-- src/include/executor/execPartition.h | 65 src/include/executor/executor.h| 14 +- 9 files changed, 708 insertions(+), 685 deletions(-) create mode 100644 src/backend/executor/execPartition.c create mode 100644 src/include/executor/execPartition.h diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 1ab6dba7ae..903c8c4def 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -147,8 +147,6 @@ static int32 partition_bound_cmp(PartitionKey key, static int partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, void *probe, bool probe_is_bound, bool *is_equal); -static void get_partition_dispatch_recurse(Relation rel, Relation parent, - List **pds, List **leaf_part_oids); /* * RelationBuildPartitionDesc @@ -1193,148 +1191,6 @@ get_partition_qual_relid(Oid relid) return result; } -/* - * RelationGetPartitionDispatchInfo - * Returns information necessary to route tuples down a partition tree - * - * The number of elements in the returned array (that is, the number of - * PartitionDispatch objects for the partitioned tables in the partition tree) - * is returned in *num_parted and a list of the OIDs of all the leaf - * partitions of rel is returned in *leaf_part_oids. - * - * All the relations in the partition tree (including 'rel') must have been - * locked (using at least the AccessShareLock) by the caller. - */ -PartitionDispatch * -RelationGetPartitionDispatchInfo(Relation rel, -int *num_parted, List **leaf_part_oids) -{ - List *pdlist = NIL; - PartitionDispatchData **pd; - ListCell *lc; - int i; - - Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); - - *num_parted = 0; - *leaf_part_oids = NIL; - - get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids); - *num_parted = list_length(pdlist
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
On 2017/09/26 9:51, Michael Paquier wrote: > On Tue, Sep 26, 2017 at 8:48 AM, Michael Paquier > wrote: >> On Mon, Sep 25, 2017 at 11:32 PM, Tom Lane wrote: >>> Yeah, I'd noticed that while reviewing the vacuum-multiple-tables patch. >>> My thought about fixing it was to pass a null RangeVar when handling a >>> table we'd identified through inheritance or pg_class-scanning, to >>> indicate that this wasn't a table named in the original command. This >>> only works conveniently if you decide that it's appropriate to silently >>> ignore relation_open failure on such table OIDs, but I think it is. >>> >>> Not sure about whether we ought to try to fix that in v10. It's a >>> mostly-cosmetic problem in what ought to be an infrequent corner case, >>> so it might not be worth taking risks for post-RC1. OTOH, I would >>> not be surprised to get bug reports about it down the road. >> >> Something like that looks like a good compromise for v10. I would >> rather see a more complete fix with each relation name reported >> correctly on HEAD though. The information provided would be useful for >> users when using autovacuum when skipping a relation because no lock >> could be taken on it. > > Actually, perhaps this should be tracked as an open item? A simple fix > is leading to the path that no information is better than misleading > information in this case. +1. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
On 2017/09/25 18:37, Michael Paquier wrote: > On Mon, Sep 25, 2017 at 4:42 PM, Amit Langote > wrote: >> On 2017/09/25 12:10, Michael Paquier wrote: >> Hmm, I'm not sure if we need to lock the partitions, too. Locks taken by >> find_all_inheritors() will be gone once the already-running transaction is >> ended by the caller (vacuum()). get_rel_oids() should just lock the table >> mentioned in the command (the parent table), so that find_all_inheritors() >> returns the correct partition list, as Tom perhaps alluded to when he said >> "it would also make the find_all_inheritors call a lot more meaningful.", >> but... > > It is important to hold the lock for child tables until the end of the > transaction actually to fill in correctly the RangeVar associated to > each partition when scanning them. I think that's right, although, I don't see any new RangeVar created under vacuum() at the moment. Maybe, you're referring to the Nathan's patch that perhaps does that. >> When vacuum() iterates that list and calls vacuum_rel() for each relation >> in the list, it might be the case that a relation in that list is no >> longer a partition of the parent. So, one might say that it would get >> vacuumed unnecessarily. Perhaps that's fine? > > I am personally fine with that. Any checks would prove to complicate > the code for not much additional value. Tend to agree here. >>> I am not >>> saying that this needs to be fixed in REL_10_STABLE at this stage >>> though as this would require some refactoring similar to what the >>> patch adding support for VACUUM with multiple relations does. >> >> I think it would be better to fix that independently somehow. VACUUM on >> partitioned tables is handled by calling vacuum_rel() on individual >> partitions. It would be nice if vacuum_rel() didn't have to refer to the >> RangeVar. Could we not use get_rel_name(relid) in place of >> relation->relname? I haven't seen the other patch yet though, so maybe >> I'm missing something. > > The RangeVar is used for error reporting, and once you reach > vacuum_rel, the relation and its OID may be gone. So you need to save > the information of the RangeVar beforehand when scanning the > relations. That's one reason behind having the VacuumRelation stuff > discussed on the thread for multiple table VACUUM, this way you store > for each item to be vacuumed: > - A RangeVar. > - A list of columns. > - An OID saved by vacuum deduced from the RangeVar. > That's quite some refactoring, so my opinion is that this ship has > sailed already for REL_10_STABLE, and that we had better live with > that in 10. Yeah, your simple patch seems enough for v10. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
Hi Dilip. Thanks for looking at the patches and the comments. On 2017/09/16 18:43, Dilip Kumar wrote: > On Fri, Sep 15, 2017 at 2:20 PM, Amit Langote > wrote: >> On 2017/09/15 11:16, Amit Langote wrote: > > Thanks for the updated patch. I was going through the logic of > get_rel_partitions in 0002 as almost similar functionality will be > required by runtime partition pruning on which Beena is working. The > only difference is that here we are processing the > "rel->baserestrictinfo" and in case of runtime pruning, we also need > to process join clauses which are pushed down to appendrel. Yeah, I agree with the point you seem to be making that get_rel_partitions() covers a lot of functionality, which it would be nice to break down into reusable function(s) with suitable signature(s) that the executor will also be able to use. Your proposed refactoring patch down-thread seems to be a good step in that direction. Thanks for working on it. > So can we make some generic logic which can be used for both the patches. > > So basically, we need to do two changes > > 1. In get_rel_partitions instead of processing the > "rel->baserestrictinfo" we can take clause list as input that way we > can pass any clause list to this function. > > 2. Don't call "get_partitions_for_keys" routine from the > "get_rel_partitions", instead, get_rel_partitions can just prepare > minkey, maxkey and the caller of the get_rel_partitions can call > get_partitions_for_keys, because for runtime pruning we need to call > get_partitions_for_keys at runtime. It's not clear to me whether get_rel_partitions() itself, as it is, is callable from outside the planner, because its signature contains RelOptInfo. We have the RelOptInfo in the signature, because we want to mark certain fields in it so that latter planning steps can use them. So, get_rel_partitions()'s job is not just to match clauses and find partitions, but also to perform certain planner-specific tasks of generating information that the later planning steps will want to use. That may turn out to be unnecessary, but until we know that, let's not try to export get_rel_partitions() itself out of the planner. OTOH, the function that your refactoring patch separates out to match clauses to partition keys and extract bounding values seems reusable outside the planner and we should export it in such a way that it can be used in the executor. Then, the hypothetical executor function that does the pruning will first call the planner's clause-matching function, followed by calling get_partitions_for_keys() in partition.c to get the selected partitions. We should be careful when designing the interface of the exported function to make sure it's not bound to the planner. Your patch still maintains the RelOptInfo in the signature of the clause-matching function, which the executor pruning function won't have access to. > After these changes also there will be one problem that the > get_partitions_for_keys is directly fetching the "rightop->constvalue" > whereas, for runtime pruning, we need to store rightop itself and > calculate the value at runtime by param evaluation, I haven't yet > thought how can we make this last part generic. I don't think any code introduced by the patch in partition.c itself looks inside OpExpr (or any type of clause for that matter). That is, I don't see where get_partitions_for_keys() is looking at rightop->constvalue. All it receives to work with are arrays of Datums and some other relevant information like inclusivity, nullness, etc. By the way, I'm now rebasing these patches on top of [1] and will try to merge your refactoring patch in some appropriate way. Will post more tomorrow. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
On 2017/09/25 12:10, Michael Paquier wrote: > On Sat, Sep 23, 2017 at 4:13 AM, Tom Lane wrote: >> Somebody inserted this into vacuum.c's get_rel_oids(): >> >> tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid)); >> if (!HeapTupleIsValid(tuple)) >> elog(ERROR, "cache lookup failed for relation %u", relid); >> >> apparently without having read the very verbose comment two lines above, >> which points out that we're not taking any lock on the target relation. >> So, if that relation is concurrently being dropped, you're likely to >> get "cache lookup failed for relation " rather than anything more >> user-friendly. > > This has been overlooked during the lookups of 3c3bb99, and by > multiple people including me. elog() should never be things users can > face, as well as cache lookups. Agreed, that was an oversight. >> A minimum-change fix would be to replace the elog() with an ereport >> that produces the same "relation does not exist" error you'd have >> gotten from RangeVarGetRelid, had the concurrent DROP TABLE committed >> a few microseconds earlier. But that feels like its's band-aiding >> around the problem. > > Yeah, that's not right. This is a cache lookup error at the end. Also agreed that fixing the locking here would be a better solution. >> What I'm wondering about is changing the RangeVarGetRelid call to take >> ShareUpdateExclusiveLock rather than no lock. That would protect the >> syscache lookup, and it would also make the find_all_inheritors call >> a lot more meaningful. >> >> If we're doing a VACUUM, the ShareUpdateExclusiveLock would be dropped >> as soon as we close the caller's transaction, and then we'd acquire >> the same or stronger lock inside vacuum_rel(). So that seems fine. >> If we're doing an ANALYZE, then the lock would continue to be held >> and analyze_rel would merely be acquiring it an extra time, so we'd >> actually be removing a race-condition failure scenario for ANALYZE. >> This would mean a few more cycles in lock management, but since this >> only applies to a manual VACUUM or ANALYZE that specifies a table >> name, I'm not too concerned about that. > > I think that I am +1 on that. Testing such a thing I am not seeing > anything wrong either. The call to find_all_inheritors should also use > ShareUpdateExclusiveLock, and the lock on top of RangeVarGetRelid() > needs to be reworked. > > Attached is a proposal of patch. Hmm, I'm not sure if we need to lock the partitions, too. Locks taken by find_all_inheritors() will be gone once the already-running transaction is ended by the caller (vacuum()). get_rel_oids() should just lock the table mentioned in the command (the parent table), so that find_all_inheritors() returns the correct partition list, as Tom perhaps alluded to when he said "it would also make the find_all_inheritors call a lot more meaningful.", but... When vacuum() iterates that list and calls vacuum_rel() for each relation in the list, it might be the case that a relation in that list is no longer a partition of the parent. So, one might say that it would get vacuumed unnecessarily. Perhaps that's fine? >> Thoughts? > > As long as I don't forget... Another thing currently on HEAD and > REL_10_STABLE is that OIDs of partitioned tables are used, but the > RangeVar of the parent is used for error reports. This leads to > incorrect reports if a partition gets away in the middle of autovacuum > as only information about the parent is reported to the user. Oh, you're right. The original RangeVar (corresponding to the table mentioned in the command) refers to the parent table. > I am not > saying that this needs to be fixed in REL_10_STABLE at this stage > though as this would require some refactoring similar to what the > patch adding support for VACUUM with multiple relations does. I think it would be better to fix that independently somehow. VACUUM on partitioned tables is handled by calling vacuum_rel() on individual partitions. It would be nice if vacuum_rel() didn't have to refer to the RangeVar. Could we not use get_rel_name(relid) in place of relation->relname? I haven't seen the other patch yet though, so maybe I'm missing something. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
Hi. Trying to catch up. On 2017/09/25 13:43, Michael Paquier wrote: > On Sun, Sep 24, 2017 at 2:25 PM, Amit Kapila wrote: >> Added and updated the comments for both btree and hash index patches. > > I don't have real complaints about this patch, this looks fine to me. > > +* Currently, the advantage of setting pd_lower is in limited cases like > +* during wal_consistency_checking or while logging for unlogged relation > +* as for all other purposes, we initialize the metapage. Note, it also > +* helps in page masking by allowing to mask unused space. > I would have reworked this comment a bit, say like that: > Setting pd_lower is useful for two cases which make use of WAL > compressibility even if the meta page is initialized at replay: > - Logging of init forks for unlogged relations. > - wal_consistency_checking logs extra full-page writes, and this > allows masking of the unused space of the page. > > Now I often get complains that I suck at this exercise ;) So, I think I understand the discussion so far and the arguments about what we should write to explain why we're setting pd_lower to the correct value. Just to remind, I didn't actually start this thread [1] to address the issue that the FPWs of meta pages written to WAL are not getting compressed. An external backup tool relies on pd_lower to give the correct starting offset of the hole to compress, provided the page's other fields suggest it has the standard page layout. Since we discovered that pd_lower is not set correctly in gin, brin, and spgist meta pages, I created patches to do the same. You (Michael) pointed out that that actually helps compress their FPW images in WAL as well, so we began considering that. Also, you pointed out that WAL checking code masks pages based on the respective masking functions' assumptions about the page's layout properties, which the original patches forgot to consider. So, we updated the patches to change the respective masking functions to mask meta pages as pages with standard page layout, too. But as Tom pointed out [2], WAL compressibility benefit cannot be obtained unless we change how the meta page is passed to xlog.c to be written to WAL. So, we found out all the places that write/register the meta page to WAL and changed the respective XLogRegisterBuffer calls to include the REGBUG_STANDARD flag. Some of these calls already passed REGBUF_WILL_INIT, which would result in no FPW image to be written to the WAL and so there would be no question of compressibility. But, passing REGBUF_STANDARD would still help the case where WAL checking is enabled, because FPW image *is* written in that case. So, ISTM, comments that the patches add should all say that setting the meta pages' pd_lower to the correct value helps to pass those pages to xlog.c as compressible standard layout pages, regardless of whether they are actually passed that way. Even if the patches do take care of the latter as well. Did I miss something? Looking at Amit K's updated patches for btree and hash, it seems that he updated the comment to say that setting pd_lower to the correct value is *essential*, because those pages are passed as REGBUF_STANDARD pages and hence will be compressed. That seems to contradict what I wrote above, but I think that's not wrong, too. So, I updated the gin, brin, and spgist patches to say the same, viz. the following: /* * Set pd_lower just past the end of the metadata. This is essential, * because without doing so, metadata will be lost if xlog.c compresses * the page. */ Attached updated patches. Thanks, Amit [1] https://www.postgresql.org/message-id/0d273805-0e9e-ec1a-cb84-d4da400b8f85%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/22268.1504815869%40sss.pgh.pa.us From 600caa054e98673428fc9595b13df54efe06a6a4 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 23 Jun 2017 11:20:41 +0900 Subject: [PATCH 1/5] Set pd_lower correctly in the GIN metapage. Also tell xlog.c to treat the metapage like a standard page, so any hole in it is compressed. --- src/backend/access/gin/ginfast.c | 20 ++-- src/backend/access/gin/gininsert.c | 4 ++-- src/backend/access/gin/ginutil.c | 17 - src/backend/access/gin/ginxlog.c | 25 ++--- 4 files changed, 46 insertions(+), 20 deletions(-) diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c index 59e435465a..359de85dfc 100644 --- a/src/backend/access/gin/ginfast.c +++ b/src/backend/access/gin/ginfast.c @@ -399,6 +399,14 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) /* * Write metabuffer, make xlog entry */ + + /* +* Set pd_lower just past the end of the metadata. This is essential, +* because without doing so, metadata will be lost if xlog.c compresses +* the page. +*/ + ((PageHeader) metapage)->pd_lower = +
Re: [HACKERS] path toward faster partition pruning
On Sat, Sep 16, 2017 at 4:04 AM, Robert Haas wrote: > On Fri, Sep 15, 2017 at 4:50 AM, Amit Langote > wrote: >> Rebased patches attached. Because Dilip complained earlier today about >> clauses of the form (const op var) not causing partition-pruning, I've >> added code to commute the clause where it is required. Some other >> previously mentioned limitations remain -- no handling of OR clauses, no >> elimination of redundant clauses for given partitioning column, etc. >> >> A note about 0001: this patch overlaps with >> 0003-Canonical-partition-scheme.patch from the partitionwise-join patch >> series that Ashutosh Bapat posted yesterday [1]. > > It doesn't merely overlap; it's obviously a derivative work, Yes it is. I noted that upthread [1] that most of these are derived from Ashutosh's patch on his suggestion. I guess I should have repeated that in this message too, sorry. > and the > commit message in your version should credit all the authors. That was a mistake on my part, too. Will be careful hereon. Thanks, Amit [1] https://www.postgresql.org/message-id/0e829199-a43c-2a66-b966-89a0020a6cd4%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimise default partition scanning while adding new partition
On Sat, Sep 16, 2017 at 12:59 AM, Robert Haas wrote: > On Fri, Sep 15, 2017 at 2:00 AM, Amit Langote > wrote: >> I wonder if we should call check_default_allows_bound() from >> ATExecAttachPartition(), too, instead of validating updated default >> partition constraint using ValidatePartitionConstraints()? That is, call >> the latter only to validate the partition constraint of the table being >> attached and call check_default_allows_bound() to validate the updated >> default partition constraint. That way, INFO/ERROR messages related to >> default partition constraint are consistent across the board. > > I believe the intended advantage of the current system is that if you > specify multiple operations in a single ALTER TABLE command, you only > do one scan rather than having a second scan per operation. If that's > currently working, we probably don't want to make it stop working. OK. How about squash Jeevan's and my patch, so both check_default_allows_bound() and ValidatePartitionConstraints() know to scan default partition's children and there won't be any surprises in the regression test output as you found after applying just the Jeevan's patch. Unfortunately, I'm not able to post such a patch right now. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...
On Fri, Sep 15, 2017 at 9:20 PM, Robert Haas wrote: > On Thu, Sep 14, 2017 at 8:30 AM, Ashutosh Bapat > wrote: >> LGTM. The patch applies cleanly on the current HEAD, compiles (small >> bit in regress.c requires compilation), and make check passes. Marking >> this as "ready for committer". > > Committed. Thanks, closed in the CF app. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On 2017/09/15 11:16, Amit Langote wrote: > I will post rebased patches later today, although I think the overall > design of the patch on the planner side of things is not quite there yet. > Of course, your and others' feedback is greatly welcome. Rebased patches attached. Because Dilip complained earlier today about clauses of the form (const op var) not causing partition-pruning, I've added code to commute the clause where it is required. Some other previously mentioned limitations remain -- no handling of OR clauses, no elimination of redundant clauses for given partitioning column, etc. A note about 0001: this patch overlaps with 0003-Canonical-partition-scheme.patch from the partitionwise-join patch series that Ashutosh Bapat posted yesterday [1]. Because I implemented the planner-portion of this patch based on what 0001 builds, I'm posting it here. It might actually turn out that we will review and commit 0003-Canonical-partition-scheme.patch on that thread, but meanwhile apply 0001 if you want to play with the later patches. I would certainly like to review 0003-Canonical-partition-scheme.patch myself, but won't be able to immediately (see below). > Also, I must inform to all of those who're looking at this thread that I > won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23, > Sat, due to some personal business. To remind. Thanks, Amit [1] https://www.postgresql.org/message-id/CAFiTN-skmaqeCVaoAHCBqe2DyfO3f6sgdtEjHWrUgi0kV1yPLQ%40mail.gmail.com From ff9ccd8df6555cfca31e54e22293ac1613db327c Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 13 Sep 2017 18:24:55 +0900 Subject: [PATCH 1/5] Some optimizer data structures for partitioned rels Nobody uses it though. --- src/backend/optimizer/util/plancat.c | 120 +++ src/backend/optimizer/util/relnode.c | 20 ++ src/include/nodes/nodes.h| 1 + src/include/nodes/relation.h | 135 +++ src/include/optimizer/plancat.h | 2 + 5 files changed, 278 insertions(+) diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index a1ebd4acc8..f7e3a1df5f 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -68,6 +68,8 @@ static List *get_relation_constraints(PlannerInfo *root, static List *build_index_tlist(PlannerInfo *root, IndexOptInfo *index, Relation heapRelation); static List *get_relation_statistics(RelOptInfo *rel, Relation relation); +static void get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, + Relation relation); /* * get_relation_info - @@ -420,6 +422,10 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent, /* Collect info about relation's foreign keys, if relevant */ get_relation_foreign_keys(root, rel, relation, inhparent); + /* Collect partitioning info, if relevant. */ + if (relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) + get_relation_partition_info(root, rel, relation); + heap_close(relation, NoLock); /* @@ -1802,3 +1808,117 @@ has_row_triggers(PlannerInfo *root, Index rti, CmdType event) heap_close(relation, NoLock); return result; } + +static void +get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel, + Relation relation) +{ + int i; + ListCell *l; + PartitionKey key = RelationGetPartitionKey(relation); + PartitionDesc partdesc = RelationGetPartitionDesc(relation); + + rel->part_scheme = find_partition_scheme(root, relation); + rel->partexprs = (List **) palloc0(key->partnatts * sizeof(List *)); + + l = list_head(key->partexprs); + for (i = 0; i < key->partnatts; i++) + { + Expr *keyCol; + + if (key->partattrs[i] != 0) + { + keyCol = (Expr *) makeVar(rel->relid, + key->partattrs[i], + key->parttypid[i], + key->parttypmod[i], + key->parttypcoll[i], + 0); + } + else + { + if (l == NULL) + elog(ERROR, "wrong number of partition key expressions"); + keyCol = (Expr *) copyObject(lfirst(l)); + l = lnext(l); + } + + rel->partexprs[i] = list_m
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/09/14 16:00, Michael Paquier wrote: > On Wed, Sep 13, 2017 at 4:43 PM, Amit Langote > wrote: >> Sure, no problem. > > OK, I took a closer look at all patches, but did not run any manual > tests to test the compression except some stuff with > wal_consistency_checking. Thanks for the review. > + if (opaque->flags & GIN_DELETED) > + mask_page_content(page); > + else if (pagehdr->pd_lower != 0) > + mask_unused_space(page); > [...] > + /* Mask the unused space, provided the page's pd_lower is set. */ > + if (pagehdr->pd_lower != 0) > mask_unused_space(page); > > For the masking functions, shouldn't those check use (pd_lower > > SizeOfPageHeaderData) instead of 0? pd_lower gets set to a non-zero > value on HEAD, so you would apply the masking even if the meta page is > upgraded from an instance that did not enforce the value of pd_lower > later on. Those conditions also definitely need comments. That will be > a good reminder so as why it needs to be kept. Agreed, done. > +* > +* This won't be of any help unless we use option like REGBUF_STANDARD > +* while registering the buffer for metapage as otherwise, it won't try to > +* remove the hole while logging the full page image. > */ > This comment is in the btree code. But you actually add > REGBUF_STANDARD. So I think that this could be just removed. > > * Set pd_lower just past the end of the metadata. This is not essential > -* but it makes the page look compressible to xlog.c. > +* but it makes the page look compressible to xlog.c. See > +* _bt_initmetapage. > This reference could be removed as well as _bt_initmetapage does not > provide any information, the existing comment is wrong without your > patch, and then becomes right with this patch. Amit K's reply may have addressed these comments. > After that I have spotted a couple of places for btree, hash and > SpGist where the updates of pd_lower are not happening. Let's keep in > mind that updated meta pages could come from an upgraded version, so > we need to be careful here about all code paths updating meta pages, > and WAL-logging them. > > It seems to me that an update of pd_lower is missing in _bt_getroot(), > just before marking the buffer as dirty I think. And there is a second > in _bt_unlink_halfdead_page(), a third in _bt_insertonpg() and finally > one in _bt_newroot(). Amit K's reply about btree and hash should've resolved any doubts for those index types. About SP-Gist, see the comment below. > For SpGist, I think that there are two missing: spgbuild() and spgGetCache(). spgbuild() calls SpGistInitMetapage() before marking the metapage buffer dirty. The latter already sets pd_lower correctly, so we don't need to do it explicitly in spgbuild() itself. spgGetCache() doesn't write the metapage, only reads it: /* Last, get the lastUsedPages data from the metapage */ metabuffer = ReadBuffer(index, SPGIST_METAPAGE_BLKNO); LockBuffer(metabuffer, BUFFER_LOCK_SHARE); metadata = SpGistPageGetMeta(BufferGetPage(metabuffer)); if (metadata->magicNumber != SPGIST_MAGIC_NUMBER) elog(ERROR, "index \"%s\" is not an SP-GiST index", RelationGetRelationName(index)); cache->lastUsedPages = metadata->lastUsedPages; UnlockReleaseBuffer(metabuffer); So, I think it won't be correct to set pd_lower here, no? > For hash, hashbulkdelete(), _hash_vacuum_one_page(), > _hash_addovflpage(), _hash_freeovflpage() and _hash_doinsert() are > missing the shot, no? We could have a meta page of a hash index > upgraded from PG10. Amit K's reply. :) Updated patch attached, which implements your suggested changes to the masking functions. By the way, as I noted on another unrelated thread, I will not be able to respond to emails from tomorrow until 9/23. Thanks, Amit From 6741334ce5f3261b5e5caffa3914d4e1485fe5d8 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 23 Jun 2017 11:20:41 +0900 Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage. Also tell xlog.c to treat the metapage like a standard page, so any hole in it is compressed. --- src/backend/access/gin/ginfast.c | 22 -- src/backend/access/gin/gininsert.c | 4 ++-- src/backend/access/gin/ginutil.c | 19 ++- src/backend/access/gin/ginxlog.c | 25 ++--- 4 files changed, 50 insertions(+), 20 deletions(-) diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c index 59e435465a..d96529cf72 100644 --- a/src/backend/access/gin/ginfast.c +++ b/src/backend/access/gin/ginfast.c @@ -399,6 +399,15 @@ ginHeapTupleFastInsert(GinState *ginstate, Gi
Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT
On 2017/09/15 15:36, Amit Langote wrote: > The fact that > parent is locked after the child and with ShareUpdateExclusiveLock instead > of AccessExclusiveLock, we observe this race condition when SELECTing from > the parent. Oops, I meant "parent table is locked with AccessShareLock instead of AccessExclusiveLock..." Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT
Hi. On 2017/08/28 18:28, Kyotaro HORIGUCHI wrote: > << the following is another topic >> > BTW, in the partitioned table case, the parent is always locked first using an AccessExclusiveLock. There are other considerations in that case such as needing to recreate the partition descriptor upon termination of inheritance (both the DETACH PARTITION and also DROP TABLE child cases). >>> >>> Apart from the degree of concurrency, if we keep parent->children >>> order of locking, such recreation does not seem to be >>> needed. Maybe I'm missing something. >> >> Sorry to have introduced that topic in this thread, but I will try to >> explain anyway why things are the way they are currently: >> >> Once a table is no longer a partition of the parent (detached or dropped), >> we must make sure that the next commands in the transaction don't see it >> as one. That information is currently present in the relcache >> (rd_partdesc), which is used by a few callers, most notably the >> tuple-routing code. Next commands must recreate the entry so that the >> correct thing happens based on the updated information. More precisely, >> we must invalidate the current entry. RelationClearRelation() will either >> delete the entry or rebuild it. If it's being referenced somewhere, it >> will be rebuilt. The place holding the reference may also be looking at >> the content of rd_partdesc, which we don't require them to make a copy of, >> so we must preserve its content while other fields of RelationData are >> being read anew from the catalog. We don't have to preserve it if there >> has been any change (partition added/dropped), but to make such a change >> one would need to take a strong enough lock on the relation (parent). We >> assume here that anyone who wants to reference rd_partdesc takes at least >> AccessShareLock lock on the relation, and anyone who wants to change its >> content must take a lock that will conflict with it, so >> AccessExclusiveLock. Note that in all of this, we are only talking about >> one relation, that is the parent, so parent -> child ordering of taking >> locks may be irrelevant. > > I think I understand this, anyway DropInherit and DropPartition > is different-but-at-the-same-level operations so surely needs > amendment for drop/detach cases. Is there already a solution? Or > reproducing steps? Sorry, I think I forgot to reply to this. Since you seem to have chosen the other solution (checking that child is still a child), maybe this reply is a bit too late, but anyway. DropInherit or NO INHERIT is seen primarily as changing a child table's (which is the target table of the command) property that it is no longer a child of the parent, so we lock the child table to block concurrent operations from considering it a child of parent anymore. The fact that parent is locked after the child and with ShareUpdateExclusiveLock instead of AccessExclusiveLock, we observe this race condition when SELECTing from the parent. DropPartition or DETACH PARTITION is seen primarily as changing the parent table's (which is the target table of the command) property that one of the partitions is removed, so we lock the parent. Any concurrent operations that rely on the parent's relcache to get the partition list will wait for the session that is dropping the partition to finish, so that they get the fresh information from the relcache (or more importantly do not end up with information obtained from the relcache going invalid under them without notice). Note that the lock on the partition/child is also present and it plays more or less the the same role as it does in the DropInherit case, but due to different order of locking, reported race condition does not occur between SELECT on partitioned table and DROP/DETACH PARTITION. By the way, I will take a look at your patch when I come back from the vacation. Meanwhile, I noticed that it needs another rebase after 0a480502b092 [1]. Thanks, Amit [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0a480502b092 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimise default partition scanning while adding new partition
On 2017/09/15 0:59, Robert Haas wrote: > On Thu, Sep 14, 2017 at 4:03 AM, Jeevan Ladhe > wrote: >> Thanks Amit for reviewing. >>> Patch looks fine to me. By the way, why don't we just say "Can we skip >>> scanning part_rel?" in the comment before the newly added call to >>> PartConstraintImpliedByRelConstraint()? We don't need to repeat the >>> explanation of what it does at the every place we call it. >> >> I have changed the comment. >> PFA. > > I'm probably missing something stupid, but why does this happen? > > ALTER TABLE list_parted2 ATTACH PARTITION part_7 FOR VALUES IN (7); > -INFO: partition constraint for table "list_parted2_def" is implied > by existing constraints > ERROR: partition constraint is violated by some row > > Based on the regression test changes made up higher in the file, I'd > expect that line to be replaced by two lines, one for > list_parted2_def_p1 and another for list_parted2_def_p2, because at > this point, list_parted2_def is a partitioned table with those two > children, and they seem to have appropriate constraints. > > list_parted2_def_p1 has > Check constraints: > "check_a" CHECK (a = ANY (ARRAY[21, 22])) > > list_parted2_def_p2 has > Check constraints: > "check_a" CHECK (a = ANY (ARRAY[31, 32])) > > Well, if a is 21 or 22 for the first, or 31 or 32 for the second, then > it's not 7. Or so I would think. ISTM, Jeevan's patch modifies check_default_allows_bound(), which only DefineRelation() calls as things stand today. IOW, it doesn't handle the ATTACH PARTITION case, so you're not seeing the lines for list_parted2_def_p1 and list_parted2_def_p2, because ATExecAttachPartition doesn't yet know how to skip the validation scan for default partition's children. I wonder if we should call check_default_allows_bound() from ATExecAttachPartition(), too, instead of validating updated default partition constraint using ValidatePartitionConstraints()? That is, call the latter only to validate the partition constraint of the table being attached and call check_default_allows_bound() to validate the updated default partition constraint. That way, INFO/ERROR messages related to default partition constraint are consistent across the board. I hacked that up in the attached 0001. The patch also modifies the INFO message that check_default_allows_bound() emits when the scan is skipped to make it apparent that a default partition is involved. (Sorry, this should've been a review comment I should've posted before the default partition patch was committed.) Then apply 0002, which is Jeevan's patch. With 0001 in place, Robert's complaint is taken care of. Then comes 0003, which is my patch to teach ValidatePartitionConstraints() to skip child table scans using their existing constraint. Thanks, Amit From 00765f2166c78902d0c92ed9d1a4dea033a50a12 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 15 Sep 2017 14:30:36 +0900 Subject: [PATCH 1/3] Teach ATExecAttachPartition to use check_default_allows_bound Currently it schedules validation scan of the default partition in the same way as it schedules the scan of the partition being attached. Instead, call check_default_allows_bound() to do the scanning, so the handling is symmetric with DefineRelation(). Also, update the INFO message in check_default_allows_bound(), so that it's clear from the message that the default partition is involved. --- src/backend/catalog/partition.c | 8 ++- src/backend/commands/tablecmds.c | 40 ++- src/test/regress/expected/alter_table.out | 12 +- 3 files changed, 20 insertions(+), 40 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 1ab6dba7ae..027b98d850 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -912,15 +912,11 @@ check_default_allows_bound(Relation parent, Relation default_rel, def_part_constraints = get_proposed_default_constraint(new_part_constraints); - /* -* If the existing constraints on the default partition imply that it will -* not contain any row that would belong to the new partition, we can -* avoid scanning the default partition. -*/ + /* Can we skip scanning default_rel? */ if (PartConstraintImpliedByRelConstraint(default_rel, def_part_constraints)) { ereport(INFO, - (errmsg("partition constraint for table \"%s\" is implied by existing constraints", + (errmsg("updated partition constraint for default partition \"%s\" is implied by existing constraints", RelationGetRelationName(default_rel; return; } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 563bcda30c..ed4a2092b3 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/c
Re: [HACKERS] path toward faster partition pruning
Hi Dilip, Thanks for looking at the patch. On 2017/09/15 13:43, Dilip Kumar wrote: > On Wed, Sep 6, 2017 at 4:08 PM, Amit Langote >> [PATCH 2/5] WIP: planner-side changes for partition-pruning >> >> This patch adds a stub get_partitions_for_keys in partition.c with a >> suitable interface for the caller to pass bounding keys extracted from the >> query and other related information. >> >> Importantly, it implements the logic within the planner to match query's >> scan keys to a parent table's partition key and form the bounding keys >> that will be passed to partition.c to compute the list of partitions that >> satisfy those bounds. > > + Node *leftop = get_leftop(clause); > + > + if (IsA(leftop, RelabelType)) > + leftop = (Node *) ((RelabelType *) leftop)->arg; > + > + if (equal(leftop, partkey)) > > It appears that this patch always assume that clause will be of form > "var op const", but it can also be "const op var" > > That's the reason in below example where in both the queries condition > is same it can only prune in the first case but not in the second. > > postgres=# explain select * from t where t.a < 2; >QUERY PLAN > > Append (cost=0.00..2.24 rows=1 width=8) >-> Seq Scan on t1 (cost=0.00..2.24 rows=1 width=8) > Filter: (a < 2) > (3 rows) > > postgres=# explain select * from t where 2>t.a; >QUERY PLAN > > Append (cost=0.00..4.49 rows=2 width=8) >-> Seq Scan on t1 (cost=0.00..2.24 rows=1 width=8) > Filter: (2 > a) >-> Seq Scan on t2 (cost=0.00..2.25 rows=1 width=8) > Filter: (2 > a) > (5 rows) Yeah, there are a bunch of smarts still missing in that patch as it is. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On 2017/09/15 10:55, David Rowley wrote: > On 21 August 2017 at 18:37, Amit Langote > wrote: >> I've been working on implementing a way to perform plan-time >> partition-pruning that is hopefully faster than the current method of >> using constraint exclusion to prune each of the potentially many >> partitions one-by-one. It's not fully cooked yet though. > > I'm interested in seeing improvements in this area, so I've put my > name down to review this. Great, thanks! I will post rebased patches later today, although I think the overall design of the patch on the planner side of things is not quite there yet. Of course, your and others' feedback is greatly welcome. Also, I must inform to all of those who're looking at this thread that I won't be able to respond to emails from tomorrow (9/16, Sat) until 9/23, Sat, due to some personal business. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/15 4:43, Robert Haas wrote: > On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat > wrote: >> I have few changes to multi-level expansion patch as per discussion in >> earlier mails > > OK, I have committed > 0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic > changes. > > Phew, getting that sorted out has been an astonishing amount of work. Yeah, thanks to both of you. Now on to other complicated stuff. :) Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 2017/09/15 1:37, Robert Haas wrote: > On Thu, Sep 14, 2017 at 7:56 AM, Amit Khandekar > wrote: >> On 14 September 2017 at 06:43, Amit Langote >>> langote_amit...@lab.ntt.co.jp> wrote: >>> Attached updated patch. >> >> @@ -1222,151 +1209,130 @@ PartitionDispatch * >> RelationGetPartitionDispatchInfo(Relation rel, >> int >> *num_parted, List **leaf_part_oids) >> { >> + List *pdlist; >> PartitionDispatchData **pd; >> >> + get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids); >> >> Above, pdlist is passed uninitialized. And then inside >> get_partition_dispatch_recurse(), it is used here : >> *pds = lappend(*pds, pd); >> >> >> >> pg_indent says more alignments needed. For e.g. gettext_noop() call >> below needs to be aligned: >> pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent), >> tupdesc, >> gettext_noop("could not convert row type")); >> >> >> >> Other than that, the patch looks good to me. I verified that the leaf >> oids are ordered exaclty in the order of the UPDATE subplans output. > > Committed with fixes for those issues and a few other cosmetic changes. Thanks Amit for the review and Robert for committing. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Use MINVALUE/MAXVALUE instead of UNBOUNDED for range partition b
On 2017/09/14 16:53, Dean Rasheed wrote: > On 13 September 2017 at 10:05, Amit Langote > wrote: >> Coincidentally, I just wrote the patch for canonicalizing stored values, >> instead of erroring out. Please see attached if that's what you were >> thinking too. >> > > Looks reasonable to me, if we decide to go this way.> > One minor review comment -- Thanks for the review. > it isn't really necessary to have the > separate new boolean local variables as well as the datum kind > variables. Just tracking the datum kind is sufficient and slightly > simpler. That would also address a slight worry I have that your > coding might result in a compiler warning about the kind variables > being used uninitialised with some less intelligent compilers, not > smart enough to work out that it can't actually happen. Got rid of the boolean variables in the updated patch. Thanks, Amit From 61bfaeb46623572f5adba0b624d00dfecb6ed495 Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 13 Sep 2017 17:51:38 +0900 Subject: [PATCH] Canonicalize catalog representation of range partition bounds Since the system effectively ignores values of the bound for partition key columns following the first unbounded key, it's pointless to accurately remember them in the system catalog. Instead store minvalue or maxvalue for all such columns, matching what the first unbounded key is. This makes \d output of range partitions look more canonical. --- doc/src/sgml/ref/create_table.sgml | 6 +- src/backend/parser/parse_utilcmd.c | 30 -- src/test/regress/expected/create_table.out | 20 +--- src/test/regress/expected/insert.out | 8 src/test/regress/sql/create_table.sql | 6 ++ 5 files changed, 60 insertions(+), 10 deletions(-) diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index 824253de40..7d2eb30722 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -328,7 +328,11 @@ FROM ( { numeric_literal | MAXVALUE in a partition bound are ignored; so the bound (10, MINVALUE, 0) is equivalent to (10, MINVALUE, 10) and (10, MINVALUE, MINVALUE) - and (10, MINVALUE, MAXVALUE). + and (10, MINVALUE, MAXVALUE). Morever, the system will + store (10, MINVALUE, MINVALUE) into the system catalog if + the user specifies (10, MINVALUE, 0), and + (10, MAXVALUE, MAXVALUE) if the user specifies + (10, MAXVALUE, 0). diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 655da02c10..ca983105cc 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -3381,6 +3381,8 @@ transformPartitionBound(ParseState *pstate, Relation parent, *cell2; int i, j; + PartitionRangeDatumKind lower_kind = PARTITION_RANGE_DATUM_VALUE, + upper_kind = PARTITION_RANGE_DATUM_VALUE; if (spec->strategy != PARTITION_STRATEGY_RANGE) ereport(ERROR, @@ -3426,7 +3428,17 @@ transformPartitionBound(ParseState *pstate, Relation parent, coltype = get_partition_col_typid(key, i); coltypmod = get_partition_col_typmod(key, i); - if (ldatum->value) + /* +* If we've found the first infinite bound, use the same for +* subsequent columns. +*/ + if (lower_kind != PARTITION_RANGE_DATUM_VALUE) + { + ldatum = copyObject(ldatum);/* don't scribble on input */ + ldatum->kind = lower_kind; + ldatum->value = NULL; + } + else if (ldatum->value) { con = castNode(A_Const, ldatum->value); value = transformPartitionBoundValue(pstate, con, @@ -3439,8 +3451,20 @@ transformPartitionBound(ParseState *pstate, Relation parent, ldatum = copyObject(ldatum);/* don't scribble on input */ ldatum->value = (Node *) value; } + else + lower_kind = ldatum->kind; - if (rdatum->value) + /* +* If we've found the first infinite bound, use the same for +* subsequent columns. +*/ +
Re: [HACKERS] moving some partitioning code to executor
Repeating links for better accessibility: On 2017/09/14 16:13, Amit Langote wrote: > [1] https://www.postgresql.org/message-id/CA%2BTgmoafr%3DhUrM%3Dcbx-k%3DBDHOF2OfXaw95HQSNAK4mHBwmSjtw%40mail.gmail.com > [2] https://www.postgresql.org/message-id/7fe0007b-7ad1-a593-df11-ad05364ebce4%40lab.ntt.co.jp Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] moving some partitioning code to executor
Hi. It seems to me that some of the code in partition.c is better placed somewhere under the executor directory. There was even a suggestion recently [1] to introduce a execPartition.c to house some code around tuple-routing. IMO, catalog/partition.c should present an interface for handling operations on a *single* partitioned table and avoid pretending to support any operations on the whole partition tree. For example, the PartitionDispatch structure embeds some knowledge about the partition tree it's part of, which is useful when used for tuple-routing, because of the way it works now (lock and form ResultRelInfos of *all* leaf partitions before the first input row is processed). So, let's move that structure, along with the code that creates and manipulates the same, out of partition.c/h and to execPartition.c/h. Attached patch attempts to do that. While doing the same, I didn't move *all* of get_partition_for_tuple() out to execPartition.c, instead modified its signature as shown below: -extern int get_partition_for_tuple(PartitionDispatch *pd, -TupleTableSlot *slot, -EState *estate, -PartitionDispatchData **failed_at, -TupleTableSlot **failed_slot); +extern int get_partition_for_tuple(Relation relation, Datum *values, +bool *isnull); That way, we keep the core partition bound comparison logic inside partition.c and move rest of the stuff to its caller ExecFindPartition(), which includes navigating the enveloping PartitionDispatch's. Thoughts? PS: 0001 of the attached is the patch from [2] which is here to be applied on HEAD before applying the main patch (0002) itself Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoafr%3DhUrM%3Dcbx-k%3DBDHOF2OfXa w95HQSNAK4mHBwmSjtw%40mail.gmail.com [2] https://www.postgresql.org/message-id/7fe0007b-7ad1-a593-df11-ad05364ebce4%40l ab.ntt.co.jp From cb16fb6c89f60a8cf6b8f713ca9c831fe3824b2d Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 8 Sep 2017 17:35:10 +0900 Subject: [PATCH 1/2] Make RelationGetPartitionDispatch expansion order depth-first This is so as it matches what the planner is doing with partitioning inheritance expansion. Matching with planner order helps because it helps ease matching the executor's per-partition objects with planner-created per-partition nodes. --- src/backend/catalog/partition.c | 242 1 file changed, 99 insertions(+), 143 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 73eff17202..ddb46a80cb 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key, static int partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, void *probe, bool probe_is_bound, bool *is_equal); +static void get_partition_dispatch_recurse(Relation rel, Relation parent, + List **pds, List **leaf_part_oids); /* * RelationBuildPartitionDesc @@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid) } /* - * Append OIDs of rel's partitions to the list 'partoids' and for each OID, - * append pointer rel to the list 'parents'. - */ -#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \ - do\ - {\ - int i;\ - for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\ - {\ - (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\ - (parents) = lappend((parents), (rel));\ - }\ - } while(0) - -/* * RelationGetPartitionDispatchInfo * Returns information necessary to route tuples down a partition tree * @@ -1222,151 +1209,120 @@ PartitionDispatch * RelationGetPartitionDispatchInfo(Relation rel, int *num_parted, List **leaf_part_oids) { + List *pdlist; PartitionDispatchData **pd; - List *all_parts = NIL, - *all_parents = NIL, - *parted_rels, - *parted_rel_parents; - ListCell *lc1, - *lc2; - int i, - k, - offset; + ListCell *lc; + int i; - /* -* We rely on the relcache to traverse the partition tree to build both -* the leaf partition OIDs list and the array of PartitionDispatch objects -* for the partitioned tables in the tree. That means every partitioned -* table in the tree must be locked, which is fine since we require the -* caller to lock all the
Re: [HACKERS] Optimise default partition scanning while adding new partition
Hi Jeevan, On 2017/09/12 18:22, Jeevan Ladhe wrote: > Commit 6f6b99d1335be8ea1b74581fc489a97b109dd08a introduced default > partitioning support. This commit added a new function > check_default_allows_bound(), > which checks if there exists a row in the default partition that would > belong to > the new partition being added. If it finds one, it throws an error. Before > taking > the decision to scan the default partition, this function checks if there > are > existing constraints on default partition that would imply the new partition > constraints, if yes it skips scanning the default partition, otherwise it > scans the > default partition and its children(if any). But, while doing so the current > code > misses the fact that there can be constraints on the child of default > partition > such that they would imply the constraints of the new partition being added, > and hence individual child scan can also be skipped. > Attached is the patch which does this. > > This is previously discussed in default partitioning thread[1], and decision > was made that we can take this a separate patch rather than as a part of the > default partitioning support. Patch looks fine to me. By the way, why don't we just say "Can we skip scanning part_rel?" in the comment before the newly added call to PartConstraintImpliedByRelConstraint()? We don't need to repeat the explanation of what it does at the every place we call it. > Amit Langote has a similar patch[2] for scanning the children of a > partitioned > table which is being attached as partition of another partitioned table. I just posted my rebased patch there [1]. Thanks, Amit [1] https://www.postgresql.org/message-id/a83a0899-19f5-594c-9aac-3ba0f16989a1%40lab.ntt.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] A bug in mapping attributes in ATExecAttachPartition()
On 2017/08/07 11:05, Amit Langote wrote: > By the way, bulk of 0004 is refactoring which it seems is what Jeevan's > default partition patch set also includes as one of the patches [1]. It > got a decent amount review from Ashutosh. I broke it down into a separate > patch, so that the patch to add the new feature is its own tiny patch. > > I also spotted a couple of comments referring to attachRel that we just > recently renamed. > > So, attached are: > > 0001: s/attachRel/attachrel/g > 0002: Refactoring to introduce a PartConstraintImpliedByRelConstraint > 0003: Add the feature to skip the scan of individual leaf partitions Since Jeevan Ladhe mentioned this patch [1] earlier this week, sending the rebased patches here for consideration. Actually there are only 2 patches now, because 0002 above is rendered unnecessary by ecfe59e50fb [2]. Thanks, Amit [1] https://www.postgresql.org/message-id/CAOgcT0MWwG8WBw8frFMtRYHAgDD=tpt6u7wcso_l2k0kypm...@mail.gmail.com [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ecfe59e50fb From 55e1e14a821de541c2d24c152c193bf57eb91d43 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 7 Aug 2017 10:45:39 +0900 Subject: [PATCH 1/2] Typo: attachRel is now attachrel --- src/backend/commands/tablecmds.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 96354bdee5..563bcda30c 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13779,7 +13779,7 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd) * Prevent circularity by seeing if rel is a partition of attachrel. (In * particular, this disallows making a rel a partition of itself.) * -* We do that by checking if rel is a member of the list of attachRel's +* We do that by checking if rel is a member of the list of attachrel's * partitions provided the latter is partitioned at all. We want to avoid * having to construct this list again, so we request the strongest lock * on all partitions. We need the strongest lock, because we may decide -- 2.11.0 From a7d0d781bd9e3730f90d902d0e09abf79962f872 Mon Sep 17 00:00:00 2001 From: amit Date: Mon, 7 Aug 2017 10:51:47 +0900 Subject: [PATCH 2/2] Teach ATExecAttachPartition to skip validation in more cases In cases where the table being attached is a partitioned table and the table itself does not have constraints that would allow validation on the whole table to be skipped, we can still skip the validations of individual partitions if they each happen to have the requisite constraints. Per an idea of Robert Haas', with code refactoring suggestions from Ashutosh Bapat. --- src/backend/commands/tablecmds.c | 10 ++ src/test/regress/expected/alter_table.out | 13 + src/test/regress/sql/alter_table.sql | 10 ++ 3 files changed, 33 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 563bcda30c..901eea7fe2 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13678,6 +13678,16 @@ ValidatePartitionConstraints(List **wqueue, Relation scanrel, /* There can never be a whole-row reference here */ if (found_whole_row) elog(ERROR, "unexpected whole-row reference found in partition key"); + + /* Check if we can we skip scanning this part_rel. */ + if (PartConstraintImpliedByRelConstraint(part_rel, my_partconstr)) + { + ereport(INFO, + (errmsg("partition constraint for table \"%s\" is implied by existing constraints", + RelationGetRelationName(part_rel; + heap_close(part_rel, NoLock); + continue; + } } /* Grab a work queue entry. */ diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out index 0478a8ac60..e3415837b6 100644 --- a/src/test/regress/expected/alter_table.out +++ b/src/test/regress/expected/alter_table.out @@ -3464,6 +3464,19 @@ ERROR: updated partition constraint for default partition would be violated by -- should be ok after deleting the bad row DELETE FROM part5_def_p1 WHERE b = 'y'; ALTER TABLE part_5 ATTACH PARTITION part5_p1 FOR VALUES IN ('y'); +-- If the partitioned table being attached does not have a constraint that +-- would allow validation scan to be skipped, but an individual partition +-- does, then the partition's validation scan is skipped.
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/14 7:43, Robert Haas wrote: > On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat > wrote: >> I debugged what happens in case of query "select 1 from t1 union all >> select 2 from t1;" with the current HEAD (without multi-level >> expansion patch attached). It doesn't set partitioned_rels in Append >> path that gets converted into Append plan. Remember t1 is a >> multi-level partitioned table here with t1p1 as its immediate >> partition and t1p1p1 as partition of t1p1. So, the >> set_append_rel_pathlist() recurses once as shown in the following >> stack trace. > > Nice debugging. +1. > I spent some time today looking at this and I think > it's a bug in v10, and specifically in add_paths_to_append_rel(), > which only sets partitioned_rels correctly when the appendrel is a > partitioned rel, and not when it's a subquery RTE with one or more > partitioned queries beneath it. > > Attached are two patches either one of which will fix it. First, I > wrote mechanical-partrels-fix.patch, which just mechanically > propagates partitioned_rels lists from accumulated subpaths into the > list used to construct the parent (Merge)AppendPath. I wasn't entire > happy with that, because it ends up building multiple partitioned_rels > lists for the same RelOptInfo. That seems silly, but there's no > principled way to avoid it; avoiding it amounts to hoping that all the > paths for the same relation carry the same partitioned_rels list, > which is uncomfortable. > > So then I wrote pcinfo-for-subquery.patch. That patch notices when an > RTE_SUBQUERY appendrel is processed and accumulates the > partitioned_rels of its immediate children; in case there can be > multiple nested levels of subqueries before we get down to the actual > partitioned rel, it also adds a PartitionedChildRelInfo for the > subquery RTE, so that there's no need to walk the whole tree to build > the partitioned_rels list at higher levels, just the immediate > children. I find this fix a lot more satisfying. It adds less code > and does no extra work in the common case. I very much like pcinfo-for-subquery.patch, although I'm not sure if we need to create PartitionedChildRelInfo for the sub-query parent RTE as the patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL subqueries are flattened way before we get to add_paths_to_append_rel(); if it could not be flattened, there wouldn't be a call to add_paths_to_append_rel() in the first place, because no AppendRelInfos would be generated. See what happens when is_simple_union_all_recurse() returns false to flatten_simple_union_all() -- no AppendRelInfos will be generated and added to root->append_rel_list in that case. IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries like we're setting out to build for multi-level partitioned tables. So, as things stand today, there can at most be one recursive call of add_path_to_append_rel() for a sub-query parent RTE, that is, if its child sub-queries contain partitioned tables, but not more. The other patch (multi-level expansion of partitioned tables) will change that, but even then we won't need sub-query's own PartitioendChildRelInfo. > Notice that the choice of fix we adopt has consequences for your > 0001-Multi-level-partitioned-table-expansion.patch -- with > mechanical-partrels-fix.patch, that patch could either associated all > partitioned_rels with the top-parent or it could work level by level > and everything would get properly assembled later. But with > pcinfo-for-subquery.patch, we need everything associated with the > top-parent. That doesn't seem like a problem to me, but it's > something to note. I think it's fine. With 0001-Multi-level-partitioned-table-expansion.patch, get_partitioned_child_rels() will get called even for non-root partitioned tables, for which it won't find a valid pcinfo. I think that patch must also change its callers to stop Asserting that a valid pcinfo is returned. Spotted a typo in pcinfo-for-subquery.patch: + * A plain relation will alread have Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] expanding inheritance in partition bound order
On 2017/09/14 1:42, Robert Haas wrote: > On Wed, Sep 13, 2017 at 6:02 AM, Amit Langote > wrote: >> It seems to me we don't really need the first patch all that much. That >> is, let's keep PartitionDispatchData the way it is for now, since we don't >> really have any need for it beside tuple-routing (EIBO as committed didn't >> need it for one). So, let's forget about "decoupling >> RelationGetPartitionDispatchInfo() from the executor" thing for now and >> move on. >> >> So, attached is just the patch to make RelationGetPartitionDispatchInfo() >> traverse the partition tree in depth-first manner to be applied on HEAD. > > I like this patch. Not only does it improve the behavior, but it's > actually less code than we have now, and in my opinion, the new code > is easier to understand, too. > > A few suggestions: Thanks for the review. > - I think get_partition_dispatch_recurse() get a check_stack_depth() > call just like expand_partitioned_rtentry() did, and for the same > reasons: if the catalog contents are corrupted so that we have an > infinite loop in the partitioning hierarchy, we want to error out, not > crash. Ah, missed that. Done. > - I think we should add a comment explaining that we're careful to do > this in the same order as expand_partitioned_rtentry() so that callers > can assume that the N'th entry in the leaf_part_oids array will also > be the N'th child of an Append node. Done. Since the Append/ModifyTable may skip some leaf partitions due to pruning, added a note about that too. > + * For every partitioned table other than root, we must store a > > other than the root > > + * partitioned table. The value multiplied back by -1 is returned as the > > multiplied by -1, not multiplied back by -1 > > + * tables in the tree, using which, search is continued further down the > + * partition tree. > > Period after "in the tree". Then continue: "This value is used to > continue the search in the next level of the partition tree." Fixed. Attached updated patch. Thanks, Amit From c2599d52267cc362e059452efe69ddd09b94c083 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 8 Sep 2017 17:35:10 +0900 Subject: [PATCH] Make RelationGetPartitionDispatch expansion order depth-first This is so as it matches what the planner is doing with partitioning inheritance expansion. Matching with planner order helps because it helps ease matching the executor's per-partition objects with planner-created per-partition nodes. --- src/backend/catalog/partition.c | 252 +--- 1 file changed, 109 insertions(+), 143 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 73eff17202..36f52ddb98 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key, static int partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, void *probe, bool probe_is_bound, bool *is_equal); +static void get_partition_dispatch_recurse(Relation rel, Relation parent, + List **pds, List **leaf_part_oids); /* * RelationBuildPartitionDesc @@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid) } /* - * Append OIDs of rel's partitions to the list 'partoids' and for each OID, - * append pointer rel to the list 'parents'. - */ -#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \ - do\ - {\ - int i;\ - for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\ - {\ - (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\ - (parents) = lappend((parents), (rel));\ - }\ - } while(0) - -/* * RelationGetPartitionDispatchInfo * Returns information necessary to route tuples down a partition tree * @@ -1222,151 +1209,130 @@ PartitionDispatch * RelationGetPartitionDispatchInfo(Relation rel, int *num_parted, List **leaf_part_oids) { + List *pdlist; PartitionDispatchData **pd; - List *all_parts = NIL, - *all_parents = NIL, - *parted_rels, - *parted_rel_parents; - ListCell *lc1, - *lc2; - int i, - k, - offset; + ListCell *lc; + int i; - /*
Re: [HACKERS] expanding inheritance in partition bound order
On 2017/09/11 18:56, Amit Langote wrote: > Attached updated patch does it that way for both partitioned table indexes > and leaf partition indexes. Thanks for pointing it out. It seems to me we don't really need the first patch all that much. That is, let's keep PartitionDispatchData the way it is for now, since we don't really have any need for it beside tuple-routing (EIBO as committed didn't need it for one). So, let's forget about "decoupling RelationGetPartitionDispatchInfo() from the executor" thing for now and move on. So, attached is just the patch to make RelationGetPartitionDispatchInfo() traverse the partition tree in depth-first manner to be applied on HEAD. Thoughts? Thanks, Amit From 1e99c776eda30c29fdb0e48570d6b3acd6b9a05d Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 8 Sep 2017 17:35:10 +0900 Subject: [PATCH] Make RelationGetPartitionDispatch expansion order depth-first This is so as it matches what the planner is doing with partitioning inheritance expansion. Matching with planner order helps because it helps ease matching the executor's per-partition objects with planner-created per-partition nodes. --- src/backend/catalog/partition.c | 242 1 file changed, 99 insertions(+), 143 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 73eff17202..ddb46a80cb 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -147,6 +147,8 @@ static int32 partition_bound_cmp(PartitionKey key, static int partition_bound_bsearch(PartitionKey key, PartitionBoundInfo boundinfo, void *probe, bool probe_is_bound, bool *is_equal); +static void get_partition_dispatch_recurse(Relation rel, Relation parent, + List **pds, List **leaf_part_oids); /* * RelationBuildPartitionDesc @@ -1192,21 +1194,6 @@ get_partition_qual_relid(Oid relid) } /* - * Append OIDs of rel's partitions to the list 'partoids' and for each OID, - * append pointer rel to the list 'parents'. - */ -#define APPEND_REL_PARTITION_OIDS(rel, partoids, parents) \ - do\ - {\ - int i;\ - for (i = 0; i < (rel)->rd_partdesc->nparts; i++)\ - {\ - (partoids) = lappend_oid((partoids), (rel)->rd_partdesc->oids[i]);\ - (parents) = lappend((parents), (rel));\ - }\ - } while(0) - -/* * RelationGetPartitionDispatchInfo * Returns information necessary to route tuples down a partition tree * @@ -1222,151 +1209,120 @@ PartitionDispatch * RelationGetPartitionDispatchInfo(Relation rel, int *num_parted, List **leaf_part_oids) { + List *pdlist; PartitionDispatchData **pd; - List *all_parts = NIL, - *all_parents = NIL, - *parted_rels, - *parted_rel_parents; - ListCell *lc1, - *lc2; - int i, - k, - offset; + ListCell *lc; + int i; - /* -* We rely on the relcache to traverse the partition tree to build both -* the leaf partition OIDs list and the array of PartitionDispatch objects -* for the partitioned tables in the tree. That means every partitioned -* table in the tree must be locked, which is fine since we require the -* caller to lock all the partitions anyway. -* -* For every partitioned table in the tree, starting with the root -* partitioned table, add its relcache entry to parted_rels, while also -* queuing its partitions (in the order in which they appear in the -* partition descriptor) to be looked at later in the same loop. This is -* a bit tricky but works because the foreach() macro doesn't fetch the -* next list element until the bottom of the loop. -*/ - *num_parted = 1; - parted_rels = list_make1(rel); - /* Root partitioned table has no parent, so NULL for parent */ - parted_rel_parents = list_make1(NULL); - APPEND_REL_PARTITION_OIDS(rel, all_parts, all_parents); - forboth(lc1, all_parts, lc2, all_parents) + Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); + + *num_parted = 0; + *leaf_part_oids = NIL; + + get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids); + *num_parted = list_length(pdlist); + pd = (PartitionDispatchData **) palloc(*num_parted * +
Re: [HACKERS] Re: [COMMITTERS] pgsql: Use MINVALUE/MAXVALUE instead of UNBOUNDED for range partition b
Hi Dean, On 2017/09/13 17:51, Dean Rasheed wrote: > Robert Haas writes: >> On Tue, Sep 12, 2017 at 9:58 AM, Alvaro Herrera >> wrote: >>> Did anything happen on this, or did we just forget it completely? >> >> I forgot it. :-( >> >> I really think we should fix this. > > Ah, sorry. This was for me to follow up, and I dropped the ball. > > Here's a patch restoring the original error checks (actually not the > same coding as used in previous versions of the patch, because that > would have allowed a MINVALUE after a MAXVALUE and vice versa). > > A drawback to doing this is that we lose compatibility with syntaxes > supported by other databases, which was part of the reason for > choosing the terms MINVALUE and MAXVALUE in the first place. > > So thinking about this afresh, my preference would actually be to just > canonicalise the values stored rather than erroring out. > > Thoughts? Coincidentally, I just wrote the patch for canonicalizing stored values, instead of erroring out. Please see attached if that's what you were thinking too. Thanks, Amit From e5fc04c915cb98b0c56b234372ece4b9b1043033 Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 13 Sep 2017 17:51:38 +0900 Subject: [PATCH] Canonicalize catalog representation of range partition bounds Since the system effectively ignores values of the bound for partition key columns following the first unbounded key, it's pointless to accurately remember them in the system catalog. Instead store minvalue or maxvalue for all such columns, matching what the first unbounded key is. This makes \d output of range partitions look more canonical. --- doc/src/sgml/ref/create_table.sgml | 6 +- src/backend/parser/parse_utilcmd.c | 30 -- src/test/regress/expected/create_table.out | 6 +++--- src/test/regress/expected/insert.out | 8 4 files changed, 40 insertions(+), 10 deletions(-) diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index 824253de40..7d2eb30722 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -328,7 +328,11 @@ FROM ( { numeric_literal | MAXVALUE in a partition bound are ignored; so the bound (10, MINVALUE, 0) is equivalent to (10, MINVALUE, 10) and (10, MINVALUE, MINVALUE) - and (10, MINVALUE, MAXVALUE). + and (10, MINVALUE, MAXVALUE). Morever, the system will + store (10, MINVALUE, MINVALUE) into the system catalog if + the user specifies (10, MINVALUE, 0), and + (10, MAXVALUE, MAXVALUE) if the user specifies + (10, MAXVALUE, 0). diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 655da02c10..9086ac90ef 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -3381,6 +3381,10 @@ transformPartitionBound(ParseState *pstate, Relation parent, *cell2; int i, j; + PartitionRangeDatumKind first_lower_unbounded_kind, + first_upper_unbounded_kind; + boollower_unbounded_found = false, + upper_unbounded_found = false; if (spec->strategy != PARTITION_STRATEGY_RANGE) ereport(ERROR, @@ -3426,7 +3430,13 @@ transformPartitionBound(ParseState *pstate, Relation parent, coltype = get_partition_col_typid(key, i); coltypmod = get_partition_col_typmod(key, i); - if (ldatum->value) + if (lower_unbounded_found) + { + ldatum = copyObject(ldatum); + ldatum->kind = first_lower_unbounded_kind; + ldatum->value = NULL; + } + else if (ldatum->value) { con = castNode(A_Const, ldatum->value); value = transformPartitionBoundValue(pstate, con, @@ -3439,8 +3449,19 @@ transformPartitionBound(ParseState *pstate, Relation parent, ldatum = copyObject(ldatum);/* don't scribble on input */ ldatum->value = (Node *) value; } + else + { + lower_unbounded_found = true; + first_lower_unbounded_kind = ldatum->kind; + } - if (rdatum->value) + if (upper_unbounded_found) + { + rdatum = copyObject(rdatum); + rdatum->kind = first_upper_
Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...
On 2017/09/13 16:59, Ashutosh Bapat wrote: > On Wed, Sep 13, 2017 at 1:27 PM, Amit Langote > wrote: >> On 2017/09/13 16:42, Ashutosh Bapat wrote: >>> On Wed, Sep 13, 2017 at 7:49 AM, Amit Langote wrote: >>>> In the attached updated patch, I created separate .source files in >>>> src/test/regress/input and output directories called fdw_handler.source >>>> and put the test_fdw_handler function definition there. When I had >>>> originally thought of it back when I wrote the patch, it seemed to be an >>>> overkill, because we're just normally defining a single C function there >>>> to be used in the newly added foreign_data tests. In any case, we need to >>>> go the .source file way, because that's the only way to refer to paths to >>>> .so library when defining C language functions. >>> >>> It still looks like an overkill to add a new file to define a dummy >>> FDW handler. Why do we need to define a handler as a C function? Can't >>> we define handler as a SQL function. If we could do that we could add >>> the function definition in foreign_data.sql itself. >> >> I guess that's because the last time I tried to define the handler as a >> SQL function, I couldn't: >> >> create function test_fdw_handler() returns fdw_handler as '' language sql; >> ERROR: SQL functions cannot return type fdw_handler >> >> fdw_handler is a pseudo-type, which neither SQL nor plpgsql function can >> return. >> >> Am I missing something? > > Ok. May be then create_function_1.sql is the right place. Just add it > to the section of passing tests and annotate that it's testing > creating an FDW handler. Sorry for jumping back and forth. Alright, done. Thanks for reviewing. Regards, Amit From d0c28965b892ac76d83987e9bf35e8ab8fd62e11 Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 10 May 2017 10:37:42 +0900 Subject: [PATCH] Add some FDW HANDLER DDL tests --- src/test/regress/expected/foreign_data.out | 28 +++- src/test/regress/input/create_function_1.source | 6 + src/test/regress/output/create_function_1.source | 5 + src/test/regress/regress.c | 7 ++ src/test/regress/sql/foreign_data.sql| 13 +++ 5 files changed, 53 insertions(+), 6 deletions(-) diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out index c6e558b07f..331f7a911f 100644 --- a/src/test/regress/expected/foreign_data.out +++ b/src/test/regress/expected/foreign_data.out @@ -89,6 +89,14 @@ CREATE FOREIGN DATA WRAPPER foo VALIDATOR postgresql_fdw_validator; postgresql | regress_foreign_data_user | - | postgresql_fdw_validator | | | (3 rows) +-- HANDLER related checks +CREATE FUNCTION invalid_fdw_handler() RETURNS int LANGUAGE SQL AS 'SELECT 1;'; +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER invalid_fdw_handler; -- ERROR +ERROR: function invalid_fdw_handler must return type fdw_handler +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler HANDLER invalid_fdw_handler; -- ERROR +ERROR: conflicting or redundant options +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler; +DROP FOREIGN DATA WRAPPER test_fdw; -- ALTER FOREIGN DATA WRAPPER ALTER FOREIGN DATA WRAPPER foo; -- ERROR ERROR: syntax error at or near ";" @@ -188,18 +196,26 @@ ALTER FOREIGN DATA WRAPPER foo RENAME TO foo1; (3 rows) ALTER FOREIGN DATA WRAPPER foo1 RENAME TO foo; +-- HANDLER related checks +ALTER FOREIGN DATA WRAPPER foo HANDLER invalid_fdw_handler; -- ERROR +ERROR: function invalid_fdw_handler must return type fdw_handler +ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler HANDLER anything; -- ERROR +ERROR: conflicting or redundant options +ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler; +WARNING: changing the foreign-data wrapper handler can change behavior of existing foreign tables +DROP FUNCTION invalid_fdw_handler(); -- DROP FOREIGN DATA WRAPPER DROP FOREIGN DATA WRAPPER nonexistent; -- ERROR ERROR: foreign-data wrapper "nonexistent" does not exist DROP FOREIGN DATA WRAPPER IF EXISTS nonexistent; NOTICE: foreign-data wrapper "nonexistent" does not exist, skipping \dew+ -List of foreign-data wrappers -Name| Owner | Handler |Validator | Access privileges | FDW options | Description -+---+-+--+---+--+- - dummy | regress_foreign
Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...
On 2017/09/13 16:42, Ashutosh Bapat wrote: > On Wed, Sep 13, 2017 at 7:49 AM, Amit Langote wrote: >> In the attached updated patch, I created separate .source files in >> src/test/regress/input and output directories called fdw_handler.source >> and put the test_fdw_handler function definition there. When I had >> originally thought of it back when I wrote the patch, it seemed to be an >> overkill, because we're just normally defining a single C function there >> to be used in the newly added foreign_data tests. In any case, we need to >> go the .source file way, because that's the only way to refer to paths to >> .so library when defining C language functions. > > It still looks like an overkill to add a new file to define a dummy > FDW handler. Why do we need to define a handler as a C function? Can't > we define handler as a SQL function. If we could do that we could add > the function definition in foreign_data.sql itself. I guess that's because the last time I tried to define the handler as a SQL function, I couldn't: create function test_fdw_handler() returns fdw_handler as '' language sql; ERROR: SQL functions cannot return type fdw_handler fdw_handler is a pseudo-type, which neither SQL nor plpgsql function can return. Am I missing something? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/09/13 16:20, Michael Paquier wrote: > On Wed, Sep 13, 2017 at 2:48 PM, Amit Langote > wrote: >> I updated the patches so that the metapage's pd_lower is set to the >> correct value just before *every* point where we are about to insert a >> full page image of the metapage into WAL. That's in addition to doing the >> same in various metapage init routines, which the original patch did >> already anyway. I guess this now ensures that wal_consistency_checking >> masking of these metapages as standard layout pages always works, even for >> pre-v11 indexes that were upgraded. > > Please note that I do have plans to look at all the patches proposed > on this thread for all the indexes next. No report for today though as > those deal with many code paths so it requires some attention. I think > I'll group the review for all index AMs into the same email if you > don't mind, each patch deals with its own thing in its own > src/backend/access/ path. Sure, no problem. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/13 16:21, Ashutosh Bapat wrote: > On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas wrote: >> locks taken from the executor are worthless because plancache.c will >> always do the job for us. I don't know of a case where we execute a >> saved plan without going through the plan cache, but that doesn't mean >> that there isn't one or that there couldn't be one in the future. >> It's not the job of these partitioning patches to whack around the way >> we do locking in general -- they should preserve the existing behavior >> as much as possible. If we want to get rid of the locking in the >> executor altogether, that's a separate discussion where, I have a >> feeling, there will prove to be better reasons for the way things are >> than we are right now supposing. >> > > I agree that it's not the job of these patches to change the locking > or even get rid of partitioned_rels. In order to continue returning > partitioned_rels in Append paths esp. in the case of queries involving > set operations and partitioned table e.g "select 1 from t1 union all > select 2 from t1;" in which t1 is multi-level partitioned table, we > need a fix in add_paths_to_append_rels(). The fix provided in [1] is > correct but we will need a longer explanation of why we have to > involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation > is complicated. If we get rid of partitioned_rels, we don't need to > fix that code in add_paths_to_append_rel(). Yeah, let's get on with setting partitioned_rels in AppendPath correctly in this patch. Ashutosh's suggested approach seems fine, although it needlessly requires to scan root->pcinfo_list. But it shouldn't be longer than the number of partitioned tables in the query, so maybe that's fine too. At least, it doesn't require us to add code to add_paths_to_append_rel() that can be pretty hard to wrap one's head around. That said, we might someday need to look carefully at some things that Robert mentioned carefully, especially around the order of locks taken by AcquireExecutorLocks() in light of the EIBO patch getting committed. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 19:56, Ashutosh Bapat wrote: > I think the code here expects the original parent_rte and not the one > we set around line 1169. > > This isn't a bug right now, since both the parent_rte s have same > content. But I am not sure if that will remain to be so. Here's patch > to fix the thinko. Instead of the new bool is_parent_partitioned, why not move the code to set partitioned_rels to the block where you're now setting is_parent_partitioned. Also, since we know this isn't a bug at the moment but will turn into one once we have step-wise expansion, why not include this fix in that patch itself? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/09/13 13:05, Tom Lane wrote: > Amit Langote writes: >> On 2017/09/12 23:27, Amit Kapila wrote: >>> I think one point which might be missed is that the patch needs to >>> modify pd_lower for all usages of metapage, not only when it is first >>> time initialized. > >> Maybe I'm missing something, but isn't the metadata size fixed and hence >> pd_lower won't change once it's initialized? Maybe, it's not true for all >> index types? > > No, the point is that you might be dealing with an index recently > pg_upgraded from v10 or before, which does not have the correct > value for pd_lower on that page. This has to be coped with. Ah, got it. Thanks for the explanation. I updated the patches so that the metapage's pd_lower is set to the correct value just before *every* point where we are about to insert a full page image of the metapage into WAL. That's in addition to doing the same in various metapage init routines, which the original patch did already anyway. I guess this now ensures that wal_consistency_checking masking of these metapages as standard layout pages always works, even for pre-v11 indexes that were upgraded. Also, we now pass the metapage buffer as containing a page of standard layout to XLogRegisterBuffer(), so that any hole in it is compressed when actually writing to WAL. Thanks, Amit From 607b4ab062652e7ffc0f95338c9265b09be18b56 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 23 Jun 2017 11:20:41 +0900 Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage. Also tell xlog.c to treat the metapage like a standard page, so any hole in it is compressed. --- src/backend/access/gin/ginfast.c | 22 -- src/backend/access/gin/gininsert.c | 4 ++-- src/backend/access/gin/ginutil.c | 19 ++- src/backend/access/gin/ginxlog.c | 24 +--- 4 files changed, 49 insertions(+), 20 deletions(-) diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c index 59e435465a..d96529cf72 100644 --- a/src/backend/access/gin/ginfast.c +++ b/src/backend/access/gin/ginfast.c @@ -399,6 +399,15 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) /* * Write metabuffer, make xlog entry */ + + /* +* Set pd_lower just past the end of the metadata. This is not essential +* but it makes the page look compressible to xlog.c, because we pass the +* buffer containing this page to XLogRegisterBuffer() as a page with +* standard layout. +*/ + ((PageHeader) metapage)->pd_lower = + ((char *) metadata + sizeof(GinMetaPageData)) - (char *) metapage; MarkBufferDirty(metabuffer); if (needWal) @@ -407,7 +416,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) memcpy(&data.metadata, metadata, sizeof(GinMetaPageData)); - XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT); + XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT | REGBUF_STANDARD); XLogRegisterData((char *) &data, sizeof(ginxlogUpdateMeta)); recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_UPDATE_META_PAGE); @@ -572,6 +581,14 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead, metadata->nPendingHeapTuples = 0; } + /* +* Set pd_lower just past the end of the metadata. This is not +* essential but it makes the page look compressible to xlog.c, +* because we pass the buffer containing this page to +* XLogRegisterBuffer() as page with standard layout. +*/ + ((PageHeader) metapage)->pd_lower = + ((char *) metadata + sizeof(GinMetaPageData)) - (char *) metapage; MarkBufferDirty(metabuffer); for (i = 0; i < data.ndeleted; i++) @@ -586,7 +603,8 @@ shiftList(Relation index, Buffer metabuffer, BlockNumber newHead, XLogRecPtr recptr; XLogBeginInsert(); - XLogRegisterBuffer(0, metabuffer, REGBUF_WILL_INIT); + XLogRegisterBuffer(0, metabuffer, + REGBUF_WILL_INIT | REGBUF_STANDARD); for (i = 0; i < data.ndeleted; i++) XLogRegisterBuffer(i + 1, buffers[i], REGBUF_WILL_INIT); diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index 5378011f50..c9aa4ee147 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -348,7 +348,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
Re: [HACKERS] Race between SELECT and ALTER TABLE NO INHERIT
On 2017/09/13 12:05, Simon Riggs wrote: > On 26 June 2017 at 10:16, Amit Langote wrote: > >> BTW, in the partitioned table case, the parent is always locked first >> using an AccessExclusiveLock. There are other considerations in that case >> such as needing to recreate the partition descriptor upon termination of >> inheritance (both the DETACH PARTITION and also DROP TABLE child cases). > > Is this requirement documented or in comments anywhere? Yes. See the last sentence in the description of PARTITION OF clause in CREATE TABLE: https://www.postgresql.org/docs/devel/static/sql-createtable.html#sql-createtable-partition And, the 4th point in the list of differences between declarative partitioning and inheritance: https://www.postgresql.org/docs/devel/static/ddl-partitioning.html#ddl-partitioning-implementation-inheritance Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Setting pd_lower in GIN metapage
Thanks for the review. On 2017/09/12 23:27, Amit Kapila wrote: > On Tue, Sep 12, 2017 at 3:51 PM, Amit Langote wrote: >> I updated the patches for GIN, BRIN, and SP-GiST to include the following >> changes: >> >> 1. Pass REGBUF_STNADARD flag when registering the metapage buffer >> > > I have looked into brin patch and it seems you have not considered all > usages of meta page. The structure BrinRevmap also contains a > reference to meta page buffer and when that is modified (ex. in > revmap_physical_extend), then also I think you need to consider using > REGBUF_STNADARD flag. Fixed. >> Did I miss something from the discussion? >> > > I think one point which might be missed is that the patch needs to > modify pd_lower for all usages of metapage, not only when it is first > time initialized. Maybe I'm missing something, but isn't the metadata size fixed and hence pd_lower won't change once it's initialized? Maybe, it's not true for all index types? Thanks, Amit From c73183871632b368e2662ff5a35bfb6b3eaaade1 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 23 Jun 2017 11:20:41 +0900 Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage. Also tell xlog.c to treat the metapage like a standard page, so any hole in it is compressed. --- src/backend/access/gin/gininsert.c | 4 ++-- src/backend/access/gin/ginutil.c | 9 + src/backend/access/gin/ginxlog.c | 24 +--- 3 files changed, 20 insertions(+), 17 deletions(-) diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index 5378011f50..c9aa4ee147 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -348,7 +348,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo) Pagepage; XLogBeginInsert(); - XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT); + XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT | REGBUF_STANDARD); XLogRegisterBuffer(1, RootBuffer, REGBUF_WILL_INIT); recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_INDEX); @@ -447,7 +447,7 @@ ginbuildempty(Relation index) START_CRIT_SECTION(); GinInitMetabuffer(MetaBuffer); MarkBufferDirty(MetaBuffer); - log_newpage_buffer(MetaBuffer, false); + log_newpage_buffer(MetaBuffer, true); GinInitBuffer(RootBuffer, GIN_LEAF); MarkBufferDirty(RootBuffer); log_newpage_buffer(RootBuffer, false); diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c index 136ea27718..e926649dd2 100644 --- a/src/backend/access/gin/ginutil.c +++ b/src/backend/access/gin/ginutil.c @@ -374,6 +374,15 @@ GinInitMetabuffer(Buffer b) metadata->nDataPages = 0; metadata->nEntries = 0; metadata->ginVersion = GIN_CURRENT_VERSION; + + /* +* Set pd_lower just past the end of the metadata. This is not essential +* but it makes the page look compressible to xlog.c, as long as the +* buffer containing the page is passed to XLogRegisterBuffer() as a +* REGBUF_STANDARD page. +*/ + ((PageHeader) page)->pd_lower = + ((char *) metadata + sizeof(GinMetaPageData)) - (char *) page; } /* diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c index 7ba04e324f..f5c11b2d9a 100644 --- a/src/backend/access/gin/ginxlog.c +++ b/src/backend/access/gin/ginxlog.c @@ -514,7 +514,7 @@ ginRedoUpdateMetapage(XLogReaderState *record) Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO); metapage = BufferGetPage(metabuffer); - GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer)); + GinInitMetabuffer(metabuffer); memcpy(GinPageGetMeta(metapage), &data->metadata, sizeof(GinMetaPageData)); PageSetLSN(metapage, lsn); MarkBufferDirty(metabuffer); @@ -656,7 +656,7 @@ ginRedoDeleteListPages(XLogReaderState *record) Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO); metapage = BufferGetPage(metabuffer); - GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer)); + GinInitMetabuffer(metabuffer); memcpy(GinPageGetMeta(metapage), &data->metadata, sizeof(GinMetaPageData)); PageSetLSN(metapage, lsn); @@ -768,6 +768,7 @@ void gin_mask(char *pagedata, BlockNumber blkno) { Pagepage = (Page) pagedata; + PageHeader pagehdr = (PageHeader) page; GinPageOpaque opaque; mask_page_lsn(page); @@ -776,18 +777,11 @@ gin_mask(char *pagedata, BlockNumber blkno) mask_page_hint_bits(page); /* -* GIN metapage doesn't use pd_lower/pd_upper. Other page types do. Hence, -* we need to apply
Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...
On 2017/09/12 20:17, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 2:27 PM, Amit Langote > wrote: >> Thanks Ashutosh for taking a look at this. >> >> On 2017/09/05 21:16, Ashutosh Bapat wrote: >>> The patch needs a rebase. >> >> Attached rebased patch. > > Thanks for rebased patch. Thanks for the review. > We could annotate each ERROR with an explanation as to why that's an > error, but then this file doesn't do that for other commands, so may > be the patch is just fine. Agreed. Note that this patch is just about adding the tests, not modifying foreigncmds.c to change error handling around HANDLER functions. > Also, I am wondering whether we should create the new handler function > in foreign.c similar to postgresql_fdw_validator(). The prologue has a > caution > > 606 * Caution: this function is deprecated, and is now meant only for testing > 607 * purposes, because the list of options it knows about doesn't > necessarily > 608 * square with those known to whichever libpq instance you might be using. > 609 * Inquire of libpq itself, instead. > > So, may be we don't want to add it there. But adding the handler > function in create_function_1 doesn't seem good. If that's the correct > place, then at least it should be moved before " -- Things that > shouldn't work:"; it doesn't belong to functions that don't work. In the attached updated patch, I created separate .source files in src/test/regress/input and output directories called fdw_handler.source and put the test_fdw_handler function definition there. When I had originally thought of it back when I wrote the patch, it seemed to be an overkill, because we're just normally defining a single C function there to be used in the newly added foreign_data tests. In any case, we need to go the .source file way, because that's the only way to refer to paths to .so library when defining C language functions. Thanks, Amit From 510987531bfdf22df0bc8eef27f232e580d415b1 Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 10 May 2017 10:37:42 +0900 Subject: [PATCH] Add some FDW HANDLER DDL tests --- src/test/regress/expected/foreign_data.out | 28 ++-- src/test/regress/input/fdw_handler.source | 5 + src/test/regress/output/fdw_handler.source | 5 + src/test/regress/parallel_schedule | 2 +- src/test/regress/regress.c | 7 +++ src/test/regress/serial_schedule | 1 + src/test/regress/sql/.gitignore| 1 + src/test/regress/sql/foreign_data.sql | 13 + 8 files changed, 55 insertions(+), 7 deletions(-) create mode 100644 src/test/regress/input/fdw_handler.source create mode 100644 src/test/regress/output/fdw_handler.source diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out index c6e558b07f..331f7a911f 100644 --- a/src/test/regress/expected/foreign_data.out +++ b/src/test/regress/expected/foreign_data.out @@ -89,6 +89,14 @@ CREATE FOREIGN DATA WRAPPER foo VALIDATOR postgresql_fdw_validator; postgresql | regress_foreign_data_user | - | postgresql_fdw_validator | | | (3 rows) +-- HANDLER related checks +CREATE FUNCTION invalid_fdw_handler() RETURNS int LANGUAGE SQL AS 'SELECT 1;'; +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER invalid_fdw_handler; -- ERROR +ERROR: function invalid_fdw_handler must return type fdw_handler +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler HANDLER invalid_fdw_handler; -- ERROR +ERROR: conflicting or redundant options +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler; +DROP FOREIGN DATA WRAPPER test_fdw; -- ALTER FOREIGN DATA WRAPPER ALTER FOREIGN DATA WRAPPER foo; -- ERROR ERROR: syntax error at or near ";" @@ -188,18 +196,26 @@ ALTER FOREIGN DATA WRAPPER foo RENAME TO foo1; (3 rows) ALTER FOREIGN DATA WRAPPER foo1 RENAME TO foo; +-- HANDLER related checks +ALTER FOREIGN DATA WRAPPER foo HANDLER invalid_fdw_handler; -- ERROR +ERROR: function invalid_fdw_handler must return type fdw_handler +ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler HANDLER anything; -- ERROR +ERROR: conflicting or redundant options +ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler; +WARNING: changing the foreign-data wrapper handler can change behavior of existing foreign tables +DROP FUNCTION invalid_fdw_handler(); -- DROP FOREIGN DATA WRAPPER DROP FOREIGN DATA WRAPPER nonexistent; -- ERROR ERROR: foreign-data wrapper "nonexistent" does not exist DROP FOREIGN DATA WRAPPER IF EXISTS nonexistent; NOTICE: foreign-data wrapper "nonexistent" does not exist, skipping \dew+ -
Re: [HACKERS] dropping partitioned tables without CASCADE
On 2017/09/06 19:14, Amit Langote wrote: > On 2017/09/06 18:46, Rushabh Lathia wrote: >> Okay, I have marked this as ready for committer. > > Thanks Ashutosh and Rushabh for rebasing and improving the patch. Looks > good to me too. Patch needed to be rebased after the default partitions patch went in, so done. Per build status on http://commitfest.cputube.org :) Thanks, Amit From 0ac21ff604b5dccf818f9d69c945ff845d1771bf Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 13 Sep 2017 09:56:34 +0900 Subject: [PATCH] Some enhancments for \d+ output of partitioned tables --- src/bin/psql/describe.c| 32 -- src/test/regress/expected/create_table.out | 13 +++- src/test/regress/expected/foreign_data.out | 3 +++ src/test/regress/expected/insert.out | 17 src/test/regress/sql/create_table.sql | 2 +- src/test/regress/sql/insert.sql| 4 6 files changed, 59 insertions(+), 12 deletions(-) diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index d22ec68431..855e6870e9 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -2831,7 +2831,7 @@ describeOneTableDetails(const char *schemaname, /* print child tables (with additional info if partitions) */ if (pset.sversion >= 10) printfPQExpBuffer(&buf, - "SELECT c.oid::pg_catalog.regclass, pg_catalog.pg_get_expr(c.relpartbound, c.oid)" + "SELECT c.oid::pg_catalog.regclass, pg_get_expr(c.relpartbound, c.oid), c.relkind" " FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i" " WHERE c.oid=i.inhrelid AND i.inhparent = '%s'" " ORDER BY c.oid::pg_catalog.regclass::pg_catalog.text;", oid); @@ -2854,7 +2854,18 @@ describeOneTableDetails(const char *schemaname, else tuples = PQntuples(result); - if (!verbose) + /* +* For a partitioned table with no partitions, always print the number +* of partitions as zero, even when verbose output is expected. +* Otherwise, we will not print "Partitions" section for a partitioned +* table without any partitions. +*/ + if (tableinfo.relkind == RELKIND_PARTITIONED_TABLE && tuples == 0) + { + printfPQExpBuffer(&buf, _("Number of partitions: %d"), tuples); + printTableAddFooter(&cont, buf.data); + } + else if (!verbose) { /* print the number of child tables, if any */ if (tuples > 0) @@ -2886,12 +2897,21 @@ describeOneTableDetails(const char *schemaname, } else { + char *partitioned_note; + + if (*(PQgetvalue(result, i, 2)) == RELKIND_PARTITIONED_TABLE) + partitioned_note = " is partitioned"; + else + partitioned_note = ""; + if (i == 0) - printfPQExpBuffer(&buf, "%s: %s %s", - ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1)); + printfPQExpBuffer(&buf, "%s: %s %s%s", + ct, PQgetvalue(result, i, 0), PQgetvalue(result, i, 1), + partitioned_note); else - printfPQExpBuffer(&buf, "%*s %s %s", - ctw, "", PQgetvalue(result, i, 0), PQgetvalue(result, i, 1)); + printfPQExpBuffer(&buf, "%*s %s %s%s", + ctw, "", PQgetvalue(result, i, 0), PQgetvalue(result, i, 1), + partitioned_note);
Re: [HACKERS] Setting pd_lower in GIN metapage
On 2017/09/11 18:13, Michael Paquier wrote: > On Mon, Sep 11, 2017 at 5:40 PM, Amit Langote wrote: >> On 2017/09/10 15:22, Michael Paquier wrote: >>> Coordinating efforts here would be nice. If you, Amit K, are taking >>> care of a patch for btree and hash, would you, Amit L, write the part >>> for GIN, BRIN and SpGist? This needs a careful lookup as many code >>> paths need a lookup so it may take time. Please note that I don't mind >>> double-checking this part if you don't have enough room to do so. >> >> Sorry, I didn't have time today to carefully go through the recent >> discussion on this thread (starting with Tom's email wherein he said he >> set the status of the patch to Waiting on Author). I will try tomorrow. > > Thanks for the update! Once you get to this point, please let me know > if you would like to work on a more complete patch for brin, gin and > spgist. If you don't have enough room, I am fine to produce something. I updated the patches for GIN, BRIN, and SP-GiST to include the following changes: 1. Pass REGBUF_STNADARD flag when registering the metapage buffer 2. Pass true for page_std argument of log_newpage() or log_newpage_buffer() when using it to log the metapage 3. Update comment near where pd_lower is set in the respective metapages, similar to what Amit K did in his patches for btree and hash metapages. Did I miss something from the discussion? Thanks, Amit From bf291247e8f49d4558ab70fa335bd5a7bdda88b4 Mon Sep 17 00:00:00 2001 From: amit Date: Fri, 23 Jun 2017 11:20:41 +0900 Subject: [PATCH 1/3] Set pd_lower correctly in the GIN metapage. Also tell xlog.c to treat the metapage like a standard page, so any hole in it is compressed. --- src/backend/access/gin/gininsert.c | 4 ++-- src/backend/access/gin/ginutil.c | 9 + src/backend/access/gin/ginxlog.c | 24 +--- 3 files changed, 20 insertions(+), 17 deletions(-) diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index 5378011f50..c9aa4ee147 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -348,7 +348,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo) Pagepage; XLogBeginInsert(); - XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT); + XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT | REGBUF_STANDARD); XLogRegisterBuffer(1, RootBuffer, REGBUF_WILL_INIT); recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_INDEX); @@ -447,7 +447,7 @@ ginbuildempty(Relation index) START_CRIT_SECTION(); GinInitMetabuffer(MetaBuffer); MarkBufferDirty(MetaBuffer); - log_newpage_buffer(MetaBuffer, false); + log_newpage_buffer(MetaBuffer, true); GinInitBuffer(RootBuffer, GIN_LEAF); MarkBufferDirty(RootBuffer); log_newpage_buffer(RootBuffer, false); diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c index 136ea27718..e926649dd2 100644 --- a/src/backend/access/gin/ginutil.c +++ b/src/backend/access/gin/ginutil.c @@ -374,6 +374,15 @@ GinInitMetabuffer(Buffer b) metadata->nDataPages = 0; metadata->nEntries = 0; metadata->ginVersion = GIN_CURRENT_VERSION; + + /* +* Set pd_lower just past the end of the metadata. This is not essential +* but it makes the page look compressible to xlog.c, as long as the +* buffer containing the page is passed to XLogRegisterBuffer() as a +* REGBUF_STANDARD page. +*/ + ((PageHeader) page)->pd_lower = + ((char *) metadata + sizeof(GinMetaPageData)) - (char *) page; } /* diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c index 7ba04e324f..f5c11b2d9a 100644 --- a/src/backend/access/gin/ginxlog.c +++ b/src/backend/access/gin/ginxlog.c @@ -514,7 +514,7 @@ ginRedoUpdateMetapage(XLogReaderState *record) Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO); metapage = BufferGetPage(metabuffer); - GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer)); + GinInitMetabuffer(metabuffer); memcpy(GinPageGetMeta(metapage), &data->metadata, sizeof(GinMetaPageData)); PageSetLSN(metapage, lsn); MarkBufferDirty(metabuffer); @@ -656,7 +656,7 @@ ginRedoDeleteListPages(XLogReaderState *record) Assert(BufferGetBlockNumber(metabuffer) == GIN_METAPAGE_BLKNO); metapage = BufferGetPage(metabuffer); - GinInitPage(metapage, GIN_META, BufferGetPageSize(metabuffer)); + GinInitMetabuffer(metabuffer); memcpy(GinPageGetMeta(metapage), &data->metadata, sizeof(GinMetaPageData)); PageSetLSN(metapage, lsn); @@ -768,6 +768,7 @@ void
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 18:49, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote > wrote: >> >> That said, I noticed that we might need to be careful about what the value >> of the root parent's PlanRowMark's allMarkType field gets set to. We need >> to make sure that it reflects markType of all partitions in the tree, >> including those that are not root parent's direct children. Is that true >> with the proposed implementation? > > Yes. We include child's allMarkTypes into parent's allMarkTypes. So, > top parent's PlanRowMarks should have all descendant's allMarkTypes, > which is not happening in the patch right now. There are two ways to > fix that. > > 1. Pass top parent's PlanRowMark all the way down to the leaf > partitions, so that current expand_single_inheritance_child() collects > allMarkTypes of all children correctly. But this way, PlanRowMarks of > intermediate parent does not reflect allMarkTypes of its children, > only top root records that. > 2. Pass immediate parent's PlanRowMark to > expand_single_inheritance_child(), so that it records allMarkTypes of > its children. In expand_partitioned_rtentry() have following sequence > > expand_single_inheritance_child(root, parentrte, parentRTindex, > parentrel, parentrc, childrel, > appinfos, &childrte, &childRTindex, > &childrc); > > /* If this child is itself partitioned, recurse */ > if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) >{ > expand_partitioned_rtentry(root, childrte, childRTindex, >childrel, childrc, lockmode, appinfos, >partitioned_child_rels); > > /* Include child's rowmark type in parent's allMarkTypes */ > parentrc->allMarkTypes |= childrc->allMarkTypes; >} > so that we push allMarkTypes up the hierarchy. > > I like the second way, since every intermediate parent records > allMarkTypes of its descendants. I like the second way, too. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 17:53, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote: >> So, we can remove partitioned_rels from (Merge)AppendPath and >> (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). > > Don't we need partitioned_rels from Append paths to be transferred to > ModifyTable node or we have a different way of calculating > nonleafResultRelations? No, we don't transfer partitioned_rels from Append path to ModifyTable node. inheritance_planner(), that builds the ModifyTable path for UPDATE/DELETE on a partitioned table, fetches partitioned_rels from root->pcinfo_list itself and passes it to create_modifytable_path. No Append path is involved in that case. PlannedStmt.nonleafResultRelations is built by concatenating the partitioned_rels lists of all ModifyTable nodes appearing in the query. It does not depend on Append's or AppendPath's partitioned_rels. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] no test coverage for ALTER FOREIGN DATA WRAPPER name HANDLER ...
Thanks Ashutosh for taking a look at this. On 2017/09/05 21:16, Ashutosh Bapat wrote: > The patch needs a rebase. Attached rebased patch. Thanks, Amit From 75bcb6ebcc00193cb0251fced994f03d205e9e7f Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 10 May 2017 10:37:42 +0900 Subject: [PATCH] Add some FDW HANDLER DDL tests --- src/test/regress/expected/foreign_data.out | 28 +++- src/test/regress/input/create_function_1.source | 3 +++ src/test/regress/output/create_function_1.source | 2 ++ src/test/regress/regress.c | 7 ++ src/test/regress/sql/foreign_data.sql| 13 +++ 5 files changed, 47 insertions(+), 6 deletions(-) diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out index c6e558b07f..331f7a911f 100644 --- a/src/test/regress/expected/foreign_data.out +++ b/src/test/regress/expected/foreign_data.out @@ -89,6 +89,14 @@ CREATE FOREIGN DATA WRAPPER foo VALIDATOR postgresql_fdw_validator; postgresql | regress_foreign_data_user | - | postgresql_fdw_validator | | | (3 rows) +-- HANDLER related checks +CREATE FUNCTION invalid_fdw_handler() RETURNS int LANGUAGE SQL AS 'SELECT 1;'; +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER invalid_fdw_handler; -- ERROR +ERROR: function invalid_fdw_handler must return type fdw_handler +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler HANDLER invalid_fdw_handler; -- ERROR +ERROR: conflicting or redundant options +CREATE FOREIGN DATA WRAPPER test_fdw HANDLER test_fdw_handler; +DROP FOREIGN DATA WRAPPER test_fdw; -- ALTER FOREIGN DATA WRAPPER ALTER FOREIGN DATA WRAPPER foo; -- ERROR ERROR: syntax error at or near ";" @@ -188,18 +196,26 @@ ALTER FOREIGN DATA WRAPPER foo RENAME TO foo1; (3 rows) ALTER FOREIGN DATA WRAPPER foo1 RENAME TO foo; +-- HANDLER related checks +ALTER FOREIGN DATA WRAPPER foo HANDLER invalid_fdw_handler; -- ERROR +ERROR: function invalid_fdw_handler must return type fdw_handler +ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler HANDLER anything; -- ERROR +ERROR: conflicting or redundant options +ALTER FOREIGN DATA WRAPPER foo HANDLER test_fdw_handler; +WARNING: changing the foreign-data wrapper handler can change behavior of existing foreign tables +DROP FUNCTION invalid_fdw_handler(); -- DROP FOREIGN DATA WRAPPER DROP FOREIGN DATA WRAPPER nonexistent; -- ERROR ERROR: foreign-data wrapper "nonexistent" does not exist DROP FOREIGN DATA WRAPPER IF EXISTS nonexistent; NOTICE: foreign-data wrapper "nonexistent" does not exist, skipping \dew+ -List of foreign-data wrappers -Name| Owner | Handler |Validator | Access privileges | FDW options | Description -+---+-+--+---+--+- - dummy | regress_foreign_data_user | - | -| | | useless - foo| regress_test_role_super | - | -| | (b '3', c '4', a '2', d '5') | - postgresql | regress_foreign_data_user | - | postgresql_fdw_validator | | | + List of foreign-data wrappers +Name| Owner | Handler |Validator | Access privileges | FDW options | Description ++---+--+--+---+--+- + dummy | regress_foreign_data_user | -| - | | | useless + foo| regress_test_role_super | test_fdw_handler | - | | (b '3', c '4', a '2', d '5') | + postgresql | regress_foreign_data_user | -| postgresql_fdw_validator | | | (3 rows) DROP ROLE regress_test_role_super; -- ERROR diff --git a/src/test/regress/input/create_function_1.source b/src/test/regress/input/create_function_1.source index f2b1561cc2..669a5355b0 100644 --- a/src/test/regress/input/create_function_1.source +++ b/src/test/regress/input/create_function_1.source @@ -87,3 +87,6 @@ CREATE FUNCTION test1 (int) RETURNS int LANGUAGE C CREATE FUNCTION test1 (int) RETURNS int LANGUAGE internal AS 'nosuch'; + +CREATE FUNCTION test_fdw_handler () RETURNS fdw_handler LANGUAGE C +AS '@libdir@/regress@DLSUFFIX@', 'test_fdw_handler'; diff --git a/src/test/regress/output/create_function_1.source b/src/te
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 16:39, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote > wrote: >> On 2017/09/11 19:45, Ashutosh Bapat wrote: >>> On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: >>>> IMHO, we should make it the responsibility of the future patch to set a >>>> child PlanRowMark's prti to the direct parent's RT index, when we actually >>>> know that it's needed for something. We clearly know today why we need to >>>> pass the other objects like child RT entry, RT index, and Relation, so we >>>> should limit this patch to pass only those objects to the recursive call. >>>> That makes this patch a relatively easy to understand change. >>> >>> I think you are mixing two issues here 1. setting parent RTI in child >>> PlanRowMark and 2. passing immediate parent's PlanRowMark to >>> expand_single_inheritance_child(). >>> >>> I have discussed 1 in my reply to Robert. >>> >>> About 2 you haven't given any particular comments to my reply. To me >>> it looks like it's this patch that introduces the notion of >>> multi-level expansion, so it's natural for this patch to pass >>> PlanRowMark in cascaded fashion similar to other structures. >> >> You patch does 2 to be able to do 1, doesn't it? That is, to be able to >> set the child PlanRowMark's prti to the direct parent's RT index, you pass >> the immediate parent's PlanRowMark to the recursive call of >> expand_single_inheritance_child(). > > No. child PlanRowMark's prti is set to parentRTIndex, which is a > separate argument and is used to also set parent_relid in > AppendRelInfo. OK. So, to keep the old behavior (if at all), we'd actually need a new argument rootParentRTindex. Old behavior being that all child PlanRowMarks has the rootParentRTindex as their prti. It seems though that the new behavior where prti will now be set to the direct parent's RT index is more or less harmless, because whatever we set prti to, as long as it's different from rti, we can consider it a child PlanRowMark. So it might be fine to set prti to direct parent's RT index. That said, I noticed that we might need to be careful about what the value of the root parent's PlanRowMark's allMarkType field gets set to. We need to make sure that it reflects markType of all partitions in the tree, including those that are not root parent's direct children. Is that true with the proposed implementation? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/12 16:55, Ashutosh Bapat wrote: > On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote: >> So I looked at this a bit closely and came to the conclusion that we may >> not need to keep partitioned table RT indexes in the >> (Merge)Append.partitioned_rels after all, as far as execution-time locking >> is concerned. >> >> Consider two cases: >> >> 1. Plan is created and executed in the same transaction >> >> In this case, locks taken on the partitioned tables by the planner will >> suffice. >> >> 2. Plan is executed in a different transaction from the one in which it >>was created (a cached plan) >> >> In this case, AcquireExecutorLocks will lock all the relations in >> PlannedStmt.rtable, which must include all partitioned tables of all >> partition trees involved in the query. Of those, it will lock the tables >> whose RT indexes appear in PlannedStmt.nonleafResultRelations with >> RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global >> list of all partitioned table RT indexes obtained by concatenating >> partitioned_rels lists of all ModifyTable nodes involved in the query >> (set_plan_refs does that). We need to distinguish nonleafResultRelations, >> because we need to take the stronger lock on a given table before any >> weaker one if it happens to appear in the query as a non-result relation >> too, to avoid lock strength upgrade deadlock hazard. >> >> Moreover, because all the tables from plannedstmt->rtable, including the >> partitioned tables, will be added to PlannedStmt.relationsOids, any >> invalidation events affecting the partitioned tables (for example, >> add/remove a partition) will cause the plan involving partitioned tables >> to be recreated. >> >> In none of this do we rely on the partitioned table RT indexes appearing >> in the (Merge)Append node itself. Maybe, we should just remove >> partitioned_rels from (Merge)AppendPath and (Merge)Append node in a >> separate patch and move on. >> >> Thoughts? > > Yes, I did the same analysis (to which I refer in my earlier reply to > you). I too think we should just remove partitioned_rels from Append > paths. But then the question is those are then transferred to > ModifyTable node in create_modifytable_plan() and use it for something > else. What should we do about that code? I don't think we are really > using that list from ModifyTable node as well, so may be we could > remove it from there as well. What do you think? Does that mean > partitioned_rels isn't used at all in the code? No, we cannot simply get rid of partitioned_rels altogether. We'll need to keep it in the ModifyTable node, because we *do* need the nonleafResultRelations list in PlannedStmt to distinguish partitioned table result relations, which set_plan_refs builds by concatenating partitioned_rels lists of various ModifyTable nodes of the query. The PlannedStmt.nonleafResultRelations list actually has some use (which parallels PlannedStmt.resultRelations), but partitioned_rels list in the individual (Merge)Append, as it turns out, doesn't. So, we can remove partitioned_rels from (Merge)AppendPath and (Merge)Append nodes and remove ExecLockNonLeafAppendTables(). Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/11 21:07, Ashutosh Bapat wrote: > On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas wrote: >> On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat >> wrote: >>> So, all partitioned partitions are getting locked correctly. Am I >>> missing something? >> >> That's not a valid test. In that scenario, you're going to hold all >> the locks acquired by the planner, all the locks acquired by the >> rewriter, and all the locks acquired by the executor, but when using >> prepared queries, it's possible to execute the plan after the planner >> and rewriter locks are no longer held. > > I see the same thing when I use prepare and execute So I looked at this a bit closely and came to the conclusion that we may not need to keep partitioned table RT indexes in the (Merge)Append.partitioned_rels after all, as far as execution-time locking is concerned. Consider two cases: 1. Plan is created and executed in the same transaction In this case, locks taken on the partitioned tables by the planner will suffice. 2. Plan is executed in a different transaction from the one in which it was created (a cached plan) In this case, AcquireExecutorLocks will lock all the relations in PlannedStmt.rtable, which must include all partitioned tables of all partition trees involved in the query. Of those, it will lock the tables whose RT indexes appear in PlannedStmt.nonleafResultRelations with RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global list of all partitioned table RT indexes obtained by concatenating partitioned_rels lists of all ModifyTable nodes involved in the query (set_plan_refs does that). We need to distinguish nonleafResultRelations, because we need to take the stronger lock on a given table before any weaker one if it happens to appear in the query as a non-result relation too, to avoid lock strength upgrade deadlock hazard. Moreover, because all the tables from plannedstmt->rtable, including the partitioned tables, will be added to PlannedStmt.relationsOids, any invalidation events affecting the partitioned tables (for example, add/remove a partition) will cause the plan involving partitioned tables to be recreated. In none of this do we rely on the partitioned table RT indexes appearing in the (Merge)Append node itself. Maybe, we should just remove partitioned_rels from (Merge)AppendPath and (Merge)Append node in a separate patch and move on. Thoughts? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/11 19:45, Ashutosh Bapat wrote: > On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote: >> IMHO, we should make it the responsibility of the future patch to set a >> child PlanRowMark's prti to the direct parent's RT index, when we actually >> know that it's needed for something. We clearly know today why we need to >> pass the other objects like child RT entry, RT index, and Relation, so we >> should limit this patch to pass only those objects to the recursive call. >> That makes this patch a relatively easy to understand change. > > I think you are mixing two issues here 1. setting parent RTI in child > PlanRowMark and 2. passing immediate parent's PlanRowMark to > expand_single_inheritance_child(). > > I have discussed 1 in my reply to Robert. > > About 2 you haven't given any particular comments to my reply. To me > it looks like it's this patch that introduces the notion of > multi-level expansion, so it's natural for this patch to pass > PlanRowMark in cascaded fashion similar to other structures. You patch does 2 to be able to do 1, doesn't it? That is, to be able to set the child PlanRowMark's prti to the direct parent's RT index, you pass the immediate parent's PlanRowMark to the recursive call of expand_single_inheritance_child(). All I am trying to say is that this patch's mission is to expand inheritance step-wise to be able to do certain things in the *planner* that weren't possible before. The patch accomplishes that by creating child AppendRelInfos such that its parent_relid field is set to the immediate parent's RT index. It's quite clear why we're doing so. It's not clear why we should do so for PlanRowMarks too. Maybe it's fine as long as nothing breaks. >> If we go with your patch, partitioned tables won't get locked, for >> example, in case of the following query (p is a partitioned table): >> >> select 1 from p union all select 2 from p; >> >> That's because the RelOptInfos for the two instances of p in the above >> query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children >> of the Append corresponding to the UNION ALL subquery RTE. So, >> partitioned_rels does not get set per your proposed code. > [...] > So, all partitioned partitions are getting locked correctly. Am I > missing something? Will reply to this separately to your other email. > Actually, the original problem that caused this discussion started > with an assertion failure in get_partitioned_child_rels() as > Assert(list_length(result) >= 1); > > This assertion fails if result is NIL when an intermediate partitioned > table is passed. May be we should assert (result == NIL || > list_length(result) == 1) and allow that function to be called even > for intermediate partitioned partitions for which the function will > return NIL. That will leave the code in add_paths_to_append_rel() > simple. Thoughts? Yeah, I guess that could work. We'll just have to write comments to describe why the Assert is written that way. >> In create_lateral_join_info(): >> >> +Assert(IS_SIMPLE_REL(brel)); >> +Assert(brte); >> >> The second Assert is either unnecessary or should be placed first. > > simple_rte_array[] may have some NULL entries. Second assert makes > sure that we aren't dealing with a NULL entry. Any particular reason > to reorder the asserts? Sorry, I missed that the 2nd Assert has b"rte". I thought it's b"rel". >> The following comment could be made a bit clearer. >> >> + * In the case of table inheritance, the parent RTE is directly >> linked >> + * to every child table via an AppendRelInfo. In the case of table >> + * partitioning, the inheritance hierarchy is expanded one level at >> a >> + * time rather than flattened. Therefore, an other member rel >> that is >> + * a partitioned table may have children of its own, and must >> + * therefore be marked with the appropriate lateral info so that >> those >> + * children eventually get marked also. >> >> How about: In the case of partitioned table inheritance, the original >> parent RTE is linked, via AppendRelInfo, only to its immediate partitions. >> Partitions below the first level are accessible only via their immediate >> parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so >> consider those as well. > > I don't see much difference between those two. We usually do not use > macros in comments, so usually comments mention "other member" rel. &g
Re: [HACKERS] expanding inheritance in partition bound order
Hi Amit, On 2017/09/11 16:16, Amit Khandekar wrote: > Thanks Amit for the patch. I am still reviewing it, but meanwhile > below are a few comments so far ... Thanks for the review. > + next_parted_idx += (list_length(*pds) - next_parted_idx - 1); > > I think this can be replaced just by : > + next_parted_idx = list_length(*pds) - 1; > Or, how about removing this variable next_parted_idx altogether ? > Instead, we can just do this : > pd->indexes[i] = -(1 + list_length(*pds)); That seems like the simplest possible way to do it. > + next_leaf_idx += (list_length(*leaf_part_oids) - next_leaf_idx); > > Didn't understand why next_leaf_idx needs to be updated in case when > the current partrelid is partitioned. I think it should be incremented > only for leaf partitions, no ? Or for that matter, again, how about > removing the variable 'next_leaf_idx' and doing this : > *leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid); > pd->indexes[i] = list_length(*leaf_part_oids) - 1; Yep. Attached updated patch does it that way for both partitioned table indexes and leaf partition indexes. Thanks for pointing it out. > --- > > * For every partitioned table in the tree, starting with the root > * partitioned table, add its relcache entry to parted_rels, while also > * queuing its partitions (in the order in which they appear in the > * partition descriptor) to be looked at later in the same loop. This is > * a bit tricky but works because the foreach() macro doesn't fetch the > * next list element until the bottom of the loop. > > I think the above comment needs to be modified with something > explaining the relevant changed code. For e.g. there is no > parted_rels, and the "tricky" part was there earlier because of the > list being iterated and at the same time being appended. > > I think I forgot to update this comment. > I couldn't see the existing comments like "Indexes corresponding to > the internal partitions are multiplied by" anywhere in the patch. I > think those comments are still valid, and important. Again, I failed to keep this comment. Anyway, I reworded the comments a bit to describe what the code is doing more clearly. Hope you find it so too. Thanks, Amit From 0e04cee14a5168e0652c2aa646c169789ae41e8e Mon Sep 17 00:00:00 2001 From: amit Date: Wed, 30 Aug 2017 10:02:05 +0900 Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor Currently it and the structure it generates viz. PartitionDispatch objects are too coupled with the executor's tuple-routing code. In particular, it's pretty undesirable that it makes it the responsibility of the caller to release some resources, such as executor tuple table slots, tuple-conversion maps, etc. That makes it harder to use in places other than where it's currently being used. After this refactoring, ExecSetupPartitionTupleRouting() now needs to do some of the work that was previously done in RelationGetPartitionDispatchInfo(). --- src/backend/catalog/partition.c| 53 +++- src/backend/commands/copy.c| 32 +++-- src/backend/executor/execMain.c| 88 ++ src/backend/executor/nodeModifyTable.c | 37 +++--- src/include/catalog/partition.h| 20 +++- src/include/executor/executor.h| 4 +- src/include/nodes/execnodes.h | 40 +++- 7 files changed, 181 insertions(+), 93 deletions(-) diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c index 73eff17202..555b7c21c7 100644 --- a/src/backend/catalog/partition.c +++ b/src/backend/catalog/partition.c @@ -1292,7 +1292,6 @@ RelationGetPartitionDispatchInfo(Relation rel, Relationpartrel = lfirst(lc1); Relationparent = lfirst(lc2); PartitionKey partkey = RelationGetPartitionKey(partrel); - TupleDesc tupdesc = RelationGetDescr(partrel); PartitionDesc partdesc = RelationGetPartitionDesc(partrel); int j, m; @@ -1300,29 +1299,12 @@ RelationGetPartitionDispatchInfo(Relation rel, pd[i] = (PartitionDispatch) palloc(sizeof(PartitionDispatchData)); pd[i]->reldesc = partrel; pd[i]->key = partkey; - pd[i]->keystate = NIL; pd[i]->partdesc = partdesc; if (parent != NULL) - { - /* -* For every partitioned table other than root, we must store a -* tuple table slot initialized with its tuple descriptor and a -* tuple conversion map to convert a tuple from its parent's -* rowtype to its own. That is to make sure that we are looking at -* the correct row using the correct tuple descriptor when -
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables
On 2017/09/11 16:23, Ashutosh Bapat wrote: > On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas wrote: >> I'm a bit suspicious about the fact that there are now executor >> changes related to the PlanRowMarks. If the rowmark's prti is now the >> intermediate parent's RT index rather than the top-parent's RT index, >> it'd seem like that'd matter somehow. Maybe it doesn't, because the >> code that cares about prti seems to only care about whether it's >> different from rti. But if that's true everywhere, then why even >> change this? I think we might be well off not to tinker with things >> that don't need to be changed. > > In the definition of ExecRowMark, I see > Index prti; /* parent range table index, if child */ > It just says parent, by which I take as direct parent. For > inheritance, which earlier flattened inheritance hierarchy, direct > parent was top parent. So, probably nobody thought whether a parent is > direct parent or top parent. But now that we have introduced that > concept we need to interpret this comment anew. And I think > interpreting it as direct parent is non-lossy. But then we also don't have anything to say about why we're making that change. If you could describe what non-lossy is in this context, then fine. But that we'd like to match with what we're going to do for AppendRelInfos does not seem to be a sufficient explanation for this change. > If we set top parent's > index, parent RTI in AppendRelInfo and PlanRowMark would not agree. > So, it looks quite natural that we set the direct parent's index in > PlanRowMark. They would not agree, yes, but aren't they unrelated? If we have a reason for them to agree, (for example, row-locking breaks in the inherited table case if we didn't), then we should definitely make them agree. Updating the comment for prti definition might be something that this patch could (should?) do, but I'm not quite sure about that too. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers