On 2016/08/05 21:38, Ashutosh Bapat wrote: >>> Consider lists ('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a >>> list partitioned tables. I am suggesting that we arrange them as >>> ('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row >> (either >>> for comparison or for inserting) has value 'c', we will search for it in >>> ('a','b','l') but will be eliminate other two partitions since the >> second's >>> partition's lowest value is higher than 'c' and lowest values of rest of >> the >>> partitions are higher than that of the second partition. Without this >> order >>> among the partitions, we will have to compare lowest values of all the >>> partitions. >> >> I would think that for that case what we'd want to do is create two >> lists. One looks like this: >> >> [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ] >> >> The other looks like this: >> >> [3, 3, 2, 1, 1, 2, 1, 1, 3, 2] >> > > small correction; there's an extra 1 here. Every partition in the example > has only three values. > > >> >> Given a particular value, bsearch the first list. If the value is not >> found, it's not in any partition. If it is found, then go look up the >> same array index in the second list; that's the containing partition. >> > > +1, if we could do it. It will need a change in the way Amit's patch stores > partitioning scheme in PartitionDesc.
Okay, I will try to implement this in the next version of the patch. One thing that comes to mind is what if a user wants to apply hash operator class equality to list partitioned key by specifying a hash operator class for the corresponding column. In that case, we would not have the ordering procedure with an hash operator class, hence any ordering based optimization becomes impossible to implement. The current patch rejects a column for partition key if its type does not have a btree operator class for both range and list methods, so this issue doesn't exist, however it could be seen as a limitation. > This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', > 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are > essentially same, are represented in different ways. It makes matching > partitions for partition-wise join a bit tedius. We have to make sure that > the first array matches for both the joining relations and then make sure > that all the values belonging to the same partition for one table also > belong to the same partition in the other table. Some more complex logic > for matching subsets of lists for partition-wise join. So, we have 3 choices for the internal representation of list partitions: Choice 1 (the current approach): Load them in the same order as they are found in the partition catalog: Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'} Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'} In this case, mismatch on the first list would make the two tables incompatibly partitioned, whereas they really aren't incompatible. Choice 2: Representation with 2 arrays: Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1] Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3] It still doesn't help the case of pairwise joins because it's hard to tell which value belongs to which partition (the 2nd array carries the original partition numbers). Although it might still work for tuple-routing. Choice 3: Order all lists' elements for each list individually and then order the lists themselves on their first values: Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'} Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} This representation makes pairing partitions for pairwise joining convenient but for tuple-routing we still need to visit each partition in the worst case. > At least for straight forward partitioned table matching it helps to have > both these array look same independent of the user specification. From that > point of view, the partition be ordered by their lowest or highest list > values and the second array is the index in the ordered set. For both the > specifications above, the list will look like > > [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ] > [1, 1, 2, 3, 3, 2, 3, 1, 2] IIUC, this seems like a combination of 2 and 3 above: So, we have the following list partitions (as read from the catalog) Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'} Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'} By applying the method of 3: Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'} Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'} Then applying 2: Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [3, 1, 2] Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [2, 3, 1] This is user-specification independent representation wherein the partition numbers in the 2nd array are based on canonical representation (ordered lists). To check pairwise join compatibility, simply compare the first two arrays. As to which partitions (think OIDs, RTEs whatever) pair with each other, simply pair corresponding elements of the 3rd array which are original partitions numbers (or OIDs). Also when routing a tuple, we find partition number in the array 2 and then look up the array 3 to get the actual partition to insert the tuple. Neither of these representations make the logic of checking pairwise-join compatibility and pairing a subset of partitions (on either side) any easier, although I haven't given it a good thought yet. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers