Re: [HACKERS] [POC] hash partitioning

Ashutosh Bapat Tue, 16 May 2017 00:48:00 -0700

Hi,
Here's patch with some cosmetic fixes to 0002, to be applied on top of 0002.


On Tue, May 16, 2017 at 1:02 PM, Ashutosh Bapat
<[email protected]> wrote:
> On Sun, May 14, 2017 at 12:30 PM, amul sul <[email protected]> wrote:
>> On Fri, May 12, 2017 at 10:39 PM, Ashutosh Bapat
>> <[email protected]> wrote:
>>> On Fri, May 12, 2017 at 6:08 PM, amul sul <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> Please find the following updated patches attached:
>>>>
>>>> 0001-Cleanup.patch : Does some cleanup and code refactoring required
>>>> for hash partition patch. Otherwise, there will be unnecessary diff in
>>>> 0002 patch
>>>
>>> Thanks for splitting the patch.
>>>
>>> +                if (isnull[0])
>>> +                    cur_index = partdesc->boundinfo->null_index;
>>> This code assumes that null_index will be set to -1 when has_null is false. 
>>> Per
>>> RelationBuildPartitionDesc() this is true. But probably we should write this
>>> code as
>>> if (isnull[0])
>>> {
>>>     if (partdesc->boundinfo->has_null)
>>>         cur_index = partdesc->boundinfo->null_index;
>>> }
>>> That way we are certain that when has_null is false, cur_index = -1 similar 
>>> to
>>> the original code.
>>>
>> Okay will add this.
>
> Thanks.
>
>> I still don't understood point of having has_null
>> variable, if no null accepting partition exists then null_index is
>> alway set to -1 in RelationBuildPartitionDesc.  Anyway, let not change
>> the original code.
>
> I agree. has_null might have been folded as null_index == -1. But
> that's not the problem of this patch.
>
> 0001 looks good to me now.
>
>
>>
>> [...]
>>>
>>> +        if (key->strategy == PARTITION_STRATEGY_HASH)
>>> +        {
>>> +            ndatums = nparts;
>>> +            hbounds = (PartitionHashBound **) palloc(nparts *
>>> +
>>> sizeof(PartitionHashBound *));
>>> +            i = 0;
>>> +            foreach (cell, boundspecs)
>>> +            {
>>> +                PartitionBoundSpec *spec = lfirst(cell);
>>> +
>>> [ clipped ]
>>> +                hbounds[i]->index = i;
>>> +                i++;
>>> +            }
>>> For list and range partitioned table we order the bounds so that two
>>> partitioned tables have them in the same order irrespective of order in 
>>> which
>>> they are specified by the user or hence stored in the catalogs. The 
>>> partitions
>>> then get indexes according the order in which their bounds appear in ordered
>>> arrays of bounds. Thus any two partitioned tables with same partition
>>> specification always have same PartitionBoundInfoData. This helps in
>>> partition-wise join to match partition bounds of two given tables.  Above 
>>> code
>>> assigns the indexes to the partitions as they appear in the catalogs. This
>>> means that two partitioned tables with same partition specification but
>>> different order for partition bound specification will have different
>>> PartitionBoundInfoData represenation.
>>>
>>> If we do that, probably partition_bounds_equal() would reduce to just 
>>> matching
>>> indexes and the last element of datums array i.e. the greatest modulus 
>>> datum.
>>> If ordered datums array of two partitioned table do not match exactly, the
>>> mismatch can be because missing datums or different datums. If it's a 
>>> missing
>>> datum it will change the greatest modulus or have corresponding entry in
>>> indexes array as -1. If the entry differs it will cause mismatching indexes 
>>> in
>>> the index arrays.
>>>
>> Make sense, will fix this.
>
> I don't see this being addressed in the patches attached in the reply to 
> Dilip.
>
>>
>>>
>>> +        if (key->partattrs[i] != 0)
>>> +        {
>>> +            keyCol = (Node *) makeVar(1,
>>> +                                      key->partattrs[i],
>>> +                                      key->parttypid[i],
>>> +                                      key->parttypmod[i],
>>> +                                      key->parttypcoll[i],
>>> +                                      0);
>>> +
>>> +            /* Form hash_fn(value) expression */
>>> +            keyCol = (Node *) makeFuncExpr(key->partsupfunc[i].fn_oid,
>>> +                                    
>>> get_fn_expr_rettype(&key->partsupfunc[i]),
>>> +                                    list_make1(keyCol),
>>> +                                    InvalidOid,
>>> +                                    InvalidOid,
>>> +                                    COERCE_EXPLICIT_CALL);
>>> +        }
>>> +        else
>>> +        {
>>> +            keyCol = (Node *) copyObject(lfirst(partexprs_item));
>>> +            partexprs_item = lnext(partexprs_item);
>>> +        }
>>> I think we should add FuncExpr for column Vars as well as expressions.
>>>
>> Okay, will fix this.
>
> Here, please add a check similar to get_quals_for_range()
> 1840             if (partexprs_item == NULL)
> 1841                 elog(ERROR, "wrong number of partition key expressions");
>
>
>>
>>> I think we need more comments for compute_hash_value(), mix_hash_value() and
>>> satisfies_hash_partition() as to what each of them accepts and what it
>>> computes.
>>>
>>> +        /* key's hash values start from third argument of function. */
>>> +        if (!PG_ARGISNULL(i + 2))
>>> +        {
>>> +            values[i] = PG_GETARG_DATUM(i + 2);
>>> +            isnull[i] = false;
>>> +        }
>>> +        else
>>> +            isnull[i] = true;
>>> You could write this as
>>> isnull[i] = PG_ARGISNULL(i + 2);
>>> if (isnull[i])
>>>     values[i] = PG_GETARG_DATUM(i + 2);
>>>
>> Okay.
>
> If we have used this technique somewhere else in PG code, please
> mention that function/place.
>         /*
>          * Rotate hash left 1 bit before mixing in the next column.  This
>          * prevents equal values in different keys from cancelling each other.
>          */
>
>
>>
>>> +                    foreach (lc, $5)
>>> +                    {
>>> +                        DefElem    *opt = (DefElem *) lfirst(lc);
>>> A search on WITH in gram.y shows that we do not handle WITH options in 
>>> gram.y.
>>> Usually they are handled at the transformation stage. Why is this an 
>>> exception?
>>> If you do that, we can have all the error handling in
>>> transformPartitionBound().
>>>
>> If so, ForValues need to return list for hash and PartitionBoundSpec
>> for other two; wouldn't  that break code consistency? And such
>> validation is not new in gram.y see xmltable_column_el.
>
> Thanks for pointing that out. Ok, then may be leave it in gram.y. But
> may be we should move the error handling in transform function.
>
>
>>
>>> +DATA(insert OID = 5028 ( satisfies_hash_partition PGNSP PGUID 12 1 0
>>> 2276 0 f f f f f f i s 3 0 16 "23 23 2276" _null_ _null_ _null_ _null_
>>> _null_ satisfies_hash_partition _null_ _null_ _null_ ));
>>> Why is third argument to this function ANY? Shouldn't it be INT4ARRAY 
>>> (variadic
>>> INT4)?
>>>
>> Will use INT4ARRAY in next patch, but I am little sceptical of it.  we
>> need an unsigned int32, but unfortunately there is not variadic uint32
>> support.  How about INT8ARRAY?
>
> Hmm, I think as long as the binary representation of given unsigned
> integer doesn't change in the function call, we could cast an INT32
> datums into unsigned int32, so spending extra 4 bytes per partition
> key doesn't look like worth the effort.
>
> A related question is, all hash functions have return type as
> "integer" but internally they return uint32. Why not to do the same
> for this function as well?
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 987bb73..e9b09dd 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -58,7 +58,8 @@
  * In the case of range partitioning, ndatums will typically be far less than
  * 2 * nparts, because a partition's upper bound and the next partition's lower
  * bound are the same in most common cases, and we only store one of them.
- * In case of hash partitioning, ndatums be same as the number of partitions.
+ * In case of hash partitioning, ndatums will be same as the number of
+ * partitions.
  *
  * For range and list partitioned tables, datums is an array of datum-tuples
  * with key->partnatts datums each.
@@ -1504,12 +1505,12 @@ make_partition_op_expr(PartitionKey key, int keynum,
 /*
  * get_qual_for_hash
  *
- * Given a list of partition columns, modulus and remainder this function
- * returns an expression Node for the partition table's CHECK constraint.
+ * Given a list of partition columns, modulus and remainder corresponding to a
+ * partition, this function returns CHECK constraint expression Node for that
+ * partition.
  *
- * For example, given a partition definition such as:
- *  CREATE TABLE simple_hash (a int, b char(10))
- *  PARTITION BY HASH (a, b);
+ * For a partitioned table defined as:
+ *  CREATE TABLE simple_hash (a int, b char(10)) PARTITION BY HASH (a, b);
  *
  * CREATE TABLE p_p1 PARTITION OF simple_hash
  * 	FOR VALUES WITH (modulus 2, remainder 1);
@@ -1520,16 +1521,16 @@ make_partition_op_expr(PartitionKey key, int keynum,
  * CREATE TABLE p_p4 PARTITION OF simple_hash
  * 	FOR VALUES WITH (modulus 8, remainder 4);
  *
- * This function will return one of the following in the form of a
- * subexpression:
+ * This function will return one of the following in the form of an
+ * expression:
  *
  * for p_p1: satisfies_hash_partition(2, 1, hash_fn_1(a), hash_fn_2(b))
  * for p_p2: satisfies_hash_partition(4, 2, hash_fn_1(a), hash_fn_2(b))
  * for p_p3: satisfies_hash_partition(8, 0, hash_fn_1(a), hash_fn_2(b))
  * for p_p4: satisfies_hash_partition(8, 4, hash_fn_1(a), hash_fn_2(b))
  *
- * hash_fn_1 and hash_fn_2 will be datatype-specific hash functions for
- * column a and b respectively.
+ * where hash_fn_1 and hash_fn_2 are be datatype-specific hash functions for
+ * columns a and b respectively.
  */
 static List *
 get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec)
@@ -1568,7 +1569,6 @@ get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec)
 
 		/* Left operand */
 		if (key->partattrs[i] != 0)
-		{
 			keyCol = (Node *) makeVar(1,
 									  key->partattrs[i],
 									  key->parttypid[i],
@@ -1576,7 +1576,6 @@ get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec)
 									  key->parttypcoll[i],
 									  0);
 
-		}
 		else
 		{
 			keyCol = (Node *) copyObject(lfirst(partexprs_item));
@@ -2684,7 +2683,7 @@ mix_hash_value(int nkeys, uint32 *hash_array, bool *isnull)
 		/*
 		 * Rotate hash left 1 bit before mixing in the next column.  This
 		 * prevents equal values in different keys from cancelling each other.
-		 * */
+		 */
 		rowHash = (rowHash << 1) | ((rowHash & 0x80000000) ? 1 : 0);
 
 		if (!isnull[i])
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 46344ee..8f8bbcd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13231,8 +13231,9 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
 		}
 
 		/*
-		 * Collation is irrelevant for hash partition key, because hash
-		 * operator classes provide only equality, not ordering.
+		 * Hash operator classes provide only equality, not ordering.
+		 * Collation, which is relevant for ordering and not equality is
+		 * irrelevant for hash partitioning.
 		 */
 		if (*strategy == PARTITION_STRATEGY_HASH && pelem->collation != NIL)
 		{
@@ -13404,9 +13405,10 @@ ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
 
 		partcollation[attn] = attcollation;
 
-		/* Identify opclass to use.  For list and range partitioning we use only
-		 * btree operators, which seems enough for those.  For hash partitioning,
-		 * we use hash operators. */
+		/*
+		 * Identify opclass to use.  For list and range partitioning we use only
+		 * btree operator class, which seems enough for those.  For hash partitioning,
+		 * we use hash operator class. */
 		if (strategy == PARTITION_STRATEGY_HASH)
 		{
 			am_method = "hash";

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

Reply via email to