Re: Do we want a hashset type?

jian he Wed, 28 Jun 2023 23:55:09 -0700

On Wed, Jun 28, 2023 at 4:50 PM Joel Jacobson <j...@compiler.org> wrote:
>
> On Wed, Jun 28, 2023, at 08:26, jian he wrote:
>
> > Hi there.
> > I changed the function hashset_contains to strict.
>
> Changing hashset_contains to STRICT would cause it to return NULL
> if any of the operands are NULL, which I don't believe is correct, since:
>
> SELECT NULL = ANY('{}'::int4[]);
>  ?column?
> ----------
>  f
> (1 row)
>
> Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
> to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in 
> SQL:2023.
>
> Did you try running `make installcheck` after your change?
> You would then have seen one of the tests failing:
>
> test array-and-multiset-semantics ... FAILED       21 ms
>
> Check the content of `regression.diffs` to see why:
>
> % cat regression.diffs
> diff -U3 
> /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 
> /Users/joel/src/hashset/results/array-and-multiset-semantics.out
> --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out    
>   2023-06-27 10:07:38
> +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out    
> 2023-06-28 10:13:27
> @@ -158,7 +158,7 @@
>          |      | {NULL}      | {NULL}       |                  |
>          |    1 | {1}         | {1}          |                  |
>          |    4 | {4}         | {4}          |                  |
> - {}     |      | {NULL}      | {NULL}       | f                | f
> + {}     |      | {NULL}      | {NULL}       |                  | f
>   {}     |    1 | {1}         | {1}          | f                | f
>   {}     |    4 | {4}         | {4}          | f                | f
>   {NULL} |      | {NULL}      | {NULL,NULL}  |                  |
> @@ -284,7 +284,8 @@
>      "= ANY(...)";
>   arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
>  ------+------+-------------+--------------+------------------+------------
> -(0 rows)
> + {}   |      | {NULL}      | {NULL}       |                  | f
> +(1 row)
>
>
> > also change the way to return an empty array.
>
> Nice.
> I agree the `Datum d` variable was unnecessary.
> I also removed the unused includes.
>
> > in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> > int4hashset can speed distinct aggregate and distinct counts?
> > like the following:
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3
>
> The 100k tables seems to be too small to give any meaningful results,
> when trying to measure individual queries:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
>  Execution Time: 26.790 ms
>  Execution Time: 30.616 ms
>  Execution Time: 33.253 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k;
>  Execution Time: 32.797 ms
>  Execution Time: 27.605 ms
>  Execution Time: 26.228 ms
>
> If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
> is actually faster for the `i` column where all input integers are unique:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
>  Execution Time: 799.017 ms
>  Execution Time: 796.008 ms
>  Execution Time: 799.121 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_10M;
>  Execution Time: 1204.873 ms
>  Execution Time: 1221.822 ms
>  Execution Time: 1216.340 ms
>
> For random integers, hashset is a win though:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
>  Execution Time: 1874.722 ms
>  Execution Time: 1878.760 ms
>  Execution Time: 1861.640 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(rnd) FROM benchmark_input_10M;
>  Execution Time: 1253.709 ms
>  Execution Time: 1222.651 ms
>  Execution Time: 1237.849 ms
>
> > explain(costs off,timing off, analyze,buffers)
> > select count(distinct rnd) from benchmark_input_100k \watch c=3
> >
> > explain(costs off,timing off, analyze,buffers)
> > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> > benchmark_input_100k) sub(x) \watch c=3
>
> I tried these with 10M:
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
>  Execution Time: 1733.320 ms
>  Execution Time: 1725.214 ms
>  Execution Time: 1716.636 ms
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM 
> benchmark_input_10M) sub(x);
>  Execution Time: 1249.612 ms
>  Execution Time: 1240.558 ms
>  Execution Time: 1252.103 ms
>
> Not sure what I think of the current benchmark suite.
>
> I think it would be better to only include some realistic examples from
> real-life, such as the graph query which was the reason I personally started
> working on this. Otherwise there is a risk we optimise for some hypothetical
> scenario that is not relevant in practise.
>
> Would be good with more examples of typical work loads for when the hashset
> type would be useful.
>
> /Joel


> Did you try running `make installcheck` after your change?

First I use make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
I found out it uses another active cluster.
So I killed another active cluster.
later i found another database port so I took me sometime to found out
I need use following:
make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
PGPORT=5421

Anyway, this time, I added another macro,which seems to simplify the code.

#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))

it passed all the tests on my local machine.
I should have only made a patch, but when I was committed, I forgot to
mention one file, so I needed 2 commits.


> Not sure what I think of the current benchmark suite.
your result is so different from mine. I use the default  config. I
see a big difference. yech, I agree, the performance test should be
more careful.

From ece7e6bf34facb67c7a938b2862f3d3af06aefc0 Mon Sep 17 00:00:00 2001
From: pgaddict <jian.universal...@gmail.com>
Date: Thu, 29 Jun 2023 14:27:02 +0800
Subject: [PATCH 2/2] marco SET_DATA_PTR to quicly access hashset data region

---
 hashset-api.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/hashset-api.c b/hashset-api.c
index 28eb2387..2d856b47 100644
--- a/hashset-api.c
+++ b/hashset-api.c
@@ -194,7 +194,7 @@ int4hashset_out(PG_FUNCTION_ARGS)
 
 	/* Calculate the pointer to the bitmap and values array */
 	bitmap = set->data;
-	values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+	values = (int32 *) SET_DATA_PTR(set);
 
 	/* Initialize the StringInfo buffer */
 	initStringInfo(&str);
@@ -411,7 +411,7 @@ int4hashset_union(PG_FUNCTION_ARGS)
 	int4hashset_t  *seta = PG_GETARG_INT4HASHSET_COPY(0);
 	int4hashset_t  *setb = PG_GETARG_INT4HASHSET(1);
 	char		   *bitmap = setb->data;
-	int32		   *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
+	int32		   *values = (int32 *) SET_DATA_PTR(seta);
 
 	for (i = 0; i < setb->capacity; i++)
 	{
@@ -598,7 +598,7 @@ int4hashset_agg_add_set(PG_FUNCTION_ARGS)
 		value = PG_GETARG_INT4HASHSET(1);
 
 		bitmap = value->data;
-		values = (int32 *) (value->data + CEIL_DIV(value->capacity, 8));
+		values = (int32 *) SET_DATA_PTR(value);
 
 		for (i = 0; i < value->capacity; i++)
 		{
@@ -665,7 +665,7 @@ int4hashset_agg_combine(PG_FUNCTION_ARGS)
 	dst = (int4hashset_t *) PG_GETARG_POINTER(0);
 
 	bitmap = src->data;
-	values = (int32 *) (src->data + CEIL_DIV(src->capacity, 8));
+	values = (int32 *) SET_DATA_PTR(src);
 
 	for (i = 0; i < src->capacity; i++)
 	{
@@ -698,7 +698,7 @@ int4hashset_to_array(PG_FUNCTION_ARGS)
 		PG_RETURN_ARRAYTYPE_P(construct_empty_array(INT4OID));
 
 	sbitmap = set->data;
-	svalues = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+	svalues = (int32 *) (int32 *) SET_DATA_PTR(set);
 
 	/* number of values to store in the array */
 	nvalues = set->nelements;
@@ -757,7 +757,7 @@ int4hashset_eq(PG_FUNCTION_ARGS)
 		PG_RETURN_BOOL(false);
 
 	bitmap_a = a->data;
-	values_a = (int32 *)(a->data + CEIL_DIV(a->capacity, 8));
+	values_a = (int32 *) SET_DATA_PTR(a);
 
 	/*
 	 * Check if every element in a is also in b
@@ -935,7 +935,8 @@ int4hashset_intersection(PG_FUNCTION_ARGS)
 	int4hashset_t  *seta = PG_GETARG_INT4HASHSET(0);
 	int4hashset_t  *setb = PG_GETARG_INT4HASHSET(1);
 	char		   *bitmap = setb->data;
-	int32		   *values = (int32 *)(bitmap + CEIL_DIV(setb->capacity, 8));
+	int32		   *values = (int32 *) SET_DATA_PTR(setb);
+	
 	int4hashset_t  *intersection;
 
 	intersection = int4hashset_allocate(
@@ -971,7 +972,7 @@ int4hashset_difference(PG_FUNCTION_ARGS)
 	int4hashset_t	*setb = PG_GETARG_INT4HASHSET(1);
 	int4hashset_t	*difference;
 	char			*bitmap = seta->data;
-	int32			*values = (int32 *)(bitmap + CEIL_DIV(seta->capacity, 8));
+	int32			*values = (int32 *) SET_DATA_PTR(seta);
 
 	difference = int4hashset_allocate(
 		seta->capacity,
@@ -1007,8 +1008,8 @@ int4hashset_symmetric_difference(PG_FUNCTION_ARGS)
 	int4hashset_t  *result;
 	char		   *bitmapa = seta->data;
 	char		   *bitmapb = setb->data;
-	int32		   *valuesa = (int32 *) (bitmapa + CEIL_DIV(seta->capacity, 8));
-	int32		   *valuesb = (int32 *) (bitmapb + CEIL_DIV(setb->capacity, 8));
+	int32		   *valuesa = (int32 *) SET_DATA_PTR(seta);
+	int32		   *valuesb = (int32 *) SET_DATA_PTR(setb);
 
 	result = int4hashset_allocate(
 		seta->nelements + setb->nelements,
-- 
2.34.1

From f94c1261f691c6473cd27de2a8d9465f77e73b40 Mon Sep 17 00:00:00 2001
From: pgaddict <jian.universal...@gmail.com>
Date: Thu, 29 Jun 2023 14:20:00 +0800
Subject: [PATCH 1/2]  marco SET_DATA_PTR to quicly access hashset data region

---
 hashset.c | 8 ++++----
 hashset.h | 3 +++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/hashset.c b/hashset.c
index 91907abc..d786d379 100644
--- a/hashset.c
+++ b/hashset.c
@@ -82,7 +82,7 @@ int4hashset_resize(int4hashset_t * set)
 
 	/* Calculate the pointer to the bitmap and values array */
 	bitmap = set->data;
-	values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+	values = (int32 *) SET_DATA_PTR(set);
 
 	for (i = 0; i < set->capacity; i++)
 	{
@@ -132,7 +132,7 @@ int4hashset_add_element(int4hashset_t *set, int32 value)
     position = hash % set->capacity;
 
 	bitmap = set->data;
-	values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+	values = (int32 *) SET_DATA_PTR(set);
 
 	while (true)
 	{
@@ -204,7 +204,7 @@ int4hashset_contains_element(int4hashset_t *set, int32 value)
 	position = hash % set->capacity;
 
 	bitmap = set->data;
-	values = (int32 *) (set->data + CEIL_DIV(set->capacity, 8));
+	values = (int32 *) SET_DATA_PTR(set);
 
 	while (true)
 	{
@@ -238,7 +238,7 @@ int4hashset_extract_sorted_elements(int4hashset_t *set)
 
 	/* Access the data array */
 	char *bitmap = set->data;
-	int32 *values = (int32 *)(set->data + CEIL_DIV(set->capacity, 8));
+	int32 *values = (int32 *) SET_DATA_PTR(set);
 
 	/* Counter for the number of extracted elements */
 	int32 nextracted = 0;
diff --git a/hashset.h b/hashset.h
index 86f5d1b0..2513ab55 100644
--- a/hashset.h
+++ b/hashset.h
@@ -42,6 +42,9 @@ typedef struct int4hashset_t {
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } int4hashset_t;
 
+#define SET_DATA_PTR(a) \
+		(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
+				
 int4hashset_t *int4hashset_allocate(int capacity, float4 load_factor, float4 growth_factor, int hashfn_id);
 int4hashset_t *int4hashset_resize(int4hashset_t * set);
 int4hashset_t *int4hashset_add_element(int4hashset_t *set, int32 value);
-- 
2.34.1

Re: Do we want a hashset type?

Reply via email to