Re: Propose a new hook for mutating the query bounds

2021-11-18 Thread Xiaozhe Yao
Hi,

Thanks for the previous feedbacks!

> The way the hook is used seems pretty inconvenient, though.

I see the problem, and I agree.

I looked into how other hooks work, and I am wondering if it looks ok if
we: pass a pointer to the hook, and let the hook check if there is any
information applicable. If there is none, the hook just returns False and
we let the rest of the code handle. If it is true, we get the selectivity
from the hook and return it. So something like

```
if (clauselist_selectivity_hook &&
(*clauselist_selectivity_hook) (root, clauses, varRelid, jointype, sjinfo,
use_extended_stats, ))
{
return s1;
}
```

What I am trying to mock is the get_index_stats_hook (
https://github.com/taminomara/psql-hooks/blob/master/Detailed.md#get_index_stats_hook).


Am I understanding your idea correctly and does this look somehow better?

Best regards,
Xiaozhe

On Wed, Nov 17, 2021 at 7:47 PM Tomas Vondra 
wrote:

> On 11/17/21 16:39, Xiaozhe Yao wrote:
> > Hi Tom,
> >
> > Thanks for your feedback. I completely agree with you that a
> > higher-level hook is better suited for this case. I have adjusted the
> > PoC patch to this email.
> >
> > Now it is located in the clauselist_selectivity_ext function, where we
> > first check if the hook is defined. If so, we let the hook estimate the
> > selectivity and return the result. With this one, I can also develop
> > extensions to better estimate the selectivity.
> >
>
> I think clauselist_selectivity is the right level, because this is
> pretty similar to what extended statistics are doing. I'm not sure if
> the hook should be called in clauselist_selectivity_ext or in the plain
> clauselist_selectivity. But it should be in clauselist_selectivity_or
> too, probably.
>
> The way the hook is used seems pretty inconvenient, though. I mean, if
> you do this
>
>  if (clauselist_selectivity_hook)
>  return clauselist_selectivity_hook(...);
>
> then what will happen when the ML model has no information applicable to
> a query? This is called for all relations, all conditions, etc. and
> you've short-circuited all the regular code, so the hook will have to
> copy all of that. Seems pretty silly and fragile.
>
> IMO the right approach is what statext_clauselist_selectivity is doing,
> i.e. estimate clauses, mark them as estimated in a bitmap, and let the
> rest of the existing code take care of the remaining clauses. So more
> something like
>
>  if (clauselist_selectivity_hook)
>  s1 *= clauselist_selectivity_hook(..., );
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..94f7993529 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -26,6 +26,9 @@
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 
+/* Hooks for plugins to get control when we ask for selectivity */
+clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL;
+
 /*
  * Data structure for accumulating info about possible range-query
  * clause pairs in clauselist_selectivity.
@@ -130,6 +133,18 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	ListCell   *l;
 	int			listidx;
 
+	/*
+	* If we have a hook for selectivity estimation and it returns True, then go for it.
+	*/
+	if (clauselist_selectivity_hook && 
+		(*clauselist_selectivity_hook) (root, clauses, varRelid, jointype, sjinfo, use_extended_stats, ))
+		{
+			/*
+			* The hook takes control of estimating the selectivity
+			* and saves the estimated selectivity to s1.
+			*/
+			return s1;
+		}
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
@@ -370,7 +385,19 @@ clauselist_selectivity_or(PlannerInfo *root,
 	Bitmapset  *estimatedclauses = NULL;
 	ListCell   *lc;
 	int			listidx;
-
+	
+	/*
+	 * If we have a hook for selectivity estimation and it returns True, then go for it.
+	 */
+	if (clauselist_selectivity_hook && 
+		(*clauselist_selectivity_hook) (root, clauses, varRelid, jointype, sjinfo, use_extended_stats, ))
+		{
+			/*
+			* The hook takes control of estimating the selectivity
+			* and saves the estimated selectivity to s1.
+			*/
+			return s1;
+		}
 	/*
 	 * Determine if these clauses reference a single relation.  If so, and if
 	 * it has extended statistics, try to apply those.
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 41b49b2662..7e10144822 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -202,4 +202,17 @@ extern int	locate_var_of_level(Node *node, int levelsup);

Re: Propose a new hook for mutating the query bounds

2021-11-17 Thread Xiaozhe Yao
Hi Tom,

Thanks for your feedback. I completely agree with you that a higher-level
hook is better suited for this case. I have adjusted the PoC patch to this
email.

Now it is located in the clauselist_selectivity_ext function, where we
first check if the hook is defined. If so, we let the hook estimate the
selectivity and return the result. With this one, I can also develop
extensions to better estimate the selectivity.

I hope it makes more sense. Also please forgive me if I am understanding
Postgres somehow wrong, as I am quite new to this community :)

Best regards,
Xiaozhe
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..841306ca6c 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -26,6 +26,9 @@
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
 
+/* Hooks for plugins to get control when we ask for selectivity */
+clauselist_selectivity_hook_type clauselist_selectivity_hook = NULL;
+
 /*
  * Data structure for accumulating info about possible range-query
  * clause pairs in clauselist_selectivity.
@@ -130,6 +133,12 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	ListCell   *l;
 	int			listidx;
 
+	/*
+	* If we have a hook for selectivity estimation, then go for it
+	*/
+	if (clauselist_selectivity_hook)
+		return clauselist_selectivity_hook(root, clauses, varRelid, jointype, sjinfo, use_extended_stats);
+
 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity_ext(). None of what we might do below is relevant.
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 41b49b2662..b8b9b34946 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -202,4 +202,16 @@ extern int	locate_var_of_level(Node *node, int levelsup);
 extern List *pull_var_clause(Node *node, int flags);
 extern Node *flatten_join_alias_vars(Query *query, Node *node);
 
+/* Hooks for external selectivity estimation */
+typedef Selectivity (*clauselist_selectivity_hook_type) (
+		   PlannerInfo *root,
+		   List *clauses,
+		   int varRelid,
+		   JoinType jointype,
+		   SpecialJoinInfo *sjinfo,
+		   bool use_extended_stats);
+
+extern PGDLLIMPORT clauselist_selectivity_hook_type clauselist_selectivity_hook;
+
+
 #endif			/* OPTIMIZER_H */


Re: Propose a new hook for mutating the query bounds

2021-11-17 Thread Xiaozhe Yao
Hi Tomas and Hackers,

Thanks for your reply and feedback!

> I don't understand how you could achieve this by mutating pg_statistic,
without also breaking estimation for queries that only have Y<20.

I agree, if we mutate pg_statistics, we will break lots of stuff and the
process becomes complicated. That's also why I think mutating the bounds
makes more sense and is easier to achieve.

> Maybe, but it's really hard to comment on this without seeing any PoC
patches. We don't know where you you'd like the hook called, what info
would it have access to, how would it tweak the selectivities etc.

I have attached a PoC patch to this mail. Essentially in this patch, I only
try to pass the pointer of the constval in ```scalarineqsql``` function. It
is enough from the Postgres side. With that, I can handle other things in
an independent extension.

I hope this makes sense.

Best regards,
Xiaozhe

On Wed, Nov 17, 2021 at 2:49 PM Tomas Vondra 
wrote:

>
>
> On 11/17/21 2:24 PM, Xiaozhe Yao wrote:
> > Hi hackers,
> >
> > I am currently working on improving the cardinality estimation component
> > in PostgreSQL with machine learning. I came up with a solution that
> > mutates the bounds for different columns. For example, assume that we
> > have a query
> >
> > ```
> > select * from test where X<10 and Y<20;
> > ```
> >
> > Our approach tries to learn the relation between X and Y. For example,
> > if we have a linear relation, Y=X+10. Then Y<20 is essentially
> > equivalent to X<10. Therefore we can mutate the Y<20 to Y > that the selectivity will be 1, and we will have a more accurate
> estimation.
> >
>
> OK. FWIW the extended statistics patch originally included a patch for
> multi-dimensional histograms, and that would have worked for this
> example just fine, I guess. But yeah, there are various other
> dependencies for which a histogram would not help. And ML might discover
> that and help ...
>
> > It seems to me that we can achieve something similar by mutating the
> > pg_statistics, however, mutating the bounds is something more
> > straightforward to me and less expensive.
> >
>
> I don't understand how you could achieve this by mutating pg_statistic,
> without also breaking estimation for queries that only have Y<20.
>
> > I am wondering if it is possible to have such an extension? Or if there
> > is a better solution to this? I have already implemented this stuff in a
> > private repository, and if this is something you like, I can further
> > propose the patch to the list.
> >
>
> Maybe, but it's really hard to comment on this without seeing any PoC
> patches. We don't know where you you'd like the hook called, what info
> would it have access to, how would it tweak the selectivities etc.
>
> If you think this would work, write a PoC patch and we'll see.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 10895fb287..cdf4762c60 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -143,6 +143,8 @@
 /* Hooks for plugins to get control when we ask for stats */
 get_relation_stats_hook_type get_relation_stats_hook = NULL;
 get_index_stats_hook_type get_index_stats_hook = NULL;
+mutate_bounds_hook_type mutate_bounds_hook = NULL;
+
 
 static double eqsel_internal(PG_FUNCTION_ARGS, bool negate);
 static double eqjoinsel_inner(Oid opfuncoid, Oid collation,
@@ -669,6 +671,9 @@ scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq,
 
 	fmgr_info(get_opcode(operator), );
 
+	if (mutate_bounds_hook) {
+		mutate_bounds_hook(root, , isgt, iseq);
+	}
 	/*
 	 * If we have most-common-values info, add up the fractions of the MCV
 	 * entries that satisfy MCV OP CONST.  These fractions contribute directly
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 9dd444e1ff..b406b0aa2b 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -145,6 +145,9 @@ typedef bool (*get_index_stats_hook_type) (PlannerInfo *root,
 		   VariableStatData *vardata);
 extern PGDLLIMPORT get_index_stats_hook_type get_index_stats_hook;
 
+typedef bool (*mutate_bounds_hook_type) (PlannerInfo *root,  Datum *bound, bool isgt, bool iseq);
+extern PGDLLIMPORT mutate_bounds_hook_type mutate_bounds_hook;
+
 /* Functions in selfuncs.c */
 
 extern void examine_variable(PlannerInfo *root, Node *node, int varRelid,


Propose a new hook for mutating the query bounds

2021-11-17 Thread Xiaozhe Yao
Hi hackers,

I am currently working on improving the cardinality estimation component in
PostgreSQL with machine learning. I came up with a solution that mutates
the bounds for different columns. For example, assume that we have a query

```
select * from test where X<10 and Y<20;
```

Our approach tries to learn the relation between X and Y. For example, if
we have a linear relation, Y=X+10. Then Y<20 is essentially equivalent to
X<10. Therefore we can mutate the Y<20 to Y