Change comments of removing useless joins.

2024-01-08 Thread ywgrit
After reading the logic of removing useless join, I think the comment of
this might need to be changed: "Currently, join_is_removable only succeeds
if sjinfo's right hand is a single baserel. " could be changed to
"Currently, join_is_removable only succeeds if sjinfo's min_righthand is a
single baserel. ". Because the useless join in the query "select t1.* from
t1 left join (t2 left join t3 on t3.a=t2.b) on t2.a=t1.a;" would also be
eliminated. That is, the query will be converted to "select t1.* from t1;"
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 6c02fe8908..70e0ae372f 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -112,7 +112,7 @@ restart:
 
 		/*
 		 * Currently, join_is_removable can only succeed when the sjinfo's
-		 * righthand is a single baserel.  Remove that rel from the query and
+		 * min_righthand is a single baserel.  Remove that rel from the query and
 		 * joinlist.
 		 */
 		innerrelid = bms_singleton_member(sjinfo->min_righthand);


Re: planner chooses incremental but not the best one

2023-12-26 Thread ywgrit
Hi,Tomas

Recently, I looked at papers related to estimation of cardinarity with
selection. I may be biased towards the scheme provided by the paper
"Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries
and Event Reports". This paper uses distinct sampling as opposed to the
current uniform sampling, and the main differences between the two are as
follows.

1)It not only counts the distincts on each column or multiple columns, but
also saves some rows corresponding to each distinct value, i.e., it saves
some part of the rows of the original relation as samples. The purpose of
saving these rows is to accommodate restrictions on the queries, such as
where clauses.

2)The queries are executed on the samples, and the result of the execution
is used as the statistical value of cardinarity.

The advantages of this paper over existing practices are as follows.

1)The samples collected can be applied to arbitrary predicates, e.g.
predicates that are correlated with the columns of group by clauses.

2)The accuracy is very high, and in some scenarios, the statistical error
can be minimized by hundreds of times compared to uniform sampling.

However, the scheme provided in this paper also has some defects, as
mentioned above, the scheme relies on the collected samples, which will
lead to a significant increase in the storage overhead of statistical
information.

I'd like to hear your opinions.

ywgrit.

ywgrit  于2023年12月22日周五 16:20写道:

> The possible solution of one scenario I can come up with so far is the
> query's predicate columns and group columns belonging to one table .
>
> For a query that contains where clause, perhaps num_groups could be
> estimated according to the following formula.
>
> num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where
> clause * ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
> sort_col_2, ... sort_col_m) / ndistinct(pred_col_1, pred_col_2, ...
> pred_col_n).
>
> ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause =
> ndistinct(pred_var_1, pred_var_2, ... pred_var_n) * selectivity of rel.
>
> pred_col_n belong to the columns involved in the where clause and
> sort_col_m belong to the columns involved in the group by clause.
>
> The reason for multiplying by selectivity of rel directly is that the
> selectivity of rel depends on only pred_col not sort_col. So the above
> formula can be simplified as follows.
>
> num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
> sort_col_2, ... sort_col_m) * selectivity of rel.
>
> The correctness of the above formula depends on the following conditions.
>
>-
>
>ndistinct(pred_col_1, pred_col_2, ... pred_col_n)*
>ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2,
>... sort_col_m) statistics already exist, and need be accurate.
>-
>
>Both pred_col_n and sort_col_m are uniformly distributed, if not,
>statistics such as mcv are needed for correction.
>-
>
>The tuples of rel are the number of total tuples of the table , not
>the number of filtered tuples.
>
> After experimentation, in the scenario mentioned in previous thread. The
> estimate num_groups is 3, the accuracy of result strongly relies on the
> uniform distribution of b, which makes ndistinct(pred_col_1, pred_col_2,
> ... pred_col_n) with where clause could be able to estimated accurately.
>
> I'd like to hear your opinions.
>
> Regards.
>
> ywgrit.
>
> Tomas Vondra  于2023年12月18日周一 20:53写道:
>
>>
>>
>> On 12/18/23 11:40, Richard Guo wrote:
>> >
>> > On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
>> > mailto:tomas.von...@enterprisedb.com>>
>> > wrote:
>> >
>> > Oh! Now I see what you meant by using the new formula in 84f9a35e3
>> > depending on how we sum tuples. I agree that seems like the right
>> thing.
>> >
>> > I'm not sure it'll actually help with the issue, though - if I
>> apply the
>> > patch, the plan does not actually change (and the cost changes just
>> a
>> > little bit).
>> >
>> > I looked at this only very briefly, but I believe it's due to the
>> > assumption of independence I mentioned earlier - we end up using
>> the new
>> > formula introduced in 84f9a35e3, but it assumes it assumes the
>> > selectivity and number of groups are independent. But that'd not the
>> > case here, because the groups are very clearly correlated (with the
>> > condition on "b").
>> >
>> >
>> > You're right.  The patch allows us to adjust the estimate of distinct
>> > values for appendrels

Re: planner chooses incremental but not the best one

2023-12-22 Thread ywgrit
The possible solution of one scenario I can come up with so far is the
query's predicate columns and group columns belonging to one table .

For a query that contains where clause, perhaps num_groups could be
estimated according to the following formula.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where
clause * ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
sort_col_2, ... sort_col_m) / ndistinct(pred_col_1, pred_col_2, ...
pred_col_n).

ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause =
ndistinct(pred_var_1, pred_var_2, ... pred_var_n) * selectivity of rel.

pred_col_n belong to the columns involved in the where clause and
sort_col_m belong to the columns involved in the group by clause.

The reason for multiplying by selectivity of rel directly is that the
selectivity of rel depends on only pred_col not sort_col. So the above
formula can be simplified as follows.

num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
sort_col_2, ... sort_col_m) * selectivity of rel.

The correctness of the above formula depends on the following conditions.

   -

   ndistinct(pred_col_1, pred_col_2, ... pred_col_n)* ndistinct(pred_col_1,
   pred_col_2, ... pred_col_n, sort_col_1, sort_col_2, ... sort_col_m)
   statistics already exist, and need be accurate.
   -

   Both pred_col_n and sort_col_m are uniformly distributed, if not,
   statistics such as mcv are needed for correction.
   -

   The tuples of rel are the number of total tuples of the table , not the
   number of filtered tuples.

After experimentation, in the scenario mentioned in previous thread. The
estimate num_groups is 3, the accuracy of result strongly relies on the
uniform distribution of b, which makes ndistinct(pred_col_1, pred_col_2,
... pred_col_n) with where clause could be able to estimated accurately.

I'd like to hear your opinions.

Regards.

ywgrit.

Tomas Vondra  于2023年12月18日周一 20:53写道:

>
>
> On 12/18/23 11:40, Richard Guo wrote:
> >
> > On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
> > mailto:tomas.von...@enterprisedb.com>>
> > wrote:
> >
> > Oh! Now I see what you meant by using the new formula in 84f9a35e3
> > depending on how we sum tuples. I agree that seems like the right
> thing.
> >
> > I'm not sure it'll actually help with the issue, though - if I apply
> the
> > patch, the plan does not actually change (and the cost changes just a
> > little bit).
> >
> > I looked at this only very briefly, but I believe it's due to the
> > assumption of independence I mentioned earlier - we end up using the
> new
> > formula introduced in 84f9a35e3, but it assumes it assumes the
> > selectivity and number of groups are independent. But that'd not the
> > case here, because the groups are very clearly correlated (with the
> > condition on "b").
> >
> >
> > You're right.  The patch allows us to adjust the estimate of distinct
> > values for appendrels using the new formula introduced in 84f9a35e3.
> > But if the restrictions are correlated with the grouping expressions,
> > the new formula does not behave well.  That's why the patch does not
> > help in case [1], where 'b' and 'c' are correlated.
> >
> > OTOH, if the restrictions are not correlated with the grouping
> > expressions, the new formula would perform quite well.  And in this case
> > the patch would help a lot, as shown in [2] where estimate_num_groups()
> > gives a much more accurate estimate with the help of this patch.
> >
> > So this patch could be useful in certain situations.  I'm wondering if
> > we should at least have this patch (if it is right).
> >
>
> I do agree the patch seems to do the right thing, and it's worth pushing
> on it's own.
>
> >
> > If that's the case, I'm not sure how to fix this :-(
> >
> >
> > The commit message of 84f9a35e3 says
> >
> > This could possibly be improved upon in the future by identifying
> > correlated restrictions and using a hybrid of the old and new
> > formulae.
> >
> > Maybe this is something we can consider trying.  But anyhow this is not
> > an easy task I suppose.
>
> Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-)
>
> The challenge is where to get usable information about correlation
> between columns. I only have a couple very rought ideas of what might
> try. For example, if we have multi-column ndistinct statistics, we might
> look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from
>
> ndistinct(b,c,d) / ndistinct(b,c)
>
> If we know how many distinct values we have for the predicate column, we
&

Re: Fix bug with indexes on whole-row expressions

2023-12-17 Thread ywgrit
Thanks, tom. Considering the scenario where the indexed column is a
function Var on a whole expression, it's really not a good idea to disable
creating index on whole expression. I tried
find_composite_type_dependencies, it seems that this function can only
detect dependencies created by statements such as 'CREATE INDEX
test_tbl1_idx ON test_tbl1((row(x,y)::test_type1))', and cannot detect
dependencies created by statements such as 'CREATE INDEX test_tbl1_idx ON
test_tbl1((test _tbl1))'. After the execution of the former sql statement,
4 rows are added to the pg_depend table, one of which is the index ->
pg_type dependency. After the latter sql statement is executed, only one
row is added to the pg_depend table, and there is no index -> pg_type
dependency, so I guess this function doesn't detect all cases of index on
whole-row expression. And I would suggest to do the detection when the
index is created, because then we can get the details of the index and give
a warning in the way you mentioned.

Tom Lane  于2023年12月13日周三 23:01写道:

> ywgrit  writes:
> > I forbid to create indexes on whole-row expression in the following
> patch.
> > I'd like to hear your opinions.
>
> As I said in the previous thread, I don't think this can possibly
> be acceptable.  Surely there are people depending on the capability.
> I'm not worried so much about the exact case of an index column
> being a whole-row Var --- I agree that that's pretty useless ---
> but an index column that is a function on a whole-row Var seems
> quite useful.  (Your patch fails to detect that, BTW, which means
> it does not block the case presented in bug #18244.)
>
> I thought about extending the ALTER TABLE logic to disallow changes
> in composite types that appear in index expressions.  We already have
> find_composite_type_dependencies(), and it turns out that this already
> blocks ALTER for the case you want to forbid, but we concluded that we
> didn't need to prevent it for the bug #18244 case:
>
>  * If objsubid identifies a specific column, refer to that in error
>  * messages.  Otherwise, search to see if there's a user column of
> the
>  * type.  (We assume system columns are never of interesting
> types.)
>  * The search is needed because an index containing an expression
>  * column of the target type will just be recorded as a
> whole-relation
>  * dependency.  If we do not find a column of the type, the
> dependency
>  * must indicate that the type is transiently referenced in an
> index
>  * expression but not stored on disk, which we assume is OK, just
> as
>  * we do for references in views.  (It could also be that the
> target
>  * type is embedded in some container type that is stored in an
> index
>  * column, but the previous recursion should catch such cases.)
>
> Perhaps a reasonable answer would be to issue a WARNING (not error)
> in the case where an index has this kind of dependency.  The index
> might need to be reindexed --- but it might not, too, and in any case
> I doubt that flat-out forbidding the ALTER is a helpful idea.
>
> regards, tom lane
>


Fix bug with indexes on whole-row expressions

2023-12-12 Thread ywgrit
Hi, Thomas Munro and Laurenz Albe.

Since I didn't subscribe to the psql-hackers mailing list before this bug
was raised, please forgive me for not being able to reply to this email by
placing the email message below.
https://www.postgresql.org/message-id/flat/e48a5d9a2d3d72985d61ee254314f5f5f5444a55.ca...@cybertec.at

I forbid to create indexes on whole-row expression in the following patch.
I'd like to hear your opinions.


--
Best Wishes,
ywgrit
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cd23ab3b25..e4451b1d36 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -572,9 +572,18 @@ DefineIndex(Oid relationId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	ListCell   *lc;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
+	foreach (lc, stmt->indexParams)
+	{
+		IndexElem *ielem = castNode(IndexElem, lfirst(lc));
+		if (IsA(ielem->expr, Var) && castNode(Var, ielem->expr)->varattno == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+	errmsg("cannot create index on whole-row expression of table '%s'", ielem->indexcolname)));
+	}
 	/*
 	 * Some callers need us to run with an empty default_tablespace; this is a
 	 * necessary hack to be able to reproduce catalog state accurately when


Re: Is the member name of hashctl inappropriate?

2023-09-13 Thread ywgrit
You are right, I came to this erroneous conclusion based on the following
mishandled experiment: My test program is shown below.

typedef struct ColumnIdentifier
{
   Oidrelid;
   AttrNumber resno;
} ColumnIdentifier;

typedef struct ColumnType
{
   ColumnIdentifier colId;
   Oid  vartype;
} ColumnType;


int
ColumnIdentifier_compare(const void *key1, const void *key2, Size keysize)
{
const ColumnIdentifier *colId_key1 = (const ColumnIdentifier *) key1;
const ColumnIdentifier *colId_key2 = (const ColumnIdentifier *) key2;

return colId_key1->relid == colId_key2->relid && colId_key1->resno ==
colId_key2->resno ? 0 : 1;
}


void *
ColumnIdentifier_copy(void *dest, const void *src, Size keysize)
{
ColumnIdentifier *colId_dest = (ColumnIdentifier *) dest;
ColumnIdentifier *colId_src = (ColumnIdentifier *) src;

colId_dest->relid = colId_src->relid;
colId_dest->resno = colId_src->resno;

   return NULL;/* not used */
}


   HASHCTL   hashctl;
   hashctl.hash = tag_hash;
   hashctl.match = ColumnIdentifier_compare;
   hashctl.keycopy = ColumnIdentifier_copy;
   hashctl.keysize = sizeof(ColumnIdentifier);
   hashctl.entrysize = sizeof(ColumnType);
   HTAB *htab = hash_create("type of column",
 512 /* nelem */,
 ,
 HASH_ELEM | HASH_FUNCTION |
 HASH_COMPARE | HASH_KEYCOPY);
ColumnType *entry = NULL;

   ColumnIdentifier *colId = (ColumnIdentifier *)
MemoryContextAllocZero(CurrentMemoryContext,
sizeof(ColumnIdentifier));
   ColumnType *coltype = (ColumnType *)
MemoryContextAllocZero(CurrentMemoryContext, sizeof(ColumnType));

   coltype->colId.relid = colId->relid = 16384;
   coltype->colId.resno = colId->resno = 1;
   coltype->vartype = INT4OID;

   hash_search(htab, coltype, HASH_ENTER, NULL);
   entry = hash_search(htab, colId, HASH_FIND, NULL);

   Assert(entry->colId.relid == colId->relid);
   Assert(entry->colId.resno == colId->resno);
   Assert(entry->vartype == INT4OID); // entry->vartype == 0

As shown above, entry->vartype is not assigned when keycopy copies only the
key. I modified ColumnIdentifier_copy as shown below, the keycopy copies
the entire entry.

void *
ColumnIdentifier_copy(void *dest, const void *src, Size keysize)
{
const ColumnType *coltype_src = (const ColumnType *) src;
const ColumnType *coltype_dest = (const ColumnType *) dest;

  coltype_dest->colId->relid = coltype_src->colId->relid;
  coltype_dest->colId->resno = coltype_src->colId->resno;
  coltype_dest->vartype = coltype_src->vartype;

   return NULL;/* not used */
}

The result is that entry->vartype is now the same as var->vartype, which
leads me to believe that keycopy "should" copy the entire entry. Before
sending the initial email, I looked at the implementation of
"hash_search_with_hash_value" and found the line
"hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize)", which made me
wonder how data field is copied into the HTAB?

But at the time I ignored a note above: "Caller is expected to fill the
data field on return". Now I know that the data field needs to be filled
manually, so it was my misuse. Thanks for the correction!

Thanks

Tom Lane  于2023年9月13日周三 09:36写道:

> ywgrit  writes:
> > According to the description, the keycopy function only copies the key,
> but
> > in reality it copies the entire entry, i.e., the key and the value,
>
> On what grounds do you claim that?  dynahash.c only ever passes "keysize"
> as the size parameter.
>
> regards, tom lane
>


Is the member name of hashctl inappropriate?

2023-09-12 Thread ywgrit
The definition of hashctl is shown below

typedef struct HASHCTL
{
  long   num_partitions; /* # partitions (must be power of 2) */
  long   ssize; /* segment size */
  long   dsize; /* (initial) directory size */
  long   max_dsize; /* limit to dsize if dir
size is limited */
  long   ffactor;   /* fill factor */
  Size   keysize;   /* hash key length in bytes */
  Size   entrysize; /* total user element size
in bytes */
  HashValueFunc hash; /* hash function */
  HashCompareFunc match; /* key comparison function */
  HashCopyFunc keycopy;   /* key copying function */
  HashAllocFunc alloc;   /* memory allocator */
  MemoryContext hcxt; /* memory context to use
for allocations */
  HASHHDR   *hctl;   /* location of header in
shared mem */
} HASHCTL;


/*
* Key copying functions must have this signature. The return value is not
* used. (The definition is set up to allow memcpy() and strlcpy() to be
* used directly.)
*/
typedef void *(*HashCopyFunc) (void *dest, const void *src, Size keysize);

According to the description, the keycopy function only copies the key, but
in reality it copies the entire entry, i.e., the key and the value, is the
name wrong? This may make the developer pass in an inappropriate keycopy
parameter when creating the htab.

Thanks