from:"Andy Fan"

Re: Optimize WindowAgg's use of tuplestores

2024-07-18 Thread Andy Fan

Ashutosh Bapat  writes:

> On Fri, Jul 12, 2024 at 11:59 AM Ashutosh Bapat
>  wrote:
>>
>>
> Out of curiosity, I measured the performance with just the "unlikely"
> change and with the full patch. Below are the results
> Testing with 100 partitions
...
> latency average = 333.538 ms
>
> with just unlikely change
> Testing with 100 partitions
..
> Testing with 1 partitions
>
> There's noticeable change across all the number of partitions with
> just "unlikely" change.

I'm curious about why a 'unlikey' change can cause noticeable change,
especially there is just one function call in the 'if-statement' (I am
thinking more instrucments in the if-statement body, more changes it can
cause). 

+   if (unlikely(winstate->buffer == NULL))
+   prepare_tuplestore(winstate);

-- 
Best Regards
Andy Fan

Re: New function normal_rand_array function to contrib/tablefunc.

2024-07-17 Thread Andy Fan

Andy Fan  writes:


(just noticed this reply is sent to Jim privately, re-sent it to
public.)

> Hi Jim,
>
>>
>> When either minval or maxval exceeds int4 the function cannot be
>> executed/found
>>
>> SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);
>>
>> ERROR:  function normal_rand_array(integer, integer, integer, bigint)
>> does not exist
>> LINE 1: SELECT * FROM normal_rand_array(5, 10, 8, 42::bigint);
>>   ^
>> HINT:  No function matches the given name and argument types. You might
>> need to add explicit type casts.
>> ---
>>
>> SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42);
>>
>> ERROR:  function normal_rand_array(integer, integer, bigint, integer)
>> does not exist
>> LINE 1: SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42);
>>   ^
>> HINT:  No function matches the given name and argument types. You might
>> need to add explicit type casts.
>> ---
>
>>
>> However, when both are int8 it works fine:
>
> I defined the function as below:
>
> postgres=# \df normal_rand_array
> List of functions
>  Schema |   Name| Result data type |   Argument data 
> types| Type 
> +---+--+--+--
>  public | normal_rand_array | SETOF anyarray   | integer, integer, 
> anyelement, anyelement | func
> (1 row)
>
> so it is required that the 3nd and 4th argument should have the same
> data type, that's why your first 2 test case failed and the third one
> works.  and I also think we should not add a test case / document for
> this since the behavior of 'anyelement' system.
>

This issue can be fixed with the new API defined suggested by Dean. 

>>
>> SELECT * FROM normal_rand_array(5, 10, 8::bigint, 42::bigint);
>>
>>     normal_rand_array     
>> --
>>  {29,38,31,10,23,39,9,32}
>>  {8,39,19,31,29,15,17,15,36,20,33,19}
>>  {15,18,42,19}
>>  {16,31,33,11,14,20,24,9,12,17,22,42,41,24,11,41}
>>  {15,11,36,8,28,37}
>> (5 rows)
>> ---
>>
>> Is it the expected behaviour?
>
> Yes, see the above statements.
>
>>
>> In some cases the function returns an empty array. Is it also expected?
>>
>> SELECT count(*)
>> FROM normal_rand_array(10, 10, 8, 42) i
>> WHERE array_length(i,1) IS NULL;
>>
>>  count
>> ---
>>   4533
>> (1 row)
>
> Yes, by design I think it is a feature which could generate [] case
> which should be used a special case for testing, and at the
> implementation side, the [] means the length is 0 which is caused by I
> choose the 'len' by random [0 .. len * 2], so 0 is possible and doesn't
> confict with the declared behavior. 
>
>> In both cases, perhaps mentioning these behaviors in the docs would
>> avoid some confusion.
>
> hmm, It doesn't take some big effort to add them, but I'm feeling that
> would make the document a bit of too verbose/detailed.
>
> Sorry for the late respone! 

-- 
Best Regards
Andy Fan

Re: New function normal_rand_array function to contrib/tablefunc.

2024-07-17 Thread Andy Fan

Dean Rasheed  writes:

> My suggestion would be to mirror the signatures of the core random()
> functions more closely, and have this:
>
> 1). rand_array(numvals int, minlen int, maxlen int)
> returns setof float8[]
>
> 2). rand_array(numvals int, minlen int, maxlen int,
>minval int, maxval int)
> returns setof int[]
>
> 3). rand_array(numvals int, minlen int, maxlen int,
>minval bigint, maxval bigint)
> returns setof bigint[]
>
> 4). rand_array(numvals int, minlen int, maxlen int,
>minval numeric, maxval numeric)
> returns setof numeric[]

this is indeed a more clean and correct APIs, I will use the above ones
in the next version. Thanks for the suggestion. 

It is just not clear to me how verbose the document should to be, and
where the document should be, tablefunc.sgml, the comment above the
function or the places just the code?  In many cases you provides above
or below are just implementation details, not the API declaimed purpose. 

> Something else that's not obvious is that this patch is relying on the
> core random functions, which means that it's using the same PRNG state
> as those functions. That's probably OK, but it should be documented,
> because it's different from tablefunc's normal_rand() function, which
> uses pg_global_prng_state.

My above question applies to this comment. 

> In particular, what this means is that
> calling setseed() will affect the output of these new functions, but
> not normal_rand(). Incidentally, that makes writing tests much simpler
> -- just call setseed() first and the output will be repeatable.

Good to know this user case. for example, should this be documented?

>> In some cases the function returns an empty array. Is it also expected?
>>
>
> Perhaps it would be useful to have separate minimum and maximum array
> length arguments, rather than a mean array length argument.

I'm not sure which one is better, but main user case of this function
for testing pupose, so it I think minimum and maximum array length is
good for me too.

> 
> Actually, I find the current behaviour somewhat counterintuitive. Only
> after reading the source code did I realise what it's actually doing,
> which is this:
>
> Row 1: array of random length in range [0, meanarraylen]
> Row 2: array of length 2*meanarraylen - length of previous array
> Row 3: array of random length in range [0, meanarraylen]
> Row 4: array of length 2*meanarraylen - length of previous array
> ...
>
> That's far from obvious (it's certainly not documented) and I don't
> think it's a particularly good way of achieving a specified mean array
> length, because only half the lengths are random.

I'm not sure how does this matter in real user case.

> One thing that's definitely needed is more documentation. It should
> have its own new subsection, like the other tablefunc functions.

is the documentaion for the '2*meanarraylen - lastarraylen'?

and What is new subsection, do you mean anything wrong in
'tablefunc.sgml', I did have some issue to run 'make html', but the
error exists before my patch, so I change the document carefully without
testing it.  do you know how to fix the below error in 'make html'?


$/usr/bin/xsltproc --nonet --path . --path . --stringparam pg.version '18devel' 
 stylesheet.xsl postgres-full.xml

I/O error : Attempt to load network entity 
http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl
warning: failed to load external entity 
"http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl;
compilation error: file stylesheet.xsl line 6 element import
xsl:import : unable to load 
http://docbook.sourceforge.net/release/xsl/current/xhtml/chunk.xsl
I/O error : Attempt to load network entity 
http://docbook.sourceforge.net/release/xsl/current/common/entities.ent
stylesheet-html-common.xsl:4: warning: failed to load external entity 
"http://docbook.sourceforge.net/release/xsl/current/common/entities.ent;
%common.entities;
 ^
stylesheet-html-common.xsl:124: parser error : Entity 'primary' not defined
  translate(substring(, 1, 1),


> Something else that confused me is why this function is called
> normal_rand_array(). The underlying random functions that it's calling
> return random values from a uniform distribution, not a normal
> distribution. Arrays of normally distributed random values might also
> be useful, but that's not what this patch is doing.

OK, you are right, your new names should be better. 

> Also, the function accepts float8 minval and maxval arguments, and
> then simply ignores them and returns random float8 values in the range
> [0,1), which is highly counterintuitive.

This is a obvious bug and it only exists in float8 case IIUC, will fix
it in the next version.


-- 
Best Regards
Andy Fan

Re: Parallel CREATE INDEX for GIN indexes

2024-07-08 Thread Andy Fan

Andy Fan  writes:

I just realize all my replies is replied to sender only recently,
probably because I upgraded the email cient and the short-cut changed
sliently, resent the lastest one only 

>>> Suppose RBTree's output is:
>>> 
>>> batch-1 at RBTree:
>>> 1  [tid1, tid8, tid100]
>>> 2  [tid1, tid9, tid800]
>>> ...
>>> 78 [tid23, tid99, tid800]
>>> 
>>> batch-2 at RBTree
>>> 1  [tid1001, tid1203, tid1991]
>>> ...
>>> ...
>>> 97 [tid1023, tid1099, tid1800]
>>> 
>>> Since all the tuples in each batch (1, 2, .. 78) are sorted already, we
>>> can just flush them into tuplesort as a 'run' *without any sorts*,
>>> however within this way, it is possible to produce more 'runs' than what
>>> you did in your patch.
>>> 
>>
>> Oh! Now I think I understand what you were proposing - you're saying
>> that when dumping the RBTree to the tuplesort, we could tell the
>> tuplesort this batch of tuples is already sorted, and tuplesort might
>> skip some of the work when doing the sort later.
>>
>> I guess that's true, and maybe it'd be useful elsewhere, I still think
>> this could be left as a future improvement. Allowing it seems far from
>> trivial, and it's not quite clear if it'd be a win (it might interfere
>> with the existing sort code in unexpected ways).
>
> Yes, and I agree that can be done later and I'm thinking Matthias's
> proposal is more promising now. 
>
>>> new way: the No. of batch depends on size of RBTree's batch size.
>>> existing way: the No. of batch depends on size of work_mem in tuplesort.
>>> Usually the new way would cause more no. of runs which is harmful for
>>> mergeruns.  so I can't say it is an improve of not and not include it in
>>> my previous patch. 
>>> 
>>> however case 1 sounds a good canidiates for this method.
>>> 
>>> Tuples from state->bs_worker_state after the perform_sort and ctid
>>> merge: 
>>> 
>>> 1 [tid1, tid8, tid100, tid1001, tid1203, tid1991]
>>> 2 [tid1, tid9, tid800]
>>> 78 [tid23, tid99, tid800]
>>> 97 [tid1023, tid1099, tid1800]
>>> 
>>> then when we move tuples to bs_sort_state, a). we don't need to sort at
>>> all. b). we can merge all of them into 1 run which is good for mergerun
>>> on leader as well. That's the thing I did in the previous patch.
>>> 
>>
>> I'm sorry, I don't understand what you're proposing. Could you maybe
>> elaborate in more detail?
>
> After we called "tuplesort_performsort(state->bs_worker_sort);" in
> _gin_process_worker_data, all the tuples in bs_worker_sort are sorted
> already, and in the same function _gin_process_worker_data, we have
> code:
>
> while ((tup = tuplesort_getgintuple(worker_sort, , true)) != NULL)
> {
>
>   (1)
>   
>   tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
>
> }
>
> and later we called 'tuplesort_performsort(state->bs_sortstate);'.  Even
> we have some CTID merges activity in  '(1)', the tuples are still
> ordered, so the sort (in both tuplesort_putgintuple and
> 'tuplesort_performsort) are not necessary, what's more, in the each of
> 'flush-memory-to-disk' in tuplesort, it create a 'sorted-run', and in
> this case, acutally we only need 1 run only since all the input tuples
> in the worker is sorted. The reduction of 'sort-runs' in worker will be
> helpful to leader's final mergeruns.  the 'sorted-run' benefit doesn't
> exist for the case-1 (RBTree -> worker_state). 
>
> If Matthias's proposal is adopted, my optimization will not be useful
> anymore and Matthias's porposal looks like a more natural and effecient
> way.

-- 
Best Regards
Andy Fan

Re: Make tuple deformation faster

2024-07-01 Thread Andy Fan

David Rowley  writes:

> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses.  This should have
> helped contribute to the "insn per cycle" increase.  4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs. Maybe someone else can explain
> better than I can. In more simple terms, generally, the higher the
> "insn per cycle", the better. Also, the lower all of the idle and
> branch miss percentages are that's generally also better. However,
> you'll notice that the patched version has more front and backend
> stalls. I assume this is due to performing more instructions per cycle
> from improved branch prediction causing memory and instruction stalls
> to occur more frequently, effectively (I think) it's just hitting the
> next bottleneck(s) - memory and instruction decoding. At least, modern
> CPUs should be able to out-pace RAM in many workloads, so perhaps it's
> not that surprising that "backend cycles idle" has gone up due to such
> a large increase in instructions per cycle due to improved branch
> prediction.

Thanks for the answer, just another area desvers to exploring.

> It would be nice to see this tested on some modern Intel CPU. A 13th
> series or 14th series, for example, or even any intel from the past 5
> years would be better than nothing.

I have two kind of CPUs.

a). Intel Xeon Processor (Icelake) for my ECS
b). Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz at Mac.

My ECS reports " branch-misses", probabaly because it
runs in virtualization software , and Mac doesn't support perf yet :( 

-- 
Best Regards
Andy Fan

Re: Parallel CREATE INDEX for GIN indexes

2024-07-01 Thread Andy Fan

Tomas Vondra  writes:

Hi Tomas,

I am in a incompleted review process but I probably doesn't have much
time on this because of my internal tasks. So I just shared what I
did and the non-good-result patch.

What I tried to do is:
1) remove all the "sort" effort for the state->bs_sort_state tuples since
its input comes from state->bs_worker_state which is sorted already.

2). remove *partial* "sort" operations between accum.rbtree to
state->bs_worker_state because the tuple in accum.rbtree is sorted
already. 

Both of them can depend on the same API changes. 

1. 
struct Tuplesortstate
{
..
+ bool input_presorted; /*  the tuples are presorted. */
+ new_tapes;  // writes the tuples in memory into a new 'run'. 
}

and user can set it during tuplesort_begin_xx(, presorte=true);


2. in tuplesort_puttuple, if memory is full but presorted is
true, we can (a) avoid the sort. (b) resuse the existing 'runs'
to reduce the effort of 'mergeruns' unless new_tapes is set to
true. once it switch to a new tapes, the set state->new_tapes to false
and wait 3) to change it to true again.

3. tuplesort_dumptuples(..);  // dump the tuples in memory and set
new_tapes=true to tell the *this batch of input is presorted but they
are done, the next batch is just presort in its own batch*.

In the gin-parallel-build case,  for the case 1), we can just use

for tuple in bs_worker_sort: 
tuplesort_putgintuple(state->bs_sortstate, ..);
tuplesort_dumptuples(..);

At last we can get a). only 1 run in the worker so that the leader can
have merge less runs in mergeruns.  b). reduce the sort both in
perform_sort_tuplesort and in sortstate_puttuple_common.

for the case 2). we can have:

   for tuple in RBTree.tuples:
  tuplesort_puttuples(tuple) ;  
  // this may cause a dumptuples internally when the memory is full,
  // but it is OK.
tuplesort_dumptuples(..);

we can just remove the "sort" into sortstate_puttuple_common but
probably increase the 'runs' in sortstate which will increase the effort
of mergeruns at last.

But the test result is not good, maybe the 'sort' is not a key factor of
this. I do missed the perf step before doing this. or maybe my test data
is too small. 

Here is the patch I used for the above activity. and I used the
following sql to test. 

CREATE TABLE t(a int[], b numeric[]);

-- generate 1000 * 1000 rows.
insert into t select i, n
from normal_rand_array(1000, 90, 1::int4, 1::int4) i,
normal_rand_array(1000, 90, '1.00233234'::numeric, '8.239241989134'::numeric) n;

alter table t set (parallel_workers=4);
set debug_parallel_query to on;
set max_parallel_maintenance_workers to 4;

create index on t using gin(a);
create index on t using gin(b);

I found normal_rand_array is handy to use in this case and I
register it into https://commitfest.postgresql.org/48/5061/.

Besides the above stuff, I didn't find anything wrong in the currrent
patch, and the above stuff can be categoried into "furture improvement"
even it is worthy to. 

-- 
Best Regards
Andy Fan

>From 48c2e03fd854c8f88f781adc944f37b004db0721 Mon Sep 17 00:00:00 2001
From: Andy Fan 
Date: Sat, 8 Jun 2024 13:21:08 +0800
Subject: [PATCH v20240702 1/3] Add function normal_rand_array function to
 contrib/tablefunc.

It can produce an array of numbers with n controllable array length and
duplicated elements in these arrays.
---
 contrib/tablefunc/Makefile|   2 +-
 contrib/tablefunc/expected/tablefunc.out  |  26 
 contrib/tablefunc/sql/tablefunc.sql   |  10 ++
 contrib/tablefunc/tablefunc--1.0--1.1.sql |   7 ++
 contrib/tablefunc/tablefunc.c | 140 ++
 contrib/tablefunc/tablefunc.control   |   2 +-
 doc/src/sgml/tablefunc.sgml   |  10 ++
 src/backend/utils/adt/arrayfuncs.c|   7 ++
 8 files changed, 202 insertions(+), 2 deletions(-)
 create mode 100644 contrib/tablefunc/tablefunc--1.0--1.1.sql

diff --git a/contrib/tablefunc/Makefile b/contrib/tablefunc/Makefile
index 191a3a1d38..f0c67308fd 100644
--- a/contrib/tablefunc/Makefile
+++ b/contrib/tablefunc/Makefile
@@ -3,7 +3,7 @@
 MODULES = tablefunc
 
 EXTENSION = tablefunc
-DATA = tablefunc--1.0.sql
+DATA = tablefunc--1.0.sql tablefunc--1.0--1.1.sql
 PGFILEDESC = "tablefunc - various functions that return tables"
 
 REGRESS = tablefunc
diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out
index ddece79029..9f0cbbfbbe 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -12,6 +12,32 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
 -- negative number of tuples
 SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
 ERROR:  number of rows cannot be negative
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::numeric, 8::numeric) as i;
+ count |avg

Re: Make tuple deformation faster

2024-07-01 Thread Andy Fan

David Rowley  writes:

> Currently, TupleDescData contains the descriptor's attributes in a
> variable length array of FormData_pg_attribute allocated within the
> same allocation as the TupleDescData. According to my IDE,
> sizeof(FormData_pg_attribute) == 104 bytes. It's that large mainly due
> to attname being 64 bytes. The TupleDescData.attrs[] array could end
> up quite large on tables with many columns and that could result in
> poor CPU cache hit ratios when deforming tuples.
...
>
> To test the performance of this, I tried using the attached script
> which creates a table where the first column is a variable length
> column and the final column is an int.  The query I ran to test the
> performance inserted 1 million rows into this table and performed a
> sum() on the final column.  The attached graph shows that the query is
> 30% faster than master with 15 columns between the first and last
> column.

Yet another a wonderful optimization! I just want to know how did you
find this optimization (CPU cache hit) case and think it worths some
time. because before we invest our time to optimize something, it is
better know that we can get some measurable improvement after our time
is spend. At for this case, 30% is really a huge number even it is a
artificial case.

Another case is Andrew introduced NullableDatum 5 years ago and said using
it in TupleTableSlot could be CPU cache friendly, I can follow that, but
how much it can improve in an ideal case, is it possible to forecast it
somehow? I ask it here because both cases are optimizing for CPU cache..

-- 
Best Regards
Andy Fan

Re: Question about maxTapes & selectnewtape & dumptuples

2024-06-30 Thread Andy Fan

Heikki Linnakangas  writes:

> On 30/06/2024 12:48, Andy Fan wrote:

>> for example, at the first use of outputTapes[x], it stores (1, 3, 5,
>> 7),
>> and later (2, 4, 6, 8) are put into it.  so the overall of (1, 3, 5, 7,
>> 2, 4, 6, 8) are not sorted? Where did I go wrong?
>
> There's a distinction between "runs" and "tapes". Each "run" is sorted,
> but there can be multiple runs on a tape. In that case, multiple merge
> passes are needed. Note the call to markrunend() between the runs. In
> your example, the tape would look like (1, 3, 5, 7, , 2, 4,
> 6, 8).

I see, Thank you for your answer!

-- 
Best Regards
Andy Fan

Question about maxTapes & selectnewtape & dumptuples

2024-06-30 Thread Andy Fan



Hi,

merge sorts requires all the tuples in each input are pre-sorted (1),
and in tuplesort.c, when the workmem is full, we dumptuples into
destTape. (2).  it looks to me that we need a unlimited number of Tapes
if the number of input tuples is big enough. However we set a maxTapes
in 'inittapes' and we may use a pre-used tapes for storing new tuples in
'selectnewtape'. Wouldn't this break the prerequisite (1)? 

selectnewtape(Tuplesortstate *state)
{
if (state->nOutputTapes < state->maxTapes)
{..}
else
{
/*
 * We have reached the max number of tapes.  Append to an 
existing
 * tape.
 */
state->destTape = state->outputTapes[state->nOutputRuns % 
state->nOutputTapes];
state->nOutputRuns++;
}
}


for example, at the first use of outputTapes[x], it stores (1, 3, 5, 7),
and later (2, 4, 6, 8) are put into it.  so the overall of (1, 3, 5, 7,
2, 4, 6, 8) are not sorted? Where did I go wrong?

-- 
Best Regards
Andy Fan

Re: cost delay brainstorming

2024-06-25 Thread Andy Fan

Andy Fan  writes:

>
>> - Longrunning transaction prevents increasing relfrozenxid, we run autovacuum
>>   over and over on the same relation, using up the whole cost budget. This is
>>   particularly bad because often we'll not autovacuum anything else, building
>>   up a larger and larger backlog of actual work.
>
> Could we maintain a pg_class.last_autovacuum_min_xid during vacuum? So
> if we compare the OldestXminXid with pg_class.last_autovacuum_min_xid
> before doing the real work.

Maintaining the oldestXminXid on this relation might be expensive. 

>>
>> - Tables, where on-access pruning works very well, end up being vacuumed far
>>   too frequently, because our autovacuum scheduling doesn't know about tuples
>>   having been cleaned up by on-access pruning.
>
> Good to know this case. if we update the pg_stats_xx metrics when on-access
> pruning, would it is helpful on this?

I got the answer myself, it doesn't work because on-access pruing
working on per-index level and one relation may has many indexes. and
pg_stats_xx works at relation level.  So the above proposal doesn't
work. 

-- 
Best Regards
Andy Fan

Re: cost delay brainstorming

2024-06-21 Thread Andy Fan

y bad because often we'll not autovacuum anything else, building
>   up a larger and larger backlog of actual work.

Could we maintain a pg_class.last_autovacuum_min_xid during vacuum? So
if we compare the OldestXminXid with pg_class.last_autovacuum_min_xid
before doing the real work. I think we can use a in-place update on it
to avoid too many versions of pg_class tuples when updating
pg_class.last_autovacuum_min_xid.

>
> - Tables, where on-access pruning works very well, end up being vacuumed far
>   too frequently, because our autovacuum scheduling doesn't know about tuples
>   having been cleaned up by on-access pruning.

Good to know this case. if we update the pg_stats_xx metrics when on-access
pruning, would it is helpful on this? 

> - Larger tables with occasional lock conflicts cause autovacuum to be
>   cancelled and restarting from scratch over and over. If that happens before
>   the second table scan, this can easily eat up the whole cost budget without
>   making forward progress.

Off-peak time + manual vacuum should be helpful I think.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-06-21 Thread Andy Fan

Hi,

Let's me clarify the current state of this patch and what kind of help
is needed.  You can check [1] for what kind of problem it try to solve
and what is the current approach.

The current blocker is if the approach is safe (potential to memory leak
crash etc.).  And I want to have some agreement on this step at least.

"""
When we access [some desired] toast column in EEOP_{SCAN|INNER_OUTER}_VAR
steps, we can detoast it immediately and save it back to
slot->tts_values. so the later ExprState can use the same detoast version
all the time.  this should be more the most effective and simplest way.
"""

IMO, it doesn't break any design of TupleTableSlot and should be
safe. Here is my understanding of TupleTableSlot, please correct me if
anything I missed.

(a). TupleTableSlot define tts_values[*] / tts_isnulls[*] at the high level.
  When the upper layer want a attnum, it just access tts_values/nulls[n].

(b). TupleTableSlot has many different variants, depends on how to get
  tts_values[*] from them, like VirtualTupleTableSlot, HeapTupleTableSlot,
  how to manage its resource (mainly memory) when the life cycle is
  done, for example BufferHeapTupleTableSlot. 

(c). TupleTableSlot defines a lot operation against on that, like
getsomeattrs, copy, clear and so on, for the different variants.  

the impacts of my proposal on the above factor:

(a). it doesn't break it clearly,
(b). it adds a 'detoast' step for step (b) when fill the tts_values[*],
(this is done conditionally, see [1] for the overall design, we
currently focus the safety). What's more, I only added the step in 
EEOP_{SCAN|INNER_OUTER}_VAR_TOAST step, it reduce the impact in another
step.
(c). a special MemoryContext in TupleTableSlot is introduced for the
detoast datum, it is reset whenever the TupleTableSlot.tts_nvalid
= 0. and I also double check every operation defined in
TupleTableSlotOps, looks nothing should be specially handled.

Robert has expressed his concern as below and wanted Andres to have a
look at, but unluckly Andres doesn't have such time so far:( 

> Robert Haas  writes:
>
>> On Wed, May 22, 2024 at 9:46 PM Andy Fan  wrote:
>>> Please give me one more chance to explain this. What I mean is:
>>>
>>> Take SELECT f(a) FROM t1 join t2...; for example,
>>>
>>> When we read the Datum of a Var, we read it from tts->tts_values[*], no
>>> matter what kind of TupleTableSlot is.  So if we can put the "detoast
>>> datum" back to the "original " tts_values, then nothing need to be
>>> changed.
>>
>> Yeah, I don't think you can do that.
>>
>>> - Saving the "detoast datum" version into tts_values[*] doesn't break
>>>   the above semantic and the ExprEval engine just get a detoast version
>>>   so it doesn't need to detoast it again.
>>
>> I don't think this is true. If it is true, it needs to be extremely
>> well-justified. Even if this seems to work in simple cases, I suspect
>> there will be cases where it breaks badly, resulting in memory leaks
>> or server crashes or some other kind of horrible problem that I can't
>> quite imagine. Unfortunately, my knowledge of this code isn't
>> fantastic, so I can't say exactly what bad thing will happen, and I
>> can't even say with 100% certainty that anything bad will happen, but
>> I bet it will. It seems like it goes against everything I understand
>> about how TupleTableSlots are supposed to be used. (Andres would be
>> the best person to give a definitive answer here.)

I think having a discussion of the design of TupleTableSlots would be
helpful for the progress since in my knowldege it doesn't against
anything of the TupleTaleSlots :(, as I stated above. I'm sure I'm
pretty open on this. 

>>
>>> - The keypoint is the memory management and effeiciency. for now I think
>>>   all the places where "slot->tts_nvalid" is set to 0 means the
>>>   tts_values[*] is no longer validate, then this is the place we should
>>>   release the memory for the "detoast datum".  All the other places like
>>>   ExecMaterializeSlot or ExecCopySlot also need to think about the
>>>   "detoast datum" as well.
>>
>> This might be a way of addressing some of the issues with this idea,
>> but I doubt it will be acceptable. I don't think we want to complicate
>> the slot API for this feature. There's too much downside to doing
>> that, in terms of performance and understandability.

Complexity: I double check the code, it has very limitted changes on the
TupleTupleSlot APIs (less than 10 rows).  see tuptable.h.

Performance: by design, it just move the chance of detoast action
*conditionly* and put it back to tts_values[*], there is no extra
overhead to find out the detoast version for sharing.

Understandability: based on how do we understand TupleTableSlot:) 

[1] https://www.postgresql.org/message-id/87il4jrk1l.fsf%40163.com

-- 
Best Regards
Andy Fan

configure error when CFLAGS='-Wall -Werror

2024-06-20 Thread Andy Fan



Hi,

I relies on some compiler's check to reduce some simple coding issues, I
use clang 18.1.6 for now. however "CFLAGS='-Wall -Werror ' ./configure"
would fail, and if I run ' ./configure' directly, it is OK. I'm not sure
why it happens. More details is below:

(master)> echo $CC
clang
(master)> clang --version
clang version 18.1.6 (https://gitee.com/mirrors/llvm-project.git 
1118c2e05e67a36ed8ca250524525cdb66a55256)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/bin

(master)> CFLAGS='-Wall -Werror ' ./configure

checking for clang option to accept ISO C89... unsupported
checking for clang option to accept ISO C99... unsupported
configure: error: C compiler "clang" does not support C99

In config.log, we can see:

configure:4433: clang -qlanglvl=extc89 -c -Wall -Werror   conftest.c >&5
clang: error: unknown argument: '-qlanglvl=extc89'

and clang does doesn't support -qlanglvl.

in 'configure', we can see the related code is:

"""
for ac_arg in '' -std=gnu99 -std=c99 -c99 -AC99 -D_STDC_C99= -qlanglvl=extc99
do
  CC="$ac_save_CC $ac_arg"
  if ac_fn_c_try_compile "$LINENO"; then :
  ac_cv_prog_cc_c99=$ac_arg
fi
rm -f core conftest.err conftest.$ac_objext
  test "x$ac_cv_prog_cc_c99" != "xno" && break
done
rm -f conftest.$ac_ext
CC=$ac_save_CC



# Error out if the compiler does not support C99, as the codebase
# relies on that.
if test "$ac_cv_prog_cc_c99" = no; then
as_fn_error $? "C compiler \"$CC\" does not support C99" "$LINENO" 5
fi
"""

So my questions are:
1. based on the fact clang doesn't support '-qlanglvl' all the time, why
removing the CFLAGS matters.

2. If you are using clang as well, what CFLAGS you use and it works?
for example: IIRC, clang doesn't report error when a variable is set
but no used by default, we have to add some extra flags to make it.

-- 
Best Regards
Andy Fan

New function normal_rand_array function to contrib/tablefunc.

2024-06-08 Thread Andy Fan


Here is a new function which could produce an array of numbers with a
controllable array length and duplicated elements in these arrays. I
used it when working with gin index, and I think it would be helpful for
others as well, so here it is.

select * from normal_rand_array(5, 10, 1.8::numeric, 3.5::numeric);
   normal_rand_array   
---
 {3.3,2.3,2.7,3.2,2.0,2.7,3.4,2.7,2.3,2.9}
 {3.3,1.8,2.9,3.4,2.0,1.8,2.0,3.5,2.8,2.5}
 {2.1,1.9,2.3,1.9,2.5,2.7,2.4,2.9,1.8}
 {2.3,2.5,2.4,2.7,2.7,2.3,2.9,3.3,3.3,1.9,3.5}
 {2.8,3.4,2.7,1.8,3.3,2.3,2.2,3.5,2.6,2.5}
(5 rows)

select * from normal_rand_array(5, 10, 1.8::int4, 3.5::int4);
  normal_rand_array  
-
 {3,2,2,3,4,2}
 {2,4,2,3,3,3,3,2,2,3,3,2,3,2}
 {2,4,3}
 {4,2,3,4,2,4,2,2,3,4,3,3,2,4,4,2,3}
 {4,3,3,4,3,3,4,2,4}
(5 rows)

the 5 means it needs to produce 5 rows in total and the 10 is the
average array length, and 1.8 is the minvalue for the random function
and 3.5 is the maxvalue. 

-- 
Best Regards
Andy Fan

>From 397dcaf67f29057b80aebbb6116b49ac8344547c Mon Sep 17 00:00:00 2001
From: Andy Fan 
Date: Sat, 8 Jun 2024 13:21:08 +0800
Subject: [PATCH v20240608 1/1] Add function normal_rand_array function to
 contrib/tablefunc.

It can produce an array of numbers with n controllable array length and
duplicated elements in these arrays.
---
 contrib/tablefunc/Makefile|   2 +-
 contrib/tablefunc/expected/tablefunc.out  |  26 
 contrib/tablefunc/sql/tablefunc.sql   |  10 ++
 contrib/tablefunc/tablefunc--1.0--1.1.sql |   7 ++
 contrib/tablefunc/tablefunc.c | 140 ++
 contrib/tablefunc/tablefunc.control   |   2 +-
 doc/src/sgml/tablefunc.sgml   |  10 ++
 src/backend/utils/adt/arrayfuncs.c|   7 ++
 8 files changed, 202 insertions(+), 2 deletions(-)
 create mode 100644 contrib/tablefunc/tablefunc--1.0--1.1.sql

diff --git a/contrib/tablefunc/Makefile b/contrib/tablefunc/Makefile
index 191a3a1d38..f0c67308fd 100644
--- a/contrib/tablefunc/Makefile
+++ b/contrib/tablefunc/Makefile
@@ -3,7 +3,7 @@
 MODULES = tablefunc
 
 EXTENSION = tablefunc
-DATA = tablefunc--1.0.sql
+DATA = tablefunc--1.0.sql tablefunc--1.0--1.1.sql
 PGFILEDESC = "tablefunc - various functions that return tables"
 
 REGRESS = tablefunc
diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out
index ddece79029..9f0cbbfbbe 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -12,6 +12,32 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
 -- negative number of tuples
 SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
 ERROR:  number of rows cannot be negative
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::numeric, 8::numeric) as i;
+ count |avg 
+---+
+10 | 3.
+(1 row)
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int4, 8::int4) as i;
+ count |avg 
+---+
+10 | 3.
+(1 row)
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int8, 8::int8) as i;
+ count |avg 
+---+
+10 | 3.
+(1 row)
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::float8, 8::float8) as i;
+ count |avg 
+---+
+10 | 3.
+(1 row)
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 'abc'::text, 'def'::text) as i;
+ERROR:  unsupported type 25 in normal_rand_array.
 --
 -- crosstab()
 --
diff --git a/contrib/tablefunc/sql/tablefunc.sql b/contrib/tablefunc/sql/tablefunc.sql
index 0fb8e40de2..dec57cfc66 100644
--- a/contrib/tablefunc/sql/tablefunc.sql
+++ b/contrib/tablefunc/sql/tablefunc.sql
@@ -8,6 +8,16 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
 -- negative number of tuples
 SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
 
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::numeric, 8::numeric) as i;
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int4, 8::int4) as i;
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int8, 8::int8) as i;
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::float8, 8::float8) as i;
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 'abc'::text, 'def'::text) as i;
+
 --
 -- crosstab()
 --
diff --git a/contrib/tablefunc/tablefunc--1.0--1.1.sql b/contrib/tablefunc/tablefunc--1.0--1.1

differential test coverage when working on a patch

2024-06-03 Thread Andy Fan



When testing my own patches or review other's patches, I want to know if
the new code has been tested, however our current 'make coverage-html'
shows all the codes rather than the 'new code', so is there a good way
to get the answer for the above question?

I searched lcov at [1] and the options like '--diff-file' or
'--select-script' looks very promising, but all of them needs some time
to try it out and then automate it. so I'd like to ask first..

[1] https://github.com/linux-test-project/lcov/blob/master/README 

-- 
Best Regards
Andy Fan

Re: why memoize is not used for correlated subquery

2024-05-28 Thread Andy Fan




> I imagined making this work by delaying the plan creation for
> subqueries until the same time as create_plan() for the outer query.

Do you mean sublinks rather than subqueries? if so, we can get another
benefit I want very much.

explain (costs off) select * from t1 where t1.a = 1
and exists (select 1 from t2 where  t2.a = t1.a and 
random() > 0);
  QUERY PLAN   
---
 Seq Scan on t1
   Filter: ((a = 1) AND EXISTS(SubPlan 1))
   SubPlan 1
 ->  Seq Scan on t2
   Filter: ((a = t1.a) AND (random() > '0'::double precision))

As for now, when we are planing the sublinks, we don't know t1.a = 1
which may lost some optimization chance.  Considering the t2.a is a
partition key of t2, this would yield some big improvement for planning
a big number of partitioned table.

-- 
Best Regards
Andy Fan

Re: Parallel CREATE INDEX for GIN indexes

2024-05-28 Thread Andy Fan



Hi Tomas,

I have completed my first round of review, generally it looks good to
me, more testing need to be done in the next days. Here are some tiny
comments from my side, just FYI. 

1. Comments about GinBuildState.bs_leader looks not good for me. 
   
/*
 * bs_leader is only present when a parallel index build is performed, 
and
 * only in the leader process. (Actually, only the leader process has a
 * GinBuildState.)
 */
GinLeader  *bs_leader;

In the worker function _gin_parallel_build_main:
initGinState(, indexRel); is called, and the
following members in workers at least: buildstate.funcCtx,
buildstate.accum and so on.  So is the comment "only the leader process
has a GinBuildState" correct?

2. progress argument is not used?
_gin_parallel_scan_and_build(GinBuildState *state,
 GinShared *ginshared, 
Sharedsort *sharedsort,
 Relation heap, 
Relation index,
 int sortmem, bool 
progress)


3. In function tuplesort_begin_index_gin, comments about nKeys takes me
some time to think about why 1 is correct(rather than
IndexRelationGetNumberOfKeyAttributes) and what does the "only the index
key" means. 

base->nKeys = 1;/* Only the index key */

finally I think it is because gin index stores each attribute value into
an individual index entry for a multi-column index, so each index entry
has only 1 key.  So we can comment it as the following? 

"Gin Index stores the value of each attribute into different index entry
for mutli-column index, so each index entry has only 1 key all the
time." This probably makes it easier to understand.


4. GinBuffer: The comment "Similar purpose to BuildAccumulator, but much
simpler." makes me think why do we need a simpler but
similar structure, After some thoughts, they are similar at accumulating
TIDs only. GinBuffer is designed for "same key value" (hence
GinBufferCanAddKey). so IMO, the first comment is good enough and the 2 
comments introduce confuses for green hand and is potential to remove
it. 

/*
 * State used to combine accumulate TIDs from multiple GinTuples for the same
 * key value.
 *
 * XXX Similar purpose to BuildAccumulator, but much simpler.
 */
typedef struct GinBuffer


5. GinBuffer: ginMergeItemPointers always allocate new memory for the
new items and hence we have to pfree old memory each time. However it is
not necessary in some places, for example the new items can be appended
to Buffer->items (and this should be a common case). So could we
pre-allocate some spaces for items and reduce the number of pfree/palloc
and save some TID items copy in the desired case?

6. GinTuple.ItemPointerData first;  /* first TID in the array */

is ItemPointerData.ip_blkid good enough for its purpose?  If so, we can
save the memory for OffsetNumber for each GinTuple.

Item 5) and 6) needs some coding and testing. If it is OK to do, I'd
like to take it as an exercise in this area. (also including the item
1~4.)


-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-05-23 Thread Andy Fan



Robert Haas  writes:

> On Wed, May 22, 2024 at 9:46 PM Andy Fan  wrote:
>> Please give me one more chance to explain this. What I mean is:
>>
>> Take SELECT f(a) FROM t1 join t2...; for example,
>>
>> When we read the Datum of a Var, we read it from tts->tts_values[*], no
>> matter what kind of TupleTableSlot is.  So if we can put the "detoast
>> datum" back to the "original " tts_values, then nothing need to be
>> changed.
>
> Yeah, I don't think you can do that.
>
>> - Saving the "detoast datum" version into tts_values[*] doesn't break
>>   the above semantic and the ExprEval engine just get a detoast version
>>   so it doesn't need to detoast it again.
>
> I don't think this is true. If it is true, it needs to be extremely
> well-justified. Even if this seems to work in simple cases, I suspect
> there will be cases where it breaks badly, resulting in memory leaks
> or server crashes or some other kind of horrible problem that I can't
> quite imagine. Unfortunately, my knowledge of this code isn't
> fantastic, so I can't say exactly what bad thing will happen, and I
> can't even say with 100% certainty that anything bad will happen, but
> I bet it will. It seems like it goes against everything I understand
> about how TupleTableSlots are supposed to be used. (Andres would be
> the best person to give a definitive answer here.)
>
>> - The keypoint is the memory management and effeiciency. for now I think
>>   all the places where "slot->tts_nvalid" is set to 0 means the
>>   tts_values[*] is no longer validate, then this is the place we should
>>   release the memory for the "detoast datum".  All the other places like
>>   ExecMaterializeSlot or ExecCopySlot also need to think about the
>>   "detoast datum" as well.
>
> This might be a way of addressing some of the issues with this idea,
> but I doubt it will be acceptable. I don't think we want to complicate
> the slot API for this feature. There's too much downside to doing
> that, in terms of performance and understandability.

Thanks for the doubt! Let's see if Andres has time to look at this. I
feel great about the current state. 

-- 
Best Regards
Andy Fan

Re: using extended statistics to improve join estimates

2024-05-22 Thread Andy Fan



Andrei Lepikhov  writes:

> On 20/5/2024 15:52, Andy Fan wrote:
>> Hi Andrei,
>> 
>>> On 4/3/24 01:22, Tomas Vondra wrote:
>>>> Cool! There's obviously no chance to get this into v18, and I have stuff
>>>> to do in this CF. But I'll take a look after that.
>>> I'm looking at your patch now - an excellent start to an eagerly awaited
>>> feature!
>>> A couple of questions:
>>> 1. I didn't find the implementation of strategy 'c' - estimation by the
>>> number of distinct values. Do you forget it?
>> What do you mean the "strategy 'c'"?
> As described in 0001-* patch:
> * c) No extended stats with MCV. If there are multiple join clauses,
> * we can try using ndistinct coefficients and do what eqjoinsel does.

OK, I didn't pay enough attention to this comment before. and yes, I get
the same conclusion as you -  there is no implementation of this.

and if so, I think we should remove the comments and do the
implementation in the next patch. 

>>> 2. Can we add a clauselist selectivity hook into the core (something
>>> similar the code in attachment)? It can allow the development and
>>> testing of multicolumn join estimations without patching the core.
>> The idea LGTM. But do you want
>> +if (clauselist_selectivity_hook)
>> +s1 = clauselist_selectivity_hook(root, clauses, varRelid, 
>> jointype,
>> +
>> rather than
>> +if (clauselist_selectivity_hook)
>> +*return* clauselist_selectivity_hook(root, clauses, ..)
> Of course - library may estimate not all the clauses - it is a reason,
> why I added input/output parameter 'estimatedclauses' by analogy with
> statext_clauselist_selectivity.

OK.

Do you think the hook proposal is closely connected with the current
topic? IIUC it's seems not. So a dedicated thread to explain the problem
to slove and the proposal and the follwing discussion should be helpful
for both topics. I'm just worried about mixing the two in one thread
would make things complexer unnecessarily.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-05-22 Thread Andy Fan



Nikita Malakhov  writes:

> Hi,

> Andy, glad you've not lost interest in this work, I'm looking
> forward to your improvements!

Thanks for your words, I've adjusted to the rhythm of the community and
welcome more feedback:)

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-05-22 Thread Andy Fan


Robert Haas  writes:

> On Tue, May 21, 2024 at 10:02 PM Andy Fan  wrote:
>> One more things I want to highlight it "syscache" is used for metadata
>> and *detoast cache* is used for user data. user data is more
>> likely bigger than metadata, so cache size control (hence eviction logic
>> run more often) is more necessary in this case. The first graph in [1]
>> (including the quota text) highlight this idea.
>
> Yes.
>
>> I think that is same idea as design a.
>
> Sounds similar.
>

This is really great to know!

>> I think the key points includes:
>>
>> 1. Where to put the *detoast value* so that ExprEval system can find out
>> it cheaply. The soluation in A slot->tts_values[*].
>>
>> 2. How to manage the memory for holding the detoast value.
>>
>> The soluation in A is:
>>
>> - using a lazy created MemoryContext in slot.
>> - let the detoast system detoast the datum into the designed
>> MemoryContext.
>>
>> 3. How to let Expr evalution run the extra cycles when needed.
>
> I don't understand (3) but I agree that (1) and (2) are key points. In
> many - nearly all - circumstances just overwriting slot->tts_values is
> unsafe and will cause problems. To make this work, we've got to find
> some way of doing this that is compatible with general design of
> TupleTableSlot, and I think in that API, just about the only time you
> can touch slot->tts_values is when you're trying to populate a virtual
> slot. But often we'll have some other kind of slot. And even if we do
> have a virtual slot, I think you're only allowed to manipulate
> slot->tts_values up until you call ExecStoreVirtualTuple.

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is.  So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed. 

Aside from efficiency considerations, security and rationality are also
important considerations. So what I think now is:

- tts_values[*] means the Datum from the slot, and it has its operations
  like slot_getsomeattrs, ExecMaterializeSlot, ExecCopySlot,
  ExecClearTuple and so on. All of the characters have nothing with what
  kind of slot it is. 

- Saving the "detoast datum" version into tts_values[*] doesn't break
  the above semantic and the ExprEval engine just get a detoast version
  so it doesn't need to detoast it again.

- The keypoint is the memory management and effeiciency. for now I think
  all the places where "slot->tts_nvalid" is set to 0 means the
  tts_values[*] is no longer validate, then this is the place we should
  release the memory for the "detoast datum".  All the other places like
  ExecMaterializeSlot or ExecCopySlot also need to think about the
  "detoast datum" as well.

> That's why I mentioned using something temporary. You could legally
> (a) materialize the existing slot, (b) create a new virtual slot, (c)
> copy over the tts_values and tts_isnull arrays, (c) detoast any datums
> you like from tts_values and store the new datums back into the array,
> (d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
> of the original one. That would avoid repeated detoasting, but it
> would also have a bunch of overhead.

Yes, I agree with the overhead, so I perfer to know if the my current
strategy has principle issue first.

> Having to fully materialize the
> existing slot is a cost that you wouldn't always need to pay if you
> didn't do this, but you have to do it to make this work. Creating the
> new virtual slot costs something. Copying the arrays costs something.
> Detoasting costs quite a lot potentially, if you don't end up using
> the values. If you end up needing a heap tuple representation of the
> slot, you're going to now have to make a new one rather than just
> using the one that came straight from the buffer.

> But I do agree with you
> that *if* we could make it work, it would probably be better than a
> longer-lived cache.


I noticed the "if" and great to know that:), speically for the efficiecy
& memory usage control purpose. 

Another big challenge of this is how to decide if a Var need this
pre-detoasting strategy, we have to consider:

- The in-place tts_values[*] update makes the slot bigger, which is
  harmful for CP_SMALL_TLIST (createplan.c) purpose. 
- ExprEval runtime checking if the Var is toastable is an overhead. It
  is must be decided ExecInitExpr time. 
  
The code complexity because of the above 2 factors makes people (include
me) unhappy.

===
Yesterday I did some worst case testing fo

Re: Shared detoast Datum proposal

2024-05-21 Thread Andy Fan



Andy Fan  writes:

> Hi Robert, 
>
>> Andy Fan asked me off-list for some feedback about this proposal. I
>> have hesitated to comment on it for lack of having studied the matter
>> in any detail, but since I've been asked for my input, here goes:
>
> Thanks for doing this! Since we have two totally different designs
> between Tomas and me and I think we both understand the two sides of
> each design, just that the following things are really puzzled to me.

my emacs was stuck and somehow this incompleted message was sent out,
Please ignore this post for now. Sorry for this.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-05-21 Thread Andy Fan



Hi Robert, 

> Andy Fan asked me off-list for some feedback about this proposal. I
> have hesitated to comment on it for lack of having studied the matter
> in any detail, but since I've been asked for my input, here goes:

Thanks for doing this! Since we have two totally different designs
between Tomas and me and I think we both understand the two sides of
each design, just that the following things are really puzzled to me.

- which factors we should care more.
- the possiblility to improve the design-A/B for its badness.

But for the point 2, we probably needs some prefer design first.

for now let's highlight that there are 2 totally different method to
handle this problem. Let's call:

design a:  detoast the datum into slot->tts_values (and manage the
memory with the lifespan of *slot*). From me.

design b:  detost the datum into a CACHE (and manage the memory with
CACHE eviction and at released the end-of-query). From Tomas.

Currently I pefer design a, I'm not sure if I want desgin a because a is
really good or just because design a is my own design. So I will
summarize the the current state for the easier discussion. I summarize
them so that you don't need to go to the long discussion, and I summarize
the each point with the link to the original post since I may
misunderstand some points and the original post can be used as a double
check. 

> I agree that there's a problem here, but I suspect this is not the
> right way to solve it. Let's compare this to something like the
> syscache. Specifically, let's think about the syscache on
> pg_class.relname.

OK.

> In the case of the syscache on pg_class.relname, there are two reasons
> why we might repeatedly look up the same values in the cache. One is
> that the code might do multiple name lookups when it really ought to
> do only one. Doing multiple lookups is bad for security and bad for
> performance, but we have code like that in some places and some of it
> is hard to fix. The other reason is that it is likely that the user
> will mention the same table names repeatedly; each time they do, they
> will trigger a lookup on the same name. By caching the result of the
> lookup, we can make it much faster. An important point to note here is
> that the queries that the user will issue in the future are unknown to
> us. In a certain sense, we cannot even know whether the same table
> name will be mentioned more than once in the same query: when we reach
> the first instance of the table name and look it up, we do not have
> any good way of knowing whether it will be mentioned again later, say
> due to a self-join. Hence, the pattern of cache lookups is
> fundamentally unknowable.

True. 

> But that's not the case in the motivating example that started this
> thread. In that case, the target list includes three separate
> references to the same column. We can therefore predict that if the
> column value is toasted, we're going to de-TOAST it exactly three
> times.

True.

> If, on the other hand, the column value were mentioned only
> once, it would be detoasted just once. In that latter case, which is
> probably quite common, this whole cache idea seems like it ought to
> just waste cycles, which makes me suspect that no matter what is done
> to this patch, there will always be cases where it causes significant
> regressions.

I agree.

> In the former case, where the column reference is
> repeated, it will win, but it will also hold onto the detoasted value
> after the third instance of detoasting, even though there will never
> be any more cache hits.

True.

> Here, unlike the syscache case, the cache is
> thrown away at the end of the query, so future queries can't benefit.
> Even if we could find a way to preserve the cache in some cases, it's
> not very likely to pay off. People are much more likely to hit the
> same table more than once than to hit the same values in the same
> table more than once in the same session.

I agree.

One more things I want to highlight it "syscache" is used for metadata
and *detoast cache* is used for user data. user data is more 
likely bigger than metadata, so cache size control (hence eviction logic
run more often) is more necessary in this case. The first graph in [1]
(including the quota text) highlight this idea.

> Suppose we had a design where, when a value got detoasted, the
> detoasted value went into the place where the toasted value had come
> from. Then this problem would be handled pretty much perfectly: the
> detoasted value would last until the scan advanced to the next row,
> and then it would be thrown out. So there would be no query-lifespan
> memory leak.

I think that is same idea as design a.

> Now, I don't really know how this would work, because the
> TOAST pointer is coming out of a slot and we can't necessarily write
&g

Re: using extended statistics to improve join estimates

2024-05-20 Thread Andy Fan



Hi Andrei,

> On 4/3/24 01:22, Tomas Vondra wrote:
>> Cool! There's obviously no chance to get this into v18, and I have stuff
>> to do in this CF. But I'll take a look after that.
> I'm looking at your patch now - an excellent start to an eagerly awaited
> feature!
> A couple of questions:
> 1. I didn't find the implementation of strategy 'c' - estimation by the
> number of distinct values. Do you forget it?

What do you mean the "strategy 'c'"?  

> 2. Can we add a clauselist selectivity hook into the core (something
> similar the code in attachment)? It can allow the development and
> testing of multicolumn join estimations without patching the core.

The idea LGTM. But do you want 

+   if (clauselist_selectivity_hook)
+   s1 = clauselist_selectivity_hook(root, clauses, varRelid, 
jointype,
+

rather than

+   if (clauselist_selectivity_hook)
+   *return* clauselist_selectivity_hook(root, clauses, ..)


?

-- 
Best Regards
Andy Fan

Re: UniqueKey v2

2024-05-13 Thread Andy Fan



>> Consider join of tables "a", "b" and "c". My understanding is that
>> make_join_rel() is called once with rel1={a} and rel2={b join c}, then with
>> rel1={a join b} and rel2={c}, etc. I wanted to say that each call should
>> produce the same set of unique keys.
>> 
>> I need to check this part more in detail.
>
> I think I understand now. By calling populate_joinrel_uniquekeys() for various
> orderings, you can find out that various input relation unique keys can
> represent the whole join. For example, if the ordering is
>
> A JOIN (B JOIN C)
>
> you can prove that the unique keys of A can be used for the whole join, while
> for the ordering
>
> B JOIN (A JOIN C)
>
> you can prove the same for the unique keys of B, and so on.

Yes.

>> > We don't create the component uniquekey if any one side of the boths
>> > sides is unique already. For example:
>> > 
>> > "(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
>> > unique", there is no need to create component UniqueKey (t1.id, t2.id);  
>> 
>> ok, I need to check more in detail how this part works.
>
> This optimization makes sense to me.

OK.

>> > >
>> > >   Of course one problem is that the number of combinations can grow
>> > >   exponentially as new relations are joined.
>> > 
>> > Yes, that's why "rule 1" needed and "How to reduce the overhead" in
>> > UniqueKey.README is introduced. 
>
> I think there should yet be some guarantee that the number of unique keys does
> not grow exponentially. Perhaps a constant that allows a relation (base or
> join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3
> or 4.) And when picking the "best N keys", one criterion could be the number
> of expressions in the key (the shorter key the better).

I don't want to introduce this complextity right now. I'm more
inerested with how to work with them effectivity. main effort includes: 

- the design of bitmapset which is memory usage friendly and easy for
combinations.
- Optimize the singlerow cases to reduce N UnqiueKeys to 1 UniqueKey.

I hope we can pay more attention to this optimization (at most N
UniqueKeys) when the major inforastruce has been done. 

>> > You are correct and IMO my current code are able to tell it is a single
>> > row as well.
>> > 
>> > 1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
>> > consequence.
>> > 2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
>> > after the join because of rule 1 on joinrel. and t1 is singlerow, so the
>> > joinrel is singlerow as well.
>
> ok, I think I understand now.

OK.

At last, this probably is my first non-trival patchs which has multiple
authors, I don't want myself is the bottleneck for the coorperation, so
if you need something to do done sooner, please don't hesitate to ask me
for it explicitly.

Here is my schedule about this. I can provide the next version based
our discussion and your patches at the eariler of next week. and update
the UniqueKey.README to make sure the overall design clearer. What I
hope you to pay more attention is the UniqueKey.README besides the
code. I hope the UniqueKey.README can reduce the effort for others to
understand the overall design enormously.

-- 
Best Regards
Andy Fan

Re: UniqueKey v2

2024-05-13 Thread Andy Fan

uniquekey_useful_for_merging()
>> >
>> >   How does uniqueness relate to merge join? In README.uniquekey you seem to
>> >   point out that a single row is always sorted, but I don't think this
>> >   function is related to that fact. (Instead, I'd expect that pathkeys are
>> >   added to all paths for a single-row relation, but I'm not sure you do 
>> > that
>> >   in the current version of the patch.)
>> 
>> The merging is for "mergejoinable join clauses", see function
>> eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";
>
> My question is: why is the uniqueness important specifically to merge join? I
> understand that join evaluation can be more efficient if we know that one
> input relation is unique (i.e. we only scan that relation until we find the
> first match), but this is not specific to merge join.

So the answer is the "merging" in uniquekey_useful_for_merging() has
nothing with merge join. 

>> > * is_uniquekey_useful_afterjoin()
>> >
>> >   Now that my patch (0004) allows propagation of the unique keys from a
>> >   subquery to the upper query, I was wondering if the UniqueKey structure
>> >   needs the 'use_for_distinct field' I mean we should now propagate the 
>> > unique
>> >   keys to the parent query whether the subquery has DISTINCT clause or 
>> > not. I
>> >   noticed that the field is checked by is_uniquekey_useful_afterjoin(), so 
>> > I
>> >   changed the function to always returned true. However nothing changed in 
>> > the
>> >   output of regression tests (installcheck). Do you insist that the
>> >   'use_for_distinct' field is needed?
>
> I miss your answer to this comment.

After we considers the uniquekey from subquery, 'use_for_distinct' field
is not needed.

>> >   ** What does the 'multi' word in the function name mean?
>> 
>> multi means multiple, I thought we use this short name in the many
>> places, for ex bt_multi_page_stats after a quick search. 
>
> Why not simply uniquekey_contains_nulls() ?

> Actually I wouldn't say that an instance of UniqueKey contains any value (NULL
> or NOT NULL) because it describes the whole relation rather than particular
> row. I consider UniqueKey to be a set of expressions. How about
> uniquekey_expression_nullable() ?

uniquekey_expression_nullable() is a better name, I will use it in the
next version.

IIUC, we have reached to the agreement based on your latest response for
the most of the questions. Please point me if I missed anything.  

>> >   Of course one problem is that the number of combinations can grow
>> >   exponentially as new relations are joined.
>> 
>> Yes, that's why "rule 1" needed and "How to reduce the overhead" in
>> UniqueKey.README is introduced. 
>
> What if we are interested in unique keys of a subquery, but the subquery has
> no DISTINCT clause?

I agree we should remove the prerequisite of "use_for_distinct". 

-- 
Best Regards
Andy Fan

Re: First draft of PG 17 release notes

2024-05-13 Thread Andy Fan

Bruce Momjian  writes:

> On Sat, May 11, 2024 at 01:27:25PM +0800, Andy Fan wrote:
>> 
>> Hello Bruce,
>> 
>> > I have committed the first draft of the PG 17 release notes;  you can
>> > see the results here:
>> >
>> >https://momjian.us/pgsql_docs/release-17.html
>> 
>> Thank you for working on this!
>> 
>> > I welcome feedback.  For some reason it was an easier job than usual.
>> 
>> Do you think we need to add the following 2 items?
>> 
>> - 9f133763961e280d8ba692bcad0b061b861e9138 this is an optimizer
>>   transform improvement.
>
> It was unclear from the commit message exactly what user-visible
> optimization this allowed.  Do you have details?

Yes, It allows the query like "SELECT * FROM t1 WHERE t1.a in (SELECT a
FROM t2 WHERE t2.b = t1.b)" be pulled up a semi join, hence more join
methods / join orders are possible.

>
>> - a8a968a8212ee3ef7f22795c834b33d871fac262 this is an optimizer costing
>>   improvement.
>
> Does this allow faster UNION ALL with LIMIT, perhaps?

Yes, for example:  (subquery-1) UNION ALL (subquery-2) LIMIT n;

When planning the subquery-1 or subquery-2, limit N should be
considered. As a consequence, maybe hash join should be replaced with
Nested Loop. Before this commits, it is ignored if it is flatten into 
appendrel, and the "flatten" happens very often.

David provided a summary for the both commits in [1].

[1]
https://www.postgresql.org/message-id/CAApHDvqAQgq27LgYmJ85VVGTR0%3DhRW6HHq2oZgK0ZiYC_a%2BEww%40mail.gmail.com

-- 
Best Regards
Andy Fan

Re: Parallel CREATE INDEX for GIN indexes

2024-05-13 Thread Andy Fan



Tomas Vondra  writes:

>>> 7) v20240502-0007-Detect-wrap-around-in-parallel-callback.patch
>>>
>>> There's one more efficiency problem - the parallel scans are required to
>>> be synchronized, i.e. the scan may start half-way through the table, and
>>> then wrap around. Which however means the TID list will have a very wide
>>> range of TID values, essentially the min and max of for the key.
>> 
>> I have two questions here and both of them are generall gin index questions
>> rather than the patch here.
>> 
>> 1. What does the "wrap around" mean in the "the scan may start half-way
>> through the table, and then wrap around".  Searching "wrap" in
>> gin/README gets nothing. 
>> 
>
> The "wrap around" is about the scan used to read data from the table
> when building the index. A "sync scan" may start e.g. at TID (1000,0)
> and read till the end of the table, and then wraps and returns the
> remaining part at the beginning of the table for blocks 0-999.
>
> This means the callback would not see a monotonically increasing
> sequence of TIDs.
>
> Which is why the serial build disables sync scans, allowing simply
> appending values to the sorted list, and even with regular flushes of
> data into the index we can simply append data to the posting lists.

Thanks for the hints, I know the sync strategy comes from syncscan.c
now. 

>>> Without 0006 this would cause frequent failures of the index build, with
>>> the error I already mentioned:
>>>
>>>   ERROR: could not split GIN page; all old items didn't fit

>> 2. I can't understand the below error.
>> 
>>>   ERROR: could not split GIN page; all old items didn't fit

>   if (!append || ItemPointerCompare(, ) >= 0)
> elog(ERROR, "could not split GIN page; all old items didn't fit");
>
> It can fail simply because of the !append part.

Got it, Thanks!

>> If we split the blocks among worker 1-block by 1-block, we will have a
>> serious issue like here.  If we can have N-block by N-block, and N-block
>> is somehow fill the work_mem which makes the dedicated temp file, we
>> can make things much better, can we? 

> I don't understand the question. The blocks are distributed to workers
> by the parallel table scan, and it certainly does not do that block by
> block. But even it it did, that's not a problem for this code.

OK, I get ParallelBlockTableScanWorkerData.phsw_chunk_size is designed
for this.

> The problem is that if the scan wraps around, then one of the TID lists
> for a given worker will have the min TID and max TID, so it will overlap
> with every other TID list for the same key in that worker. And when the
> worker does the merging, this list will force a "full" merge sort for
> all TID lists (for that key), which is very expensive.

OK.

Thanks for all the answers, they are pretty instructive!

-- 
Best Regards
Andy Fan

Re: First draft of PG 17 release notes

2024-05-10 Thread Andy Fan



Hello Bruce,

> I have committed the first draft of the PG 17 release notes;  you can
> see the results here:
>
>   https://momjian.us/pgsql_docs/release-17.html

Thank you for working on this!

> I welcome feedback.  For some reason it was an easier job than usual.

Do you think we need to add the following 2 items?

- 9f133763961e280d8ba692bcad0b061b861e9138 this is an optimizer
  transform improvement.

- a8a968a8212ee3ef7f22795c834b33d871fac262 this is an optimizer costing
  improvement.

Both of them can generate a better plan on some cases. 

-- 
Best Regards
Andy Fan

Re: Parallel CREATE INDEX for GIN indexes

2024-05-10 Thread Andy Fan

Tomas Vondra  writes:

>> I guess both of you are talking about worker process, if here are
>> something in my mind:
>> 
>> *btbuild* also let the WORKER dump the tuples into Sharedsort struct
>> and let the LEADER merge them directly. I think this aim of this design
>> is it is potential to save a mergeruns. In the current patch, worker dump
>> to local tuplesort and mergeruns it and then leader run the merges
>> again. I admit the goal of this patch is reasonable, but I'm feeling we
>> need to adapt this way conditionally somehow. and if we find the way, we
>> can apply it to btbuild as well. 
>> 
>
> I'm a bit confused about what you're proposing here, or how is that
> related to what this patch is doing and/or to the what Matthias
> mentioned in his e-mail from last week.
>
> Let me explain the relevant part of the patch, and how I understand the
> improvement suggested by Matthias. The patch does the work in three phases:

What's in my mind is:

1. WORKER-1

Tempfile 1:

key1:  1
key3:  2
...

Tempfile 2:

key5:  3
key7:  4
...

2. WORKER-2

Tempfile 1:

Key2: 1
Key6: 2
...

Tempfile 2:
Key3: 3
Key6: 4
..

In the above example:  if we do the the merge in LEADER, only 1 mergerun
is needed. reading 4 tempfile 8 tuples in total and write 8 tuples.

If we adds another mergerun into WORKER, the result will be:

WORKER1:  reading 2 tempfile 4 tuples and write 1 tempfile (called X) 4
tuples. 
WORKER2:  reading 2 tempfile 4 tuples and write 1 tempfile (called Y) 4
tuples. 

LEADER:  reading 2 tempfiles (X & Y) including 8 tuples and write it
into final tempfile.

So the intermedia result X & Y requires some extra effort.  so I think
the "extra mergerun in worker" is *not always* a win, and my proposal is
if we need to distinguish the cases in which one we should add the
"extra mergerun in worker" step.

> The trouble with (2) is that it "just copies" data from one tuplesort
> into another, increasing the disk space requirements. In an extreme
> case, when nothing can be combined, it pretty much doubles the amount of
> disk space, and makes the build longer.

This sounds like the same question as I talk above, However my proposal
is to distinguish which cost is bigger between "the cost saving from
merging TIDs in WORKERS" and "cost paid because of the extra copy",
then we do that only when we are sure we can benefits from it, but I
know it is hard and not sure if it is doable. 

> What I think Matthias is suggesting, is that this "TID list merging"
> could be done directly as part of the tuplesort in step (1). So instead
> of just moving the "sort tuples" from the appropriate runs, it could
> also do an optional step of combining the tuples and writing this
> combined tuple into the tuplesort result (for that worker).

OK, I get it now. So we talked about lots of merge so far at different
stage and for different sets of tuples.

1. "GIN deform buffer" did the TIDs merge for the same key for the tuples
in one "deform buffer" batch, as what the current master is doing.

2. "in memory buffer sort" stage, currently there is no TID merge so
far and Matthias suggest that. 

3. Merge the TIDs for the same keys in LEADER vs in WORKER first +
LEADER then. this is what your 0002 commit does now and I raised some
concerns as above.

> Matthias also mentioned this might be useful when building btree indexes
> with key deduplication.

> AFAICS this might work, although it probably requires for the "combined"
> tuple to be smaller than the sum of the combined tuples (in order to fit
> into the space). But at least in the GIN build in the workers this is
> likely true, because the TID lists do not overlap (and thus not hurting
> the compressibility).
>
> That being said, I still see this more as an optimization than something
> required for the patch,

If GIN deform buffer is big enough (like greater than the in memory
buffer sort) shall we have any gain because of this, since the
scope is the tuples in in-memory-buffer-sort. 

> and I don't think I'll have time to work on this
> anytime soon. The patch is not extremely complex, but it's not trivial
> either. But if someone wants to take a stab at extending tuplesort to
> allow this, I won't object ...

Agree with this. I am more interested with understanding the whole
design and the scope to fix in this patch, and then I can do some code
review and testing, as for now, I still in the "understanding design and
scope" stage. If I'm too slow about this patch, please feel free to
commit it any time and I don't expect I can find any valueable
improvement and bugs.  I probably needs another 1 ~ 2 weeks to study
this patch.

-- 
Best Regards
Andy Fan

Re: Parallel CREATE INDEX for GIN indexes

2024-05-09 Thread Andy Fan

Tomas Vondra  writes:

> 3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch
>
> In 0002 the workers still do an explicit qsort() on the TID list before
> writing the data into the shared tuplesort. But we can do better - the
> workers can do a merge sort too. To help with this, we add the first TID
> to the tuplesort tuple, and sort by that too - it helps the workers to
> process the data in an order that allows simple concatenation instead of
> the full mergesort.
>
> Note: There's a non-obvious issue due to parallel scans always being
> "sync scans", which may lead to very "wide" TID ranges when the scan
> wraps around. More about that later.

This is really amazing.

> 7) v20240502-0007-Detect-wrap-around-in-parallel-callback.patch
>
> There's one more efficiency problem - the parallel scans are required to
> be synchronized, i.e. the scan may start half-way through the table, and
> then wrap around. Which however means the TID list will have a very wide
> range of TID values, essentially the min and max of for the key.
>
> Without 0006 this would cause frequent failures of the index build, with
> the error I already mentioned:
>
>   ERROR: could not split GIN page; all old items didn't fit

I have two questions here and both of them are generall gin index questions
rather than the patch here.

1. What does the "wrap around" mean in the "the scan may start half-way
through the table, and then wrap around".  Searching "wrap" in
gin/README gets nothing. 

2. I can't understand the below error.

>   ERROR: could not split GIN page; all old items didn't fit

When the posting list is too long, we have posting tree strategy. so in
which sistuation we could get this ERROR. 

> issue with efficiency - having such a wide TID list forces the mergesort
> to actually walk the lists, because this wide list overlaps with every
> other list produced by the worker.

If we split the blocks among worker 1-block by 1-block, we will have a
serious issue like here.  If we can have N-block by N-block, and N-block
is somehow fill the work_mem which makes the dedicated temp file, we
can make things much better, can we? 

-- 
Best Regards
Andy Fan

Re: Parallel CREATE INDEX for GIN indexes

2024-05-09 Thread Andy Fan



Hello Tomas,

>>> 2) v20240502-0002-Use-mergesort-in-the-leader-process.patch
>>>
>>> The approach implemented by 0001 works, but there's a little bit of
>>> issue - if there are many distinct keys (e.g. for trigrams that can
>>> happen very easily), the workers will hit the memory limit with only
>>> very short TID lists for most keys. For serial build that means merging
>>> the data into a lot of random places, and in parallel build it means the
>>> leader will have to merge a lot of tiny lists from many sorted rows.
>>>
>>> Which can be quite annoying and expensive, because the leader does so
>>> using qsort() in the serial part. It'd be better to ensure most of the
>>> sorting happens in the workers, and the leader can do a mergesort. But
>>> the mergesort must not happen too often - merging many small lists is
>>> not cheaper than a single qsort (especially when the lists overlap).
>>>
>>> So this patch changes the workers to process the data in two phases. The
>>> first works as before, but the data is flushed into a local tuplesort.
>>> And then each workers sorts the results it produced, and combines them
>>> into results with much larger TID lists, and those results are written
>>> to the shared tuplesort. So the leader only gets very few lists to
>>> combine for a given key - usually just one list per worker.
>> 
>> Hmm, I was hoping we could implement the merging inside the tuplesort
>> itself during its own flush phase, as it could save significantly on
>> IO, and could help other users of tuplesort with deduplication, too.
>> 
>
> Would that happen in the worker or leader process? Because my goal was
> to do the expensive part in the worker, because that's what helps with
> the parallelization.

I guess both of you are talking about worker process, if here are
something in my mind:

*btbuild* also let the WORKER dump the tuples into Sharedsort struct
and let the LEADER merge them directly. I think this aim of this design
is it is potential to save a mergeruns. In the current patch, worker dump
to local tuplesort and mergeruns it and then leader run the merges
again. I admit the goal of this patch is reasonable, but I'm feeling we
need to adapt this way conditionally somehow. and if we find the way, we
can apply it to btbuild as well. 

-- 
Best Regards
Andy Fan

Re: UniqueKey v2

2024-05-06 Thread Andy Fan

  b) is referenced by an UK of a relation which has already been proven
>  single-row.

I can't follow here...

>
>   I think that in the example above, an EC {t1.e, t2.id} should exist. So when
>   checking whether 't2' is single-row, the condition b) cam be ised: the UK of
>   't2' should reference the EC {t1.e, t2.id}, which in turn contains the
>   column t1.e. And 't1' is unique because its EC meets the condition a). (Of
>   course you need to check em_jdomain before you use particular EM.)

I think the existing rule 1 for joinrel works well with the singlerow
case naturally, what can be improved if we add the theory you suggested
here? 

-- 
Best Regards
Andy Fan

Re: using extended statistics to improve join estimates

2024-04-30 Thread Andy Fan

Hello Justin,

Thanks for showing interest on this!

> On Sun, Apr 28, 2024 at 10:07:01AM +0800, Andy Fan wrote:
>> 's/estimiatedcluases/estimatedclauses/' typo error in the
>> commit message is not fixed since I have to regenerate all the commits
>
> Maybe you know this, but some of these patches need to be squashed.
> Regenerating the patches to address feedback is the usual process.
> When they're not squished, it makes it hard to review the content of the
> patches.

You might overlooked the fact that the each individual commit is just to
make the communication effectively (easy to review) and all of them
will be merged into 1 commit at the last / during the process of review. 

Even though if something make it hard to review, I am pretty happy to
regenerate the patches, but does 's/estimiatedcluases/estimatedclauses/'
belongs to this category? I'm pretty sure that is not the only typo
error or inapproprate word, if we need to regenerate the 22 patches
because of that, we have to regenerate that pretty often.

Do you mind to provide more feedback once and I can merge all of them in
one modification or you think the typo error has blocked the review
process?

>
> For example:
> [PATCH v1 18/22] Fix error "unexpected system attribute" when join with 
> system attr
> ..adds .sql regression tests, but the expected .out isn't updated until
> [PATCH v1 19/22] Fix the incorrect comment on extended stats.
>
> That fixes an elog() in Tomas' original commit, so it should probably be
> 002 or 003.

Which elog are you talking about?

> It might make sense to keep the first commit separate for
> now, since it's nice to keep Tomas' original patch "pristine" to make
> more apparent the changes you're proposing.

This is my goal as well, did you find anything I did which break this
rule, that's absoluately not my intention.

> Another:
> [PATCH v1 20/22] Add fastpath when combine the 2 MCV like eqjoinsel_inner.
> ..doesn't compile without
> [PATCH v1 21/22] When mcv->ndimensions == list_length(clauses), handle it 
> same as
>
> Your 022 patch fixes a typo in your 002 patch, which means that first
> one reads a patch with a typo, and then later, a 10 line long patch
> reflowing the comment with a typo fixed.

I would like to regenerate the 22 patches if you think the typo error
make the reivew process hard. I can do such things but not willing to
do that often.

>
> A good guideline is that each patch should be self-contained, compiling
> and passing tests.  Which is more difficult with a long stack of
> patches.

I agree.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-04-27 Thread Andy Fan



Andy Fan  writes:

> Hello everyone,
>
>> After some more thoughts about the diference of the two ideas, then I
>> find we are resolving two different issues, just that in the wrong index
>> choose cases, both of them should work generally. 
>
> Here is the formal version for the attribute reloptions direction.

> commit 0d842e39275710a544b11033f5eec476147daf06 (HEAD -> force_generic)
> Author: yizhi.fzh 
> Date:   Sun Mar 31 11:51:28 2024 +0800
>
> Add a attopt to disable MCV when estimating for Var = Const
> 
> As of current code, when calculating the selectivity for Var = Const,
> planner first checks if the Const is an most common value and if not, it
> takes out all the portions of MCV's selectivity and num of distinct
> value, and treat the selectivity for Const equal for the rest
> n_distinct.
> 
> This logic works great when the optimizer statistic is up to date,
> however if the known most common value has taken up most of the
> selectivity at the last run of analyze, and the new most common value in
> reality has not been gathered, the estimation for the new MCV will be
> pretty bad. A common case for this would be created_at = {current_date};
> 
> To overcome this issue, we provides a new syntax:
> 
> ALTER TABLE tablename ALTER COLUMN created_at SET (force_generic=on);
> 
> After this, planner ignores the value of MCV for this column when
> estimating for Var = Const and treating all the values equally.
> 
> This would cause some badness if the values for a column are pretty not
> equal which is what MCV is designed for, however this patch just provide
> one more option to user and let user make the decision.
>
> Here is an example about its user case.

...

Here are some add-ups for this feature:

- After the use this feature, we still to gather the MCV on these
  columns because they are still useful for the join case, see
  eqjoinsel_inner function.

- Will this feature make some cases worse since it relies on the fact
  that not using the MCV list for var = Const? That's is true in
  theory. But if user use this feature right, they will not use this
  feature for these columns. The feature is just designed for the user
  case in the commit message and the theory is exactly same as generic
  plan. If user uses it right, they may save the effort of run 'analyze'
  pretty frequently and get some better result on both index choose and
  rows estimation. Plus the patch is pretty not aggressive and it's easy
  to maintain.

- Is the 'force_generic' a good name for attribute option? Probably not,
  we can find out a good name after we agree on this direction.  

- Will it be conflicted with David's idea of certainty_factor? Probably
  not,even both of them can handle the index-choose-case. See my point
  on [1]

[1] https://www.postgresql.org/message-id/877cicao6e.fsf%40163.com 

-- 
Best Regards
Andy Fan

Re: using extended statistics to improve join estimates

2024-04-27 Thread Andy Fan

Hello Justin!

Justin Pryzby  writes:

> |../src/backend/statistics/extended_stats.c:3151:36: warning: ‘relid’ may be 
> used uninitialized [-Wmaybe-uninitialized]
> | 3151 | if (var->varno != relid)
> |  |^
> |../src/backend/statistics/extended_stats.c:3104:33: note: ‘relid’ was 
> declared here
> | 3104 | int relid;
> |  | ^
> |[1016/1893] Compiling C object 
> src/backend/postgres_lib.a.p/statistics_mcv.c.o
> |../src/backend/statistics/mcv.c: In function ‘mcv_combine_extended’:
> |../src/backend/statistics/mcv.c:2431:49: warning: declaration of ‘idx’ 
> shadows a previous local [-Wshadow=compatible-local]

Thanks for the feedback, the warnning should be fixed in the lastest
revision and 's/estimiatedcluases/estimatedclauses/' typo error in the
commit message is not fixed since I have to regenerate all the commits
to fix that. We are still in dicussion stage and I think these impact is
pretty limited on dicussion.

> FYI, I also ran the patch with a $large number of reports without
> observing any errors or crashes.

Good to know that.

> I'll try to look harder at the next patch revision.

Thank you!

-- 
Best Regards
Andy Fan

Re: New committers: Melanie Plageman, Richard Guo

2024-04-27 Thread Andy Fan



"Jonathan S. Katz"  writes:

> [[PGP Signed Part:Undecided]]
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!
>

Congratulations to both, Well deserved!

-- 
Best Regards
Andy Fan

Re: UniqueKey v2

2024-04-19 Thread Andy Fan

Hello Antonin,

> zhihuifan1...@163.com wrote:
>
>> Here is the v3, ...
>
> I'm trying to enhance the join removal functionality (will post my patch in a
> separate thread soon) and I consider your patch very helpful in this
> area.

Thanks for these words.  The point 2) and 3) is pretty interesting to me
at [1] and "enhance join removal" is another interesting user case. 

> Following is my review. Attached are also some fixes and one enhancement:
> propagation of the unique keys (UK) from a subquery to the parent query
> (0004). (Note that 0001 is just your patch rebased).

Thanks for that! more enhancment like uniquekey in partitioned table is
needed. This post is mainly used to check if more people is still
interested with this. 

> Are you going to submit the patch to the first CF of PG 18?

Since there are still known work to do, I'm not sure if it is OK to
submit in CF. What do you think about this part?

>
> Please let me know if I can contribute to the effort by reviewing or writing
> some code.

Absolutely yes! please feel free to review / writing any of them and do
remember add yourself into the author list if you do that. 

Thanks for your review suggestion, I will get to this very soon if once
I get time, I hope it is in 4 weeks. 

[1]
https://www.postgresql.org/message-id/7mlamswjp81p.fsf%40e18c07352.et15sqa

-- 
Best Regards
Andy Fan

Re: Extract numeric filed in JSONB more effectively

2024-04-16 Thread Andy Fan

Andy Fan  writes:

> Here is latest version, nothing changed besides the rebase to the latest
> master. The most recent 3 questions should be addressed.
>
> - The error message compatible issue [1] and the Peter's answer at [2].
> - Peter's new question at [2] and my answer at [3].
>
> Any effrot to move this patch ahead is welcome and thanks all the people
> who provided high quaility feedback so far, especially chapman!
>
> [1] https://www.postgresql.org/message-id/87r0hmvuvr@163.com
> [2]
> https://www.postgresql.org/message-id/8102ff5b-b156-409e-a48f-e53e63a39b36%40eisentraut.org
> [3] https://www.postgresql.org/message-id/8734t6c5rh.fsf%40163.com

rebase to the latest master again.

commit bc990b983136ef658cd3be03cdb07f2eadc4cd5c (HEAD -> jsonb_numeric)
Author: yizhi.fzh 
Date:   Mon Apr 1 09:36:08 2024 +0800

Improve the performance of Jsonb numeric/bool extraction.

JSONB object uses a binary compatible numeric format with the numeric
data type in SQL. However in the past, extracting a numeric value from a
JSONB field still needs to find the corresponding JsonbValue first,
then convert the JsonbValue to Jsonb, and finally use the cast system to
convert the Jsonb to a Numeric data type. This approach was very
inefficient in terms of performance.

In the current patch, It is handled that the JsonbValue is converted to
numeric data type directly.  This is done by the planner support
function which detects the above case and simplify the expression.
Because the boolean type and numeric type share certain similarities in
their attributes, we have implemented the same optimization approach for
both.  In the ideal test case, the performance can be 2x than before.

The optimized functions and operators includes:
1. jsonb_object_field / ->
2. jsonb_array_element / ->
3. jsonb_extract_path / #>
4. jsonb_path_query
    5. jsonb_path_query_first

-- 
Best Regards
Andy Fan

>From bc990b983136ef658cd3be03cdb07f2eadc4cd5c Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 1 Apr 2024 09:36:08 +0800
Subject: [PATCH v18 1/1] Improve the performance of Jsonb numeric/bool
 extraction.

JSONB object uses a binary compatible numeric format with the numeric
data type in SQL. However in the past, extracting a numeric value from a
JSONB field still needs to find the corresponding JsonbValue first,
then convert the JsonbValue to Jsonb, and finally use the cast system to
convert the Jsonb to a Numeric data type. This approach was very
inefficient in terms of performance.

In the current patch, It is handled that the JsonbValue is converted to
numeric data type directly.  This is done by the planner support
function which detects the above case and simplify the expression.
Because the boolean type and numeric type share certain similarities in
their attributes, we have implemented the same optimization approach for
both.  In the ideal test case, the performance can be 2x than before.

The optimized functions and operators includes:
1. jsonb_object_field / ->
2. jsonb_array_element / ->
3. jsonb_extract_path / #>
4. jsonb_path_query
5. jsonb_path_query_first
---
 src/backend/utils/adt/jsonb.c | 206 ++
 src/backend/utils/adt/jsonbsubs.c |   4 +-
 src/backend/utils/adt/jsonfuncs.c | 123 ++-
 src/backend/utils/adt/jsonpath_exec.c |  32 +++-
 src/include/catalog/pg_proc.dat   |  46 +-
 src/include/utils/jsonb.h |  11 +-
 src/test/regress/expected/jsonb.out   | 112 +-
 src/test/regress/sql/jsonb.sql|  66 -
 src/tools/pgindent/typedefs.list  |   1 +
 9 files changed, 542 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index e4562b3c6c..e05b5b35f1 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -16,9 +16,15 @@
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
+#include "nodes/makefuncs.h"
+#include "nodes/supportnodes.h"
+#include "parser/parse_coerce.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
+#include "utils/date.h"
+#include "utils/datetime.h"
+#include "utils/fmgroids.h"
 #include "utils/json.h"
 #include "utils/jsonb.h"
 #include "utils/jsonfuncs.h"
@@ -2034,6 +2040,206 @@ cannotCastJsonbValue(enum jbvType type, const char *sqltype)
 	elog(ERROR, "unknown jsonb type: %d", (int) type);
 }

+
+/*
+ * jsonb_cast_support()
+ *
+ * Planner support function for extracting numeric or bool data type more
+ * effectively. After finding out the corresponding JsonbValue, instead of
+ * casting it to Jsonb as an intermediate type, we

Re: Replace FunctionCall2Coll with FunctionCallInvoke

2024-04-07 Thread Andy Fan

Hello Michael, 

> [[PGP Signed Part:Undecided]]
> On Mon, Apr 08, 2024 at 12:25:00PM +0800, Andy Fan wrote:
>> During the review of using extended statistics to improve join estimates
>> [1], I found some code level optimization opportunities which apply to
>> existing code as well. To make the discussion easier, open this new
>> thread.
>
> Is that measurable?

I didn't spent time on testing it. Compared with if the improvement is
measureable, I'm more interested with if it is better than before or
not. As for measurable respect, I'm with the idea at [1]. Do you think
the measurable matter? 

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZJ0a_Dcn%2BST4YSeSrLnnmajmcsi7ZvEpgkKNiF0SwBuw%40mail.gmail.com

-- 
Best Regards
Andy Fan

Replace FunctionCall2Coll with FunctionCallInvoke

2024-04-07 Thread Andy Fan


Hello,

During the review of using extended statistics to improve join estimates
[1], I found some code level optimization opportunities which apply to
existing code as well. To make the discussion easier, open this new
thread.

I don't know how to name the thread name, just use the patch 1 for the
naming.

commit d228d9734e70b4f43ad824d736fb1279d2aad5fc (HEAD -> misc_stats)
Author: yizhi.fzh 
Date:   Mon Apr 8 11:43:37 2024 +0800

Fast path for empty clauselist in clauselist_selectivity_ext

It should be common in the real life, for example:

SELECT * FROM t1, t2 WHERE t1.a = t2.a AND t1.a > 3;

clauses == NIL at the scan level of t2.

This would also make some debug life happier.

commit e852ce631f9348d5d29c8a53090ee71f7253767c
Author: yizhi.fzh 
Date:   Mon Apr 8 11:13:57 2024 +0800

Improve FunctionCall2Coll with FunctionCallInvoke

If a FunctionCall2Coll is called multi times with the same FmgrInfo,
turning it into FunctionCallInvoke will be helpful since the later one
can use the shared FunctionCallInfo.

There are many other places like GITSTATE which have the similar
chances, but I'd see if such changes is necessary at the first
place.


[1]
https://www.postgresql.org/message-id/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com
 

-- 
Best Regards
Andy Fan
>From e852ce631f9348d5d29c8a53090ee71f7253767c Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 8 Apr 2024 11:13:57 +0800
Subject: [PATCH v1 1/2] Improve FunctionCall2Coll with FunctionCallInvoke

If a FunctionCall2Coll is called multi times with the same FmgrInfo,
turning it into FunctionCallInvoke will be helpful since the later one
can use the shared FunctionCallInfo.

There are many other places like GITSTATE which have the similar
chances, but I'd see if such changes is necessary at the first place.
---
 src/backend/utils/adt/selfuncs.c | 35 +---
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 35f8f306ee..8dfeba5444 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -196,7 +196,7 @@ static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 			   Oid sortop, Oid collation,
 			   Datum *min, Datum *max);
 static void get_stats_slot_range(AttStatsSlot *sslot,
- Oid opfuncoid, FmgrInfo *opproc,
+ FunctionCallInfo fcinfo,
  Oid collation, int16 typLen, bool typByVal,
  Datum *min, Datum *max, bool *p_have_data);
 static bool get_actual_variable_range(PlannerInfo *root,
@@ -5905,6 +5905,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 	bool		typByVal;
 	Oid			opfuncoid;
 	FmgrInfo	opproc;
+	LOCAL_FCINFO(fcinfo, 2);
 	AttStatsSlot sslot;
 
 	/*
@@ -5936,8 +5937,6 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 	   (opfuncoid = get_opcode(sortop
 		return false;
 
-	opproc.fn_oid = InvalidOid; /* mark this as not looked up yet */
-
 	get_typlenbyval(vardata->atttype, , );
 
 	/*
@@ -5957,6 +5956,11 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 		free_attstatsslot();
 	}
 
+	fmgr_info(opfuncoid, );
+	InitFunctionCallInfoData(*fcinfo, , 2, collation, NULL, NULL);
+	fcinfo->args[0].isnull = false;
+	fcinfo->args[1].isnull = false;
+
 	/*
 	 * Otherwise, if there is a histogram with some other ordering, scan it
 	 * and get the min and max values according to the ordering we want.  This
@@ -5968,7 +5972,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 		 STATISTIC_KIND_HISTOGRAM, InvalidOid,
 		 ATTSTATSSLOT_VALUES))
 	{
-		get_stats_slot_range(, opfuncoid, ,
+		get_stats_slot_range(, fcinfo,
 			 collation, typLen, typByVal,
 			 , , _data);
 		free_attstatsslot();
@@ -6003,7 +6007,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
 		}
 
 		if (use_mcvs)
-			get_stats_slot_range(, opfuncoid, ,
+			get_stats_slot_range(, fcinfo,
  collation, typLen, typByVal,
  , , _data);
 		free_attstatsslot();
@@ -6021,7 +6025,7 @@ get_variable_range(PlannerInfo *root, VariableStatData *vardata,
  * to what we find in the statistics array.
  */
 static void
-get_stats_slot_range(AttStatsSlot *sslot, Oid opfuncoid, FmgrInfo *opproc,
+get_stats_slot_range(AttStatsSlot *sslot, FunctionCallInfo fcinfo,
 	 Oid collation, int16 typLen, bool typByVal,
 	 Datum *min, Datum *max, bool *p_have_data)
 {
@@ -6031,10 +6035,6 @@ get_stats_slot_range(AttStatsSlot *sslot, Oid opfuncoid, FmgrInfo *opproc,
 	bool		found_tmin = false;
 	bool		found_tmax = false;
 
-	/* Look up the comparison function, if we didn't already do so */
-	if (opproc->fn_oid != opfuncoid)
-		fmgr_info(opfuncoid, opproc);
-
 	/* Scan all the slot's values */
 	for (int i = 0; i < sslot->nvalues; i++)
 	{
@@ -604

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-03 Thread Andy Fan



> I also can confirm that a lot of users would be very happy to have
> "read your writes consistency" and ready to do something to achieve
> this at an application level.  However, they typically don't know what
> exactly they need.
>
> So, blogging about pg_wal_replay_wait() and spreading words about it
> at conferences would be highly appreciated.

Sure, once it is committed, I promise I can doing a knowledge sharing in
our organization and write a article in chinese language.  

-- 
Best Regards
Andy Fan

Re: using extended statistics to improve join estimates

2024-04-02 Thread Andy Fan



Tomas Vondra  writes:

> On 4/2/24 10:23, Andy Fan wrote:
>> 
>>> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>>>> Rebased over 269b532ae and muted compiler warnings.
>> 
>> Thank you Justin for the rebase!
>> 
>> Hello Tomas,
>> 
>> Thanks for the patch! Before I review the path at the code level, I want
>> to explain my understanding about this patch first.
>> 
>
> If you want to work on this patch, that'd be cool. A review would be
> great, but if you want to maybe take over and try moving it forward,
> that'd be even better. I don't know when I'll have time to work on it
> again, but I'd promise to help you with working on it.

OK, I'd try to moving it forward.

>
>> Before this patch, we already use MCV information for the eqjoinsel, it
>> works as combine the MCV on the both sides to figure out the mcv_freq
>> and then treat the rest equally, but this doesn't work for MCV in
>> extended statistics, this patch fill this gap. Besides that, since
>> extended statistics means more than 1 columns are involved, if 1+
>> columns are Const based on RestrictInfo, we can use such information to
>> filter the MCVs we are interesting, that's really cool. 
>> 
>
> Yes, I think that's an accurate description of what the patch does.

Great to know that:)

>
>> I did some more testing, all of them are inner join so far, all of them
>> works amazing and I am suprised this patch didn't draw enough
>> attention.

> I think it didn't go forward for a bunch of reasons:
>
..
>
> 3) Uncertainty about how applicable the patch is in practice.
>
> I suppose it was some combination of these reasons, not sure.
>
> As for the "practicality" mentioned in (3), it's been a while since I
> worked on the patch so I don't recall the details, but I think I've been
> thinking mostly about "start join" queries, where a big "fact" table
> joins to small dimensions. And in that case the fact table may have a
> MCV, but the dimensions certainly don't have any (because the join
> happens on a PK).
>
> But maybe that's a wrong way to think about it - it was clearly useful
> to consider the case with (per-attribute) MCVs on both sides as worth
> special handling. So why not to do that for multi-column MCVs, right?

Yes, that's what my current understanding is.

There are some cases where there are 2+ clauses between two tables AND
the rows estimiation is bad AND the plan is not the best one. In such
sisuations, I'd think this patch probably be helpful. The current case
in hand is PG11, there is no MCV information for extended statistics, so
I even can't verify the patch here is useful or not manually. When I see
them next time in a newer version of PG, I can verity it manually to see
if the rows estimation can be better.  

>> At for the code level, I reviewed them in the top-down manner and almost
>> 40% completed. Here are some findings just FYI. For efficiency purpose,
>> I provide each feedback with a individual commit, after all I want to
>> make sure my comment is practical and coding and testing is a good way
>> to archive that. I tried to make each of them as small as possible so
>> that you can reject or accept them convinently.
>> 
>> 0001 is your patch, I just rebase them against the current master. 0006
>> is not much relevant with current patch, and I think it can be committed
>> individually if you are OK with that.
>> 
>> Hope this kind of review is helpful.
>> 
>
> Cool! There's obviously no chance to get this into v18, and I have stuff
> to do in this CF. But I'll take a look after that.

Good to know that. I will continue my work before that. 

-- 
Best Regards
Andy Fan

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Andy Fan

Bharath Rupireddy  writes:

> On Tue, Apr 2, 2024 at 3:41 PM Kartyshov Ivan
>  wrote:
>>
>> 8 years, we tried to add this feature, and now we suggest the best way
>> for this feature is to commit the minimal version first.
>
> Just curious, do you or anyone else have an immediate use for this
> function? If yes, how are they achieving read-after-write-consistency
> on streaming standbys in their application right now without a
> function like this?

The link [1] may be helpful and I think the reason there is reasonable
to me.

Actually we also disucss how to make sure the "read your writes
consistency" internally, and the soluation here looks good to me.

Glad to know that this patch will be committed very soon. 

[1]
https://www.postgresql.org/message-id/CAPpHfdtuiL1x4APTs7u1fCmxkVp2-ZruXcdCfprDMdnOzvdC%2BA%40mail.gmail.com

-- 
Best Regards
Andy Fan

Re: using extended statistics to improve join estimates

2024-04-02 Thread Andy Fan

> On Wed, Mar 02, 2022 at 11:38:21AM -0600, Justin Pryzby wrote:
>> Rebased over 269b532ae and muted compiler warnings.

Thank you Justin for the rebase!

Hello Tomas,

Thanks for the patch! Before I review the path at the code level, I want
to explain my understanding about this patch first.

Before this patch, we already use MCV information for the eqjoinsel, it
works as combine the MCV on the both sides to figure out the mcv_freq
and then treat the rest equally, but this doesn't work for MCV in
extended statistics, this patch fill this gap. Besides that, since
extended statistics means more than 1 columns are involved, if 1+
columns are Const based on RestrictInfo, we can use such information to
filter the MCVs we are interesting, that's really cool. 

I did some more testing, all of them are inner join so far, all of them
works amazing and I am suprised this patch didn't draw enough
attention. I will test more after I go though the code.

At for the code level, I reviewed them in the top-down manner and almost
40% completed. Here are some findings just FYI. For efficiency purpose,
I provide each feedback with a individual commit, after all I want to
make sure my comment is practical and coding and testing is a good way
to archive that. I tried to make each of them as small as possible so
that you can reject or accept them convinently.

0001 is your patch, I just rebase them against the current master. 0006
is not much relevant with current patch, and I think it can be committed
individually if you are OK with that.

Hope this kind of review is helpful.

-- 
Best Regards
Andy Fan

>From daa6c27bc7dd0631607f0f254cc15491633a9ccc Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 13 Dec 2021 14:05:17 +0100
Subject: [PATCH v1 1/8] Estimate joins using extended statistics

Use extended statistics (MCV) to improve join estimates. In general this
is similar to how we use regular statistics - we search for extended
statistics (with MCV) covering all join clauses, and if we find such MCV
on both sides of the join, we combine those two MCVs.

Extended statistics allow a couple additional improvements - e.g. if
there are baserel conditions, we can use them to restrict the part of
the MCVs combined. This means we're building conditional probability
distribution and calculating conditional probability

P(join clauses | baserel conditions)

instead of just P(join clauses).

The patch also allows combining regular and extended MCV - we don't need
extended MCVs on both sides. This helps when one of the tables does not
have extended statistics (e.g. because there are no correlations).
---
 src/backend/optimizer/path/clausesel.c|  63 +-
 src/backend/statistics/extended_stats.c   | 805 ++
 src/backend/statistics/mcv.c  | 758 +
 .../statistics/extended_stats_internal.h  |  20 +
 src/include/statistics/statistics.h   |  12 +
 src/test/regress/expected/stats_ext.out   | 167 
 src/test/regress/sql/stats_ext.sql|  66 ++
 7 files changed, 1890 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 0ab021c1e8..bedf76edae 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -48,6 +48,9 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
 			 JoinType jointype,
 			 SpecialJoinInfo *sjinfo,
 			 bool use_extended_stats);
+static inline bool treat_as_join_clause(PlannerInfo *root,
+		Node *clause, RestrictInfo *rinfo,
+		int varRelid, SpecialJoinInfo *sjinfo);

 /
  *		ROUTINES TO COMPUTE SELECTIVITIES
@@ -127,12 +130,53 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	bool		single_clause_optimization = true;
+
+	/*
+	 * The optimization of skipping to clause_selectivity_ext for single
+	 * clauses means we can't improve join estimates with a single join
+	 * clause but additional baserel restrictions. So we disable it when
+	 * estimating joins.
+	 *
+	 * XXX Not sure if this is the right way to do it, but more elaborate
+	 * checks would mostly negate the whole point of the optimization.
+	 * The (Var op Var) patch has the same issue.
+	 *
+	 * XXX An alternative might be making clause_selectivity_ext smarter
+	 * and make it use the join extended stats there. But that seems kinda
+	 * against the whole point of the optimization (skipping expensive
+	 * stuff) and it's making other parts more complex.
+	 *
+	 * XXX Maybe this should check if there are at least some restrictions
+	 * on some base relations, which seems important. But then again, that
+	 * seems to go against the idea of this check to be cheap. Moreover, it
+	 * won't work for OR clauses, which may have multiple parts but we still
+

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-02 Thread Andy Fan



Hi Alexander!

>> +
>> +pg_wal_replay_wait (
>> +  target_lsn pg_lsn,
>> +  timeout bigint 
>> DEFAULT 0)
>> +void
>> +   
>>
>> Should we return the millseconds of waiting time?  I think this
>> information may be useful for customer if they want to know how long
>> time it waits for for minitor purpose.
>
> Please, check it more carefully.  In v17 timeout is in integer milliseconds.

I guess one of us misunderstand the other:( and I do didn't check the
code very carefully. 

Acutally I meant the "return value" rather than function argument. IIUC
the current return value is void per below statement.

>> +void

If so, when users call pg_wal_replay_wait, they can get informed when
the wal is replaied to the target_lsn, but they can't know how long time
it waits unless they check it in application side, I think such
information will be useful for monitor purpose sometimes. 

select pg_wal_replay_wait(lsn, 1000);  may just take 1ms in fact, in
this case, I want this function return 1. 

-- 
Best Regards
Andy Fan

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-01 Thread Andy Fan

Alexander Korotkov  writes:

Hello,

>  
>  Did you forget to attach the new patch?
>
> Yes, here it is. 
>
> --
> Regards,
> Alexander Korotkov 
>
> [4. text/x-diff; 
> v17-0001-Implement-pg_wal_replay_wait-stored-procedure.patch]...

+
+pg_wal_replay_wait (
+  target_lsn pg_lsn,
+  timeout bigint 
DEFAULT 0)
+void
+   

Should we return the millseconds of waiting time?  I think this
information may be useful for customer if they want to know how long
time it waits for for minitor purpose. 

-- 
Best Regards
Andy Fan

Re: Extract numeric filed in JSONB more effectively

2024-03-31 Thread Andy Fan


Here is latest version, nothing changed besides the rebase to the latest
master. The most recent 3 questions should be addressed.

- The error message compatible issue [1] and the Peter's answer at [2].
- Peter's new question at [2] and my answer at [3].

Any effrot to move this patch ahead is welcome and thanks all the people
who provided high quaility feedback so far, especially chapman!

[1] https://www.postgresql.org/message-id/87r0hmvuvr@163.com
[2]
https://www.postgresql.org/message-id/8102ff5b-b156-409e-a48f-e53e63a39b36%40eisentraut.org
[3] https://www.postgresql.org/message-id/8734t6c5rh.fsf%40163.com

-- 
Best Regards
Andy Fan

>From fb38be5addb93d7c0b8c1a3e8376751c9b3be5f5 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 1 Apr 2024 09:36:08 +0800
Subject: [PATCH v17 1/1] Improve the performance of Jsonb numeric/bool 
 extraction.

JSONB object uses a binary compatible numeric format with the numeric
data type in SQL. However in the past, extracting a numeric value from a
JSONB field still needs to find the corresponding JsonbValue first,
then convert the JsonbValue to Jsonb, and finally use the cast system to
convert the Jsonb to a Numeric data type. This approach was very
inefficient in terms of performance.

In the current patch, It is handled that the JsonbValue is converted to
numeric data type directly.  This is done by the planner support
function which detects the above case and simplify the expression.
Because the boolean type and numeric type share certain similarities in
their attributes, we have implemented the same optimization approach for
both.  In the ideal test case, the performance can be 2x than before.

The optimized functions and operators includes:
1. jsonb_object_field / ->
2. jsonb_array_element / ->
3. jsonb_extract_path / #>
4. jsonb_path_query
5. jsonb_path_query_first
---
 src/backend/utils/adt/jsonb.c | 206 ++
 src/backend/utils/adt/jsonbsubs.c |   4 +-
 src/backend/utils/adt/jsonfuncs.c | 123 ++-
 src/backend/utils/adt/jsonpath_exec.c |  32 +++-
 src/include/catalog/pg_proc.dat   |  46 +-
 src/include/utils/jsonb.h |  11 +-
 src/test/regress/expected/jsonb.out   | 112 +-
 src/test/regress/sql/jsonb.sql|  66 -
 src/tools/pgindent/typedefs.list  |   1 +
 9 files changed, 542 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index a5e48744ac..6e93b34fd6 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -16,9 +16,15 @@
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
+#include "nodes/makefuncs.h"
+#include "nodes/supportnodes.h"
+#include "parser/parse_coerce.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
+#include "utils/date.h"
+#include "utils/datetime.h"
+#include "utils/fmgroids.h"
 #include "utils/json.h"
 #include "utils/jsonb.h"
 #include "utils/jsonfuncs.h"
@@ -2035,6 +2041,206 @@ cannotCastJsonbValue(enum jbvType type, const char *sqltype)
 	elog(ERROR, "unknown jsonb type: %d", (int) type);
 }
 
+
+/*
+ * jsonb_cast_support()
+ *
+ * Planner support function for extracting numeric or bool data type more
+ * effectively. After finding out the corresponding JsonbValue, instead of
+ * casting it to Jsonb as an intermediate type, we covert it to the desired
+ * data type directly.
+ */
+Datum
+jsonb_cast_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+
+	if (IsA(rawreq, SupportRequestSimplify))
+	{
+		SupportRequestSimplify *req = (SupportRequestSimplify *) rawreq;
+		FuncExpr   *fexpr = req->fcall;
+		FuncExpr   *jsonb_start_func = NULL,
+   *jsonb_finish_func = NULL,
+   *final_func = NULL;
+		Node	   *input;
+		Oid			new_func_id = InvalidOid;
+		List	   *args;
+		Oid			input_func_id,
+	collid,
+	inputcollid;
+		bool		retset = false,
+	variadic = false;
+
+		Assert(list_length(fexpr->args) == 1);
+		input = (Node *) linitial(fexpr->args);
+
+		if (IsA(input, OpExpr))
+		{
+			OpExpr	   *opExpr = castNode(OpExpr, input);
+
+			input_func_id = opExpr->opfuncid;
+			collid = opExpr->opcollid;
+			inputcollid = opExpr->inputcollid;
+			args = opExpr->args;
+		}
+		else if (IsA(input, FuncExpr))
+		{
+			FuncExpr   *funcExpr = castNode(FuncExpr, input);
+
+			input_func_id = funcExpr->funcid;
+			collid = funcExpr->funccollid;
+			inputcollid = funcExpr->inputcollid;
+			args = funcExpr->args;
+		}
+		else
+			/* not the desired pattern. */
+			PG_RETURN_POINTER(NULL);
+
+		/* build a function to return the JsonbValue directly. */
+		switch (input_func_id)
+		{
+			case F_JSONB_OBJECT_FIELD:
+new_func_id = F_JSONB_OBJECT_

Re: a wrong index choose when statistics is out of date

2024-03-30 Thread Andy Fan


Hello everyone,

> After some more thoughts about the diference of the two ideas, then I
> find we are resolving two different issues, just that in the wrong index
> choose cases, both of them should work generally. 

Here is the formal version for the attribute reloptions direction.

commit 0d842e39275710a544b11033f5eec476147daf06 (HEAD -> force_generic)
Author: yizhi.fzh 
Date:   Sun Mar 31 11:51:28 2024 +0800

Add a attopt to disable MCV when estimating for Var = Const

As of current code, when calculating the selectivity for Var = Const,
planner first checks if the Const is an most common value and if not, it
takes out all the portions of MCV's selectivity and num of distinct
value, and treat the selectivity for Const equal for the rest
n_distinct.

This logic works great when the optimizer statistic is up to date,
however if the known most common value has taken up most of the
selectivity at the last run of analyze, and the new most common value in
reality has not been gathered, the estimation for the new MCV will be
pretty bad. A common case for this would be created_at = {current_date};

To overcome this issue, we provides a new syntax:

ALTER TABLE tablename ALTER COLUMN created_at SET (force_generic=on);

After this, planner ignores the value of MCV for this column when
estimating for Var = Const and treating all the values equally.

This would cause some badness if the values for a column are pretty not
equal which is what MCV is designed for, however this patch just provide
one more option to user and let user make the decision.

Here is an example about its user case.

create table t(a int, b int, c int) with (autovacuum_enabled=off);
create index on t(a, b);
create index on t(a, c);
create table t2 (id int primary key, a int);
insert into t2 select i , i from generate_series(1, 800)i;

insert into t select floor(random() * 100 + 1)::int, i, i
from generate_series(1, 10) i;
analyze t,t2;

insert into t
select floor(random() * 10 + 1)::int + 100 , i, i
from generate_series(1, 1) i;

explain (costs off) select * from t where a = 109 and b = 8;
explain (costs off, analyze)
select * from t join t2 on t.c = t2.id where t.a = 109;

ALTER TABLE t ALTER COLUMN a SET (force_generic=on);

-- We will see some good result now.
explain (costs off) select * from t where a = 109 and b = 8;
explain (costs off, analyze)
select * from t join t2 on t.c = t2.id where t.a = 109;

I will add this to our commitfest application, any feedback is welcome! 

-- 
Best Regards
Andy Fan
>From 0d842e39275710a544b11033f5eec476147daf06 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Sun, 31 Mar 2024 11:51:28 +0800
Subject: [PATCH v1 1/1] Add a attopt to disable MCV when estimating for Var =
 Const

As of current code, when calculating the selectivity for Var = Const,
planner first checks if the Const is an most common value and if not, it
takes out all the portions of MCV's selectivity and num of distinct
value, and treat the selectivity for Const equal for the rest
n_distinct.

This logic works great when the optimizer statistic is up to date,
however if the known most common value has taken up most of the
selectivity at the last run of analyze, and the new most common value in
reality has not been gathered, the estimation for the new MCV will be
pretty bad. A common case for this would be created_at = {current_date};

To overcome this issue, we provides a new syntax:

ALTER TABLE tablename ALTER COLUMN created_at SET (force_generic=on);

After this, planner ignores the value of MCV for this column when
estimating for Var = Const and treating all the values equally.

This would cause some badness if the values for a column are pretty not
equal which is what MCV is designed for, however this patch just provide
one more option to user and let user make the decision.
---
 src/backend/access/common/reloptions.c| 12 +++-
 src/backend/utils/adt/selfuncs.c  | 36 +++
 src/bin/psql/tab-complete.c   |  2 +-
 src/include/utils/attoptcache.h   |  1 +
 src/test/regress/expected/alter_table.out |  2 ++
 src/test/regress/sql/alter_table.sql  |  2 ++
 6 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d8559..2c74a3fcf9 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -129,6 +129,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"force_generic",
+			"estimate the selectivity like generic plan",
+			RELOPT_KIND_ATTRIBUTE,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	{
 		{
 			"security_barrier",
@@ -2070,7 +2079,8 @@ attribute_reloptions(Datum reloptions, bool validate)
 {
 	static const relopt_parse_elt tab[] = {
 		{"n_dis

Re: a wrong index choose when statistics is out of date

2024-03-13 Thread Andy Fan



>
> Having had the same problem for a long time, I've made an attempt and
> invented a patch that probes an index to determine whether the estimated
> constant is within statistics' scope.
> I remember David's remark on the overhead problem, but I don't argue it
> here. This patch is on the table to have one more solution sketch for
> further discussion.

I think the following code will be really horrendous on peformance
aspect, think about the cases where we have thousands of tuples.

+   index_rescan(index_scan, scankeys, 1, NULL, 0);
+   while (index_getnext_tid(index_scan, ForwardScanDirection) != 
NULL)
+   {
+   ntuples++;
+   }
+

> Also, Andy, if you have a specific problem with index choosing, you can
> try a tiny option that makes the index-picking technique less dependent
> on the ordering of index lists [1].

thanks, index choosing issue already not the only issue I want to address now.

You said the my patch was kind of lucky to work at [1], have you figure
out an example to prove that?

[1]
https://www.postgresql.org/message-id/701d2097-2c5b-41e2-8629-734e3c8ba613%40postgrespro.ru
 

-- 
Best Regards
Andy Fan

Re: Extract numeric filed in JSONB more effectively

2024-03-09 Thread Andy Fan



>> But I have a different question about this patch set.  This has some
>> overlap with the JSON_VALUE function that is being discussed at
>> [0][1]. For example, if I apply the patch
>> v39-0001-Add-SQL-JSON-query-functions.patch from that thread, I can run
>>
>> select count(*) from tb where json_value(a, '$.a' returning numeric) = 2;
>>
>> and I get a noticeable performance boost over
>>
>> select count(*) from tb where cast (a->'a' as numeric) = 2;
>
> Here is my test and profile about the above 2 queries.
>
..
> As we can see the patch here has the best performance (this result looks
> be different from yours?).
>
> After I check the code, I am sure both patches *don't* have the problem
> in master where it get a jsonbvalue first and convert it to jsonb and
> then cast to numeric.
>
> Then I perf the result, and find the below stuff:
>
..

> JSONB_VALUE has a much longer way to get getKeyJsonValueFromContainer,
> then I think JSON_VALUE probably is designed for some more complex path 
> which need to pay extra effort which bring the above performance
> difference. 


Hello Peter,

Thanks for highlight the JSON_VALUE patch! Here is the sistuation in my
mind now. 

My patch is desigined to *not* introducing any new user-faced functions, 
but let some existing functions run faster. JSON_VALUE patch is designed
to following more on SQL standard so introuduced one new function which
has more flexibility on ERROR handling [1].  

Both patches are helpful on the subject here, but my patch here has a
better performance per my testing, I don't think I did anything better
here, just because JSON_VALUE function is designed for some more generic
purpose which has to pay some extra effort, and even if we have some
chance to improve JSON_VALUE, I don't think it shoud not block the patch
here (I'd like to learn more about this, it may takes some time!)

So I think the my patch here can be go ahead again, what do you think? 

[1]
https://www.postgresql.org/message-id/CACJufxGtetrn34Hwnb9D2if5D_HOPAh235MtEZ1meVYx-BiNtg%40mail.gmail.com
 

-- 
Best Regards
Andy Fan

Re: remaining sql/json patches

2024-03-09 Thread Andy Fan



jian he  writes:

> On Tue, Mar 5, 2024 at 12:38 PM Andy Fan  wrote:
>>
>>
>> In the commit message of 0001, we have:
>>
>> """
>> Both JSON_VALUE() and JSON_QUERY() functions have options for
>> handling EMPTY and ERROR conditions, which can be used to specify
>> the behavior when no values are matched and when an error occurs
>> during evaluation, respectively.
>>
>> All of these functions only operate on jsonb values. The workaround
>> for now is to cast the argument to jsonb.
>> """
>>
>> which is not clear for me why we introduce JSON_VALUE() function, is it
>> for handling EMPTY or ERROR conditions? I think the existing cast
>> workaround have a similar capacity?
>>
>
> I guess because it's in the standard.
> but I don't see individual sql standard Identifier, JSON_VALUE in
> sql_features.txt
> I do see JSON_QUERY.
> mysql also have JSON_VALUE, [1]
>
> EMPTY, ERROR: there is a standard Identifier: T825: SQL/JSON: ON EMPTY
> and ON ERROR clauses
>
> [1] 
> https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html#function_json-value

Thank you for this informatoin!

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-08 Thread Andy Fan



After some more thoughts about the diference of the two ideas, then I
find we are resolving two different issues, just that in the wrong index
choose cases, both of them should work generally. 

Your idea actually adding some rule based logic named certainty_factor,
just the implemenation is very grace. for the example in this case, it
take effects *when the both indexes has the same cost*. I believe that
can resolve the index choose here, but how about the rows estimation?
issue due to the fact that the design will not fudge the cost anyway, I
assume you will not fudge the rows or selectivity as well. Then if the
optimizer statistics is missing, what can we do for both index choosing
and rows estimation? I think that's where my idea comes out.  

Due to the fact that optimizer statistics can't be up to date by design,
and assume we have a sistuation where the customer's queries needs that
statistcs often, how about doing the predication with the history
statistics? it can cover for both index choose and rows estimation. Then
the following arguments may be arised. a). we can't decide when the
missed optimizer statistics is wanted *automatically*, b). if we
predicate the esitmiation with the history statistics, the value of MCV
information is missed. The answer for them is a). It is controlled by
human with the "alter table t alter column a set
(force_generic=on)". b). it can't be resolved I think, and it only take
effects when the real Const is so different from the ones in
history. generic plan has the same issue I think.

I just reviewed the bad queries plan for the past half years internally,
I found many queries used the Nested loop which is the direct cause. now
I think I find out a new reason for this, because the missed optimizer
statistics cause the rows in outer relation to be 1, which make the Nest
loop is choosed. I'm not sure your idea could help on this or can help
on this than mine at this aspect.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-08 Thread Andy Fan



David Rowley  writes:

> On Thu, 7 Mar 2024 at 23:40, Andy Fan  wrote:
>>
>> David Rowley  writes:
>> > If you don't want the planner to use the statistics for the column why
>> > not just do the following?
>>
>> Acutally I didn't want the planner to ignore the statistics totally, I
>> want the planner to treat the "Const" which probably miss optimizer part
>> average, which is just like what we did for generic plan for the blow
>> query.
>
> I'm with Andrei on this one and agree with his "And it is just luck
> that you've got the right answer".

Any example to support this conclusion? and what's the new problem after
this?

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-07 Thread Andy Fan



Andrei Lepikhov  writes:

> On 5/3/2024 19:56, Andy Fan wrote:
>> I think it is OK for a design review, for the implementaion side, the
>> known issue includes:
>> 1. Support grap such infromation from its parent for partitioned table
>> if the child doesn't have such information.
>> 2. builtin document and testing.
>> Any feedback is welcome.
> Thanks for your efforts.
> I was confused when you showed the problem connected to clauses like
> "Var op Const" and "Var op Param".

hmm, then what is the soluation in your mind when you say the "ticky" in
[1]?  I am thinking we have some communication gap here.

> As far as I know, the estimation logic of such clauses uses MCV and
> number-distinct statistics. So, being out of MCV values, it becomes
> totally insensitive to any internal skew in data and any data outside
> the statistics boundaries.
> Having studied the example you provided with the patch, I think it is
> not a correct example:
> Difference between var_eq_const and var_eq_non_const quite obvious:

The response should be same as what I did in [2], let's see if we can
make the gap between us smaller.

> In the second routine, you don't have information about the const value
> and can't use MCV for estimation. Also, you can't exclude MCV values
> from the estimation. And it is just luck that you've got the right
> answer. I think if you increased the weight of the unknown part, you
> would get a bad result, too.

> I would like to ask David why the var_eq_const estimator doesn't have an
> option for estimation with a histogram. Having that would relieve a
> problem with skewed data. Detecting the situation with incoming const
> that is out of the covered area would allow us to fall back to ndistinct
> estimation or something else. At least, histogram usage can be
> restricted by the reltuples value and ratio between the total number of
> MCV values and the total number of distinct values in the table.

I think an example which show your algorithm is better would be pretty
helpful for communication. 

[1] 
https://www.postgresql.org/message-id/15381eea-cbc3-4087-9d90-ab752292bd54%40postgrespro.ru
[2] https://www.postgresql.org/message-id/87msra9vgo.fsf%40163.com
-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-07 Thread Andy Fan

David Rowley  writes:

> On Wed, 6 Mar 2024 at 02:09, Andy Fan  wrote:
>> This patch introduces a new attoptions like this:
>>
>> ALTER TABLE t ALTER COLUMN col set (force_generic=true);
>>
>> Then selfunc.c realizes this and ignore the special Const value, then
>> average selectivity is chosen. This fall into the weakness of generic
>> plan, but this patch doesn't introduce any new weakness and we leave the
>> decision to user which could resolve some problem. Also this logic only
>> apply to eqsel since the ineqsel have the get_actual_variable_range
>> mechanism which is helpful for index choose case at least.
>
> If you don't want the planner to use the statistics for the column why
> not just do the following?

Acutally I didn't want the planner to ignore the statistics totally, I
want the planner to treat the "Const" which probably miss optimizer part
average, which is just like what we did for generic plan for the blow
query.  

prepare s as SELECT * FROM t WHERE a = $1 and b = $2;
explain (costs off) execute s(109, 8);
   QUERY PLAN
-
 Index Scan using t_a_c_idx on t
   Index Cond: (a = 109)
   Filter: (b = 8) 

(3 rows)

custom plan, Wrong index due to we have a bad estimation for a = 109.

set plan_cache_mode to force_generic_plan ;
explain (costs off) execute s(109, 8);
  QUERY PLAN   
---
 Index Scan using t_a_b_idx on t
   Index Cond: ((a = $1) AND (b = $2))   -- Correct index.
(2 rows)

Generic plan - we use the average estimation for the missed optimizer
statistics part and *if the new value is not so different from existing
ones*, we can get a disired result. 

It is true that the "generic" way is not as exactly accurate as the
"custom" way since the later one can use the data in MCV, but that is
the cost we have to pay to make the missed optimizer statistics less
imporant and generic plan has the same issue as well. As for this
aspect, I think the way you proposed probably have a wider use case.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-05 Thread Andy Fan


Hi,

>
>> We should do anything like add column options in the meantime. Those
>> are hard to remove once added.
>
> I will try it very soon.

Attached is a PoC version. and here is the test case.

create table t(a int, b int, c int) with (autovacuum_enabled=off);
create index on t(a, b);
create index on t(a, c);

insert into t select floor(random() * 100 + 1)::int, i, i
from generate_series(1, 10) i;

analyze t;

insert into t
select floor(random() * 10 + 1)::int + 100 , i, i
from generate_series(1, 1000) i;

-- one of the below queries would choose a wrong index.
-- here is the result from my test.
explain (costs off) select * from t where a = 109 and c = 8;
  QUERY PLAN   
---
 Index Scan using t_a_c_idx on t
   Index Cond: ((a = 109) AND (c = 8))
(2 rows)

explain (costs off) select * from t where a = 109 and b = 8;
   QUERY PLAN
-
 Index Scan using t_a_c_idx on t
   Index Cond: (a = 109)
   Filter: (b = 8)
(3 rows)

Wrong index is chosen for the second case!

-- After applying the new API.

alter table t alter column a set (force_generic=on);

explain (costs off) select * from t where a = 109 and c = 8;
  QUERY PLAN   
---
 Index Scan using t_a_c_idx on t
   Index Cond: ((a = 109) AND (c = 8))
(2 rows)

explain (costs off) select * from t where a = 109 and b = 8;
  QUERY PLAN   
---
 Index Scan using t_a_b_idx on t
   Index Cond: ((a = 109) AND (b = 8))
(2 rows)

Then both cases can choose a correct index.

commit f8cca472479c50ba73479ec387882db43d203522 (HEAD -> shared_detoast_value)
Author: yizhi.fzh 
Date:   Tue Mar 5 18:27:48 2024 +0800

Add a "force_generic" attoptions for selfunc.c

Sometime user just care about the recent data and the optimizer
statistics for such data is not gathered, then some bad decision may
happen. Before this patch, we have to make the autoanalyze often and
often, but it is not only expensive but also may be too late.

This patch introduces a new attoptions like this:

ALTER TABLE t ALTER COLUMN col set (force_generic=true);

Then selfunc.c realizes this and ignore the special Const value, then
average selectivity is chosen. This fall into the weakness of generic
plan, but this patch doesn't introduce any new weakness and we leave the
decision to user which could resolve some problem. Also this logic only
apply to eqsel since the ineqsel have the get_actual_variable_range
mechanism which is helpful for index choose case at least.

I think it is OK for a design review, for the implementaion side, the
known issue includes:

1. Support grap such infromation from its parent for partitioned table
if the child doesn't have such information.
2. builtin document and testing. 

Any feedback is welcome.

-- 
Best Regards
Andy Fan

>From f8cca472479c50ba73479ec387882db43d203522 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Tue, 5 Mar 2024 18:27:48 +0800
Subject: [PATCH v0 1/1] Add a "force_generic" attoptions for selfunc.c

Sometime user just care about the recent data and the optimizer
statistics for such data is not gathered, then some bad decision may
happen. Before this patch, we have to make the autoanalyze often and
often, but it is not only expensive but also may be too late.

This patch introduces a new attoptions like this:

ALTER TABLE t ALTER COLUMN col set (force_generic=true);

Then selfunc.c realizes this and ignore the special Const value, then
average selectivity is chosen. This fall into the weakness of generic
plan, but this patch doesn't introduce any new weakness and we leave the
decision to user which could resolve some problem. Also this logic only
apply to eqsel since the ineqsel have the get_actual_variable_range
mechanism which is helpful for index choose case at least.
---
 src/backend/access/common/reloptions.c | 12 -
 src/backend/utils/adt/selfuncs.c   | 35 ++
 src/bin/psql/tab-complete.c|  2 +-
 src/include/utils/attoptcache.h|  1 +
 4 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 0921a736ab..c8444ea577 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -131,6 +131,15 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	{
+		{
+			"force_generic",
+			"estimate the selectivity like generic plan",
+			RELOPT_KIND_ATTRIBUTE,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	{
 		{
 			"security_barrier",
@@ -2072,7 +2081,8 @@ attribute_reloptions(Datum reloptions, bool validate)
 {
 	static const relopt_parse_elt tab[] = {
 		{"n_distinct", REL

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread Andy Fan



David Rowley  writes:

> On Tue, 5 Mar 2024 at 00:37, Andy Fan  wrote:
>>
>> David Rowley  writes:
>> > I don't think it would be right to fudge the costs in any way, but I
>> > think the risk factor for IndexPaths could take into account the
>> > number of unmatched index clauses and increment the risk factor, or
>> > "certainty_factor" as it is currently in my brain-based design. That
>> > way add_path() would be more likely to prefer the index that matches
>> > the most conditions.
>>
>> This is somehow similar with my proposal at [1]?  What do you think
>> about the treat 'col op const' as 'col op $1' for the marked column?
>> This could just resolve a subset of questions in your mind, but the
>> method looks have a solid reason.
>
> Do you mean this?

Yes, it is not cautious enough to say "similar" too quick.

After reading your opinion again, I think what you are trying to do is
adding one more dimension to Path compared with the existing cost and
pathkey information and it would take effects on add_path stage. That is
impressive, and I'm pretty willing to do more testing once the v1 is
done.

I just noted you have expressed your idea about my proposal 1,

> We should do anything like add column options in the meantime. Those
> are hard to remove once added.

I will try it very soon.  and I'm a bit of upset no one care about my
proposal 2 which is the AI method, I see many companies want to
introduce AI to planner even I don't seen any impressive success, but
this user case looks like a candidate. 

>> + /*
>> + * To make the planner more robust to handle some inaccurate statistics
>> + * issue, we will add a extra cost to qpquals so that the less qpquals
>> + * the lower cost it has.
>> + */
>> + cpu_run_cost += 0.01 * list_length(qpquals);
>
> I don't think it's a good idea to add cost penalties like you proposed
> there. This is what I meant by "I don't think it would be right to
> fudge the costs in any way".
>
> If you modify the costs to add some small penalty so that the planner
> is more likely to favour some other plan, what happens if we then
> decide the other plan has some issue and we want to penalise that for
> some other reason? Adding the 2nd penalty might result in the original
> plan choice again. Which one should be penalised more? I think the
> uncertainty needs to be tracked separately.
>
> Fudging the costs like this is also unlikely to play nicely with
> add_path's use of STD_FUZZ_FACTOR.  There'd be an incentive to do
> things like total_cost *= STD_FUZZ_FACTOR; to ensure we get a large
> enough penalty.

I agree and I just misunderstood your proposal yesterday. 

-- 
Best Regards
Andy Fan

Re: remaining sql/json patches

2024-03-04 Thread Andy Fan

Hi,

> On Tue, Mar 5, 2024 at 12:03 AM Alvaro Herrera  
> wrote:
>> On 2024-Mar-04, Erik Rijkers wrote:
>>
>> > In my hands (applying with patch), the patches, esp. 0001, do not apply.
>> > But I see the cfbot builds without problem so maybe just ignore these 
>> > FAILED
>> > lines.  Better get them merged - so I can test there...
>>
>> It's because of dbbca2cf299b.  It should apply cleanly if you do "git
>> checkout dbbca2cf299b^" first ...  That commit is so recent that
>> evidently the cfbot hasn't had a chance to try this patch again since it
>> went in, which is why it's still green.
>
> Thanks for the heads up.  Attaching rebased patches.

In the commit message of 0001, we have:

"""
Both JSON_VALUE() and JSON_QUERY() functions have options for
handling EMPTY and ERROR conditions, which can be used to specify
the behavior when no values are matched and when an error occurs
during evaluation, respectively.

All of these functions only operate on jsonb values. The workaround
for now is to cast the argument to jsonb.
"""

which is not clear for me why we introduce JSON_VALUE() function, is it
for handling EMPTY or ERROR conditions? I think the existing cast
workaround have a similar capacity?

Then I think if it is introduced as a performance improvement like [1],
then the test at [1] might be interesting. If this is the case, the
method in [1] can avoid the user to modify these queries for the using
the new function. 

[1] https://www.postgresql.org/message-id/8734t6c5rh@163.com

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan

>
>> 2. more likely to use up all the memory which is allowed. for example:
>> if we set the limit to 1MB, then we kept more data which will be not
>> used and then consuming all of the 1MB. 
>> 
>> My method is resolving this with some helps from other modules (kind of
>> invasive) but can control the eviction well and use the memory as less
>> as possible.
>> 
>
> Is the memory usage really an issue? Sure, it'd be nice to evict entries
> as soon as we can prove they are not needed anymore, but if the cache
> limit is set to 1MB it's not really a problem to use 1MB.

This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.

It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit 
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB. 

> On 3/4/24 18:08, Andy Fan wrote:
>> ...
>>>
>>>> I assumed that releasing all of the memory at the end of executor once
>>>> is not an option since it may consumed too many memory. Then, when and
>>>> which entry to release becomes a trouble for me. For example:
>>>>
>>>>   QUERY PLAN
>>>> --
>>>>  Nested Loop
>>>>Join Filter: (t1.a = t2.a)
>>>>->  Seq Scan on t1
>>>>->  Seq Scan on t2
>>>> (4 rows)
>>>>
>>>> In this case t1.a needs a longer lifespan than t2.a since it is
>>>> in outer relation. Without the help from slot's life-cycle system, I
>>>> can't think out a answer for the above question.
>>>>
>>>
>>> This is true, but how likely such plans are? I mean, surely no one would
>>> do nested loop with sequential scans on reasonably large tables, so how
>>> representative this example is?
>> 
>> Acutally this is a simplest Join case, we still have same problem like
>> Nested Loop + Index Scan which will be pretty common. 
>> 
>
> Yes, I understand there are cases where LRU eviction may not be the best
> choice - like here, where the "t1" should stay in the case. But there
> are also cases where this is the wrong choice, and LRU would be better.
>
> For example a couple paragraphs down you suggest to enforce the memory
> limit by disabling detoasting if the memory limit is reached. That means
> the detoasting can get disabled because there's a single access to the
> attribute somewhere "up the plan tree". But what if the other attributes
> (which now won't be detoasted) are accessed many times until then?

I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not 
true that once it is get disabled, it can't be enabled any more for the
given query.

> I think LRU is a pretty good "default" algorithm if we don't have a very
> good idea of the exact life span of the values, etc. Which is why other
> nodes (e.g. Memoize) use LRU too.

> But I wonder if there's a way to count how many times an attribute is
> accessed (or is likely to be). That might be used to inform a better
> eviction strategy.

Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the 
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.

> Also, we don't need to evict the whole entry - we could evict just the
> data part (guaranteed to be fairly large), but keep the header, and keep
> the counts, expected number of hits, and other info. And use this to
> e.g. release entries that reached the expected number of hits. But I'd
> probably start with LRU and only do this as an improvement later.

A great lession learnt here, thanks for sharing this!

As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan

 difference in real
>> case, so my current concern mainly comes from factor 1.
>> """
>> 
>
> This seems a bit dismissive of the (possible) issue.

Hmm, sorry about this, that is not my intention:(

> It'd be good to do
> some measurements, especially on simple queries that can't benefit from
> the detoasting (e.g. because there's no varlena attribute).

This testing for the queries which have no varlena attribute was done at
the very begining of this thread, and for now, the code is much more
nicer for this sistuation. all the changes in execExpr.c
execExprInterp.c has no impact on this. Only the plan walker codes
matter. Here is a test based on tenk1. 

q1: explain select count(*) from tenk1 where odd > 10 and even > 30;
q2: select count(*) from tenk1 where odd > 10 and even > 30;

pgbench -n -f qx.sql regression -T10

| Query | master (ms) | patched (ms) |
|---+-+--|
| q1|   0.129 |0.129 |
| q2|   1.876 |1.870 |

there are some error for patched-q2 combination, but at least it can
show it doesn't cause noticable regression.

> In any case, my concern is more about having to do this when creating
> the plan at all, the code complexity etc. Not just because it might have
> performance impact.

I think the main trade-off is TOAST cache method is pretty non-invasive
but can't control the eviction well, the impacts includes:
1. may evicting the datum we want and kept the datum we don't need.
2. more likely to use up all the memory which is allowed. for example:
if we set the limit to 1MB, then we kept more data which will be not
used and then consuming all of the 1MB. 

My method is resolving this with some helps from other modules (kind of
invasive) but can control the eviction well and use the memory as less
as possible.

-- 
Best Regards
Andy Fan

Re: Extract numeric filed in JSONB more effectively

2024-03-04 Thread Andy Fan



Peter Eisentraut  writes:

> On 09.02.24 10:05, Andy Fan wrote:
>> 2. Where is the current feature blocked for the past few months?
>> It's error message compatible issue! Continue with above setup:
>> master:
>> select * from tb where (a->'b')::numeric > 3::numeric;
>> ERROR:  cannot cast jsonb string to type numeric
>> select * from tb where (a->'b')::int4 > 3::numeric;
>> ERROR:  cannot cast jsonb string to type integer
>> You can see the error message is different (numeric vs integer).
>> Patched:
>> We still can get the same error message as master BUT the code
>> looks odd.
>> select * from tb where (a->'b')::int4 > 3;
>>  QUERY PLAN
>> ---
>>   Seq Scan on public.tb
>> Output: a
>> Filter: 
>> ((jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, 
>> 'b'::text), '23'::oid))::integer > 3)
>> (3 rows)
>> You can see "jsonb_finish_numeric(..,  '23::oid)" the '23::oid' is
>> just
>> for the *"integer"* output in error message:
>> "cannot cast jsonb string to type*integer*"
>> Now the sistuation is either we use the odd argument (23::oid) in
>> jsonb_finish_numeric, or we use a incompatible error message with the
>> previous version. I'm not sure which way is better, but this is the
>> place the current feature is blocked.
>
> I'm not bothered by that.  It also happens on occasion in the backend C
> code that we pass around extra information to be able to construct
> better error messages.  The functions here are not backend C code, but
> they are internal functions, so similar considerations can apply.

Thanks for speaking on this!

>
> But I have a different question about this patch set.  This has some
> overlap with the JSON_VALUE function that is being discussed at
> [0][1]. For example, if I apply the patch
> v39-0001-Add-SQL-JSON-query-functions.patch from that thread, I can run
>
> select count(*) from tb where json_value(a, '$.a' returning numeric) = 2;
>
> and I get a noticeable performance boost over
>
> select count(*) from tb where cast (a->'a' as numeric) = 2;

Here is my test and profile about the above 2 queries.

create table tb(a jsonb);
insert into tb
select jsonb_build_object('a', i) from generate_series(1, 1)i;

cat a.sql
select count(*) from tb
where json_value(a, '$.a' returning numeric) = 2;

cat b.sql
select count(*) from tb where cast (a->'a' as numeric) = 2;

pgbench -n -f xxx.sql postgres -T100 | grep lat

Then here is the result:

| query | master  | patched (patch here and jsonb_value) |
|---+-+-|
| a.sql | /   | 2.59 (ms)   |
| b.sql | 3.34 ms | 1.75 (ms)   |

As we can see the patch here has the best performance (this result looks
be different from yours?).

After I check the code, I am sure both patches *don't* have the problem
in master where it get a jsonbvalue first and convert it to jsonb and
then cast to numeric.

Then I perf the result, and find the below stuff:

JSOB_VALUE

-   32.02% 4.30%  postgres  postgres   [.] JsonPathValue
   - 27.72% JsonPathValue
  - 22.63% executeJsonPath (inlined)
 - 19.97% executeItem (inlined)
- executeItemOptUnwrapTarget
   - 17.79% executeNextItem
  - 15.49% executeItem (inlined)
 - executeItemOptUnwrapTarget
+ 8.50% getKeyJsonValueFromContainer (note here..)
+ 2.30% executeNextItem
  0.79% findJsonbValueFromContainer
+ 0.68% check_stack_depth
  + 1.51% jspGetNext
   + 0.73% check_stack_depth
   1.27% jspInitByBuffer
   0.85% JsonbExtractScalar
  + 4.91% DatumGetJsonbP (inlined)

Patch here for b.sql:
-

-   19.98% 2.10%  postgres  postgres   [.] jsonb_object_field_start
   - 17.88% jsonb_object_field_start
  - 17.70% jsonb_object_field_internal (inlined)
 + 11.03% getKeyJsonValueFromContainer
 - 6.26% DatumGetJsonbP (inlined)
+ 5.78% detoast_attr
   + 1.21% _start
   + 0.54% 0x55ddb44552a

JSONB_VALUE has a much longer way to get getKeyJsonValueFromContainer,
then I think JSON_VALUE probably is designed for some more complex path 
which need to pay extra effort which bring the above performance
difference. 

I added Amit and Alvaro to this thread in case they can have more
insight on this.

> So some questions to think about:
>
> 1. Compare performance of base case vs

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan



Tomas Vondra  writes:

>>> But I'm a bit surprised the patch needs to pass a
>>> MemoryContext to so many places as a function argument - that seems to
>>> go against how we work with memory contexts. Doubly so because it seems
>>> to only ever pass CurrentMemoryContext anyway. So what's that about?
>> 
>> I think you are talking about the argument like this:
>>  
>>  /* --
>> - * detoast_attr -
>> + * detoast_attr_ext -
>>   *
>>   *  Public entry point to get back a toasted value from compression
>>   *  or external storage.  The result is always non-extended varlena form.
>>   *
>> + * ctx: The memory context which the final value belongs to.
>> + *
>>   * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
>>   * datum, the result will be a pfree'able chunk.
>>   * --
>>   */
>> 
>> +extern struct varlena *
>> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
>> 
>> This is mainly because 'detoast_attr' will apply more memory than the
>> "final detoast datum" , for example the code to scan toast relation
>> needs some memory as well, and what I want is just keeping the memory
>> for the final detoast datum so that other memory can be released sooner,
>> so I added the function argument for that. 
>> 
>
> What exactly does detoast_attr allocate in order to scan toast relation?
> Does that happen in master, or just with the patch?

It is in the current master, for example the TupleTableSlot creation
needed by scanning the toast relation needs a memory allocating. 

> If with master, I
> suggest to ignore that / treat that as a separate issue and leave it for
> a different patch.

OK, I can make it as a seperate commit in the next version. 

> In any case, the custom is to allocate results in the context that is
> set in CurrentMemoryContext at the moment of the call, and if there's
> substantial amount of allocations that we want to free soon, we either
> do that by explicit pfree() calls, or create a small temporary context
> in the function (but that's more expensive).
>
> I don't think we should invent a new convention where the context is
> passed as an argument, unless absolutely necessary.

Hmm, in this case, if we don't add the new argument, we have to get the
detoast datum in Context-1 and copy it to the desired memory context,
which is the thing I want to avoid.  I think we have a same decision to
make on the TOAST cache method as well.

-- 
Best Regards
Andy Fan

Re: Eager aggregation, take 3

2024-03-04 Thread Andy Fan

Richard Guo  writes:

> Hi All,
>
> Eager aggregation is a query optimization technique that partially
> pushes a group-by past a join, and finalizes it once all the relations
> are joined.  Eager aggregation reduces the number of input rows to the
> join and thus may result in a better overall plan.  This technique is
> thoroughly described in the 'Eager Aggregation and Lazy Aggregation'
> paper [1].

This is a really helpful and not easy task, even I am not sure when I
can spend time to study this, I want to say "Thanks for working on
this!" first and hope we can really progress on this topic. Good luck! 

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread Andy Fan

Andrei Lepikhov  writes:

> On 3/3/2024 14:01, Andy Fan wrote:
>> 1. We can let the user define the column as the value is increased day by
>> day. the syntax may be:
>> ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED.
>> then when a query like 'create_at op const', the statistics module
>> can
>> treat it as 'created_at = $1'. so the missing statistics doesn't make
>> difference. Then I think the above issue can be avoided.
> Let me write some words to support your efforts in that way.
> I also have some user cases where they periodically insert data in large
> chunks. These chunks contain 'always increased' values, and it causes
> trouble each time they start an analytic query over this new data before
> the analyze command.
> I have thought about that issue before but invented nothing special
> except a more aggressive analysis of such tables.

I have to say we run into a exactly same sistuation and use the same
trick to solve the problem, and we know no matter how aggressive it is,
the problem may still happen.

> Your trick can work, but it needs a new parameter in pg_type and a lot
> of additional code for such a rare case.
> I'm looking forward to the demo patch.

Maybe my word "auto_increased" is too like a type, but actually what I
want to is adding a new attribute for pg_attribute which ties with one
column in one relation.  When we figure out a selective on this
*column*, we do such trick. This probably doesn't need much code.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-04 Thread Andy Fan

David Rowley  writes:

> On Sun, 3 Mar 2024 at 20:08, Andy Fan  wrote:
>> The issue can be reproduced with the following steps:
>>
>> create table x_events (.., created_at timestamp, a int, b int);
>>
>> create index idx_1 on t(created_at, a);
>> create index idx_2 on t(created_at, b);
>>
>> query:
>> select * from t where create_at = current_timestamp and b = 1;
>>
>> index (created_at, a) rather than (created_at, b) may be chosen for the
>> above query if the statistics think "create_at = current_timestamp" has
>> no rows, then both index are OK, actually it is true just because
>> statistics is out of date.
>
> I don't think there's really anything too special about the fact that
> the created_at column is always increasing. We commonly get 1-row
> estimates after multiplying the selectivities from individual stats.
> Your example just seems like yet another reason that this could
> happen.

You are right about there are more cases which lead this happen. However
this is the only case where the created_at = $1 trick can works, which
was the problem I wanted to resove when I was writing. 

> I've been periodically talking about introducing "risk" as a factor
> that the planner should consider.  I did provide some detail in [1]
> about the design that was in my head at that time.  I'd not previously
> thought that it could also solve this problem, but after reading your
> email, I think it can.

Haha, I remeber you were against "risk factor" before at [1], and at
that time we are talking about the exact same topic as here, and I
proposaled another risk factor. Without an agreement, I did it in my
own internal version and get hurted then, something like I didn't pay
enough attention to Bitmap Index Scan and Index scan. Then I forget the
"risk factor".

>
> I don't think it would be right to fudge the costs in any way, but I
> think the risk factor for IndexPaths could take into account the
> number of unmatched index clauses and increment the risk factor, or
> "certainty_factor" as it is currently in my brain-based design. That
> way add_path() would be more likely to prefer the index that matches
> the most conditions.

This is somehow similar with my proposal at [1]?  What do you think
about the treat 'col op const' as 'col op $1' for the marked column?
This could just resolve a subset of questions in your mind, but the
method looks have a solid reason.

Currently I treat the risk factor as what you did before, but this maybe
another time for me to switch my mind again.

[1] 
https://www.postgresql.org/message-id/CAApHDvovVWCbeR4v%2BA4Dkwb%3DYS_GuJG9OyCm8jZu%2B%2BcP2xsY_A%40mail.gmail.com
-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-03-03 Thread Andy Fan



Tomas Vondra  writes:

> On 3/3/24 07:10, Andy Fan wrote:
>> 
>> Hi,
>> 
>> Here is a updated version, the main changes are:
>> 
>> 1. an shared_detoast_datum.org file which shows the latest desgin and
>> pending items during discussion. 
>> 2. I removed the slot->pre_detoast_attrs totally.
>> 3. handle some pg_detoast_datum_slice use case.
>> 4. Some implementation improvement.
>> 
>
> I only very briefly skimmed the patch, and I guess most of my earlier
> comments still apply.

Yes, the overall design is not changed.

> But I'm a bit surprised the patch needs to pass a
> MemoryContext to so many places as a function argument - that seems to
> go against how we work with memory contexts. Doubly so because it seems
> to only ever pass CurrentMemoryContext anyway. So what's that about?

I think you are talking about the argument like this:
 
 /* --
- * detoast_attr -
+ * detoast_attr_ext -
  *
  * Public entry point to get back a toasted value from compression
  * or external storage.  The result is always non-extended varlena form.
  *
+ * ctx: The memory context which the final value belongs to.
+ *
  * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
  * datum, the result will be a pfree'able chunk.
  * --
  */

+extern struct varlena *
+detoast_attr_ext(struct varlena *attr, MemoryContext ctx)

This is mainly because 'detoast_attr' will apply more memory than the
"final detoast datum" , for example the code to scan toast relation
needs some memory as well, and what I want is just keeping the memory
for the final detoast datum so that other memory can be released sooner,
so I added the function argument for that. 

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-03-03 Thread Andy Fan

Tomas Vondra  writes:

> On 2/26/24 16:29, Andy Fan wrote:
>>
>> ...>
>> I can understand the benefits of the TOAST cache, but the following
>> issues looks not good to me IIUC: 
>> 
>> 1). we can't put the result to tts_values[*] since without the planner
>> decision, we don't know if this will break small_tlist logic. But if we
>> put it into the cache only, which means a cache-lookup as a overhead is
>> unavoidable.
>
> True - if you're comparing having the detoasted value in tts_values[*]
> directly with having to do a lookup in a cache, then yes, there's a bit
> of an overhead.
>
> But I think from the discussion it's clear having to detoast the value
> into tts_values[*] has some weaknesses too, in particular:
>
> - It requires decisions which attributes to detoast eagerly, which is
> quite invasive (having to walk the plan, ...).
>
> - I'm sure there will be cases where we choose to not detoast, but it
> would be beneficial to detoast.
>
> - Detoasting just the initial slices does not seem compatible with this.
>
> IMHO the overhead of the cache lookup would be negligible compared to
> the repeated detoasting of the value (which is the current baseline). I
> somewhat doubt the difference compared to tts_values[*] will be even
> measurable.
>
>> 
>> 2). It is hard to decide which entry should be evicted without attaching
>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
>> we keep is the entry we will reuse soon?
>> 
>
> True. But is that really a problem? I imagined we'd set some sort of
> memory limit on the cache (work_mem?), and evict oldest entries. So the
> entries would eventually get evicted, and the memory limit would ensure
> we don't consume arbitrary amounts of memory.
>

Here is a copy from the shared_detoast_datum.org in the patch. I am
feeling about when / which entry to free is a key problem and run-time
(detoast_attr) overhead vs createplan.c overhead is a small difference
as well. the overhead I paid for createplan.c/setref.c looks not huge as
well. 

"""
A alternative design: toast cache
-

This method is provided by Tomas during the review process. IIUC, this
method would maintain a local HTAB which map a toast datum to a detoast
datum and the entry is maintained / used in detoast_attr
function. Within this method, the overall design is pretty clear and the
code modification can be controlled in toasting system only.

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

  QUERY PLAN
--
 Nested Loop
   Join Filter: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

Another difference between the 2 methods is my method have many
modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
it can save some run-time effort like hash_search find / enter run-time
in method 2 since I put them directly into tts_values[*].

I'm not sure the factor 2 makes some real measurable difference in real
case, so my current concern mainly comes from factor 1.
"""

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-03-03 Thread Andy Fan



Tomas Vondra  writes:

> On 3/3/24 02:52, Andy Fan wrote:
>> 
>> Hi Nikita,
>> 
>>>
>>> Have you considered another one - to alter pg_detoast_datum
>>> (actually, it would be detoast_attr function) and save
>>> detoasted datums in the detoast context derived
>>> from the query context? 
>>>
>>> We have just enough information at this step to identify
>>> the datum - toast relation id and value id, and could
>>> keep links to these detoasted values in a, say, linked list
>>>  or hash table. Thus we would avoid altering the executor
>>> code and all detoast-related code would reside within
>>> the detoast source files?
>> 
>> I think you are talking about the way Tomas provided.  I am really
>> afraid that I was thought of too self-opinionated, but I do have some
>> concerns about this approch as I stated here [1], looks my concerns is
>> still not addressed, or the concerns itself are too absurd which is
>> really possible I think? 
>> 
>
> I can't speak for others, but I did not consider you and your
> proposals too self-opinionated. 

This would be really great to know:)

> You did
> propose a solution that you consider the right one. That's fine. People
> will question that and suggest possible alternatives. That's fine too,
> it's why we have this list in the first place.

I agree.

-- 
Best Regards
Andy Fan

a wrong index choose when statistics is out of date

2024-03-02 Thread Andy Fan



The issue can be reproduced with the following steps:

create table x_events (.., created_at timestamp, a int, b int); 

create index idx_1 on t(created_at, a);
create index idx_2 on t(created_at, b);

query:
select * from t where create_at = current_timestamp and b = 1;

index (created_at, a) rather than (created_at, b) may be chosen for the
above query if the statistics think "create_at = current_timestamp" has
no rows, then both index are OK, actually it is true just because
statistics is out of date.

I just run into this again recently and have two new idea this time,
I'd like gather some feedback on this.

1. We can let the user define the column as the value is increased day by
   day. the syntax may be:

   ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED.

   then when a query like 'create_at op const', the statistics module can
   treat it as 'created_at = $1'. so the missing statistics doesn't make
   difference. Then I think the above issue can be avoided. 

   This is different from letting user using a PreparedStmt directly
   because it is possible that we always choose a custom plan, the
   easiest way to make this happen is we do a planning time partition 
   prune.

2. Use some AI approach to forecast the data it doesn't gather yet. The
   training stage may happen at analyze stage, take the above case for
   example, it may get a model like 'there are 100 rows per second for
   the time during 9:00 to 18:00 and there are 2 rows per seconds for
   other time range.
   
For now, I think option 1 may be easier to happen.

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-03-02 Thread Andy Fan


Hi,

Here is a updated version, the main changes are:

1. an shared_detoast_datum.org file which shows the latest desgin and
pending items during discussion. 
2. I removed the slot->pre_detoast_attrs totally.
3. handle some pg_detoast_datum_slice use case.
4. Some implementation improvement.

commit 66c64c197a5dab97a563be5a291127e4c5d6841d (HEAD -> shared_detoast_value)
Author: yizhi.fzh 
Date:   Sun Mar 3 13:48:25 2024 +0800

shared detoast datum

See the overall design & alternative design & testing in
shared_detoast_datum.org

In the shared_detoast_datum.org, I added the alternative design part for
the idea of TOAST cache.

-- 
Best Regards
Andy Fan

>From 66c64c197a5dab97a563be5a291127e4c5d6841d Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Sun, 3 Mar 2024 13:48:25 +0800
Subject: [PATCH v9 1/1] shared detoast datum

See the overall design & alternative design & testing in
shared_detoast_datum.org
---
 src/backend/access/common/detoast.c   |  68 +-
 src/backend/access/common/toast_compression.c |  10 +-
 src/backend/executor/execExpr.c   |  60 +-
 src/backend/executor/execExprInterp.c | 179 ++
 src/backend/executor/execTuples.c | 127 
 src/backend/executor/execUtils.c  |   2 +
 src/backend/executor/nodeHashjoin.c   |   2 +
 src/backend/executor/nodeMergejoin.c  |   2 +
 src/backend/executor/nodeNestloop.c   |   1 +
 src/backend/executor/shared_detoast_datum.org | 203 ++
 src/backend/jit/llvm/llvmjit_expr.c   |  26 +-
 src/backend/jit/llvm/llvmjit_types.c  |   1 +
 src/backend/optimizer/plan/createplan.c   | 107 +++-
 src/backend/optimizer/plan/setrefs.c  | 590 +++---
 src/include/access/detoast.h  |   3 +
 src/include/access/toast_compression.h|   4 +-
 src/include/executor/execExpr.h   |  12 +
 src/include/executor/tuptable.h   |  14 +
 src/include/nodes/execnodes.h |  14 +
 src/include/nodes/plannodes.h |  53 ++
 src/tools/pgindent/typedefs.list  |   2 +
 21 files changed, 1342 insertions(+), 138 deletions(-)
 create mode 100644 src/backend/executor/shared_detoast_datum.org

diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 3547cdba56..acc9644689 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -22,11 +22,11 @@
 #include "utils/expandeddatum.h"
 #include "utils/rel.h"
 
-static struct varlena *toast_fetch_datum(struct varlena *attr);
+static struct varlena *toast_fetch_datum(struct varlena *attr, MemoryContext ctx);
 static struct varlena *toast_fetch_datum_slice(struct varlena *attr,
 			   int32 sliceoffset,
 			   int32 slicelength);
-static struct varlena *toast_decompress_datum(struct varlena *attr);
+static struct varlena *toast_decompress_datum(struct varlena *attr, MemoryContext ctx);
 static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 slicelength);
 
 /* --
@@ -42,7 +42,7 @@ static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32
  * --
  */
 struct varlena *
-detoast_external_attr(struct varlena *attr)
+detoast_external_attr_ext(struct varlena *attr, MemoryContext ctx)
 {
 	struct varlena *result;
 
@@ -51,7 +51,7 @@ detoast_external_attr(struct varlena *attr)
 		/*
 		 * This is an external stored plain value
 		 */
-		result = toast_fetch_datum(attr);
+		result = toast_fetch_datum(attr, ctx);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -68,13 +68,13 @@ detoast_external_attr(struct varlena *attr)
 
 		/* recurse if value is still external in some other way */
 		if (VARATT_IS_EXTERNAL(attr))
-			return detoast_external_attr(attr);
+			return detoast_external_attr_ext(attr, ctx);
 
 		/*
 		 * Copy into the caller's memory context, in case caller tries to
 		 * pfree the result.
 		 */
-		result = (struct varlena *) palloc(VARSIZE_ANY(attr));
+		result = (struct varlena *) MemoryContextAlloc(ctx, VARSIZE_ANY(attr));
 		memcpy(result, attr, VARSIZE_ANY(attr));
 	}
 	else if (VARATT_IS_EXTERNAL_EXPANDED(attr))
@@ -87,7 +87,7 @@ detoast_external_attr(struct varlena *attr)
 
 		eoh = DatumGetEOHP(PointerGetDatum(attr));
 		resultsize = EOH_get_flat_size(eoh);
-		result = (struct varlena *) palloc(resultsize);
+		result = (struct varlena *) MemoryContextAlloc(ctx, resultsize);
 		EOH_flatten_into(eoh, (void *) result, resultsize);
 	}
 	else
@@ -101,32 +101,45 @@ detoast_external_attr(struct varlena *attr)
 	return result;
 }
 
+struct varlena *
+detoast_external_attr(struct varlena *attr)
+{
+	return detoast_external_attr_ext(attr, CurrentMemoryContext);
+}
+
 
 /* --
- * detoast_attr -
+ * detoast_attr_ext -
  *
  *	Public entry point to get back a toasted value from compression
  *	or external storage.  The

Re: Shared detoast Datum proposal

2024-03-02 Thread Andy Fan



Hi Nikita,

>
> Have you considered another one - to alter pg_detoast_datum
> (actually, it would be detoast_attr function) and save
> detoasted datums in the detoast context derived
> from the query context? 
>
> We have just enough information at this step to identify
> the datum - toast relation id and value id, and could
> keep links to these detoasted values in a, say, linked list
>  or hash table. Thus we would avoid altering the executor
> code and all detoast-related code would reside within
> the detoast source files?

I think you are talking about the way Tomas provided.  I am really
afraid that I was thought of too self-opinionated, but I do have some
concerns about this approch as I stated here [1], looks my concerns is
still not addressed, or the concerns itself are too absurd which is
really possible I think? 

[1] https://www.postgresql.org/message-id/875xyb1a6q.fsf%40163.com

-- 
Best Regards
Andy Fan

Re: Better error messages for %TYPE and %ROWTYPE in plpgsql

2024-02-26 Thread Andy Fan

"David G. Johnston"  writes:

> On Mon, Feb 26, 2024 at 5:46 PM Andy Fan  wrote:
>
>  > Per recent discussion[1], plpgsql returns fairly unhelpful "syntax
>  > error" messages when a %TYPE or %ROWTYPE construct references a
>  > nonexistent object.  Here's a quick little finger exercise to try
>  > to improve that.
>
>  Looks this modify the error message, I want to know how ould we treat
>  error-message-compatible issue during minor / major upgrade.
>
> There is no bug here so no back-patch; and we are not yet past feature freeze 
> for v17.

Acutally I didn't asked about back-patch.  I meant error message is an
part of user interface, if we change a error message, the end
user may be impacted, at least in theory. for example, end-user has some
code like this:

String errMsg = ex.getErrorMsg();

if (errMsg.includes("a-target-string"))
{
// do sth.
}

So if the error message is changed, the above code may be broken.

I have little experience on this, so I want to know the policy we are
using, for the background which I said in previous reply.

-- 
Best Regards
Andy Fan

Re: Better error messages for %TYPE and %ROWTYPE in plpgsql

2024-02-26 Thread Andy Fan




> Per recent discussion[1], plpgsql returns fairly unhelpful "syntax
> error" messages when a %TYPE or %ROWTYPE construct references a
> nonexistent object.  Here's a quick little finger exercise to try
> to improve that.

Looks this modify the error message, I want to know how ould we treat
error-message-compatible issue during minor / major upgrade. I'm not
sure if my question is too inconceivable, I ask this because one of my
patch [1] has blocked on this kind of issue [only] for 2 months and
actaully the error-message-compatible requirement was metioned by me at
the first and resolve it by adding a odd parameter. Then the odd
parameter blocked the whole process. 

[1] https://www.postgresql.org/message-id/87r0hmvuvr@163.com

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-02-26 Thread Andy Fan

Nikita Malakhov  writes:

> Hi,
>
> Tomas, we already have a working jsonb partial detoast prototype,
> and currently I'm porting it to the recent master.

This is really awesome! Acutally when I talked to MySQL guys, they said
MySQL already did this and I admit it can resolve some different issues
than the current patch. Just I have many implemetion questions in my
mind. 

> Andy, thank you! I'll check the last patch set out and reply in a day
> or two.

Thank you for your attention!

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-02-26 Thread Andy Fan

> On 2/26/24 14:22, Andy Fan wrote:
>> 
>>...
>>
>>> Also, toasted values
>>> are not always being used immediately and as a whole, i.e. jsonb values are 
>>> fully
>>> detoasted (we're working on this right now) to extract the smallest value 
>>> from
>>> big json, and these values are not worth keeping in memory. For text values 
>>> too,
>>> we often do not need the whole value to be detoasted.
>> 
>> I'm not sure how often this is true, espeically you metioned text data
>> type. I can't image why people acquire a piece of text and how can we
>> design a toasting system to fulfill such case without detoast the whole
>> as the first step. But for the jsonb or array data type, I think it is
>> more often. However if we design toasting like that, I'm not sure if it
>> would slow down other user case. for example detoasting the whole piece
>> use case. I am justing feeling both way has its user case, kind of heap
>> store and columna store.  
>> 
>
> Any substr/starts_with call benefits from this optimization, and such
> calls don't seem that uncommon. I certainly can imagine people doing
> this fairly often.

This leads me to pay more attention to pg_detoast_datum_slice user case
which I did overlook and caused some regression in this case. Thanks!

As you said later:

> Is there a way to identify cases that are likely to benefit from this
> slicing, and disable the detoasting for them? We already disable it for
> other cases, so maybe we can do this here too?

I think the answer is yes, if our goal is to disable the whole toast for
some speical calls like substr/starts_with. In the current patch, we have:

/*
 * increase_level_for_pre_detoast
 *  Check if the given Expr could detoast a Var directly, if yes,
 * increase the level and return true. otherwise return false;
 */
static inline void
increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
{
/* The following nodes is impossible to detoast a Var directly. */
if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
{
ctx->level_added = false;
}
else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == 
F_PG_COLUMN_COMPRESSION)
{
/* let's not detoast first so that pg_column_compression works. 
*/
ctx->level_added = false;
}

while it works, but the blacklist looks not awesome.

> In any case, it's a long-standing behavior /
> optimization, and I don't think we can just dismiss it this quickly.

I agree. So I said both have their user case. and when I say the *both*, I
mean the other method is "partial detoast prototype", which Nikita has
told me before. while I am sure it has many user case, I'm also feeling
it is complex and have many questions in my mind, but I'd like to see
the design before asking them.

> Or maybe there's a way to do the detoasting lazily, and only keep the
> detoasted value when it's requesting the whole value. Or perhaps even
> better, remember what part we detoasted, and maybe it'll be enough for
> future requests?
>
> I'm not sure how difficult would this be with the proposed approach,
> which eagerly detoasts the value into tts_values. I think it'd be
> easier to implement with the TOAST cache I briefly described ... 

I can understand the benefits of the TOAST cache, but the following
issues looks not good to me IIUC: 

1). we can't put the result to tts_values[*] since without the planner
decision, we don't know if this will break small_tlist logic. But if we
put it into the cache only, which means a cache-lookup as a overhead is
unavoidable.

2). It is hard to decide which entry should be evicted without attaching
it to the TupleTableSlot's life-cycle. then we can't grantee the entry
we keep is the entry we will reuse soon?

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-02-26 Thread Andy Fan



Hi, 


> When we store and detoast large values, say, 1Gb - that's a very likely 
> scenario,
> we have such cases from prod systems - we would end up in using a lot of 
> shared
> memory to keep these values alive, only to discard them later.

First the feature can make sure if the original expression will not
detoast it, this feature will not detoast it as well. Based on this, if
there is a 1Gb datum, in the past, it took 1GB memory as well, but
it may be discarded quicker than the patched version. but patched
version will *not* keep it for too long time since once the
slot->tts_values[*] is invalidated, the memory will be released
immedately.

For example:

create table t(a text, b int);
insert into t select '1-gb-length-text' from generate_series(1, 100);

select * from t where a like '%gb%';

The patched version will take 1gb extra memory only.

Are you worried about this case or some other cases? 

> Also, toasted values
> are not always being used immediately and as a whole, i.e. jsonb values are 
> fully
> detoasted (we're working on this right now) to extract the smallest value from
> big json, and these values are not worth keeping in memory. For text values 
> too,
> we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.  

FWIW, as I shown in the test case, this feature is not only helpful for
big datum, it is also helpful for small toast datum. 

-- 
Best Regards
Andy Fan

Re: Shared detoast Datum proposal

2024-02-25 Thread Andy Fan

xpr will recurse to Var for each
   expression, so it is a good time to track the attribute number and
   see if the Var is directly or indirectly accessed. Usually the
   indirectly access a Var means a detoast would happens, for 
   example an expression like a > 3. However some known expressions like
   VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs;

As a result, the final result is added into the plan node of Scan and
Join.

typedef struct Scan
{
/*
 * Records of var's varattno - 1 where the Var is accessed indirectly by
 * any expression, like a > 3.  However a IS [NOT] NULL is not included
 * since it doesn't access the tts_values[*] at all.
 *
 * This is a essential information to figure out which attrs should use
 * the pre-detoast-attrs logic.
 */
Bitmapset  *reference_attrs;
} Scan;

typedef struct Join
{
..
/*
 * Records of var's varattno - 1 where the Var is accessed indirectly by
 * any expression, like a > 3.  However a IS [NOT] NULL is not included
 * since it doesn't access the tts_values[*] at all.
 *
 * This is a essential information to figure out which attrs should use
 * the pre-detoast-attrs logic.
 */
Bitmapset  *outer_reference_attrs;
Bitmapset  *inner_reference_attrs;
} Join;


Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
   in ScanState & JoinState, which will be passed into expression
   engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
   _VAR_TOAST steps are generated finally then everything is connected
   with ExecSlotDetoastDatum!


Testing
---

Case 1:

create table t (a numeric);
insert into t select i from generate_series(1, 10)i;

cat 1.sql

select * from t where a > 0;

In this test, the current master run detoast twice for each datum. one
in numeric_gt,  one in numeric_out.  this feature makes the detoast once.

pgbench -f 1.sql -n postgres -T 10 -M prepared

master: 30.218 ms
patched(Bitmapset): 30.881ms


Then we can see the perf report as below:

-7.34% 0.00%  postgres  postgres   [.] ExecSlotDetoastDatum 
(inlined)
   - ExecSlotDetoastDatum (inlined)
  - 3.47% bms_add_member
   - 3.06% bms_make_singleton (inlined)
- palloc0
 1.30% AllocSetAlloc

-5.99% 0.00%  postgres  postgres  [.] ExecFreePreDetoastDatum (inlined)
   - ExecFreePreDetoastDatum (inlined)
2.64% bms_next_member
1.17% bms_del_members
0.94% AllocSetFree

One of the reasons is because Bitmapset will deallocate its memory when
all the bits are deleted due to commit 00b41463c, then we have to
allocate memory at the next time when adding a member to it. This kind
of action is executed 10 times in the above workload.

Then I introduce bitset data struct (commit 0002) which is pretty like
the Bitmapset, but it would not deallocate the memory when all the bits
is unset. and use it in this feature (commit 0003). Then the result
became to: 28.715ms 

-5.22% 0.00%  postgres  postgres  [.] ExecFreePreDetoastDatum (inlined)
   - ExecFreePreDetoastDatum (inlined)
  - 2.82% bitset_next_member
   1.69% bms_next_member_internal (inlined)
   0.95% bitset_next_member
0.66% AllocSetFree

Here we can see the expensive calls are bitset_next_member on
slot->pre_detoast_attrs and pfree. if we put the detoast datum into
a dedicated memory context, then we can save the cost of
bitset_next_member since can discard all the memory in once and use
MemoryContextReset instead of AllocSetFree (commit 0004). then the
result became to 27.840ms. 

So the final result for case 1: 

master: 30.218 ms
patched(Bitmapset): 30.881ms
patched(bitset): 28.715ms
latency average(bitset + tts_value_mctx) = 27.840 ms


Big jsonbs test:

create table b(big jsonb);

insert into b select
jsonb_object_agg(x::text,
random()::text || random()::text || random()::text)
from generate_series(1,600) f(x);

insert into b select (select big from b) from generate_series(1, 1000)i;

explain analyze
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

master: 702.224 ms
patched: 133.306 ms

Memory usage test:

I run the workload of tpch scale 10 on against both master and patched
versions, the memory usage looks stable.

In progress work:

I'm still running tpc-h scale 100 to see if anything interesting
finding, that is in progress. As for the scale 10:

master: 55s
patched: 56s

The reason is q9 plan changed a bit, the real reason needs some more
time. Since this patch doesn't impact on the best path generation, so it
should not reasonble for me.

-- 
Best Regards
Andy Fan

>From cce81190cca7209cbb8aa314a27c52832fbe0a2d Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Tue, 20 Feb 2024 14:1

Re: Reduce useless changes before reassembly during logical replication

2024-02-22 Thread Andy Fan


Hi, 
>
> This is all true but note that in successful cases (where the table is
> published) all the work done by FilterByTable(accessing caches,
> transaction-related stuff) can add noticeable overhead as anyway we do
> that later in pgoutput_change(). I think I gave the same comment
> earlier as well but didn't see any satisfactory answer or performance
> data for successful cases to back this proposal.

I did some benchmark yesterday at [1] and found it adds 20% cpu time.
then come out a basic idea, I think it deserves a share. "transaction
related stuff" comes from the syscache/systable access except the
HistorySansphot. and the syscache is required in the following
sistuations: 

1. relfilenode (from wal) -> relid.
2. relid -> namespaceid (to check if the relid is a toast relation).
3. if toast, get its origianl relid.
4. access the data from pg_publication_tables.
5. see if the relid is a partition, if yes, we may get its root
relation.

Acutally we already has a RelationSyncCache for #4, and it *only* need
to access syscache when replicate_valid is false, I think this case
should be rare, but the caller doesn't know it, so the caller must
prepare the transaction stuff in advance even in the most case they are
not used. So I think we can get a optimization here.

then the attached patch is made.

Author: yizhi.fzh 
Date:   Wed Feb 21 18:40:03 2024 +0800

Make get_rel_sync_entry less depending on transaction state.

get_rel_sync_entry needs transaction only a replicate_valid = false
entry is found, this should be some rare case. However the caller can't
know if a entry is valid, so they have to prepare the transaction state
before calling this function. Such preparation is expensive.

This patch makes the get_rel_sync_entry can manage a transaction stage
only if necessary. so the callers don't need to prepare it blindly.

Then comes to #1, acutally we have RelfilenumberMapHash as a cache, when
the cache is hit (suppose this is a usual case), no transaction stuff
related.  I have two ideas then:

1. Optimize the cache hit sistuation like what we just did for
get_rel_sync_entry for the all the 5 kinds of data and only pay the
effort for cache miss case. for the data for #2, #3, #5, all the keys
are relid, so I think a same HTAB should be OK.

2. add the content for #1, #2, #3, #5 to wal when wal_level is set to
logical. 

In either case, the changes for get_rel_sync_entry should be needed. 

> Note, users can
> configure to stream_in_progress transactions in which case they
> shouldn't see such a big problem.

People would see the changes is spilled to disk, but the CPU cost for
Reorder should be still paid I think. 

[1] https://www.postgresql.org/message-id/87o7cadqj3.fsf%40163.com

-- 
Best Regards
Andy Fan

>From 90b8df330df049bd1d7e881dc6e9b108c17b0924 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Wed, 21 Feb 2024 18:40:03 +0800
Subject: [PATCH v1 1/1] Make get_rel_sync_entry less depending on transaction
 state.

get_rel_sync_entry needs transaction only a replicate_valid = false
entry is found, this should be some rare case. However the caller can't
know if a entry is valid, so they have to prepare the transaction state
before calling this function. Such preparation is expensive.

This patch makes the get_rel_sync_entry can manage a transaction stage
only if necessary. so the callers don't need to prepare it blindly.
---
 src/backend/replication/pgoutput/pgoutput.c | 60 -
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 998f92d671..25e55590a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -13,6 +13,7 @@
 #include "postgres.h"
 
 #include "access/tupconvert.h"
+#include "access/relation.h"
 #include "catalog/partition.h"
 #include "catalog/pg_publication.h"
 #include "catalog/pg_publication_rel.h"
@@ -214,6 +215,11 @@ static void init_rel_sync_cache(MemoryContext cachectx);
 static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
 			 Relation relation);
+static RelationSyncEntry *get_rel_sync_entry_by_relid(PGOutputData *data,
+	  Oid relid);
+static RelationSyncEntry *get_rel_sync_entry_internal(PGOutputData *data,
+	  Relation relation,
+	  Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 		  uint32 hashvalue);
@@ -1962,11 +1968,29 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
  */
 static RelationSyncEntry *
 get_rel_sync_entry(PGOutputData *data, Relation relation)
+{
+	r

Re: Shared detoast Datum proposal

2024-02-21 Thread Andy Fan




I see your reply when I started to write a more high level
document. Thanks for the step by step help!

Tomas Vondra  writes:

> On 2/20/24 19:38, Andy Fan wrote:
>> 
>> ...
>>
>>> I think it should be done the other
>>> way around, i.e. the patch should introduce the main feature first
>>> (using the traditional Bitmapset), and then add Bitset on top of that.
>>> That way we could easily measure the impact and see if it's useful.
>> 
>> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
>> indicate it is too expensive. and after talk with David at [2], I
>> introduced bitset and use it here. the test case I used comes from [1].
>> IRCC, there were 5% performance difference because of this.
>> 
>> create table w(a int, b numeric);
>> insert into w select i, i from generate_series(1, 100)i;
>> select b from w where b > 0;
>> 
>> To reproduce the difference, we can replace the bitset_clear() with
>> 
>> bitset_free(slot->pre_detoasted_attrs);
>> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);
>> 
>> in ExecFreePreDetoastDatum. then it works same as Bitmapset.
>> 
>
> I understand the bitset was not introduced until v5, after noticing the
> bitmapset is not quite efficient. But I still think the patches should
> be the other way around, i.e. the main feature first, then the bitset as
> an optimization.
>
> That allows everyone to observe the improvement on their own (without
> having to tweak the code), and it also doesn't require commit of the
> bitset part before it gets actually used by anything.

I start to think this is a better way rather than the opposite. The next
version will be:

0001: shared detoast datum feature with high level introduction.
0002: introduce bitset and use it shared-detoast-datum feature, with the
test case to show the improvement. 

>> I will do more test on the memory leak stuff, since there are so many
>> operation aginst slot like ExecCopySlot etc, I don't know how to test it
>> fully. the method in my mind now is use TPCH with 10GB data size, and
>> monitor the query runtime memory usage.
>> 
>
> I think this is exactly the "high level design" description that should
> be in a comment, somewhere.

Got it.

>>> Of course, expanded datums are
>>> not meant to be long-lived, while "shared detoasted values" are meant to
>>> exist (potentially) for the query duration.
>> 
>> hmm, acutally the "shared detoast value" just live in the
>> TupleTableSlot->tts_values[*], rather than the whole query duration. The
>> simple case is:
>> 
>> SELECT * FROM t WHERE a_text LIKE 'abc%';
>> 
>> when we scan to the next tuple, the detoast value for the previous tuple
>> will be relased.
>> 
>
> But if the (detoasted) value is passed to the next executor node, it'll
> be kept, right?

Yes and only one copy for all the executor nodes.

> Unrelated question - could this actually end up being too aggressive?
> That is, could it detoast attributes that we end up not needing? For
> example, what if the tuple gets eliminated for some reason (e.g. because
> of a restriction on the table, or not having a match in a join)? Won't
> we detoast the tuple only to throw it away?

The detoast datum will have the exactly same lifespan with other
tts_values[*]. If the tuple get eliminated for any reason, those detoast
datum still exist until the slot is cleared for storing the next tuple.

>> typedef struct Join
>> {
>> ..
>>  /*
>>   * Records of var's varattno - 1 where the Var is accessed indirectly by
>>   * any expression, like a > 3.  However a IS [NOT] NULL is not included
>>   * since it doesn't access the tts_values[*] at all.
>>   *
>>   * This is a essential information to figure out which attrs should use
>>   * the pre-detoast-attrs logic.
>>   */
>>  Bitmapset  *outer_reference_attrs;
>>  Bitmapset  *inner_reference_attrs;
>> } Join;
>> 
>
> Is it actually necessary to add new fields to these nodes? Also, the
> names are not particularly descriptive of the purpose - it'd be better
> to have "detoast" in the name, instead of generic "reference".

Because of the way of the data transformation, I think we must add the
fields to keep such inforamtion. Then these information will be used
initilize the necessary information in PlanState. maybe I am having a
fixed mindset, I can't think out a way to avoid that right now.

I used 'reference' rather than detoast is because some implementaion
issues. In the createpla

Re: Reduce useless changes before reassembly during logical replication

2024-02-21 Thread Andy Fan


Hi Jie,

> Most importantly, if we filter out unpublished relationship-related
> changes after
> constructing the changes but before queuing the changes into a transaction,
> will it reduce the workload of logical decoding and avoid disk or memory 
> growth
> as much as possible?

Thanks for the report!

Discarding the unused changes as soon as possible looks like a valid
optimization for me, but I pretty like more experienced people have a
double check. 

> Attached is the patch I used to implement this optimization.

After a quick look at the patch, I found FilterByTable is too expensive
because of the StartTransaction and AbortTransaction. With your above
setup and run the below test:

insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,999100) i;

perf the wal sender of mypub for 30 seconds, then I get:

- 22.04% 1.53%  postgres  postgres   [.] FilterByTable  
  - 20.51% FilterByTable
  
AbortTransaction
   ResourceOwnerReleaseInternal 
  LockReleaseAll
 
hash_seq_search 

The main part comes from AbortTransaction, and the 20% is not trivial.

>From your patch:
+
+   /*
+* Decoding needs access to syscaches et al., which in turn use
+* heavyweight locks and such. Thus we need to have enough state around 
to
+* keep track of those.  The easiest way is to simply use a transaction
+* internally.
+ 
+   using_subtxn = IsTransactionOrTransactionBlock();
+
+   if (using_subtxn)
+   BeginInternalSubTransaction("filter change by table");
+   else
+   StartTransactionCommand();

Acutally FilterByTable here is simpler than "decoding", we access
syscache only when we find an entry in get_rel_sync_entry and the
replicate_valid is false, and the invalid case should rare. 

What I'm thinking now is we allow the get_rel_sync_sync_entry build its
own transaction state *only when it find a invalid entry*. if the caller
has built it already, like the existing cases in master, nothing will
happen except a simple transaction state check. Then in the
FilterByTable case we just leave it for get_rel_sync_sync_entry. See the
attachemnt for the idea.

-- 
Best Regards
Andy Fan

>From 90b8df330df049bd1d7e881dc6e9b108c17b0924 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Wed, 21 Feb 2024 18:40:03 +0800
Subject: [PATCH v1 1/1] Make get_rel_sync_entry less depending on transaction
 state.

get_rel_sync_entry needs transaction only a replicate_valid = false
entry is found, this should be some rare case. However the caller can't
know if a entry is valid, so they have to prepare the transaction state
before calling this function. Such preparation is expensive.

This patch makes the get_rel_sync_entry can manage a transaction stage
only if necessary. so the callers don't need to prepare it blindly.
---
 src/backend/replication/pgoutput/pgoutput.c | 60 -
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 998f92d671..25e55590a2 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -13,6 +13,7 @@
 #include "postgres.h"
 
 #include "access/tupconvert.h"
+#include "access/relation.h"
 #include "catalog/partition.h"
 #include "catalog/pg_publication.h"
 #include "catalog/pg_publication_rel.h"
@@ -214,6 +215,11 @@ static void init_rel_sync_cache(MemoryContext cachectx);
 static void cleanup_rel_sync_cache(TransactionId xid, bool is_commit);
 static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data,
 			 Relation relation);
+static RelationSyncEntry *get_rel_sync_entry_by_relid(PGOutputData *data,
+	  Oid relid);
+static RelationSyncEntry *get_rel_sync_entry_internal(PGOutputData *data,
+	  Relation relation,
+	  Oid relid);
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 		  uint32 hashvalue);
@@ -1962,11 +1968,29 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
  */
 static RelationSyncEntry *
 get_rel_sync_entry(PGOutputData *data, Relation relation)
+{
+	return get_rel_sync_entry_internal(data, relation, InvalidOid);
+}
+
+static RelationSyncEntry *
+__attribute__ ((unused))
+get_rel_sync_entry_by_relid(PGOutputData *data, Oid relid)
+{
+	return get_rel_sync_entry_in

Re: Shared detoast Datum proposal

2024-02-20 Thread Andy Fan

(struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
bitset_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}


> For example, what if we add a "TOAST cache" as a query-level hash table,
> and modify the detoasting to first check the hash table (with the TOAST
> pointer as a key)? It'd be fairly trivial to enforce a memory limit on
> the hash table, evict values from it, etc. And it wouldn't require any
> of the createplan/setrefs changes, I think ...

Hmm, I am not sure I understand you correctly at this part. In the
current patch, to avoid the run-time (ExecExprInterp) check if we
should detoast and save the datum, I defined 3 extra steps so that
the *extra check itself* is not needed for unnecessary attributes.
for example an datum for int or a detoast datum should not be saved back
to tts_values[*] due to the small_tlist reason. However these steps can
be generated is based on the output of createplan/setrefs changes. take
the INNER_VAR for example: 

In ExecInitExprRec:

switch (variable->varno)
{
case INNER_VAR:
if (is_join_plan(plan) &&
bms_is_member(attnum,
  ((JoinState *) 
state->parent)->inner_pre_detoast_attrs))
{
scratch.opcode = EEOP_INNER_VAR_TOAST;
}
else
{
scratch.opcode = EEOP_INNER_VAR;
}
}

The data workflow is:

1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
decides which Vars should *not* be pre_detoasted because of small_tlist
reason and record it in Plan.forbid_pre_detoast_vars.

2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
detoasted for the specific plan node and record them in it. Currently
only Scan and Join nodes support this feature.

typedef struct Scan
{
...
/*
 * Records of var's varattno - 1 where the Var is accessed indirectly by
 * any expression, like a > 3.  However a IS [NOT] NULL is not included
 * since it doesn't access the tts_values[*] at all.
 *
 * This is a essential information to figure out which attrs should use
 * the pre-detoast-attrs logic.
 */
Bitmapset  *reference_attrs;
} Scan;

typedef struct Join
{
..
/*
 * Records of var's varattno - 1 where the Var is accessed indirectly by
 * any expression, like a > 3.  However a IS [NOT] NULL is not included
 * since it doesn't access the tts_values[*] at all.
 *
 * This is a essential information to figure out which attrs should use
 * the pre-detoast-attrs logic.
 */
Bitmapset  *outer_reference_attrs;
Bitmapset  *inner_reference_attrs;
} Join;

3. during the InitPlan stage, we maintain the
PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.

4. At the ExecExprInterp stage, only the new StepOp do the extra check
to see if the detoast should happen. Other steps doesn't need this
check at all. 

If we avoid the createplan/setref.c changes, probabaly some unrelated
StepOp needs the extra check as well?

When I worked with the UniqueKey feature, I maintained a
UniqueKey.README to summaried all the dicussed topics in threads, the
README is designed to save the effort for more reviewer, I think I
should apply the same logic for this feature.

Thank you very much for your feedback!

v7 attached, just some comments and Assert changes. 

[1] https://www.postgresql.org/message-id/87il4jrk1l.fsf%40163.com
[2]
https://www.postgresql.org/message-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com

-- 
Best Regards
Andy Fan

>From f2e7772228e8a18027b9c29f10caba9c6570d934 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Tue, 20 Feb 2024 11:11:53 +0800
Subject: [PATCH v7 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c | 200 +-
 src/backend/nodes/outfuncs.c  |  51 +
 src/include/nodes/bitmapset.h |  28 +++
 src/include/nodes/nodes.h |   4 +
 src/test/modules/test

Re: Shared detoast Datum proposal

2024-02-19 Thread Andy Fan


Hi,

I didn't another round of self-review.  Comments, variable names, the
order of function definition are improved so that it can be read as
smooth as possible. so v6 attached.

-- 
Best Regards
Andy Fan
>From f2e7772228e8a18027b9c29f10caba9c6570d934 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Tue, 20 Feb 2024 11:11:53 +0800
Subject: [PATCH v6 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c | 200 +-
 src/backend/nodes/outfuncs.c  |  51 +
 src/include/nodes/bitmapset.h |  28 +++
 src/include/nodes/nodes.h |   4 +
 src/test/modules/test_misc/Makefile   |  11 +
 src/test/modules/test_misc/README |   4 +-
 .../test_misc/expected/test_bitset.out|   7 +
 src/test/modules/test_misc/meson.build|  17 ++
 .../modules/test_misc/sql/test_bitset.sql |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c| 118 +++
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list  |   1 +
 13 files changed, 441 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..40cfea2308 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1351,6 +1346,19 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1466,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 	  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = (size + BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	if (size == 0)
+		return NULL;
+
+	result = (Bitset *) palloc0(sizeof(Bitset) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	if (a != NULL)
+		pfree(a);
+}
+
+bool
+bitset_is_empty(Bitset *a)
+{
+	int			i;
+
+	if (a == NULL)
+		return true;
+
+	for (i = 0; i < a->nwords; i++)
+	{
+		bitmapword	w = a->words[i];
+
+		if (w != 0)
+			return false;
+	}
+
+	return true;
+}
+
+Bitset *
+bitset_copy(Bitset *a)
+{
+	Bitset	   *result;
+
+	if (a == NULL)
+		return NULL;
+
+	result = bitset_init(a->nwords * BITS_PER_BITMAPWORD);
+
+	memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords);
+	return result;
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] |= ((bi

Re: serial not accepted as datatype in ALTER TABLE ... ALTER COLUMN

2024-02-19 Thread Andy Fan

Ashutosh Bapat  writes:

> On Sun, Feb 18, 2024 at 1:59 PM Andy Fan  wrote:
>>
>>
>> I tried your idea with the attatchment, it is still in a drafted state
>> but it can be used as a prove-of-concept and for better following
>> communicating.  Just one point needs to metion is serial implies
>> "default value" + "not null" constaint. So when we modify a column into
>> serial, we need to modify the 'NULL value' and only to the default value
>> at the RewriteTable stage.
>>
>
> I am surprised that this requires changes in ReWrite. I thought adding
> NOT NULL constraint and default value commands would be done by
> transformColumnDefinition(). But I haven't looked at the patch close
> enough.

Hmm, I think this depends on how to handle the NULL values before the
RewriteTable. 

Consider the example like this:

\pset null (null)
create table t(a int);
insert into t select 1;
insert into t select;

postgres=# select * from t;
   a

  1
 (null)
(2 rows)

since serial type implies "not null" + "default value", shall we raise error
or fill the value with the "default" value?  The patch choose the later
way which needs changes in RewirteTable stage, but now I think the raise
error directly is an option as well. 

-- 
Best Regards
Andy Fan

Re: Properly pathify the union planner

2024-02-18 Thread Andy Fan



David Rowley  writes:

>>
>> The two if-clauses looks unnecessary, it should be handled by
>> pathkeys_count_contained_in already. The same issue exists in
>> pathkeys_useful_for_ordering as well. Attached patch fix it in master.
>
> I agree.  I'd rather not have those redundant checks in
> pathkeys_useful_for_setop(), and I do want those functions to be as
> similar as possible.  So I think adjusting it in master is a good
> idea.
>
> I've pushed your patch.
>
Thanks for the pushing!


-- 
Best Regards
Andy Fan

Re: serial not accepted as datatype in ALTER TABLE ... ALTER COLUMN

2024-02-18 Thread Andy Fan


Hi Ashutosh, 

> data_type is described on that page as "Data type of the new column,
> or new data type for an existing column." but CREATE TABLE
> documentation [2] redirects data_type to [3], which mentions serial.
> The impression created by the documentation is the second statement
> above is a valid statement as should not throw an error; instead
> change the data type of the column (and create required sequence).

I didn't find out a reason to not support it, if have any reason, I
think it is better have some explaination in the document.

> In code ATPrepAlterColumnType() calls typenameTypeIdAndMod(), whereas
> transformColumnDefinition() (called for ALTER TABLE ... ADD COLUMN and
> CREATE TABLE) handles "serial" data type separately. Looks like we are
> missing a call to transformColumnDefinition() in
> transformAlterTableStmt() under case AT_AlterColumnType.

I tried your idea with the attatchment, it is still in a drafted state
but it can be used as a prove-of-concept and for better following
communicating.  Just one point needs to metion is serial implies
"default value" + "not null" constaint. So when we modify a column into
serial, we need to modify the 'NULL value' and only to the default value
at the RewriteTable stage.

-- 
Best Regards
Andy Fan

>From 4485a9af33c9c7d493e23c2804bde4c15df0b79a Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Sun, 18 Feb 2024 16:14:29 +0800
Subject: [PATCH v0 1/1] POC: Support ALTER TABLE tableName ALERT COLUMN col
 type serial.

---
 src/backend/commands/tablecmds.c  | 60 +--
 src/backend/parser/gram.y |  1 +
 src/backend/parser/parse_utilcmd.c| 41 -
 src/include/nodes/parsenodes.h|  1 +
 src/include/parser/parse_utilcmd.h|  4 +-
 src/test/regress/expected/alter_table.out | 70 +--
 src/test/regress/sql/alter_table.sql  | 20 +--
 7 files changed, 180 insertions(+), 17 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 46ece07338..7e13b04efa 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -231,6 +231,8 @@ typedef struct NewColumnValue
 	Expr	   *expr;			/* expression to compute */
 	ExprState  *exprstate;		/* execution state */
 	bool		is_generated;	/* is it a GENERATED expression? */
+	bool		new_on_null;	/* set the new value only when the old value
+ * is NULL. */
 } NewColumnValue;
 
 /*
@@ -5594,7 +5596,8 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 atstmt,
 	 context->queryString,
 	 ,
-	 );
+	 ,
+	 context);
 
 	/* Execute any statements that should happen before these subcommand(s) */
 	foreach(lc, beforeStmts)
@@ -6230,10 +6233,11 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 	if (ex->is_generated)
 		continue;
 
-	newslot->tts_values[ex->attnum - 1]
-		= ExecEvalExpr(ex->exprstate,
-	   econtext,
-	   >tts_isnull[ex->attnum - 1]);
+	if (!ex->new_on_null || oldslot->tts_isnull[ex->attnum - 1])
+		newslot->tts_values[ex->attnum - 1]
+			= ExecEvalExpr(ex->exprstate,
+		   econtext,
+		   >tts_isnull[ex->attnum - 1]);
 }
 
 ExecStoreVirtualTuple(newslot);
@@ -13284,6 +13288,7 @@ ATPrepAlterColumnType(List **wqueue,
 		newval->attnum = attnum;
 		newval->expr = (Expr *) transform;
 		newval->is_generated = false;
+		newval->new_on_null = def->from_serial;
 
 		tab->newvals = lappend(tab->newvals, newval);
 		if (ATColumnChangeRequiresRewrite(transform, attnum))
@@ -13495,6 +13500,48 @@ ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
 	HeapTuple	depTup;
 	ObjectAddress address;
 
+	if (def->from_serial)
+	{
+		/*
+		 * Store the DEFAULT for the from_serial
+		 */
+		/* XXX: copy from ATExecAddColumn */
+		RawColumnDefault *rawEnt;
+
+		heapTup = SearchSysCacheCopyAttName(RelationGetRelid(rel), colName);
+		attTup = (Form_pg_attribute) GETSTRUCT(heapTup);
+
+		rawEnt = (RawColumnDefault *) palloc(sizeof(RawColumnDefault));
+		rawEnt->attnum = attTup->attnum;
+		rawEnt->raw_default = copyObject(def->raw_default);
+
+		/*
+		 * Attempt to skip a complete table rewrite by storing the specified
+		 * DEFAULT value outside of the heap.  This may be disabled inside
+		 * AddRelationNewConstraints if the optimization cannot be applied.
+		 */
+		rawEnt->missingMode = (!def->generated);
+
+		rawEnt->generated = def->generated;
+
+		/*
+		 * This function is intended for CREATE TABLE, so it processes a
+		 * _list_ of defaults, but we just do one.
+		 */
+		AddRelationNewConstraints(rel, list_make1(rawEnt), NIL,
+  false, true, false, NULL);
+
+		/* Make the additional catalog changes v

Re: make add_paths_to_append_rel aware of startup cost

2024-02-15 Thread Andy Fan



David Rowley  writes:

> I'd be more happy using this one as percentage-wise, the cost
> difference is much larger.

+1 for the percentage-wise.
>
> I checked that the t2.thounsand = 0 query still tests the cheap
> startup paths in add_paths_to_append_rel().

I get the same conclusion here. 
>
> So, if nobody has any better ideas, I'm just going to push the " and
> t2.thousand = 0" adjustment.

LGTM.

-- 
Best Regards
Andy Fan

Re: A new strategy for pull-up correlated ANY_SUBLINK

2024-02-15 Thread Andy Fan



Hi Alexander,

> Hi!
>
>> If the changes of Alena are ok, can you merge the changes and post an
>> updated version so that CFBot can apply the patch and verify the
>> changes. As currently CFBot is trying to apply only Alena's changes
>> and failing with the following at [1]:
>
> I think this is a very nice and pretty simple optimization.  I've
> merged the changes by Alena, and slightly revised the code changes in
> convert_ANY_sublink_to_join().  I'm going to push this if there are no
> objections.

Thanks for picking up this! I double checked the patch, it looks good to
me. 

-- 
Best Regards
Andy Fan

Re: [PATCH] Expose port->authn_id to extensions and triggers

2024-02-15 Thread Andy Fan


Hi,

Michael Paquier  writes:

> [[PGP Signed Part:Undecided]]
> On Tue, Aug 23, 2022 at 10:04:30AM -0700, Jacob Champion wrote:
>
> My main worry here is EXEC_BACKEND, where we would just use our own
> implementation of fork(), and it is a bad idea at the end to leave
> that untouched while we could have code paths that attempt to access
> it.  At the end, I have moved the initialization at the same place as
> where we set MyProcPort for a backend in BackendInitialize(), mainly
> as a matter of consistency because ClientConnectionInfo is aimed at
> being a subset of that.  And applied.

I found a compiler complaint of this patch. The attached fix that.

-- 
Best Regards
Andy Fan

>From 9e8a3fb7a044704fbfcd682a897f72260266bd54 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 15 Feb 2024 16:46:57 +0800
Subject: [PATCH v1 1/1] Remove the "parameter 'maxsize' set but not used"
 error.

maxsize is only used in Assert build, to make compiler quiet, it is
better maintaining it only assert build.
---
 src/backend/utils/init/miscinit.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 23f77a59e5..0e72fcefab 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -1050,7 +1050,9 @@ SerializeClientConnectionInfo(Size maxsize, char *start_address)
 	Assert(maxsize >= sizeof(serialized));
 	memcpy(start_address, , sizeof(serialized));
 
+#ifdef USE_ASSERT_CHECKING
 	maxsize -= sizeof(serialized);
+#endif
 	start_address += sizeof(serialized);
 
 	/* Copy authn_id into the space after the struct */
-- 
2.34.1

Re: make add_paths_to_append_rel aware of startup cost

2024-02-15 Thread Andy Fan


Andy Fan  writes:

> Thomas Munro  writes:
>
>> On Thu, Oct 5, 2023 at 9:07 PM David Rowley  wrote:
>>> Thanks. Pushed.
>>
>> FYI somehow this plan from a8a968a8212e flipped in this run:
>>
>> === dumping 
>> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/regression.diffs
>> ===
>> diff -U3 
>> /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out
>> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out
>> --- /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out
>> 2024-01-15 00:31:13.947555940 +
>> +++ 
>> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out
>> 2024-02-14 00:06:17.075584839 +
>> @@ -1447,9 +1447,9 @@
>> ->  Append
>>   ->  Nested Loop
>> Join Filter: (t1.tenthous = t2.tenthous)
>> -   ->  Seq Scan on tenk1 t1
>> +   ->  Seq Scan on tenk2 t2
>> ->  Materialize
>> - ->  Seq Scan on tenk2 t2
>> + ->  Seq Scan on tenk1 t1
>>   ->  Result
>>  (8 rows)
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2024-02-14%2000%3A01%3A03
>
> Thanks for this information! I will take a look at this.

I found the both plans have the same cost, I can't get the accurate
cause of this after some hours research, but it is pretty similar with
7516056c584e3, so I uses a similar strategy to stable it. is it
acceptable? 

-- 
Best Regards
Andy Fan

>From 01c2b5ae76a4493f33b894afdd4a8d3ce31e1a3e Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 15 Feb 2024 16:29:30 +0800
Subject: [PATCH v1 1/1] Attempt to stabilize new unionall + limit regression
 test

This test was recently added in a8a968a8212e, and some times It appears
to be unstable in regards to the join order presumably due to the
relations at either side of the join being equal in side.  Here we add a
qual to make one of them smaller so the planner is more likely to choose
to material the smaller of the two.

Reported-by: Thomas Munro
Author: Andy Fan

Discussion: https://postgr.es/m/CA%2BhUKGLqC-NobKYfjxNM3Gexv9OJ-Fhvy9bugUcXsZjTqH7W%3DQ%40mail.gmail.com#88e6420b59a863403b9b67a0128fdacc
---
 src/test/regress/expected/union.out | 11 ++-
 src/test/regress/sql/union.sql  |  4 ++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out
index 73e320bad4..0e527b65b3 100644
--- a/src/test/regress/expected/union.out
+++ b/src/test/regress/expected/union.out
@@ -1438,8 +1438,8 @@ where (x = 0) or (q1 >= q2 and q1 <= q2);
 -- Ensure we get a Nested Loop join between tenk1 and tenk2
 explain (costs off)
 select t1.unique1 from tenk1 t1
-inner join tenk2 t2 on t1.tenthous = t2.tenthous
-   union all
+inner join tenk2 t2 on t1.tenthous = t2.tenthous and t1.ten > 5
+union all
 (values(1)) limit 1;
QUERY PLAN   
 
@@ -1447,11 +1447,12 @@ inner join tenk2 t2 on t1.tenthous = t2.tenthous
->  Append
  ->  Nested Loop
Join Filter: (t1.tenthous = t2.tenthous)
-   ->  Seq Scan on tenk1 t1
+   ->  Seq Scan on tenk2 t2
->  Materialize
- ->  Seq Scan on tenk2 t2
+ ->  Seq Scan on tenk1 t1
+   Filter: (ten > 5)
  ->  Result
-(8 rows)
+(9 rows)
 
 -- Ensure there is no problem if cheapest_startup_path is NULL
 explain (costs off)
diff --git a/src/test/regress/sql/union.sql b/src/test/regress/sql/union.sql
index 6c509ac80c..13ba9f21ad 100644
--- a/src/test/regress/sql/union.sql
+++ b/src/test/regress/sql/union.sql
@@ -548,8 +548,8 @@ where (x = 0) or (q1 >= q2 and q1 <= q2);
 -- Ensure we get a Nested Loop join between tenk1 and tenk2
 explain (costs off)
 select t1.unique1 from tenk1 t1
-inner join tenk2 t2 on t1.tenthous = t2.tenthous
-   union all
+inner join tenk2 t2 on t1.tenthous = t2.tenthous and t1.ten > 5
+union all
 (values(1)) limit 1;
 
 -- Ensure there is no problem if cheapest_startup_path is NULL
-- 
2.34.1

Re: make add_paths_to_append_rel aware of startup cost

2024-02-14 Thread Andy Fan



Thomas Munro  writes:

> On Thu, Oct 5, 2023 at 9:07 PM David Rowley  wrote:
>> Thanks. Pushed.
>
> FYI somehow this plan from a8a968a8212e flipped in this run:
>
> === dumping 
> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/regression.diffs
> ===
> diff -U3 
> /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out
> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out
> --- /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/regress/expected/union.out
> 2024-01-15 00:31:13.947555940 +
> +++ 
> /home/bf/bf-build/mylodon/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/union.out
> 2024-02-14 00:06:17.075584839 +
> @@ -1447,9 +1447,9 @@
> ->  Append
>   ->  Nested Loop
> Join Filter: (t1.tenthous = t2.tenthous)
> -   ->  Seq Scan on tenk1 t1
> +   ->  Seq Scan on tenk2 t2
> ->  Materialize
> - ->  Seq Scan on tenk2 t2
> + ->  Seq Scan on tenk1 t1
>   ->  Result
>  (8 rows)
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon=2024-02-14%2000%3A01%3A03

Thanks for this information! I will take a look at this.

-- 
Best Regards
Andy Fan

Re: Extract numeric filed in JSONB more effectively

2024-02-09 Thread Andy Fan


Hi,

Here is the update of this patch.

1. What is it for?

commit f7b93acc24b4a152984048fefc6d71db606e3204 (HEAD -> jsonb_numeric)
Author: yizhi.fzh 
Date:   Fri Feb 9 16:54:06 2024 +0800

Improve the performance of Jsonb numeric/bool extraction.

JSONB object uses a binary compatible numeric format with the numeric
data type in SQL. However in the past, extracting a numeric value from a
JSONB field still needs to find the corresponding JsonbValue first,
then convert the JsonbValue to Jsonb, and finally use the cast system to
convert the Jsonb to a Numeric data type. This approach was very
inefficient in terms of performance.

In the current patch, It is handled that the JsonbValue is converted to
numeric data type directly.  This is done by the planner support
function which detects the above case and simplify the expression.
Because the boolean type and numeric type share certain similarities in
their attributes, we have implemented the same optimization approach for
both.  In the ideal test case, the performance can be 2x than before.

The optimized functions and operators includes:
1. jsonb_object_field / ->
2. jsonb_array_element / ->
3. jsonb_extract_path / #>
4. jsonb_path_query
5. jsonb_path_query_first

example:
create table tb(a jsonb);
insert into tb select '{"a": 1, "b": "a"}'::jsonb;


master:
explain (costs off, verbose) select * from tb where (a->'a')::numeric > 
3::numeric;
QUERY PLAN 
---
 Seq Scan on public.tb
   Output: a
   Filter: (((tb.a -> 'a'::text))::numeric > '3'::numeric)
(3 rows)

patched:

postgres=# explain (costs off, verbose) select * from tb where 
(a->'a')::numeric > 3::numeric;
 QUERY PLAN 
 
-
 Seq Scan on public.tb
   Output: a
   Filter: (jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, 
'a'::text), '1700'::oid) > '3'::numeric)
(3 rows)

The final expression generated by planner support function includes:

1).
jsonb_object_field_start((tb.a)::internal, 'a'::text) first, this
function returns the internal datum which is JsonbValue in fact.
2).
jsonb_finish_numeric(internal (jsonbvalue), '1700::oid) convert the
jsonbvalue to numeric directly without the jsonb as a intermedia result.

the reason why "1700::oid" will be explained later, that's the key issue
right now.

The reason why we need the 2 steps rather than 1 step is because the
code can be better abstracted, the idea comes from Chap, the detailed
explaination is at [1]. You can search "Now, it would make me happy to
further reduce some of the code duplication" and read the following
graph. 


2. Where is the current feature blocked for the past few months?

It's error message compatible issue! Continue with above setup:

master:

select * from tb where (a->'b')::numeric > 3::numeric;
ERROR:  cannot cast jsonb string to type numeric

select * from tb where (a->'b')::int4 > 3::numeric;
ERROR:  cannot cast jsonb string to type integer

You can see the error message is different (numeric vs integer). 


Patched:

We still can get the same error message as master BUT the code
looks odd.

select * from tb where (a->'b')::int4 > 3;
QUERY PLAN  
   
---
 Seq Scan on public.tb
   Output: a
   Filter: ((jsonb_finish_numeric(jsonb_object_field_start((tb.a)::internal, 
'b'::text), '23'::oid))::integer > 3)
(3 rows)

You can see "jsonb_finish_numeric(..,  '23::oid)" the '23::oid' is just
for the *"integer"* output in error message:

"cannot cast jsonb string to type *integer*"

Now the sistuation is either we use the odd argument (23::oid) in
jsonb_finish_numeric, or we use a incompatible error message with the
previous version. I'm not sure which way is better, but this is the
place the current feature is blocked.

3. what do I want now?

Since this feature uses the planner support function which needs some
catalog changes, so it is better that we can merge this feature in PG17,
or else, we have to target it in PG18. So if some senior developers can
chime in, for the current blocking issue at least, will be pretty
helpful.

[1]
https://www.postgresql.org/message-id/5138c6b5fd239e7ce4e1a4e63826ac27%40anastigmatix.net
 

-- 
Best Regards
Andy Fan

>From f7b93acc24b4a152984048fefc6d71db606e3204 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Fri, 9

Re: the s_lock_stuck on perform_spin_delay

2024-02-08 Thread Andy Fan

her spinlock" has much more user cases than
SpinLockAcquire / LockBufHdr, I think sharing the same function will be
OK which is VerifyNoSpinLocksHeld function for now. 


> - spinlock_finish_acquire() - have acquired spinlock
>   empty in optimized builds, in assert builds sets variable indicating we're
>   in spinlock
>
>   This would get called in SpinLockRelease() etc.

I think you mean "SpinLockAcquire" here.

Matthias suggested "we need to 'arm' the checks just before we lock the
spin lock, and 'disarm' the checks just after we unlock the spin lock"
at [1], I'm kind of persuaded by that.  so I used
spinlock_prepare_acquire to set variable indicating we're in
spinlock. which one do you prefer now? 

>
> - spinlock_finish_release()  - not holding the lock anymore
>
>   This would get called by SpinLockRelease(), UnlockBufHdr()
>
>
> - spinlock_prepare_spin() - about to spin waiting for a spinlock
>   like the current init_spin_delay()
>
>   This would get called in s_lock(), LockBufHdr() etc.
>
>
> - spinlock_finish_spin() - completed waiting for a spinlock
>   like the current finish_spin_delay()
>
>   This would get called in s_lock(), LockBufHdr() etc.

All done in v10, for consistent purpose, I also renamed
perform_spin_delay to spinlock_perform_delay. 

I have got much more than what I expected before in this review process,
thank you very much about that!

[1]
https://www.postgresql.org/message-id/CAEze2WggP-2Dhocmdhp-LxBzic%3DMXRgGA_tmv1G_9n-PDt2MQg%40mail.gmail.com

-- 
Best Regards
Andy Fan

>From 961ba292a9a87685c5ddeefbbaa5edc02a17e0e8 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 8 Feb 2024 21:52:24 +0800
Subject: [PATCH v10 1/1] Detect misuse of spin lock automatically

Spin lock are intended for very short-term locks, but it is possible
to be misused in many cases. e.g. Acquiring another LWLocks or regular
locks, memory allocation, errstart when holding a spin lock. this patch
would detect such misuse automatically in a USE_ASSERT_CHECKING build.

CHECK_FOR_INTERRUPTS should be avoided as well when holding a spin lock.
Depends on what signals are left to handle, PG may raise error/fatal
which would cause the code jump to some other places which is hardly to
release the spin lock anyway.
---
 src/backend/storage/buffer/bufmgr.c | 18 +
 src/backend/storage/lmgr/lock.c |  6 +++
 src/backend/storage/lmgr/lwlock.c   | 16 +---
 src/backend/storage/lmgr/s_lock.c   | 61 +++--
 src/backend/tcop/postgres.c |  8 
 src/backend/utils/error/elog.c  |  9 +
 src/backend/utils/mmgr/mcxt.c   | 16 
 src/include/miscadmin.h | 21 +-
 src/include/storage/buf_internals.h |  1 +
 src/include/storage/s_lock.h| 61 ++---
 src/include/storage/spin.h  | 15 ++-
 src/test/regress/regress.c  | 12 --
 src/tools/pgindent/typedefs.list|  2 +-
 13 files changed, 192 insertions(+), 54 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb1ec3b86d..f97612ca92 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5390,12 +5390,13 @@ rlocator_comparator(const void *p1, const void *p2)
 uint32
 LockBufHdr(BufferDesc *desc)
 {
-	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
 
 	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
 
-	init_local_spin_delay();
+	VerifyNoSpinLocksHeld(true);
+
+	spinlock_local_prepare_spin();
 
 	while (true)
 	{
@@ -5404,9 +5405,9 @@ LockBufHdr(BufferDesc *desc)
 		/* if it wasn't set before we're OK */
 		if (!(old_buf_state & BM_LOCKED))
 			break;
-		perform_spin_delay();
+		spinlock_perform_delay();
 	}
-	finish_spin_delay();
+	spinlock_finish_spin();
 	return old_buf_state | BM_LOCKED;
 }
 
@@ -5420,20 +5421,21 @@ LockBufHdr(BufferDesc *desc)
 static uint32
 WaitBufHdrUnlocked(BufferDesc *buf)
 {
-	SpinDelayStatus delayStatus;
 	uint32		buf_state;
 
-	init_local_spin_delay();
+	spinlock_local_prepare_spin();
 
 	buf_state = pg_atomic_read_u32(>state);
 
 	while (buf_state & BM_LOCKED)
 	{
-		perform_spin_delay();
+		spinlock_perform_delay();
 		buf_state = pg_atomic_read_u32(>state);
 	}
 
-	finish_spin_delay();
+	spinlock_finish_spin();
+
+	spinlock_finish_release();
 
 	return buf_state;
 }
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c70a1adb9a..a711e63664 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -776,6 +776,12 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found_conflict;
 	bool		log_lock = false;
 
+	/*
+	 * Spin lock should not be held for a long time, but the time needed here
+	 * may be too long, so let's make sure no spin lock is held now.
+	 */
+	VerifyNoSpinLocksHeld(true);
+
 	if (lockmethodid <= 0

Re: Properly pathify the union planner

2024-02-06 Thread Andy Fan


Hi,

> * I think we should update truncate_useless_pathkeys() to account for
> the ordering requested by the query's set operation;

Nice catch.

> I'm thinking that maybe it'd be better to move the work of sorting the
> subquery's paths to the outer query level, specifically within the
> build_setop_child_paths() function, just before we stick SubqueryScanPath
> on top of the subquery's paths.  I think this is better because:
>
> 1. This minimizes the impact on subquery planning and reduces the
> footprint within the grouping_planner() function as much as possible.
>
> 2. This can help avoid the aforementioned add_path() issue because the
> two involved paths will be structured as:
>
> cheapest_path -> subqueryscan
> and
> cheapest_path -> sort -> subqueryscan
>
> If the two paths cost fuzzily the same and add_path() decides to keep
> the second one due to it having better pathkeys and pfree the first one,
> it would not be a problem.

This is a smart idea, it works because you create a two different
subqueryscan for the cheapest_input_path.

FWIW, I found we didn't create_sort_path during building a merge join
path, instead it just cost the sort and add it to the cost of mergejoin
path only and note this path needs a presorted data. At last during the
create_mergejoin_*plan*, it create the sort_plan really. As for the
mergeappend case, could we use the similar strategy? with this way, we
might simpliy the code to use MergeAppend node since the caller just
need to say I want to try MergeAppend with the given pathkeys without
really creating the sort by themselves. 

(Have a quick glance of initial_cost_mergejoin and
create_mergejoin_plan, looks incremental sort doesn't work with mergejoin?)

>
> To assist the discussion I've attached a diff file that includes all the
> changes above.

+ */
+static int
+pathkeys_useful_for_setop(PlannerInfo *root, List *pathkeys)
+{
+   int n_common_pathkeys;
+
+   if (root->setop_pathkeys == NIL)
+   return 0;   /* no special setop 
ordering requested */
+
+   if (pathkeys == NIL)
+   return 0;   /* unordered path */
+
+   (void) pathkeys_count_contained_in(root->setop_pathkeys, pathkeys,
+  
_common_pathkeys);
+
+   return n_common_pathkeys;
+}

The two if-clauses looks unnecessary, it should be handled by
pathkeys_count_contained_in already. The same issue exists in
pathkeys_useful_for_ordering as well. Attached patch fix it in master.

-- 
Best Regards
Andy Fan

>From f562dd8b80bd0b7c2d66ee9633434a1e5be82e04 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Wed, 7 Feb 2024 07:00:33 +0800
Subject: [PATCH v1 1/1] Remove the unnecessary checks for pathkey routines

both pathkeys_count_contained_in and foreach can handle the NIL input,
so no need the extra checks.
---
 src/backend/optimizer/path/pathkeys.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 82ff31273b..98334e5c52 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -2126,12 +2126,6 @@ pathkeys_useful_for_ordering(PlannerInfo *root, List *pathkeys)
 {
 	int			n_common_pathkeys;
 
-	if (root->query_pathkeys == NIL)
-		return 0;/* no special ordering requested */
-
-	if (pathkeys == NIL)
-		return 0;/* unordered path */
-
 	(void) pathkeys_count_contained_in(root->query_pathkeys, pathkeys,
 	   _common_pathkeys);
 
@@ -2167,10 +2161,6 @@ pathkeys_useful_for_grouping(PlannerInfo *root, List *pathkeys)
 	if (root->group_pathkeys == NIL)
 		return 0;
 
-	/* unordered path */
-	if (pathkeys == NIL)
-		return 0;
-
 	/* walk the pathkeys and search for matching group key */
 	foreach(key, pathkeys)
 	{
-- 
2.34.1

Re: the s_lock_stuck on perform_spin_delay

2024-01-24 Thread Andy Fan



Andres Freund  writes:

> On 2024-01-18 14:00:58 -0500, Robert Haas wrote:
>> > The LockBufHdr also used init_local_spin_delay / perform_spin_delay
>> > infrastruce and then it has the same issue like ${subject}, it is pretty
>> > like the code in s_lock; Based on my current knowledge, I think we
>> > should add the check there.
>> 
>> I'd like to hear from Andres, if possible. @Andres: Should these
>> sanity checks apply only to spin locks per se, or also to buffer
>> header locks?
>
> They also should apply to buffer header locks. The exact same dangers apply
> there. The only reason this isn't using a plain spinlock is that this way we
> can modify more state with a single atomic operation. But all the dangers of
> using spinlocks apply just as well.

Thanks for speaking on this! 

-- 
Best Regards
Andy Fan

Re: the s_lock_stuck on perform_spin_delay

2024-01-24 Thread Andy Fan


Hi, 

> Hi,
>
> On 2024-01-22 15:18:35 +0800, Andy Fan wrote:
>> I used sigismember(, SIGQUIT) to detect if a process is doing a
>> quickdie, however this is bad not only because it doesn't work on
>> Windows, but also it has too poor performance even it impacts on
>> USE_ASSERT_CHECKING build only. In v8, I introduced a new global
>> variable quickDieInProgress to handle this.
>
> For reasons you already noted, using sigismember() isn't viable. But I am not
> convinced by quickDieInProgress either. I think we could just reset the state
> for the spinlock check in the code handling PANIC, perhaps via a helper
> function in spin.c.

Handled with the action for your suggestion #3. 

>> +void
>> +VerifyNoSpinLocksHeld(void)
>> +{
>> +#ifdef USE_ASSERT_CHECKING
>> +if (last_spin_lock_file != NULL)
>> +elog(PANIC, "A spin lock has been held at %s:%d",
>> + last_spin_lock_file, last_spin_lock_lineno);
>> +#endif
>> +}
>
> I think the #ifdef for this needs to be in the header, not here. Otherwise we
> add a pointless external function call to a bunch of performance critical
> code.
>
Done.

>> diff --git a/src/backend/storage/buffer/bufmgr.c 
>> b/src/backend/storage/buffer/bufmgr.c
>> index 7d601bef6d..c600a113cf 100644
>> --- a/src/backend/storage/buffer/bufmgr.c
>> +++ b/src/backend/storage/buffer/bufmgr.c
>> @@ -5409,6 +5409,7 @@ LockBufHdr(BufferDesc *desc)
>>  
>>  init_local_spin_delay();
>>  
>> +START_SPIN_LOCK();
>>  while (true)
>>  {
>>  /* set BM_LOCKED flag */
>
> Seems pretty odd that we now need init_local_spin_delay() and
> START_SPIN_LOCK(). Note that init_local_spin_delay() also wraps handling of
> __FILE__, __LINE__ etc, so it seems we're duplicating state here.

Thanks for catching this! Based on the feedbacks so far, it is not OK to
acquire another spin lock when holding one already. So I refactor the
code like this:

 /*
- * Support for spin delay which is useful in various places where
- * spinlock-like procedures take place.
+ * Support for spin delay and spin misuse detection purpose.
+ *
+ * spin delay which is useful in various places where spinlock-like
+ * procedures take place.
+ *
+ * spin misuse is based on global spinStatus to know if a spin lock
+ * is held when a heavy operation is taking.
  */
 typedef struct
 {
@@ -846,22 +854,40 @@ typedef struct
const char *file;
int line;
const char *func;
-} SpinDelayStatus;
+   boolin_panic; /* works for spin lock misuse purpose. */
+} SpinLockStatus;

+extern PGDLLIMPORT SpinLockStatus spinStatus;

Now all the init_local_spin_delay, perform_spin_delay, finish_spin_delay
access the same global variable spinStatus. and just 2 extra functions
added (previously have 3). they are:

extern void VerifyNoSpinLocksHeld(bool check_in_panic);
extern void ResetSpinLockStatus(void);

The panic check stuff is still added into spinLockStatus. 

v9 attached.

-- 
Best Regards
Andy Fan

>From 4c8fd0ab71299e57fbeb18dec70051bd1d035c7c Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 25 Jan 2024 15:19:49 +0800
Subject: [PATCH v9 1/1] Detect misuse of spin lock automatically

Spin lock are intended for very short-term locks, but it is possible
to be misused in many cases. e.g. Acquiring another LWLocks or regular
locks, memory allocation, errstart when holding a spin lock. this patch
would detect such misuse automatically in a USE_ASSERT_CHECKING build.

CHECK_FOR_INTERRUPTS should be avoided as well when holding a spin lock.
Depends on what signals are left to handle, PG may raise error/fatal
which would cause the code jump to some other places which is hardly to
release the spin lock anyway.
---
 src/backend/storage/buffer/bufmgr.c | 24 +++
 src/backend/storage/lmgr/lock.c |  6 +++
 src/backend/storage/lmgr/lwlock.c   | 21 +++---
 src/backend/storage/lmgr/s_lock.c   | 63 -
 src/backend/tcop/postgres.c |  6 +++
 src/backend/utils/error/elog.c  | 10 +
 src/backend/utils/mmgr/mcxt.c   | 16 
 src/include/miscadmin.h | 21 +-
 src/include/storage/buf_internals.h |  1 +
 src/include/storage/s_lock.h| 56 ++---
 src/tools/pgindent/typedefs.list|  2 +-
 11 files changed, 176 insertions(+), 50 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..739a94209b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5402,12 +5402,11 @@ rlocator_comparator(const void *p1, const void *p2)
 uint32
 LockBufHdr(BufferDesc *desc)
 {
-	SpinDelayStatus delayStatus;
 	uint32		old_bu

Re: Shared detoast Datum proposal

2024-01-23 Thread Andy Fan


Michael Zhilin  writes:

> Hi Andy,
>
> It looks like v5 is missing in your mail. Could you please check and resend 
> it?

ha, yes.. v5 is really attached this time.

commit eee0b2058f912d0d56282711c5d88bc0b1b75c2f (HEAD -> 
shared_detoast_value_v3)
Author: yizhi.fzh 
Date:   Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

details at https://postgr.es/m/87il4jrk1l.fsf%40163.com

commit eeca405f5ae87e7d4e5496de989ac7b5173bcaa9
Author: yizhi.fzh 
Date:   Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1] 
https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com


As for the commit "Introduce a Bitset data struct.", the test coverage
is 100% now. So it would be great that people can review this first. 

-- 
Best Regards
Andy Fan

>From eeca405f5ae87e7d4e5496de989ac7b5173bcaa9 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 22 Jan 2024 15:48:33 +0800
Subject: [PATCH v5 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c | 201 +-
 src/backend/nodes/outfuncs.c  |  51 +
 src/include/nodes/bitmapset.h |  28 +++
 src/include/nodes/nodes.h |   4 +
 src/test/modules/test_misc/Makefile   |  11 +
 src/test/modules/test_misc/README |   4 +-
 .../test_misc/expected/test_bitset.out|   7 +
 src/test/modules/test_misc/meson.build|  17 ++
 .../modules/test_misc/sql/test_bitset.sql |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c| 132 
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list  |   1 +
 13 files changed, 456 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..eb38834e21 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1348,9 +1343,23 @@ bms_next_member(const Bitmapset *a, int prevbit)
 		/* in subsequent words, consider all bits */
 		mask = (~(bitmapword) 0);
 	}
+
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +14

Re: Shared detoast Datum proposal

2024-01-22 Thread Andy Fan

Hi,

Peter Smith  writes:

> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.
>
> ==
> [1] https://commitfest.postgresql.org/46/4759/
> [2] 
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue.  This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> 
shared_detoast_value_v2)
Author: yizhi.fzh 
Date:   Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh 
Date:   Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1] 
https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com

I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination. 

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.

-- 
Best Regards
Andy Fan

Re: the s_lock_stuck on perform_spin_delay

2024-01-22 Thread Andy Fan



Robert Haas  writes:

> On Mon, Jan 22, 2024 at 12:13 PM Andy Fan  wrote:
>> > On Mon, Jan 22, 2024 at 11:58 AM Andy Fan  wrote:
>> >> I get your point! Acquiring an already held spinlock in quickdie is
>> >> unlikely to happen, but since our existing infrastructure can handle it,
>> >> then there is no reason to bypass it.
>> >
>> > No, the existing infrastructure cannot handle that at all.
>>
>> Actually I mean we can handle it without 0003. am I still wrong?
>> Without the 0003, if we acquiring the spin lock which is held by
>> ourself already. VerifyNoSpinLocksHeld in SpinLockAcquire should catch
>> it.
>
> But that's only going to run in assert-only builds. The whole point of
> the patch set is to tell developers that there are bugs in the code
> that need fixing, not to catch problems that actually occur in
> production.

I see. As to this aspect, then yes.

-- 
Best Regards
Andy Fan

Re: the s_lock_stuck on perform_spin_delay

2024-01-22 Thread Andy Fan

Robert Haas  writes:

> On Mon, Jan 22, 2024 at 11:58 AM Andy Fan  wrote:
>> I get your point! Acquiring an already held spinlock in quickdie is
>> unlikely to happen, but since our existing infrastructure can handle it,
>> then there is no reason to bypass it.
>
> No, the existing infrastructure cannot handle that at all.

Actually I mean we can handle it without 0003. am I still wrong?
Without the 0003, if we acquiring the spin lock which is held by
ourself already. VerifyNoSpinLocksHeld in SpinLockAcquire should catch
it. 

-- 
Best Regards
Andy Fan

1 2 3 4 5 6 >

1 - 100 of 599 matches

Mail list logo