date:20210528

Re: Skipping logical replication transactions on subscriber side

2021-05-28 Thread Amit Kapila

On Sat, May 29, 2021 at 8:27 AM Masahiko Sawada  wrote:
>
> On Thu, May 27, 2021 at 7:04 PM Amit Kapila  wrote:
> >
> > On Thu, May 27, 2021 at 1:46 PM Masahiko Sawada  
> > wrote:
> > >
> > > 1. the worker records the XID and commit LSN of the failed transaction
> > > to a catalog.
> > >
> >
> > When will you record this info? I am not sure if we can try to update
> > this when an error has occurred. We can think of using try..catch in
> > apply worker and then record it in catch on error but would that be
> > advisable? One random thought that occurred to me is to that apply
> > worker notifies such information to the launcher (or maybe another
> > process) which will log this information.
>
> Yeah, I was concerned about that too and had the same idea. The
> information still could not be written if the server crashes before
> the launcher writes it. But I think it's an acceptable.
>

True, because even if the launcher restarts, the apply worker will
error out again and resend the information. I guess we can have an
error queue where apply workers can add their information and the
launcher will then process those.  If we do that, then we need to
probably define what we want to do if the queue gets full, either
apply worker nudge launcher and wait or it can just throw an error and
continue. If you have any better ideas to share this information then
we can consider those as well.

> >
> > > 2. the user specifies how to resolve that conflict transaction
> > > (currently only 'skip' is supported) and writes to the catalog.
> > > 3. the worker does the resolution method according to the catalog. If
> > > the worker didn't start to apply those changes, it can skip the entire
> > > transaction. If did, it rollbacks the transaction and ignores the
> > > remaining.
> > >
> > > The worker needs neither to reset information of the last failed
> > > transaction nor to mark the conflicted transaction as resolved. The
> > > worker will ignore that information when checking the catalog if the
> > > commit LSN is passed.
> > >
> >
> > So won't this require us to check the required info in the catalog
> > before applying each transaction? If so, that might be overhead, maybe
> > we can build some cache of the highest commitLSN that can be consulted
> > rather than the catalog table.
>
> I think workers can cache that information when starts and invalidates
> and reload the cache when the catalog gets updated.  Specifying to
> skip XID will update the catalog, invalidating the cache.
>
> > I think we need to think about when to
> > remove rows for which conflict has been resolved as we can't let that
> > information grow infinitely.
>
> I guess we can update catalog tuples in place when another conflict
> happens next time. The catalog tuple should be fixed size. The
> already-resolved conflict will have the commit LSN older than its
> replication origin's LSN.
>

Okay, but I have a slight concern that we will keep xid in the system
which might have been no longer valid. So, we will keep this info
about subscribers around till one performs drop subscription,
hopefully, that doesn't lead to too many rows. This will be okay as
per the current design but say tomorrow we decide to parallelize the
apply for a subscription then there could be multiple errors
corresponding to a subscription and in that case, such a design might
appear quite limiting. One possibility could be that when the launcher
is periodically checking for new error messages, it can clean up the
conflicts catalog as well, or maybe autovacuum does this periodically
as it does for stats (via pgstat_vacuum_stat).

-- 
With Regards,
Amit Kapila.

Re: Decoding speculative insert with toast leaks memory

2021-05-28 Thread Amit Kapila

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
 wrote:
>
> I wonder if there's a way to free the TOASTed data earlier, instead of
> waiting until the end of the transaction (as this patch does).
>

IIUC we are anyway freeing the toasted data at the next
insert/update/delete. We can try to free at other change message types
like REORDER_BUFFER_CHANGE_MESSAGE but as you said that may make the
patch more complex, so it seems better to do the fix on the lines of
what is proposed in the patch.

-- 
With Regards,
Amit Kapila.

Re: Parallel Inserts in CREATE TABLE AS

2021-05-28 Thread Amit Kapila

On Fri, May 28, 2021 at 8:53 AM Amit Kapila  wrote:
>
> On Thu, May 27, 2021 at 7:37 PM Bharath Rupireddy
> >
> > I captured below information with the attached patch
> > 0001-test-times-and-block-counts.patch applied on top of CTAS v23
> > patch set. Testing details are attached in the file named "test".
> > Total time spent in LockRelationForExtension
> > Total time spent in GetPageWithFreeSpace
> > Total time spent in RelationAddExtraBlocks
> > Total number of times extended the relation in bulk
> > Total number of times extended the relation by one block
> > Total number of blocks added in bulk extension
> > Total number of times getting the page from FSM
> >
>
> In your results, the number of pages each process is getting from FSM
> is not matching with the number of blocks added. I think we need to
> increment 'fsm_hit_count' in RecordAndGetPageWithFreeSpace as well
> because that is also called and the process can get a free page via
> the same. The other thing to check via debugger is when one worker
> adds the blocks in bulk does another parallel worker gets all those
> blocks. You can achieve that by allowing one worker (say worker-1) to
> extend the relation in bulk and then let it wait and allow another
> worker (say worker-2) to proceed and see if it gets all the pages
> added by worker-1 from FSM. You need to keep the leader also waiting
> or not perform any operation.
>

While looking at results, I have observed one more thing that we are
trying to parallelize I/O due to which we might not be seeing benefit
in such cases. I think even for non-write queries there won't be any
(much) benefit if we can't parallelize CPU usage. Basically, the test
you are doing is for statement: explain analyze verbose create table
test as select * from tenk1;. Now, in this statement, there is no
qualification and still, the Gather node is generated for it, this
won't be the case if we check "select * from tenk1". Is it due to the
reason that the patch completely ignores the parallel_tuple_cost? But
still, it should prefer a serial plan due parallel_setup_cost, why is
that not happening? Anyway, I think we should not parallelize such
queries where we can't parallelize CPU usage. Have you tried the cases
without changing any of the costings for parallelism?

-- 
With Regards,
Amit Kapila.

Regarding the necessity of RelationGetNumberOfBlocks for every rescan / bitmap heap scan.

2021-05-28 Thread Andy Fan

Hi:

I'm always confused about the following codes.

static void
initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
{
ParallelBlockTableScanDesc bpscan = NULL;
bool allow_strat;
bool allow_sync;

/*
* Determine the number of blocks we have to scan.
*
* It is sufficient to do this once at scan start, since any tuples added
* while the scan is in progress will be invisible to my snapshot anyway.
* (That is not true when using a non-MVCC snapshot.  However, we couldn't
* guarantee to return tuples added after scan start anyway, since they
* might go into pages we already scanned.  To guarantee consistent
* results for a non-MVCC snapshot, the caller must hold some higher-level
* lock that ensures the interesting tuple(s) won't change.)
*/
if (scan->rs_base.rs_parallel != NULL)
{
bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
scan->rs_nblocks = bpscan->phs_nblocks;
}
else
scan->rs_nblocks = RelationGextNumberOfBlocks(scan->rs_base.rs_rd);


..
}

1. Why do we need scan->rs_nblocks =
   RelationGextNumberOfBlocks(scan->rs_base.rs_rd) for every rescan, which
looks
   mismatched with the comments along the code. and the comments looks
   reasonable to me.
2. For the heap scan after an IndexScan, we don't need to know the heap
   size, then why do we need to get the nblocks for bitmap heap scan? I
think the
   similarity between the 2 is that both of them can get a "valid"
CTID/pages number
   from index scan.  To be clearer, I think for bitmap heap scan, we even
don't
   need check the RelationGextNumberOfBlocks for the initscan.
3. If we need to check nblocks every time,  why Parallel Scan doesn't
change it
every time?

shall we remove the RelationGextNumberOfBlocks for bitmap heap scan totally
and the rescan for normal heap scan?

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)

Re: Skipping logical replication transactions on subscriber side

2021-05-28 Thread Masahiko Sawada

On Thu, May 27, 2021 at 7:04 PM Amit Kapila  wrote:
>
> On Thu, May 27, 2021 at 1:46 PM Masahiko Sawada  wrote:
> >
> > On Thu, May 27, 2021 at 2:48 PM Amit Kapila  wrote:
> > >
> > > Okay, that makes sense but still not sure how will you identify if we
> > > need to reset XID in case of failure doing that in the previous
> > > attempt.
> >
> > It's a just idea but we can record the failed transaction with XID as
> > well as its commit LSN passed? The sequence I'm thinking is,
> >
> > 1. the worker records the XID and commit LSN of the failed transaction
> > to a catalog.
> >
>
> When will you record this info? I am not sure if we can try to update
> this when an error has occurred. We can think of using try..catch in
> apply worker and then record it in catch on error but would that be
> advisable? One random thought that occurred to me is to that apply
> worker notifies such information to the launcher (or maybe another
> process) which will log this information.

Yeah, I was concerned about that too and had the same idea. The
information still could not be written if the server crashes before
the launcher writes it. But I think it's an acceptable.

>
> > 2. the user specifies how to resolve that conflict transaction
> > (currently only 'skip' is supported) and writes to the catalog.
> > 3. the worker does the resolution method according to the catalog. If
> > the worker didn't start to apply those changes, it can skip the entire
> > transaction. If did, it rollbacks the transaction and ignores the
> > remaining.
> >
> > The worker needs neither to reset information of the last failed
> > transaction nor to mark the conflicted transaction as resolved. The
> > worker will ignore that information when checking the catalog if the
> > commit LSN is passed.
> >
>
> So won't this require us to check the required info in the catalog
> before applying each transaction? If so, that might be overhead, maybe
> we can build some cache of the highest commitLSN that can be consulted
> rather than the catalog table.

I think workers can cache that information when starts and invalidates
and reload the cache when the catalog gets updated.  Specifying to
skip XID will update the catalog, invalidating the cache.

> I think we need to think about when to
> remove rows for which conflict has been resolved as we can't let that
> information grow infinitely.

I guess we can update catalog tuples in place when another conflict
happens next time. The catalog tuple should be fixed size. The
already-resolved conflict will have the commit LSN older than its
replication origin's LSN.


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Skipping logical replication transactions on subscriber side

2021-05-28 Thread Masahiko Sawada

On Thu, May 27, 2021 at 7:26 PM Amit Kapila  wrote:
>
> On Thu, May 27, 2021 at 12:01 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, May 26, 2021 at 6:11 PM Amit Kapila  wrote:
> > >
> > > I agree with you that specifying XID could be easier and
> > > understandable for users. I was thinking and studying a bit about what
> > > other systems do in this regard. Why don't we try to provide conflict
> > > resolution methods for users? The idea could be that either the
> > > conflicts can be resolved automatically or manually. In the case of
> > > manual resolution, users can use the existing methods or the XID stuff
> > > you are proposing here and in case of automatic resolution, the
> > > in-built or corresponding user-defined functions will be invoked for
> > > conflict resolution. There are more details to figure out in the
> > > automatic resolution scheme but I see a lot of value in doing the
> > > same.
> >
> > Yeah, I also see a lot of value in automatic conflict resolution. But
> > maybe we can have both ways? For example, in case where the user wants
> > to resolve conflicts in different ways or a conflict that cannot be
> > resolved by automatic resolution (not sure there is in practice
> > though), the manual resolution would also have value.
> >
>
> Right, that is exactly what I was saying. So, even if both can be done
> as separate patches, we should try to design the manual resolution in
> a way that can be extended for an automatic resolution system. I think
> we can try to have some initial idea/design/POC for an automatic
> resolution as well to ensure that the manual resolution scheme can be
> further extended.

Totally agreed.

But perhaps we might want to note that the conflict resolution we're
talking about is to resolve conflicts at the row or column level. It
doesn't necessarily raise an ERROR and the granularity of resolution
is per record or column. For example, if a DELETE and an UPDATE
process the same tuple (searched by PK), the UPDATE may not find the
tuple and be ignored due to the tuple having been already deleted. In
this case, no ERROR will occur (i.g. UPDATE will be ignored), but the
user may want to do another conflict resolution. On the other hand,
the feature proposed here assumes that an error has already occurred
and logical replication has already been stopped. And resolves it by
skipping the entire transaction.

IIUC the conflict resolution can be thought of as a combination of
types of conflicts and the resolution that can be applied to them. For
example, if there is a conflict between INSERT and INSERT and the
latter INSERT violates the unique constraint, an ERROR is raised. So
DBA can resolve it manually. But there is another way to automatically
resolve it by selecting the tuple having a newer timestamp. On the
other hand, in the DELETE and UPDATE conflict described above, it's
possible to automatically ignore the fact that the UPDATE could update
the tuple. Or we can even generate an ERROR so that DBA can resolve it
manually. DBA can manually resolve the conflict in various ways:
skipping the entire transaction from the origin, choose the tuple
having a newer/older timestamp, etc.

In that sense, we can think of the feature proposed here as a feature
that provides a way to resolve the conflict that would originally
cause an ERROR by skipping the entire transaction. If we add a
solution that raises an ERROR for conflicts that don't originally
raise an ERROR (like DELETE and UPDATE conflict) in the future, we
will be able to manually skip each transaction for all types of
conflicts.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

ANALYZE's dead tuple accounting can get confused

2021-05-28 Thread Peter Geoghegan

The accounting used by ANALYZE to count dead tuples in
acquire_sample_rows() (actually in heapam_scan_analyze_next_tuple()
these days) makes some dubious assumptions about how it should count
dead tuples. This is something that I discussed with Masahiko in the
context of our Postgres 14 work on VACUUM, which ultimately led to
better documentation of the issues (see commit 7136bf34). But I want
to talk about it again now. This is not a new issue.

The ANALYZE dead tuple accounting takes a 100% quantitative approach
-- it is entirely unconcerned about qualitative distinctions about the
number of dead tuples per logical row. Sometimes that doesn't matter,
but there are many important cases where it clearly is important. I'll
show one such case now. This is a case where the system frequently
launches autovacuum workers that really never manage to do truly
useful work:

$ pgbench -i -s 50 -F 80
  ...
$ pgbench -s 50 -j 4 -c 32 -M prepared -T 300 --rate=15000
 ...

I've made the heap fill factor 80 (with -F). I've also set both
autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor to
0.02 here, which is aggressive but still basically reasonable. I've
enabled autovacuum logging so we can see exactly what's going on with
autovacuum when pgbench runs -- that's the interesting part.

The log output shows that an autovacuum worker was launched and ran
VACUUM against pgbench_accounts on 11 separate occasions during the 5
minute pgbench benchmark. All 11 autovacuum log reports show the
details are virtually the same in each case. Here is the 11th and
final output concerning the accounts table (I could have used any of
the other 10 just as easily):

p 593300/2021-05-28 16:16:47 PDT LOG:  automatic vacuum of table
"regression.public.pgbench_accounts": index scans: 0
pages: 0 removed, 102041 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 100300 removed, 500 remain, 0 are dead but not yet
removable, oldest xmin: 7269905
buffer usage: 204166 hits, 0 misses, 3586 dirtied
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 11.250 MB/s
I/O Timings:
system usage: CPU: user: 2.31 s, system: 0.02 s, elapsed: 2.49 s
WAL usage: 200471 records, 31163 full page images, 44115415 bytes

Notice that we have 0 LP_DEAD items left behind by pruning -- either
opportunistic pruning or pruning by VACUUM. Pruning by VACUUM inside
lazy_scan_prune() does "remove 100300 dead tuples", so arguably VACUUM
does some useful work. Though I would argue that we don't -- I think
that this is a total waste of cycles. This particular quantitative
measure has little to do with anything that matters to the workload.
This workload shouldn't ever need to VACUUM the accounts table (except
when the time comes to freeze its tuples) -- the backends can clean up
after themselves opportunistically, without ever faltering (i.e.
without ever failing to keep a HOT chain on the same page).

The picture we see here seems contradictory, even if you think about
the problem in exactly the same way as vacuumlazy.c thinks about the
problem. On the one hand autovacuum workers are launched because
opportunistic cleanup techniques (mainly opportunistic heap page
pruning) don't seem to be able to keep up with the workload. On the
other hand, when VACUUM actually runs we consistently see 0 LP_DEAD
stub items in heap pages, which is generally an excellent indicator
that opportunistic HOT pruning is in fact working perfectly. Only one
of those statements can be correct.

The absurdity of autovacuum's behavior with this workload becomes
undeniable once you tweak just one detail and see what changes. For
example, I find that if I repeat the same process but increase
autovacuum_vacuum_scale_factor from 0.02 to 0.05, everything changes.
Instead of getting 11 autovacuum runs against pgbench_accounts I get 0
autovacuum runs! This effect is stable, and won't change if the
workload runs for more than 5 minutes. Apparently vacuuming less
aggressively results in less need for vacuuming!

I believe that there is a sharp discontinuity here -- a crossover
point for autovacuum_vacuum_scale_factor at which the behavior of the
system *flips*, from very *frequent* autovacuum runs against the
accounts table, to *zero* runs. This seems like a real problem to me.
I bet it has real consequences that are hard to model. In any case
this simple model seems convincing enough. The dead tuple accounting
makes it much harder to set autovacuum_vacuum_scale_factor very
aggressively (say 0.02 or so) -- nobody is going to want to do that as
long as it makes the system launch useless autovacuum workers that
never end up doing useful work in a subset of tables. Users are
currently missing out on the benefit of very aggressive autovacuums
against tables where it truly makes sense.

The code in acquire_sample_rows()/heapam_scan_analyze_next_tuple()
counts tuples/line pointers o

Re: Degression (PG10 > 11, 12 or 13)

2021-05-28 Thread Johannes Graën

On 28/05/2021 18.24, Tom Lane wrote:
> Pavel Stehule  writes:
>> pá 28. 5. 2021 v 16:12 odesílatel Johannes Graën 
>> napsal:
>>> When trying to upgrade an existing database from version 10 to 13 I came
>>> across a degression in some existing code used by clients. Further
>>> investigations showed that performance measures are similar in versions
>>> 11 to 13, while in the original database on version 10 it's around 100
>>> times faster. I could boil it down to perl functions used for sorting.
> 
>> Are you sure, so all databases use the same encoding and same locale?
> 
> Yeah ... I don't know too much about the performance of Perl regexps,
> but it'd be plausible that it varies depending on locale setting.

It probably wasn't Perl at all. Thanks to the hint I checked the initial
database again and, while encoding and ctype are set to UTF8, the
collation is C, which makes a huge difference:

... order by tab(attr) => Execution Time: 51429.875 ms
... order by tab(attr collate "C") => Execution Time: 537.757 ms

in the original database. Any other version yields similar times.

On 28/05/2021 17.47, Tomas Vondra wrote:
> That function is pretty much just a sequence of ~120 regular
> expressions, doing something similar to unaccent(). I wonder if we're
> calling the function much more often, perhaps due to some changes in the
> sort code (the function is immutable, but that does not guarantee it's
> called just once).

> Also, maybe try materializing the function results before doing the
> sort, perhaps like this:
>
> SELECT * FROM (select attr, func(attr) as fattr from tab offset 0) foo
> ORDER BY fattr;

I was expecting it to be called once in the process of sorting, and it
seems that this is actually true for all version and different
collations, but sorting for a collation that is not C requires
considerable more resources (that still needs to be shown for other
collations, but I see the overhead of having more or less complex
definitions vs. just comparing numbers).

That being said, I would have used unaccent or, if that wasn't an
option, maybe have those values calculated by a trigger function when
the corresponding rows are changed. But I don't control the code.

Now what keeps me wondering is how the sorting works internally and if
we could conclude that using the C collation in order expressions and
indexes is a general way to speed up queries - if the actual order is of
less importance.

Best
  Johannes

Re: Add option --drop-cascade for pg_dump/restore

2021-05-28 Thread Greg Sabino Mullane

Overall the patch looks good, but I did notice a few small things:

1. In pg_dumpall.c, the section  /* Add long options to the pg_dump argument 
list */, we are now 
passing along the --drop-cascade option. However, --clean is not passed in, so 
any call to pg_dumpall using --drop-cascade fails a the pg_dump step. You'll 
note 
that --if-exists it not passed along either; because we are dropping the whole 
database, we don't 
need to have pg_dump worry about dropping objects at all. So I think that 
--drop-cascade should NOT be passed along from pg_dumpall to pg_dump.

2. I'm not even sure if --drop-cascade makes sense for pg_dumpall, as you 
cannot cascade global things like databases and roles.

3. In the file pg_backup_archiver.c, the patch does a 
stmtEnd = strstr(mark + strlen(buffer), ";");" and then spits 
out things "past" the semicolon as the final %s in the appendPQExpBuffer line. 
I'm not clear why: are we expecting more things to appear after the semi-colon? 
Why not just append a "\n" manually as part of the previous %s?

Cheers,
Greg

The new status of this patch is: Waiting on Author

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-05-28 Thread Mark Dilger

> On May 27, 2021, at 11:06 PM, Noah Misch  wrote:
> 
> On Tue, May 25, 2021 at 01:33:54PM -0700, Mark Dilger wrote:
>> v3-0001 adds a new pg_logical_replication role with permission to manage 
>> publications and subscriptions.
> 
>> v3-0004 adds a new pg_database_security role with permission to perform many
>> actions that would otherwise require superuser, so long as those actions do
>> not compromise the security of the host or network.  This role, along with
>> pg_logical_replication, is intended to be safe to delegate to the tenant of
>> a database provided as a service.
> 
> pg_logical_replication would not be safe to delegate that way:
> https://postgr.es/m/flat/CACqFVBbx6PDq%2B%3DvHM0n78kHzn8tvOM-kGO_2q_q0zNAMT%2BTzdA%40mail.gmail.com

Oh, I agree that this patch set does not go the extra step to make it safe.  
You are quite right to push back, as my email was poorly worded.  I should have 
said "intended to be eventually made safe to delegate". The idea is that the 
patch set addresses most places in the sources where we test for superuser and 
tests instead for (superuser || ), and then uses that same set of 
roles to control who has sufficient privileges to set GUCs.  The 
pg_host_security and pg_network_security roles are not intended to eventually 
be safe to delegate.  Or at least, I can't see any clear path to getting there. 
 The pg_database_security and pg_logical_replication roles should be ones we 
can make safe.  If we can agree as a community which set of roles are 
appropriate, then we can have separate patches as needed for tightening the 
security around them.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: terminate called after throwing an instance of 'std::bad_alloc'

2021-05-28 Thread Justin Pryzby

There's a memory leak affecting JIT expressions, even when inlining,
optimization, and tuple deforming are disabled.

The server that got OOM after installing PG13.3 (because your patch wasn't
applied) still gets OOM even with inline_above_cost=-1, optimize_above_cost=-1,
and deforming=off.

Actually I think this is an even more severe leak, even though it seems to be
infrequent.  The inlining leak is slow enough that I don't think that it,
alone, would cause a crash.  It's possible there's a race condition or
something that makes this hard to notice or reproduce, or the inlining leak is
just easier to find.

I guess your patch avoids this issue, too, since it didn't OOM with your patch.

-- 
Justin

Re: Degression (PG10 > 11, 12 or 13)

2021-05-28 Thread Tom Lane

Pavel Stehule  writes:
> pá 28. 5. 2021 v 16:12 odesílatel Johannes Graën 
> napsal:
>> When trying to upgrade an existing database from version 10 to 13 I came
>> across a degression in some existing code used by clients. Further
>> investigations showed that performance measures are similar in versions
>> 11 to 13, while in the original database on version 10 it's around 100
>> times faster. I could boil it down to perl functions used for sorting.

> Are you sure, so all databases use the same encoding and same locale?

Yeah ... I don't know too much about the performance of Perl regexps,
but it'd be plausible that it varies depending on locale setting.

regards, tom lane

Re: Add ZSON extension to /contrib/

2021-05-28 Thread Tomas Vondra




On 5/28/21 4:22 PM, Andrew Dunstan wrote:
> 
> On 5/28/21 6:35 AM, Tomas Vondra wrote:
>>
>>>
>>> IMO the main benefit of having different dictionaries is that you
>>> could have a small dictionary for small and very structured JSONB
>>> fields (e.g. some time-series data), and a large one for large /
>>> unstructured JSONB fields, without having the significant performance
>>> impact of having that large and varied dictionary on the
>>> small&structured field. Although a binary search is log(n) and thus
>>> still quite cheap even for large dictionaries, the extra size is
>>> certainly not free, and you'll be touching more memory in the process.
>>>
>> I'm sure we can think of various other arguments for allowing separate
>> dictionaries. For example, what if you drop a column? With one huge
>> dictionary you're bound to keep the data forever. With per-column dicts
>> you can just drop the dict and free disk space / memory.
>>
>> I also find it hard to believe that no one needs 2**16 strings. I mean,
>> 65k is not that much, really. To give an example, I've been toying with
>> storing bitcoin blockchain in a database - one way to do that is storing
>> each block as a single JSONB document. But each "item" (eg. transaction)
>> is identified by a unique hash, so that means (tens of) thousands of
>> unique strings *per document*.
>>
>> Yes, it's a bit silly and extreme, and maybe the compression would not
>> help much in this case. But it shows that 2**16 is damn easy to hit.
>>
>> In other words, this seems like a nice example of survivor bias, where
>> we only look at cases for which the existing limitations are acceptable,
>> ignoring the (many) remaining cases eliminated by those limitations.
>>
>>
> 
> I don't think we should lightly discard the use of 2 byte keys though.
> Maybe we could use a scheme similar to what we use for text lengths,
> where the first bit indicates whether we have a 1 byte or 4 byte length
> indicator. Many dictionaries will have less that 2^15-1 entries, so they
> would use exclusively the smaller keys.
> 

I didn't mean to discard that, of course. I'm sure a lot of data sets
may be perfectly fine with 64k keys, of course, and it may be worth
optimizing that as a special case. All I'm saying is that if we start
from the position that this limit is perfectly fine and no one is going
to hit it in practice, it may be due to people not even trying it on
documents with more keys.

That being said, I still don't think the 1MB vs. 1.7MB figure is
particularly meaningful, because it's for "empty" dictionary, which is
something you'll not have in practice. And once you start adding keys,
the difference will get less and less significant.

However, if we care about efficiency for "small" JSON documents, it's
probably worth using something like varint [1], which is 1-4B depending
on the value.

[1] https://learnmeabitcoin.com/technical/varint


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Degression (PG10 > 11, 12 or 13)

2021-05-28 Thread Pavel Stehule

hI

pá 28. 5. 2021 v 16:12 odesílatel Johannes Graën 
napsal:

> Hi,
>
> When trying to upgrade an existing database from version 10 to 13 I came
> across a degression in some existing code used by clients. Further
> investigations showed that performance measures are similar in versions
> 11 to 13, while in the original database on version 10 it's around 100
> times faster. I could boil it down to perl functions used for sorting.
>
> >From the real data that I don't own, I created a test case that is
> sufficient to observe the degression: http://ix.io/3o7f
>
>
> These are the numbers on PG 10:
>
> > test=# explain (analyze, verbose, buffers)
> > select attr from tab order by func(attr);
> >   QUERY PLAN
> >
> --
> >  Sort  (cost=3269.68..3294.36 rows=9869 width=40) (actual
> time=179.374..180.558 rows=9869 loops=1)
> >Output: attr, (func(attr))
> >Sort Key: (func(tab.attr))
> >Sort Method: quicksort  Memory: 1436kB
> >Buffers: shared hit=49
> >->  Seq Scan on public.tab  (cost=0.00..2614.94 rows=9869 width=40)
> (actual time=2.293..169.060 rows=9869 loops=1)
> >  Output: attr, func(attr)
> >  Buffers: shared hit=49
> >  Planning time: 0.318 ms
> >  Execution time: 182.061 ms
> > (10 rows)
> >
> > test=# explain (analyze, verbose, buffers)
> > select attr from tab;
> >  QUERY PLAN
> >
> 
> >  Seq Scan on public.tab  (cost=0.00..147.69 rows=9869 width=8) (actual
> time=0.045..3.975 rows=9869 loops=1)
> >Output: attr
> >Buffers: shared hit=49
> >  Planning time: 0.069 ms
> >  Execution time: 6.020 ms
> > (5 rows)
>
>
> And here we have PG 11:
>
> > test=# explain (analyze, verbose, buffers)
> > select attr from tab order by func(attr);
> >   QUERY PLAN
> >
> --
> >  Sort  (cost=3269.68..3294.36 rows=9869 width=40) (actual
> time=597.877..599.805 rows=9869 loops=1)
> >Output: attr, (func(attr))
> >Sort Key: (func(tab.attr))
> >Sort Method: quicksort  Memory: 1436kB
> >Buffers: shared hit=49
> >->  Seq Scan on public.tab  (cost=0.00..2614.94 rows=9869 width=40)
> (actual time=0.878..214.188 rows=9869 loops=1)
> >  Output: attr, func(attr)
> >  Buffers: shared hit=49
> >  Planning Time: 0.151 ms
> >  Execution Time: 601.767 ms
> > (10 rows)
> >
> > test=# explain (analyze, verbose, buffers)
> > select attr from tab;
> >  QUERY PLAN
> >
> 
> >  Seq Scan on public.tab  (cost=0.00..147.69 rows=9869 width=8) (actual
> time=0.033..1.628 rows=9869 loops=1)
> >Output: attr
> >Buffers: shared hit=49
> >  Planning Time: 0.043 ms
> >  Execution Time: 2.581 ms
> > (5 rows)
>
>
> In the real scenario it's 500ms vs. 50s. The reason is obviously the
> perl function used as sort key. All different versions have been tested
> with an unmodified config and one tunes with pgtune. Creating a
> functional index does not help in the original database as the planner
> doesn't use it, while it *is* used in the test case. But the question
> what causes that noticeable difference in performance is untouched by
> the fact that it could be circumvented in some cases.
>
> The perl version used is v5.24.1.
>

 I looked on profile - Postgres 14

   5,67%  libperl.so.5.32.1 [.] Perl_utf8_length
   5,44%  libc-2.33.so  [.] __strcoll_l
   4,73%  libperl.so.5.32.1 [.] Perl_pp_subst
   4,33%  libperl.so.5.32.1 [.] Perl_re_intuit_start
   3,25%  libperl.so.5.32.1 [.] Perl_fbm_instr
   1,92%  libperl.so.5.32.1 [.] Perl_regexec_flags
   1,79%  libperl.so.5.32.1 [.] Perl_runops_standard
   1,16%  libperl.so.5.32.1 [.] Perl_pp_const
   0,97%  perf  [.] 0x002fcf93
   0,94%  libperl.so.5.32.1 [.] Perl_pp_nextstate
   0,68%  libperl.so.5.32.1 [.] Perl_do_trans
   0,54%  perf  [.] 0x003dd0c5

and Postgres - 10

   5,45%  libperl.so.5.32.1  [.] Perl_utf8_length
   4,78%  libc-2.33.so   [.] __strcoll_l
   4,15%  libperl.so.5.32.1  [.] Perl_re_intuit_start
   3,92%  libperl.so.5.32.1  [.] Perl_pp_subst
   2,99%  libperl.so.5.32.1  [.] Perl_fbm_instr
   1

Re: Degression (PG10 > 11, 12 or 13)

2021-05-28 Thread Tomas Vondra

On 5/28/21 4:12 PM, Johannes Graën wrote:
> Hi,
> 
> When trying to upgrade an existing database from version 10 to 13 I came
> across a degression in some existing code used by clients. Further
> investigations showed that performance measures are similar in versions
> 11 to 13, while in the original database on version 10 it's around 100
> times faster. I could boil it down to perl functions used for sorting.
> 
>>From the real data that I don't own, I created a test case that is
> sufficient to observe the degression: http://ix.io/3o7f
> 

That function is pretty much just a sequence of ~120 regular
expressions, doing something similar to unaccent(). I wonder if we're
calling the function much more often, perhaps due to some changes in the
sort code (the function is immutable, but that does not guarantee it's
called just once).

It'd be interesting to see profiles from perf, both from 10 and 11.

Also, maybe try materializing the function results before doing the
sort, perhaps like this:

SELECT * FROM (select attr, func(attr) as fattr from tab offset 0) foo
ORDER BY fattr;

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: be-secure-gssapi.c and auth.c with setenv() not compatible on Windows

2021-05-28 Thread Tom Lane

Michael Paquier  writes:
> We can do two things here:
> 1) Switch be-secure-gssapi.c and auth.c to use putenv().
> 2) Backport into 12 and 13 the fallback implementation of setenv
> introduced in 7ca37fb, and keep be-secure-gssapi.c as they are now.

There's a lot of value in keeping the branches looking alike.
On the other hand, 7ca37fb hasn't survived contact with the
public yet, so I'm a bit nervous about it.

It's not clear to me how much of 7ca37fb you're envisioning
back-patching in (2).  I think it'd be best to back-patch
only the addition of pgwin32_setenv, and then let the gssapi
code use it.  In that way, if there's anything wrong with
pgwin32_setenv, we're only breaking code that never worked
on Windows before anyway.

regards, tom lane

Re: Command statistics system (cmdstats)

2021-05-28 Thread Mark Dilger




> On May 28, 2021, at 6:42 AM, Alvaro Herrera  wrote:
> 
 On Mar 4, 2020, at 7:43 PM, Mark Dilger  
 wrote:
> 
 as mentioned in [1], I have created an implementation of command
 counter statistics very similar in purpose to the one already
 pending in the commitfest going by the name "pg_stat_sql".  I don't
 really care if this implementation is seen as building on that one
 or as separate, but I was worried about hijacking that thread, so
 I'm starting this thead. There are elements of this patch that
 borrowed from that one, so if this is accepted, authorship should
 reflect that.
> 
> Hello, was this abandoned or is there another thread somewhere?  The
> patch in the email I'm replying to doesn't apply, but the conflicts
> don't look very bad.

I abandoned it for v14, choosing instead to work on amcheck features.  Would 
you like me to take this up again for v15?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add ZSON extension to /contrib/

2021-05-28 Thread Andrew Dunstan



On 5/28/21 6:35 AM, Tomas Vondra wrote:
>
>>
>> IMO the main benefit of having different dictionaries is that you
>> could have a small dictionary for small and very structured JSONB
>> fields (e.g. some time-series data), and a large one for large /
>> unstructured JSONB fields, without having the significant performance
>> impact of having that large and varied dictionary on the
>> small&structured field. Although a binary search is log(n) and thus
>> still quite cheap even for large dictionaries, the extra size is
>> certainly not free, and you'll be touching more memory in the process.
>>
> I'm sure we can think of various other arguments for allowing separate
> dictionaries. For example, what if you drop a column? With one huge
> dictionary you're bound to keep the data forever. With per-column dicts
> you can just drop the dict and free disk space / memory.
>
> I also find it hard to believe that no one needs 2**16 strings. I mean,
> 65k is not that much, really. To give an example, I've been toying with
> storing bitcoin blockchain in a database - one way to do that is storing
> each block as a single JSONB document. But each "item" (eg. transaction)
> is identified by a unique hash, so that means (tens of) thousands of
> unique strings *per document*.
>
> Yes, it's a bit silly and extreme, and maybe the compression would not
> help much in this case. But it shows that 2**16 is damn easy to hit.
>
> In other words, this seems like a nice example of survivor bias, where
> we only look at cases for which the existing limitations are acceptable,
> ignoring the (many) remaining cases eliminated by those limitations.
>
>

I don't think we should lightly discard the use of 2 byte keys though.
Maybe we could use a scheme similar to what we use for text lengths,
where the first bit indicates whether we have a 1 byte or 4 byte length
indicator. Many dictionaries will have less that 2^15-1 entries, so they
would use exclusively the smaller keys.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Degression (PG10 > 11, 12 or 13)

2021-05-28 Thread Johannes Graën

Hi,

When trying to upgrade an existing database from version 10 to 13 I came
across a degression in some existing code used by clients. Further
investigations showed that performance measures are similar in versions
11 to 13, while in the original database on version 10 it's around 100
times faster. I could boil it down to perl functions used for sorting.

>From the real data that I don't own, I created a test case that is
sufficient to observe the degression: http://ix.io/3o7f


These are the numbers on PG 10:

> test=# explain (analyze, verbose, buffers)
> select attr from tab order by func(attr);
>   QUERY PLAN
> --
>  Sort  (cost=3269.68..3294.36 rows=9869 width=40) (actual 
> time=179.374..180.558 rows=9869 loops=1)
>Output: attr, (func(attr))
>Sort Key: (func(tab.attr))
>Sort Method: quicksort  Memory: 1436kB
>Buffers: shared hit=49
>->  Seq Scan on public.tab  (cost=0.00..2614.94 rows=9869 width=40) 
> (actual time=2.293..169.060 rows=9869 loops=1)
>  Output: attr, func(attr)
>  Buffers: shared hit=49
>  Planning time: 0.318 ms
>  Execution time: 182.061 ms
> (10 rows)
> 
> test=# explain (analyze, verbose, buffers)
> select attr from tab;
>  QUERY PLAN
> 
>  Seq Scan on public.tab  (cost=0.00..147.69 rows=9869 width=8) (actual 
> time=0.045..3.975 rows=9869 loops=1)
>Output: attr
>Buffers: shared hit=49
>  Planning time: 0.069 ms
>  Execution time: 6.020 ms
> (5 rows)


And here we have PG 11:

> test=# explain (analyze, verbose, buffers)
> select attr from tab order by func(attr);
>   QUERY PLAN
> --
>  Sort  (cost=3269.68..3294.36 rows=9869 width=40) (actual 
> time=597.877..599.805 rows=9869 loops=1)
>Output: attr, (func(attr))
>Sort Key: (func(tab.attr))
>Sort Method: quicksort  Memory: 1436kB
>Buffers: shared hit=49
>->  Seq Scan on public.tab  (cost=0.00..2614.94 rows=9869 width=40) 
> (actual time=0.878..214.188 rows=9869 loops=1)
>  Output: attr, func(attr)
>  Buffers: shared hit=49
>  Planning Time: 0.151 ms
>  Execution Time: 601.767 ms
> (10 rows)
> 
> test=# explain (analyze, verbose, buffers)
> select attr from tab;
>  QUERY PLAN
> 
>  Seq Scan on public.tab  (cost=0.00..147.69 rows=9869 width=8) (actual 
> time=0.033..1.628 rows=9869 loops=1)
>Output: attr
>Buffers: shared hit=49
>  Planning Time: 0.043 ms
>  Execution Time: 2.581 ms
> (5 rows)


In the real scenario it's 500ms vs. 50s. The reason is obviously the
perl function used as sort key. All different versions have been tested
with an unmodified config and one tunes with pgtune. Creating a
functional index does not help in the original database as the planner
doesn't use it, while it *is* used in the test case. But the question
what causes that noticeable difference in performance is untouched by
the fact that it could be circumvented in some cases.

The perl version used is v5.24.1.

Best
  Johannes

Re: Reduce lock level for ALTER TABLE ... ADD CHECK .. NOT VALID

2021-05-28 Thread Greg Sabino Mullane

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Looks fine to me

The new status of this patch is: Ready for Committer

Re: Asynchronous Append on postgres_fdw nodes.

2021-05-28 Thread Etsuro Fujita

Horiguchi-san,

On Fri, May 28, 2021 at 5:29 PM Kyotaro Horiguchi
 wrote:
> At Fri, 28 May 2021 16:30:29 +0900, Etsuro Fujita  
> wrote in
> > The root cause would
> > be that we call classify_matching_subplans() to re-determine
> > sync/async subplans when called from the first ExecAppend() after the
> > first ReScan, even if do_exec_prune=false, which is incorrect because
> > in that case it is assumed to re-use sync/async subplans determined
> > during the the first ExecAppend() after Init.

I noticed I wrote it wrong.  If do_exec_prune=false, we would
determine sync/async subplans during ExecInitAppend(), so the “re-use
sync/async subplans determined during the the first ExecAppend() after
Init" part should be corrected as “re-use sync/async subplans
determined during ExecInitAppend()”.  Sorry for that.

> The patch drops some "= NULL" (initial) initializations when
> nasyncplans == 0. AFAICS makeNode() fills the returned memory with
> zeroes but I'm not sure it is our convention to omit the
> intializations.

I’m not sure, but I think we omit it in some cases; for example, we
don’t set as_valid_subplans to NULL explicitly in ExecInitAppend(), if
do_exec_prune=true.

> Otherwise the patch seems to make the code around cleaner.

Thanks for reviewing!

Best regards,
Etsuro Fujita

Re: Command statistics system (cmdstats)

2021-05-28 Thread Alvaro Herrera

> >> On Mar 4, 2020, at 7:43 PM, Mark Dilger  
> >> wrote:

> >> as mentioned in [1], I have created an implementation of command
> >> counter statistics very similar in purpose to the one already
> >> pending in the commitfest going by the name "pg_stat_sql".  I don't
> >> really care if this implementation is seen as building on that one
> >> or as separate, but I was worried about hijacking that thread, so
> >> I'm starting this thead. There are elements of this patch that
> >> borrowed from that one, so if this is accepted, authorship should
> >> reflect that.

Hello, was this abandoned or is there another thread somewhere?  The
patch in the email I'm replying to doesn't apply, but the conflicts
don't look very bad.

-- 
Álvaro Herrera   Valdivia, Chile

Re: storing an explicit nonce

2021-05-28 Thread Bruce Momjian

On Thu, May 27, 2021 at 04:36:23PM -0400, Stephen Frost wrote:
> At this point I'm wondering if it's just:
> 
> dboid/relfilenode:block-offset
> 
> and then we hash it to whatever size EVP_CIPHER_iv_length(AES-XTS-128)
> (or -256, whatever we're using based on what was passed to initdb)
> returns.

FYI, the dboid is not preserved by pg_upgrade.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns

2021-05-28 Thread Etsuro Fujita

On Fri, May 28, 2021 at 8:25 AM Alvaro Herrera  wrote:
> On 2019-Mar-15, Etsuro Fujita wrote:
> > (2019/03/04 12:10), Etsuro Fujita wrote:
> > > (2019/03/02 3:57), Andres Freund wrote:
> > > > FWIW, I pushed the EPQ patch, doing this conversion blindly. It'd be
> > > > awesome if you'd check that it actually works...
> > >
> > > I'll start the work later this week. I think I can post an (initial)
> > > report on that next week, maybe.
> >
> > Here is an updated version of the patch [1].
>
> What happened to this patch?  Seems it was forgotten ...

We didn't do anything about the patch.  IIRC, the patch was created
for testing the preliminary work for pluggable table access methods
(ad0bda5d2), rather than making it into postgres_fdw.

Best regards,
Etsuro Fujita

Re: Decoding speculative insert with toast leaks memory

2021-05-28 Thread Amit Kapila

On Fri, May 28, 2021 at 6:01 PM Tomas Vondra
 wrote:
>
> On 5/28/21 2:17 PM, Dilip Kumar wrote:
> > On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
> >  wrote:
> >> On 5/27/21 6:36 AM, Dilip Kumar wrote:
> >>> On Thu, May 27, 2021 at 9:47 AM Amit Kapila  
> >>> wrote:
> 
>  On Thu, May 27, 2021 at 9:40 AM Dilip Kumar  
>  wrote:
> 
>  True, but if you do this clean-up in ReorderBufferCleanupTXN then you
>  don't need to take care at separate places. Also, toast_hash is stored
>  in txn so it appears natural to clean it up in while releasing TXN.
> >>>
> >>> Make sense, basically, IMHO we will have to do in TruncateTXN and
> >>> ReturnTXN as attached?
> >>>
> >>
> >> Yeah, I've been working on a fix over the last couple days (because of a
> >> customer issue), and I ended up with the reset in ReorderBufferReturnTXN
> >> too - it does solve the issue in the case I've been investigating.
> >>
> >> I'm not sure the reset in ReorderBufferTruncateTXN is correct, though.
> >> Isn't it possible that we'll need the TOAST data after streaming part of
> >> the transaction? After all, we're not resetting invalidations, tuplecids
> >> and snapshot either
> >
> > Actually, as per the current design, we don't need the toast data
> > after the streaming.  Because currently, we don't allow to stream the
> > transaction if we need to keep the toast across stream e.g. if we only
> > have toast insert without the main insert we assure this as partial
> > changes, another case is if we have multi-insert with toast then we
> > keep the transaction as mark partial until we get the last insert of
> > the multi-insert.  So with the current design we don't have any case
> > where we need to keep toast data across streams.
> >
> >  ... And we'll eventually clean it after the streamed
> >> transaction gets commited (ReorderBufferStreamCommit ends up calling
> >> ReorderBufferReturnTXN too).
> >
> > Right, but generally after streaming we assert that txn->size should
> > be 0.  That could be changed if we change the above design but this is
> > what it is today.
> >
>
> Can we add some assert to enforce this?
>

There is already an Assert for this. See ReorderBufferCheckMemoryLimit.

-- 
With Regards,
Amit Kapila.

be-secure-gssapi.c and auth.c with setenv() not compatible on Windows

2021-05-28 Thread Michael Paquier

Hi all,

Now that hamerkop has been fixed and that we have some coverage with
builds of GSSAPI on Windows thanks to 02511066, the buildfarm has been
complaining about a build failure on Windows for 12 and 13:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamerkop&dt=2021-05-28%2011%3A06%3A18&stg=make

The logs are hard to decrypt, but I guess that this is caused by the
use of setenv() in be-secure-gssapi.c and auth.c, as the tree has no
implementation that MSVC could feed on for those branches.

The recent commit 7ca37fb has changed things so as setenv() is used
instead of putenv(), and provides a fallback implementation, which
explains why the compilation of be-secure-gssapi.c and auth.c works
with MSVC, as reported by hamerkop.

We can do two things here:
1) Switch be-secure-gssapi.c and auth.c to use putenv().
2) Backport into 12 and 13 the fallback implementation of setenv
introduced in 7ca37fb, and keep be-secure-gssapi.c as they are now.

It is worth noting that 860fe27 mentions the use of setenv() in
be-secure-gssapi.c but has done nothing for it.  I would choose 1), on
the ground that adding a new file on back-branches adds an additional
cost to Windows maintainers if they use their own MSVC scripts (I know
of one case here that would be impacted), and that does not seem
mandatory here as putenv() would just work.

Thoughts?
--
Michael


signature.asc
Description: PGP signature

Re: Decoding speculative insert with toast leaks memory

2021-05-28 Thread Tomas Vondra

On 5/28/21 2:17 PM, Dilip Kumar wrote:
> On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
>  wrote:
>> On 5/27/21 6:36 AM, Dilip Kumar wrote:
>>> On Thu, May 27, 2021 at 9:47 AM Amit Kapila  wrote:

 On Thu, May 27, 2021 at 9:40 AM Dilip Kumar  wrote:

 True, but if you do this clean-up in ReorderBufferCleanupTXN then you
 don't need to take care at separate places. Also, toast_hash is stored
 in txn so it appears natural to clean it up in while releasing TXN.
>>>
>>> Make sense, basically, IMHO we will have to do in TruncateTXN and
>>> ReturnTXN as attached?
>>>
>>
>> Yeah, I've been working on a fix over the last couple days (because of a
>> customer issue), and I ended up with the reset in ReorderBufferReturnTXN
>> too - it does solve the issue in the case I've been investigating.
>>
>> I'm not sure the reset in ReorderBufferTruncateTXN is correct, though.
>> Isn't it possible that we'll need the TOAST data after streaming part of
>> the transaction? After all, we're not resetting invalidations, tuplecids
>> and snapshot either
> 
> Actually, as per the current design, we don't need the toast data
> after the streaming.  Because currently, we don't allow to stream the
> transaction if we need to keep the toast across stream e.g. if we only
> have toast insert without the main insert we assure this as partial
> changes, another case is if we have multi-insert with toast then we
> keep the transaction as mark partial until we get the last insert of
> the multi-insert.  So with the current design we don't have any case
> where we need to keep toast data across streams.
> 
>  ... And we'll eventually clean it after the streamed
>> transaction gets commited (ReorderBufferStreamCommit ends up calling
>> ReorderBufferReturnTXN too).
> 
> Right, but generally after streaming we assert that txn->size should
> be 0.  That could be changed if we change the above design but this is
> what it is today.
> 

Can we add some assert to enforce this?

>> I wonder if there's a way to free the TOASTed data earlier, instead of
>> waiting until the end of the transaction (as this patch does). But I
>> suspect it'd be way more complex, harder to backpatch, and destroying
>> the hash table is a good idea anyway.
> 
> Right.
> 
>> Also, I think the "if (txn->toast_hash != NULL)" checks are not needed,
>> because it's the first thing ReorderBufferToastReset does.
> 
> I see, I will change this.  If we all agree with this idea.
> 

+1 from me. I think it's good enough to do the cleanup at the end, and
it's an improvement compared to current state. There might be cases of
transactions doing many such speculative inserts and accumulating a lot
of data in the TOAST hash, but I find it very unlikely.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Decoding speculative insert with toast leaks memory

2021-05-28 Thread Dilip Kumar

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
 wrote:
> On 5/27/21 6:36 AM, Dilip Kumar wrote:
> > On Thu, May 27, 2021 at 9:47 AM Amit Kapila  wrote:
> >>
> >> On Thu, May 27, 2021 at 9:40 AM Dilip Kumar  wrote:
> >>
> >> True, but if you do this clean-up in ReorderBufferCleanupTXN then you
> >> don't need to take care at separate places. Also, toast_hash is stored
> >> in txn so it appears natural to clean it up in while releasing TXN.
> >
> > Make sense, basically, IMHO we will have to do in TruncateTXN and
> > ReturnTXN as attached?
> >
>
> Yeah, I've been working on a fix over the last couple days (because of a
> customer issue), and I ended up with the reset in ReorderBufferReturnTXN
> too - it does solve the issue in the case I've been investigating.
>
> I'm not sure the reset in ReorderBufferTruncateTXN is correct, though.
> Isn't it possible that we'll need the TOAST data after streaming part of
> the transaction? After all, we're not resetting invalidations, tuplecids
> and snapshot either

Actually, as per the current design, we don't need the toast data
after the streaming.  Because currently, we don't allow to stream the
transaction if we need to keep the toast across stream e.g. if we only
have toast insert without the main insert we assure this as partial
changes, another case is if we have multi-insert with toast then we
keep the transaction as mark partial until we get the last insert of
the multi-insert.  So with the current design we don't have any case
where we need to keep toast data across streams.

 ... And we'll eventually clean it after the streamed
> transaction gets commited (ReorderBufferStreamCommit ends up calling
> ReorderBufferReturnTXN too).

Right, but generally after streaming we assert that txn->size should
be 0.  That could be changed if we change the above design but this is
what it is today.

> I wonder if there's a way to free the TOASTed data earlier, instead of
> waiting until the end of the transaction (as this patch does). But I
> suspect it'd be way more complex, harder to backpatch, and destroying
> the hash table is a good idea anyway.

Right.

> Also, I think the "if (txn->toast_hash != NULL)" checks are not needed,
> because it's the first thing ReorderBufferToastReset does.

I see, I will change this.  If we all agree with this idea.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Decoding speculative insert with toast leaks memory

2021-05-28 Thread Tomas Vondra

On 5/27/21 6:36 AM, Dilip Kumar wrote:
> On Thu, May 27, 2021 at 9:47 AM Amit Kapila  wrote:
>>
>> On Thu, May 27, 2021 at 9:40 AM Dilip Kumar  wrote:
>>
>> True, but if you do this clean-up in ReorderBufferCleanupTXN then you
>> don't need to take care at separate places. Also, toast_hash is stored
>> in txn so it appears natural to clean it up in while releasing TXN.
> 
> Make sense, basically, IMHO we will have to do in TruncateTXN and
> ReturnTXN as attached?
> 

Yeah, I've been working on a fix over the last couple days (because of a
customer issue), and I ended up with the reset in ReorderBufferReturnTXN
too - it does solve the issue in the case I've been investigating.

I'm not sure the reset in ReorderBufferTruncateTXN is correct, though.
Isn't it possible that we'll need the TOAST data after streaming part of
the transaction? After all, we're not resetting invalidations, tuplecids
and snapshot either ... And we'll eventually clean it after the streamed
transaction gets commited (ReorderBufferStreamCommit ends up calling
ReorderBufferReturnTXN too).

I wonder if there's a way to free the TOASTed data earlier, instead of
waiting until the end of the transaction (as this patch does). But I
suspect it'd be way more complex, harder to backpatch, and destroying
the hash table is a good idea anyway.

Also, I think the "if (txn->toast_hash != NULL)" checks are not needed,
because it's the first thing ReorderBufferToastReset does.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add ZSON extension to /contrib/

2021-05-28 Thread Tomas Vondra




On 5/27/21 4:15 AM, Andrew Dunstan wrote:
> 
> On 5/26/21 5:29 PM, Bruce Momjian wrote:
>> On Tue, May 25, 2021 at 01:55:13PM +0300, Aleksander Alekseev wrote:
>>> Hi hackers,
>>>
>>> Back in 2016 while being at PostgresPro I developed the ZSON extension [1]. 
>>> The
>>> extension introduces the new ZSON type, which is 100% compatible with JSONB 
>>> but
>>> uses a shared dictionary of strings most frequently used in given JSONB
>>> documents for compression. These strings are replaced with integer IDs.
>>> Afterward, PGLZ (and now LZ4) applies if the document is large enough by 
>>> common
>>> PostgreSQL logic. Under certain conditions (many large documents), this 
>>> saves
>>> disk space, memory and increases the overall performance. More details can 
>>> be
>>> found in README on GitHub.
>> I think this is interesting because it is one of the few cases that
>> allow compression outside of a single column.  Here is a list of
>> compression options:
>>
>>  https://momjian.us/main/blogs/pgblog/2020.html#April_27_2020
>>  
>>  1. single field
>>  2. across rows in a single page
>>  3. across rows in a single column
>>  4. across all columns and rows in a table
>>  5. across tables in a database
>>  6. across databases
>>
>> While standard Postgres does #1, ZSON allows 2-5, assuming the data is
>> in the ZSON data type.  I think this cross-field compression has great
>> potential for cases where the data is not relational, or hasn't had time
>> to be structured relationally.  It also opens questions of how to do
>> this cleanly in a relational system.
>>
> 
> I think we're going to get the best bang for the buck on doing 2, 3, and
> 4. If it's confined to a single table then we can put a dictionary in
> something like a fork.

Agreed.

> Maybe given partitioning we want to be able to do multi-table
> dictionaries, but that's less certain.
> 

Yeah. I think it'll have many of the same issues/complexity as global
indexes, and the gains are likely limited. At least assuming the
partitions are sufficiently large, but tiny partitions are inefficient
in general, I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [BUG]Update Toast data failure in logical replication

2021-05-28 Thread Dilip Kumar

On Fri, May 28, 2021 at 12:31 PM tanghy.f...@fujitsu.com
 wrote:
>
> On Friday, May 28, 2021 3:02 PM, tanghy.f...@fujitsu.com wrote:
> > FYI. The problem also occurs in PG-13. I will try to check from which 
> > version it
> > got introduced.
>
> I reproduced it in PG-10,11,12,13.
> I think the problem has been existing since Logical replication introduced in 
> PG-10.

Seems you did not set the replica identity for updating the tuple.
Try this before updating, and it should work.

ALTER TABLE toasted_key REPLICA IDENTITY USING INDEX toasted_key_pkey;

or

ALTER TABLE toasted_key REPLICA IDENTITY FULL.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Add ZSON extension to /contrib/

2021-05-28 Thread Tomas Vondra




On 5/26/21 2:49 AM, Stephen Frost wrote:
> Greetings,
> 
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Matthias van de Meent  writes:
>>> I like the idea of the ZSON type, but I'm somewhat disappointed by its
>>> current limitations:
>>
>> I've not read the code, so maybe this thought is completely off-point,
>> but I wonder if anything could be learned from PostGIS.  AIUI they
>> have developed the infrastructure needed to have auxiliary info
>> (particularly, spatial reference data) attached to a geometry column,
>> without duplicating it in every value of the column.  Seems like that
>> is a close analog of what's needed here.
> 
> Err, not exactly the same- there aren't *that* many SRIDs and therefore
> they can be stuffed into the typemod (my, probably wrong, recollection
> was that I actually pushed Paul in that direction due to being
> frustrated with CHECK constraints they had been using previously..).
> 
> Not something you could do with a dictionary as what's contempalted
> here.  I do agree that each jsonb/zson/whatever column should really be
> able to have its own dictionary though and maybe you could shove *which*
> of those dictionaries a given column uses into the typemod for that
> column...  In an ideal world, however, we wouldn't make a user have to
> actually do that though and instead we'd just build our own magically
> for them when they use jsonb.
> 

I think doing this properly will require inventing new infrastructure to
associate some custom parameters with a column (and/or data type). In
principle it seems quite similar to 911e702077, which introduced opclass
parameters, which allowed implementing the new BRIN opclasses in PG14.

Even if we eventually decide to not add zson into contrib (or core), it
seems like this infrastructure would make zson more usable in practice
with this capability.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add ZSON extension to /contrib/

2021-05-28 Thread Tomas Vondra

On 5/26/21 6:43 PM, Matthias van de Meent wrote:
> On Wed, 26 May 2021 at 12:49, Aleksander Alekseev
>  wrote:
>>
>> Hi hackers,
>>
>> Many thanks for your feedback, I very much appreciate it!
>>
>>> If the extension is mature enough, why make it an extension in
>>> contrib, and not instead either enhance the existing jsonb type with
>>> it or make it a built-in type?
>>
>>> IMO we have too d*mn many JSON types already.  If we can find a way
>>> to shoehorn this optimization into JSONB, that'd be great.  Otherwise
>>> I do not think it's worth the added user confusion.
>>
>> Magnus, Tom,
>>
>> My reasoning is that if the problem can be solved with an extension
>> there is little reason to modify the core. This seems to be in the
>> spirit of PostgreSQL. If the community reaches the consensus to modify
>> the core to introduce a similar feature, we could discuss this as
>> well. It sounds like a lot of unnecessary work to me though (see
>> below).
>>
>>> * doesn't cover all cases, notably indexes.
>>
>> Tom,
>>
>> Not sure if I follow. What cases do you have in mind?
>>
>>> Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.
>>
>> Matthias,
>>
>> Good point. I suspect that PostGIS is an exception though...
> 
> Quite a few other non-/common/ extensions are available in RDS[0],
> some of which are HLL (from citusdata), pglogical (from 2ndQuadrant)
> and orafce (from Pavel Stehule, orafce et al.).
> 
>>> I like the idea of the ZSON type, but I'm somewhat disappointed by its
>>> current limitations
>>
>> Several people suggested various enhancements right after learning
>> about ZSON. Time showed, however, that none of the real-world users
>> really need e.g. more than one common dictionary per database. I
>> suspect this is because no one has more than 2**16 repeatable unique
>> strings (one dictionary limitation) in their documents. Thus there is
>> no benefit in having separate dictionaries and corresponding extra
>> complexity.
> 
> IMO the main benefit of having different dictionaries is that you
> could have a small dictionary for small and very structured JSONB
> fields (e.g. some time-series data), and a large one for large /
> unstructured JSONB fields, without having the significant performance
> impact of having that large and varied dictionary on the
> small&structured field. Although a binary search is log(n) and thus
> still quite cheap even for large dictionaries, the extra size is
> certainly not free, and you'll be touching more memory in the process.
> 

I'm sure we can think of various other arguments for allowing separate
dictionaries. For example, what if you drop a column? With one huge
dictionary you're bound to keep the data forever. With per-column dicts
you can just drop the dict and free disk space / memory.

I also find it hard to believe that no one needs 2**16 strings. I mean,
65k is not that much, really. To give an example, I've been toying with
storing bitcoin blockchain in a database - one way to do that is storing
each block as a single JSONB document. But each "item" (eg. transaction)
is identified by a unique hash, so that means (tens of) thousands of
unique strings *per document*.

Yes, it's a bit silly and extreme, and maybe the compression would not
help much in this case. But it shows that 2**16 is damn easy to hit.

In other words, this seems like a nice example of survivor bias, where
we only look at cases for which the existing limitations are acceptable,
ignoring the (many) remaining cases eliminated by those limitations.

>>> - Each dictionary uses a lot of memory, regardless of the number of
>>> actual stored keys. For 32-bit systems the base usage of a dictionary
>>> without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
>>> almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
>>> more than I'd want to install.
>>
>> You are probably right on this one, this part could be optimized. I
>> will address this if we agree on submitting the patch.
>>

I'm sure it can be optimized, but I also think it's focusing on the base
memory usage too much.

What I care about is the end result, i.e. how much disk space / memory I
save at the end. I don't care if it's 1MB or 1.7MB if using the
compression saves me e.g. 50% of disk space. And it's completely
irrelevant if I can't use the feature because of limitations stemming
from the "single dictionary" limitations (in which case I'll save the
0.7MB, but I also lose the 50% disk space savings - not a great trade
off, if you ask me).

>>> - You call gettimeofday() in both dict_get and in get_current_dict_id.
>>> These functions can be called in short and tight loops (for small GSON
>>> fields), in which case it would add significant overhead through the
>>> implied syscalls.
>>
>> I must admit, I'm not an expert in this area. My understanding is that
>> gettimeofday() is implemented as single virtual memory access on
>> modern operating systems, e.g. VDSO on Linux, thus it's very

RE: Parallel Inserts in CREATE TABLE AS

2021-05-28 Thread houzj.f...@fujitsu.com

From: Bharath Rupireddy 
Sent: Thursday, May 27, 2021 10:07 PM
> On Thu, May 27, 2021 at 9:53 AM Bharath Rupireddy
>  wrote:
> > > One idea to find this out could be that we have three counters for
> > > each worker which counts the number of times each worker extended
> > > the relation in bulk, the number of times each worker extended the
> > > relation by one block, the number of times each worker gets the page
> > > from FSM. It might be possible that with this we will be able to
> > > figure out why there is a difference between your and Hou-San's
> > > results.
> >
> > Yeah, that helps. And also, the time spent in
> > LockRelationForExtension, ConditionalLockRelationForExtension,
> > GetPageWithFreeSpace and RelationAddExtraBlocks too can give some
> > insight.
> >
> > My plan is to have a patch with above info added in (which I will
> > share it here so that others can test and see the results too) and run
> > the "case 4" where there's a regression seen on my system.
> 
> I captured below information with the attached patch
> 0001-test-times-and-block-counts.patch applied on top of CTAS v23 patch set.
> Testing details are attached in the file named "test".
> Total time spent in LockRelationForExtension Total time spent in
> GetPageWithFreeSpace Total time spent in RelationAddExtraBlocks Total
> number of times extended the relation in bulk Total number of times extended
> the relation by one block Total number of blocks added in bulk extension Total
> number of times getting the page from FSM
> 
> Here is a summary of what I observed:
> 1) The execution time with 2 workers, without TABLE_INSERT_SKIP_FSM
> (140 sec) is more than with 0 workers (112 sec)
> 2) The execution time with 2 workers, with TABLE_INSERT_SKIP_FSM (225
> sec) is more than with 2 workers, without TABLE_INSERT_SKIP_FSM (140
> sec)
> 3) Majority of the time is going into waiting for relation extension lock in
> LockRelationForExtension. With 2 workers, without TABLE_INSERT_SKIP_FSM,
> out of total execution time 140 sec, the time spent in 
> LockRelationForExtension
> is ~40 sec and the time spent in RelationAddExtraBlocks is ~20 sec. So, ~60 
> sec
> are being spent in these two functions. With 2 workers, with
> TABLE_INSERT_SKIP_FSM, out of total execution time 225 sec, the time spent
> in LockRelationForExtension is ~135 sec and the time spent in
> RelationAddExtraBlocks is 0 sec (because we skip FSM, no bulk extend logic
> applies). So, most of the time is being spent in LockRelationForExtension.
> 
> I'm still not sure why the execution time with 0 workers (or serial execution 
> or
> no parallelism involved) on my testing system is 112 sec compared to 58 sec on
> Hou-San's system for the same use case. Maybe the testing system I'm using is
> not of the latest configuration compared to others.
> 
> Having said that, I request others to try and see if the same observations (as
> above) are made on their testing systems for the same use case. If others 
> don't
> see regression (with just 2 workers) or they observe not much difference with
> and without TABLE_INSERT_SKIP_FSM.

Thanks for the patch !
I attached my test results. Note I did not change the wal_level to minimal.

I only change the the following configuration:

shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

Best regards,
houzj


test_results
Description: test_results

Re: Parallel INSERT SELECT take 2

2021-05-28 Thread Greg Nancarrow

On Mon, May 24, 2021 at 3:15 PM houzj.f...@fujitsu.com
 wrote:
>
>
> Thanks for the comments and your descriptions looks good.
> Attaching v5 patchset with all these changes.
>

A few other minor things I noticed:

(1) error message wording when declaring a table SAFE for parallel DML

src/backend/commands/tablecmds.c

Since data modification for the RELKIND_FOREIGN_TABLE and
RELPERSISTENCE_TEMP types are allowed in the parallel-restricted case
(i.e. leader may modify in parallel mode)
I'm thinking it may be better to use wording like:

"cannot support foreign or temporary table data modification by
parallel workers"

instead of

"cannot support parallel data modification on a foreign or temporary table"

There are TWO places where this error message is used.

(What do you think?)

(2) Minor formatting issue

src/backend/optimizer/util/clauses.c

static safety_object *make_safety_object(Oid objid, Oid classid,
char proparallel)

should be:

static safety_object *
make_safety_object(Oid objid, Oid classid, char proparallel)

(3) Minor formatting issue

src/backend/utils/cache/typcache.c


List *GetDomainConstraints(Oid type_id)

should be:

List *
GetDomainConstraints(Oid type_id)


Regards,
Greg Nancarrow
Fujitsu Australia

Re: Asynchronous Append on postgres_fdw nodes.

2021-05-28 Thread Kyotaro Horiguchi

At Fri, 28 May 2021 16:30:29 +0900, Etsuro Fujita  
wrote in 
> On Wed, Mar 31, 2021 at 6:55 PM Etsuro Fujita  wrote:
> > On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita  
> > wrote:
> > > I'm happy with the patch, so I'll commit it if there are no objections.
> >
> > Pushed.
> 
> I noticed that rescan of async Appends is broken when
> do_exec_prune=false, leading to incorrect results on normal builds and
> the following failure on assertion-enabled builds:
> 
> TRAP: FailedAssertion("node->as_valid_asyncplans == NULL", File:
> "nodeAppend.c", Line: 1126, PID: 76644)
> 
> See a test case for this added in the attached.  The root cause would
> be that we call classify_matching_subplans() to re-determine
> sync/async subplans when called from the first ExecAppend() after the
> first ReScan, even if do_exec_prune=false, which is incorrect because
> in that case it is assumed to re-use sync/async subplans determined
> during the the first ExecAppend() after Init.  The attached fixes this
> issue.  (A previous patch also had this issue, so I fixed it, but I
> think I broke this again when simplifying the patch :-(.)  I did a bit
> of cleanup, and modified ExecReScanAppend() to initialize an async
> state variable as_nasyncresults to zero, to be sure.  I think the
> variable would have been set to zero before we get to that function,
> so I don't think we really need to do so, though.
> 
> I will add this to the open items list for v14.

The patch drops some "= NULL" (initial) initializations when
nasyncplans == 0. AFAICS makeNode() fills the returned memory with
zeroes but I'm not sure it is our convention to omit the
intializations.

Otherwise the patch seems to make the code around cleaner.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race condition in recovery?

2021-05-28 Thread Kyotaro Horiguchi

(Sorry for being a bit off-topic)

At Fri, 28 May 2021 12:18:35 +0900, Tatsuro Yamada 
 wrote in 
> Hi Horiguchi-san,

(Why me?)

> In a project I helped with, I encountered an issue where
> the archive command kept failing. I thought this issue was
> related to the problem in this thread, so I'm sharing it here.
> If I should create a new thread, please let me know.
> 
> * Problem
>   - The archive_command is failed always.

Although I think the configuration is a kind of broken, it can be seen
as it is mimicing the case of shared-archive, where primary and
standby share the same archive directory.

Basically we need to use an archive command like the following for
that case to avoid this kind of failure. The script returns "success"
when the target file is found but identical with the source file. I
don't find such a description in the documentation, and haven't
bothered digging into the mailing-list archive.

==
#! /bin/bash

if [ -f $2 ]; then
cmp -s $1 $2
if [ $? != 0 ]; then
exit 1
fi
exit 0
fi

cp $1 $2
==

A maybe-non-optimal behavior is both 0002.history.done and .ready
files are found at once in archive_status directory but that doesn't
practically matter. (Or I faintly remember that it is designed to work
even in that case.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Asynchronous Append on postgres_fdw nodes.

2021-05-28 Thread Etsuro Fujita

On Wed, Mar 31, 2021 at 6:55 PM Etsuro Fujita  wrote:
> On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita  wrote:
> > I'm happy with the patch, so I'll commit it if there are no objections.
>
> Pushed.

I noticed that rescan of async Appends is broken when
do_exec_prune=false, leading to incorrect results on normal builds and
the following failure on assertion-enabled builds:

TRAP: FailedAssertion("node->as_valid_asyncplans == NULL", File:
"nodeAppend.c", Line: 1126, PID: 76644)

See a test case for this added in the attached.  The root cause would
be that we call classify_matching_subplans() to re-determine
sync/async subplans when called from the first ExecAppend() after the
first ReScan, even if do_exec_prune=false, which is incorrect because
in that case it is assumed to re-use sync/async subplans determined
during the the first ExecAppend() after Init.  The attached fixes this
issue.  (A previous patch also had this issue, so I fixed it, but I
think I broke this again when simplifying the patch :-(.)  I did a bit
of cleanup, and modified ExecReScanAppend() to initialize an async
state variable as_nasyncresults to zero, to be sure.  I think the
variable would have been set to zero before we get to that function,
so I don't think we really need to do so, though.

I will add this to the open items list for v14.

Best regards,
Etsuro Fujita

fix-rescan-of-async-appends.patch
Description: Binary data

RE: [BUG]Update Toast data failure in logical replication

2021-05-28 Thread tanghy.f...@fujitsu.com

On Friday, May 28, 2021 3:02 PM, tanghy.f...@fujitsu.com wrote: 
> FYI. The problem also occurs in PG-13. I will try to check from which version 
> it
> got introduced.

I reproduced it in PG-10,11,12,13.
I think the problem has been existing since Logical replication introduced in 
PG-10.

Regards
Tang

39 matches

Mail list logo