Re: Undo logs

2018-09-02 Thread Dilip Kumar
On Sun, Sep 2, 2018 at 12:18 AM, Thomas Munro
 wrote:
> On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
>  wrote:
>> On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar  wrote:
>> > As Thomas has already mentioned upthread that we are working on an
>> > undo-log based storage and he has posted the patch sets for the lowest
>> > layer called undo-log-storage.
>> >
>> > This is the next layer which sits on top of the undo log storage,
>> > which will provide an interface for prepare, insert, or fetch the undo
>> > records. This layer will use undo-log-storage to reserve the space for
>> > the undo records and buffer management routine to write and read the
>> > undo records.
>
> I have also pushed a new WIP version of the lower level undo log
> storage layer patch set to a public branch[1].  I'll leave the earlier
> branch[2] there because the record-level patch posted by Dilip depends
> on it for now.

Rebased undo_interface patches on top of the new branch of undo-log-storage[1].

[1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


undo_interface_v2.patch
Description: Binary data


undo_interface_test_v2.patch
Description: Binary data


Re: speeding up planning with partitions

2018-09-02 Thread Amit Langote
Thank you Kato-san for testing.

On 2018/08/31 19:48, Kato, Sho wrote:
> Hi, Amit
> 
> Great!
> With the total number of records being 6400, I benchmarked while increasing 
> the number of partitions from 100 to 6400.
> Applying three all patches, 20% performance improved for 100 partitions.
> 
> I have the same number of records for each partition, do you do the same?

I didn't load any data into tables when running the tests, because these
patches are meant for reducing planner latency.  More specifically,
they're addressed to fix the current planner behavior that its latency
increases with increasing number of partitions, with focus on the common
case where only a single partition will need to be scanned by a given query.

I'd try to avoid using a benchmark whose results is affected by anything
other than the planning latency.  It will be a good idea if you leave the
tables empty and just vary the number of partitions and nothing else.

Also, we're interested in planning latency, so using just SELECT and
UPDATE in your benchmark script will be enough, because their planning
time is affected by the number of partitions.  No need to try to measure
the INSERT latency, because its planning latency is not affected by the
number of partitions.  Moreover, I'd rather suggest you take out the
INSERT statement from the benchmark for now, because its execution time
does vary unfavorably with increasing number of partitions.  Sure, there
are other patches to address that, but it's better not to mix the patches
to avoid confusion.

> Also, in my case, performance was better when not prepare.

Patches in this thread do nothing for the execution, so, there is no need
to compare prepared vs non-prepared.  In fact, just measure non-prepared
tps and latency, because we're only interested in planning time here.

> I think these patches do not improve execute case, so we need faster runtime 
> pruning patch[1], right?

We already have run-time pruning code (that is the code in the patch you
linked) committed into the tree in PG 11, so the master tree also has it.
But since we're not interested in execution time, no need to worry about
run-time pruning.

> Details of measurement conditions and results are as follows.
> - base source
>   master(@777e6ddf17) + Speeding up Insert v8 patch[1]
>
> - table definition(e.g. 100 partition)
>   create table test.accounts(aid serial, abalance int)
>   partition by range(aid);
>   create table test.accounts_history(id serial, aid int, delta int,
>   mtime timestamp without time zone)
>   partition by range(aid);
>
> - command option
>   pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
>   pgbench -d testdb -f benchmark.pgbench -T 180 -r -n
>   
> -results
>   base source no prepared
>part_num |   tps_ex   | update_latency | select_latency | insert_latency 
>   --++++
> 100 | 662.414805 |  0.357 |  0.265 |  0.421
> 200 | 494.478431 |  0.439 |  0.349 |  0.579
> 400 | 307.982089 |  0.651 |  0.558 |  0.927
> 800 | 191.360676 |  0.979 |  0.876 |  1.548
>1600 |  75.344947 |  2.253 |  2.003 |  4.301
>3200 |  30.643902 |  5.716 |  4.955 | 10.118
>6400 |  16.726056 | 12.512 |  8.582 | 18.054
> 
>   0001 no prepared
>part_num |   tps_ex   | update_latency | select_latency | insert_latency 
>   --++++
> 100 | 429.816329 |  0.811 |   0.75 |  0.365
> 200 | 275.211531 |  1.333 |  1.248 |  0.501
> 400 | 160.499833 |  2.384 |  2.252 |  0.754
> 800 |  79.387776 |  4.935 |  4.698 |  1.468
>1600 |  24.787377 | 16.593 | 15.954 |  4.302
>3200 |   9.846421 |  42.96 | 42.139 |  8.848
>6400 |   4.919772 |  87.43 | 83.109 |  16.56

Hmm, since 0001 is meant to improve update planning time, it seems odd
that you'd get poorer results compared to base source.  But, it seems base
source is actually master + the patch to improve the execution time of
update, so maybe that patch is playing a part here, although I'm not sure
why even that's making this much of a difference.

I suggest that you use un-patched master as base source, that is, leave
out any patches to improve execution time.

[ ... ]

>   0001 + 0002 no prepared
>part_num |   tps_ex   | update_latency | select_latency | insert_latency 
>   --++++
> 100 |  682.53091 |  0.388 |   0.35 |   0.35
> 200 | 469.906601 |  0.543 |  0.496 |  

Re: Adding a note to protocol.sgml regarding CopyData

2018-09-02 Thread Tatsuo Ishii
> Hello Bradley & Tatsuo-san,
> 
>>> ... references to the protocol version lacks homogeneity.
>>> ... I'd suggest to keep "the vX.0 protocol" for a short version,
>>> and "the version X.0 protocol" for long ...
>>
>> I agree. Change made.
> 
> Patch applies cleanly. Doc build ok.
> 
> One part talks about "terminating line", the other of "termination
> line". I wonder whether one is better than the other?

I am not a native English speaker so I maybe wrong... for me, current
usage of "terminating line", and "termination line" looks correct. The
former refers to concrete example thus uses "the", while the latter
refers to more general case thus uses "an".

BTW, I think the patch should apply to master and REL_11_STABLE
branches at least. But should this be applied to other back branches
as well?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Hint to set owner for tablespace directory

2018-09-02 Thread Rafia Sabih
On Fri, Aug 31, 2018 at 4:11 PM, Maksim Milyutin 
wrote:

> On 08/31/2018 01:12 PM, Rafia Sabih wrote:
>
>
>
> On Fri, Aug 31, 2018 at 3:30 PM, Maksim Milyutin 
> wrote:
>
>> On 08/31/2018 11:55 AM, Rafia Sabih wrote:
>>
>>
>> Adding to that this patch needs a rebase. And, please don't forget to run
>> 'git diff --check', as it has some white-space issues.
>>
>>
>>  git apply /home/edb/Desktop/patches/others/owner_tbsp_hint.patch
>  /home/edb/Desktop/patches/others/owner_tbsp_hint.patch:9: trailing
> whitespace.
> {
> /home/edb/Desktop/patches/others/owner_tbsp_hint.patch:10: trailing
> whitespace.
> bool wrong_owner = (errno == EPERM);
>
>
> This is hex+ASCII display (from *hexdump -C owner_tbsp_hint.patch* output)
> of code fragment above:
>
> 01a0  0a 20 09 09 65 6c 73 65  0a 2b 09 09 7b 0a 2b 09  |.
> ..else.+..{.+.|
> 01b0  09 09 62 6f 6f 6c 20 77  72 6f 6e 67 5f 6f 77 6e  |..bool
> wrong_own|
> 01c0  65 72 20 3d 20 28 65 72  72 6e 6f 20 3d 3d 20 45  |er = (errno
> == E|
> 01d0  50 45 52 4d 29 3b 0a 2b  0a 20 09 09 09 65 72 65  |PERM);.+.
> ...ere|
>
> Problem lines don't have trailing whitespaces, only line feed (0x0a)
> symbols
>
> This is what I got. Please correct me if I am missng anything. I am using
> centos 7,  just an FYI.
>
>
> My git version is 2.7.4.
>
> I am using version 2.12.1, that might be the reason for the discrepancy.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


Re: pg_verify_checksums failure with hash indexes

2018-09-02 Thread Amit Kapila
On Tue, Aug 28, 2018 at 6:43 PM Michael Paquier  wrote:
>
> On Tue, Aug 28, 2018 at 11:21:34AM +0200, Peter Eisentraut wrote:
> > The files in question correspond to
> >
> > hash_i4_index
> > hash_name_index
> > hash_txt_index
>
> The hash index code has been largely refactored in v10, so I would
> imagine that you can see the problem as well there.
>
> [... digging digging ...]
>
> And indeed I can see the problem in 10 as well with my own pg_checksums,
> and I can see hash_f8_index with a problem on top of what Peter has
> reported.
>

AFAICS, this problem exists in 9.6 and prior branches as well,
although, I can't test it.  I am not sure whether we need to backpatch
this beyond 10 (where hash indexes are WAL logged) as prior to that
hash-indexes are anyway not-reliable.  What's your opinion?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Proposal for disk quota feature

2018-09-02 Thread Pavel Stehule
2018-09-03 3:49 GMT+02:00 Hubert Zhang :

> Thanks Pavel.
> Your patch did enforcement on storage level(md.c or we could also use
> smgr_extend). It's straight forward.
> But I prefer to implement disk_quota as a feature with following
> objectives:
> 1 set/alter disk quota setting on different database objects, e.g. user,
> database, schema etc. not only a general GUC, but we could set separate
> quota limit for a specific objects.
> 2 enforcement operator should work at two positions: before query is
> running and when query is running. The latter one's implementation maybe
> similar to your patch.
>

The patch was just example. The resource quotes should be more complex -
per partition, table, schema, database, user - so GUC are possible, but not
very user friendly.

Our case is specific, but not too much. The servers are used for
multidimensional analyses - and some tables can grow too fast (COPY, INSERT
SELECT). We need to solve limits immediately. The implementation is simple,
so I did it. Same implementation on database level, or schema level needs
some more locks, so it will not be too effective. The resource management
can be complex very complex, and I expect so it will be hard work.

Regards

Pavel


> On Sun, Sep 2, 2018 at 8:44 PM, Pavel Stehule 
> wrote:
>
>> Hi
>>
>> 2018-09-02 14:18 GMT+02:00 Hubert Zhang :
>>
>>> Thanks Chapman.
>>> @Pavel,  could you please explain more about your second suggestion 
>>> "implement
>>> some quotas on storage level?"
>>>
>>
>> See attached patch - it is very simple - and good enough for our
>> purposes.
>>
>> Regards
>>
>> Pavel
>>
>>
>>
>>> We will not keep the long-lived processes attach to all databases(just
>>> like you mentioned servers with thousands of databases)
>>> And you are right, we could share ideas with autovacuum process, fork
>>> worker processes in need.
>>> "autovacuum checks for tables that have had a large number of inserted,
>>> updated or deleted tuples. These checks use the statistics collection
>>> facility"
>>> diskquota process is similar to autovacuum at caring about insert, but
>>> the difference is that it also care about vucuum full, truncate and drop.
>>> While update and delete may not be interested since no file change happens.
>>> So a separate diskquota process is preferred.
>>>
>>> So if we implemented disk quota as a full native feature, and in the
>>> first initial version I prefer to implement the following features:
>>> 1 Fork diskquota launcher process under Postmaster serverloop, which is
>>> long-lived.
>>> 2 Diskquota launcher process is responsible for creating diskquota
>>> worker process for every database.
>>> 3 DIskquota setting is stored in a separate catalog table for each
>>> database.
>>> 4 Initialization stage, Diskquota launcher process creates diskquota worker
>>> process for all the databases(traverse like autovacuum). Worker process
>>> calculates disk usage of db objects and their diskquota setting. If any
>>> db object exceeds its quota limit, put them into the blacklist in the
>>> shared memory, which will later be used by enforcement operator. Worker
>>> process exits when works are done.
>>> 5 Running stage, Diskquota launcher process creates diskquota worker
>>> process for the database with a large number of insert, copy, truncate,
>>> drop etc. or create disk quota statement. Worker process updates the file
>>> size for db objects containing the result relation, and compare with the
>>> diskquota setting. Again, if exceeds quota limit, put them into blacklist,
>>> remove from blacklist vice versa. Worker process exits when works are
>>> done and a GUC could control the frequency of worker process restart to a
>>> specific database. As you know, this GUC also controls the delay when we do
>>> enforcement.
>>> 6 Enforcement. When postgres backend executes queries, check the
>>> blacklist in shared memory to determine whether the query is allowed(before
>>> execute) or need rollback(is executing)?
>>>
>>> If we implemented disk quota as an extension, we could just use
>>> background worker to start diskquota launcher process and use
>>> RegisterDynamicBackgroundWorker() to fork child diskquota worker
>>> processes by the launcher process as suggested by @Chapman.
>>> Diskquota setting could be stored in user table in a separate schema for
>>> each database(Schema and table created by create extension statement) just
>>> like what Heikki has done in pg_quota project. But in this case, we need to
>>> create extension for each database before diskquota worker process can be
>>> set up for that database.
>>>
>>> Any comments on the above design and which is preferred, native feature
>>> or extension as the POC?
>>>
>>>
>>> -- Hubert
>>>
>>>
>>>
>>> On Fri, Aug 31, 2018 at 3:32 AM, Pavel Stehule 
>>> wrote:
>>>


 2018-08-30 16:22 GMT+02:00 Chapman Flack :

> On 08/30/2018 09:57 AM, Hubert Zhang wrote:
>
> > 2 Keep one worker process for each database. But using a
> 

Re: pg_verify_checksums failure with hash indexes

2018-09-02 Thread Amit Kapila
On Sat, Sep 1, 2018 at 10:28 AM Dilip Kumar  wrote:
>
> On Sat, Sep 1, 2018 at 8:22 AM, Robert Haas  wrote:
> > On Thu, Aug 30, 2018 at 7:27 AM, Amit Kapila  
> > wrote:
> >
> > I wouldn't bother bumping HASH_VERSION.  First, the fix needs to be
> > back-patched, and you certainly can't back-patch a HASH_VERSION bump.
> > Second, you should just pick a formula that gives the same answer as
> > now for the cases where the overrun doesn't occur, and some other
> > sufficiently-value for the cases where an overrun currently does
> > occur.  If you do that, you're not changing the behavior in any case
> > that currently works, so there's really no reason for a version bump.
> > It just becomes a bug fix at that point.
> >

makes sense.

>
> I think if we compute with below formula which I suggested upthread
>
> #define HASH_MAX_BITMAPS   Min(BLCKSZ / 8, 1024)
>
> then for BLCKSZ 8K and bigger, it will remain the same value where it
> does not overrun.  And, for the small BLCKSZ, I think it will give
> sufficient space for the hash map. If the BLCKSZ is 1K then the sizeof
> (HashMetaPageData) + sizeof (HashPageOpaque) = 968 which is very close
> to the BLCKSZ.
>

Yeah, so at 1K, the value of HASH_MAX_BITMAPS will be 128 as per above
formula which is what it was its value prior to the commit 620b49a1.
I think it will be better if you add a comment in your patch
indicating the importance/advantage of such a formula.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [PATCH] Improve geometric types

2018-09-02 Thread Tom Lane
Tomas Vondra  writes:
> On 08/17/2018 11:23 PM, Tom Lane wrote:
>> Yeah, we've definitely hit such problems before.  The geometric logic
>> seems particularly prone to it because it's doing more and subtler
>> float arithmetic than the rest of the system ... but it's not the sole
>> source of minus-zero issues.

> We can go through the patch and fix places with obvious -0 risks, but
> then what? I don't have access to any powepc machines, so I can't check
> if there are any other failures. So the only thing I could do is commit
> and fix based on buildfarm failures ...

That's the usual solution.  I don't see anything particularly wrong
with it.  If the buildfarm results suggest a whole mess of problems,
it might be appropriate to revert and rethink; but you shouldn't be
afraid to use the buildfarm as a testbed.

regards, tom lane



Re: Proposal for disk quota feature

2018-09-02 Thread Hubert Zhang
Thanks Pavel.
Your patch did enforcement on storage level(md.c or we could also use
smgr_extend). It's straight forward.
But I prefer to implement disk_quota as a feature with following objectives:
1 set/alter disk quota setting on different database objects, e.g. user,
database, schema etc. not only a general GUC, but we could set separate
quota limit for a specific objects.
2 enforcement operator should work at two positions: before query is
running and when query is running. The latter one's implementation maybe
similar to your patch.

On Sun, Sep 2, 2018 at 8:44 PM, Pavel Stehule 
wrote:

> Hi
>
> 2018-09-02 14:18 GMT+02:00 Hubert Zhang :
>
>> Thanks Chapman.
>> @Pavel,  could you please explain more about your second suggestion 
>> "implement
>> some quotas on storage level?"
>>
>
> See attached patch - it is very simple - and good enough for our purposes.
>
> Regards
>
> Pavel
>
>
>
>> We will not keep the long-lived processes attach to all databases(just
>> like you mentioned servers with thousands of databases)
>> And you are right, we could share ideas with autovacuum process, fork
>> worker processes in need.
>> "autovacuum checks for tables that have had a large number of inserted,
>> updated or deleted tuples. These checks use the statistics collection
>> facility"
>> diskquota process is similar to autovacuum at caring about insert, but
>> the difference is that it also care about vucuum full, truncate and drop.
>> While update and delete may not be interested since no file change happens.
>> So a separate diskquota process is preferred.
>>
>> So if we implemented disk quota as a full native feature, and in the
>> first initial version I prefer to implement the following features:
>> 1 Fork diskquota launcher process under Postmaster serverloop, which is
>> long-lived.
>> 2 Diskquota launcher process is responsible for creating diskquota
>> worker process for every database.
>> 3 DIskquota setting is stored in a separate catalog table for each
>> database.
>> 4 Initialization stage, Diskquota launcher process creates diskquota worker
>> process for all the databases(traverse like autovacuum). Worker process
>> calculates disk usage of db objects and their diskquota setting. If any
>> db object exceeds its quota limit, put them into the blacklist in the
>> shared memory, which will later be used by enforcement operator. Worker
>> process exits when works are done.
>> 5 Running stage, Diskquota launcher process creates diskquota worker
>> process for the database with a large number of insert, copy, truncate,
>> drop etc. or create disk quota statement. Worker process updates the file
>> size for db objects containing the result relation, and compare with the
>> diskquota setting. Again, if exceeds quota limit, put them into blacklist,
>> remove from blacklist vice versa. Worker process exits when works are
>> done and a GUC could control the frequency of worker process restart to a
>> specific database. As you know, this GUC also controls the delay when we do
>> enforcement.
>> 6 Enforcement. When postgres backend executes queries, check the
>> blacklist in shared memory to determine whether the query is allowed(before
>> execute) or need rollback(is executing)?
>>
>> If we implemented disk quota as an extension, we could just use
>> background worker to start diskquota launcher process and use
>> RegisterDynamicBackgroundWorker() to fork child diskquota worker
>> processes by the launcher process as suggested by @Chapman.
>> Diskquota setting could be stored in user table in a separate schema for
>> each database(Schema and table created by create extension statement) just
>> like what Heikki has done in pg_quota project. But in this case, we need to
>> create extension for each database before diskquota worker process can be
>> set up for that database.
>>
>> Any comments on the above design and which is preferred, native feature
>> or extension as the POC?
>>
>>
>> -- Hubert
>>
>>
>>
>> On Fri, Aug 31, 2018 at 3:32 AM, Pavel Stehule 
>> wrote:
>>
>>>
>>>
>>> 2018-08-30 16:22 GMT+02:00 Chapman Flack :
>>>
 On 08/30/2018 09:57 AM, Hubert Zhang wrote:

 > 2 Keep one worker process for each database. But using a parent/global
 > quota worker process to manage the lifecycle of database level worker
 > processes. It could handle the newly created database(avoid restart
 > database) and save resource when a database is not used. But this
 needs to
 > change worker process to be hierarchical. Postmaster becomes the
 grandfather
 >  of database level worker processes in this case.

 I am using background workers this way in 9.5 at $work.

 In my case, one worker lives forever, wakes up on a set period, and
 starts a short-lived worker for every database, waiting for each
 one before starting the next.

 It was straightforward to implement. Looking back over the code,
 I see the global worker assigns its own PID to 

Re: [PATCH] Improve geometric types

2018-09-02 Thread Tomas Vondra
On 08/17/2018 11:23 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> Hmm, yeah. Based on past experience, the powerpc machines are likely to
>> stumble on this.
> 
>> FWIW my understanding is that these failures actually happen in new
>> tests, it's not an issue introduced by this patch series.
> 
> Yeah, we've definitely hit such problems before.  The geometric logic
> seems particularly prone to it because it's doing more and subtler
> float arithmetic than the rest of the system ... but it's not the sole
> source of minus-zero issues.
> 

It's not entirely clear to me what's the best way forward with this.

We can go through the patch and fix places with obvious -0 risks, but
then what? I don't have access to any powepc machines, so I can't check
if there are any other failures. So the only thing I could do is commit
and fix based on buildfarm failures ...

Emre, any better ideas?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: memory leak when serializing TRUNCATE in reorderbuffer

2018-09-02 Thread Tomas Vondra
I've pushed this, with some minor tweak, and backpatched to 11 (which is
where TRUNCATE decoding was introduced).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: libpq debug log

2018-09-02 Thread Tom Lane
Robert Haas  writes:
> On Fri, Aug 24, 2018 at 9:48 AM, Tom Lane  wrote:
>> PQtrace() is utterly useless for anything except debugging libpq
>> internals, and it's not tremendously useful even for that.  Don't
>> bother with that part.

> I think that improving the output format could help with that a lot.
> What it current produces is almost unreadable; adjusting it to emit
> one line per protocol message would, I think, help a lot.  There are
> probably other improvements that could be made at the same time.

I wouldn't mind throwing it out and reimplementing it ;-) ... tracing
at the logical message level rather than the byte level would help.
But still, I'm not really convinced that it has very much to do with
what you'd want in a user-level debug log.

Part of what's really bad about the PQtrace output is that in v2 protocol
the same output can be repeated several times as we try to parse a message
and conclude we don't have it all yet.  I believe that problem is gone
in v3, but it may be hard to do a consistent redesign until we nuke libpq's
v2 support.  Still, it might be past time for the latter, seeing that
we no longer promise pre-7.4 compatibility in either psql or pg_dump.

regards, tom lane



Re: libpq debug log

2018-09-02 Thread Robert Haas
On Fri, Aug 24, 2018 at 9:48 AM, Tom Lane  wrote:
> PQtrace() is utterly useless for anything except debugging libpq
> internals, and it's not tremendously useful even for that.  Don't
> bother with that part.

I think that improving the output format could help with that a lot.
What it current produces is almost unreadable; adjusting it to emit
one line per protocol message would, I think, help a lot.  There are
probably other improvements that could be made at the same time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_dump --load-via-partition-root vs. parallel restore

2018-09-02 Thread Tom Lane
Robert Haas  writes:
> On Tue, Aug 28, 2018 at 3:53 PM, Tom Lane  wrote:
>> Parallel pg_restore generally assumes that the archive file is telling it
>> the truth about data dependencies; in particular, that a TABLE DATA item
>> naming a particular target table is loading data into exactly that table.
>> --load-via-partition-root creates a significant probability that that
>> assumption is wrong, at least in scenarios where the data really does get
>> redirected into other partitions than the original one.  This can result
>> in inefficiencies (e.g., index rebuild started before a table's data is
>> really all loaded) or outright failures (foreign keys or RLS policies
>> applied before the data is all loaded).  I suspect that deadlock failures
>> during restore are also possible, since identify_locking_dependencies
>> is not going to be nearly close to the truth about which operations
>> might hold which locks.

> Hmm.  I had the idea that this wasn't a problem because I thought we
> had all of pg_dump strictly separated into pre-data, data, and
> post-data; therefore, I thought it would be the case that none of that
> other stuff would happen until all table data was loaded.  From what
> you are saying here, I guess that's not the case?

We don't run restore operations in parallel during the pre-data phase,
mainly because we lack faith that the dependencies are really completely
represented for those objects (particularly in older archives).  Once we
get to parallel operation, though, everything is schedulable as soon as
its dependencies are satisfied; so it's possible for post-data objects
to run before supposedly-unrelated data objects are done.  There's no
explicit representation of the data/post-data boundary in the archive's
dependencies.

regards, tom lane



Re: pg_dump --load-via-partition-root vs. parallel restore

2018-09-02 Thread Robert Haas
On Tue, Aug 28, 2018 at 3:53 PM, Tom Lane  wrote:
> Parallel pg_restore generally assumes that the archive file is telling it
> the truth about data dependencies; in particular, that a TABLE DATA item
> naming a particular target table is loading data into exactly that table.
> --load-via-partition-root creates a significant probability that that
> assumption is wrong, at least in scenarios where the data really does get
> redirected into other partitions than the original one.  This can result
> in inefficiencies (e.g., index rebuild started before a table's data is
> really all loaded) or outright failures (foreign keys or RLS policies
> applied before the data is all loaded).  I suspect that deadlock failures
> during restore are also possible, since identify_locking_dependencies
> is not going to be nearly close to the truth about which operations
> might hold which locks.

Hmm.  I had the idea that this wasn't a problem because I thought we
had all of pg_dump strictly separated into pre-data, data, and
post-data; therefore, I thought it would be the case that none of that
other stuff would happen until all table data was loaded.  From what
you are saying here, I guess that's not the case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: typcache.c typos

2018-09-02 Thread David Rowley
On 3 September 2018 at 10:24, Robert Haas  wrote:
> On Tue, Aug 28, 2018 at 9:39 AM, Tom Lane  wrote:
>> I don't see any typos here; it looks to me more like a difference of
>> opinion about whether "information" is singular or plural.  I don't
>> really feel a need to change the existing wording.
>
> +1.  "To speed up lookup" would be OK but I don't think "to speed
> lookup" is wrong, and "to speedup lookup" does sound wrong.

Not sure where changing it to "to speedup lookup" was proposed.  In
any case, it seems most people are happy the way it is. That's fine
for me.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: logical decoding: ABI break in 10.5 et al

2018-09-02 Thread Robert Haas
On Tue, Aug 28, 2018 at 11:38 AM, Alvaro Herrera
 wrote:
> [TL;DR: I propose to set up abidiff.postgresql.org.]

Nice idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: typcache.c typos

2018-09-02 Thread Robert Haas
On Tue, Aug 28, 2018 at 9:39 AM, Tom Lane  wrote:
>> Opps. I mistakenly attached the incorrect patch. The correct one is
>> attached to this email.
>
> I don't see any typos here; it looks to me more like a difference of
> opinion about whether "information" is singular or plural.  I don't
> really feel a need to change the existing wording.

+1.  "To speed up lookup" would be OK but I don't think "to speed
lookup" is wrong, and "to speedup lookup" does sound wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

2018-09-02 Thread Robert Haas
On Mon, Aug 27, 2018 at 11:38 AM, Alexander Korotkov
 wrote:
> The aspect I'm more concerned here about is whether we miss ability
> for detecting some of IO errors, if we don't distinguish new pages
> from pages whose tuples were removed by vacuum.

My main concern is correctness.  If we ever have valid-looking buffers
in shared_buffers after the corresponding data has been truncated away
on disk, we've got to make sure that nobody ever confuses one of them
with an actually-valid buffer.  Reading over your algorithm, I can't
convince myself that you have that case nailed down tightly enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why hash OIDs?

2018-09-02 Thread Robert Haas
On Tue, Aug 28, 2018 at 8:02 PM, Tom Lane  wrote:
> I think this argument is a red herring TBH.  The example Robert shows is
> of *zero* interest for dynahash or catcache, unless it's taking only the
> low order 3 bits of the OID for the bucket number.  But actually we'll
> increase the table size proportionally to the number of entries, so
> that you can't have say 1000 table entries without at least 10 bits
> being used for the bucket number.  That means that you'd only have
> trouble if those 1000 tables all had OIDs exactly 1K (or some multiple
> of that) apart.  Such a case sounds quite contrived from here.

Hmm.  I was thinking that it was a problem if the number of OIDs
consumed per table was a FACTOR of 1000, not just if it was a POWER of
1000.  I mean, if it's, say, 4, that means three-quarters of your hash
table buckets are unused, which seems poor.  But maybe it's not really
a big enough problem in practice for us to care?  Dunno.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Stored procedures and out parameters

2018-09-02 Thread Robert Haas
On Thu, Aug 30, 2018 at 7:45 PM, Chapman Flack  wrote:
> On 08/30/18 15:35, Robert Haas wrote:
>> On Tue, Aug 28, 2018 at 6:30 AM, Peter Eisentraut
>>  wrote:
>>> CALL compatible with the SQL standard.  For example, if you have a
>>> function f1(IN a int, OUT b int), you would call it as SELECT f1(x)
>>> and the "b" would somehow be the return value.  But a procedure call
>>> would be CALL p1(x, y), where x and y could be, say, PL/pgSQL
>>> variables.
>
> I suppose the key question for most driver writers is going to be,
> what does that difference look like at the fe-be protocol level?

I don't think the issue is so much the FE-BE protocol level as the SQL syntax.

> PL/pgSQL might be an unrepresentative example for that question,
> as it lives in the backend and could have some other way of retrieving
> b to store in y. For any remote client, the result still needs to get
> back there before the client can apply any "this result gets assigned
> to my y variable" semantics, and is there any material difference between
> the protocol message sequences that return these results
>
>   select foo(1,2);
>   select * from foo(1,2);
>   call bar(1,2);
>
> to the client? And, in the parallel universe where functions got
> implemented according to the standard, what in that picture would
> be different?

You may (or may not) be missing the point here.  Your first two
examples do not obviously involve OUT parameters, although in theory
they could, since whatever OUT parameters exist are going to show up
in the third one.  The third one definitely does not, since CALL
apparently would require a variable that could be set to be passed as
an argument.  I'm not actually sure how it's supposed to work,
actually, because the documentation for CALL addresses neither the
proper usage nor the incompatibility as compared with functions:

https://www.postgresql.org/docs/11/static/sql-call.html

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Stored procedures and out parameters

2018-09-02 Thread Robert Haas
On Thu, Aug 30, 2018 at 4:14 PM, Dave Cramer  wrote:
> Reading this from the (JDBC) drivers perspective, which is probably a fairly
> popular one,
> We now have a standard that we can't really support. Either the driver will
> have to support
> the new PROCEDURE with the {call } mechanism or stay with the existing
> FUNCTIONS.
> This puts the drivers in a no win situation.

I understand and I agree.

> Undoubtedly.. surely the opportunity to do something about this has not
> passed as this has not been
> officially released ?

I agree with that, too, but I can't make other people do things they
don't want to do, and then there's the question of time.  On the basis
of previous experience, there is going to be little appetite for
holding up the release to address this issue no matter how badly
busted it is.  Ultimately, it's the RMT that must decide what to do in
cases like this.  I have confidence that they are watching this
thread, but I don't know what they will decide to do about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Incorrect use of errcode_for_file_access in backend code

2018-09-02 Thread Michael Paquier
Hi all,

While looking at another patch for slot.c, I have noticed what looks
like incorrect use of errcode_for_file_access:
- errcode_for_file_access() is used with rmtree(), which makes no sense
as this comes from common/rmtree.c, and a warning already shows up using
%m.
- ERRCODE_DATA_CORRUPTED is not used to mention corrupted data, and
instead errcode_for_file_access() gets called.

Wouldn't something like the attached provide more adapted error
handling?  That's mostly error handling beautification, so I would be
incline to just fix HEAD.

(I have noticed some inconsistent error string format in autoprewarm.c
on the way.)

Thoughts?
--
Michael
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index cc5e2dd89c..03bf90ce2d 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -641,7 +641,7 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
 		errno = save_errno;
 		ereport(ERROR,
 (errcode_for_file_access(),
- errmsg("could not write to file \"%s\" : %m",
+ errmsg("could not write to file \"%s\": %m",
 		transient_dump_file_path)));
 	}
 
@@ -664,7 +664,7 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
 			errno = save_errno;
 			ereport(ERROR,
 	(errcode_for_file_access(),
-	 errmsg("could not write to file \"%s\" : %m",
+	 errmsg("could not write to file \"%s\": %m",
 			transient_dump_file_path)));
 		}
 	}
@@ -684,7 +684,7 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
 		errno = save_errno;
 		ereport(ERROR,
 (errcode_for_file_access(),
- errmsg("could not close file \"%s\" : %m",
+ errmsg("could not close file \"%s\": %m",
 		transient_dump_file_path)));
 	}
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index cb781e6e9a..800ca14488 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -628,8 +628,7 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
 	 */
 	if (!rmtree(tmppath, true))
 		ereport(WARNING,
-(errcode_for_file_access(),
- errmsg("could not remove directory \"%s\"", tmppath)));
+(errmsg("could not remove directory \"%s\"", tmppath)));
 
 	/*
 	 * We release this at the very end, so that nobody starts trying to create
@@ -1132,8 +1131,8 @@ StartupReplicationSlots(void)
 			if (!rmtree(path, true))
 			{
 ereport(WARNING,
-		(errcode_for_file_access(),
-		 errmsg("could not remove directory \"%s\"", path)));
+		(errmsg("could not remove directory \"%s\"",
+path)));
 continue;
 			}
 			fsync_fname("pg_replslot", true);
@@ -1432,21 +1431,21 @@ RestoreSlotFromDisk(const char *name)
 	/* verify magic */
 	if (cp.magic != SLOT_MAGIC)
 		ereport(PANIC,
-(errcode_for_file_access(),
+(errcode(ERRCODE_DATA_CORRUPTED),
  errmsg("replication slot file \"%s\" has wrong magic number: %u instead of %u",
 		path, cp.magic, SLOT_MAGIC)));
 
 	/* verify version */
 	if (cp.version != SLOT_VERSION)
 		ereport(PANIC,
-(errcode_for_file_access(),
+(errcode(ERRCODE_DATA_CORRUPTED),
  errmsg("replication slot file \"%s\" has unsupported version %u",
 		path, cp.version)));
 
 	/* boundary check on length */
 	if (cp.length != ReplicationSlotOnDiskV2Size)
 		ereport(PANIC,
-(errcode_for_file_access(),
+(errcode(ERRCODE_DATA_CORRUPTED),
  errmsg("replication slot file \"%s\" has corrupted length %u",
 		path, cp.length)));
 
@@ -1496,8 +1495,8 @@ RestoreSlotFromDisk(const char *name)
 		if (!rmtree(slotdir, true))
 		{
 			ereport(WARNING,
-	(errcode_for_file_access(),
-	 errmsg("could not remove directory \"%s\"", slotdir)));
+	(errmsg("could not remove directory \"%s\"",
+			slotdir)));
 		}
 		fsync_fname("pg_replslot", true);
 		return;


signature.asc
Description: PGP signature


Re: Bug in slot.c and are replication slots ever used at Window?

2018-09-02 Thread Michael Paquier
On Fri, Aug 31, 2018 at 11:46:47AM -0700, Michael Paquier wrote:
> The checks will not be included in the final fix, still they look useful
> so I am planning to start a new thread on the matter as perhaps other
> folks have more and/or better ideas.

Pushed the fix down to 9.4, without the extra sanity checks.
--
Michael


signature.asc
Description: PGP signature


Re: pg_constraint.conincluding is useless

2018-09-02 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Alvaro Herrera  writes:
> > This requires a catversion bump, for which it may seem a bit late;
> > however I think it's better to release pg11 without a useless catalog
> > column only to remove it in pg12 ...

Yeah, I really don't think we want to have another catalog column that
we end up wanting to remove later on, if we can avoid it..

> Catversion bumps during beta are routine.  If we had put out rc1
> I'd say it was too late, but we have not.

I agree that rc1 would be too late.  On the flip side, I don't think
we should really consider catversion bumps during beta to be 'routine'.

> If we do do a bump for beta4, I'd be strongly tempted to address the
> lack of a unique index for pg_constraint as well, cf
> https://www.postgresql.org/message-id/10110.1535907...@sss.pgh.pa.us

All that said, given that we've got two pretty good reasons for a
catversion bump, and one is to remove a useless column before it ever
gets in a release, I'm +1 for making both of these changes.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: pg_constraint.conincluding is useless

2018-09-02 Thread Michael Paquier
On Sun, Sep 02, 2018 at 01:27:25PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > This requires a catversion bump, for which it may seem a bit late;
> > however I think it's better to release pg11 without a useless catalog
> > column only to remove it in pg12 ...
> 
> Catversion bumps during beta are routine.  If we had put out rc1
> I'd say it was too late, but we have not.

At the same time Covering indexes are a new thing, so if the timing
allows, let's move on with having a cleaner catalog layer from the
start, that's less compatibility breakages to justify later on.
Hopefully.

> If we do do a bump for beta4, I'd be strongly tempted to address the
> lack of a unique index for pg_constraint as well, cf
> https://www.postgresql.org/message-id/10110.1535907...@sss.pgh.pa.us

Yeah...  I looked at the thread.
--
Michael


signature.asc
Description: PGP signature


Re: pg_constraint.conincluding is useless

2018-09-02 Thread Tom Lane
Alvaro Herrera  writes:
> This requires a catversion bump, for which it may seem a bit late;
> however I think it's better to release pg11 without a useless catalog
> column only to remove it in pg12 ...

Catversion bumps during beta are routine.  If we had put out rc1
I'd say it was too late, but we have not.

If we do do a bump for beta4, I'd be strongly tempted to address the
lack of a unique index for pg_constraint as well, cf
https://www.postgresql.org/message-id/10110.1535907...@sss.pgh.pa.us

regards, tom lane



pg_constraint.conincluding is useless

2018-09-02 Thread Alvaro Herrera
Hi

Already mentioned this in 
https://postgr.es/m/20180831205020.nxhw6ypysgshjtnl@alvherre.pgsql

While trying to add support for foreign keys to partitioned tables, I
noticed that commit 8224de4f42cc ("Indexes with INCLUDE columns and
their support in B-tree") added a column to pg_constraint that appears
to be there only to enable ruleutils.c to print out the list of columns
in a PRIMARY KEY or UNIQUE constraint that uses included columns.
However, this is pretty easy to obtain from pg_index.conkey instead, so
I claim that that column is useless.  In fact, here's a patch to remove
it.

This requires a catversion bump, for which it may seem a bit late;
however I think it's better to release pg11 without a useless catalog
column only to remove it in pg12 ...

Thoughts?

-- 
Álvaro Herrera
>From 5e9c74c4dc59afb348b8b3ea5233b7cbf43c0616 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Sun, 2 Sep 2018 13:35:46 -0300
Subject: [PATCH] remove conincluding

---
 doc/src/sgml/catalogs.sgml|  8 
 src/backend/catalog/pg_constraint.c   | 21 --
 src/backend/utils/adt/ruleutils.c | 57 +--
 src/include/catalog/pg_constraint.h   |  6 ---
 src/test/regress/expected/index_including.out | 40 +--
 src/test/regress/sql/index_including.sql  | 10 ++---
 6 files changed, 70 insertions(+), 72 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 07e8b3325f..0179deea2e 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2374,14 +2374,6 @@ SCRAM-SHA-256$iteration 
count:
  
 
  
-  conincluding
-  int2[]
-  pg_attribute.attnum
-  List of the non-constrained columns which are included into
-the same index as the constrained columns
- 
-
- 
   confkey
   int2[]
   pg_attribute.attnum
diff --git a/src/backend/catalog/pg_constraint.c 
b/src/backend/catalog/pg_constraint.c
index 7a6d158f89..ea84441360 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -85,7 +85,6 @@ CreateConstraintEntry(const char *constraintName,
boolnulls[Natts_pg_constraint];
Datum   values[Natts_pg_constraint];
ArrayType  *conkeyArray;
-   ArrayType  *conincludingArray;
ArrayType  *confkeyArray;
ArrayType  *conpfeqopArray;
ArrayType  *conppeqopArray;
@@ -116,21 +115,6 @@ CreateConstraintEntry(const char *constraintName,
else
conkeyArray = NULL;
 
-   if (constraintNTotalKeys > constraintNKeys)
-   {
-   Datum  *conincluding;
-   int j = 0;
-   int constraintNIncludedKeys = 
constraintNTotalKeys - constraintNKeys;
-
-   conincluding = (Datum *) palloc(constraintNIncludedKeys * 
sizeof(Datum));
-   for (i = constraintNKeys; i < constraintNTotalKeys; i++)
-   conincluding[j++] = Int16GetDatum(constraintKey[i]);
-   conincludingArray = construct_array(conincluding, 
constraintNIncludedKeys,
-   
INT2OID, 2, true, 's');
-   }
-   else
-   conincludingArray = NULL;
-
if (foreignNKeys > 0)
{
Datum  *fkdatums;
@@ -204,11 +188,6 @@ CreateConstraintEntry(const char *constraintName,
else
nulls[Anum_pg_constraint_conkey - 1] = true;
 
-   if (conincludingArray)
-   values[Anum_pg_constraint_conincluding - 1] = 
PointerGetDatum(conincludingArray);
-   else
-   nulls[Anum_pg_constraint_conincluding - 1] = true;
-
if (confkeyArray)
values[Anum_pg_constraint_confkey - 1] = 
PointerGetDatum(confkeyArray);
else
diff --git a/src/backend/utils/adt/ruleutils.c 
b/src/backend/utils/adt/ruleutils.c
index 03e9a28a63..bd0792e13e 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -315,7 +315,7 @@ static char *deparse_expression_pretty(Node *expr, List 
*dpcontext,
 static char *pg_get_viewdef_worker(Oid viewoid,
  int prettyFlags, int wrapColumn);
 static char *pg_get_triggerdef_worker(Oid trigid, bool pretty);
-static void decompile_column_index_array(Datum column_index_array, Oid relId,
+static int decompile_column_index_array(Datum column_index_array, Oid 
relId,
 StringInfo buf);
 static char *pg_get_ruledef_worker(Oid ruleoid, int prettyFlags);
 static char *pg_get_indexdef_worker(Oid indexrelid, int colno,
@@ -2055,6 +2055,8 @@ pg_get_constraintdef_worker(Oid constraintId, bool 
fullCommand,
Datum   val;
boolisnull;

Re[3]: Adding a note to protocol.sgml regarding CopyData

2018-09-02 Thread Fabien COELHO



Hello Bradley & Tatsuo-san,


... references to the protocol version lacks homogeneity.
... I'd suggest to keep "the vX.0 protocol" for a short version,
and "the version X.0 protocol" for long ...


I agree. Change made.


Patch applies cleanly. Doc build ok.

One part talks about "terminating line", the other of "termination line". 
I wonder whether one is better than the other?


--
Fabien.



Re: Proposal for disk quota feature

2018-09-02 Thread Pavel Stehule
Hi

2018-09-02 14:18 GMT+02:00 Hubert Zhang :

> Thanks Chapman.
> @Pavel,  could you please explain more about your second suggestion "implement
> some quotas on storage level?"
>

See attached patch - it is very simple - and good enough for our purposes.

Regards

Pavel



> We will not keep the long-lived processes attach to all databases(just
> like you mentioned servers with thousands of databases)
> And you are right, we could share ideas with autovacuum process, fork
> worker processes in need.
> "autovacuum checks for tables that have had a large number of inserted,
> updated or deleted tuples. These checks use the statistics collection
> facility"
> diskquota process is similar to autovacuum at caring about insert, but the
> difference is that it also care about vucuum full, truncate and drop. While
> update and delete may not be interested since no file change happens. So a
> separate diskquota process is preferred.
>
> So if we implemented disk quota as a full native feature, and in the
> first initial version I prefer to implement the following features:
> 1 Fork diskquota launcher process under Postmaster serverloop, which is
> long-lived.
> 2 Diskquota launcher process is responsible for creating diskquota worker
> process for every database.
> 3 DIskquota setting is stored in a separate catalog table for each
> database.
> 4 Initialization stage, Diskquota launcher process creates diskquota worker
> process for all the databases(traverse like autovacuum). Worker process
> calculates disk usage of db objects and their diskquota setting. If any
> db object exceeds its quota limit, put them into the blacklist in the
> shared memory, which will later be used by enforcement operator. Worker
> process exits when works are done.
> 5 Running stage, Diskquota launcher process creates diskquota worker
> process for the database with a large number of insert, copy, truncate,
> drop etc. or create disk quota statement. Worker process updates the file
> size for db objects containing the result relation, and compare with the
> diskquota setting. Again, if exceeds quota limit, put them into blacklist,
> remove from blacklist vice versa. Worker process exits when works are
> done and a GUC could control the frequency of worker process restart to a
> specific database. As you know, this GUC also controls the delay when we do
> enforcement.
> 6 Enforcement. When postgres backend executes queries, check the blacklist
> in shared memory to determine whether the query is allowed(before execute)
> or need rollback(is executing)?
>
> If we implemented disk quota as an extension, we could just use background
> worker to start diskquota launcher process and use
> RegisterDynamicBackgroundWorker() to fork child diskquota worker
> processes by the launcher process as suggested by @Chapman.
> Diskquota setting could be stored in user table in a separate schema for
> each database(Schema and table created by create extension statement) just
> like what Heikki has done in pg_quota project. But in this case, we need to
> create extension for each database before diskquota worker process can be
> set up for that database.
>
> Any comments on the above design and which is preferred, native feature or
> extension as the POC?
>
>
> -- Hubert
>
>
>
> On Fri, Aug 31, 2018 at 3:32 AM, Pavel Stehule 
> wrote:
>
>>
>>
>> 2018-08-30 16:22 GMT+02:00 Chapman Flack :
>>
>>> On 08/30/2018 09:57 AM, Hubert Zhang wrote:
>>>
>>> > 2 Keep one worker process for each database. But using a parent/global
>>> > quota worker process to manage the lifecycle of database level worker
>>> > processes. It could handle the newly created database(avoid restart
>>> > database) and save resource when a database is not used. But this
>>> needs to
>>> > change worker process to be hierarchical. Postmaster becomes the
>>> grandfather
>>> >  of database level worker processes in this case.
>>>
>>> I am using background workers this way in 9.5 at $work.
>>>
>>> In my case, one worker lives forever, wakes up on a set period, and
>>> starts a short-lived worker for every database, waiting for each
>>> one before starting the next.
>>>
>>> It was straightforward to implement. Looking back over the code,
>>> I see the global worker assigns its own PID to worker.bgw_notify_pid
>>> of each of its children, and also obtains a handle for each child
>>> from RegisterDynamicBackgroundWorker().
>>>
>>> I imagine the global quota worker would prefer to start workers
>>> for every database and then just wait for notifications from any
>>> of them, but that seems equally straightforward at first glance.
>>>
>>
>> There are servers with thousands databases. Worker per database is not
>> good idea.
>>
>> It should to share ideas, code with autovacuum process.
>>
>> Not sure, how to effective implementation based on bg workers can be. On
>> servers with large set of databases, large set of tables it can identify
>> too big table too late.
>>
>> Isn't better to implement 

Re: Proposal for disk quota feature

2018-09-02 Thread Hubert Zhang
Thanks Chapman.
@Pavel,  could you please explain more about your second suggestion "implement
some quotas on storage level?"
We will not keep the long-lived processes attach to all databases(just like
you mentioned servers with thousands of databases)
And you are right, we could share ideas with autovacuum process, fork
worker processes in need.
"autovacuum checks for tables that have had a large number of inserted,
updated or deleted tuples. These checks use the statistics collection
facility"
diskquota process is similar to autovacuum at caring about insert, but the
difference is that it also care about vucuum full, truncate and drop. While
update and delete may not be interested since no file change happens. So a
separate diskquota process is preferred.

So if we implemented disk quota as a full native feature, and in the first
initial version I prefer to implement the following features:
1 Fork diskquota launcher process under Postmaster serverloop, which is
long-lived.
2 Diskquota launcher process is responsible for creating diskquota worker
process for every database.
3 DIskquota setting is stored in a separate catalog table for each database.
4 Initialization stage, Diskquota launcher process creates diskquota worker
process for all the databases(traverse like autovacuum). Worker process
calculates disk usage of db objects and their diskquota setting. If any
db object exceeds its quota limit, put them into the blacklist in the
shared memory, which will later be used by enforcement operator. Worker
process exits when works are done.
5 Running stage, Diskquota launcher process creates diskquota worker
process for the database with a large number of insert, copy, truncate,
drop etc. or create disk quota statement. Worker process updates the file
size for db objects containing the result relation, and compare with the
diskquota setting. Again, if exceeds quota limit, put them into blacklist,
remove from blacklist vice versa. Worker process exits when works are done
and a GUC could control the frequency of worker process restart to a
specific database. As you know, this GUC also controls the delay when we do
enforcement.
6 Enforcement. When postgres backend executes queries, check the blacklist
in shared memory to determine whether the query is allowed(before execute)
or need rollback(is executing)?

If we implemented disk quota as an extension, we could just use background
worker to start diskquota launcher process and use
RegisterDynamicBackgroundWorker() to fork child diskquota worker processes
by the launcher process as suggested by @Chapman. Diskquota setting could
be stored in user table in a separate schema for each database(Schema and
table created by create extension statement) just like what Heikki has done
in pg_quota project. But in this case, we need to create extension for each
database before diskquota worker process can be set up for that database.

Any comments on the above design and which is preferred, native feature or
extension as the POC?


-- Hubert



On Fri, Aug 31, 2018 at 3:32 AM, Pavel Stehule 
wrote:

>
>
> 2018-08-30 16:22 GMT+02:00 Chapman Flack :
>
>> On 08/30/2018 09:57 AM, Hubert Zhang wrote:
>>
>> > 2 Keep one worker process for each database. But using a parent/global
>> > quota worker process to manage the lifecycle of database level worker
>> > processes. It could handle the newly created database(avoid restart
>> > database) and save resource when a database is not used. But this needs
>> to
>> > change worker process to be hierarchical. Postmaster becomes the
>> grandfather
>> >  of database level worker processes in this case.
>>
>> I am using background workers this way in 9.5 at $work.
>>
>> In my case, one worker lives forever, wakes up on a set period, and
>> starts a short-lived worker for every database, waiting for each
>> one before starting the next.
>>
>> It was straightforward to implement. Looking back over the code,
>> I see the global worker assigns its own PID to worker.bgw_notify_pid
>> of each of its children, and also obtains a handle for each child
>> from RegisterDynamicBackgroundWorker().
>>
>> I imagine the global quota worker would prefer to start workers
>> for every database and then just wait for notifications from any
>> of them, but that seems equally straightforward at first glance.
>>
>
> There are servers with thousands databases. Worker per database is not
> good idea.
>
> It should to share ideas, code with autovacuum process.
>
> Not sure, how to effective implementation based on bg workers can be. On
> servers with large set of databases, large set of tables it can identify
> too big table too late.
>
> Isn't better to implement some quotas on storage level?
>
> Regards
>
> Pavel
>
>
>
>> -Chap
>>
>>
>


-- 
Thanks

Hubert Zhang