Re: Should we warn against using too many partitions?

2019-06-07 Thread Justin Pryzby
I made another pass, hopefully it's useful and not too much of a pain.

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index cce1618fc1..be2ca3be48 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4674,6 +4675,88 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate 
>= DATE '2008-01-01';


   
+  
+  
+   Declarative Partitioning Best Practices
+
+   
+The choice of how to partition a table should be made carefully as the
+performance of query planning and execution can be negatively affected by
+poorly made design decisions.

Maybe just "poor design"

+partitioned table.  WHERE clause items that match and
+are compatible with the partition key can be used to prune away unneeded

remove "away" ?

+requirements for the PRIMARY KEY or a
+UNIQUE constraint.  Removal of unwanted data is also
+a factor to consider when planning your partitioning strategy as an entire
+partition can be removed fairly quickly.  However, if data that you want

Can we just say "dropped" ?  On my first (re)reading, I briefly thought this
was now referring to "pruning" as "removal".

+to keep exists in that partition then that means having to resort to using
+DELETE instead of removing the partition.
+   
+
+   
+Choosing the target number of partitions by which the table should be
+divided into is also a critical decision to make.  Not having enough

Should be: ".. target number .. into which .. should be divided .."

+partitions may mean that indexes remain too large and that data locality
+remains poor which could result in poor cache hit ratios.  However,

Change the 2nd remains to "is" and the second poor to "low" ?

+dividing the table into too many partitions can also cause issues.
+Too many partitions can mean slower query planning times and higher memory

s/slower/longer/

+consumption during both query planning and execution.  It's also important
+to consider what changes may occur in the future when choosing how to
+partition your table.  For example, if you choose to have one partition

Remove "when choosing ..."?  Or say:

|When choosing how to partition your table, it's also important to consider
|what changes may occur in the future.

+per customer and you currently have a small number of large customers,
+what will the implications be if in several years you obtain a large
+number of small customers.  In this case, it may be better to choose to
+partition by HASH and choose a reasonable number of
+partitions rather than trying to partition by LIST and
+hoping that the number of customers does not increase significantly over
+time.
+   

It's an unusual thing for which to hope :)

+   
+Sub-partitioning can be useful to further divide partitions that are
+expected to become larger than other partitions, although excessive
+sub-partitioning can easily lead to large numbers of partitions and can
+cause the problems mentioned in the preceding paragraph.
+   

cause the SAME problems ?

+It is also important to consider the overhead of partitioning during
+query planning and execution.  The query planner is generally able to
+handle partition hierarchies up a few thousand partitions fairly well,
+provided that typical queries prune all but a small number of partitions
+during query planning.  Planning times become slower and memory

s/slower/longer/

Hm, maybe say "typical queries ALLOW PRUNNING .."

+consumption becomes higher when more partitions remain after the planner
+performs partition pruning.  This is particularly true for the

Just say: "remain after planning" ?

+UPDATE and DELETE commands.  Also,
+even if most queries are able to prune a large number of partitions during
+query planning, it still may be undesirable to have a large number of

may still ?

+partitions as each partition requires metadata about the partition to be
+stored in each session that touches it.  If each session touches a large

stored for ?

+number of partitions over a period of time then the memory consumption for
+this may become significant.
+   

Remove "over a period of time" ?
Add a comma?

Maybe say:

|If each session touches a large number of partitions, then the memory
|overhead may become significant.

+   
+With data warehouse type workloads it can make sense to use a larger
+number of partitions than with an OLTP type workload.  Generally, in data
+warehouses, query planning time is less of a concern as the majority of

VAST majority?  Or "essentially all"?  Or " .. query planning time is
insignificant compared to the time spent during query execution.

+processing time is spent during query execution.  With either of these two
+types of workload it is important to make the right decisions early as

early COMMA

Justin




Re: Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !

2019-06-07 Thread Fabien COELHO



Hello Avinash,


I was testing bloom indexes today. I understand bloom indexes uses bloom
filters. [...]

So the question here is -
I assume - number of bits = k. Where k is the total number of hash
functions used on top of the data that needs to validated. Is that correct
? If yes, why do we see the Index 1 performing better than Index 2 ?
Because, the data has to go through more hash functions (4 vs 2) in Index 1
than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives.
Please let me know if there is anything simple that i am missing here.


You may have a look at the blog entry about these parameters I redacted a 
few year ago:


  http://blog.coelho.net/database/2016/12/11/postgresql-bloom-index.html

--
Fabien.




Re: [PROPOSAL] Drop orphan temp tables in single-mode

2019-06-07 Thread Arthur Zakirov
Hello Alexander,

On Friday, June 7, 2019, Alexander Korotkov 
wrote:
> BTW, does this patch checks that temporary table is really orphan?
> AFAICS, user may define some temporary tables in single-user mode
> before running VACUUM.

As far as I remember, the patch checks it.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Artur


Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !

2019-06-07 Thread Avinash Kumar
Hi All,

I was testing bloom indexes today. I understand bloom indexes uses bloom
filters.

As i understand, a bloom filter is a bit array of m bits and a constant "k"
number of hash functions are used to generate hashes for the data. And then
the appropriate bits are set to 1.

I was doing the following test where i was generating 10 million records
and testing bloom indexes.

CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5
int,id6 int,id7 int,details text, zipcode int);

INSERT INTO foo.bar SELECT (random() * 100)::int, (random() *
100)::int,(random() * 100)::int,(random() * 100)::int,(random()
* 100)::int,(random() * 100)::int, (random() *
100)::int,(random() * 100)::int,md5(g::text), floor(random()*
(2- + 1) + ) from generate_series(1,100*1e6) g;

As per the documentation, bloom indexes accepts 2 parameters. *length* and
the *number of bits for each column*.

Here is the problem or the question i have.

I have created the following 2 Indexes.

*Index 1*
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
col8=4);

*Index 2*
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=2, col2=2, col3=2, col4=2, col5=2, col6=2, col7=2,
col8=2);

With change in length, we of course see a difference in the Index size
which is understandable. Here i have the same length for both indexes. But,
i reduced the number of bits per each index column from 4 to 2. Both the
above indexes are of the same size. But, there is a very slight difference
in the execution time between Index 1 and Index 2 but with the same cost.

So the question here is -
I assume - number of bits = k. Where k is the total number of hash
functions used on top of the data that needs to validated. Is that correct
? If yes, why do we see the Index 1 performing better than Index 2 ?
Because, the data has to go through more hash functions (4 vs 2) in Index 1
than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives.
Please let me know if there is anything simple that i am missing here.

*Query *

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) select * from foo.bar where id = 736833
and dept = 89861 and id2 = 573221 and id3 = 675911 and id4 = 943394 and id5
= 326756 and id6 = 597560 and zipcode = 10545;

*With Index 1 *

   QUERY PLAN

 Bitmap Heap Scan on foo.bar  (cost=2647060.00..2647061.03 rows=1 width=69)
(actual time=307.000..307.000 rows=0 loops=1)
   Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
   Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 =
573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
   Buffers: shared hit=147059
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..2647060.00 rows=1
width=0) (actual time=306.997..306.997 rows=0 loops=1)
 Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2
= 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
 Buffers: shared hit=147059
 Planning Time: 0.106 ms
* Execution Time: 307.030 ms*
(9 rows)

*With Index 2 *

   QUERY PLAN

 Bitmap Heap Scan on foo.bar  (cost=2647060.00..2647061.03 rows=1 width=69)
(actual time=420.881..420.881 rows=0 loops=1)
   Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
   Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 =
573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
   Buffers: shared hit=147059
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..2647060.00 rows=1
width=0) (actual time=420.878..420.878 rows=0 loops=1)
 Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2
= 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
 Buffers: shared hit=147059
 Planning Time: 0.104 ms
* Execution Time: 420.913 ms*
(9 rows)

Thanks,
Avinash.


Re: Temp table handling after anti-wraparound shutdown (Was: BUG #15840)

2019-06-07 Thread Michael Paquier
On Fri, Jun 07, 2019 at 05:26:32PM -0700, Andres Freund wrote:
> I was more thinking that we'd move the check for orphaned-ness into a
> separate function (maybe IsOrphanedRelation()), and move the code to
> drop orphan relations into a separate function (maybe
> DropOrphanRelations()). That'd limit the amount of code duplication for
> doing this both in autovacuum and all-database vacuums quite
> considerably.

A separation makes sense.  At some point we should actually try to
separate vacuum and orphan relation cleanup, so separate functions
make sense.  The only reason why we are doing it with autovacuum is
that it is the only thing in-core spawning a worker connected to a
database which does a full scan of pg_class.
--
Michael


signature.asc
Description: PGP signature


Re: be-gssapi-common.h should be located in src/include/libpq/

2019-06-07 Thread Michael Paquier
On Fri, Jun 07, 2019 at 09:52:26AM +0200, Daniel Gustafsson wrote:
> While looking at libpq.h I noticed what seems to be a few nitpicks: the GSS
> function prototype isn’t using the common format of having a comment 
> specifying
> the file it belongs to; ssl_loaded_verify_locations is defined as extern even
> though it’s only available under USE_SSL (which works fine since it’s only
> accessed under USE_SSL but seems kinda wrong); and FeBeWaitSet is not listed
> under the pqcomm.c prototypes like how the extern vars from be-secure.c are.
> All of these are in the attached.

Indeed, this makes the header more consistent.  Thanks for noticing.
--
Michael


signature.asc
Description: PGP signature


Re: be-gssapi-common.h should be located in src/include/libpq/

2019-06-07 Thread Michael Paquier
On Fri, Jun 07, 2019 at 08:11:07AM -0400, Stephen Frost wrote:
> I'm pretty sure it ended up there just because that's how things are in
> src/interfaces/libpq.  I don't have any objection to moving it, I had
> really just been waiting to see where that thread ended up going.
> 
> On a quick look, the patch looks fine to me.

OK thanks.  I have committed this portion of the patch for now.  If
there are any remaining issues let's take care of them afterwards.
--
Michael


signature.asc
Description: PGP signature


Re: Binary support for pgoutput plugin

2019-06-07 Thread Andres Freund
Hi,

On 2019-06-07 21:16:12 -0400, Chapman Flack wrote:
> On 06/07/19 21:01, Andres Freund wrote:
> > On 2019-06-07 20:52:38 -0400, Chapman Flack wrote:
> > That'd be a *lot* of additional complexity, and pretty much prohibitive
> > from a performance POV. We'd have to not continue decoding on the server
> > side *all* the time to give the client a chance to inquire additional
> > information.
> 
> Does anything travel in the client->server direction during replication?
> I thought I saw CopyBoth mentioned. Is there a select()/poll() being done
> so those messages can be received?

Yes, acknowledgements of how far data has been received (and how far
processed), which is then used to release resources (WAL, xid horizon)
and allow synchronous replication to block until something has been
received.

- Andres




Re: Binary support for pgoutput plugin

2019-06-07 Thread Chapman Flack
On 06/07/19 21:01, Andres Freund wrote:
> On 2019-06-07 20:52:38 -0400, Chapman Flack wrote:
>> It seems they had ended up designing a whole 'nother "protocol level"
>> involving queries wrapping their results as JSON and an app layer that
>> unwraps again, after trying a simpler first approach that was foiled by the
>> inability to see into arrays and anonymous record types in the 'describe'
>> response.
> 
> I suspect quite a few people would have to have left the projectbefore
> this would happen.

I'm not sure I understand what you're getting at. The "whole 'nother
protocol" was something they actually implemented, at the application
level, by rewriting their queries to produce JSON and their client to
unwrap it. It wasn't proposed to go into postgres ... but it was a
workaround they were forced into by the current protocol's inability
to tell them what they were getting.

> That'd be a *lot* of additional complexity, and pretty much prohibitive
> from a performance POV. We'd have to not continue decoding on the server
> side *all* the time to give the client a chance to inquire additional
> information.

Does anything travel in the client->server direction during replication?
I thought I saw CopyBoth mentioned. Is there a select()/poll() being done
so those messages can be received?

It does seem that the replication protocol would be the tougher problem.
For the regular extended-query protocol, it seems like allowing an extra
Describe or two before Execute might not be awfully hard.

Regards,
-Chap




Re: Binary support for pgoutput plugin

2019-06-07 Thread Andres Freund
Hi,

On 2019-06-07 20:52:38 -0400, Chapman Flack wrote:
> It seems they had ended up designing a whole 'nother "protocol level"
> involving queries wrapping their results as JSON and an app layer that
> unwraps again, after trying a simpler first approach that was foiled by the
> inability to see into arrays and anonymous record types in the 'describe'
> response.

I suspect quite a few people would have to have left the projectbefore
this would happen.


> I thought, in a new protocol rev, why not let the driver send additional
> 'describe' messages after the first one, to drill into structure of
> individual columns mentioned in the first response, before sending the
> 'execute' message?
> 
> If it doesn't want the further detail, it doesn't have to ask.
> 
> > And then we suddenly need tracking for all these, so we don't always
> > send out that information when we previously already did
> 
> If it's up to the client driver, it can track what it needs or already has.

> I haven't looked too deeply into the replication protocol ... it happens
> under a kind of copy-both, right?, so maybe there's a way for the receiver
> to send some inquiries back, but maybe in a windowed, full-duplex way where
> it might have to buffer some incoming messages before getting the response
> to an inquiry message it sent.

That'd be a *lot* of additional complexity, and pretty much prohibitive
from a performance POV. We'd have to not continue decoding on the server
side *all* the time to give the client a chance to inquire additional
information.

Greetings,

Andres Freund




Re: Binary support for pgoutput plugin

2019-06-07 Thread Chapman Flack
On 06/07/19 19:27, Andres Freund wrote:
> The problem is that I don't recognize a limiting principle:
> 
> If we want NOT NULL information for clients, why don't we include the
> underlying types for arrays, and the fields in composite types? What
> about foreign keys? And unique keys?

This reminds me of an idea I had for a future fe/be protocol version,
right after a talk by Alyssa Ritchie and Henrietta Dombrovskaya at the
last 2Q PGConf. [1]

It seems they had ended up designing a whole 'nother "protocol level"
involving queries wrapping their results as JSON and an app layer that
unwraps again, after trying a simpler first approach that was foiled by the
inability to see into arrays and anonymous record types in the 'describe'
response.

I thought, in a new protocol rev, why not let the driver send additional
'describe' messages after the first one, to drill into structure of
individual columns mentioned in the first response, before sending the
'execute' message?

If it doesn't want the further detail, it doesn't have to ask.

> And then we suddenly need tracking for all these, so we don't always
> send out that information when we previously already did

If it's up to the client driver, it can track what it needs or already has.

I haven't looked too deeply into the replication protocol ... it happens
under a kind of copy-both, right?, so maybe there's a way for the receiver
to send some inquiries back, but maybe in a windowed, full-duplex way where
it might have to buffer some incoming messages before getting the response
to an inquiry message it sent.

Would those be thinkable thoughts for a future protocol rev?

Regards,
-Chap


[1]
https://www.2qpgconf.com/schedule/information-exchange-techniques-for-javapostgresql-applications/




Re: Temp table handling after anti-wraparound shutdown (Was: BUG #15840)

2019-06-07 Thread Andres Freund
Hi,

On 2019-06-08 08:59:37 +0900, Michael Paquier wrote:
> On Fri, Jun 07, 2019 at 03:58:43PM -0700, Andres Freund wrote:
> > Do we need to move the orphan temp cleanup code into database vacuums or
> > such?
> 
> When entering into the vacuum() code path for an autovacuum, only one
> relation at a time is processed, and we have prior that extra
> processing related to toast relations when selecting the relations to
> work on, or potentially delete orphaned temp tables.  For a manual
> vacuum, we finish by deciding which relation to process in
> get_all_vacuum_rels(), so the localized processing is a bit different
> than what's done in do_autovacuum() when scanning pg_class for
> relations.

Yea, I know.  I didn't mean that we should only handle orphan cleanup
only within database wide vacuums, just *also* there. ISTM that'd mean
that at least some of the code ought to be in vacuum.c, and then also
called by autovacuum.c.


> Technically, I think that it would work to give up on the gathering of
> the orphaned OIDs in a gathering and let them be gathered in the list
> of items to vacuum, and then put the deletion logic down to
> vacuum_rel().

I don't think it makes much sense to go there. The API would probably
look pretty bad.

I was more thinking that we'd move the check for orphaned-ness into a
separate function (maybe IsOrphanedRelation()), and move the code to
drop orphan relations into a separate function (maybe
DropOrphanRelations()). That'd limit the amount of code duplication for
doing this both in autovacuum and all-database vacuums quite
considerably.

A more aggressive approach would be to teach vac_update_datfrozenxid()
to ignore orphaned temp tables - perhaps even by heap_inplace'ing an
orphaned table's relfrozenxid/relminmxid with InvalidTransactionId. We'd
not want to do that in do_autovacuum() - otherwise the schema won't get
cleaned up, but for database widevacuums that seems like it could be
good approach.



Random observation while re-reading this code: Having do_autovacuum()
and ExecVacuum() both go through vacuum() seems like it adds too much
complexity to be worth it. Like half of the file is only concerned with
checks related to that.

Greetings,

Andres Freund




Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-06-07 Thread Michael Paquier
On Fri, Jun 07, 2019 at 08:55:36AM -0400, Robert Haas wrote:
> Are you talking about the call to ReleaseBulkInsertStatePin, or something 
> else?

Yes, I was referring to ReleaseBulkInsertStatePin()
--
Michael


signature.asc
Description: PGP signature


Re: Temp table handling after anti-wraparound shutdown (Was: BUG #15840)

2019-06-07 Thread Michael Paquier
On Fri, Jun 07, 2019 at 03:58:43PM -0700, Andres Freund wrote:
> Do we need to move the orphan temp cleanup code into database vacuums or
> such?

When entering into the vacuum() code path for an autovacuum, only one
relation at a time is processed, and we have prior that extra
processing related to toast relations when selecting the relations to
work on, or potentially delete orphaned temp tables.  For a manual
vacuum, we finish by deciding which relation to process in
get_all_vacuum_rels(), so the localized processing is a bit different
than what's done in do_autovacuum() when scanning pg_class for
relations. 

Technically, I think that it would work to give up on the gathering of
the orphaned OIDs in a gathering and let them be gathered in the list
of items to vacuum, and then put the deletion logic down to
vacuum_rel().  However, there is a take: for autovacuum we gather the
orphaned entries and the other relations to process, then drop all the
orphaned OIDs, and finally vacuum/analyze the entries collected.  So
if you put the deletion logic down into vacuum_rel() then we won't be
able to drop orphaned tables before working on a database, which would
be bad if we know about an orphaned set, but autovacuum works for a
long time on other legit entries first.
--
Michael


signature.asc
Description: PGP signature


Re: Binary support for pgoutput plugin

2019-06-07 Thread Andres Freund
Hi,

On 2019-06-05 19:05:05 -0400, Dave Cramer wrote:
> I am curious why you are "strongly" opposed however. We already have the
> information. Adding doesn't seem onerous.

(thought I'd already replied with this)

The problem is that I don't recognize a limiting principle:

If we want NOT NULL information for clients, why don't we include the
underlying types for arrays, and the fields in composite types? What
about foreign keys? And unique keys?

And then we suddenly need tracking for all these, so we don't always
send out that information when we previously already did - and in some
of the cases there's no infrastructure for that.

I just don't quite buy that the output plugin build for pg's logical
replication needs is a good place to include a continually increasing
amount of metadata that logical replication doesn't need.  That's going
to add overhead and make the code more complicated.

Greetings,

Andres Freund




Temp table handling after anti-wraparound shutdown (Was: BUG #15840)

2019-06-07 Thread Andres Freund
Hi,

(Moving a part of this discussion to hackers)

In #15840 Thierry had the situation that autovacuum apparently could not
keep up, and he ended up with a wraparound situation. Following the
hints and shutting down the cluster and vacuuming the relevant DB in
single user mode did not resolve the issue however. That's because there
was a session with temp tables:

On 2019-06-07 16:40:27 -0500, Thierry Husson wrote:
>   oid   | oid  | relkind | relfrozenxid |
> age
> +--+-+--+
>  460564 | pg_temp_3.cur_semt700_progsync_4996  | r   |36464 |
> 2146483652
>  460764 | pg_temp_8.cur_semt700_progsync_5568  | r   | 19836544 |
> 2126683572
>  460718 | pg_temp_4.cur_semt700_progsync_5564  | r   | 19836544 |
> 2126683572
>  460721 | pg_temp_5.cur_semt700_progsync_5565  | r   | 19836544 |
> 2126683572
>  461068 | pg_temp_22.cur_semt700_progsync_5581 | r   | 19836544 |
> 2126683572
> 
> These are temporary tables to manage concurrency & server load. It seems the
> sudden disconnection due to wraparound protection didn't get them removed. I
> removed them manually under single mode and there is no more warning now,
> vacuum command included. Your command is very interesting to know.

And our logic for dropping temp tables only kicks in autovacuum, but not
in a database manual VACUUM.

Which means that currently the advice we give, namely to shut down and
vacuum the database in singleuser mode plainly doesn't work. Without any
warnings hinting in the right direction.

Do we need to move the orphan temp cleanup code into database vacuums or
such?

Greetings,

Andres Freund




Re: tableam: abstracting relation sizing code

2019-06-07 Thread Daniel Gustafsson
> On 7 Jun 2019, at 14:43, Robert Haas  wrote:

> I think that's probably going in the wrong direction.  It's arguable,
> of course.  However, it seems to me that there's nothing heap-specific
> about the number 10.  It's not computed based on the format of a heap
> page or a heap tuple.  It's just somebody's guess (likely Tom's) about
> how to plan with empty relations.  If somebody finds that another
> number works better for their AM, it's probably also better for heap,
> at least on that person's workload.  

Fair enough, that makes sense.

> Good catch, and now I notice that the callback is not called
> estimate_rel_size but relation_estimate_size.  Updated patch attached;
> thanks for the review.

The commit message still refers to it as estimate_rel_size though. The comment 
on
table_block_relation_estimate_size does too but that one might be intentional.

The v3 patch applies cleanly and passes tests (I did not see the warning that
Alvaro mentioned, so yay for testing on multiple compilers).

During re-review, the below part stood out as a bit odd however:

+   if (curpages < 10 &&
+   relpages == 0 &&
+   !rel->rd_rel->relhassubclass)
+   curpages = 10;
+
+   /* report estimated # pages */
+   *pages = curpages;
+   /* quick exit if rel is clearly empty */
+   if (curpages == 0)
+   {
+   *tuples = 0;
+   *allvisfrac = 0;
+   return;
+   }

While I know this codepath isn’t introduced by this patch (it was introduced in
696d78469f3), I hadn’t seen it before so sorry for thread-jacking slightly.

Maybe I’m a bit thick but if the rel is totally empty and without children,
then curpages as well as relpages would be both zero.  But if so, how can we
enter the second "quick exit” block since curpages by then will be increased to
10 in the block just before for the empty case?  It seems to me that the blocks
should be the other way around to really have a fast path, but I might be
missing something.

cheers ./daniel



Re: heapam_index_build_range_scan's anyvisible

2019-06-07 Thread Alvaro Herrera
On 2019-Jun-07, Robert Haas wrote:

> Yeah, I wondered whether SnapshotNonVacuumable might've been added
> later, but I was too lazy to check the commit log.  I'll try coding up
> that approach and see how it looks.

Thanks.

> But do you have any comment on the question of whether this function
> is actually safe with < ShareLock, per the comments about caching
> HOT-related state across buffer lock releases?

Well, as far as I understand we do hold a buffer pin on the page the
whole time until we abandon it, which prevents HOT pruning, so the root
offset cache should be safe (since heap_page_prune requires cleanup
lock).  The thing we don't keep held is a buffer lock, so I/U/D could
occur, but those are not supposed to be hazards for the BRIN use, since
that's covered by the anyvisible / SnapshotNonVacuumable
hack^Wtechnique.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: heapam_index_build_range_scan's anyvisible

2019-06-07 Thread Robert Haas
On Fri, Jun 7, 2019 at 4:30 PM Alvaro Herrera  wrote:
> On 2019-Jun-07, Robert Haas wrote:
> > I spent some time today studying heapam_index_build_range_scan and
> > quickly reached the conclusion that it's kind of a mess.  At heart
> > it's pretty simple: loop over all the table, check each tuple against
> > any qual, and pass the visible ones to the callback.  However, in an
> > attempt to make it cater to various needs slightly outside of its
> > original design purpose, various warts have been added, and there are
> > enough of them now that I at least find it fairly difficult to
> > understand.  One of those warts is anyvisible, which I gather was
> > added in support of BRIN.
>
> Yes, commit 2834855cb9fd added that flag.  SnapshotNonVacuumable did not
> exist back then.  It seems like maybe it would work to remove the flag
> and replace with passing SnapshotNonVacuumable.  The case that caused
> that flag to be added is tested by a dedicated isolation test, so if
> BRIN becomes broken by the change at least it'd be obvious ...

Yeah, I wondered whether SnapshotNonVacuumable might've been added
later, but I was too lazy to check the commit log.  I'll try coding up
that approach and see how it looks.

But do you have any comment on the question of whether this function
is actually safe with < ShareLock, per the comments about caching
HOT-related state across buffer lock releases?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: heapam_index_build_range_scan's anyvisible

2019-06-07 Thread Alvaro Herrera
On 2019-Jun-07, Robert Haas wrote:

> I spent some time today studying heapam_index_build_range_scan and
> quickly reached the conclusion that it's kind of a mess.  At heart
> it's pretty simple: loop over all the table, check each tuple against
> any qual, and pass the visible ones to the callback.  However, in an
> attempt to make it cater to various needs slightly outside of its
> original design purpose, various warts have been added, and there are
> enough of them now that I at least find it fairly difficult to
> understand.  One of those warts is anyvisible, which I gather was
> added in support of BRIN.

Yes, commit 2834855cb9fd added that flag.  SnapshotNonVacuumable did not
exist back then.  It seems like maybe it would work to remove the flag
and replace with passing SnapshotNonVacuumable.  The case that caused
that flag to be added is tested by a dedicated isolation test, so if
BRIN becomes broken by the change at least it'd be obvious ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




heapam_index_build_range_scan's anyvisible

2019-06-07 Thread Robert Haas
I spent some time today studying heapam_index_build_range_scan and
quickly reached the conclusion that it's kind of a mess.  At heart
it's pretty simple: loop over all the table, check each tuple against
any qual, and pass the visible ones to the callback.  However, in an
attempt to make it cater to various needs slightly outside of its
original design purpose, various warts have been added, and there are
enough of them now that I at least find it fairly difficult to
understand.  One of those warts is anyvisible, which I gather was
added in support of BRIN.

I first spent some time looking at how the 'anyvisible' flag affects
the behavior of the function. AFAICS, setting the flag to true results
in three behavior changes:

1. The elog(WARNING, ...) calls about a concurrent insert/delete calls
in progress can't be reached.
2. In some cases, reltuples += 1 might not occur where it would've
happened otherwise.
3. If we encounter a HOT-updated which was deleted by our own
transaction, we index it instead of skipping it.

Change #2 doesn't matter because the only caller that passes
anyvisible = true seems to be BRIN, and BRIN ignores the return value.
I initially thought that change #1 must not matter either, because
function has comments in several places saying that the caller must
hold ShareLock or better.  And I thought change #3 must also not
matter, because as the comments explain, this function is used to
build indexes, and if our CREATE INDEX command commits, then any
deletions that it has already performed will commit too, so the fact
that we haven't indexed the now-deleted tuples will be fine.  Then I
realized that brin_summarize_new_values() is calling this function
*without* ShareLock and for *not* for the purpose of creating a new
index but rather for the purpose of updating an existing index, which
means #1 and #3 do matter after all. But I think it's kind of
confusing because anyvisible doesn't change anything about which
tuples are visible.  SnapshotAny is already making "any" tuple
"visible." This flag really means "caller is holding a
lower-than-normal lock level and is not inserting into a brand new
relfilnode".

There may be more than just a cosmetic problem here, because the comments say:

 * It might look unsafe to use this information across buffer
 * lock/unlock.  However, we hold ShareLock on the table so no
 * ordinary insert/update/delete should occur; and we hold pin on the
 * buffer continuously while visiting the page, so no pruning
 * operation can occur either.

In the BRIN case that doesn't apply; I don't know whether this is safe
in that case for some other reason.

I also note that amcheck's bt_check_every_level can also call this
without ShareLock. It doesn't need to set anyvisible because passing a
snapshot bypasses the WARNINGs anyway, but it might have whatever
problem the above comment is thinking about.

Also, it's just cosmetic, but this comment definitely needs updating:

/*
 * We could possibly get away with not locking the buffer here,
 * since caller should hold ShareLock on the relation, but let's
 * be conservative about it.  (This remark is still correct even
 * with HOT-pruning: our pin on the buffer prevents pruning.)
 */
LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);

One more thing.  Assuming that there are no live bugs here, or that we
fix them, another possible simplification would be to remove the
anyvisible = true flag and have BRIN pass SnapshotNonVacuumable.
SnapshotNonVacuumable returns true when HeapTupleSatisfiesVacuum
doesn't return HEAPTUPLE_DEAD, so I think we'd get exactly the same
behavior (again, modulo reltuples, which doesn't matter).
heap_getnext() would perform functionally the same check as the
bespoke code internally, and just wouldn't return the dead tuples in
the first place.  There's an assertion that would trip, but we could
probably just change it.  BRIN's callback might also get a different
value for tupleIsAlive in some cases, but it ignores that value
anyway.

So to summarize:

1. Is this function really safe with < ShareLock?  Both BRIN and
amcheck think so, but the function itself isn't sure. If yes, we need
to adapt the comments.  If no, we need to think about some other fix.

2. anyvisible is a funny name given what the flag really does. Maybe
we can simplify by replacing it with SnapshotNonVacuumable().
Otherwise maybe we should rename the flag.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_dump: fail to restore partition table with serial type

2019-06-07 Thread Alvaro Herrera
On 2019-May-06, Alvaro Herrera wrote:

> On 2019-May-06, Rushabh Lathia wrote:
> 
> > Found another scenario where check constraint is not getting
> > dump for the child table.
> 
> You're right, the patched code is bogus; I'm reverting it all for
> today's minors.  Thanks for reporting.

Here's another version of this patch.  This time, I added some real
tests in pg_dump's suite, including a SERIAL column and NOT NULL
constraints.  The improved test verifies that the partition is created
separately and later attached, and it includes constraints from the
parent as well as some locally defined ones.  I also added tests with
legacy inheritance, which was not considered previously in pg_dump tests
as far as I could see.

I looked for other cases that could have been broken by changing the
partition creation methodology in pg_dump, and didn't find anything.
That part of pg_dump (dumpTableSchema) is pretty spaghettish, though;
the fact that shouldPrintColumn makes some partitioned-related decisions
and then dumpTableSchema make them again is notoriously confusing.  I
could have easily missed something.


One weird thing about pg_dump's output of the serial column in a
partitioned table is that it emits the parent table itself first without
a DEFAULT clause, then the sequence and marks it as owned by the column;
then it emits the partition *with* the default clause, and finally it
alters the parent table's column to set the default.  Now there is some
method in this madness (the OWNED BY clause for the sequence is mingled
together with the sequence itself), but I think this arrangement makes
a partial restore of the partition fail.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 72ad894d79b6f26d24a8857e5f33c6d51bd65051 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 24 Apr 2019 15:30:37 -0400
Subject: [PATCH v1 1/2] Make pg_dump emit ATTACH PARTITION instead of
 PARTITION OF
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Using PARTITION OF can result in column ordering being changed from the
database being dumped, if the partition uses a column layout different
from the parent's.  It's not pg_dump's job to editorialize on table
definitions, so this is not acceptable; back-patch all the way back to
pg10, where partitioned tables where introduced.

This change also ensures that partitions end up in the correct
tablespace, if different from the parent's; this is an oversight in
ca4103025dfe (in pg12 only).  Partitioned indexes (in pg11) don't have
this problem, because they're already created as independent indexes and
attached to their parents afterwards.

This change also has the advantage that the partition is restorable from
the dump (as a standalone table) even if its parent table isn't
restored.

Author: David Rowley
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/cakjs1f_1c260not_vbj067az3jxptxvrohdvmlebmudx1ye...@mail.gmail.com
Discussion: https://postgr.es/m/20190423185007.GA27954@alvherre.pgsql
---
 src/bin/pg_dump/pg_dump.c| 123 +--
 src/bin/pg_dump/t/002_pg_dump.pl |  12 ++-
 2 files changed, 60 insertions(+), 75 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 9f59cc74ee..408eee9119 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8631,9 +8631,12 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
  * Normally this is always true, but it's false for dropped columns, as well
  * as those that were inherited without any local definition.  (If we print
  * such a column it will mistakenly get pg_attribute.attislocal set to true.)
- * However, in binary_upgrade mode, we must print all such columns anyway and
- * fix the attislocal/attisdropped state later, so as to keep control of the
- * physical column order.
+ * For partitions, it's always true, because we want the partitions to be
+ * created independently and ATTACH PARTITION used afterwards.
+ *
+ * In binary_upgrade mode, we must print all columns and fix the attislocal/
+ * attisdropped state later, so as to keep control of the physical column
+ * order.
  *
  * This function exists because there are scattered nonobvious places that
  * must be kept in sync with this decision.
@@ -8643,7 +8646,9 @@ shouldPrintColumn(DumpOptions *dopt, TableInfo *tbinfo, int colno)
 {
 	if (dopt->binary_upgrade)
 		return true;
-	return (tbinfo->attislocal[colno] && !tbinfo->attisdropped[colno]);
+	if (tbinfo->attisdropped[colno])
+		return false;
+	return (tbinfo->attislocal[colno] || tbinfo->ispartition);
 }
 
 
@@ -15594,27 +15599,6 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 		if (tbinfo->reloftype && !dopt->binary_upgrade)
 			appendPQExpBuffer(q, " OF %s", tbinfo->reloftype);
 
-		/*
-		 * If the table is a partition, dump it as such; except in the case of
-		 * a binary upgrade, w

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2019-06-07 Thread Alexander Korotkov
On Fri, Dec 28, 2018 at 11:32 AM Masahiko Sawada  wrote:
> On Thu, Dec 27, 2018 at 10:24 PM Alvaro Herrera
>  wrote:
> >
> > On 2018-Dec-27, Alexey Kondratov wrote:
> >
> > > To summarize:
> > >
> > > 1) Alvaro and Michael agreed, that REINDEX with tablespace move may be
> > > useful. This is done in the patch attached to my initial email. Adding
> > > REINDEX to ALTER TABLE as new action seems quite questionable for me and 
> > > not
> > > completely semantically correct. ALTER already looks bulky.
> >
> > Agreed on these points.
>
> As an alternative idea, I think we can have a new ALTER INDEX variants
> that rebuilds the index while moving tablespace, something like ALTER
> INDEX ... REBUILD SET TABLESPACE 

+1

It seems the easiest way to have feature-full commands. If we put
functionality of CLUSTER and VACUUM FULL to ALTER TABLE, and put
functionality of REINDEX to ALTER INDEX, then CLUSTER, VACUUM FULL and
REINDEX would be just syntax sugar.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [PROPOSAL] Drop orphan temp tables in single-mode

2019-06-07 Thread Alexander Korotkov
On Fri, Mar 8, 2019 at 9:28 AM Michael Paquier  wrote:
> On Thu, Mar 07, 2019 at 10:49:29AM -0500, Robert Haas wrote:
> > On Thu, Mar 7, 2019 at 10:24 AM Tom Lane  wrote:
> > > So if we think we can invent a "MAGICALLY FIX MY DATABASE" command,
> > > let's do that.  But please let's not turn a well defined command
> > > like VACUUM into something that you don't quite know what it will do.
> >
> > I am on the fence about that.  I see your point, but on the other
> > hand, autovacuum drops temp tables all the time in multi-user mode and
> > I think it's pretty clear that, with the possible exception of you,
> > users find that an improvement.  So it could be argued that we're
> > merely proposing to make the single-user mode behavior of vacuum
> > consistent with the behavior people are already expecting it to do.
>
> It is possible for a session to drop temporary tables of other
> sessions.  Wouldn't it work as well in this case for single-user mode
> when seeing an orphan temp table still defined?  Like Tom, I don't
> think that it is a good idea to play with the heuristics of VACUUM in
> the way the patch proposes.

I think we have a kind of agreement, that having simple way to get rid
of all orphan tables in single-user mode is good.  The question is
whether it's good for VACUUM command to do this.  Naturally, user may
enter single-user mode for different reasons, not only for xid
wraparound fixing.  For example, one may enter this mode for examining
temporary tables present in the database.  Then it would be
disappointing surprise that all of them gone after VACUUM command.

So, what about special option, which would make VACUUM to drop orphan
tables in single-user mode?  Do we need it in multi-user mode too?

BTW, does this patch checks that temporary table is really orphan?
AFAICS, user may define some temporary tables in single-user mode
before running VACUUM.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: tableam: inconsistent parameter name

2019-06-07 Thread Robert Haas
On Fri, Jun 7, 2019 at 1:19 PM Andres Freund  wrote:
> On 2019-06-07 13:11:21 -0400, Robert Haas wrote:
> > I found what appears to be another typo very nearby.  Attached please
> > find v2, fixing both issues.
>
> Hm, I thinks that's fixed already?

Oops, you're right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: tableam: inconsistent parameter name

2019-06-07 Thread Andres Freund
On 2019-06-07 13:11:21 -0400, Robert Haas wrote:
> I found what appears to be another typo very nearby.  Attached please
> find v2, fixing both issues.

Hm, I thinks that's fixed already?




Re: tableam: inconsistent parameter name

2019-06-07 Thread Robert Haas
On Fri, Jun 7, 2019 at 12:52 PM Andres Freund  wrote:
> On 2019-06-07 12:37:33 -0400, Robert Haas wrote:
> > TableAmRoutine's index_build_range_scan thinks that parameter #8 is
> > called end_blockno, but table_index_build_range_scan and
> > heapam_index_build_range_scan and BRIN's summarize_range all agree
> > that it's the number of blocks to be scanned.  I assume that this got
> > changed at some point while Andres was hacking on it and this one
> > place just never got updated.
>
> Not sure where it came from :/
>
> > Proposed patch attached.  Since this seems like a bug, albeit a
> > harmless one, I propose to commit this to v12.
>
> Yea, please do!

I found what appears to be another typo very nearby.  Attached please
find v2, fixing both issues.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v2-0001-tableam-Fix-cosmetic-mistakes-related-to-index_bu.patch
Description: Binary data


Re: tableam: inconsistent parameter name

2019-06-07 Thread Andres Freund
Hi,

On 2019-06-07 12:37:33 -0400, Robert Haas wrote:
> TableAmRoutine's index_build_range_scan thinks that parameter #8 is
> called end_blockno, but table_index_build_range_scan and
> heapam_index_build_range_scan and BRIN's summarize_range all agree
> that it's the number of blocks to be scanned.  I assume that this got
> changed at some point while Andres was hacking on it and this one
> place just never got updated.

Not sure where it came from :/

> Proposed patch attached.  Since this seems like a bug, albeit a
> harmless one, I propose to commit this to v12.

Yea, please do!

Greetings,

Andres Freund




Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-06-07 Thread Andres Freund
Hi,

(David, see bottom if you're otherwise not interested).

On 2019-06-07 09:48:29 -0400, Robert Haas wrote:
> On Sat, Jun 1, 2019 at 3:20 PM Andres Freund  wrote:
> > > I'd like to think that the best way to deal with that and reduce the
> > > confusion would be to move anything related to bulk inserts into their
> > > own header/file, meaning the following set:
> > > - ReleaseBulkInsertStatePin
> > > - GetBulkInsertState
> > > - FreeBulkInsertState
> > > There is the argument that we could also move that part into tableam.h
> > > itself though as some of the rather generic table-related callbacks,
> > > but that seems grotty.  So I think that we could just move that stuff
> > > as backend/access/common/bulkinsert.c.
> >
> > Yea, I think we should do that at some point. But I'm not sure this is
> > the right design. Bulk insert probably needs to rather be something
> > that's allocated inside the AM.
> 
> As far as I can see, any on-disk, row-oriented, block-based AM is
> likely to want the same implementation as the heap.

I'm pretty doubtful about that. It'd e.g. would make a ton of sense to
keep the VM pinned, even for heap. You could also do a lot better with
toast.  And for zheap we'd - unless we change the design - quite
possibly benefit from keeping the last needed tpd buffer around.


> Here's a draft design for adding some abstraction, roughly modeled on
> the abstraction Andres added for TupleTableSlots:

Hm, I'm not sure I see the need for a vtable based approach here. Won't
every AM know exactly what they need / have?  I'm not convinced it's
worthwhile to treat that separately from the tableam. I.e.  have a
BulkInsertState struct with *no* members, and then, as you suggest:

> 
> 3. table AM gets a new member BulkInsertState
> *(*create_bistate)(Relation Rel) and a corresponding function
> table_create_bistate(), analogous to table_create_slot(), which can
> call the constructor function for the appropriate type of
> BulkInsertState and return the result

but also route the following through the AM:

> 2. that structure has a member for each defined operation on a 
> BulkInsertState:
> 
> void (*free)(BulkInsertState *);
> void (*release_pin)(BulkInsertState *); // maybe rename to make it more 
> generic

Where free would just be part of finish_bulk_insert, and release_pin a
new callback.


> 4. each type of BulkInsertState has its own functions for making use
> of it, akin to ReadBufferBI.

Right, I don't think that's avoidable unfortunately.



> I see Michael's point about the relationship between
> finish_bulk_insert() and the BulkInsertState, and maybe if we could
> figure that out we could avoid the need for a BulkInsertState to have
> a free method (or maybe any methods at all, in which case it could
> just be an opaque struct, like a Node).

Right, so we actually eneded up at the same place. And you found a bug:

> However, it looks to me as though copy.c can create a bunch of
> BulkInsertStates but only call finish_bulk_insert() once, so unless
> that's a bug in need of fixing I don't quite see how to make that
> approach work.

That is a bug. Not a currently "active" one with in-core AMs (no
dangerous bulk insert flags ever get set for partitioned tables), but we
obviously need to fix it nevertheless.

Robert, seems we'll have to - regardless of where we come down on fixing
this bug - have to make copy use multiple BulkInsertState's, even in the
CIM_SINGLE (with proute) case. Or do you have a better idea?

David, any opinions on how to best fix this? It's not extremely obvious
how to do so best in the current setup of the partition actually being
hidden somewhere a few calls away (i.e. the table_open happening in
ExecInitPartitionInfo()).

Greetings,

Andres Freund




tableam: inconsistent parameter name

2019-06-07 Thread Robert Haas
TableAmRoutine's index_build_range_scan thinks that parameter #8 is
called end_blockno, but table_index_build_range_scan and
heapam_index_build_range_scan and BRIN's summarize_range all agree
that it's the number of blocks to be scanned.  I assume that this got
changed at some point while Andres was hacking on it and this one
place just never got updated.

Proposed patch attached.  Since this seems like a bug, albeit a
harmless one, I propose to commit this to v12.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


0001-tableam-Fix-index_build_range_scan-parameter-name.patch
Description: Binary data


Re: tableam: abstracting relation sizing code

2019-06-07 Thread Alvaro Herrera
On 2019-Jun-07, Robert Haas wrote:

> On Fri, Jun 7, 2019 at 8:43 AM Robert Haas  wrote:
> > Good catch, and now I notice that the callback is not called
> > estimate_rel_size but relation_estimate_size.  Updated patch attached;
> > thanks for the review.
> 
> Let's try that one more time, and this time perhaps I'll make it compile.

It looks good to me, passes tests.  This version seems to introduce a warning
in my build:

/pgsql/source/master/src/backend/access/table/tableam.c: In function 
'table_block_relation_estimate_size':
/pgsql/source/master/src/backend/access/table/tableam.c:633:12: warning: 
implicit declaration of function 'rint' [-Wimplicit-function-declaration]
  *tuples = rint(density * (double) curpages);
^~~~
/pgsql/source/master/src/backend/access/table/tableam.c:633:12: warning: 
incompatible implicit declaration of built-in function 'rint'
/pgsql/source/master/src/backend/access/table/tableam.c:633:12: note: include 
'' or provide a declaration of 'rint'

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: tableam: abstracting relation sizing code

2019-06-07 Thread Andres Freund
Hi,

On 2019-06-07 08:32:37 -0400, Robert Haas wrote:
> On Thu, Jun 6, 2019 at 10:08 PM Michael Paquier  wrote:
> > "allvisfrac", "pages" and "tuples" had better be documented about
> > which result they represent.
> 
> A lot of the table AM stuff (and the related slot stuff) lacks
> function header comments; I don't like that and think it should be
> improved. However, that's not the job of this patch. I think it's
> completely correct for this patch to document, as it does, that the
> arguments have the same meaning as for the estimate_rel_size method,
> and leave it at that. There is certainly negative value in duplicating
> the definitions in multiple places, where they might get out of sync.
> The header comment for table_relation_estimate_size() refers the
> reader to the comments for estimate_rel_size(), which says:

Note that these function ended up that way by precisely this logic... ;)

Greetings,

Andres Freund




Re: tableam: abstracting relation sizing code

2019-06-07 Thread Robert Haas
On Fri, Jun 7, 2019 at 8:43 AM Robert Haas  wrote:
> Good catch, and now I notice that the callback is not called
> estimate_rel_size but relation_estimate_size.  Updated patch attached;
> thanks for the review.

Let's try that one more time, and this time perhaps I'll make it compile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v3-0001-tableam-Provide-helper-functions-for-relation-siz.patch
Description: Binary data


Re: doc: pg_trgm missing description for GUC "pg_trgm.strict_word_similarity_threshold"

2019-06-07 Thread Ian Barwick

On 6/7/19 9:00 PM, Michael Paquier wrote:

On Fri, Jun 07, 2019 at 03:44:14PM +0900, Masahiko Sawada wrote:

BTW while looking GUC variables defined in trgm_op.c the operators in
each short description seems not correct; there is an extra percent
sign. Should we also fix them?


Both of you are right here


I did notice the double percent signs but my brain skipped over them
assuming they were translatable strings, thanks for catching that.


and the addition documentation looks fine to me (except the indentation).


The indentation in the additional documentation seems fine to me, it's
the section for the preceding GUC which is offset one column to the right.
Patch attached for that.

> The fix for the parameter descriptions can be back-patched safely as they
> would reload correctly once the version is updated.

Yup, they would appear the first time one of the pg_trgm functions is called
in a session after the new object file is installed.

> Or is that not worth bothering except on HEAD?  Thoughts?

Personally I don't think it's that critical, but not bothered either way.
Presumably no-one has complained so far anyway (I only chanced upon the missing
GUC description because I was poking about looking for examples of custom
GUC handling...)


Regards

Ian Barwick

--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
index 83b0033d7a..e353ecc954 100644
--- a/doc/src/sgml/pgtrgm.sgml
+++ b/doc/src/sgml/pgtrgm.sgml
@@ -320,23 +320,23 @@
  
 

-
- 
-  pg_trgm.word_similarity_threshold (real)
-  
-   
-pg_trgm.word_similarity_threshold configuration parameter
-   
-  
- 
- 
-  
-   Sets the current word similarity threshold that is used by
-   <% and %> operators.  The threshold
-   must be between 0 and 1 (default is 0.6).
-  
- 
-
+   
+
+ pg_trgm.word_similarity_threshold (real)
+ 
+  
+   pg_trgm.word_similarity_threshold configuration parameter
+  
+ 
+
+
+ 
+  Sets the current word similarity threshold that is used by
+  <% and %> operators.  The threshold
+  must be between 0 and 1 (default is 0.6).
+ 
+
+   
   
  
 


Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-06-07 Thread Robert Haas
On Fri, Jun 7, 2019 at 10:17 AM Tomas Vondra
 wrote:
> Yes, they could get quite big, and I think you're right we need to
> keep that in mind, because it's on the outer (often quite large) side of
> the join. And if we're aiming to restrict memory usage, it'd be weird to
> just ignore this.
>
> But I think Thomas Munro originally proposed to treat this as a separate
> BufFile, so my assumption was each worker would simply rewrite the bitmap
> repeatedly for each hash table fragment. That means a bit more I/O, but as
> those files are buffered and written in 8kB pages, with just 1 bit per
> tuple. I think that's pretty OK and way cheaper that rewriting the whole
> batch, where each tuple can be hundreds of bytes.

Yes, this is also my thought.  I'm not 100% sure I understand
Melanie's proposal, but I think that it involves writing every
still-unmatched outer tuple for every inner batch.  This proposal --
assuming we can get the tuple numbering worked out -- involves writing
a bit for every outer tuple for every inner batch.  So each time you
do an inner batch, you write either (a) one bit for EVERY outer tuple
or (b) the entirety of each unmatched tuple.  It's possible for the
latter to be cheaper if the number of unmatched tuples is really,
really tiny, but it's not very likely.

For example, suppose that you've got 4 batches and each batch matches
99% of the tuples, which are each 50 bytes wide.  After each batch,
approach A writes 1 bit per tuple, so a total of 4 bits per tuple
after 4 batches.  Approach B writes a different amount of data after
each batch.  After the first batch, it writes 1% of the tuples, and
for each one written it writes 50 bytes, so it writes 50 bytes * 0.01
= ~4 bits/tuple.  That's already equal to what approach A wrote after
all 4 batches, and it's going to do a little more I/O over the course
of the remaining matches - although not much, because the unmatched
tuples file will be very very tiny after we eliminate 99% of the 1%
that survived the first batch.  However, these are extremely favorable
assumptions for approach B.  If the tuples are wider or the batches
match only say 20% of the tuples, approach B is going to be wy
more I/O.

Assuming I understand correctly, which I may not.

> Also, it does not require any concurrency control, which rewriting the
> batches themselves probably does (because we'd be feeding the tuples into
> some shared file, I suppose). Except for the final step when we need to
> merge the bitmaps, of course.

I suppose that rewriting the batches -- or really the unmatched tuples
file -- could just use a SharedTuplestore, so we probably wouldn't
need a lot of new code for this.  I don't know whether contention
would be a problem or not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-06-07 Thread Robert Haas
On Thu, Jun 6, 2019 at 7:31 PM Melanie Plageman
 wrote:
> I'm not sure I understand why you would need to compare the original
> tuples to the unmatched tuples file.

I think I was confused.  Actually, I'm still not sure I understand this part:

> Then, you iterate again through the outer side a third time to join it
> to the unmatched tuples in the unmatched tuples file (from the first
> chunk) and write the following to a new unmatched tuples file:
> 5
> 9
> 11
> 11

and likewise here

> Then you iterate a fifth time through the outer side to join it to the
> unmatched tuples in the unmatched tuples file (from the second chunk)
> and write the following to a new unmatched tuples file:
> 11
> 11

So you refer to joining the outer side to the unmatched tuples file,
but how would that tell you which outer tuples had no matches on the
inner side?  I think what you'd need to do is anti-join the unmatched
tuples file to the current inner batch.  So the algorithm would be
something like:

for each inner batch:
  for each outer tuple:
if tuple matches inner batch then emit match
if tuple does not match inner batch and this is the first inner batch:
  write tuple to unmatched tuples file
  if this is not the first inner batch:
for each tuple from the unmatched tuples file:
  if tuple does not match inner batch:
write to new unmatched tuples file
discard previous unmatched tuples file and use the new one for the
next iteration

for each tuple in the final unmatched tuples file:
  null-extend and emit

If that's not what you have in mind, maybe you could provide some
similar pseudocode?  Or you can just ignore me. I'm not trying to
interfere with an otherwise-fruitful discussion by being the only one
in the room who is confused...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-06-07 Thread Tomas Vondra

On Thu, Jun 06, 2019 at 04:33:31PM -0700, Melanie Plageman wrote:

On Tue, Jun 4, 2019 at 12:15 PM Robert Haas  wrote:


On Tue, Jun 4, 2019 at 3:09 PM Melanie Plageman
 wrote:
> One question I have is, how would the OR'd together bitmap be
> propagated to workers after the first chunk? That is, when there are
> no tuples left in the outer bunch, for a given inner chunk, would you
> load the bitmaps from each worker into memory, OR them together, and
> then write the updated bitmap back out so that each worker starts with
> the updated bitmap?

I was assuming we'd elect one participant to go read all the bitmaps,
OR them together, and generate all the required null-extended tuples,
sort of like the PHJ_BUILD_ALLOCATING, PHJ_GROW_BATCHES_ALLOCATING,
PHJ_GROW_BUCKETS_ALLOCATING, and/or PHJ_BATCH_ALLOCATING states only
involve one participant being active at a time. Now you could hope for
something better -- why not parallelize that work?  But on the other
hand, why not start simple and worry about that in some future patch
instead of right away? A committed patch that does something good is
better than an uncommitted patch that does something AWESOME.



What if you have a lot of tuples -- couldn't the bitmaps get pretty
big? And then you have to OR them all together and if you can't put
the whole bitmap from each worker into memory at once to do it, it
seems like it would be pretty slow. (I mean maybe not as slow as
reading the outer side 5 times when you only have 3 chunks on the
inner + all the extra writes from my unmatched tuple file idea, but
still...)



Yes, they could get quite big, and I think you're right we need to
keep that in mind, because it's on the outer (often quite large) side of
the join. And if we're aiming to restrict memory usage, it'd be weird to
just ignore this.

But I think Thomas Munro originally proposed to treat this as a separate
BufFile, so my assumption was each worker would simply rewrite the bitmap
repeatedly for each hash table fragment. That means a bit more I/O, but as
those files are buffered and written in 8kB pages, with just 1 bit per
tuple. I think that's pretty OK and way cheaper that rewriting the whole
batch, where each tuple can be hundreds of bytes.

Also, it does not require any concurrency control, which rewriting the
batches themselves probably does (because we'd be feeding the tuples into
some shared file, I suppose). Except for the final step when we need to
merge the bitmaps, of course.

So I think this would work, it does not have the issue with using too much
memory, and I don't think the overhead is too bad.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Avoiding hash join batch explosions with extreme skew and weird stats

2019-06-07 Thread Tomas Vondra

On Thu, Jun 06, 2019 at 04:37:19PM -0700, Melanie Plageman wrote:

On Thu, May 16, 2019 at 3:22 PM Thomas Munro  wrote:


Admittedly I don't have a patch, just a bunch of handwaving.  One
reason I haven't attempted to write it is because although I know how
to do the non-parallel version using a BufFile full of match bits in
sync with the tuples for outer joins, I haven't figured out how to do
it for parallel-aware hash join, because then each loop over the outer
batch could see different tuples in each participant.  You could use
the match bit in HashJoinTuple header, but then you'd have to write
all the tuples out again, which is more IO than I want to do.  I'll
probably start another thread about that.



Going back to the idea of using the match bit in the HashJoinTuple header
and writing out all of the outer side for every chunk of the inner
side, I was wondering if there was something we could do that was kind
of like mmap'ing the outer side file to give the workers in parallel
hashjoin the ability to update a match bit in the tuple in place and
avoid writing the whole outer side out each time.



I think this was one of the things we discussed in Ottawa - we could pass
index of the tuple (in the batch) along with the tuple, so that each
worker know which bit to set.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-06-07 Thread Robert Haas
On Sat, Jun 1, 2019 at 3:20 PM Andres Freund  wrote:
> > I'd like to think that the best way to deal with that and reduce the
> > confusion would be to move anything related to bulk inserts into their
> > own header/file, meaning the following set:
> > - ReleaseBulkInsertStatePin
> > - GetBulkInsertState
> > - FreeBulkInsertState
> > There is the argument that we could also move that part into tableam.h
> > itself though as some of the rather generic table-related callbacks,
> > but that seems grotty.  So I think that we could just move that stuff
> > as backend/access/common/bulkinsert.c.
>
> Yea, I think we should do that at some point. But I'm not sure this is
> the right design. Bulk insert probably needs to rather be something
> that's allocated inside the AM.

As far as I can see, any on-disk, row-oriented, block-based AM is
likely to want the same implementation as the heap.  Column stores
might want to pin multiple buffers, and an in-memory AM might have a
completely different set of requirements, but something like zheap
really has no reason to depart from what the heap does.  I think it's
really important that new table AMs not only have the option to do
something different than the heap in any particular area, but that
they also have the option to do the SAME thing as the heap without
having to duplicate a bunch of code.  So I think it would be
reasonable to start by doing some pure code movement here, along the
lines proposed by Michael -- not sure if src/backend/access/common is
right or if it should be src/backend/access/table -- and then add the
abstraction afterwards.  Worth noting is ReadBufferBI() also needs
moving and is a actually a bigger problem than the functions that
Michael mentioned, because the other functions are accessible if
you're willing to stoop to including heap-specific headers, but that
function is static and you'll have to just copy-and-paste it.  Uggh.

Here's a draft design for adding some abstraction, roughly modeled on
the abstraction Andres added for TupleTableSlots:

1. a BulkInsertState becomes a struct whose only member is a pointer
to const BulkInsertStateOps *const ops

2. that structure has a member for each defined operation on a BulkInsertState:

void (*free)(BulkInsertState *);
void (*release_pin)(BulkInsertState *); // maybe rename to make it more generic

3. table AM gets a new member BulkInsertState
*(*create_bistate)(Relation Rel) and a corresponding function
table_create_bistate(), analogous to table_create_slot(), which can
call the constructor function for the appropriate type of
BulkInsertState and return the result

4. each type of BulkInsertState has its own functions for making use
of it, akin to ReadBufferBI.  That particular function signature is
only likely to be correct for something that does more-or-less what
the existing type of BulkInsertState does; if you're using a
column-store that pins multiple buffers or something, you'll need your
own code path.  But that's OK, because ReadBufferBI or whatever other
functions you have are only going to get called from AM-specific code,
which will know what type of BulkInsertState they have got, because
they are in control of which kind of BulkInsertState gets created for
their relations as per point #4, so they can just call the right
functions.

5. The current implementation of BulkInsertState gets renamed to
BlockBulkInsertState (or something else) and is used by heap and any
AMs that like it.

I see Michael's point about the relationship between
finish_bulk_insert() and the BulkInsertState, and maybe if we could
figure that out we could avoid the need for a BulkInsertState to have
a free method (or maybe any methods at all, in which case it could
just be an opaque struct, like a Node). However, it looks to me as
though copy.c can create a bunch of BulkInsertStates but only call
finish_bulk_insert() once, so unless that's a bug in need of fixing I
don't quite see how to make that approach work.

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Custom table AMs need to include heapam.h because of BulkInsertState

2019-06-07 Thread Robert Haas
On Thu, Jun 6, 2019 at 10:29 PM Michael Paquier  wrote:
> One thing which is a bit tricky is that for example with COPY FROM we
> have a code path which is able to release a buffer held by the bulk
> insert state.

Are you talking about the call to ReleaseBulkInsertStatePin, or something else?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: hyrax vs. RelationBuildPartitionDesc

2019-06-07 Thread Robert Haas
On Thu, Jun 6, 2019 at 2:48 AM Amit Langote  wrote:
> Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch,
> which seems to fix this issue for me.

Yeah, that looks right.  I think my patch was full of fuzzy thinking
and inadequate testing; thanks for checking it over and coming up with
the right solution.

Anyone else want to look/comment?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: tableam: abstracting relation sizing code

2019-06-07 Thread Robert Haas
On Fri, Jun 7, 2019 at 4:11 AM Daniel Gustafsson  wrote:
> Makes sense. Regarding one of the hacks:
>
> +* HACK: if the relation has never yet been vacuumed, use a minimum 
> size
> +* estimate of 10 pages.  The idea here is to avoid assuming a
> +* newly-created table is really small, even if it currently is, 
> because
> +* that may not be true once some data gets loaded into it.
>
> When this is a generic function for AM’s, would it make sense to make the
> minimum estimate a passed in value rather than hardcoded at 10?  I don’t have 
> a
> case in mind, but I also don’t think that assumptions made for heap 
> necessarily
> makes sense for all AM’s. Just thinking out loud.

I think that's probably going in the wrong direction.  It's arguable,
of course.  However, it seems to me that there's nothing heap-specific
about the number 10.  It's not computed based on the format of a heap
page or a heap tuple.  It's just somebody's guess (likely Tom's) about
how to plan with empty relations.  If somebody finds that another
number works better for their AM, it's probably also better for heap,
at least on that person's workload.  It seems more likely to me that
this needs to be a GUC or reloption than that it needs to be an
AM-specific property.  It also seems to me that if the parameters of
the hack randomly vary from one AM to another, it's likely to create
confusing plan differences that have nothing to do with the actual
merits of what the AMs are doing.  But all that being said, I'm not
blocking anybody from fooling around with this; nobody's obliged to
use the helper function.  It's just there for people who want the same
AM-independent logic that the heap uses.

> > Here is a patch that tries to improve the situation.  I don't know
> > whether there is some better approach; this seemed like the obvious
> > thing to do.
>
> A small nitpick on the patch:
>
> + * estimate_rel_size callback, because it has a few additional paramters.
>
> s/paramters/parameters/

Good catch, and now I notice that the callback is not called
estimate_rel_size but relation_estimate_size.  Updated patch attached;
thanks for the review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v2-0001-tableam-Provide-helper-functions-for-relation-siz.patch
Description: Binary data


Re: tableam: abstracting relation sizing code

2019-06-07 Thread Robert Haas
On Thu, Jun 6, 2019 at 10:08 PM Michael Paquier  wrote:
> Looks like a neat split.

Thanks.

> "allvisfrac", "pages" and "tuples" had better be documented about
> which result they represent.

A lot of the table AM stuff (and the related slot stuff) lacks
function header comments; I don't like that and think it should be
improved. However, that's not the job of this patch. I think it's
completely correct for this patch to document, as it does, that the
arguments have the same meaning as for the estimate_rel_size method,
and leave it at that. There is certainly negative value in duplicating
the definitions in multiple places, where they might get out of sync.
The header comment for table_relation_estimate_size() refers the
reader to the comments for estimate_rel_size(), which says:

 * estimate_rel_size - estimate # pages and # tuples in a table or index
 *
 * We also estimate the fraction of the pages that are marked all-visible in
 * the visibility map, for use in estimation of index-only scans.
 *
 * If attr_widths isn't NULL, it points to the zero-index entry of the
 * relation's attr_widths[] cache; we fill this in if we have need to compute
 * the attribute widths for estimation purposes.

...which AFAICT constitutes as much documentation of these parameters
as we have got.  I think this is all a bit byzantine and could
probably be made clearer, but (1) fortunately it's not all that hard
to guess what these are supposed to mean and (2) I don't feel obliged
to do semi-related comment cleanup every time I patch tableam.

> Could you explain what's the use cases you have in mind for
> usable_bytes_per_page?  All AMs using smgr need to have a page header,
> which implies that the usable number of bytes is (BLCKSZ - page
> header) like heap for tuple data.  In the AMs you got to work with, do
> you store some extra data in the page which is not used for tuple
> storage?  I think that makes sense, just wondering about the use
> case.

Exactly.  BLCKSZ - page header is probably the maximum unless you roll
your own page format, but you could easily have less if you use the
special space.  zheap is storing transaction slots there; you could
store an epoch to avoid freezing, and there's probably quite a few
other reasonable possibilities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Fix inconsistencies for v12

2019-06-07 Thread Alexander Lakhin
Hello Amit,
07.06.2019 7:30, Amit Kapila wrote:
> Leaving the changes related to the above two points, I have combined
> all the changes and fixed the things as per my review in the attached
> patch.  Alexander, see if you can verify once the combined patch.  I
> am planning to commit the attached by tomorrow and then we can deal
> with the remaining two.  However, in the meantime, if Andres shared
> his views on the above two points, then we can include the changes
> corresponding to them as well.
Amit, I agree with all of your changes. All I could is to move a dot:
.. (see contrib/postgres_fdw/postgres_fdw.c: postgresBeginDirectModify()
as one example).

Best regards,
Alexander





Re: be-gssapi-common.h should be located in src/include/libpq/

2019-06-07 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> As mentioned on another thread about test coverage, I have noticed
> that be-gssapi-common.h is not placed at the correct location, even
> its its identication path at the top points to where the file should
> be:
> https://www.postgresql.org/message-id/20190604014630.gh1...@paquier.xyz
> 
> The file has been introduced at its current location as of b0b39f72.
> Any objections to something like the attached?

I'm pretty sure it ended up there just because that's how things are in
src/interfaces/libpq.  I don't have any objection to moving it, I had
really just been waiting to see where that thread ended up going.

On a quick look, the patch looks fine to me.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: doc: pg_trgm missing description for GUC "pg_trgm.strict_word_similarity_threshold"

2019-06-07 Thread Michael Paquier
On Fri, Jun 07, 2019 at 03:44:14PM +0900, Masahiko Sawada wrote:
> BTW while looking GUC variables defined in trgm_op.c the operators in
> each short description seems not correct; there is an extra percent
> sign. Should we also fix them?

Both of you are right here, and the addition documentation looks fine
to me (except the indentation).  The fix for the parameter
descriptions can be back-patched safely as they would reload correctly
once the version is updated.  Or is that not worth bothering except on
HEAD?  Thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: Small review comment on pg_checksums

2019-06-07 Thread Michael Paquier
On Fri, Jun 07, 2019 at 03:30:35PM +0900, Masahiko Sawada wrote:
> Agreed. Please find an attached patch.

Thanks, committed.
--
Michael


signature.asc
Description: PGP signature


Re: PG 12 draft release notes

2019-06-07 Thread Adrien Nayrat
On 5/11/19 10:33 PM, Bruce Momjian wrote:
> I have posted a draft copy of the PG 12 release notes here:
> 
>   http://momjian.us/pgsql_docs/release-12.html
> 
> They are committed to git.  It includes links to the main docs, where
> appropriate.  Our official developer docs will rebuild in a few hours.
> 

Hello,

By looking to a user request to add greek in our FTS [1], I suggest to mention
which languages has been added in fd582317e.

Patch attached.

I hesitate to also mention these changes?

> These all work in UTF8, and the indonesian and irish ones also work in LATIN1.

> The non-UTF8 version of the hungarian stemmer now works in LATIN2 not LATIN1.


1:
https://www.postgresql.org/message-id/trinity-f09793a1-8c13-4b56-94fe-10779e96c87e-1559896268438%403c-app-mailcom-bs16

Cheers,

-- 
Adrien
diff --git a/doc/src/sgml/release-12.sgml b/doc/src/sgml/release-12.sgml
index 532055456c..c50a23bd1c 100644
--- a/doc/src/sgml/release-12.sgml
+++ b/doc/src/sgml/release-12.sgml
@@ -2085,8 +2085,8 @@ Author: Tom Lane 
 -->
 
   
-   Update Snowball stemmer dictionaries with support for new languages
-   (Arthur Zakirov)
+   Update Snowball stemmer dictionaries with support for new languages:
+   arabic, indonesian, irish, lithuanian, nepali and tami. (Arthur Zakirov)
   
 
   


signature.asc
Description: OpenPGP digital signature


Missing generated column in ALTER TABLE ADD COLUMN doc

2019-06-07 Thread Masahiko Sawada
Hi,

We support ALTER TABLE ADD COLUMN .. GENERATED ALWAYS AS .. but the
doc is missing it. Attached small patch fixes this.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


doc_alter_column_add_generated_column.patch
Description: Binary data


Re: tableam: abstracting relation sizing code

2019-06-07 Thread Daniel Gustafsson
> On 6 Jun 2019, at 22:40, Robert Haas  wrote:

> It looks to me as though any table AM that uses the relation forks
> supported by PostgreSQL in a more or less normal manner is likely to
> require an implementation of the relation_size callback that is
> identical to the one for heap, and an implementation of the
> estimate_rel_size method that is extremely similar to the one for
> heap.  The latter is especially troubling as the amount of code
> duplication is non-trivial, and it's full of special hacks.

Makes sense. Regarding one of the hacks:

+* HACK: if the relation has never yet been vacuumed, use a minimum size
+* estimate of 10 pages.  The idea here is to avoid assuming a
+* newly-created table is really small, even if it currently is, because
+* that may not be true once some data gets loaded into it.

When this is a generic function for AM’s, would it make sense to make the
minimum estimate a passed in value rather than hardcoded at 10?  I don’t have a
case in mind, but I also don’t think that assumptions made for heap necessarily
makes sense for all AM’s. Just thinking out loud.

> Here is a patch that tries to improve the situation.  I don't know
> whether there is some better approach; this seemed like the obvious
> thing to do.

A small nitpick on the patch:

+ * estimate_rel_size callback, because it has a few additional paramters.

s/paramters/parameters/

cheers ./daniel



Wording variations in descriptions of gucs.

2019-06-07 Thread Kyotaro Horiguchi
Hello.

In guc.c many of the variables are described as "Set_s_ something"
as if the variable name is the omitted subject. A few seem being
wrongly written as "Set something" with the same intention.

Is it useful to unify them to the majority?

wal_level
> gettext_noop("Set the level of information written to the WAL."),

log_transaction_sample_rage
> gettext_noop("Set the fraction of transactions to log for new transactions."),


Though, recovery_target seems written as intended.

> gettext_noop("Set to 'immediate' to end recovery as soon as a consistent 
> state is reached.

# rather it seems to be the detaied description..

regards.
-- 
Kyotaro Horiguchi
NTT Open Source Software Center


guc_desc_fix.patch
Description: Binary data


Re: be-gssapi-common.h should be located in src/include/libpq/

2019-06-07 Thread Daniel Gustafsson
> On 7 Jun 2019, at 06:34, Michael Paquier  wrote:

> Any objections to something like the attached?

No objections to moving the file per the patch.

While looking at libpq.h I noticed what seems to be a few nitpicks: the GSS
function prototype isn’t using the common format of having a comment specifying
the file it belongs to; ssl_loaded_verify_locations is defined as extern even
though it’s only available under USE_SSL (which works fine since it’s only
accessed under USE_SSL but seems kinda wrong); and FeBeWaitSet is not listed
under the pqcomm.c prototypes like how the extern vars from be-secure.c are.
All of these are in the attached.

cheers ./daniel



libpq_reorg.diff
Description: Binary data


Re: Should we warn against using too many partitions?

2019-06-07 Thread Amit Langote
Hi,

Thanks for the updated patches.

On Fri, Jun 7, 2019 at 2:34 PM David Rowley
 wrote:
> Anyway comments welcome.  If I had a few more minutes to spare I'd
> have wrapped OLTP in  tags, but out of time for now.

Some rewording suggestions.

1.

+...Removal of unwanted data is also a factor to consider when
+planning your partitioning strategy as an entire partition can be removed
+fairly quickly.  However, if data that you want to keep exists in that
+partition then that means having to resort to using
+DELETE instead of removing the partition.

Not sure if the 2nd sentence is necessary or perhaps should be
rewritten in a way that helps to design to benefit from this.

Maybe:

...Removal of unwanted data is also a factor to consider when
planning your partitioning strategy as an entire partition can be
removed fairly quickly, especially if the partition keys are chosen
such that all data that can be deleted together are grouped into
separate partitions.

2.

+... For example, if you choose to have one partition
+per customer and you currently have a small number of large customers,
+what will the implications be if in several years you obtain a large
+number of small customers.

The sentence could be rewritten a bit.  Maybe as:

... For example, choosing a design with one partition per customer,
because you currently have a small number of large customers, will not
scale well several years down the line when you might have a large
number of small customers.

Btw, doesn't it suffice here to say "large number of customers"
instead of "large number of small customers"?

3.

+... In this case, it may be better to choose to
+partition by RANGE and choose a reasonable number of
+partitions

Maybe:

... and choose reasonable number of partitions, each containing the
data of a fixed number of customers.

4.

+...  It also
+may be undesirable to have a large number of partitions as each partition
+requires metadata about the partition to be stored in each session that
+touches it.  If each session touches a large number of partitions over a
+period of time then the memory consumption for this may become
+significant.

It might be a good idea to reorder the sentences here to put the
problem first and the cause later.  Maybe like this:

Another reason to be concerned about having a large number of
partitions is that the server's memory consumption may grow
significantly over a period of time, especially if many sessions touch
large numbers of partitions.  That's because each partition requires
its own metadata that must be loaded into the local memory of each
session that touches it.

5.

+With data warehouse type workloads it can make sense to use a larger
+number of partitions than with an OLTP type workload.

Is there a comma missing between "With data warehouse type workloads"
and the rest of the sentence?

Thanks,
Amit